[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Exit_hook receiving empty job classaAd



Hi all,

I have been debugging the issue, and I have noticed two things:

1) When the hook gets called, the execution environment has already been deleted, but it does not know about it (I checked doing a pwd and trying both ls and ls .. within the hook... result: pwd (the directory under EXECUTE) is no longer there.
2) The hook now (as of HTCondor 8.0) gets killed after 1 or 2 seconds, even if HOOKNAME_HOOK_JOB_EXIT_TIMEOUT is set to 300 (obviously, HOOKNAME matches the hook name).
3) The output directory is deleted while the script is executing (tried a sleep 1 loop and ls each second, the first second the files are there, the next they aren't).

In short, it seems as if the cleaning process ignores the hook and keeps on deleting everything and such. (and the process ended naturally, so I don't think things such as KILLING_TIMEOUT should even apply). Has this code path been changed recently? Where could I look for this in the source code? (some pointer would be most welcome).

Thanks,

Joan


El 01/07/13 12:28, Joan J. Piles escribió:
Hi all,

We have been having troubles with our JOB_EXIT_HOOKS, both in HTCondor 7.8 and in HTCondor 8.0. Some of them (and the amount is strangely  increasing with time) don't get any job classAd at all. At first we thought it could be a timeout issue (we had our share of these as well), but it doesn't seem to be the case as the hook script continues its execution. Just in case, we have set both KILLING_TIMEOUT and xxxxx_HOOK_JOB_EXIT_TIMEOUT to 300 seconds, which should be more than enough for it.

The first thing our hook script tries to do is to dump the whole classad to a file (for debugging purposes), and it is creating empty files:

#!/bin/bash

TMPFILE=`mktemp /tmp/condorlog.XXXXXX`
cat > $TMPFILE

The script keeps going from there (reading the stored classad and processing it). We can see that the script tries to do its job, but it complains about not having any data to work on. That's why we have discarded the possibility of a timeout.

I found a similar report in the list from four years ago [1], but it didn't seem to get any solution. Is there anything I could do to further debug this issue?

Thanks,

Joan

[1]: https://lists.cs.wisc.edu/archive/htcondor-users/2009-July/msg00165.shtml
-- 
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------