[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Defining an exit script for condor jobs



On Oct 6, 2005, at 2:13 PM, Terrence Martin wrote:

I asked this question a couple months ago but I wanted to put it out
again because I did not follow up on the one response I got.

My question was whether it is possible to have a script run on job exit
that can go beyond what the normal condor exit does in terms of cleaning
up areas. This is important in the current Open Science Grid clusters I
am working with since often user files are stored in temporary area that
condor does not necessarily know about. It would be nice to have this
area cleared on exit.


The answer I got was either use a wrapper or Dagman.

The first solution does not work, that is if I follow the rules for
USER_JOB_WRAPPER in the condor documentation to not have the wrapper
fork a child and only call exec. I can do that but it is not clear I
should. What would be nice is that in addition to USER_JOB_WRAPPER there
was a USER_JOB_EXIT_SCRIPT which could define a script that performs
certain cleanup steps on job exit.


As far as DAGman, I am not sure how that would help. DAGman from the
condor documentation is meta-scheduler that submits to condor. That
sounds like it works on the outside between the user and condor. The
grid software I work with is already thick with schedulers to condor and
I cannot enforce what users make use of on that side. All I can control
is my condor queue and my worker nodes. Admittedly my knowledge of
dagman extends to what I read here http://www.cs.wisc.edu/condor/ dagman/
but it does not sound like what I am looking for.


I guess I have another option and try to be clever. Just before my user
wrapper drops to the actual job I could start a monitoring process that
watches for the job to exit and then try to cleanup. It would be simpler
and probably less error prone if condor could just trigger a cleanup
process though. This would also have to end up being an orphan process
since the parent calls an exec right after it spawns the monitor.

I see a few options available, none ideal:

1) Have the USER_JOB_WRAPPER clean up the files of the previous job.

2) If you have any control on the submit side, you can set a post- script in the job ad that will be run after the job. You can use SUBMIT_EXPRS to add it automatically to all jobs.

3) Have the USER_JOB_WRAPPER fork the job instead of exec'ing it. We don't mention this in the manual because it can be tricky to get right. The script has to not exit before the job, exit with the same status as the job, and catch SIGTERM and forward it to the job. If you run any standard universe jobs, there are several more signals the script has to catch and forward to the job. There may be some other details, but but those are the ones I can think of.

+----------------------------------+---------------------------------+
|            Jaime Frey            |  Public Split on Whether        |
|        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
|  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |
+----------------------------------+---------------------------------+