[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to terminate jobs automatically



On Thu, Jul 22, 2004 at 11:47:44AM +0100, Dr Ian C. Smith wrote:
> --On 22 July 2004 10:53 +0100 Matt Hope <Matt.Hope@xxxxxxxx> wrote:
> 
> >If your jobs will only ever terminate in response to a vacate caused at
> >the end of day or normally then trap the vacate signal and exit
> >immediately - this will be treated as vacate succeded.
> 
> I'm using a .bat file which writes to stdout at the moment as a simple
> example so I don't know how it should trap the vacate signal. Is there
> a way of doing this for .exe's ?
> 
> >Then set your jobs up to transfer files on vacate.
> 
> I tried
> 
> transfer_files=ON_EXIT_OR_EVICT
> 
> before but had the same problem - perhaps I need the signal handler as well 
> ?
> 

No, I don't think that would matter. It may depend on how your StartD policy
is configured - take a look at 
http://www.cs.wisc.edu/condor/manual/v6.6/3_6Startd_Policy.html

and in particular, the WANT_VACATE and the KILLING_TIMEOUT expressions are
probably most relevant.


> >
> >This is not perfect since the job will remain in the queue.
> >
> 
> Yes a bit of a pain - although it would be useful if I could get them to
> pick up from were they left off next time they run.
> 
> >Alternatively have your jobs keep track of the time themseves (making
> >this time an additional argument perhaps) and have them kill themselves
> >(nicely if possible with a message to that effect) a minute or so before
> >condor would (to allow for clock differences)
> 
> Yeah I had thought of that but I'm wary of having users hard code this
> kind of implementation detail in their apps in case we change things in the 
> future.
> A signal handler would be more future proof.
> 
> >There are simple ways of doing this as well as extremely fast but
> >unpleasant ones ones if performance is really an issue with sufficient
> >granularity to hit a minute no probs...
> >
> 
> On Sun Grid Engine, which is UNIX based it is possible to send the app
> a "warning" signal to tell it to clear itself up before it sends the KILL 
> signal.
> Is there anything possible like on Condor/Windows.
> 

We provide that today - it's a bit different, and not documented very well
on in the startd policy section of the manual as for what we send, but
when we decide to vacate the job (either because we're preempting it with
another job, or the machine has decided that it doesn't want to run the job
any longer) Condor will send a WM_CLOSE Win32 message to your app - if your
app wants, it can catch that message and do whatever it wants to shut itself
down. If you have WANT_VACATE = True, it will have as long as the KILL 
expression is false - usually 10 minutes, before it goes to the  KILLING
state. (If you have WANT_VACATE = False, we'd go right to the KILLING state)
As soon as we enter the killing state, your app gets another WM_CLOSE 
message. (the same thing as if someone clicked on the 'X' in the upper
righthand corner of the window). If it's still there after KILLING_TIMEOUT
seconds, Condor kills it hard - I'm pretty sure that if we kill a job
hard, we wouldn't transfer any files back. 

> >What you describe is not really very easy with condor since there are
> >many reasons for jobs to be vacated from a machine so knowing that it is
> >due to the time is more the responsiblity of the job itself than condor...
> >
> >Not to mention the question of what to do with jobs that ran for a while,
> >were vacated due to a better job then the night ends...
> >
> >If you are running a vanilla (as opposed to standard) which you have to
> >be on windows and require the ouput irrespective of whether the job
> >completely succesfully or not the simplest solution is to write the
> >output you care about directly to the netork / database and deal with
> >restarts directly (again a central database for run counters etc).
> >
> >In this way you can also layer vacate alike functionality in future by
> >serializing sufficient info to restart either at regular check points or
> >in response to the vacate signal.
> >
> >Note that the above solution has some unpleasant security connotations
> >you may not be able to accept.
> 
> We're stuck with the vanilla universe so I guess this precludes having
> Condor connect the I/O to a shadow running on another machine. I wouldn't
> fancy writing my own version of this - plus I doubt we could live with
> the security problems.
> 

See CHIRP:

http://www.cs.wisc.edu/condor/chirp/

-Erik