[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] behavior of condor_master in a glidein-like condor pool



On 04/10/2012 08:33 PM, Yu Huang wrote:
> Because qsub jobs have a time limit (say 24 hours), I instruct the
> condor_master daemon to expire after 23.8 hours (=23.8X60 =1428
> minutes). Usually the condor_master commandline is like "condor_master
> -f -r 1428".
> 
> One thing I'm desperate to find out is when the condor_master on the
> slave node (not the host) expires, what happens to the jobs that are
> still running? Sometime ago I remembered seeing some doc say that all
> the jobs will keep running. Could any of guys confirm that or the
> opposite? My "impression" so far is most jobs on that expired node would
> all die immediately (although I did see some mangled output due to >1
> jobs output to the same file).

Yu,

I don't remember the exact config we used in your case, but I think you
want to try to set the -r time to 24-max_job_walltime. For example, if
your longest job takes 6 hours, set it to 1080 (18*60). The startd will
then shut down gracefully which means finishing the current job (shows
up as "Retiring" in condor_status).

In your current setup, jobs which are running only get 12 minutes to
finish before SGE kills the Condor daemons. This means that the job will
have to restart somewhere else.

-- 
Mats Rynge
USC/ISI - Pegasus Team <http://pegasus.isi.edu>