[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] behavior of condor_master in a glidein-like condor pool





On Tue, Apr 10, 2012 at 9:12 PM, Mats Rynge <rynge@xxxxxxx> wrote:
On 04/10/2012 08:33 PM, Yu Huang wrote:
> Because qsub jobs have a time limit (say 24 hours), I instruct the
> condor_master daemon to expire after 23.8 hours (=23.8X60 =1428
> minutes). Usually the condor_master commandline is like "condor_master
> -f -r 1428".
>
> One thing I'm desperate to find out is when the condor_master on the
> slave node (not the host) expires, what happens to the jobs that are
> still running? Sometime ago I remembered seeing some doc say that all
> the jobs will keep running. Could any of guys confirm that or the
> opposite? My "impression" so far is most jobs on that expired node would
> all die immediately (although I did see some mangled output due to >1
> jobs output to the same file).

Yu,

I don't remember the exact config we used in your case, but I think you
want to try to set the -r time to 24-max_job_walltime. For example, if
your longest job takes 6 hours, set it to 1080 (18*60). The startd will
then shut down gracefully which means finishing the current job (shows
up as "Retiring" in condor_status).
There are times that a few jobs would run so long , entirely beyond what I could expect. Or that node just went down. so it's really hard to set that expiration attuned to the job running time (also i usually have a bunch of heterogeneous jobs).

In your current setup, jobs which are running only get 12 minutes to
finish before SGE kills the Condor daemons.
condor daemon exits themselves before SGE walltime reaches so SGE never comes in and kill the daemon.  
The jobs that are running will exit when condor daemon exits (I just tried one workflow and confirmed it).

This means that the job will
have to restart somewhere else.

The problem is that instead of those jobs disappearing (failure) from the condor queue  and restarting elsewhere, they persist in the queue with "I" state. I think the host (central manager) just has no idea of what has happened to that condor daemon on that slave node. What I hope to get is a mechanism to let the condor daemon on the slave node 1. kill all jobs that are running; 2. notify the central manager that the jobs are (being) killed so that the central manager would try to re-run them elsewhere (rather than mark them "I").

yu
--
Mats Rynge
USC/ISI - Pegasus Team <http://pegasus.isi.edu>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



--
Yu Huang
Postdoc in Nelson Freimer Lab,
Center for Neurobehavioral Genetics, UCLA
Office Phone: +1.310-794-9598
Skype ID: crocea
http://www-scf.usc.edu/~yuhuang