[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] behavior of condor_master in a glidein-like condor pool



Hi,

I'm in a setting that I don't have root priviledges on the SGE (=Univa) cluster. So i have to resort to grow a condor pool out of bunch of qsub jobs.
The first qsub job will run condor_master with itself as host. The following qsub jobs will run condor_master with the node of the first qsub job as its host. In the end, i could log into the first node and submit condor jobs to the grown pool.

Because qsub jobs have a time limit (say 24 hours), I instruct the condor_master daemon to expire after 23.8 hours (=23.8X60 =1428 minutes). Usually the condor_master commandline is like "condor_master -f -r 1428".

One thing I'm desperate to find out is when the condor_master on the slave node (not the host) expires, what happens to the jobs that are still running? Sometime ago I remembered seeing some doc say that all the jobs will keep running. Could any of guys confirm that or the opposite? My "impression" so far is most jobs on that expired node would all die immediately (although I did see some mangled output due to >1 jobs output to the same file).

If the jobs on that slave would still run after the expiration of condor_master, could I configure condor_master to kill all jobs that are running on that node right-before/upon-the-same-time condor_master expires? I played with the PREEMPTY/KILL policy to no avail. And 2ndly, could I let condor on the host machine mark the jobs on the expired/shut-down slave machine as "Failure" (i'm running pegasus which uses dagman) rather than putting them into "I" state? The "failure" state would trigger dagman to re-try the job elsewhere , while "I" state doesn't trigger that, which is not nice (as i have to manually "condor_rm" them).

Any information will be greatly appreciated.
Thanks,
yu