[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] behavior of condor_master in a glidein-like condor pool



On 04/10/2012 11:33 PM, Yu Huang wrote:
Hi,

I'm in a setting that I don't have root priviledges on the SGE (=Univa)
cluster. So i have to resort to grow a condor pool out of bunch of qsub
jobs.
The first qsub job will run condor_master with itself as host. The
following qsub jobs will run condor_master with the node of the first
qsub job as its host. In the end, i could log into the first node and
submit condor jobs to the grown pool.

Because qsub jobs have a time limit (say 24 hours), I instruct the
condor_master daemon to expire after 23.8 hours (=23.8X60 =1428
minutes). Usually the condor_master commandline is like "condor_master
-f -r 1428".

One thing I'm desperate to find out is when the condor_master on the
slave node (not the host) expires, what happens to the jobs that are
still running? Sometime ago I remembered seeing some doc say that all
the jobs will keep running. Could any of guys confirm that or the
opposite? My "impression" so far is most jobs on that expired node would
all die immediately (although I did see some mangled output due to >1
jobs output to the same file).

If the jobs on that slave would still run after the expiration of
condor_master, could I configure condor_master to kill all jobs that are
running on that node right-before/upon-the-same-time condor_master
expires? I played with the PREEMPTY/KILL policy to no avail. And 2ndly,
could I let condor on the host machine mark the jobs on the
expired/shut-down slave machine as "Failure" (i'm running pegasus which
uses dagman) rather than putting them into "I" state? The "failure"
state would trigger dagman to re-try the job elsewhere , while "I" state
doesn't trigger that, which is not nice (as i have to manually
"condor_rm" them).

Any information will be greatly appreciated.
Thanks,
yu

So I'm clear on this, you -
0) qsub a condor_master that has DAEMON_LIST=MASTER,COLLECTOR,NEGOTIATOR,SCHEDD
 1) get location of (0)
2) qsub a number of condor_masters that have DAEMON_LIST=MASTER,STARTD & COLLECTOR_HOST=(1)
 3) login to (1) and condor_submit jobs
?

Condor is extremely responsible with the resources it is given. When the condor_master shuts down it will shutdown all the jobs running under it. If it doesn't there a bug.

Best,


matt