[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] behavior of condor_master in a glidein-like condor pool



Side note: I confirmed that holding the job and then releasing it on the next period doesn't get the job to run. It's still stuck in "I" mode. The only way to get it run it to remove them on the next period.

On Wed, Apr 11, 2012 at 1:59 AM, Yu Huang <polyactis@xxxxxxxxx> wrote:
https://www.racf.bnl.gov/docs/sw/condor/dealing-with-evicted-jobs offers a solution.  These "I" jobs were basically evicted/preempted by its condor_master before it expired (fast-shutdown). For whatever reason (this may be a BUG in condor because it might be treating these jobs(universe=vanilla) as Standard universe jobs and try to checkpoint&re-cover it elsewhere.) , they remain in the queue and never run again. Specify "periodic_hold" (or directly "periodic_remove") in the job submit file to hold these "I" jobs the next time the schedd considers periodic job actions (=hold, remove, release, The periodic interval is controlled by PERIODIC_EXPR_INTERVAL=60).

periodic_hold = (NumJobStarts >= 1 && JobStatus == 1)

The original link mislabels "periodic_hold" as "PeriodicHold" (it may work as well). But http://research.cs.wisc.edu/condor/manual/v7.7/condor_submit.html#SECTION0010474000000000000000 states that it should be "periodic_hold". "NumJobStarts >= 1" means the job has run at least once and "JobStatus == 1" means the job is in "I"=Idle state.

In my case, "periodic_remove = (NumJobStarts >= 1 && JobStatus == 1)" is preferred because i want this type of "stuck" jobs to be removed on next period and dagman will re-submit them automatically (dagman.retry=3).

Interesting enough, there's a shortcut to this. Instead of adding this line to every job submit file, you can set its counterpart in the daemon (only the central manager) conf file, "SYSTEM_PERIODIC_REMOVE = (NumJobStarts >= 1 && JobStatus == 1)", based on https://lists.cs.wisc.edu/archive/condor-users/2008-February/msg00275.shtml. I tested it and it worked. I have no idea the "SYSTEM_" prefix and there's nowhere in the official doc to find this usage is valid.

a few other useful links:
http://spinningmatt.wordpress.com/category/classads/
http://etutorials.org/Linux+systems/cluster+computing+with+linux/Part+III+Managing+Clusters/Chapter+15+Condor+A+Distributed+Job+Scheduler/15.2+Using+Condor/
http://spinningmatt.wordpress.com/2009/12/05/cap-job-runtime-debugging-periodic-job-policy-in-a-condor-pool/
https://lists.cs.wisc.edu/archive/condor-users/2008-February/msg00275.shtml

yu


On Tue, Apr 10, 2012 at 9:30 PM, Yu Huang <polyactis@xxxxxxxxx> wrote:


On Tue, Apr 10, 2012 at 9:12 PM, Mats Rynge <rynge@xxxxxxx> wrote:
On 04/10/2012 08:33 PM, Yu Huang wrote:
> Because qsub jobs have a time limit (say 24 hours), I instruct the
> condor_master daemon to expire after 23.8 hours (=23.8X60 =1428
> minutes). Usually the condor_master commandline is like "condor_master
> -f -r 1428".
>
> One thing I'm desperate to find out is when the condor_master on the
> slave node (not the host) expires, what happens to the jobs that are
> still running? Sometime ago I remembered seeing some doc say that all
> the jobs will keep running. Could any of guys confirm that or the
> opposite? My "impression" so far is most jobs on that expired node would
> all die immediately (although I did see some mangled output due to >1
> jobs output to the same file).

Yu,

I don't remember the exact config we used in your case, but I think you
want to try to set the -r time to 24-max_job_walltime. For example, if
your longest job takes 6 hours, set it to 1080 (18*60). The startd will
then shut down gracefully which means finishing the current job (shows
up as "Retiring" in condor_status).
There are times that a few jobs would run so long , entirely beyond what I could expect. Or that node just went down. so it's really hard to set that expiration attuned to the job running time (also i usually have a bunch of heterogeneous jobs).

In your current setup, jobs which are running only get 12 minutes to
finish before SGE kills the Condor daemons.
condor daemon exits themselves before SGE walltime reaches so SGE never comes in and kill the daemon.  
The jobs that are running will exit when condor daemon exits (I just tried one workflow and confirmed it).

This means that the job will
have to restart somewhere else.

The problem is that instead of those jobs disappearing (failure) from the condor queue  and restarting elsewhere, they persist in the queue with "I" state. I think the host (central manager) just has no idea of what has happened to that condor daemon on that slave node. What I hope to get is a mechanism to let the condor daemon on the slave node 1. kill all jobs that are running; 2. notify the central manager that the jobs are (being) killed so that the central manager would try to re-run them elsewhere (rather than mark them "I").

yu
--
Mats Rynge
USC/ISI - Pegasus Team <http://pegasus.isi.edu>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



--
Yu Huang
Postdoc in Nelson Freimer Lab,
Center for Neurobehavioral Genetics, UCLA
Office Phone: +1.310-794-9598
Skype ID: crocea
http://www-scf.usc.edu/~yuhuang



--
Yu Huang
Postdoc in Nelson Freimer Lab,
Center for Neurobehavioral Genetics, UCLA
Office Phone: +1.310-794-9598
Skype ID: crocea
http://www-scf.usc.edu/~yuhuang



--
Yu Huang
Postdoc in Nelson Freimer Lab,
Center for Neurobehavioral Genetics, UCLA
Office Phone: +1.310-794-9598
Skype ID: crocea
http://www-scf.usc.edu/~yuhuang