[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] behavior of condor_master in a glidein-like condor pool





On Wed, Apr 11, 2012 at 2:52 PM, Yu Huang <polyactis@xxxxxxxxx> wrote:

http://research.cs.wisc.edu/condor/manual/v7.7/4_1Condor_s_ClassAd.html is really not that helpful. I have yet to find a list of classAd attributes for a machine, and for a job. I know user can define whatever they want.  But the only way i could find out all attributes is through "condor_q -long" for job and "condor_status -long" for a machine?
I found http://research.cs.wisc.edu/condor/manual/v7.7/10_Appendix_A.html, Appendix A lists all the attributes for job/machine/. not that bad.
 

Now i do find definition of SYSTEM_PERIODIC_HOLD and others in configuration of condor_schedd, http://research.cs.wisc.edu/condor/manual/v7.7/3_3Configuration.html#SECTION004311000000000000000

I hope condor documentation can stress/elaborate some key concepts before jumping into "User's Manual" and "Administrator's Manual".

thanks
yu
 
Best,


matt

http://spinningmatt.wordpress.com/2011/07/04/getting-started-submitting-jobs-to-condor/


On 04/11/2012 04:59 AM, Yu Huang wrote:
https://www.racf.bnl.gov/docs/sw/condor/dealing-with-evicted-jobs offers
a solution.  These "I" jobs were basically evicted/preempted by its
condor_master before it expired (fast-shutdown). For whatever reason
(this may be a BUG in condor because it might be treating these
jobs(universe=vanilla) as Standard universe jobs and try to
checkpoint&re-cover it elsewhere.) , they remain in the queue and never
run again. Specify "periodic_hold" (or directly "periodic_remove") in
the job submit file to hold these "I" jobs the next time the schedd
considers periodic job actions (=hold, remove, release, The periodic
interval is controlled by PERIODIC_EXPR_INTERVAL=60).

periodic_hold = (NumJobStarts >= 1 && JobStatus == 1)

The original link mislabels "periodic_hold" as "PeriodicHold" (it may
work as well). But
http://research.cs.wisc.edu/condor/manual/v7.7/condor_submit.html#SECTION0010474000000000000000 states
that it should be "periodic_hold". "NumJobStarts >= 1" means the job has
run at least once and "JobStatus == 1" means the job is in "I"=Idle state.

In my case, "periodic_remove = (NumJobStarts >= 1 && JobStatus == 1)" is
preferred because i want this type of "stuck" jobs to be removed on next
period and dagman will re-submit them automatically (dagman.retry=3).

Interesting enough, there's a shortcut to this. Instead of adding this
line to every job submit file, you can set its counterpart in the daemon
(only the central manager) conf file, "SYSTEM_PERIODIC_REMOVE =
(NumJobStarts >= 1 && JobStatus == 1)", based on
https://lists.cs.wisc.edu/archive/condor-users/2008-February/msg00275.shtml.
I tested it and it worked. I have no idea the "SYSTEM_" prefix and
there's nowhere in the official doc to find this usage is valid.

a few other useful links:
http://spinningmatt.wordpress.com/category/classads/
http://etutorials.org/Linux+systems/cluster+computing+with+linux/Part+III+Managing+Clusters/Chapter+15+Condor+A+Distributed+Job+Scheduler/15.2+Using+Condor/
http://spinningmatt.wordpress.com/2009/12/05/cap-job-runtime-debugging-periodic-job-policy-in-a-condor-pool/
https://lists.cs.wisc.edu/archive/condor-users/2008-February/msg00275.shtml

yu

On Tue, Apr 10, 2012 at 9:30 PM, Yu Huang <polyactis@xxxxxxxxx
<mailto:polyactis@xxxxxxxxx>> wrote:



   On Tue, Apr 10, 2012 at 9:12 PM, Mats Rynge <rynge@xxxxxxx
   <mailto:rynge@xxxxxxx>> wrote:

       On 04/10/2012 08:33 PM, Yu Huang wrote:
        > Because qsub jobs have a time limit (say 24 hours), I
       instruct the
        > condor_master daemon to expire after 23.8 hours (=23.8X60 =1428
        > minutes). Usually the condor_master commandline is like
       "condor_master
        > -f -r 1428".
        >
        > One thing I'm desperate to find out is when the condor_master
       on the
        > slave node (not the host) expires, what happens to the jobs
       that are
        > still running? Sometime ago I remembered seeing some doc say
       that all
        > the jobs will keep running. Could any of guys confirm that or the
        > opposite? My "impression" so far is most jobs on that expired
       node would
        > all die immediately (although I did see some mangled output
       due to >1
        > jobs output to the same file).

       Yu,

       I don't remember the exact config we used in your case, but I
       think you
       want to try to set the -r time to 24-max_job_walltime. For
       example, if
       your longest job takes 6 hours, set it to 1080 (18*60). The
       startd will
       then shut down gracefully which means finishing the current job
       (shows
       up as "Retiring" in condor_status).

   There are times that a few jobs would run so long , entirely beyond
   what I could expect. Or that node just went down. so it's really
   hard to set that expiration attuned to the job running time (also i
   usually have a bunch of heterogeneous jobs).


       In your current setup, jobs which are running only get 12 minutes to
       finish before SGE kills the Condor daemons.

   condor daemon exits themselves before SGE walltime reaches so SGE
   never comes in and kill the daemon.
   The jobs that are running will exit when condor daemon exits (I just
   tried one workflow and confirmed it).

       This means that the job will
       have to restart somewhere else.

   The problem is that instead of those jobs disappearing (failure)
   from the condor queue  and restarting elsewhere, they persist in the
   queue with "I" state. I think the host (central manager) just has no
   idea of what has happened to that condor daemon on that slave node.
   What I hope to get is a mechanism to let the condor daemon on the
   slave node 1. kill all jobs that are running; 2. notify the central
   manager that the jobs are (being) killed so that the central manager
   would try to re-run them elsewhere (rather than mark them "I").

   yu

       --
       Mats Rynge
       USC/ISI - Pegasus Team <http://pegasus.isi.edu>
       _______________________________________________
       Condor-users mailing list
       To unsubscribe, send a message to
       condor-users-request@xxxxxxxxedu
       <mailto:condor-users-request@cs.wisc.edu> with a

       subject: Unsubscribe
       You can also unsubscribe by visiting
       https://lists.cs.wisc.edu/mailman/listinfo/condor-users

       The archives can be found at:
       https://lists.cs.wisc.edu/archive/condor-users/




   --
   Yu Huang
   Postdoc in Nelson Freimer Lab,
   Center for Neurobehavioral Genetics, UCLA
   Office Phone:_+1.310-794-9598 <tel:%2B1.310-794-9598>_

   Skype ID: crocea
   http://www-scf.usc.edu/~yuhuang




--
Yu Huang
Postdoc in Nelson Freimer Lab,
Center for Neurobehavioral Genetics, UCLA
Office Phone:_+1.310-794-9598 <tel:%2B1.310-794-9598>_
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxedu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/




--
Yu Huang
Postdoc in Nelson Freimer Lab,
Center for Neurobehavioral Genetics, UCLA
Office Phone: +1.310-794-9598



--
Yu Huang
Postdoc in Nelson Freimer Lab,
Center for Neurobehavioral Genetics, UCLA
Office Phone: +1.310-794-9598
Skype ID: crocea
http://www-scf.usc.edu/~yuhuang