[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] behavior of condor_master in a glidein-like condor pool



A job on a node that gets shutdown and goes back to Idle should be re-run. Not re-running sounds like a bug. There has been a good deal of fixes in the schedd/shadow & startd/starter code during 7.6 and 7.7. Will you try a current version and see if you can generate a reproducer?

FYI, periodic_hold is a command in the condor_submit languages and PeriodicHold is the name of the corresponding job attribute (and a synonym for periodic_hold in the condor_submit language).

Best,


matt

http://spinningmatt.wordpress.com/2011/07/04/getting-started-submitting-jobs-to-condor/

On 04/11/2012 04:59 AM, Yu Huang wrote:
https://www.racf.bnl.gov/docs/sw/condor/dealing-with-evicted-jobs offers
a solution.  These "I" jobs were basically evicted/preempted by its
condor_master before it expired (fast-shutdown). For whatever reason
(this may be a BUG in condor because it might be treating these
jobs(universe=vanilla) as Standard universe jobs and try to
checkpoint&re-cover it elsewhere.) , they remain in the queue and never
run again. Specify "periodic_hold" (or directly "periodic_remove") in
the job submit file to hold these "I" jobs the next time the schedd
considers periodic job actions (=hold, remove, release, The periodic
interval is controlled by PERIODIC_EXPR_INTERVAL=60).

periodic_hold = (NumJobStarts >= 1 && JobStatus == 1)

The original link mislabels "periodic_hold" as "PeriodicHold" (it may
work as well). But
http://research.cs.wisc.edu/condor/manual/v7.7/condor_submit.html#SECTION0010474000000000000000 states
that it should be "periodic_hold". "NumJobStarts >= 1" means the job has
run at least once and "JobStatus == 1" means the job is in "I"=Idle state.

In my case, "periodic_remove = (NumJobStarts >= 1 && JobStatus == 1)" is
preferred because i want this type of "stuck" jobs to be removed on next
period and dagman will re-submit them automatically (dagman.retry=3).

Interesting enough, there's a shortcut to this. Instead of adding this
line to every job submit file, you can set its counterpart in the daemon
(only the central manager) conf file, "SYSTEM_PERIODIC_REMOVE =
(NumJobStarts >= 1 && JobStatus == 1)", based on
https://lists.cs.wisc.edu/archive/condor-users/2008-February/msg00275.shtml.
I tested it and it worked. I have no idea the "SYSTEM_" prefix and
there's nowhere in the official doc to find this usage is valid.

a few other useful links:
http://spinningmatt.wordpress.com/category/classads/
http://etutorials.org/Linux+systems/cluster+computing+with+linux/Part+III+Managing+Clusters/Chapter+15+Condor+A+Distributed+Job+Scheduler/15.2+Using+Condor/
http://spinningmatt.wordpress.com/2009/12/05/cap-job-runtime-debugging-periodic-job-policy-in-a-condor-pool/
https://lists.cs.wisc.edu/archive/condor-users/2008-February/msg00275.shtml

yu

On Tue, Apr 10, 2012 at 9:30 PM, Yu Huang <polyactis@xxxxxxxxx
<mailto:polyactis@xxxxxxxxx>> wrote:



    On Tue, Apr 10, 2012 at 9:12 PM, Mats Rynge <rynge@xxxxxxx
    <mailto:rynge@xxxxxxx>> wrote:

        On 04/10/2012 08:33 PM, Yu Huang wrote:
         > Because qsub jobs have a time limit (say 24 hours), I
        instruct the
         > condor_master daemon to expire after 23.8 hours (=23.8X60 =1428
         > minutes). Usually the condor_master commandline is like
        "condor_master
         > -f -r 1428".
         >
         > One thing I'm desperate to find out is when the condor_master
        on the
         > slave node (not the host) expires, what happens to the jobs
        that are
         > still running? Sometime ago I remembered seeing some doc say
        that all
         > the jobs will keep running. Could any of guys confirm that or the
         > opposite? My "impression" so far is most jobs on that expired
        node would
         > all die immediately (although I did see some mangled output
        due to >1
         > jobs output to the same file).

        Yu,

        I don't remember the exact config we used in your case, but I
        think you
        want to try to set the -r time to 24-max_job_walltime. For
        example, if
        your longest job takes 6 hours, set it to 1080 (18*60). The
        startd will
        then shut down gracefully which means finishing the current job
        (shows
        up as "Retiring" in condor_status).

    There are times that a few jobs would run so long , entirely beyond
    what I could expect. Or that node just went down. so it's really
    hard to set that expiration attuned to the job running time (also i
    usually have a bunch of heterogeneous jobs).


        In your current setup, jobs which are running only get 12 minutes to
        finish before SGE kills the Condor daemons.

    condor daemon exits themselves before SGE walltime reaches so SGE
    never comes in and kill the daemon.
    The jobs that are running will exit when condor daemon exits (I just
    tried one workflow and confirmed it).

        This means that the job will
        have to restart somewhere else.

    The problem is that instead of those jobs disappearing (failure)
    from the condor queue  and restarting elsewhere, they persist in the
    queue with "I" state. I think the host (central manager) just has no
    idea of what has happened to that condor daemon on that slave node.
    What I hope to get is a mechanism to let the condor daemon on the
    slave node 1. kill all jobs that are running; 2. notify the central
    manager that the jobs are (being) killed so that the central manager
    would try to re-run them elsewhere (rather than mark them "I").

    yu

        --
        Mats Rynge
        USC/ISI - Pegasus Team <http://pegasus.isi.edu>
        _______________________________________________
        Condor-users mailing list
        To unsubscribe, send a message to
        condor-users-request@xxxxxxxxxxx
        <mailto:condor-users-request@xxxxxxxxxxx> with a
        subject: Unsubscribe
        You can also unsubscribe by visiting
        https://lists.cs.wisc.edu/mailman/listinfo/condor-users

        The archives can be found at:
        https://lists.cs.wisc.edu/archive/condor-users/




    --
    Yu Huang
    Postdoc in Nelson Freimer Lab,
    Center for Neurobehavioral Genetics, UCLA
    Office Phone:_+1.310-794-9598 <tel:%2B1.310-794-9598>_
    Skype ID: crocea
    http://www-scf.usc.edu/~yuhuang




--
Yu Huang
Postdoc in Nelson Freimer Lab,
Center for Neurobehavioral Genetics, UCLA
Office Phone:_+1.310-794-9598 <tel:%2B1.310-794-9598>_
Skype ID: crocea
http://www-scf.usc.edu/~yuhuang



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/