[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Managing evictions and reruns



ok, more info...

I just ran the master job and it was suspended. It eventually
restarted and 4 of the worker jobs were suspended/evicted. The only
thing this my condor setup is being used for is for my code, all jobs
are of equal rank, submitted by the same person, and all running in
the java universe. I don't know what could be causing all of these
suspensions and evictions.

Ah... just found something maybe? "(if the job was preempted because
the machine owner came back) " - I am logged into one of the machines
via SSH - maybe that makes OWNER = True and causes the preemption? In
reading the Policy Configuration manpage, it sounds like I should set
IS_OWNER = FALSE
on all my condor machines to prevent me (or anyone?) being logged in
via ssh from causing a job suspension/preemption? These machines exist
only to do Condor work.
---
Brian Pipa



On Mon, Feb 4, 2013 at 5:13 PM, Brian Pipa <brianpipa@xxxxxxxxx> wrote:
> I ran, for the first time, my new job in Condor that splits the work
> to do into multiple worker jobs. Unfortunately, of the 178 worker jobs
> that it produced,
> one of the jobs was suspended, unsuspended, evicted, then re-run (see
> log at the bottom). When it re-ran, it got "stuck" (ie - it hasn't
> finished yet and condor_wait still says it's running and ps -ef shows
> it running). I need to setup things so that either
> #1: no jobs get evicted
> or
> #2: if a job does get evicted, do not rerun it
>
> The jobs are java code that exec a python script and evidently, it
> doesn't liek to be evicted/suspended then rerun.
>
> I was poking around and it looks like I can set some variables in the
> classAd like WANT_SUSPEND, SUSPEND, PREEMPT, WANT_VACATE, CONTINUE but
> I'm having a hard time figuring out in what combination of these I
> need to set to make it do #1 or #2 above. Can anyone shed some light
> on this?
>
> it seems like I could:
> set WANT_SUSPEND to FALSE for #1 above
> or
> set CONTINUE to FALSE for #2 above
>
> but I'm just not positive. And do I set this in the job classAd itself?
>
> And a side note - how do I figure out why my job was evicted/suspended
> in the first place?
>
> ---
> job's log file
> ---
>> more /workspace/jobs/3150/output/273.14.log
> 000 (273.014.000) 02/04 12:40:09 Job submitted from host: <...62:52527>
> ...
> 001 (273.014.000) 02/04 12:40:10 Job executing on host: <...64:45943>
> ...
> 006 (273.014.000) 02/04 12:40:18 Image size of job updated: 4937388
>         15  -  MemoryUsage of job (MB)
>         14436  -  ResidentSetSize of job (KB)
> ...
> 010 (273.014.000) 02/04 12:42:41 Job was suspended.
>         Number of processes actually suspended: 2
> ...
> 006 (273.014.000) 02/04 12:42:41 Image size of job updated: 10673688
>         37  -  MemoryUsage of job (MB)
>         37280  -  ResidentSetSize of job (KB)
> ...
> 011 (273.014.000) 02/04 12:52:43 Job was unsuspended.
> ...
> 004 (273.014.000) 02/04 12:52:43 Job was evicted.
>         (0) Job was not checkpointed.
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>         0  -  Run Bytes Sent By Job
>         0  -  Run Bytes Received By Job
>         Partitionable Resources :    Usage  Request
>            Cpus                 :                 1
>            Disk (KB)            :     1750     1750
>            Memory (MB)          :       37       37
> ...
> 001 (273.014.000) 02/04 12:58:12 Job executing on host: <...64:45943>
> ...