[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Managing evictions and reruns



I ran, for the first time, my new job in Condor that splits the work
to do into multiple worker jobs. Unfortunately, of the 178 worker jobs
that it produced,
one of the jobs was suspended, unsuspended, evicted, then re-run (see
log at the bottom). When it re-ran, it got "stuck" (ie - it hasn't
finished yet and condor_wait still says it's running and ps -ef shows
it running). I need to setup things so that either
#1: no jobs get evicted
or
#2: if a job does get evicted, do not rerun it

The jobs are java code that exec a python script and evidently, it
doesn't liek to be evicted/suspended then rerun.

I was poking around and it looks like I can set some variables in the
classAd like WANT_SUSPEND, SUSPEND, PREEMPT, WANT_VACATE, CONTINUE but
I'm having a hard time figuring out in what combination of these I
need to set to make it do #1 or #2 above. Can anyone shed some light
on this?

it seems like I could:
set WANT_SUSPEND to FALSE for #1 above
or
set CONTINUE to FALSE for #2 above

but I'm just not positive. And do I set this in the job classAd itself?

And a side note - how do I figure out why my job was evicted/suspended
in the first place?

---
job's log file
---
> more /workspace/jobs/3150/output/273.14.log
000 (273.014.000) 02/04 12:40:09 Job submitted from host: <...62:52527>
...
001 (273.014.000) 02/04 12:40:10 Job executing on host: <...64:45943>
...
006 (273.014.000) 02/04 12:40:18 Image size of job updated: 4937388
	15  -  MemoryUsage of job (MB)
	14436  -  ResidentSetSize of job (KB)
...
010 (273.014.000) 02/04 12:42:41 Job was suspended.
	Number of processes actually suspended: 2
...
006 (273.014.000) 02/04 12:42:41 Image size of job updated: 10673688
	37  -  MemoryUsage of job (MB)
	37280  -  ResidentSetSize of job (KB)
...
011 (273.014.000) 02/04 12:52:43 Job was unsuspended.
...
004 (273.014.000) 02/04 12:52:43 Job was evicted.
	(0) Job was not checkpointed.
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	Partitionable Resources :    Usage  Request
	   Cpus                 :                 1
	   Disk (KB)            :     1750     1750
	   Memory (MB)          :       37       37
...
001 (273.014.000) 02/04 12:58:12 Job executing on host: <...64:45943>
...