[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to kill job when output dir removed ?



> Now you can construct an automated policy about what should happen to
the 
> held jobs by configuring SYSTEM_PERIODIC_RELEASE or 
> SYSTEM_PERIODIC_REMOVE (or the user can configure a policy with the 
> respective job policy expressions).

Ah yes! Good point. We don't use a try-again policy at Altera but you
can definitely couple a release and remove config setting with the retry
attempt counter in the jobs to limit the number of times Condor would
try and recover from a missing result directory error. Just in case it's
transient. Personally I find stupid users to be rather stable in their
continued existence. :)

> I should also note that there was one case of file transfer errors not

> handled by 6.8's put-on-hold policy.  It is failure while writing the 
> output to the submit machine (e.g. because the disk is full).  This
has 
> been fixed in Condor 6.9.1, so jobs will go on hold in this case too.

> It was difficult to judge whether this was a bug fix or a change in 
> behavior (i.e. suitable for 6.8 vs. 6.9).  In the end, I decided to
put 
> it into 6.9.

If you're accepting lobbying for pushing this into 6.8 consider this my
request. I'd definitely like to see this as a put-on-hold situation.
Right now we're actually monitoring user's disk usage on our NAS box and
holding their jobs when they get with 1% of their hard quota.

- Ian

P.S. Did something change on the mailing list? I'm not seeing my own
emails to the list anymore.