[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to kill job when output dir removed ?

> I have a recurring problem here where our users submit
> files through a web interface but then indadvertently
> remove the directory the condor input/output files
> are sitting in without killing the job first. I've
> tried all sort of safeguards to prevent this but they
> still seem find a way of doing it (that's users for ya !).

Truly one of the most annoying features in Condor is the vast amount of
email created when a user deletes their result directory before jobs are
out of the queue. Definitely the top of the list for things about Condor
I'd like more control over: a) what Condor should do when this happens
(i.e. retry indefinitely, retry for a while, or give up and drop the job
from the queue); and b) send email about this?

To deal with this I use a periodic remove setting on my scheduler that
looks for jobs that end up in the held state because of a missing result
directory and quietly removes them from the queue. The statement looks
like this (it actually does a little more than just dumping jobs that
are of the type you described):

SYSTEM_PERIODIC_REMOVE = ((JobStatus == 2) && ((CurrentTime - \
EnteredCurrentStatus) > AlteraMaxJobRunTime)) || \
( \
   (JobStatus == 5) && \
   ( \
      HoldReasonCode != 1 && \
      HoldReasonCode != 6 && \
      HoldReasonCode != 11 \
   ) \

To stem the tide of email from condor I send all condor-generated email
through a proxy email account that's running some sendmail filters which
basically look at the headers and the email body and decide whether or
not to forward the email on to my cluster admins and/or the user who
submitted the job. I do this for all condor-generated email but it was
made necessary by stupid users removing result directories before their
jobs were out of the queue.

- Ian