[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Stopping the flood of condor emails



I have a problem: our central submit system assigns an administrative
email address to our condor jobs, not the user email address. So when a
user does something stupid, like remove their working directory for the
job without deleting the job, the admin address gets flooded with
emails. Case in point: I walk in this morning to 12k new messages all
with the text body:

-----------
This is an automated email from the Condor system on machine
"ttc-schedd1.altera.com".  Do not reply.

Your job 9355.108 specified an initial working directory of
/ttcbatch/experiments/mteper/weekly_test/run/sweeps/hardcopyii/no_sweep_
parameter/top_general.
This directory currently does
not exist.  If this directory is on a shared filesystem, this could be
just a temporary problem.  Thus I will try again later
-----------

Based on the time stamps for the 12k+ message in the inbox it looks like
Condor was trying about every 20 seconds to copy data files back to this
directory, which has been deleted by the user. And it kept this up for
two jobs in the cluster all weekend long.

There are many questions that arise from a situation like this:

1) How can I throttle Condor so it's not trying these copy back
operations every 20 seconds? A random exponential back off time would be
really nice here?

2) How can I set a give up threshold on this situation so Condor can
eventually decide this data isn't ever going to get returned and just
toss the job?

3) How can I stop just this particular email from ever being sent? If I
can get solutions to questions 1 and 2 then I don't really want to know
about this. Ever.

- Ian