[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Silly user behaviour able to cause lots of Claimed+Idle's in our system



Ran into this problem on the Friday before our long weekend in Canada
(of course). Thought I'd report it here in case someone else has an idea
of how to prevent this from happening.

We had a single, heavily loaded, 6.8.0 schedd (20000 jobs across 1000 or
so clusters) feeding about 600 startd's. This is a situation we've
handled comfortably before. The problem started when a user submitted a
100 job cluster and then decided to condor_rm the cluster, and without
waiting to see if Condor actually removed the jobs (remember: heavily
loaded schedd, removing was taking a while in this case), deleted the
directory where the log file for the cluster was supposed to be located
(it was on an NFS share).

The result: our schedd got caught in this awful loop trying to write to
the log file for this cluster to the detriment of everything else it was
supposed to be doing unfortunately. Job spawning was put on hold and
over the course of an hour all our startd's went Claimed+Idle on us.
Trying to -forcex remove the cluster from the system wouldn't work
because the schedd would just continue to complain about not being to
open and write to the log file for the cluster.

I tried just creating the log file on disk for the schedd to write to
but it didn't seem to solve the problem. It continued to complain about
the file being unreachable despite wide open permissions on the
directory and the file.

We had to reboot the schedd to fix the issue. When it came back up the
cluster was removed correctly and the cluster log was written to
properly.

Bit of a nuisance to say the least. It'd be nice if Condor could handle
losing the ability to write to the cluster log file with as much grace
as it does losing the return directory for job output. Or maybe it does
and my use of NFS for the cluster log file was the problem?

Cheers!

- Ian

--
Ian R. Chesal <ichesal@xxxxxxxxxx>
Senior Software Engineer

Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300