[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Stuck schedd processes



I'm looking for advice.

The pool I'm managing has 5 submit nodes, and recently 4 of those have 
suffered from a problem I had reported a couple of weeks ago.

There's a user (let's call him K) who uses to submit big job clusters,
with individual job specs in the submit file (30000 lines, for 5400 jobs,
each entry specifying "output=", "error=", and "arguments=", followed by a 
"Queue 1" line, repeated 5400 times)

Each process would write a single "status" line to the error output
(located on an NFS server), every 20 seconds or so.

"Sometimes" these jobs become unresponsive or fail, or/and K becomes 
impatient and tries to condor_rm hist jobs. Eventually, the schedd will
get stuck, will no longer respond to condor_q, and the UID of the process
will stay set to K's account.
(And looking into /proc/`pidof condor_schedd`/fd will show that FD 12 is
connected to the file K specified as "log="... no changes to the actual
file made in hours.)

At this stage, "/etc/init.d/condor restart" won't be of any help.
Of course, any "condor_rm" or "condor_hold" action wouldn't succeed anymore.

Instead, what I have found to work is:
	/etc/init.d/condor stop
	sleep 300
	killall -TERM condor_master
	sleep 300
Then (very important!) rename the "log" file.
	/etc/init.d/condor start
and watch - the last time I looked it took 40 minutes for the condor_schedd
process to get back to be owned by the "condor" user; the SchedLog(s) contain(s)
a few ten MB of removal messages, and the spool/history rewrite takes very
long.

During this recovery procedure, it's virtually impossible to disable any
queued jobs (because the usual mechanisms don't work) so one may run into 
the same situation again.
Worse yet: user S happened to submit his own set of jobs shortly after K
had submitted hers - and wasn't able to see them anymore, so he went on
to another machine and submitted them again. When the crashed schedd was
restarted, his first set of jobs was matched and overwrote already existing
results.

Is there a way to
- restart a schedd with actual matching disabled
- specify "log=" (and the other files) in such a way that overwriting wouldn't
	happen (e.g. by prefixing the machine name: log=$(Schedd)-condor.log)
- manage jobs in the queue while the schedd isn't fully running
	(moving the spool/ entries around doesn't do the trick completely)
?

Which ways to debug the situation would you recommend?
- set SCHEDD_DEBUG to all, and MAX_SCHEDD_LOG to a huge value
- attach strace or a debugger to the condor_schedd process when it's stuck again
- anything else?

Please keep your responses on-list (I'll be AFK for a while, and my "proxies"
would like to know as well)

Thanks in advance,
 Steffen

-- 
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html