[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor_schedd slowness causing job leases to expire
- Date: Wed, 19 Mar 2008 13:02:02 -0400
- From: "Robert E. Parrott" <parrott@xxxxxxxxxxxxxxxx>
- Subject: [Condor-users] condor_schedd slowness causing job leases to expire
I'm seeing what appears to be some some that are ended due to slow
responsiveness of the condor_schedd.
In particular, one user's parallel job was terminated when another
user submitted something like 1500 parallel jobs all at once.
The condor_schedd became unresponsive, and condor_q reported that the
condor_schedd didn't respond for a time.
This was on condor v 7.0.1 on the head node, 6.8.5 on the compute nodes.
So I'm looking for the following:
1) Workarounds for the startd on the compute nodes, so that a slow
condor_schedd will not cause lease terminations like this (or with a
long timeout period)
2) Fixes for handling larger numbers of parallel jobs.
Any suggestions here (with #1 being highest priority)?
Robert E. Parrott, Ph.D. (Phys. '06)
Associate Director, Grid and
Project Manager, CrimsonGrid Initiative
Harvard University Sch. of Eng. and App. Sci.
33 Oxford St.
Cambridge, MA 02138