[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_schedd slowness causing job leases to expire



Hi Folks,

I'm seeing what appears to be some some that are ended due to slow responsiveness of the condor_schedd.

In particular, one user's parallel job was terminated when another user submitted something like 1500 parallel jobs all at once.

The condor_schedd became unresponsive, and condor_q reported that the condor_schedd didn't respond for a time.

This was on condor v 7.0.1 on the head node, 6.8.5 on the compute nodes.

So I'm looking for the following:

1) Workarounds for the startd on the compute nodes, so that a slow condor_schedd will not cause lease terminations like this (or with a long timeout period)

2) Fixes for handling larger numbers of parallel jobs.

Any suggestions here (with #1 being highest priority)?

thanks,
rob

==========================
Robert E. Parrott, Ph.D. (Phys. '06)
Associate Director, Grid and
       Supercomputing Platforms
Project Manager, CrimsonGrid Initiative
Harvard University Sch. of Eng. and App. Sci.
Maxwell-Dworkin  211,
33 Oxford St.
Cambridge, MA 02138
(617)-495-5045