[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Understanding Condor's "Claimed" state

On Wed, 29 Jun 2011, Jeff Ramnani wrote:


I have a Condor pool with 10 dedicated compute nodes, and I'm having an issue getting people's jobs scheduled the way I want. Here's what's happening.

If user1 has submitted a large batch of jobs, then users that submit jobs after them aren't getting scheduled until after the first user's jobs are completed, even if the users who came later have better priorities.

One difference I've seen when this happens is that user1 who submits the large batch of jobs does so as one job cluster that contains many jobs (in this example, let's say 100 jobs. e.g. 1.0 .. 1.99), and user2 who has a better priority, but submits later, does so as many clusters with one job each (e.g. 2.0, 3.0, 4.0).

Here's an example output of condor_q:

54713.0 user1 6/29 14:00 0+00:02:27 R 0 0.0 user1_job.py 2011.0 54713.1 user1 6/29 14:00 0+00:02:19 R 0 122.1 user1_job.py 2011.0 54713.2 user1 6/29 14:00 0+00:02:14 R 0 0.0 user1_job.py 2011.0 54713.3 user1 6/29 14:00 0+00:02:06 R 0 0.0 user1_job.py 2011.0
54713.99 user1 6/29 14:00 0+00:00:00 R 0 0.0 user1_job.py 2011.0
54488.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh
54489.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh
54490.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh
54491.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh
54492.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh

User2 has a better priority, so I would expect user2's job 54488.0 to be scheduled on the first available machine when one of user1's jobs is completed, but that's not what's happening. It seems like user1 has a "claim" on the machines that lasts longer than an individual job. I've read the following manual pages:
and I'm still not 100% sure I understand how jobs are scheduled in this situation. I've found the configuration setting for CLAIM_WORKLIFE in the manual which states, "If provided, this expression specifies the number of seconds during which a claim will continue accepting new jobs." This leads me to the following questions.

* How long does a user's "claim" last on a machine?

 Others know more technical detail here but if CLAIM_WORKLIFE
is not set, the claim is infinite unless preemption kicks the user off.
If you set CLAIM_WORKLIFE, then it is just the duration of the claim
that you specify.

* Does a job cluster keep a "claim" open on a machine until all its jobs are completed?

The size of the cluster doesn't make a difference.
but if a node is claimed for a single user then as long as that user
has jobs in the queue it will keep on executing the jobs of that user,
whether in one cluster or many.

CLAIM_WORKLIFE is a very valuable setting.  I set it to 3600 seconds.
I've never quite understood why the default is infinity.


Any help is appreciated.


Jeff Ramnani

This e-mail and any attachments may contain information that is confidential and proprietary and otherwise protected from disclosure. If you are not the intended recipient of this e-mail, do not read, duplicate or redistribute it by any means. Please immediately delete it and any attachments and notify the sender that you have received it in error. Unintended recipients are prohibited from taking action on the basis of information in this e-mail or any attachments. The DRW Companies make no representations that this e-mail or any attachments are free of computer viruses or other defects.

Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.