[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Understanding Condor's "Claimed" state


I have a Condor pool with 10 dedicated compute nodes, and I'm having an issue getting people's jobs scheduled the way I want.  Here's what's happening.

If user1 has submitted a large batch of jobs, then users that submit jobs after them aren't getting scheduled until after the first user's jobs are completed, even if the users who came later have better priorities.

One difference I've seen when this happens is that user1 who submits the large batch of jobs does so as one job cluster that contains many jobs (in this example, let's say 100 jobs. e.g. 1.0 .. 1.99), and user2 who has a better priority, but submits later, does so as many clusters with one job each (e.g. 2.0, 3.0, 4.0).

Here's an example output of condor_q:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
54713.0    user1         6/29 14:00   0+00:02:27 R  0   0.0  user1_job.py 2011.0
54713.1    user1         6/29 14:00   0+00:02:19 R  0   122.1 user1_job.py 2011.0
54713.2    user1         6/29 14:00   0+00:02:14 R  0   0.0  user1_job.py 2011.0
54713.3    user1         6/29 14:00   0+00:02:06 R  0   0.0  user1_job.py 2011.0
54713.99   user1         6/29 14:00   0+00:00:00 R  0   0.0  user1_job.py 2011.0
54488.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh           
54489.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh           
54490.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh           
54491.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh           
54492.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh

User2 has a better priority, so I would expect user2's job 54488.0 to be scheduled on the first available machine when one of user1's jobs is completed, but that's not what's happening.  It seems like user1 has a "claim" on the machines that lasts longer than an individual job.  I've read the following manual pages:
and I'm still not 100% sure I understand how jobs are scheduled in this situation.  I've found the configuration setting for CLAIM_WORKLIFE in the manual which states, "If provided, this _expression_ specifies the number of seconds during which a claim will continue accepting new jobs."  This leads me to the following questions.

* How long does a user's "claim" last on a machine?
* Does a job cluster keep a "claim" open on a machine until all its jobs are completed?

Any help is appreciated.

Jeff Ramnani 
This e-mail and any attachments may contain information that is confidential and proprietary and otherwise protected from disclosure. If you are not the intended recipient of this e-mail, do not read, duplicate or redistribute it by any means. Please immediately delete it and any attachments and notify the sender that you have received it in error. Unintended recipients are prohibited from taking action on the basis of information in this e-mail or any attachments. The DRW Companies make no representations that this e-mail or any attachments are free of computer viruses or other defects.