[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Understanding Condor's "Claimed" state
- Date: Wed, 29 Jun 2011 15:45:08 -0500 (CDT)
- From: Steven Timm <timm@xxxxxxxx>
- Subject: Re: [Condor-users] Understanding Condor's "Claimed" state
On Wed, 29 Jun 2011, Jeff Ramnani wrote:
I have a Condor pool with 10 dedicated compute nodes, and I'm having an issue
getting people's jobs scheduled the way I want. Here's what's happening.
If user1 has submitted a large batch of jobs, then users that submit jobs
after them aren't getting scheduled until after the first user's jobs are
completed, even if the users who came later have better priorities.
One difference I've seen when this happens is that user1 who submits the
large batch of jobs does so as one job cluster that contains many jobs (in
this example, let's say 100 jobs. e.g. 1.0 .. 1.99), and user2 who has a
better priority, but submits later, does so as many clusters with one job
each (e.g. 2.0, 3.0, 4.0).
Here's an example output of condor_q:
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
54713.0 user1 6/29 14:00 0+00:02:27 R 0 0.0 user1_job.py
54713.1 user1 6/29 14:00 0+00:02:19 R 0 122.1 user1_job.py
54713.2 user1 6/29 14:00 0+00:02:14 R 0 0.0 user1_job.py
54713.3 user1 6/29 14:00 0+00:02:06 R 0 0.0 user1_job.py
54713.99 user1 6/29 14:00 0+00:00:00 R 0 0.0 user1_job.py
54488.0 user2 6/29 12:03 0+00:00:00 I 0 732.4 user2_job.sh
54489.0 user2 6/29 12:03 0+00:00:00 I 0 732.4 user2_job.sh
54490.0 user2 6/29 12:03 0+00:00:00 I 0 732.4 user2_job.sh
54491.0 user2 6/29 12:03 0+00:00:00 I 0 732.4 user2_job.sh
54492.0 user2 6/29 12:03 0+00:00:00 I 0 732.4 user2_job.sh
User2 has a better priority, so I would expect user2's job 54488.0 to be
scheduled on the first available machine when one of user1's jobs is
completed, but that's not what's happening. It seems like user1 has a
"claim" on the machines that lasts longer than an individual job. I've read
the following manual pages:
and I'm still not 100% sure I understand how jobs are scheduled in this
situation. I've found the configuration setting for CLAIM_WORKLIFE in the
manual which states, "If provided, this expression specifies the number of
seconds during which a claim will continue accepting new jobs." This leads
me to the following questions.
* How long does a user's "claim" last on a machine?
Others know more technical detail here but if CLAIM_WORKLIFE
is not set, the claim is infinite unless preemption kicks the user off.
If you set CLAIM_WORKLIFE, then it is just the duration of the claim
that you specify.
* Does a job cluster keep a "claim" open on a machine until all its jobs are
The size of the cluster doesn't make a difference.
but if a node is claimed for a single user then as long as that user
has jobs in the queue it will keep on executing the jobs of that user,
whether in one cluster or many.
CLAIM_WORKLIFE is a very valuable setting. I set it to 3600 seconds.
I've never quite understood why the default is infinity.
Any help is appreciated.
This e-mail and any attachments may contain information that is confidential
and proprietary and otherwise protected from disclosure. If you are not the
intended recipient of this e-mail, do not read, duplicate or redistribute it
by any means. Please immediately delete it and any attachments and notify the
sender that you have received it in error. Unintended recipients are
prohibited from taking action on the basis of information in this e-mail or
any attachments. The DRW Companies make no representations that this e-mail
or any attachments are free of computer viruses or other defects.
Steven C. Timm, Ph.D (630) 840-8525
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.