Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Understanding Condor's "Claimed" state

Date: Wed, 29 Jun 2011 15:31:33 -0500
From: Jeff Ramnani <jramnani@xxxxxxxxxxxxxxx>
Subject: [Condor-users] Understanding Condor's "Claimed" state

Hello,

I have a Condor pool with 10 dedicated compute nodes, and I'm having an issue getting people's jobs scheduled the way I want. Here's what's happening.

If user1 has submitted a large batch of jobs, then users that submit jobs after them aren't getting scheduled until after the first user's jobs are completed, even if the users who came later have better priorities.

One difference I've seen when this happens is that user1 who submits the large batch of jobs does so as one job cluster that contains many jobs (in this example, let's say 100 jobs. e.g. 1.0 .. 1.99), and user2 who has a better priority, but submits later, does so as many clusters with one job each (e.g. 2.0, 3.0, 4.0).

Here's an example output of condor_q:

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 54713.0 user1 6/29 14:00 0+00:02:27 R 0 0.0 user1_job.py 2011.0 54713.1 user1 6/29 14:00 0+00:02:19 R 0 122.1 user1_job.py 2011.0 54713.2 user1 6/29 14:00 0+00:02:14 R 0 0.0 user1_job.py 2011.0 54713.3 user1 6/29 14:00 0+00:02:06 R 0 0.0 user1_job.py 2011.0 ... 54713.99 user1 6/29 14:00 0+00:00:00 R 0 0.0 user1_job.py 2011.0 54488.0 user2 6/29 12:03 0+00:00:00 I 0 732.4 user2_job.sh 54489.0 user2 6/29 12:03 0+00:00:00 I 0 732.4 user2_job.sh 54490.0 user2 6/29 12:03 0+00:00:00 I 0 732.4 user2_job.sh 54491.0 user2 6/29 12:03 0+00:00:00 I 0 732.4 user2_job.sh 54492.0 user2 6/29 12:03 0+00:00:00 I 0 732.4 user2_job.sh

User2 has a better priority, so I would expect user2's job 54488.0 to be scheduled on the first available machine when one of user1's jobs is completed, but that's not what's happening. It seems like user1 has a "claim" on the machines that lasts longer than an individual job. I've read the following manual pages:
http://www.cs.wisc.edu/condor/manual/v7.4.4/2_7Priorities_Preemption.html
http://www.cs.wisc.edu/condor/manual/v7.4.4/3_4User_Priorities.html
http://www.cs.wisc.edu/condor/manual/v7.4.4/3_5Policy_Configuration.html
and I'm still not 100% sure I understand how jobs are scheduled in this situation. I've found the configuration setting for CLAIM_WORKLIFE in the manual which states, "If provided, this _expression_ specifies the number of seconds during which a claim will continue accepting new jobs." This leads me to the following questions.

* How long does a user's "claim" last on a machine?
* Does a job cluster keep a "claim" open on a machine until all its jobs are completed?

Any help is appreciated.

Sincerely,

Jeff Ramnani

This e-mail and any attachments may contain information that is confidential and proprietary and otherwise protected from disclosure. If you are not the intended recipient of this e-mail, do not read, duplicate or redistribute it by any means. Please immediately delete it and any attachments and notify the sender that you have received it in error. Unintended recipients are prohibited from taking action on the basis of information in this e-mail or any attachments. The DRW Companies make no representations that this e-mail or any attachments are free of computer viruses or other defects.

Follow-Ups:
- Re: [Condor-users] Understanding Condor's "Claimed" state
  - From: Steven Timm

Prev by Date: Re: [Condor-users] When is machine RANK evaluated? Was: Rank expressions not evaluated
Next by Date: Re: [Condor-users] Understanding Condor's "Claimed" state
Previous by thread: [Condor-users] Condor 7.6.1 CREDD Service--Problems with SSL for Windows
Next by thread: Re: [Condor-users] Understanding Condor's "Claimed" state
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Understanding Condor's "Claimed" state