[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] negotiation weirdness

Speaking of CLAIM_WORKLIFE and a schedd's claim on a slot: when a job finishes running in a slot and the claim is held by the schedd what's the algorithm for picking the next job that should run in that slot from the list of jobs in the schedd?
We're using auto-retirement on our jobs but it means we're hit with about an 8% efficiency penalty due to the negotiating overhead. That is: we can ever fill our bigger pools, push them to 100% utilization, we always see ~8% of our pool unutilized as jobs finish and machines get renegotiated. We automatically put machines inot the retirement state after 20 minutes.
- Ian

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Jason Stowe
Sent: Wednesday, October 31, 2007 11:45 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] negotiation weirdness

The negotiation process can be a subtle process to debug. The condor negotiator creates matches between schedulers and machines. These matches mean that a slot will be claimed by a scheduler. This claim will span multiple job executions, so for efficiency reasons, when the slot is done with a job, it requests another job from the scheduler with the same significant attributes.

In this case, I suspect that the other Sergey jobs are finishing, and the new ones are started because the machine is still "Claimed" by the sergey scheduler. You can specify how long this claim will stay in effect using the "CLAIM_WORKLIFE" setting in 6.8.*. The default is -1, and will thus cause *all* the sergey jobs to finish executing. If you set it to say, 1 second, then the first job to execute should finish executing (presumably longer than 1 second) and the claim will be released for a new match-making cycle.

Good luck, I believe this is the issue, and let me know how this works out for you.



Jason A. Stowe

Phone: 607.227.9686

Cycle Computing, LLC
Enterprise Condor Support

On 10/31/07, Grant Goodyear < grant@xxxxxxxxxxxxxxxxx> wrote:
> I'm seeing somewhat strange results in job negotiation/scheduling.
> We're running a small (~60-node) condor cluster on a dozen or so windows
> boxes.  One box (crossroads) is the central manager (submit,manage), and
> the rest are all dedicated submit,execute machines with preemption
> turned off.  (The node config can be seen in
> http://www.grantgoodyear.org/~grant/condorlogs/condor_config.txt )
> When one user submits a large number of jobs, we're seeing his jobs get
> scheduled despite the fact that other users have better priorities.
> Here's a 10-minute view of what's running and the user priorities:
> Oct. 30, 10:40am
> http://www.grantgoodyear.org/~grant/condorlogs/running_200710301040.txt
> http://www.grantgoodyear.org/~grant/condorlogs/priorities_200710301040.txt
> Oct. 30, 10:50am
> http://www.grantgoodyear.org/~grant/condorlogs/running_200710301050.txt
> http://www.grantgoodyear.org/~grant/condorlogs/priorities_200710301050.txt
> We script the submission files, and use group accounting, so even though
> all jobs have the same owner, all of the jobs run from c:\sergey have
> +AccountingGroup = "sergey" set, the c:\jgalford jobs are in the
> "jgalford" group, and the c:\ljacobson job is in the "ljacobson" group.
> At 10:40, sergey has an effective priority of 9.57, jobs 52800-52877
> (submitted on crossroads) are running, and jobs 52878-53481 (crossroads)
> are waiting.  Group ljacobson has job 270 (submitted from littleboy)
> running, and nothing waiting in the queue.  His priority is 0.51, but
> since he has nothing waiting it doesn't matter.  Group jgalford has job 498
> (submitted from fatman) running, jobs 483-487 (submitted from
> greenhouse) running, and jobs 499-514 (submitted from fatman) waiting.
> The jgalford effective priority is 3.66.
> So, if I understand the way the negotiation process works, the waiting
> jobs should be sorted so that the jgalford job 499 (fatman) should be
> the next job chosen when a resource frees up, and that would be followed
> by 500 (fatman), ....
> At 10:50, sergey jobs 52800-52808 (crossroads) have finished, and now
> sergey jobs 52809-52904 (crossroads) are running.  No new jgalford
> jobs have started, despite the lower effective priority.
> I've included the crossroads log files
> ( http://www.grantgoodyear.org/~grant/condorlogs/) for this time
> period.  I'm not seeing anything in the logs that explains this
> behavior, but I'm hoping somebody else has better insight.
> I'm thoroughly confused.
> Help?
> Thanks,
> Grant Goodyear
> --
> Grant Goodyear
> web: http://www.grantgoodyear.org
> e-mail: grant@xxxxxxxxxxxxxxxxx
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/

Confidentiality Notice. This message may contain information that is confidential or otherwise protected from disclosure.
If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution, or copying
of this message, or any attachments, is strictly prohibited. If you have received this message in error, please advise the
sender by reply e-mail, and delete the message and any attachments. Thank you.