Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning

Date: Wed, 10 Sep 2014 10:17:58 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning

Without any real evidence other than gut instinct, I think the troublesSteffen is observing may be a symptom of the fact that startd rankcurrently does not always work properly against dynamic slots in v8.2.2.Greg is working on this now, it is scheduled to be fixed in v8.2.3(see htcondor-wiki ticket 4580 at http://goo.gl/bY4kXi - the designdocument there could be enlightening to understand how stard rank+pslotsshould work). It should explain why the first parallel job works fine,but then a second non-identical parallel job could sit around idle whileslots are held in claimed/idle for extended periods of time on a poolthat is also running vanilla jobs. Parallel universe relies on thestartd rank preemption mechanism to avoid potentially waiting forever togather all the nodes it needs for a job.


best regards,
Todd


On 9/9/2014 8:02 AM, Steffen Grunewald wrote:

Good morning,

I'm observing some very strange behaviour on our cluster, which is
configured for dynamic partitioning, with a CLAIM_WORKLIFE=3600 and
JobLeaseDuration=3600 as well.
When a parallel job runs for the first time on a fully idle pool,
it gets scheduled to the requested number of slots (some of which
may reside on the same machine), and usually runs without any
problems.
The defrag daemon will try and pick up the residues later, if there
is no other job.
But: if the claim of the individual slot hasn't expired yet, this
dynamic slot won't be gathered for the rest of the claim worklife,
_nor_ will the slot be re-used - even by the same user.
This, together with some strange starter error, will result in
full fragmentation of the whole pool for quite a long time, and
the only way to get out of this mess is to hold all jobs.
Quite unfortunate.

While I understand why defrag doesn't act on those (still "Claimed")
slots (I will see lots of "arriving" machines within half an hour,
I hope, and a steep decrease in the "Claimed" number, I hope),
I have no idea why they don't get reused by identical jobs.

This is Condor 8.2.2.

I'll set NEGOTIATOR_PRE_JOB_RANK to 0 for now, although I'm in doubt
this setting would affect job matching in such a strange way... is
it me who's terribly wrong?

Thanks,
  Steffen



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685

Follow-Ups:
- Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning
  - From: Steffen Grunewald
- Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning
  - From: Steffen Grunewald

References:
- [HTCondor-users] Problem with parallel universe jobs and dynamic partitioning
  - From: Steffen Grunewald

Prev by Date: Re: [HTCondor-users] eval() not working?
Next by Date: Re: [HTCondor-users] A few questions about DAGMan
Previous by thread: [HTCondor-users] Problem with parallel universe jobs and dynamic partitioning
Next by thread: Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning