[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning
- Date: Wed, 10 Sep 2014 10:17:58 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning
Without any real evidence other than gut instinct, I think the troubles
Steffen is observing may be a symptom of the fact that startd rank
currently does not always work properly against dynamic slots in v8.2.2.
Greg is working on this now, it is scheduled to be fixed in v8.2.3
(see htcondor-wiki ticket 4580 at http://goo.gl/bY4kXi - the design
document there could be enlightening to understand how stard rank+pslots
should work). It should explain why the first parallel job works fine,
but then a second non-identical parallel job could sit around idle while
slots are held in claimed/idle for extended periods of time on a pool
that is also running vanilla jobs. Parallel universe relies on the
startd rank preemption mechanism to avoid potentially waiting forever to
gather all the nodes it needs for a job.
On 9/9/2014 8:02 AM, Steffen Grunewald wrote:
I'm observing some very strange behaviour on our cluster, which is
configured for dynamic partitioning, with a CLAIM_WORKLIFE=3600 and
JobLeaseDuration=3600 as well.
When a parallel job runs for the first time on a fully idle pool,
it gets scheduled to the requested number of slots (some of which
may reside on the same machine), and usually runs without any
The defrag daemon will try and pick up the residues later, if there
is no other job.
But: if the claim of the individual slot hasn't expired yet, this
dynamic slot won't be gathered for the rest of the claim worklife,
_nor_ will the slot be re-used - even by the same user.
This, together with some strange starter error, will result in
full fragmentation of the whole pool for quite a long time, and
the only way to get out of this mess is to hold all jobs.
While I understand why defrag doesn't act on those (still "Claimed")
slots (I will see lots of "arriving" machines within half an hour,
I hope, and a steep decrease in the "Claimed" number, I hope),
I have no idea why they don't get reused by identical jobs.
This is Condor 8.2.2.
I'll set NEGOTIATOR_PRE_JOB_RANK to 0 for now, although I'm in doubt
this setting would affect job matching in such a strange way... is
it me who's terribly wrong?
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685