[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning



Without any real evidence other than gut instinct, I think the troubles Steffen is observing may be a symptom of the fact that startd rank currently does not always work properly against dynamic slots in v8.2.2. Greg is working on this now, it is scheduled to be fixed in v8.2.3 (see htcondor-wiki ticket 4580 at http://goo.gl/bY4kXi - the design document there could be enlightening to understand how stard rank+pslots should work). It should explain why the first parallel job works fine, but then a second non-identical parallel job could sit around idle while slots are held in claimed/idle for extended periods of time on a pool that is also running vanilla jobs. Parallel universe relies on the startd rank preemption mechanism to avoid potentially waiting forever to gather all the nodes it needs for a job.

best regards,
Todd


On 9/9/2014 8:02 AM, Steffen Grunewald wrote:
Good morning,

I'm observing some very strange behaviour on our cluster, which is
configured for dynamic partitioning, with a CLAIM_WORKLIFE=3600 and
JobLeaseDuration=3600 as well.
When a parallel job runs for the first time on a fully idle pool,
it gets scheduled to the requested number of slots (some of which
may reside on the same machine), and usually runs without any
problems.
The defrag daemon will try and pick up the residues later, if there
is no other job.
But: if the claim of the individual slot hasn't expired yet, this
dynamic slot won't be gathered for the rest of the claim worklife,
_nor_ will the slot be re-used - even by the same user.
This, together with some strange starter error, will result in
full fragmentation of the whole pool for quite a long time, and
the only way to get out of this mess is to hold all jobs.
Quite unfortunate.

While I understand why defrag doesn't act on those (still "Claimed")
slots (I will see lots of "arriving" machines within half an hour,
I hope, and a steep decrease in the "Claimed" number, I hope),
I have no idea why they don't get reused by identical jobs.

This is Condor 8.2.2.

I'll set NEGOTIATOR_PRE_JOB_RANK to 0 for now, although I'm in doubt
this setting would affect job matching in such a strange way... is
it me who's terribly wrong?

Thanks,
  Steffen



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685