[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Problem with parallel universe jobs and dynamic partitioning



Good morning,

I'm observing some very strange behaviour on our cluster, which is
configured for dynamic partitioning, with a CLAIM_WORKLIFE=3600 and
JobLeaseDuration=3600 as well.
When a parallel job runs for the first time on a fully idle pool,
it gets scheduled to the requested number of slots (some of which
may reside on the same machine), and usually runs without any
problems.
The defrag daemon will try and pick up the residues later, if there
is no other job.
But: if the claim of the individual slot hasn't expired yet, this
dynamic slot won't be gathered for the rest of the claim worklife,
_nor_ will the slot be re-used - even by the same user.
This, together with some strange starter error, will result in 
full fragmentation of the whole pool for quite a long time, and
the only way to get out of this mess is to hold all jobs.
Quite unfortunate.

While I understand why defrag doesn't act on those (still "Claimed")
slots (I will see lots of "arriving" machines within half an hour,
I hope, and a steep decrease in the "Claimed" number, I hope), 
I have no idea why they don't get reused by identical jobs.

This is Condor 8.2.2. 

I'll set NEGOTIATOR_PRE_JOB_RANK to 0 for now, although I'm in doubt
this setting would affect job matching in such a strange way... is
it me who's terribly wrong?

Thanks,
 Steffen

-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}