[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning



On Wed, Sep 10, 2014 at 10:17:58AM -0500, Todd Tannenbaum wrote:
> Without any real evidence other than gut instinct, I think the
> troubles Steffen is observing may be a symptom of the fact that
> startd rank currently does not always work properly against dynamic
> slots in v8.2.2.

Since my guts told me the same, I had replied to the RANK discussion
intentionally ;)
Changing the slot partitioning to static completely removed the issue.

                   Greg is working on this now, it is scheduled to be
> fixed in v8.2.3 (see htcondor-wiki ticket 4580 at
> http://goo.gl/bY4kXi - the design document there could be
> enlightening to understand how stard rank+pslots should work). It
> should explain why the first parallel job works fine, but then a
> second non-identical parallel job could sit around idle while slots
> are held in claimed/idle for extended periods of time on a pool that
> is also running vanilla jobs.  Parallel universe relies on the
> startd rank preemption mechanism to avoid potentially waiting
> forever to gather all the nodes it needs for a job.

... and preemption is switched off on our pool on purpose.
That might indeed be the culprit (but I'm not willing to change this
policy as it's a feature, not a bug, for our users).

I guess I should watch the recent tickets more closely, to get notified
when it's worth to try again? I'm not bound to use stable releases,
and I'm willing to get this fixed. Perhaps I should consider setting
up a second DedicatedScheduler for a sub-pool...

Thanks,
 Steffen

-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}