[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [CondorLIGO] Problem with parallel universe jobs and dynamic partitioning



On Wed, Sep 10, 2014 at 10:17:58AM -0500, Todd Tannenbaum wrote:
> Without any real evidence other than gut instinct, I think the
> troubles Steffen is observing may be a symptom of the fact that
> startd rank currently does not always work properly against dynamic
> slots in v8.2.2.  Greg is working on this now, it is scheduled to be
> fixed in v8.2.3 (see htcondor-wiki ticket 4580 at
> http://goo.gl/bY4kXi - the design document there could be
> enlightening to understand how stard rank+pslots should work).

Thanks for the document - after a first read it looks a bit
overcomplicated to me: 

Why does it need a central defrag process to re-join the resource 
fragments that remain after dynamic slots have become idle (for long
enough)? Shouldn't the local startd take care of these fragments?

If eviction is the universal solution to make resources available
to other (parallel) jobs, what if there are only parallel jobs
running? Evicting one thread of a multi-core parallel task has a
very high cost, and I can imagine a dead-lock scenario very easily.

After all, with a very inhomogeneous set of job profiles, it's
pretty hard to come up with a configuration that would serve 
almost everyone in an almost optimal way. Dynamic partitioning
was, and still is, one of the most important ingredients, and
it worked impressingly well with non-parallel workloads.
I'm still somewhat unable to see where the DD addresses the specifics
of parallel jobs (as those seem to make the difference). Got to reread
the whole thing, at least twice...

  It
> should explain why the first parallel job works fine, but then a
> second non-identical parallel job could sit around idle while slots
> are held in claimed/idle for extended periods of time on a pool that
> is also running vanilla jobs.  Parallel universe relies on the
> startd rank preemption mechanism to avoid potentially waiting
> forever to gather all the nodes it needs for a job.

I'd love it more if it wasn't. Shouldn't the corresponding user have
accumulated enough priority way before "forever"? In particular,
with short claim lifetimes, that should ensure that such "big" jobs
(BTW, we're talking about ~200 threads/nodes on a 2000-core pool,
something I consider "fair use") get their moment of fame early enough.

Just my 2 cents...
 - S

-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}