[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Local Parallel Universe job claims one extra slot? (Slow, , , )

I'm running some tests on a multi-core machine, trying to make it 
available for Parallel Universe.
For now, this machine is an "island" - it runs it's own condor
collector/negotiator, and has a single 64-core partitionable slot.

If I submit a "machine_count=20" parallel job, and check with
condor_status, I can see 20 "busy" slots (accordingly numbered
slot1_{1..20} and one "idle" one slot1_21 - a total of 21 "claimed"
one-CPU slots.

This seems to limit the number of available cores to 63.

Is this intentional? Which process is running in the extra slot,
and why does this only happen if schedd and startd are running on
the same machine? Or am I misinterpreting? (Attempting to request
all 64 cores, the job starts, after a very long time (*) of claiming
slots for it. A 63-core job )

(*) 10 Minutes, which by coincidence is CLAIM_WORKLIFE. I had been 
running a smaller job shortly before, and unlike vanilla jobs, 
parallel ones seem to want "clean" slots. Some more tuning is 
needed here, obviously... 
What I'm also seeing in the MatchLog: every 20 seconds, only one
slot is matched, others show a "Rejected". Does that mean that
although enough resources are still available in the partitionable
slot, only one chunk at a time gets split off? This wouldn't be
promising for a mixed setup (vanilla + parallel) as vanilla jobs
don't have to wait... (No, preemption is not what I'm looking for.)

- S