[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] DedicatedScheduler matching, partitioning off slot, but not starting



Hi all,

we set-up a small dedicated pool with 24 machines of type A and 21
machines of type B.

A user now submits a parallel universe jobs, requesting to run on all 24
machines of type A by setting request_cpus to the maximum type A
supports (large then type B supports).

As the pool was idle, the negotiator quickly matched the 24 nodes to the
job which can be seen via

$ condor_q 63.0 -af RemoteHosts | xargs -d , -n1 echo|grep -c slot
24

so far, so good. All nodes partition off the subslot 1_1 but for the
next 10 minutes nothing really happens, the StartLog does not contain a
hint, the StarterLog.slot1_1 nothing at all (as nothing was really
started yet).

After 10 minutes the claim seems to be deleted on the starter and a few
minutes later the negotiator tries to match the same resources again,
however, the subslot is still present and won't be preempted[1].

And now the jobs simply stay idle.

Some data points:

$condor_q -bet 63

-- Schedd: condor8.atlas.local : <10.20.30.23:2653>
The Requirements expression for job 63.000 is

    (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") &&
(TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
    (TARGET.Cpus >= RequestCpus) && ((TARGET.FileSystemDomain ==
MY.FileSystemDomain) || (TARGET.HasFileTransfer))

Job 63.000 defines the following attributes:

    DiskUsage = 1
    FileSystemDomain = "atlas.local"
    RequestCpus = 128
    RequestDisk = DiskUsage
    RequestMemory = 40960

The Requirements expression for job 63.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]          90  TARGET.Arch == "X86_64"
[1]          90  TARGET.OpSys == "LINUX"
[3]          90  TARGET.Disk >= RequestDisk
[5]          90  TARGET.Memory >= RequestMemory
[7]          24  TARGET.Cpus >= RequestCpus
[9]          90  TARGET.FileSystemDomain == MY.FileSystemDomain

Last successful match: Mon Jul 13 19:01:19 2020

Last failed match: Mon Jul 13 19:14:47 2020

Reason for last match failure: PREEMPTION_REQUIREMENTS == False

063.000:  Run analysis summary ignoring user priority.  Of 45 machines,
     21 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
     24 are able to run your job

------------------------------------------------
StartLog on one of the nodes:

07/13/20 19:01:14 slot1_1: New machine resource of type -1 allocated
07/13/20 19:01:14 Setting up slot pairings
07/13/20 19:01:14 slot1_1: Request accepted.
07/13/20 19:01:14 slot1_1: Remote owner is USER@xxxxxxxxxxx
07/13/20 19:01:14 slot1_1: State change: claiming protocol successful
07/13/20 19:01:14 slot1_1: Changing state: Owner -> Claimed
07/13/20 19:01:21 slot1_1: Called deactivate_claim()
07/13/20 19:01:21 Can't read ClaimId
07/13/20 19:01:21 condor_write(): Socket closed when trying to write 29
bytes to <10.20.30.23:16847>, fd is 11
07/13/20 19:01:21 Buf::write(): condor_write() failed
07/13/20 19:11:14 slot1_1: State change: claim no longer recognized by
the schedd - removing claim
07/13/20 19:11:14 slot1_1: Changing state and activity: Claimed/Idle ->
Preempting/Killing
07/13/20 19:11:14 slot1_1: State change: No preempting claim, returning
to owner
07/13/20 19:11:14 slot1_1: Changing state and activity:
Preempting/Killing -> Owner/Idle
07/13/20 19:11:14 slot1_1: State change: IS_OWNER is false
07/13/20 19:11:14 slot1_1: Changing state: Owner -> Unclaimed
07/13/20 19:11:14 slot1_1: Changing state: Unclaimed -> Delete
07/13/20 19:11:14 slot1_1: Resource no longer needed, deleting
07/13/20 19:14:46 slot1_1: New machine resource of type -1 allocated
07/13/20 19:14:46 Setting up slot pairings
07/13/20 19:14:47 slot1_1: Request accepted.
07/13/20 19:14:47 slot1_1: Remote owner is USER@xxxxxxxxxxx
07/13/20 19:14:47 slot1_1: State change: claiming protocol successful
07/13/20 19:14:47 slot1_1: Changing state: Owner -> Claimed
07/13/20 19:14:47 Job no longer matches partitionable slot after
MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
07/13/20 19:14:47 slot1: Partitionable slot can't be split to allocate a
dynamic slot large enough for the claim
07/13/20 19:14:47 slot1: State change: claiming protocol failed
07/13/20 19:14:47 slot1: Changing state: Unclaimed -> Owner
07/13/20 19:14:47 slot1: State change: IS_OWNER is false
07/13/20 19:14:47 slot1: Changing state: Owner -> Unclaimed
-------------------------------------

apart from the network communication glitch (which I see on various
nodes at that time), there is nothing what really points to an immediate
problem - at least for me.

Anyone with an idea what is wrong here?

Cheers

Carsten

[1] our preemption policy in this small pool is simple, only jobs whose
JobUniverse is not 11 (parallel) may be preempted which in this case is
not the case
-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature