[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dynamic Slots in Parallel Universe



Hi Todd,

Iâm resurrecting this thread because I think weâre still seeing related problems.  One of our users has a parallel universe job that has been idle for almost a day.  The StartLog on the available nodes seem to indicate that the nodes are held for a wile and then released without ever having enough nodes to start the job


03/06/18 15:26:04 slot1_1: State change: received RELEASE_CLAIM command
03/06/18 15:26:04 slot1_1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
03/06/18 15:26:04 slot1_1: State change: No preempting claim, returning to owner
03/06/18 15:26:04 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
03/06/18 15:26:04 slot1_1: Changing state: Owner -> Delete
03/06/18 15:26:04 slot1_1: Resource no longer needed, deleting
03/06/18 15:26:21 slot1_1: New machine resource of type -1 allocated
03/06/18 15:26:21 slot1: Changing state: Owner -> Unclaimed
03/06/18 15:26:21 slot1: State change: IS_OWNER is TRUE
03/06/18 15:26:21 slot1: Changing state: Unclaimed -> Owner
03/06/18 15:26:21 Setting up slot pairings
03/06/18 15:26:21 slot1_1: Request accepted.
03/06/18 15:26:21 slot1_1: Remote owner is soumi.de@xxxxxxxxxxxxxxxxxxxxxxxxx
03/06/18 15:26:21 slot1_1: State change: claiming protocol successful
03/06/18 15:26:21 slot1_1: Changing state: Owner -> Claimed


Restarting condor on the submit node which is configured as the DedicatedScheduler results in the following in the SchedLog:


03/09/18 12:01:15 (pid:3405053) Delaying scheduling of parallel jobs because startd query time is long (1) seconds
03/09/18 12:01:29 (pid:3405053) Using negotiation protocol: NEGOTIATE
03/09/18 12:01:29 (pid:3405053) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
03/09/18 12:01:32 (pid:3405053) Finished negotiating for DedicatedScheduler in local pool: 0 matched, 17 rejected

The DedicatedScheduler is running

[root@sugwg-condor ~]# condor_version 
$CondorVersion: 8.7.7 Jan 24 2018 BuildID: 429803 PRE-RELEASE-UWCS $
$CondorPlatform: x86_64_RedHat7 $


The worker nodes have

[root@CRUSH-SUGWG-OSG-10-5-187-144 ~]# condor_version 
$CondorVersion: 8.6.7 Oct 29 2017 BuildID: 422776 $
$CondorPlatform: x86_64_RedHat7 $

Any suggestions?  If you need any additional information please let me know.

Cheers,

- Larne

On Nov 22, 2017, at 11:10 AM, Duncan Brown <dabrown@xxxxxxx> wrote:

Hi Todd,

I'm having a problem with jobs claims not succeeding in the parallel universe with dynamic slots. With an 8.6.7 Schedd/Negotiator/Startd setup, I'm seeing pslot preemption work as I'd expect:

11/22/17 10:38:20 Phase 4.1:  Negotiating with schedds ...
11/22/17 10:38:20   Negotiating with DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx at <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf>
11/22/17 10:38:20 0 seconds so far for this submitter
11/22/17 10:38:20 0 seconds so far for this schedd
11/22/17 10:38:20     Request 3486245.00000: autocluster -1 (request count 1 of 0)
11/22/17 10:38:20       Matched 3486245.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf> preempting 1 dslots <10.5.182.235:9618?addrs=10.5.182.235-9618+[--1]-9618&noUDP&sock=2006_8712_3> slot1@CRUSH-SUGWG-OSG-10-5-182-235
11/22/17 10:38:20       Successfully matched with slot1@CRUSH-SUGWG-OSG-10-5-182-235
11/22/17 10:38:20     Request 3486245.00000: autocluster -1 (request count 1 of 0)
11/22/17 10:38:20       Matched 3486245.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf> preempting none <10.5.183.41:9618?addrs=10.5.183.41-9618+[--1]-9618&noUDP&sock=1879_6a8b_3> slot1@CRUSH-SUGWG-OSG-10-5-183-41
11/22/17 10:38:20       Successfully matched with slot1@CRUSH-SUGWG-OSG-10-5-183-41
11/22/17 10:38:20     Request 3486245.00000: autocluster -1 (request count 1 of 0)
11/22/17 10:38:20       Matched 3486245.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf> preempting none <10.5.182.218:9618?addrs=10.5.182.218-9618+[--1]-9618&noUDP&sock=2060_912f_3> slot1@CRUSH-SUGWG-OSG-10-5-182-218


And I'm seeing slots claimed by the dedicated scheduler:

[dbrown@sugwg-condor ~]$ condor_userprio -all
Last Priority Update: 11/22 11:00
                                            Effective     Real   Priority   Res   Total Usage       Usage             Last       Time Since
User Name                                     Priority   Priority  Factor   In Use (wghted-hrs)    Start Time       Usage Time    Last Usage
------------------------------------------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ----------
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx         4.38     4.38      1.00    184       418.79 11/02/2017 16:40 11/22/2017 10:38    0+00:21

But the claim is failing to start. On the schedd I see:

11/22/17 10:50:11 (pid:2474326) Request was NOT accepted for claim slot1@CRUSH-SUGWG-OSG-10-5-182-246 <10.5.182.246:9618?addrs=10.5.182.246-9618+[--1]-9618&noUDP&sock=1998_0a50_3> for DedicatedScheduler -1.-1

and on one of the the starter I see:

11/22/17 10:50:11 slot1_1: New machine resource of type -1 allocated
11/22/17 10:50:11 slot1: Changing state: Owner -> Unclaimed
11/22/17 10:50:11 slot1: State change: IS_OWNER is TRUE
11/22/17 10:50:11 slot1: Changing state: Unclaimed -> Owner
11/22/17 10:50:11 Setting up slot pairings
11/22/17 10:50:11 slot1_1: Request accepted.
11/22/17 10:50:11 slot1_1: Remote owner is steven.reyes@xxxxxxxxxxxxxxxxxxxxxxxxx
11/22/17 10:50:11 slot1_1: State change: claiming protocol successful
11/22/17 10:50:11 slot1_1: Changing state: Owner -> Claimed
11/22/17 10:50:11 Job no longer matches partitionable slot after MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
11/22/17 10:50:11 slot1: Partitionable slot can't be split to allocate a dynamic slot large enough for the claim
11/22/17 10:50:11 slot1: Claiming protocol failed

This is strange, as the node has been partitioned and the resources are available:

[dbrown@sugwg-condor ~]$ condor_status -long slot1_1@CRUSH-SUGWG-OSG-10-5-182-246 | grep TotalSlot
TotalSlotCpus = 3
TotalSlotDisk = 181808.0
TotalSlotMemory = 35840
TotalSlots = 2


[dbrown@sugwg-condor ~]$ condor_status -long slot1_1@CRUSH-SUGWG-OSG-10-5-182-246 | grep RemoteOwner
RemoteOwner = "steven.reyes@xxxxxxxxxxxxxxxxxxxxxxxxx"

And these resources are enough to run the job:

[dbrown@sugwg-condor ~]$ condor_q -better 3486245.0


-- Schedd: sugwg-condor.phy.syr.edu : <10.5.2.4:9615?...
The Requirements _expression_ for job 3486245.000 is

   ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )

Job 3486245.000 defines the following attributes:

   RequestCpus = 3
   RequestDisk = 1
   RequestMemory = 35840

The Requirements _expression_ for job 3486245.000 reduces to these conditions:

        Slots
Step    Matched  Condition
-----  --------  ---------
[0]       12897  TARGET.Arch == "X86_64"
[1]       12897  TARGET.OpSys == "LINUX"
[3]       12897  TARGET.Disk >= RequestDisk
[5]         142  TARGET.Memory >= RequestMemory

No successful match recorded.
Last failed match: Wed Nov 22 11:05:03 2017

Reason for last match failure: no match found

3486245.000:  Run analysis summary ignoring user priority.  Of 12897 machines,
 12727 are rejected by your job's requirements
   126 reject your job because of their own requirements
    28 are exhausted partitionable slots
     8 match and are already running your jobs
     8 match but are serving other users
     0 are available to run your job


After a the claim timeout, it gives up the claim:

11/22/17 11:00:18 slot1_1: State change: received RELEASE_CLAIM command
11/22/17 11:00:18 slot1_1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
11/22/17 11:00:18 slot1_1: State change: No preempting claim, returning to owner
11/22/17 11:00:18 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
11/22/17 11:00:18 slot1_1: Changing state: Owner -> Delete
11/22/17 11:00:18 slot1_1: Resource no longer needed, deleting

Then rise and repeat every claim timeout seconds (600).

Any ideas?

Cheers,
Duncan.

--

Duncan Brown                              Room 263-1, Physics Department
Charles Brightman Professor of Physics     Syracuse University, NY 13244
http://dabrown.expressions.syr.edu                   Phone: 315 443 5993


Attachment: smime.p7s
Description: S/MIME cryptographic signature