[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] jobs still landing on the machine even after condor_drain



Dear Expert:

In our cluster we are using condor 8.2.1 and I use partitionable slots on the execute machine.

## Partitionable slots
NUM_SLOTS = 1
SLOT_TYPE_1               = cpus=100%,mem=100%,auto
NUM_SLOTS_TYPE_1          = 1
SLOT_TYPE_1_PARTITIONABLE = TRUE

# Consumption policy
CONSUMPTION_POLICY = True
SLOT_TYPE_1_CONSUMPTION_POLICY = True
SLOT_TYPE_1_CONSUMPTION_CPUS = TARGET.RequestCpus
SLOT_TYPE_1_CONSUMPTION_MEMORY = TARGET.RequestMemory
SLOT_TYPE_1_CONSUMPTION_DISK = TARGET.RequestDisk
SLOT_WEIGHT = Cpus

USE_PID_NAMESPACES = False



Yesterday we plan to reboot node002 in the condor cluster, so in the afternoon I used 'condor_drain node002.beowulf.cluster' to prevent new jobs to be sent to node002, then I see the the following change in condor_status:

slot1@xxxxxxxxxxxx LINUX X86_64 Drained Retiring 0.020 27750 0+00:00:04 slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Retiring 8.000 2000 0+00:00:04 slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Retiring 8.000 2000 0+00:00:04 slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Retiring 7.990 2000 0+00:00:04

  node002 is a 24-core machine and each job running there is a 8-core job.

  And after some time it changed to:

slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.020 27750 0+00:02:56 slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 7.890 2000 0+00:02:57 slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 7.810 2000 0+00:02:57 slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Busy 8.240 2000 0+00:02:57



During the time period It's still the same 3 jobs running on node002, id as following:
vr019:~# condor_q -run | grep node002
53772.0 prdatlas089 8/12 04:02 0+04:22:08 slot1@xxxxxxxxxxxxxxxxxxxxxxx 54203.0 prdatlas047 8/12 11:00 0+04:12:48 slot1@xxxxxxxxxxxxxxxxxxxxxxx 54204.0 prdatlas047 8/12 11:00 0+04:12:03 slot1@xxxxxxxxxxxxxxxxxxxxxxx

I suppose that when these 3 jobs get finished, no more new jobs will be sent to node002, but this morning I find new jobs are still landing on node002:
svr019:~# condor_q -run | grep node002
54612.0 prdatlas089 8/12 18:55 0+01:59:12 slot1@xxxxxxxxxxxxxxxxxxxxxxx 54700.0 prdatlas089 8/12 20:38 0+01:57:21 slot1@xxxxxxxxxxxxxxxxxxxxxxx 54709.0 prdatlas089 8/12 20:43 0+00:10:24 slot1@xxxxxxxxxxxxxxxxxxxxxxx


So looks to me the condor_drain did take some drain action on the execution machine but later the machine was automatically brought back online, not sure if it's related to the partitionable slots setting or something else, any idea?


  Cheers,Gang