Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] jobs still landing on the machine even after condor_drain

Date: Wed, 13 Aug 2014 10:23:43 +0100
From: qing <gang.qin@xxxxxxxxxxxxx>
Subject: [HTCondor-users] jobs still landing on the machine even after condor_drain

Dear Expert:

In our cluster we are using condor 8.2.1 and I use partitionableslots on the execute machine.


## Partitionable slots
NUM_SLOTS = 1
SLOT_TYPE_1               = cpus=100%,mem=100%,auto
NUM_SLOTS_TYPE_1          = 1
SLOT_TYPE_1_PARTITIONABLE = TRUE

# Consumption policy
CONSUMPTION_POLICY = True
SLOT_TYPE_1_CONSUMPTION_POLICY = True
SLOT_TYPE_1_CONSUMPTION_CPUS = TARGET.RequestCpus
SLOT_TYPE_1_CONSUMPTION_MEMORY = TARGET.RequestMemory
SLOT_TYPE_1_CONSUMPTION_DISK = TARGET.RequestDisk
SLOT_WEIGHT = Cpus

USE_PID_NAMESPACES = False

Yesterday we plan to reboot node002 in the condor cluster, so in theafternoon I used 'condor_drain node002.beowulf.cluster' to prevent newjobs to be sent to node002, then I see the the following change incondor_status:

slot1@xxxxxxxxxxxx LINUX X86_64 Drained Retiring 0.020 277500+00:00:04slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Retiring 8.000 20000+00:00:04slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Retiring 8.000 20000+00:00:04slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Retiring 7.990 20000+00:00:04


  node002 is a 24-core machine and each job running there is a 8-core job.

  And after some time it changed to:

slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.020 277500+00:02:56slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 7.890 20000+00:02:57slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 7.810 20000+00:02:57slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Busy 8.240 20000+00:02:57

During the time period It's still the same 3 jobs running on node002,id as following:

vr019:~# condor_q -run | grep node002

53772.0 prdatlas089 8/12 04:02 0+04:22:08slot1@xxxxxxxxxxxxxxxxxxxxxxx54203.0 prdatlas047 8/12 11:00 0+04:12:48slot1@xxxxxxxxxxxxxxxxxxxxxxx54204.0 prdatlas047 8/12 11:00 0+04:12:03slot1@xxxxxxxxxxxxxxxxxxxxxxx

I suppose that when these 3 jobs get finished, no more new jobs will besent to node002, but this morning I find new jobs are still landing onnode002:

svr019:~# condor_q -run | grep node002

54612.0 prdatlas089 8/12 18:55 0+01:59:12slot1@xxxxxxxxxxxxxxxxxxxxxxx54700.0 prdatlas089 8/12 20:38 0+01:57:21slot1@xxxxxxxxxxxxxxxxxxxxxxx54709.0 prdatlas089 8/12 20:43 0+00:10:24slot1@xxxxxxxxxxxxxxxxxxxxxxx

So looks to me the condor_drain did take some drain action on theexecution machine but later the machine was automatically brought backonline, not sure if it's related to the partitionable slots setting orsomething else, any idea?



  Cheers,Gang

Follow-Ups:
- Re: [HTCondor-users] jobs still landing on the machine even after condor_drain
  - From: qing

Prev by Date: [HTCondor-users] HTCondor 8.3.0 Released
Next by Date: Re: [HTCondor-users] jobs still landing on the machine even after condor_drain
Previous by thread: [HTCondor-users] HTCondor 8.3.0 Released
Next by thread: Re: [HTCondor-users] jobs still landing on the machine even after condor_drain
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] jobs still landing on the machine even after condor_drain