[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Slots are in use even when no job is running



Hello Condor Experts,Â

We are facing a strange issue with MPI jobs.

Condor scheduler "condor_q -all" is not showing any job in queue and we are using only single sched node to submit MPI jobs but still if we submit new job it remains in idle state, runningÂbetter-analyze against that job showing below relevant output. 172 is expected as we have 172 vanila jobs running with another user but since no job is running with this user hence count 10 is totally unexpected.Â

1580.000: ÂRun analysis summary ignoring user priority. Of 194 machines,
  172 are rejected by your job's requirements
   0 reject your job because of their own requirements
   2 are exhausted partitionable slots
  Â10 match and are already running your jobs
   0 match but are serving other users
   0 are available to run your job

thought of checking the used slots on all nodes present in cluster I found that it's showing chunk of 10 slots in used status which I believe is corresponding to earlier attempts of MPI job ran. We are requesting 10 cores and 5 nodes while running MPI job. when I login into any of the node showing 10 cpus in used status and do condor_who or pstree -p condor or htop or top it doesn't show any user process running on that node. Again total count of 1slot 172 is expected as another user is running the jobs.Â
Â
# condor_status -compact Â-af:h machine cpus totalcpus childcpus 'int(totalcpus-cpus)'
machine                Âcpus totalcpus       childcpus int(totalcpus-cpus)
testnode0001.test.com 10 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,10,10 } 26 Â Â Â Â Â Â Â Â
testnode0002.test.com 11 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10 } 25 Â Â Â Â Â Â Â Â
testnode0003.test.com 21 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } Â Â15 Â Â Â Â Â Â Â Â
testnode0004.test.com 15 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,10,1,1,1,1,1,1 } Â Â Â Â 21 Â Â Â Â Â Â Â Â
testnode0005.test.com 13 Â 36.0 Â Â Â Â Â Â Â Â Â{ 10,1,1,1,1,1,1,1,1,1,1,1,1,1 } Â Â 23 Â Â Â Â Â Â Â Â
testnode0006.test.com 24 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1 } Â Â Â Â Â12 Â Â Â Â Â Â Â Â
testnode0007.test.com 8 Â Â36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10,1 } 28 Â Â Â Â Â Â Â Â
testnode0008.test.com 12 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,10 } Â Â Â Â 24 Â Â Â Â Â Â Â Â
testnode0009.test.com 12 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,10,1,1,1,1,1,1 } Â Â Â Â 24 Â Â Â Â Â Â Â Â
testnode0010.test.com 1 Â Â36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10,10 } Â Â35 Â Â Â Â Â Â Â Â
testnode0011.test.com 15 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } 21Â Â Â Â Â ÂÂ
testnode0012.test.com 17 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } Â Â 19Â Â Â Â Â Â Â Â
condor_userprio output doesn't agree with above output:

# condor_userprio -all
Last Priority Update: 11/19 03:25
                ÂEffective   Real  Priority  Res  Total Usage    Usage       Last    Time Since
User Name             Priority  Priority ÂFactor  In Use (wghted-hrs)  ÂStart Time    Usage Time  ÂLast Usage
------------------------------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ----------
testuser1@xxxxxxxx   Â3166.63  Â31.67  Â100.00  Â172   11421.95 11/01/2019 00:05 11/19/2019 03:25   Â<now>
------------------------------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ----------
Number of users: 1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 172 Â Â 11421.95 Â Â Â Â Â Â Â Â Â11/18/2019 03:25 Â Â0+23:59

Adding more confusion I found that some of old cgroup directories are not removed below output is fromÂtestnode0001 on which actually 6 vanila jobs are running each slot 1 but in cgroup I can see 14 directories are present.Â

# ls -ld /cgroup/cpu/htcondor/condor_spare_condor_slot1_*@testnode0001.test.com/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_11@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 18 22:06 /cgroup/cpu/htcondor/condor_spare_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_22@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_23@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 18 22:06 /cgroup/cpu/htcondor/condor_spare_condor_slot1_26@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 18 22:07 /cgroup/cpu/htcondor/condor_spare_condor_slot1_28@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 18 22:06 /cgroup/cpu/htcondor/condor_spare_condor_slot1_29@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Jul Â9 13:08 /cgroup/cpu/htcondor/condor_spare_condor_slot1_33@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 18 22:07 /cgroup/cpu/htcondor/condor_spare_condor_slot1_36@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:30 /cgroup/cpu/htcondor/condor_spare_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_5@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_6@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_7@xxxxxxxxxxxxxxxxxxxxx/cpu.shares
-rw-r--r-- 1 root root 0 Nov 19 01:32 /cgroup/cpu/htcondor/condor_spare_condor_slot1_9@xxxxxxxxxxxxxxxxxxxxx/cpu.shares


Setup Information:

$CondorVersion: 8.5.8 Dec 13 2016 BuildID: 390781 $
$CondorPlatform: x86_64_RedHat6 $


Thanks & Regards,
Vikrant Aggarwal