[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Slots remain Claimed/Idle after MPI job finishes



Hi,

Assume there are 16 slots in my cluster.
I submit a parallel job which requires all 16 slots in my cluster. After it finishes, all those slots remain Claimed/Idle. 
And if I want to submit a vanilla job, the job will stay idle until those slots turn Unclaimed/Idle, which takes about 10 minutes.
Why does dedicated scheduler not release claims when parallel jobs finish? And how to deal with that?

==============================================================================

condor_q -better-analyze

062.000:  Run analysis summary ignoring user priority.  Of 16 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
     16 match and are already running your jobs
      0 match but are serving other users
      0 are available to run your job
  
==============================================================================

In condor_config.local, I add the following lines to configure the dedicated resources.

 15 DedicatedScheduler = "DedicatedScheduler@ubuntu"
 16 STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
 17 START = True
 18 SUSPEND   = False
 19 CONTINUE  = True
 20 PREEMPT   = False
 21 KILL      = False
 22 WANT_SUSPEND   = False
 23 WANT_VACATE    = False
 24 RANK      = Scheduler =?= $(DedicatedScheduler)
 25 MPI_CONDOR_RSH_PATH = $(LIBEXEC)
 26 CONDOR_SSHD = /usr/sbin/sshd
 27 CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
  
=============================================================================

vi `condor_config_val schedd_log`
...
80230 02/03/18 00:45:55 (pid:15673) Found 16 potential dedicated resources in 0 seconds
80231 02/03/18 00:45:55 (pid:15673) Inserting new attribute Scheduler into non-active cluster cid=61 acid=-1
80232 02/03/18 00:45:55 (pid:15673) Found 16 potential dedicated resources in 0 seconds
80233 02/03/18 00:45:55 (pid:15673) Starting add_shadow_birthdate(61.0)
80234 02/03/18 00:45:55 (pid:15673) Started shadow for job 61.0 on slot1@ubuntu <172.18.217.37:9618?addrs=172.18.217.37-9618&noUDP&sock=14871_2961_9> for DedicatedScheduler, (shadow pid = 30325)
80235 02/03/18 00:45:59 (pid:15673) Number of Active Workers 0
80236 02/03/18 00:46:00 (pid:15673) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
80237 02/03/18 00:46:00 (pid:15673) TransferQueueManager upload 1m I/O load: 2145 bytes/s  0.000 disk load  0.000 net load
80238 02/03/18 00:46:00 (pid:15673) TransferQueueManager download 1m I/O load: 47 bytes/s  0.000 disk load  0.000 net load
80239 02/03/18 00:46:00 (pid:15673) In DedicatedScheduler::reaper pid 30325 has status 25600
80240 02/03/18 00:46:00 (pid:15673) Shadow pid 30325 exited with status 100
80241 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
80242 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
80243 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
80244 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
80245 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
80246 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
80247 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
...
80333 02/03/18 00:56:00 (pid:15673) Resource slot12@ubuntu has been unused for 600 seconds, limit is 600, releasing
80334 02/03/18 00:56:00 (pid:15673) Resource slot4@ubuntu has been unused for 600 seconds, limit is 600, releasing
80335 02/03/18 00:56:00 (pid:15673) Resource slot11@ubuntu has been unused for 600 seconds, limit is 600, releasing
80336 02/03/18 00:56:00 (pid:15673) Resource slot3@ubuntu has been unused for 600 seconds, limit is 600, releasing
80337 02/03/18 00:56:00 (pid:15673) Resource slot10@ubuntu has been unused for 600 seconds, limit is 600, releasing
...
=============================================================================

vi `condor_config_val startd_log`
...
11719 02/03/18 00:46:00 slot14: State change: starter exited
11720 02/03/18 00:46:00 slot14: Changing activity: Busy -> Idle
11721 02/03/18 00:46:00 slot16: Called deactivate_claim_forcibly()
11722 02/03/18 00:46:00 Starter pid 30341 exited with status 0
11723 02/03/18 00:46:00 slot15: State change: starter exited
11724 02/03/18 00:46:00 slot15: Changing activity: Busy -> Idle
11725 02/03/18 00:46:00 slot1: Called deactivate_claim()
11726 02/03/18 00:46:00 Starter pid 30343 exited with status 0
11727 02/03/18 00:46:00 slot16: State change: starter exited
11728 02/03/18 00:46:00 slot16: Changing activity: Busy -> Idle
11729 02/03/18 00:46:00 slot2: Called deactivate_claim()
11730 02/03/18 00:46:00 slot3: Called deactivate_claim()
11731 02/03/18 00:46:00 slot4: Called deactivate_claim()
11732 02/03/18 00:46:00 slot5: Called deactivate_claim()
11733 02/03/18 00:46:00 slot6: Called deactivate_claim()
11734 02/03/18 00:46:00 slot7: Called deactivate_claim()
11735 02/03/18 00:46:00 slot8: Called deactivate_claim()
11736 02/03/18 00:46:00 slot9: Called deactivate_claim()
11737 02/03/18 00:46:00 slot10: Called deactivate_claim()
11738 02/03/18 00:46:00 slot11: Called deactivate_claim()
11739 02/03/18 00:46:00 slot12: Called deactivate_claim()
11740 02/03/18 00:46:00 slot13: Called deactivate_claim()
11741 02/03/18 00:46:00 slot14: Called deactivate_claim()
11742 02/03/18 00:46:00 slot15: Called deactivate_claim()
11743 02/03/18 00:46:00 slot16: Called deactivate_claim()
11744 02/03/18 00:52:02 Got SIGHUP.  Re-reading config files.
11745 02/03/18 00:52:02 History file rotation is enabled.
11746 02/03/18 00:52:02   Maximum history file size is: 20971520 bytes
11747 02/03/18 00:52:02   Number of rotated history files is: 2
11748 02/03/18 00:56:00 slot12: State change: received RELEASE_CLAIM command
11749 02/03/18 00:56:00 slot12: Changing state and activity: Claimed/Idle -> Preempting/Vacating
11750 02/03/18 00:56:00 slot12: State change: No preempting claim, returning to owner
11751 02/03/18 00:56:00 slot12: Changing state and activity: Preempting/Vacating -> Owner/Idle
11752 02/03/18 00:56:00 slot12: State change: IS_OWNER is false
11753 02/03/18 00:56:00 slot12: Changing state: Owner -> Unclaimed
...

=============================================================================

The startd log shows that after mpi job finishes Startd called deactivate_claim method. However, it received RELEASE_CLAIM command 10 minutes later. What is the reason for that?
Is there any solution?