[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Slots remain Claimed/Idle after MPI job finishes



The dedicated scheduler will hold on to claims after jobs are
completed in case other parallel universe jobs are scheduled. One of
the more seasoned condor veterans would know more about the exact
reason for this, but I think this is because it can be difficult to
schedule multi-slot parallel universe jobs in a mixed pool with other
types of jobs.

In any case, you can modify the length of time that a slot will stay
in Claimed/Idle by changing UNUSED_CLAIM_TIMEOUT. From
condor_config.local.dedicated.submit in the examples directory:

######################################################################
##  Settings you may want to customize:
##  (it is generally safe to leave these untouched)
######################################################################
######################################################################

## If the dedicated scheduler has resources claimed, but nothing to
## use them for (no MPI jobs in the queue that could use them), how
## long should it hold onto them before releasing them back to the
## regular Condor pool?  Specified in seconds.  Default is 10 minutes.
## If you define this to '0', the schedd will never release claims
## (unless the schedd is shutdown).  If your dedicated resources are
## configured to only run jobs, you should probably set this attribute
## to '0'
#UNUSED_CLAIM_TIMEOUT = 600

Jason

On Fri, Feb 2, 2018 at 11:27 AM, Alan <852016362@xxxxxx> wrote:
> Hi,
>
> Assume there are 16 slots in my cluster.
> I submit a parallel job which requires all 16 slots in my cluster. After it
> finishes, all those slots remain Claimed/Idle.
> And if I want to submit a vanilla job, the job will stay idle until those
> slots turn Unclaimed/Idle, which takes about 10 minutes.
> Why does dedicated scheduler not release claims when parallel jobs finish?
> And how to deal with that?
>
> ==============================================================================
>
> condor_q -better-analyze
>
> 062.000:  Run analysis summary ignoring user priority.  Of 16 machines,
>       0 are rejected by your job's requirements
>       0 reject your job because of their own requirements
>      16 match and are already running your jobs
>       0 match but are serving other users
>       0 are available to run your job
>
> ==============================================================================
>
> In condor_config.local, I add the following lines to configure the dedicated
> resources.
>
>  15 DedicatedScheduler = "DedicatedScheduler@ubuntu"
>  16 STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
>  17 START = True
>  18 SUSPEND   = False
>  19 CONTINUE  = True
>  20 PREEMPT   = False
>  21 KILL      = False
>  22 WANT_SUSPEND   = False
>  23 WANT_VACATE    = False
>  24 RANK      = Scheduler =?= $(DedicatedScheduler)
>  25 MPI_CONDOR_RSH_PATH = $(LIBEXEC)
>  26 CONDOR_SSHD = /usr/sbin/sshd
>  27 CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
>
> =============================================================================
>
> vi `condor_config_val schedd_log`
> ...
> 80230 02/03/18 00:45:55 (pid:15673) Found 16 potential dedicated resources
> in 0 seconds
> 80231 02/03/18 00:45:55 (pid:15673) Inserting new attribute Scheduler into
> non-active cluster cid=61 acid=-1
> 80232 02/03/18 00:45:55 (pid:15673) Found 16 potential dedicated resources
> in 0 seconds
> 80233 02/03/18 00:45:55 (pid:15673) Starting add_shadow_birthdate(61.0)
> 80234 02/03/18 00:45:55 (pid:15673) Started shadow for job 61.0 on
> slot1@ubuntu
> <172.18.217.37:9618?addrs=172.18.217.37-9618&noUDP&sock=14871_2961_9> for
> DedicatedScheduler, (shadow pid = 30325)
> 80235 02/03/18 00:45:59 (pid:15673) Number of Active Workers 0
> 80236 02/03/18 00:46:00 (pid:15673) TransferQueueManager stats: active
> up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 80237 02/03/18 00:46:00 (pid:15673) TransferQueueManager upload 1m I/O load:
> 2145 bytes/s  0.000 disk load  0.000 net load
> 80238 02/03/18 00:46:00 (pid:15673) TransferQueueManager download 1m I/O
> load: 47 bytes/s  0.000 disk load  0.000 net load
> 80239 02/03/18 00:46:00 (pid:15673) In DedicatedScheduler::reaper pid 30325
> has status 25600
> 80240 02/03/18 00:46:00 (pid:15673) Shadow pid 30325 exited with status 100
> 80241 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
> 80242 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
> 80243 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
> 80244 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
> 80245 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
> 80246 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
> 80247 02/03/18 00:46:00 (pid:15673) DedicatedScheduler::deallocMatchRec
> ...
> 80333 02/03/18 00:56:00 (pid:15673) Resource slot12@ubuntu has been unused
> for 600 seconds, limit is 600, releasing
> 80334 02/03/18 00:56:00 (pid:15673) Resource slot4@ubuntu has been unused
> for 600 seconds, limit is 600, releasing
> 80335 02/03/18 00:56:00 (pid:15673) Resource slot11@ubuntu has been unused
> for 600 seconds, limit is 600, releasing
> 80336 02/03/18 00:56:00 (pid:15673) Resource slot3@ubuntu has been unused
> for 600 seconds, limit is 600, releasing
> 80337 02/03/18 00:56:00 (pid:15673) Resource slot10@ubuntu has been unused
> for 600 seconds, limit is 600, releasing
> ...
> =============================================================================
>
> vi `condor_config_val startd_log`
> ...
> 11719 02/03/18 00:46:00 slot14: State change: starter exited
> 11720 02/03/18 00:46:00 slot14: Changing activity: Busy -> Idle
> 11721 02/03/18 00:46:00 slot16: Called deactivate_claim_forcibly()
> 11722 02/03/18 00:46:00 Starter pid 30341 exited with status 0
> 11723 02/03/18 00:46:00 slot15: State change: starter exited
> 11724 02/03/18 00:46:00 slot15: Changing activity: Busy -> Idle
> 11725 02/03/18 00:46:00 slot1: Called deactivate_claim()
> 11726 02/03/18 00:46:00 Starter pid 30343 exited with status 0
> 11727 02/03/18 00:46:00 slot16: State change: starter exited
> 11728 02/03/18 00:46:00 slot16: Changing activity: Busy -> Idle
> 11729 02/03/18 00:46:00 slot2: Called deactivate_claim()
> 11730 02/03/18 00:46:00 slot3: Called deactivate_claim()
> 11731 02/03/18 00:46:00 slot4: Called deactivate_claim()
> 11732 02/03/18 00:46:00 slot5: Called deactivate_claim()
> 11733 02/03/18 00:46:00 slot6: Called deactivate_claim()
> 11734 02/03/18 00:46:00 slot7: Called deactivate_claim()
> 11735 02/03/18 00:46:00 slot8: Called deactivate_claim()
> 11736 02/03/18 00:46:00 slot9: Called deactivate_claim()
> 11737 02/03/18 00:46:00 slot10: Called deactivate_claim()
> 11738 02/03/18 00:46:00 slot11: Called deactivate_claim()
> 11739 02/03/18 00:46:00 slot12: Called deactivate_claim()
> 11740 02/03/18 00:46:00 slot13: Called deactivate_claim()
> 11741 02/03/18 00:46:00 slot14: Called deactivate_claim()
> 11742 02/03/18 00:46:00 slot15: Called deactivate_claim()
> 11743 02/03/18 00:46:00 slot16: Called deactivate_claim()
> 11744 02/03/18 00:52:02 Got SIGHUP.  Re-reading config files.
> 11745 02/03/18 00:52:02 History file rotation is enabled.
> 11746 02/03/18 00:52:02   Maximum history file size is: 20971520 bytes
> 11747 02/03/18 00:52:02   Number of rotated history files is: 2
> 11748 02/03/18 00:56:00 slot12: State change: received RELEASE_CLAIM command
> 11749 02/03/18 00:56:00 slot12: Changing state and activity: Claimed/Idle ->
> Preempting/Vacating
> 11750 02/03/18 00:56:00 slot12: State change: No preempting claim, returning
> to owner
> 11751 02/03/18 00:56:00 slot12: Changing state and activity:
> Preempting/Vacating -> Owner/Idle
> 11752 02/03/18 00:56:00 slot12: State change: IS_OWNER is false
> 11753 02/03/18 00:56:00 slot12: Changing state: Owner -> Unclaimed
> ...
>
> =============================================================================
>
> The startd log shows that after mpi job finishes Startd called
> deactivate_claim method. However, it received RELEASE_CLAIM command 10
> minutes later. What is the reason for that?
> Is there any solution?
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/