[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs stucked and slot hanging claimed/idle forever in parallel universe



Hi,
I'm trying to install and configure a very small cluster (6 servers to begin: 1 execute+submit+manager, 5 execute) based on HTcondor. This cluster should be able to manage different kind of jobs, both in vanilla and parallel universe (OpenMP/OpenMPI). Since the initial resources are so few and so are the users, I'm not concerned with time and/or resources accounting and I've defined only one group to which I've assigned all available resources (this just to avoid any problem related with user priorities).

Now, I've a very basic submit file that invoke a trivial script on a shared filesystem but I'm hitting a strange behaviour. When I submit my job, some (partitionable) slots are matched but then they are not used to run job. Depending on the value of machine_count defined in the submit file, if there are sufficient resources left, leftovers are re-matched and the job complete successfully or, if remaining resources are not enough, the job stays forever in the queue. In both cases the first slots matched just hang as Claimed/Idle forever.

I've read and studied HTCondor Administration manual, following instructions to set up dedicated scheduling and to avoid priorities. Moreover I've been playing with configuration and parameters for a week, solving a few minor issues, but now I'm stucked. Can anybody help me? Below you can find some relevant informations but a more complete environment dump can be found here:
http://www.bo.ingv.it/~perfetti/htcondor_debug-00/

In particular the result of condor_gather_info
http://www.bo.ingv.it/~perfetti/htcondor_debug-00/condor-profile.txt

All informations are taken from the manager node, that acts also as submit and execute.

Thanks for your help,
	Paolo

##############
# /etc/issue #
##############
 Debian GNU/Linux 7 \n \l

#######################
# /etc/debian_version #
#######################
 7.2

############
# uname -a #
############
 Linux asgard01 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux

##################
# Condor Version #
##################
 $CondorVersion: 8.0.3 Sep 19 2013 BuildID: 174914 $
 $CondorPlatform: x86_64_Debian7 $


####### sumbit file
Universe = parallel
executable = <path>/parallel_00.sh
notification = Always
+ParallelShutdownPolicy = "WAIT_FOR_ALL"

output  = <path>/out/parallel_00.$(Node)
error   = <path>/err/parallel_00.$(Node)
log     = <path>/log/parallel_00.log
Queue

#################################################
## test-00 : machine_count = 30
#################################################


######## SchedLog
10/18/13 18:23:47 (pid:1194) Sent ad to central manager for <user>@bo.ingv.it
10/18/13 18:23:47 (pid:1194) Sent ad to 1 collectors for <user>@bo.ingv.it
10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1
10/18/13 18:23:47 (pid:1194) Number of Active Workers 1
10/18/13 18:23:47 (pid:1496) Number of Active Workers 0
10/18/13 18:23:47 (pid:1194) Using negotiation protocol: NEGOTIATE
10/18/13 18:23:47 (pid:1194) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxx 10/18/13 18:23:47 (pid:1194) DedicatedScheduler: negotiator sent match for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one 10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:23:47 (pid:1194) Finished negotiating for DedicatedScheduler in local pool: 6 matched, 24 rejected

######## MatchLog
10/18/13 18:23:47 Matched 1.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167> preempting none <192.168.100.160:46215> slot1@xxxxxxxxxxxxxxxxxxx 10/18/13 18:23:47 Matched 1.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167> preempting none <192.168.100.164:59436> slot1@xxxxxxxxxxxxxxxxxxx 10/18/13 18:23:47 Matched 1.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167> preempting none <192.168.100.162:42756> slot1@xxxxxxxxxxxxxxxxxxx 10/18/13 18:23:47 Matched 1.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167> preempting none <192.168.100.165:50720> slot1@xxxxxxxxxxxxxxxxxxx 10/18/13 18:23:47 Matched 1.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167> preempting none <192.168.100.161:33674> slot1@xxxxxxxxxxxxxxxxxxx 10/18/13 18:23:47 Matched 1.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167> preempting none <192.168.100.163:38387> slot1@xxxxxxxxxxxxxxxxxxx 10/18/13 18:23:47 Rejected 1.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167>: no match found 10/18/13 18:23:47 Rejected 1.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167>: no match found


######## StartLog
10/18/13 18:23:47 slot1: Schedd addr = <192.168.100.160:54167>
10/18/13 18:23:47 slot1: Alive interval = 300
10/18/13 18:23:47 slot1: Received ClaimId from schedd (<192.168.100.160:46215>#1382113367#1#...) 10/18/13 18:23:47 slot1: Match requesting resources: cpus=1 memory=128 disk=1%
10/18/13 18:23:47 slot1_1: Rank of this claim is: 1.000000
10/18/13 18:23:47 slot1_1: Request accepted.
10/18/13 18:23:47 Will send partitionable slot leftovers to schedd
10/18/13 18:23:47 slot1_1: Remote owner is <user>@bo.ingv.it
10/18/13 18:23:47 slot1_1: State change: claiming protocol successful
10/18/13 18:23:47 slot1_1: Changing state: Owner -> Claimed
10/18/13 18:23:47 slot1_1: Started ClaimLease timer (16) w/ 1800 second lease duration
10/18/13 18:23:47 slot1: Schedd addr = <192.168.100.160:54167>
10/18/13 18:23:47 slot1: Alive interval = 300
10/18/13 18:23:47 slot1: Received ClaimId from schedd (<192.168.100.160:46215>#1382113367#3#...) 10/18/13 18:23:47 slot1: Match requesting resources: cpus=1 memory=128 disk=1%
10/18/13 18:23:47 slot1_2: Rank of this claim is: 1.000000
10/18/13 18:23:47 slot1_2: Request accepted.
10/18/13 18:23:47 Will send partitionable slot leftovers to schedd
10/18/13 18:23:47 slot1_2: Remote owner is <user>@bo.ingv.it
10/18/13 18:23:47 slot1_2: State change: claiming protocol successful
10/18/13 18:23:47 slot1_2: Changing state: Owner -> Claimed
10/18/13 18:23:47 slot1_2: Started ClaimLease timer (19) w/ 1800 second lease duration
10/18/13 18:23:47 slot1_1: match_info called


#################################################
## test-01 : machine_count = 12
#################################################


######## SchedLog
10/18/13 18:16:24 (pid:31620) Using negotiation protocol: NEGOTIATE
10/18/13 18:16:24 (pid:31620) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxx 10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one 10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one 10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one 10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one 10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one 10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one 10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1 10/18/13 18:16:24 (pid:31620) Finished negotiating for DedicatedScheduler in local pool: 6 matched, 6 rejected