[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs stucked and slot hanging claimed/idle forever in parallel universe



Anyone experienced the same issue and/or has any hints on how to investigate/solve? Any pointing would be much appreciated.

I did some more tests with the "Simplest Example" you can find in paragraph "2.9.3 Submission Examples" of User Manual, but I obtained the same results.



Tnx,
	Paolo


On 18/10/2013 19:41, Paolo Perfetti wrote:
Hi,
     I'm trying to install and configure a very small cluster (6 servers
to begin: 1 execute+submit+manager, 5 execute) based on HTcondor.
This cluster should be able to manage different kind of jobs, both in
vanilla and parallel universe (OpenMP/OpenMPI).
Since the initial resources are so few and so are the users, I'm not
concerned with time and/or resources accounting and I've defined only
one group to which I've assigned all available resources (this just to
avoid any problem related with user priorities).

Now, I've a very basic submit file that invoke a trivial script on a
shared filesystem but I'm hitting a strange behaviour.
When I submit my job, some (partitionable) slots are matched but then
they are not used to run job.
Depending on the value of machine_count defined in the submit file, if
there are sufficient resources left, leftovers are re-matched and the
job complete successfully or, if remaining resources are not enough, the
job stays forever in the queue. In both cases the first slots matched
just hang as Claimed/Idle forever.

I've read and studied HTCondor Administration manual, following
instructions to set up dedicated scheduling and to avoid priorities.
Moreover I've been playing with configuration and parameters for a week,
solving a few minor issues, but now I'm stucked.
Can anybody help me? Below you can find some relevant informations but a
more complete environment dump can be found here:
http://www.bo.ingv.it/~perfetti/htcondor_debug-00/

In particular the result of condor_gather_info
http://www.bo.ingv.it/~perfetti/htcondor_debug-00/condor-profile.txt

All informations are taken from the manager node, that acts also as
submit and execute.

Thanks for your help,
     Paolo

##############
# /etc/issue #
##############
  Debian GNU/Linux 7 \n \l

#######################
# /etc/debian_version #
#######################
  7.2

############
# uname -a #
############
  Linux asgard01 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux

##################
# Condor Version #
##################
  $CondorVersion: 8.0.3 Sep 19 2013 BuildID: 174914 $
  $CondorPlatform: x86_64_Debian7 $


####### sumbit file
Universe = parallel
executable = <path>/parallel_00.sh
notification = Always
+ParallelShutdownPolicy = "WAIT_FOR_ALL"

output  = <path>/out/parallel_00.$(Node)
error   = <path>/err/parallel_00.$(Node)
log     = <path>/log/parallel_00.log
Queue

#################################################
## test-00 : machine_count = 30
#################################################


######## SchedLog
10/18/13 18:23:47 (pid:1194) Sent ad to central manager for
<user>@bo.ingv.it
10/18/13 18:23:47 (pid:1194) Sent ad to 1 collectors for <user>@bo.ingv.it
10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:23:47 (pid:1194) Number of Active Workers 1
10/18/13 18:23:47 (pid:1496) Number of Active Workers 0
10/18/13 18:23:47 (pid:1194) Using negotiation protocol: NEGOTIATE
10/18/13 18:23:47 (pid:1194) Negotiating for owner:
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx
10/18/13 18:23:47 (pid:1194) DedicatedScheduler: negotiator sent match
for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one
10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:23:47 (pid:1194) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:23:47 (pid:1194) Finished negotiating for DedicatedScheduler
in local pool: 6 matched, 24 rejected

######## MatchLog
10/18/13 18:23:47       Matched 1.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167>
preempting none <192.168.100.160:46215> slot1@xxxxxxxxxxxxxxxxxxx
10/18/13 18:23:47       Matched 1.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167>
preempting none <192.168.100.164:59436> slot1@xxxxxxxxxxxxxxxxxxx
10/18/13 18:23:47       Matched 1.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167>
preempting none <192.168.100.162:42756> slot1@xxxxxxxxxxxxxxxxxxx
10/18/13 18:23:47       Matched 1.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167>
preempting none <192.168.100.165:50720> slot1@xxxxxxxxxxxxxxxxxxx
10/18/13 18:23:47       Matched 1.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167>
preempting none <192.168.100.161:33674> slot1@xxxxxxxxxxxxxxxxxxx
10/18/13 18:23:47       Matched 1.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167>
preempting none <192.168.100.163:38387> slot1@xxxxxxxxxxxxxxxxxxx
10/18/13 18:23:47       Rejected 1.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167>: no match
found
10/18/13 18:23:47       Rejected 1.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx <192.168.100.160:54167>: no match
found


######## StartLog
10/18/13 18:23:47 slot1: Schedd addr = <192.168.100.160:54167>
10/18/13 18:23:47 slot1: Alive interval = 300
10/18/13 18:23:47 slot1: Received ClaimId from schedd
(<192.168.100.160:46215>#1382113367#1#...)
10/18/13 18:23:47 slot1: Match requesting resources: cpus=1 memory=128
disk=1%
10/18/13 18:23:47 slot1_1: Rank of this claim is: 1.000000
10/18/13 18:23:47 slot1_1: Request accepted.
10/18/13 18:23:47 Will send partitionable slot leftovers to schedd
10/18/13 18:23:47 slot1_1: Remote owner is <user>@bo.ingv.it
10/18/13 18:23:47 slot1_1: State change: claiming protocol successful
10/18/13 18:23:47 slot1_1: Changing state: Owner -> Claimed
10/18/13 18:23:47 slot1_1: Started ClaimLease timer (16) w/ 1800 second
lease duration
10/18/13 18:23:47 slot1: Schedd addr = <192.168.100.160:54167>
10/18/13 18:23:47 slot1: Alive interval = 300
10/18/13 18:23:47 slot1: Received ClaimId from schedd
(<192.168.100.160:46215>#1382113367#3#...)
10/18/13 18:23:47 slot1: Match requesting resources: cpus=1 memory=128
disk=1%
10/18/13 18:23:47 slot1_2: Rank of this claim is: 1.000000
10/18/13 18:23:47 slot1_2: Request accepted.
10/18/13 18:23:47 Will send partitionable slot leftovers to schedd
10/18/13 18:23:47 slot1_2: Remote owner is <user>@bo.ingv.it
10/18/13 18:23:47 slot1_2: State change: claiming protocol successful
10/18/13 18:23:47 slot1_2: Changing state: Owner -> Claimed
10/18/13 18:23:47 slot1_2: Started ClaimLease timer (19) w/ 1800 second
lease duration
10/18/13 18:23:47 slot1_1: match_info called


#################################################
## test-01 : machine_count = 12
#################################################


######## SchedLog
10/18/13 18:16:24 (pid:31620) Using negotiation protocol: NEGOTIATE
10/18/13 18:16:24 (pid:31620) Negotiating for owner:
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx
10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match
for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one
10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match
for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one
10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match
for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one
10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match
for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one
10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match
for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one
10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:16:24 (pid:31620) DedicatedScheduler: negotiator sent match
for slot1@xxxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one
10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:16:24 (pid:31620) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
10/18/13 18:16:24 (pid:31620) Finished negotiating for
DedicatedScheduler in local pool: 6 matched, 6 rejected
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/