Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] stuck queued jobs

Date: Thu, 05 Nov 2015 14:56:56 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] stuck queued jobs

On 11/5/2015 1:59 PM, Ajao, Remi A. wrote:

Hello,

I'm having issues with various jobs being stuck in queue for over a
few hours long, meanwhile there are more than enough available
servers to run the jobs.

The only evidence of the job ID in the logs is in sched log on the
condor master. I see a whole bunch of lines like this.

Inserting new attribute Scheduler into non-active cluster cid

condor_q -analyze jobID - This actually says it's matched to a node
which is not surprising because the as part of the description a
specific host name is given.


Quick thought --

On your stuck job, please do

  condor_q jobID -af:r scheduler

You should see something like "DedicatedScheduler@xxxxxxxxxxxxxxxxxx"

Now do

  condor_status blah.host.com -af:r rank

(where blah.host.com is the host you restricted your job to using in therequirements)


You should see the output from condor_status look something like
  Scheduler=?="DedicatedScheduler@xxxxxxxxxxxxxxxxxx"

where the exact name in quotes (that starts with DedicatedScheduler)needs to be the same as what you got above from condor_q.

If it is not the same, see http://is.gd/Zo7lZF for how to setup nodes torun parallel universe jobs.

If it is the same, please share with us the output from the abovecommands along with the output from condor_version.


The bit about
  Inserting new attribute Scheduler into non-active cluster cid

is harmless in the case of a parallel universe job and something we willremove from the log in an upcoming release.


Hope the above helps,
Todd

There is no sign of any of these jobs in the Negotiator log, here's
an example of what the sched log  looks like regarding the message I
mentioned earlier.

http://pastebin.com/XKQSpDuV

Here's what the submit.txt file looks like -
http://pastebin.com/yx5JUJnY


1. Executable = g-blah.sh 2. Universe = parallel 3. Log =
g-blah.sh.log 4. Error  = err.$(cluster).$(process).$(node).txt 5.
Output = out.$(cluster).$(process).$(node).txt 6. Stream_output =
True 7. Stream_error = True 8.

9. #+ParallelShutdownPolicy = "WAIT_FOR_ALL" 10. machine_count = 1
11. Environment =
LOCKHOME=/home/condor/parallel_universe;CLUSTER_ID=sgoyal/vertest_4node_query_1/2015_11_05__22.14.06;SVNBRANCH=trunk;SVNREV=HEAD;RPMURL=http://10.10.10.16/kits/releases/7.1.2-10//7.1.2-10.x86_64.RHEL5.rpm;user=sgoyal;testlist=vertest_4node_que
ry_1;LOCAL_CLUSTER_NNODES=4;TESTFILTERS=four;r_rpmurl=http://10.10.10.16/kits/releases/7.1.2-4/R_lang/R-lang-7.1.2-4.x86_64.RHEL5.rpm;r_analytics_rpmurl=;r_place_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/place/place-7.1.2-0.x86_64.RHEL5.rpm;r_pulse_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/pulse/pulse-7.1.2-0.x86_64.RHEL5.rpm;ignore_rpm_rev=true;SVNBRANCH=branches/7.1_DRAGLINE_SP2_HOTFIX;SVNREV=HEAD;JAVA_HOME_QA=/usr/lib/jvm/java-1.7.0-openjdk.x86_64;STORE_RESULTS=true;VETT_BATCHUPDATE=false;;TIMELIMIT=82100;EMAIL_SUCCESS=blah@xxxxxxxx;EMAIL_FAILURE=blah@xxxxxxxx;EMAIL_STATUS=blah@xxxxxxxx<http://10.10.10.16/kits/releases/7.1.2-10//7.1.2-10.x86_64.RHEL5.rpm;user=sgoyal;testlist=vertest_4node_query_1;LOCAL_CLUS!

TER_NNODES=4;TESTFILTERS=four;r_rpmurl=http://10.10.10.16/kits/releases/7.1.2-4/R_lang/R-lang-7.1.2-4.x86_64.RHEL5.rpm;r_analytics_rpmurl=;r_place_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/place/place-7.1.2-0.x86_64.RHEL5.rpm;r_pulse_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/pulse/pulse-7.1.2-0.x86_64.RHEL5.rpm;ignore_rpm_rev=true;SVNBRANCH=branches/7.1_DRAGLINE_SP2_HOTFIX;SVNREV=HEAD;JAVA_HOME_QA=/usr/lib/jvm/java-1.7.0-openjdk.x86_64;STORE_RESULTS=true;VETT_BATCHUPDATE=false;;TIMELIMIT=82100;EMAIL_SUCCESS=blah@xxxxxxxx;EMAIL_FAILURE=blah@xxxxxxxx;EMAIL_STATUS=blah@xxxxxxxx>;

12. arguments = cluster/query_regress_1 13. should_transfer_files =
YES 14. when_to_transfer_output = ON_EXIT_OR_EVICT 15. # need to
explicitly make coreize big otherwise condor sets it to zero by
default 16. coresize = 4000000000 17. periodic_remove =
RemoteWallClockTime > 82800 18. +TimeLimit = 82800 19. Requirements =
(sf_maintenance == FALSE) && (SF == 1) && (Memory > 1024) && (STAGE
== 0) && (QA=?=UNDEFINED) && Machine == "blah.host.com" 20.

21. Queue

I also enabled more debugging on SCHEDD_DEBUG, however not seeing any
other interesting data. Any help is much appreciated. It's worth
noting that I do have some other jobs that are running, mostly
vanilla universe, it's the parallel universe ones that seem to be
mostly affected.

Thanks


_______________________________________________ HTCondor-users
mailing list To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You
can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685

Follow-Ups:
- Re: [HTCondor-users] stuck queued jobs
  - From: Ajao, Remi A.

References:
- [HTCondor-users] stuck queued jobs
  - From: Ajao, Remi A.

Prev by Date: Re: [HTCondor-users] Restart from checkpoint failing for HTCondor 8.4.1
Next by Date: Re: [HTCondor-users] stuck queued jobs
Previous by thread: [HTCondor-users] stuck queued jobs
Next by thread: Re: [HTCondor-users] stuck queued jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] stuck queued jobs