[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] stuck queued jobs



Hello,

I'm having issues with various jobs being stuck in queue for over a few hours long, meanwhile there are more than enough available servers to run the jobs.

The only evidence of the job ID in the logs is in sched log on the condor master. I see a whole bunch of lines like this.

Inserting new attribute Scheduler into non-active cluster cid

condor_q -analyze jobID - This actually says it's matched to a node which is not surprising because the as part of the description a specific host name is given.

There is no sign of any of these jobs in the Negotiator log, here's an example of what the sched log  looks like regarding the message I mentioned earlier.

http://pastebin.com/XKQSpDuV

Here's what the submit.txt file looks like - http://pastebin.com/yx5JUJnY


  1.
Executable = g-blah.sh
  2.
Universe = parallel
  3.
Log = g-blah.sh.log
  4.
Error  = err.$(cluster).$(process).$(node).txt
  5.
Output = out.$(cluster).$(process).$(node).txt
  6.
Stream_output = True
  7.
Stream_error = True
  8.

  9.
#+ParallelShutdownPolicy = "WAIT_FOR_ALL"
  10.
machine_count = 1
  11.
Environment = LOCKHOME=/home/condor/parallel_universe;CLUSTER_ID=sgoyal/vertest_4node_query_1/2015_11_05__22.14.06;SVNBRANCH=trunk;SVNREV=HEAD;RPMURL=http://10.10.10.16/kits/releases/7.1.2-10//7.1.2-10.x86_64.RHEL5.rpm;user=sgoyal;testlist=vertest_4node_que ry_1;LOCAL_CLUSTER_NNODES=4;TESTFILTERS=four;r_rpmurl=http://10.10.10.16/kits/releases/7.1.2-4/R_lang/R-lang-7.1.2-4.x86_64.RHEL5.rpm;r_analytics_rpmurl=;r_place_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/place/place-7.1.2-0.x86_64.RHEL5.rpm;r_pulse_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/pulse/pulse-7.1.2-0.x86_64.RHEL5.rpm;ignore_rpm_rev=true;SVNBRANCH=branches/7.1_DRAGLINE_SP2_HOTFIX;SVNREV=HEAD;JAVA_HOME_QA=/usr/lib/jvm/java-1.7.0-openjdk.x86_64;STORE_RESULTS=true;VETT_BATCHUPDATE=false;;TIMELIMIT=82100;EMAIL_SUCCESS=blah@xxxxxxxx;EMAIL_FAILURE=blah@xxxxxxxx;EMAIL_STATUS=blah@xxxxxxxx<http://10.10.10.16/kits/releases/7.1.2-10//7.1.2-10.x86_64.RHEL5.rpm;user=sgoyal;testlist=vertest_4node_query_1;LOCAL_CLUSTER_NNODES=4;TESTFILTERS=four;r_rpmurl=http://10.10.10.16/kits/releases/7.1.2-4/R_lang/R-lang-7.1.2-4.x86_64.RHEL5.rpm;r_analytics_rpmurl=;r_place_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/place/place-7.1.2-0.x86_64.RHEL5.rpm;r_pulse_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/pulse/pulse-7.1.2-0.x86_64.RHEL5.rpm;ignore_rpm_rev=true;SVNBRANCH=branches/7.1_DRAGLINE_SP2_HOTFIX;SVNREV=HEAD;JAVA_HOME_QA=/usr/lib/jvm/java-1.7.0-openjdk.x86_64;STORE_RESULTS=true;VETT_BATCHUPDATE=false;;TIMELIMIT=82100;EMAIL_SUCCESS=blah@xxxxxxxx;EMAIL_FAILURE=blah@xxxxxxxx;EMAIL_STATUS=blah@xxxxxxxx>;
  12.
arguments = cluster/query_regress_1
  13.
should_transfer_files = YES
  14.
when_to_transfer_output = ON_EXIT_OR_EVICT
  15.
# need to explicitly make coreize big otherwise condor sets it to zero by default
  16.
coresize = 4000000000
  17.
periodic_remove = RemoteWallClockTime > 82800
  18.
+TimeLimit = 82800
  19.
Requirements = (sf_maintenance == FALSE) && (SF == 1) && (Memory > 1024) && (STAGE == 0) && (QA=?=UNDEFINED) && Machine == "blah.host.com"
  20.

  21.
Queue

I also enabled more debugging on SCHEDD_DEBUG, however not seeing any other interesting data. Any help is much appreciated. It's worth noting that I do have some other jobs that are running, mostly vanilla universe, it's the parallel universe ones that seem to be mostly affected.

Thanks