[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] stuck queued jobs



Thanks Todd,

Here's the output from the commands, since I have an older version of
condor on my condor master machine, had to tweak them accordinly.

[16:09:36][condor@condor:/condor/local/log]$ condor_q 35259 -format
'%s\n' scheduler
DedicatedScheduler@xxxxxxxxxxxxxxxx

[16:10:48][condor@condor:/condor/local/log]$ condor_status
myhost.blah.com -format '%s\n' rank
Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxx"
Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxx"

[16:10:53][condor@condor:/condor/local/log]$ condor_version
$CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
$CondorPlatform: x86_64_rhap_5 $

My server farm machines however are on 8.2.8. Anything else I can look
out for?

Thanks.


On 11/05/2015 03:57 PM, Todd Tannenbaum wrote:
> On 11/5/2015 1:59 PM, Ajao, Remi A. wrote:
>> Hello,
>>
>> I'm having issues with various jobs being stuck in queue for over a
>> few hours long, meanwhile there are more than enough available
>> servers to run the jobs.
>>
>> The only evidence of the job ID in the logs is in sched log on the
>> condor master. I see a whole bunch of lines like this.
>>
>> Inserting new attribute Scheduler into non-active cluster cid
>>
>> condor_q -analyze jobID - This actually says it's matched to a node
>> which is not surprising because the as part of the description a
>> specific host name is given.
>>
> Quick thought --
>
> On your stuck job, please do
>
>    condor_q jobID -af:r scheduler
>
> You should see something like "DedicatedScheduler@xxxxxxxxxxxxxxxxxx"
>
> Now do
>
>    condor_status blah.host.com -af:r rank
>
> (where blah.host.com is the host you restricted your job to using in the 
> requirements)
>
> You should see the output from condor_status look something like
>    Scheduler=?="DedicatedScheduler@xxxxxxxxxxxxxxxxxx"
> where the exact name in quotes (that starts with DedicatedScheduler) 
> needs to be the same as what you got above from condor_q.
>
> If it is not the same, see http://is.gd/Zo7lZF for how to setup nodes to 
> run parallel universe jobs.
>
> If it is the same, please share with us the output from the above 
> commands along with the output from condor_version.
>
> The bit about
>    Inserting new attribute Scheduler into non-active cluster cid
> is harmless in the case of a parallel universe job and something we will 
> remove from the log in an upcoming release.
>
> Hope the above helps,
> Todd
>
>
>
>
>
>> There is no sign of any of these jobs in the Negotiator log, here's
>> an example of what the sched log  looks like regarding the message I
>> mentioned earlier.
>>
>> http://pastebin.com/XKQSpDuV
>>
>> Here's what the submit.txt file looks like -
>> http://pastebin.com/yx5JUJnY
>>
>>
>> 1. Executable = g-blah.sh 2. Universe = parallel 3. Log =
>> g-blah.sh.log 4. Error  = err.$(cluster).$(process).$(node).txt 5.
>> Output = out.$(cluster).$(process).$(node).txt 6. Stream_output =
>> True 7. Stream_error = True 8.
>>
>> 9. #+ParallelShutdownPolicy = "WAIT_FOR_ALL" 10. machine_count = 1
>> 11. Environment =
>> LOCKHOME=/home/condor/parallel_universe;CLUSTER_ID=sgoyal/vertest_4node_query_1/2015_11_05__22.14.06;SVNBRANCH=trunk;SVNREV=HEAD;RPMURL=http://10.10.10.16/kits/releases/7.1.2-10//7.1.2-10.x86_64.RHEL5.rpm;user=sgoyal;testlist=vertest_4node_que
>> ry_1;LOCAL_CLUSTER_NNODES=4;TESTFILTERS=four;r_rpmurl=http://10.10.10.16/kits/releases/7.1.2-4/R_lang/R-lang-7.1.2-4.x86_64.RHEL5.rpm;r_analytics_rpmurl=;r_place_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/place/place-7.1.2-0.x86_64.RHEL5.rpm;r_pulse_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/pulse/pulse-7.1.2-0.x86_64.RHEL5.rpm;ignore_rpm_rev=true;SVNBRANCH=branches/7.1_DRAGLINE_SP2_HOTFIX;SVNREV=HEAD;JAVA_HOME_QA=/usr/lib/jvm/java-1.7.0-openjdk.x86_64;STORE_RESULTS=true;VETT_BATCHUPDATE=false;;TIMELIMIT=82100;EMAIL_SUCCESS=blah@xxxxxxxx;EMAIL_FAILURE=blah@xxxxxxxx;EMAIL_STATUS=blah@xxxxxxxx<http://10.10.10.16/kits/releases/7.1.2-10//7.1.2-10.x86_64.RHEL5.rpm;user=sgoyal;testlist=vertest_4node_query_1;LOCAL_CLUS!
>>
>>
> TER_NNODES=4;TESTFILTERS=four;r_rpmurl=http://10.10.10.16/kits/releases/7.1.2-4/R_lang/R-lang-7.1.2-4.x86_64.RHEL5.rpm;r_analytics_rpmurl=;r_place_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/place/place-7.1.2-0.x86_64.RHEL5.rpm;r_pulse_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/pulse/pulse-7.1.2-0.x86_64.RHEL5.rpm;ignore_rpm_rev=true;SVNBRANCH=branches/7.1_DRAGLINE_SP2_HOTFIX;SVNREV=HEAD;JAVA_HOME_QA=/usr/lib/jvm/java-1.7.0-openjdk.x86_64;STORE_RESULTS=true;VETT_BATCHUPDATE=false;;TIMELIMIT=82100;EMAIL_SUCCESS=blah@xxxxxxxx;EMAIL_FAILURE=blah@xxxxxxxx;EMAIL_STATUS=blah@xxxxxxxx>;
>> 12. arguments = cluster/query_regress_1 13. should_transfer_files =
>> YES 14. when_to_transfer_output = ON_EXIT_OR_EVICT 15. # need to
>> explicitly make coreize big otherwise condor sets it to zero by
>> default 16. coresize = 4000000000 17. periodic_remove =
>> RemoteWallClockTime > 82800 18. +TimeLimit = 82800 19. Requirements =
>> (sf_maintenance == FALSE) && (SF == 1) && (Memory > 1024) && (STAGE
>> == 0) && (QA=?=UNDEFINED) && Machine == "blah.host.com" 20.
>>
>> 21. Queue
>>
>> I also enabled more debugging on SCHEDD_DEBUG, however not seeing any
>> other interesting data. Any help is much appreciated. It's worth
>> noting that I do have some other jobs that are running, mostly
>> vanilla universe, it's the parallel universe ones that seem to be
>> mostly affected.
>>
>> Thanks
>>
>>
>> _______________________________________________ HTCondor-users
>> mailing list To unsubscribe, send a message to
>> htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You
>> can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>