[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] About idle job on Condor



On 04/16/2017 01:07 AM, Bansal, Vikas wrote:

$ condor_q -submitter dirac -wide

 

 

-- Submitter: dirac@* : <<HostName>:10594?noUDP> : <HostName>

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

1017.1   dirac           3/30 13:02   0+00:31:41 I  0   976.6 DIRAC_M_gc43_pilotwrapper.py

1018.0   dirac           3/30 13:02   0+00:00:00 I  0   0.0  DIRAC_HKjB7D_pilotwrapper.py

...

So besides job 1017.1 which has registered some runtime, all other 370 jobs are sitting as Idle with runtime of 0.


    ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" ) &&

 


Do all the jobs have this same requirement?  Something like

Requirements = Name == "slot11@xxxxxxxxxxxxxxxxx"

in their submit file?

If so, that means all the jobs can only run on one slot on one machine, and they are all queued-up, blocked behind job 1017.1, which is idle right now, but has gotten runtime of 31 minutes since 3/30.  What does

grep 1017.1 test.log

report?  I think there's something going wrong with this one job, or that one machine, where job 1017.1 matches, starts running, and gets evicted immediately, so that it can't make progress, but blocks the rest of the jobs from running.

-greg