[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] About idle job on Condor



Vikas,

You're right in that there are as many possible reasons a job couldn't
run as there are jobs. In this case, it looks like your job did run
quite a bit. The 4th column in your condor_q output ("0+06:12:22") is
the accumulated run time, and you'll notice that it has a
LastRemoteHost attribute, which is the last slot the job ran on.

The LastRejMatchReason is the reason for the last time the job failed
to match. That could have been the most recent negotiation cycle or it
could have been from hours or days ago.

Since your requirements look good, I wonder if it isn't a problem with
the job itself. In my experience, if it's a single job failing, the
problem is usually with the job. It could be that the job matches,
starts executing, fails for some reason, and requeues. So that 6+
hours of run time could have come from a few seconds of attempt each
negotiation cycle. The NumJobStarts attribute would let us know how
many times it started. You can also look in the job's log to see if
this is the case. If it is, the output and/or error from the job my be
helpful.


Thanks,
BC

On Thu, Apr 6, 2017 at 10:14 PM, Bansal, Vikas <Vikas.Bansal@xxxxxxxx> wrote:
> Hi,
>
> I am new to Condor. I tried to search in archives about my problem. I suspect it is not a new problem that I am having but I was not able to find a clear solution.
>
> Novice question.
>
> 1. Why do I see a job in Idle state? I suspect there is no general answer to that and it depends on case to case. Is that tight?
>
> Here is an example of an idle job
>
> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>  686.0   dirac           3/29 17:30   0+06:12:22 I  0   0.0  DIRAC_HJzfpX_pilot
>
> As far as I can say 686.0 never ran (submitted on March 29) and has always been in Idle state.
>
> Letâs analyze the job
>
> $ condor_q -analyze 686.0
>
> -- Submitter: dirac-crt.hep.pnnl.gov : <192.101.107.250:10594?noUDP> : dirac-crt.hep.pnnl.gov
>         Last successful match: Fri Apr  7 01:47:54 2017
>
> The Requirements expression for your job is:
>
>     ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" ) &&
>     ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
>     ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
>     ( ( TARGET.HasFileTransfer ) ||
>       ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
>
>
> Suggestions:
>
>     Condition                         Machines Matched    Suggestion
>     ---------                         ----------------    ----------
> 1   ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" )1
> 2   ( TARGET.Arch == "X86_64" )       1298
> 3   ( TARGET.OpSys == "LINUX" )       1298
> 4   ( TARGET.Disk >= 45 )             1298
> 5   ( TARGET.Memory >= ifthenelse(MemoryUsage isnt undefined,MemoryUsage,1) )
>                                       1298
> 6   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "dirac-crt.hep.pnnl.gov" ) )
>                                       1298
>
> ==
>
> So there is a match. One match as I also expect.
>
> Letâs also see job detail. Listing only some relevant fields.
>
> $ condor_q -l 686.0
>
>
> LastRemoteHost = "slot11@xxxxxxxxxxxxxxxxx"
> CondorVersion = "$CondorVersion: 8.2.10 Oct 27 2015 $"
> LastRejMatchReason = "no match found"
>
> ==
>
> Why does it say âno match foundâ?
>
> When I look at the actual node, then it has cpu/memory available.
>
> [cwn-o10 ~]$ top
>
> top - 01:56:18 up 50 days, 12:50,  1 user,  load average: 0.06, 0.07, 0.05
> Tasks: 552 total,   1 running, 551 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem : 63742384 total, 35741192 free,  3025640 used, 24975552 buff/cache
> KiB Swap:        0 total,        0 free,        0 used. 60014588 avail Mem
>
> ==
>
> What else can I look around to conclude why job is in idle state?
>
> Any help to debug this is appreciated.
>
> Thanks,
> Vikas
>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
Ben Cotton
Technical Marketing Manager

Cycle Computing
Better Answers. Faster.

http://www.cyclecomputing.com
twitter: @cyclecomputing