[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] About idle job on Condor



Ben,

The most common reason this occurs in our situation is after a kernel update and only for jobs in the Standard universe.  We have found that these have a checkpoint image that Condor wonât schedule to run on the new kernel and they have to be removed and resubmitted.  This does not happen for every kernel update, but often does.  Have you updated your kernel since the job was submitted and is this a Standard universe job?

Andy

Andy Feldt
Senior System Support Programmer
Affiliate Assistant Professor
Homer L. Dodge Department of Physics & Astronomy
The University of Oklahoma

On Apr 7, 2017, at 7:55 AM, Ben Cotton <ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:

Vikas,

You're right in that there are as many possible reasons a job couldn't
run as there are jobs. In this case, it looks like your job did run
quite a bit. The 4th column in your condor_q output ("0+06:12:22") is
the accumulated run time, and you'll notice that it has a
LastRemoteHost attribute, which is the last slot the job ran on.

The LastRejMatchReason is the reason for the last time the job failed
to match. That could have been the most recent negotiation cycle or it
could have been from hours or days ago.

Since your requirements look good, I wonder if it isn't a problem with
the job itself. In my experience, if it's a single job failing, the
problem is usually with the job. It could be that the job matches,
starts executing, fails for some reason, and requeues. So that 6+
hours of run time could have come from a few seconds of attempt each
negotiation cycle. The NumJobStarts attribute would let us know how
many times it started. You can also look in the job's log to see if
this is the case. If it is, the output and/or error from the job my be
helpful.


Thanks,
BC

On Thu, Apr 6, 2017 at 10:14 PM, Bansal, Vikas <Vikas.Bansal@xxxxxxxx> wrote:
Hi,

I am new to Condor. I tried to search in archives about my problem. I suspect it is not a new problem that I am having but I was not able to find a clear solution.

Novice question.

1. Why do I see a job in Idle state? I suspect there is no general answer to that and it depends on case to case. Is that tight?

Here is an example of an idle job

ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
686.0   dirac           3/29 17:30   0+06:12:22 I  0   0.0  DIRAC_HJzfpX_pilot

As far as I can say 686.0 never ran (submitted on March 29) and has always been in Idle state.

Letâs analyze the job

$ condor_q -analyze 686.0

-- Submitter: dirac-crt.hep.pnnl.gov : <192.101.107.250:10594?noUDP> : dirac-crt.hep.pnnl.gov
       Last successful match: Fri Apr  7 01:47:54 2017

The Requirements _expression_ for your job is:

   ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" ) &&
   ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
   ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
   ( ( TARGET.HasFileTransfer ) ||
     ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )


Suggestions:

   Condition                         Machines Matched    Suggestion
   ---------                         ----------------    ----------
1   ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" )1
2   ( TARGET.Arch == "X86_64" )       1298
3   ( TARGET.OpSys == "LINUX" )       1298
4   ( TARGET.Disk >= 45 )             1298
5   ( TARGET.Memory >= ifthenelse(MemoryUsage isnt undefined,MemoryUsage,1) )
                                     1298
6   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "dirac-crt.hep.pnnl.gov" ) )
                                     1298

==

So there is a match. One match as I also expect.

Letâs also see job detail. Listing only some relevant fields.

$ condor_q -l 686.0


LastRemoteHost = "slot11@xxxxxxxxxxxxxxxxx"
CondorVersion = "$CondorVersion: 8.2.10 Oct 27 2015 $"
LastRejMatchReason = "no match found"

==

Why does it say âno match foundâ?

When I look at the actual node, then it has cpu/memory available.

[cwn-o10 ~]$ top

top - 01:56:18 up 50 days, 12:50,  1 user,  load average: 0.06, 0.07, 0.05
Tasks: 552 total,   1 running, 551 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 63742384 total, 35741192 free,  3025640 used, 24975552 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 60014588 avail Mem

==

What else can I look around to conclude why job is in idle state?

Any help to debug this is appreciated.

Thanks,
Vikas



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Ben Cotton
Technical Marketing Manager

Cycle Computing
Better Answers. Faster.

http://www.cyclecomputing.com
twitter: @cyclecomputing

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/