[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] About idle job on Condor



Hi,

I am new to Condor. I tried to search in archives about my problem. I suspect it is not a new problem that I am having but I was not able to find a clear solution.

Novice question.

1. Why do I see a job in Idle state? I suspect there is no general answer to that and it depends on case to case. Is that tight?

Here is an example of an idle job

ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 686.0   dirac           3/29 17:30   0+06:12:22 I  0   0.0  DIRAC_HJzfpX_pilot

As far as I can say 686.0 never ran (submitted on March 29) and has always been in Idle state.

Letâs analyze the job

$ condor_q -analyze 686.0

-- Submitter: dirac-crt.hep.pnnl.gov : <192.101.107.250:10594?noUDP> : dirac-crt.hep.pnnl.gov
	Last successful match: Fri Apr  7 01:47:54 2017

The Requirements expression for your job is:

    ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" ) &&
    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
    ( ( TARGET.HasFileTransfer ) ||
      ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )


Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" )1                    
2   ( TARGET.Arch == "X86_64" )       1298                 
3   ( TARGET.OpSys == "LINUX" )       1298                 
4   ( TARGET.Disk >= 45 )             1298                 
5   ( TARGET.Memory >= ifthenelse(MemoryUsage isnt undefined,MemoryUsage,1) )
                                      1298                 
6   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "dirac-crt.hep.pnnl.gov" ) )
                                      1298                 

==

So there is a match. One match as I also expect.

Letâs also see job detail. Listing only some relevant fields.

$ condor_q -l 686.0


LastRemoteHost = "slot11@xxxxxxxxxxxxxxxxx"
CondorVersion = "$CondorVersion: 8.2.10 Oct 27 2015 $"
LastRejMatchReason = "no match found"

==

Why does it say âno match foundâ?

When I look at the actual node, then it has cpu/memory available.

[cwn-o10 ~]$ top

top - 01:56:18 up 50 days, 12:50,  1 user,  load average: 0.06, 0.07, 0.05
Tasks: 552 total,   1 running, 551 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 63742384 total, 35741192 free,  3025640 used, 24975552 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 60014588 avail Mem

==

What else can I look around to conclude why job is in idle state?

Any help to debug this is appreciated.

Thanks,
Vikas