[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] About idle job on Condor



Ben, Andy, Greg and All,

 

First of all thank you for your quick response.

My apologies I got busy elsewhere. Now the original job is no longer there in the Condor queue.

So let me ask the same question for another set of idle jobs.

 

To answer Andy, to the best of my knowledge there has not been any kernel update since job submission.

 

Let me say that I have currently 371 Idle jobs in my queue

 

e.g

$ date

Sun Apr 16 05:46:06 UTC 2017

 

$ condor_q -submitter dirac -wide

 

 

-- Submitter: dirac@* : <<HostName>:10594?noUDP> : <HostName>

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

1017.1   dirac           3/30 13:02   0+00:31:41 I  0   976.6 DIRAC_M_gc43_pilotwrapper.py

1018.0   dirac           3/30 13:02   0+00:00:00 I  0   0.0  DIRAC_HKjB7D_pilotwrapper.py

1018.1   dirac           3/30 13:02   0+00:00:00 I  0   0.0  DIRAC_HKjB7D_pilotwrapper.py

1019.0   dirac           3/30 13:02   0+00:00:00 I  0   0.0  DIRAC_5LXpxv_pilotwrapper.py

1019.1   dirac           3/30 13:02   0+00:00:00 I  0   0.0  DIRAC_5LXpxv_pilotwrapper.py

1020.0   dirac           3/30 13:02   0+00:00:00 I  0   0.0  DIRAC_W84i2r_pilotwrapper.py

1020.1   dirac           3/30 13:02   0+00:00:00 I  0   0.0  DIRAC_W84i2r_pilotwrapper.py

1021.0   dirac           3/30 13:02   0+00:00:00 I  0   0.0  DIRAC_CAvWfZ_pilotwrapper.py

1021.1   dirac           3/30 13:02   0+00:00:00 I  0   0.0  DIRAC_CAvWfZ_pilotwrapper.py

â

1747.0   dirac           4/13 00:23   0+00:00:00 I  0   0.0  DIRAC_0lLeH1_pilotwrapper.py

1747.1   dirac           4/13 00:23   0+00:00:00 I  0   0.0  DIRAC_0lLeH1_pilotwrapper.py

1758.0   dirac           4/16 00:31   0+00:00:00 I  0   0.0  DIRAC_f6jnKD_pilotwrapper.py

1758.1   dirac           4/16 00:31   0+00:00:00 I  0   0.0  DIRAC_f6jnKD_pilotwrapper.py

1759.0   dirac           4/16 00:32   0+00:00:00 I  0   0.0  DIRAC_alcGPB_pilotwrapper.py

1759.1   dirac           4/16 00:32   0+00:00:00 I  0   0.0  DIRAC_alcGPB_pilotwrapper.py

1760.0   dirac           4/16 01:33   0+00:00:00 I  0   0.0  DIRAC_k8ak8m_pilotwrapper.py

1760.1   dirac           4/16 01:33   0+00:00:00 I  0   0.0  DIRAC_k8ak8m_pilotwrapper.py

1761.0   dirac           4/16 01:34   0+00:00:00 I  0   0.0  DIRAC_fL9q4H_pilotwrapper.py

1761.1   dirac           4/16 01:34   0+00:00:00 I  0   0.0  DIRAC_fL9q4H_pilotwrapper.py

 

371 jobs; 0 completed, 0 removed, 371 idle, 0 running, 0 held, 0 suspended

 

So besides job 1017.1 which has registered some runtime, all other 370 jobs are sitting as Idle with runtime of 0.

The latest job 1761.1 is now more than 5 hours old.

 

I also checked the log following suggestion from Greg

$ condor_q 1761.1 -af UserLog

/home/dirac/belle_crt/dirac_logs/output/test.log

 

$ grep 1761 test.log 

 

000 (1761.000.000) 04/16 01:34:16 Job submitted from host: <192.168.XXX.XXX:10594>

000 (1761.001.000) 04/16 01:34:17 Job submitted from host: <192.168.XXX.XXX:10594>

 

 

I would like to understand what is the reason for them to be in Idle state especially for 370 jobs with runtime of zero?

Any idea? Please feel free to ask me for more details.

 

I also checked

 

$ condor_q -analyze  1761.1

 

 

-- Submitter: <>

User priority for dirac@* is not available, attempting to analyze without it.

---

1761.001:  Run analysis summary.  Of 1298 machines,

   1298 are rejected by your job's requirements 

      0 reject your job because of their own requirements 

      0 match and are already running your jobs 

      0 match but are serving other users 

      0 are available to run your job

 

WARNING:  Be advised:

   No resources matched request's constraints

 

The Requirements _expression_ for your job is:

 

    ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" ) &&

    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&

    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&

    ( ( TARGET.HasFileTransfer ) ||

      ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

 

 

Suggestions:

 

    Condition                         Machines Matched    Suggestion

    ---------                         ----------------    ----------

1   ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" )0                   MODIFY TO "slot1@xxxxxxxxxxxxxxxxx"

2   ( TARGET.Arch == "X86_64" )       1298                 

==

 

It seems there is no match. Is this the reason for it to be in Idle?

Also on what basis is condor giving a suggestion? I mean why is it suggesting to use slot1@xxxxxxxxxxxxxxxxx?

 

Also does condor kill Idle jobs after some time? By default what is that time limit?

 

Thank you for your help.

 

Vikas

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of "Feldt, Andrew N." <afeldt@xxxxxx>
Reply-To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Date: Friday, April 7, 2017 at 6:36 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] About idle job on Condor

 

Ben,

 

The most common reason this occurs in our situation is after a kernel update and only for jobs in the Standard universe.  We have found that these have a checkpoint image that Condor wonât schedule to run on the new kernel and they have to be removed and resubmitted.  This does not happen for every kernel update, but often does.  Have you updated your kernel since the job was submitted and is this a Standard universe job?

 

Andy

 

Andy Feldt

Senior System Support Programmer

Affiliate Assistant Professor

Homer L. Dodge Department of Physics & Astronomy

The University of Oklahoma

 

On Apr 7, 2017, at 7:55 AM, Ben Cotton <ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:

 

Vikas,

You're right in that there are as many possible reasons a job couldn't
run as there are jobs. In this case, it looks like your job did run
quite a bit. The 4th column in your condor_q output ("0+06:12:22") is
the accumulated run time, and you'll notice that it has a
LastRemoteHost attribute, which is the last slot the job ran on.

The LastRejMatchReason is the reason for the last time the job failed
to match. That could have been the most recent negotiation cycle or it
could have been from hours or days ago.

Since your requirements look good, I wonder if it isn't a problem with
the job itself. In my experience, if it's a single job failing, the
problem is usually with the job. It could be that the job matches,
starts executing, fails for some reason, and requeues. So that 6+
hours of run time could have come from a few seconds of attempt each
negotiation cycle. The NumJobStarts attribute would let us know how
many times it started. You can also look in the job's log to see if
this is the case. If it is, the output and/or error from the job my be
helpful.


Thanks,
BC

On Thu, Apr 6, 2017 at 10:14 PM, Bansal, Vikas <Vikas.Bansal@xxxxxxxx> wrote:

Hi,

I am new to Condor. I tried to search in archives about my problem. I suspect it is not a new problem that I am having but I was not able to find a clear solution.

Novice question.

1. Why do I see a job in Idle state? I suspect there is no general answer to that and it depends on case to case. Is that tight?

Here is an example of an idle job

ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
686.0   dirac           3/29 17:30   0+06:12:22 I  0   0.0  DIRAC_HJzfpX_pilot

As far as I can say 686.0 never ran (submitted on March 29) and has always been in Idle state.

Letâs analyze the job

$ condor_q -analyze 686.0

-- Submitter: dirac-crt.hep.pnnl.gov : <192.101.107.250:10594?noUDP> : dirac-crt.hep.pnnl.gov
       Last successful match: Fri Apr  7 01:47:54 2017

The Requirements _expression_ for your job is:

   ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" ) &&
   ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
   ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
   ( ( TARGET.HasFileTransfer ) ||
     ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )


Suggestions:

   Condition                         Machines Matched    Suggestion
   ---------                         ----------------    ----------
1   ( TARGET.Name == "slot11@xxxxxxxxxxxxxxxxx" )1
2   ( TARGET.Arch == "X86_64" )       1298
3   ( TARGET.OpSys == "LINUX" )       1298
4   ( TARGET.Disk >= 45 )             1298
5   ( TARGET.Memory >= ifthenelse(MemoryUsage isnt undefined,MemoryUsage,1) )
                                     1298
6   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "dirac-crt.hep.pnnl.gov" ) )
                                     1298

==

So there is a match. One match as I also expect.

Letâs also see job detail. Listing only some relevant fields.

$ condor_q -l 686.0


LastRemoteHost = "slot11@xxxxxxxxxxxxxxxxx"
CondorVersion = "$CondorVersion: 8.2.10 Oct 27 2015 $"
LastRejMatchReason = "no match found"

==

Why does it say âno match foundâ?

When I look at the actual node, then it has cpu/memory available.

[cwn-o10 ~]$ top

top - 01:56:18 up 50 days, 12:50,  1 user,  load average: 0.06, 0.07, 0.05
Tasks: 552 total,   1 running, 551 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 63742384 total, 35741192 free,  3025640 used, 24975552 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 60014588 avail Mem

==

What else can I look around to conclude why job is in idle state?

Any help to debug this is appreciated.

Thanks,
Vikas



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/




--
Ben Cotton
Technical Marketing Manager

Cycle Computing
Better Answers. Faster.

http://www.cyclecomputing.com
twitter: @cyclecomputing

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/