[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs not finding matches.



On 07/03/2012 11:48 AM, Amy Bush wrote:
Pretty new to condor administration here, and I have a sort of high
level troubleshooting question.

The last few weeks we've had a few deadlines approaching, so a lot of
our users are sending a lot of jobs. Some folks are making good
progress, but some people have a LOT of idle jobs, and I'm trying to
sort out exactly why they're idle. Obviously you can't tell me why, but
maybe you can help me figure out what steps to take to figure out why.

Okay, so here's poor todd, who has 25 running jobs and 747 idle jobs.

whateverasaurus 10:40:57$ condor_q -g | grep " I " | grep todd | wc -l
747
whateverasaurus 10:41:01$ condor_q -g | grep " R " | grep todd | wc -l
25

Yes, he has horrible userprio right now:

whateverasaurus 10:41:37$ condor_userprio
Last Priority Update:  7/3  10:40
                              Effective
User Name                    Priority
-----------------------      ---------
smirarab@cs                   0.50
amy@cs                        0.50
laustin@cs                    0.64
mgebhart@cs                   0.90
kscherer@cs                   0.91
ckcuong@cs                    1.02
bayzid@cs                     1.73
joeraii@cs                    1.99
akanksha@cs                   2.23
elie@cs                       3.00
naga86@cs                     3.01
schrum2@cs                    3.98
dongli@cs                     4.37
julian@cs                    84.29
todd@cs                     395.11
namphuon@cs                 400.82
bsunil@cs                   437.25
<none>                      1404.07
-----------------------      ---------
Number of users shown: 17

And so some of his jobs are getting repeatedly preempted and not making
any progress for that reason. But why are they getting preempted?

Picking an idle todd-job at random (ha, the one I had been looking at
earlier is now running, typical; picking ANOTHER one at random):

whateverasaurus 10:43:29$ condor_q -g -better-analyze 572871.4
<snip>

572871.004:  Run analysis summary.  Of 3179 machines,
    1395 are rejected by your job's requirements
     997 reject your job because of their own requirements
      31 match but are serving users with a better priority in the pool
       0 match but reject the job for unknown reasons
      17 match but will not currently preempt their existing job
       0 match but are currently offline
     739 are available to run your job
         Last successful match: Mon Jul  2 13:16:27 2012

The Requirements expression for your job is:

( target.Memory >= 4000 && target.Lucid ) && ( TARGET.Arch == "X86_64" )
&&
( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= DiskUsage ) &&
( ( RequestMemory * 1024 ) >= ImageSize ) &&
( TARGET.FileSystemDomain == MY.FileSystemDomain )

     Condition                         Machines Matched    Suggestion
     ---------                         ----------------    ----------
1   target.Memory >= 4000             2096
2   ( TARGET.Arch == "X86_64" )       2714
3   target.Lucid                      3131
4   ( TARGET.OpSys == "LINUX" )       3179
5   ( TARGET.Disk >= 25000 )          3179
6   ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt
undefined,JobVMMemory,2.197265625000000E+03)) ) >= 2250000 )
                                       3179
7   ( TARGET.FileSystemDomain == "cs" )
                                       3179

So there would appear to be a lot of machines that might run his job.
There's maybe 300 of these that he won't be able to use because they're
restricted to a group he's not in, but that still leaves ~400.

And actually,

whateverasaurus 10:46:34$ condor_status -const 'Memory > 4000' | grep
X86_64 | grep Unclaimed | wc -l
658

He should be able to use any of those machines, and they're Unclaimed.

Any suggestions on how to start troubleshooting this? That's a ton of
unclaimed machines, and currently I have 1424 jobs sitting idle that
would love to have a machine.

Thanks, guys!

--
amy

Poor todd's jobs are likely being preempted because he has such a low priority (he's been using a lot of resources).

You're on a good path starting with condor_q -analyze and investigating w/ condor_status.

90% of the time the machines just aren't being used at all because of some security configuration problem. You should verify that those Unclaimed slots are eventually running jobs, or have run some in the past. Check the StartLog on the machines and possibly the SchedLog for connection issues.

Best,


matt