[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] How to solve/debug no matching?



Hi all,

the current status of our pool here is pretty much idle:
condor_status |tail -n 6

                     Total Owner Claimed Unclaimed Matched Preempting
Backfill

        X86_64/LINUX  3813     0      66         0       0          0
  3747

               Total  3813     0      66         0       0          0
  3747


However, looking into a job which is idle and rejected according to the
MatchLog shows this (numbers are bigger now since I just restarted the
submit machines and they are getting to know all compute nodes again):

condor_q -bet 8487997.0


-- Quill: atlasquill : <10.20.30.1:5432> : atlasquill
---
8487997.000:  Run analysis summary.  Of 4563 machines,
    366 are rejected by your job's requirements
      0 reject your job because of their own requirements
     38 match but are serving users with a better priority in the pool
   4112 match but reject the job for unknown reasons
     47 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Mon Dec 22 14:42:11 2008

The Requirements expression for your job is:

( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) &&
( ( CkptArch == target.Arch ) || ( CkptArch is undefined ) ) &&
( ( CkptOpSys == target.OpSys ) || ( CkptOpSys is undefined ) ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ( 1024 * target.Memory ) >= 300000 )4197
2   ( target.Arch == "X86_64" )       4563
3   ( target.OpSys == "LINUX" )       4563
4   ( ( "X86_64" == target.Arch ) )   4563
5   ( ( "LINUX" == target.OpSys ) )   4563
6   ( target.Disk >= 7500 )           4563

According to this list, this job should be sent to the cluster right
away, however it stayed idle over Xmas :(

System setup currently is: two submit machines with HA setup, running
quill on postgresql, condor version is 7.0.5 with 7.1.4 dagman binaries

Any hint how I can debug this further to narrow down why it does not work?

Cheers

Carsten