[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Tracing why nodes reject jobs?



Hi All,

I have a user who queued a couple hundred identical Standard Universe
jobs (well the parameters were a little different but the class ads
were the same), most completed but 15 are hanging aroundin idle state
after having accumulated some runtime, but will no longer match any
execute nodes:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD      
 ---
 78745.005:  Run analysis summary.  Of 429 machines,
       19 are rejected by your job's requirements
      410 reject your job because of their own requirements
	0 match but are serving users with a better priority
in the pool
        0 match but reject the job for unknown reasons
        0 match but will not currently preempt their existing job
	0 match but are currently offline
	) are available to run your job
	Last successful match: Tue Jul  617:33:30 2010
        Last failed match: Thu Jul 15 11:46:30 2010
	Reason for last match failure: no match found
		
The 19 rejected for Job requirements are clear (wrong ARCH), the 410
for node rrequirements is odd in several ways:

1) there are 410 total systems available and 344 are currently claimed
so I'd expect those to be either "match but are serving users with a
better priority in the pool" or "match but will not currently preempt
their existing job" 

2) clearly they used to match some or the job wouldn't have had runtime

Where/how can I see why a specific node rejects a given job?

-Jon