[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] match record deleted - jobs match but do not start



I’ve had a frequent problem where jobs will apparently match but then not start.  It usually happens something like this…in the course of testing or trying to get something working I have to submit small test jobs to condor on multiple occassions through the course of a day.  They will work fine time after time, but then at some point the test job will just refuse to start running.  I’m using a flocking setup, so if I condor_q -analyze, I see that the job is rejected by all of the machines on my pool (which is good, because it needs to flock).  So, all I have to go by is the SchedLog which inevitably will have lines like this:

 

12/9 16:26:53 (pid:21242) Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 1

12/9 16:26:53 (pid:21242) Sent ad to central manager for wwwrun@xxxxxxxxx

12/9 16:26:53 (pid:21242) Sent ad to 1 collectors for wwwrun@xxxxxxxxx

12/9 16:26:53 (pid:21242) Sent RELEASE_CLAIM to startd on <128.105.148.103:58031>

12/9 16:26:53 (pid:21242) Match record (<128.105.148.103:58031>, 40927, 0) deleted

 

Sometimes the jobs will run eventually, but it often takes hours to start running (when the exact same job submitted earlier in the day will have run in a matter of minutes), and in the meantime I see a lot of these match record deleted lines.

 

My schedd is not getting overloaded (there is usually only the one test job in the queue), and there is no mention of shadow exceptions in the SchedLog (and no activity whatsoever around that time in the ShadowLog).  I’ve checked user priority and machine availability on the flock-to pool, and both are good.

 

Any ideas?  Is it normal for it to often match and then immediately delete the match record?  Am I just misinterpreting the log messages?

 

Michael.