[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_schedd errors



Hello,

We had the following error in our SchedLog file:

ERROR "ERROR no job status for 14.0 in DedicatedScheduler::reaper()!" at line 1631 in file dedicated_scheduler.C

This was followed by a condor_schedd re-starting. A few seconds later, there was the error

ERROR "spawnJobs(): allocation node has no matches!" at line 2356
in file dedicated_scheduler.C

and condor_schedd re-started again.

What do you think it is the problem? We found out that job 14.0 was in a machine that had a problem and was rebooted. The owner of job 14.0 tried to remove it when the machine wasn't working. The error happened one hour later, though.

We are using version 6.8.6 of condor. We've done the upgrade, one day before this happening. When we used version 6.8.4, we also had problems with condor_schedd dying and re-starting. Because we use either the parallel or the vanilla universe (mostly the parallel) we cannot make checkpoints. This kind of error doesn't allow us to have any work done, because jobs keep re-starting and never actually end. We were hoping that the new version had fixed the bug that caused our schedd to crash. Now, we are not as optimistic....

Sara Campos