[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor_schedd errors
- Date: Thu, 22 Nov 2007 19:07:14 +0000
- From: Sara Campos <scampos@xxxxxxxxxxx>
- Subject: [Condor-users] condor_schedd errors
We had the following error in our SchedLog file:
ERROR "ERROR no job status for 14.0 in DedicatedScheduler::reaper()!" at
line 1631 in file dedicated_scheduler.C
This was followed by a condor_schedd re-starting. A few seconds later,
there was the error
ERROR "spawnJobs(): allocation node has no matches!" at line 2356
in file dedicated_scheduler.C
and condor_schedd re-started again.
What do you think it is the problem? We found out that job 14.0 was in a
machine that had a problem and was rebooted. The owner of job 14.0 tried
to remove it when the machine wasn't working. The error happened one
hour later, though.
We are using version 6.8.6 of condor. We've done the upgrade, one day
before this happening. When we used version 6.8.4, we also had problems
with condor_schedd dying and re-starting. Because we use either the
parallel or the vanilla universe (mostly the parallel) we cannot make
checkpoints. This kind of error doesn't allow us to have any work done,
because jobs keep re-starting and never actually end. We were hoping
that the new version had fixed the bug that caused our schedd to crash.
Now, we are not as optimistic....