[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Assertion ERROR on (job->_queuedNodeJobProcs >= 0)"



Michael,

I'm not sure why your schedd seems to have changed the DAGMan job's OnExitRemove expression, but it shouldn't have, and we will look into it.

As a workaround in the meantime, you should be able to reset the OnExitRemove expression of your running job using the condor_qedit tool. E.g.:

% condor_qedit 290288 OnExitRemove \
 '((JobStatus!=3) && ((DAGManJobId == 290288 && ClusterId == 290881)))'

If that doesn't work, you could try the most basic expr:

% condor_qedit 290288 OnExitRemove True

But that may allow DAGMan to leave the queue if, e.g., the machine crashes.

As an aside, unless you're comfortable with brain surgery, I wouldn't recommend getting in the habit of using condor_qedit like this... but it can be handy in an emergency. :)

-Peter


On Oct 6, 2006, at 4:41 AM, Michael Hess wrote:

Hi,

condor keeps on putting the dagman jobs on hold. As this job has consumed around 200.000 CPU hours and is 67% finished, I would like to have it running again.

The Dagman.out contains the following last lines:

10/6 10:25:37 Event: ULOG_JOB_TERMINATED for Condor Node 80400 (293671.19)
10/6 10:25:37 Node 80400 job proc (293671.19) completed successfully.
10/6 10:25:37 Number of idle job procs: 14
10/6 10:25:37 Event: ULOG_JOB_TERMINATED for Condor Node 80350 (293670.16)
10/6 10:25:37 BAD EVENT: job (293670.16.0) ended, submit count < 1 (0)
10/6 10:25:37 BAD EVENT is warning only
10/6 10:25:37 ERROR "Assertion ERROR on (job->_queuedNodeJobProcs >= 0)" at line 615 in file dag.C

After this, the dagman scheduler universe is removed from the process list by the Scheduler with the comment:

10/6 10:39:31 (pid:25192) constraint ((JobStatus!=3) && ((DAGManJobId == 290288
&& ClusterId == 290881))) does not evaluate to bool

and some time later:

10/6 10:40:14 (pid:25192) DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs) 10/6 10:40:14 (pid:25192) constraint ((JobStatus!=3) && ((DAGManJobId == 290288 && ClusterId == 293316))) does not evaluate to bool 10/6 10:40:14 (pid:25192) constraint ((JobStatus!=3) && ((DAGManJobId == 290288 && ClusterId == 293316))) does not evaluate to bool 10/6 10:40:22 (pid:25192) scheduler universe job (290288.0) pid 1666 exited with status 4 10/6 10:40:23 (pid:25192) (290288.0) Problem parsing user policy for job: The UNKNOWN (never set) OnExitRemove expression '' evaluated to UNDEFINED. Putting job on hold. 10/6 10:40:23 (pid:25192) Job 290288.0 put on hold: The UNKNOWN (never set) OnExitRemove expression '' evaluated to UNDEFINED

The hold reason with condor_q -l shows the following line:

LastHoldReason = "The UNKNOWN (never set) OnExitRemove expression '' evaluated to UNDEFINED"

With

OnExitRemove = (ExitSignal == 11 || (ExitCode >= 0 && ExitCode <= 2))

I do not know, when this problem really started, but around 2 days ago the Scheduler exited with Signal 4. This might have messed up the job queue. Condor is running at its limit with up to 1682 nodes (at night, nearly all of them really run jobs).

Any ideas of how I can get this job running again ?

Best regards and thanks in advance,

Michael Hess



--
Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685