Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Assertion ERROR on (job->_queuedNodeJobProcs >= 0)"

Date: Fri, 6 Oct 2006 14:36:43 -0500
From: "Peter F. Couvares" <pfc@xxxxxxxxxxx>
Subject: Re: [Condor-users] Assertion ERROR on (job->_queuedNodeJobProcs >= 0)"

Michael,

I'm not sure why your schedd seems to have changed the DAGMan job'sOnExitRemove expression, but it shouldn't have, and we will look intoit.

As a workaround in the meantime, you should be able to reset theOnExitRemove expression of your running job using the condor_qedittool. E.g.:


% condor_qedit 290288 OnExitRemove \
 '((JobStatus!=3) && ((DAGManJobId == 290288 && ClusterId == 290881)))'

If that doesn't work, you could try the most basic expr:

% condor_qedit 290288 OnExitRemove True

But that may allow DAGMan to leave the queue if, e.g., the machinecrashes.

As an aside, unless you're comfortable with brain surgery, I wouldn'trecommend getting in the habit of using condor_qedit like this... butit can be handy in an emergency. :)


-Peter


On Oct 6, 2006, at 4:41 AM, Michael Hess wrote:

Hi,
condor keeps on putting the dagman jobs on hold. As this job hasconsumed around 200.000 CPU hours and is 67% finished, I would liketo have it running again.
The Dagman.out contains the following last lines:
10/6 10:25:37 Event: ULOG_JOB_TERMINATED for Condor Node 80400(293671.19)
10/6 10:25:37 Node 80400 job proc (293671.19) completed successfully.
10/6 10:25:37 Number of idle job procs: 14
10/6 10:25:37 Event: ULOG_JOB_TERMINATED for Condor Node 80350(293670.16)
10/6 10:25:37 BAD EVENT: job (293670.16.0) ended, submit count < 1 (0)
10/6 10:25:37 BAD EVENT is warning only
10/6 10:25:37 ERROR "Assertion ERROR on (job->_queuedNodeJobProcs>= 0)" at line 615 in file dag.C
After this, the dagman scheduler universe is removed from theprocess list by the Scheduler with the comment:
10/6 10:39:31 (pid:25192) constraint ((JobStatus!=3) &&((DAGManJobId == 290288
&& ClusterId == 290881))) does not evaluate to bool

and some time later:
10/6 10:40:14 (pid:25192) DaemonCore: received command 478(ACT_ON_JOBS), calling handler (actOnJobs)10/6 10:40:14 (pid:25192) constraint ((JobStatus!=3) &&((DAGManJobId == 290288 && ClusterId == 293316))) does not evaluateto bool10/6 10:40:14 (pid:25192) constraint ((JobStatus!=3) &&((DAGManJobId == 290288 && ClusterId == 293316))) does not evaluateto bool10/6 10:40:22 (pid:25192) scheduler universe job (290288.0) pid1666 exited with status 410/6 10:40:23 (pid:25192) (290288.0) Problem parsing user policyfor job: The UNKNOWN (never set) OnExitRemove expression ''evaluated to UNDEFINED. Putting job on hold.10/6 10:40:23 (pid:25192) Job 290288.0 put on hold: The UNKNOWN(never set) OnExitRemove expression '' evaluated to UNDEFINED
The hold reason with condor_q -l shows the following line:
LastHoldReason = "The UNKNOWN (never set) OnExitRemove expression'' evaluated to UNDEFINED"
With

OnExitRemove = (ExitSignal == 11 || (ExitCode >= 0 && ExitCode <= 2))
I do not know, when this problem really started, but around 2 daysago the Scheduler exited with Signal 4. This might have messed upthe job queue. Condor is running at its limit with up to 1682 nodes(at night, nearly all of them really run jobs).
Any ideas of how I can get this job running again ?

Best regards and thanks in advance,

Michael Hess



--
Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685

References:
- [Condor-users] Assertion ERROR on (job->_queuedNodeJobProcs >= 0)"
  - From: Michael Hess

Prev by Date: Re: [Condor-users] Diskless compute node and lock directory
Next by Date: [Condor-users] condor not sending email???
Previous by thread: [Condor-users] Assertion ERROR on (job->_queuedNodeJobProcs >= 0)"
Next by thread: [Condor-users] Console Logins
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Assertion ERROR on (job->_queuedNodeJobProcs >= 0)"