Re: [HTCondor-users] Assertion Error on GridManager

Hi Antonio, Carles,

Sorry for the delay in the response, I haven't forgotten about this issue! The HTCondor team is looking into the assumptions made in the Assertion Error, and are discussing potential solutions. Unfortunately, our GridManager expert has been on leave for a few weeks so it may be some time before we have a solution. In the meantime, I'll work on reproducing the issue locally so we can investigate the root cause.

We do have some ideas for workarounds, though:

1. In the originating CE's HTCondor configuration: Quarantine the broken jobs to their own GridManager by setting 'GRIDMANAGER_SELECTION_EXPR' to a ClassAd _expression_ using attributes that are unique to the bad jobs. For example, we suspect that the bad jobs would have high NumJobStarts so you could use that to separate the bad jobs:

    GRIDMANAGER_SELECTION_EXPR = strcat(Owner, ifThenElse(NumJobStarts > 5), ".quarantine", "")

2. In the originating CE's HTCondor-CE configuration: Set SYSTEM_PERIODIC_REMOVE expressions to remove the offending job.


On 9/3/19 8:27 AM, Antonio Delgado Peris wrote:
Dear all,

We have found a problem with the GridManager when dealing with certain jobs that our HTCondor setup forwards to remote (Grid) resources.

It's not totally clear what's happened, but we would argue that the GridManager behaviour could be improved. The grand summary is the following:

In certain circumstances (infrequent, but not very rare, as per our experience), a job causes the GridManager code to raise an AssertionError and thus exits without processing the rest of its jobs, which may be healthy.

We consider this a bug, or at least a problem, because a problematic job is preventing other jobs from proceeding, and eventually causing all remote activity to halt. In general, we have seen that this requires manual cleaning to recover.

We would rather have (if possible) those problematic jobs ignored (i.e., left on hold), and proceed with the other jobs.

The assertion error looks like the following:

ERROR "Assertion ERROR on (gahp != __null || gmState == 14 || gmState == 12)" at line
391 in file /slots/10/dir_3391896/userdir/.tmpa8mXuo/BUILD/condor-8.8.1/src/condor_gridmanager/condorjob.cpp

Looking at the code (https://github.com/htcondor/htcondor/blob/V8_8_1-branch/src/condor_gridmanager/condorjob.cpp), I see that it happens within CondorJob::doEvaluateState:

ASSERT ( gahp != NULL || gmState == GM_HOLD || gmState == GM_DELETE );

For those interested, let me give more details by illustrating this with the most recent example of this error:

07/27/19 09:39:57 (cid:12488566) Set Attribute for job 2302737.0, HoldReason = "Error from slot1_3@xxxxxxxxxxxx: Job has gone over memory limit of 2048 megabytes. Peak usage: 42532 megabytes."

07/27/19 09:39:50 [1629941] (2314592.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 2


07/27/19 10:12:19 [1629941] Failed to get expiration time of proxy /var/lib/condor-ce/spool/2737/0/cluster2302737.proc0.subproc0/credential_CMSG-v1_0.main_411868: unable to read proxy file 07/27/19 10:12:19 [1629941]

Found job 2314592.0 --- inserting

07/27/19 10:12:19 [1629941] (2314592.0) doEvaluateState called: gmState GM_CLEAR_REQUEST, remoteState -1 

07/27/19 10:12:19 [1629941] ERROR "Assertion ERROR on (gahp != __null || gmState == 14 || gmState == 12)" at line 391 in file /slots/10/dir_3391896/userdir/.tmpa8mXuo/BUILD/condor-8.8.1/src/condor_gridmanager/condorjob.cpp

So there are two issues here: one is why the job gets into that funny state, the second one (more important, IMHO), is whether the GridManager should perhaps skip that job, but continue with the rest, instead of dying completely.

We are aware that our configuration may be uncommon, but we would be grateful if this could be avoided somehow (a configuration change in our side or perhaps a code fix from the developers).

Any comment will be welcomed :-)



