[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Assertion Error on GridManager



Hi Antonio,

Yes, you'll want to set the GRIDMANAGER_SELECTION_EXPR in the originating CE's batch configuration (i.e. somewhere in /etc/condor/config.d/). That's too bad that we can't rely on NumJobStarts but there may be an attribute that we could work with in the offending jobs, like 'JobStatus == 3' for instance? If you could send me the CE/batch ads for the last set of troublesome jobs, and maybe another set for jobs that look ok, I could take a look to try and find a good attribute to use. Until we find that attribute, though, the multiple Gridmanager procs per user could work, as long as you limited the number of processes to a handful per user.

- Brian

On 11/27/19 10:15 AM, Antonio Delgado Peris wrote:
Hi Brian,

Thanks for your email.

It has also taken me a while to reply to you :-)  (mostly, because I did not inmediately realize you had replied... the HTCondor-users list is quite busy...)

As discussed earlier, this problem is not a stopper for us, only a inconvenience, so it's OK if the solution is delayed, but it's good to see it was not forgotten :-)

Regarding the workarounds, let me start with the second one:

2. In the originating CE's HTCondor-CE configuration: Set SYSTEM_PERIODIC_REMOVE expressions to remove the offending job.
That will surely not work. In the cases we've looked at, the originating CE does remove the job. In fact, in the first case we had, it was the periodic remove who did it. What happens is that the CE job copy is deleted, and also the physical job directory (which is shared by CE and batch), but the batch job copy is not removed, and that's the one causing the problems for the GridManager component.

My interpretation of this is that the batch refuses to remove the job (until a manual -forcex is used) because it thinks the remote copy of the job is still there (while maybe it isn't anymore), and it's trying to contact the remote agent and tell it to remove its copy of the job (what can never be done due to network problems or because the remote copy was already removed).

Regarding the first one:

1. In the originating CE's HTCondor configuration: Quarantine the broken jobs to their own GridManager by setting 'GRIDMANAGER_SELECTION_EXPR' to a ClassAd _expression_ using attributes that are unique to the bad jobs.

That is a great idea! It requires that we can characterize bad jobs though.

For example, we suspect that the bad jobs would have high NumJobStarts so you could use that to separate the bad jobs:

    GRIDMANAGER_SELECTION_EXPR = strcat(Owner, ifThenElse(NumJobStarts > 5), ".quarantine", "")

I'm afraid that the I kept the full classads (as per condor_q -l) of both CE and batch jobs for the last cases we saw, and NumJobStarts didn't get higher than 1 (unless the attribute has been somehow reset?).

If we did try this, we should set it at the batch configuration (which is the one spawning the GridManager), not the HTCondorCE, right?

I'm thinking that an alternative (perhaps more brute-force) workaround might be to just use some random _expression_ so that there were X>1 GridManager processes per user, and thus an assertion error in one job would not block all re-routed jobs, but only those that happen to hash to the faulty GridManager.

This would not completely solve the problem, because we have the JobRouter's MaxJobs attribute set to the number of dedicated slots at the remote site, so the stuck jobs would reduce the number of effectively re-routed jobs (not filling all slots), but it would be in principle a 1/X reduction (rather than complete).

What do you think about this?

Thanks again!

Antonio



On 11/18/19 4:52 PM, Brian Lin wrote:
Hi Antonio, Carles,

Sorry for the delay in the response, I haven't forgotten about this issue! The HTCondor team is looking into the assumptions made in the Assertion Error, and are discussing potential solutions. Unfortunately, our GridManager expert has been on leave for a few weeks so it may be some time before we have a solution. In the meantime, I'll work on reproducing the issue locally so we can investigate the root cause.

We do have some ideas for workarounds, though:

1. In the originating CE's HTCondor configuration: Quarantine the broken jobs to their own GridManager by setting 'GRIDMANAGER_SELECTION_EXPR' to a ClassAd _expression_ using attributes that are unique to the bad jobs. For example, we suspect that the bad jobs would have high NumJobStarts so you could use that to separate the bad jobs:

    GRIDMANAGER_SELECTION_EXPR = strcat(Owner, ifThenElse(NumJobStarts > 5), ".quarantine", "")

2. In the originating CE's HTCondor-CE configuration: Set SYSTEM_PERIODIC_REMOVE expressions to remove the offending job.

Thanks,
Brian

On 9/3/19 8:27 AM, Antonio Delgado Peris wrote:
Dear all,

We have found a problem with the GridManager when dealing with certain jobs that our HTCondor setup forwards to remote (Grid) resources.


It's not totally clear what's happened, but we would argue that the GridManager behaviour could be improved. The grand summary is the following:

In certain circumstances (infrequent, but not very rare, as per our experience), a job causes the GridManager code to raise an AssertionError and thus exits without processing the rest of its jobs, which may be healthy.

We consider this a bug, or at least a problem, because a problematic job is preventing other jobs from proceeding, and eventually causing all remote activity to halt. In general, we have seen that this requires manual cleaning to recover.

We would rather have (if possible) those problematic jobs ignored (i.e., left on hold), and proceed with the other jobs.


The assertion error looks like the following:

ERROR "Assertion ERROR on (gahp != __null || gmState == 14 || gmState == 12)" at line
391 in file /slots/10/dir_3391896/userdir/.tmpa8mXuo/BUILD/condor-8.8.1/src/condor_gridmanager/condorjob.cpp

Looking at the code (https://github.com/htcondor/htcondor/blob/V8_8_1-branch/src/condor_gridmanager/condorjob.cpp), I see that it happens within CondorJob::doEvaluateState:

ASSERT ( gahp != NULL || gmState == GM_HOLD || gmState == GM_DELETE );


For those interested, let me give more details by illustrating this with the most recent example of this error:

  • A job is received by our HTCondor-CE (v3.2.1) and reaches our HTCondor batch (v8.8.1): CE ID: 2302737, batch ID: 2314592

  • The job is routed to Universe 9 (Grid), with GridResource set to a remote HTCondor-CE resource, and starts running there.

  • Eventually, the remote job fails due to memory constraints. This is noticed in the local schedd log:

07/27/19 09:39:57 (cid:12488566) Set Attribute for job 2302737.0, HoldReason = "Error from slot1_3@xxxxxxxxxxxx: Job has gone over memory limit of 2048 megabytes. Peak usage: 42532 megabytes."

  • This causes the local CE job to be removed, but the local batch job remains. However, the physical spool directory for the job files is the same for both jobs  (at least in our configuration), and it gets removed when the CE job is deleted.

  • In the GridManager log, we see that the job went to GM_SUBMITTED, and then moves to GM_CLEAR_REQUEST (instead of GM_HOLD, as happens in other cases), causing the assertion error:

07/27/19 09:39:50 [1629941] (2314592.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 2

[...]

07/27/19 10:12:19 [1629941] Failed to get expiration time of proxy /var/lib/condor-ce/spool/2737/0/cluster2302737.proc0.subproc0/credential_CMSG-v1_0.main_411868: unable to read proxy file 07/27/19 10:12:19 [1629941]

Found job 2314592.0 --- inserting

07/27/19 10:12:19 [1629941] (2314592.0) doEvaluateState called: gmState GM_CLEAR_REQUEST, remoteState -1 

07/27/19 10:12:19 [1629941] ERROR "Assertion ERROR on (gahp != __null || gmState == 14 || gmState == 12)" at line 391 in file /slots/10/dir_3391896/userdir/.tmpa8mXuo/BUILD/condor-8.8.1/src/condor_gridmanager/condorjob.cpp

  • One can also notice that the job spool dir (CE id: 2302737) has gone ('Failed to get expiration time of proxy' line)

  • From that moment on, each time the GridManager wakes up, it gets to the same assertion error and exits, missing to handle all the other jobs.

  • This can be cleaned up with 'condor_rm -forcex'.


So there are two issues here: one is why the job gets into that funny state, the second one (more important, IMHO), is whether the GridManager should perhaps skip that job, but continue with the rest, instead of dying completely.

We are aware that our configuration may be uncommon, but we would be grateful if this could be avoided somehow (a configuration change in our side or perhaps a code fix from the developers).

Any comment will be welcomed :-)

Cheers,

   Antonio


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/