[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] ERROR "Assertion ERROR on (result)" at line 319in file NTreceiver



>> Can someone with condor_shadow code access give me an idea of what
might
>> be causing this assert to get triggered?

> Noting terribly obvious to me - it is asserting while trying to send
> the remote syscall to begin execution. if you could pull out the
> 94673.23 related log (it is mixed in with the 94812.0 logs in the
> snippet you provided)

> I would guess that it is linked to the errors listed above though.
> request to run REFUSED is normally not a good sign, it can mean that
> shadows are not gettring started fast enough so the startd times out
> the request.

Yup. I think this was exactly the problem. The startd claims were timing
out. 

I disabled submits and cleared out the errant cluster (it took upwards
of an hour for the schedd to remove the 4k job cluster). Brought it back
to a stable state and then resubmitted and everything was okay. I think
the user may have issued an rm and then immediately resubmitted from the
same spot. Never good.

> Have you tried upgrading the submit host to the latest 6.8?

Probably time to bite the bullet and move to 6.8.5.

> "Fixed a bug in the condor_ shadow on Windows where it would fail to
> correctly perform the PASSWORD authentication method." just a stab in
> the dark.

We're running Linux for our central manager nodes. But I'm sure there
are a number of improvements we can take advantage of in 6.8.5.

Hey Todd! Are any of the submit or remove performance enhancements you
talked about at Condor Week going to get back ported in the 6.8.x
series? I'm tempted to move to 6.9.x just so large cluster removals
don't take a hour plus...

Thanks for the help Matt!

- Ian