[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Shadow exception



Hi Peter,

Running an 8.8.10 execute node in a 8.6 cluster could definitely be an
issue, there are often significant changes across major versions.
That's where I would start, is upgrading your pool (or downgrading
that node) an option?

Looking at your previous two emails, the first one claims "Starter
exited with status 1" but you next one mentions a Shadow exception. Is
there any relevant information in your ShadowLog and/or StarterLog
files explaining the error? You might need to set SHADOW_DEBUG =
D_FULLDEBUG (on your submit host) and STARTER_DEBUG = D_FULLDEBUG (on
the execute machine) to get more useful information. You'll have to
look through the individual starter slot logs to find the failure.

If you could include the relevant log files from around the time the
job magically started, that would be super helpful.

Mark

On Thu, Mar 25, 2021 at 5:49 PM Peter Ellevseth
<Peter.Ellevseth@xxxxxxxxxx> wrote:
>
> Gents
>
>
>
> It seems my issue did not raise any big concerns out there. Today I noticed an old job that had been submitted about three weeks ago. The log-file is
>
> crammed full with three weeks worth of â007 (162.000 â) 03/05â.. Shadow exception! Etcâ. Until tonight when I noticed it. I queried the negotiator a bit, checked out local log files. Found nothing new really. Then after a couple minutes, the job started and is now running. Magic!
>
>
>
> Any thoughts?
>
>
>
> P
>
>
>
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Peter Ellevseth
> Sent: tirsdag 19. januar 2021 00.05
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] Shadow exception
>
>
>
> Gents
>
>
>
> I am having some issues with one of the machines in my cluster. I keep getting âShadow exceptionâ, e.g.
>
>
>
> 01/18/21 23:58:11 condor_read(fd=17 <127.0.0.1:21523>,,size=5,timeout=10,flags=0,non_blocking=0)
>
> 01/18/21 23:58:11 condor_read(): Socket closed abnormally when trying to read 5 bytes from <127.0.0.1:21523>, errno=104 Connection reset by peer
>
> 01/18/21 23:58:11 Stream::get(int) failed to read padding
>
> 01/18/21 23:58:11 CLOSE TCP <127.0.0.1:31043> fd=17
>
> 01/18/21 23:58:11 Starter pid 5973 exited with status 1
>
>
>
> Now, the really strange part is that if keep fiddling around with the STARTD-machine (checking logs, running condor_status etc), the job just magically starts. I have no idea what actions make it start, but it does.
>
>
>
> The startd-machine is running a newer version of condor (8.8.10) versus the remaining cluster running 8.6. Could that be an issue?
>
>
>
> I added startd_debug = D_NETWORK, but didnât really learn anything. Are there any other useful debugs I should check out?
>
>
>
> Peter
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison