[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Shadow exception



Mark

Thanks for the feedback. I tried updating the submit node to 8.8 and submitting again with the DEBUG-options you specified. In the shadow-log:

04/13/21 15:24:11 (165.0) (1936823): Completed DC_CHILDALIVE to daemon at <XXXX:15187>
04/13/21 15:24:12 (165.0) (1936823): DaemonKeepAlive: Leaving SendAliveToParent() - success
04/13/21 15:24:12 (165.0) (1936823): condor_read(): Socket closed when trying to read 5 bytes from startd slot1@YYYY

And in the StartLog (on execute node):

04/13/21 15:24:11 CLOSE TCP <10.69.200.50:9618> fd=14
04/13/21 15:24:12 condor_read(fd=15 <127.0.0.1:11601>,,size=5,timeout=10,flags=0,non_blocking=0)
04/13/21 15:24:12 condor_read(): Socket closed abnormally when trying to read 5 bytes from <127.0.0.1:11601>, errno=104 Connection reset by peer
04/13/21 15:24:12 Stream::get(int) failed to read padding
04/13/21 15:24:12 CLOSE TCP <127.0.0.1:32581> fd=15
04/13/21 15:24:12 Starter pid 1983674 exited with status 1
04/13/21 15:24:12 slot1: State change: starter exited

Is there something in the execute node having trouble reading from 127.0.0.1?

P


-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Mark Coatsworth
Sent: mandag 29. mars 2021 19.19
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Shadow exception

Hi Peter,

Running an 8.8.10 execute node in a 8.6 cluster could definitely be an issue, there are often significant changes across major versions.
That's where I would start, is upgrading your pool (or downgrading that node) an option?

Looking at your previous two emails, the first one claims "Starter exited with status 1" but you next one mentions a Shadow exception. Is there any relevant information in your ShadowLog and/or StarterLog files explaining the error? You might need to set SHADOW_DEBUG = D_FULLDEBUG (on your submit host) and STARTER_DEBUG = D_FULLDEBUG (on the execute machine) to get more useful information. You'll have to look through the individual starter slot logs to find the failure.

If you could include the relevant log files from around the time the job magically started, that would be super helpful.

Mark

On Thu, Mar 25, 2021 at 5:49 PM Peter Ellevseth <Peter.Ellevseth@xxxxxxxxxx> wrote:
>
> Gents
>
>
>
> It seems my issue did not raise any big concerns out there. Today I 
> noticed an old job that had been submitted about three weeks ago. The 
> log-file is
>
> crammed full with three weeks worth of â007 (162.000 â) 03/05â.. Shadow exception! Etcâ. Until tonight when I noticed it. I queried the negotiator a bit, checked out local log files. Found nothing new really. Then after a couple minutes, the job started and is now running. Magic!
>
>
>
> Any thoughts?
>
>
>
> P
>
>
>
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of 
> Peter Ellevseth
> Sent: tirsdag 19. januar 2021 00.05
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] Shadow exception
>
>
>
> Gents
>
>
>
> I am having some issues with one of the machines in my cluster. I keep getting âShadow exceptionâ, e.g.
>
>
>
> 01/18/21 23:58:11 condor_read(fd=17 
> <127.0.0.1:21523>,,size=5,timeout=10,flags=0,non_blocking=0)
>
> 01/18/21 23:58:11 condor_read(): Socket closed abnormally when trying 
> to read 5 bytes from <127.0.0.1:21523>, errno=104 Connection reset by 
> peer
>
> 01/18/21 23:58:11 Stream::get(int) failed to read padding
>
> 01/18/21 23:58:11 CLOSE TCP <127.0.0.1:31043> fd=17
>
> 01/18/21 23:58:11 Starter pid 5973 exited with status 1
>
>
>
> Now, the really strange part is that if keep fiddling around with the STARTD-machine (checking logs, running condor_status etc), the job just magically starts. I have no idea what actions make it start, but it does.
>
>
>
> The startd-machine is running a newer version of condor (8.8.10) versus the remaining cluster running 8.6. Could that be an issue?
>
>
>
> I added startd_debug = D_NETWORK, but didnât really learn anything. Are there any other useful debugs I should check out?
>
>
>
> Peter
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx 
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/