[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Shadow exception



Gents

 

It seems my issue did not raise any big concerns out there. Today I noticed an old job that had been submitted about three weeks ago. The log-file is

crammed full with three weeks worth of ‘007 (162.000 …) 03/05….. Shadow exception! Etc’. Until tonight when I noticed it. I queried the negotiator a bit, checked out local log files. Found nothing new really. Then after a couple minutes, the job started and is now running. Magic!

 

Any thoughts?

 

P

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Peter Ellevseth
Sent: tirsdag 19. januar 2021 00.05
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Shadow exception

 

Gents

 

I am having some issues with one of the machines in my cluster. I keep getting ‘Shadow exception’, e.g.

 

01/18/21 23:58:11 condor_read(fd=17 <127.0.0.1:21523>,,size=5,timeout=10,flags=0,non_blocking=0)

01/18/21 23:58:11 condor_read(): Socket closed abnormally when trying to read 5 bytes from <127.0.0.1:21523>, errno=104 Connection reset by peer

01/18/21 23:58:11 Stream::get(int) failed to read padding

01/18/21 23:58:11 CLOSE TCP <127.0.0.1:31043> fd=17

01/18/21 23:58:11 Starter pid 5973 exited with status 1

 

Now, the really strange part is that if keep fiddling around with the STARTD-machine (checking logs, running condor_status etc), the job just magically starts. I have no idea what actions make it start, but it does.

 

The startd-machine is running a newer version of condor (8.8.10) versus the remaining cluster running 8.6. Could that be an issue?

 

I added startd_debug = D_NETWORK, but didn’t really learn anything. Are there any other useful debugs I should check out?

 

Peter