[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Held jobs: unable to establish standard (output|error) stream



Hi Greg,

On Thu, 2021-01-07 at 10:14:57 -0600, Greg Thain wrote:
> 
> On 1/7/21 7:03 AM, Thomas Hartmann wrote:
> > 
> > I guess there is not much more in the Starter log on your execute
> > node/slot `slot1_3@xxxxxxxxxxxxxxxxxxx` compared to the shadow
> > 
> The immediate problem here is that this job is trying to stream standard
> output back to the submt machine in real time by setting
> 
> stream_output = true
> 
> in the submit file.  Which is fine. 

I still suspect that this is the only user trying to stream output when it
count simply be written to the shared file system (which is BeeGFS, by the way).

  However, for some reason, the starter
> is complaining that it can't setup this stream for the job.  If we can see
> the StarterLog.slotXX for that particular job, we should be able to see a
> more specific error.

That would be slot1_3@h0115 ... 21-01-07_09:0* reads as follows there:

21-01-07_09:00:55  ******************************************************
21-01-07_09:00:55  ** condor_starter (CONDOR_STARTER) STARTING UP
21-01-07_09:00:55  ** /usr/sbin/condor_starter
21-01-07_09:00:55  ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
21-01-07_09:00:55  ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
21-01-07_09:00:55  ** $CondorVersion: 8.8.3 May 29 2019 BuildID: Debian-8.8.3-1+deb9u0 PackageID: 8.8.3-1+deb9u0 Debian-8.8.3-1+deb9u0 $
21-01-07_09:00:55  ** $CondorPlatform: X86_64-Debian_9 $
21-01-07_09:00:55  ** PID = 3433
21-01-07_09:00:55  ** Log last touched 1/3 17:13:36
21-01-07_09:00:55  ******************************************************
21-01-07_09:00:55  Using config source: /etc/condor/condor_config
21-01-07_09:00:55  Using local config sources: 
21-01-07_09:00:55     /etc/default/condor_config|
21-01-07_09:00:55  config Macros = 331, Sorted = 330, StringBytes = 8495, TablesBytes = 11964
21-01-07_09:00:55  CLASSAD_CACHING is OFF
21-01-07_09:00:55  Daemon Log is logging: D_ALWAYS D_ERROR
21-01-07_09:00:55  Daemoncore: Listening at <10.150.1.15:42383> on TCP (ReliSock) and UDP (SafeSock).
21-01-07_09:00:55  DaemonCore: command socket at <10.150.1.15:42383?addrs=10.150.1.15-42383>
21-01-07_09:00:55  DaemonCore: private command socket at <10.150.1.15:42383?addrs=10.150.1.15-42383>
21-01-07_09:00:55  Communicating with shadow <10.150.100.102:8859?addrs=10.150.100.102-8859&noUDP>
21-01-07_09:00:55  Submitting machine is "hypatia2.hypatia.local"
21-01-07_09:00:55  setting the orig job name in starter
21-01-07_09:00:55  setting the orig job iwd in starter
21-01-07_09:00:55  Chirp config summary: IO false, Updates false, Delayed updates true.
21-01-07_09:00:55  Initialized IO Proxy.
21-01-07_09:00:55  Done setting resource limits
21-01-07_09:00:55  File transfer completed successfully.
21-01-07_09:00:55  Job 347480.0 set to execute immediately
21-01-07_09:00:55  About to exec Pre script: /var/lib/condor/execute/dir_3433/lalinf_touch_output lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.01,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.01.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.02,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.02.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.03,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.03.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.04,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.04.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.05,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.05.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.06,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.06.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.07,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5.07.resume,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5V1-PSD.dat,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5L1-PSD.dat,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5H1-PSD.dat,lalinferencemcmc-0-V1L1H1-1249852256.9904504-1.hdf5_snr.txt
21-01-07_09:00:55  Create_Process succeeded, pid=3475
21-01-07_09:00:55  Process exited, pid=3475, status=0
21-01-07_09:00:55  Starting a VANILLA universe job with ID: 347480.0
21-01-07_09:00:55  Current mount, /tmp, is shared.
21-01-07_09:00:55  Current mount, /var, is shared.
21-01-07_09:00:55  IWD: /var/lib/condor/execute/dir_3433
21-01-07_09:00:55  Couldn't seek to end of /.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/log/lalinference-347480-0-.out for stream stdout when append mode was requested: Success
21-01-07_09:00:57  condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.150.100.102:1755>.
21-01-07_09:00:57  IO: Failed to read packet header
21-01-07_09:00:57  i/o error result is 0, errno is 104
21-01-07_09:00:57  condor_write(): Socket closed when trying to write 37 bytes to <10.150.100.102:1755>, fd is 9
21-01-07_09:00:57  Buf::write(): condor_write() failed
21-01-07_09:00:57  i/o error result is 0, errno is 110
21-01-07_09:00:57  Error file: streaming to remote file /.auto/home/xyz/lalsuite-tools/SEOBNRv4HM_PA/SEOBNRv4HM_PA_run/run6/log/lalinference-347480-0-.err
21-01-07_09:00:57  Failed to open some/all of the std files...
21-01-07_09:00:57  Aborting OsProc::StartJob.
21-01-07_09:00:57  Failed to start job, exiting
21-01-07_09:00:57  ShutdownFast all jobs.
21-01-07_09:00:57  Failed to open '.update.ad' to read update ad: No such file or directory (2).
21-01-07_09:00:57  condor_write(): Socket closed when trying to write 102 bytes to <10.150.100.102:1755>, fd is 9
21-01-07_09:00:57  Buf::write(): condor_write() failed
21-01-07_09:00:57  Lost connection to shadow, waiting 2400 secs for reconnect
21-01-07_09:00:57  All jobs have exited... starter exiting
21-01-07_09:00:57  **** condor_starter (condor_STARTER) pid 3433 EXITING WITH STATUS 0

In the output of "errno -l",
errno 104 is "ECONNRESET 104 Connection reset by peer"
errno 110 is "ETIMEDOUT 110 Connection timed out"

both seem to point to problems opening a port on the schedd/shadow machine (10.150.100.102).
This, obviously is common to all 16 jobs. (I guess that 1755 is a random port, perhaps
assigned to/ opened by the shadow?)

Is restarting that machine the only remedy? (I'm not unhappy with that but many others will be.)
I have another headnode that might not show the issue (which may be a fd leak or something
similar) as it's been recently rebooted and has only 200+ shadows running... I'll try to move
that particular user.

Thanks,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~