[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job restarts



I recently started seeing jobs fail with the errors below.  These jobs
come into our cluster from the globus job manager, which explicitly
disables streaming output and transfers the output files when the jobs
finish (via the NFSLite package from the VDT).  The file transfer is now
failing, which ultimately results in jobs being requeued and run again
and again.

These errors seem to have started at about the same time that I changed
this particular grid user's shell from /bin/bash to /bin/true.  But
other users with a shell of /bin/true don't have problems with this
output file transfer.

Where else should I look for more information on what's going wrong?

--Mike

8/24 20:49:03 ******************************************************
8/24 20:49:03 ** condor_starter (CONDOR_STARTER) STARTING UP
8/24 20:49:03 ** /opt/condor/sbin/condor_starter
8/24 20:49:03 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $
8/24 20:49:03 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
8/24 20:49:03 ** PID = 25226
8/24 20:49:03 ** Log last touched 8/24 20:49:01
8/24 20:49:03 ******************************************************
8/24 20:49:03 Using config source: /home/condor/condor_config
8/24 20:49:03 Using local config sources:
8/24 20:49:03    /share/apps/condor/hosts/cithep230/condor_config.local
8/24 20:49:03 DaemonCore: Command Socket at <10.255.255.156:45962>
8/24 20:49:03 Done setting resource limits
8/24 20:49:03 Communicating with shadow <10.255.255.216:48267>
8/24 20:49:03 Submitting machine is "gatekeeper-0-2.local"
8/24 20:49:03 setting the orig job name in starter
8/24 20:49:03 setting the orig job iwd in starter
8/24 20:49:03 File transfer completed successfully.
8/24 20:49:04 Job 875666.0 set to execute immediately
8/24 20:49:04 Starting a VANILLA universe job with ID: 875666.0
8/24 20:49:04 IWD: /state/partition1/tmp/cithep230/execute/dir_25226
8/24 20:49:04 Output file:
/state/partition1/tmp/cithep230/execute/dir_25226/_condor_stdout
8/24 20:49:04 Error file:
/state/partition1/tmp/cithep230/execute/dir_25226/_condor_stderr
8/24 20:49:10 Using wrapper
/opt/condor/bin/condor_nfslite_job_wrapper.sh to exec
Summer08-QCD_EMenriched_Pt30to80-IDEAL_V6_v1-32774-JobSpec.xml
8/24 20:49:10 Create_Process succeeded, pid=25229
8/25 08:19:58 Process exited, pid=25229, status=0
8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
failure reading 5 bytes from unknown source.
8/25 08:19:58 IO: Failed to read packet header
8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
8/25 08:19:58 File transfer failed, forcing disconnect.
8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
or for a reconnect attempt
8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
failure reading 5 bytes from unknown source.
8/25 08:19:58 IO: Failed to read packet header
8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
8/25 08:19:58 File transfer failed, forcing disconnect.
8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
or for a reconnect attempt
8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
failure reading 5 bytes from unknown source.
8/25 08:19:58 IO: Failed to read packet header
8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
8/25 08:19:58 File transfer failed, forcing disconnect.
8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
or for a reconnect attempt
8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
failure reading 5 bytes from unknown source.
8/25 08:19:58 IO: Failed to read packet header
8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
8/25 08:19:58 File transfer failed, forcing disconnect.
8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
or for a reconnect attempt
8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
failure reading 5 bytes from unknown source.
8/25 08:19:58 IO: Failed to read packet header
8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
or for a reconnect attempt
8/25 08:19:58 Got SIGQUIT.  Performing fast shutdown.
8/25 08:19:58 ShutdownFast all jobs.
8/25 08:19:58 Result of "get_usage" operation from ProcD: ERROR: No
family with the given PID is registered
8/25 08:19:58 error getting family usage in VanillaProc::PublishUpdateAd()
8/25 08:19:58 condor_write(): Socket closed when trying to write 67
bytes to <10.255.255.216:43187>, fd is 5
8/25 08:19:58 Buf::write(): condor_write() failed
8/25 08:19:58 Failed to send job exit status to shadow
8/25 08:19:58 JobExit() failed, waiting for job lease to expire or for a
reconnect attempt
8/25 08:19:58 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature