[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Help for problem debugging needed



Hello Carsten,

You can try looking at the ShadowLog ( present on the submit/schedd machine the job 4205801.25 was scheduled on). The Shadow is the process that the starter is attempting to communicate with, and might give more information about the issue.

Regards,
Rob

--

===================================
Rob Futrick
main: 888.292.5320

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and CycleServer Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com



Carsten Aulbert wrote:
Hi all,

I've a problem and I don't know how to proceed. A user wrote yesterday
to me that his vanilla universe jobs where being stopped on the cluster.
I had a brief look and saw some messages supporting this, but I have no
clue why it happened:

On the node this happened (StarterLog):
10/18 03:17:54 slot1: Got activate_claim request from shadow (<10.20.30.2:44843>)
10/18 03:17:54 slot1: Remote job ID is 4205801.25
10/18 03:17:54 slot1: Got universe "VANILLA" (5) from request classad
10/18 03:17:54 slot1: State change: claim-activation protocol successful
10/18 03:17:54 slot1: Changing activity: Idle -> Busy

Then the Starterlog for that slot:
10/18 03:17:54 Job 4205801.25 set to execute immediately
10/18 03:17:54 Starting a VANILLA universe job with ID: 4205801.25
10/18 03:17:54 IWD: XXXX
10/18 03:17:54 Output file: XXXX
10/18 03:17:54 Error file: XXXX
10/18 03:17:54 About to exec XXXX
10/18 03:17:54 Create_Process succeeded, pid=27623
10/18 06:16:00 condor_write(): Socket closed when trying to write 184 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:16:00 Buf::write(): condor_write() failed
10/18 06:21:00 condor_write(): Socket closed when trying to write 184 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:21:00 Buf::write(): condor_write() failed
10/18 06:23:04 Got SIGTERM. Performing graceful shutdown.
10/18 06:23:04 ShutdownGraceful all jobs.
10/18 06:23:04 Process exited, pid=27623, signal=15
10/18 06:23:04 condor_write(): Socket closed when trying to write 308 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:23:04 Buf::write(): condor_write() failed
10/18 06:23:04 Failed to send job exit status to shadow
10/18 06:23:04 JobExit() failed, waiting for job lease to expire or for a reconnect attempt

10/18 06:53:04 ShutdownFast all jobs.
10/18 06:53:04 Result of "get_usage" operation from ProcD: ERROR: No family with the given PID is registered
10/18 06:53:04 error getting family usage in VanillaProc::PublishUpdateAd()
10/18 06:53:04 condor_write(): Socket closed when trying to write 67 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:53:04 Buf::write(): condor_write() failed
10/18 06:53:04 Failed to send job exit status to shadow
10/18 06:53:04 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
10/18 06:53:04 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

It looks that somehow this job was remotely shut down, but as far as I can see from the logs, no-one triggered that!
During that time, no network outage was reported nor and NFS service failing.

Are any other log files of interest?

Cheers

Carsten