Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Help for problem debugging needed

Date: Mon, 20 Oct 2008 10:03:58 -0500
From: Robert Futrick <rfutrick@xxxxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Help for problem debugging needed

Hello Carsten,

You can try looking at the ShadowLog ( present on the submit/scheddmachine the job 4205801.25 was scheduled on). The Shadow is the processthat the starter is attempting to communicate with, and might give moreinformation about the issue.


Regards,
Rob

--

===================================
Rob Futrick
main: 888.292.5320

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and CycleServer Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com



Carsten Aulbert wrote:

Hi all,

I've a problem and I don't know how to proceed. A user wrote yesterday
to me that his vanilla universe jobs where being stopped on the cluster.
I had a brief look and saw some messages supporting this, but I have no
clue why it happened:

On the node this happened (StarterLog):
10/18 03:17:54 slot1: Got activate_claim request from shadow (<10.20.30.2:44843>)
10/18 03:17:54 slot1: Remote job ID is 4205801.25
10/18 03:17:54 slot1: Got universe "VANILLA" (5) from request classad
10/18 03:17:54 slot1: State change: claim-activation protocol successful
10/18 03:17:54 slot1: Changing activity: Idle -> Busy

Then the Starterlog for that slot:
10/18 03:17:54 Job 4205801.25 set to execute immediately
10/18 03:17:54 Starting a VANILLA universe job with ID: 4205801.25
10/18 03:17:54 IWD: XXXX
10/18 03:17:54 Output file: XXXX
10/18 03:17:54 Error file: XXXX
10/18 03:17:54 About to exec XXXX
10/18 03:17:54 Create_Process succeeded, pid=27623
10/18 06:16:00 condor_write(): Socket closed when trying to write 184 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:16:00 Buf::write(): condor_write() failed
10/18 06:21:00 condor_write(): Socket closed when trying to write 184 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:21:00 Buf::write(): condor_write() failed
10/18 06:23:04 Got SIGTERM. Performing graceful shutdown.
10/18 06:23:04 ShutdownGraceful all jobs.
10/18 06:23:04 Process exited, pid=27623, signal=15
10/18 06:23:04 condor_write(): Socket closed when trying to write 308 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:23:04 Buf::write(): condor_write() failed
10/18 06:23:04 Failed to send job exit status to shadow
10/18 06:23:04 JobExit() failed, waiting for job lease to expire or for a reconnect attempt

10/18 06:53:04 ShutdownFast all jobs.
10/18 06:53:04 Result of "get_usage" operation from ProcD: ERROR: No family with the given PID is registered
10/18 06:53:04 error getting family usage in VanillaProc::PublishUpdateAd()
10/18 06:53:04 condor_write(): Socket closed when trying to write 67 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:53:04 Buf::write(): condor_write() failed
10/18 06:53:04 Failed to send job exit status to shadow
10/18 06:53:04 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
10/18 06:53:04 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

It looks that somehow this job was remotely shut down, but as far as I can see from the logs, no-one triggered that!
During that time, no network outage was reported nor and NFS service failing.

Are any other log files of interest?

Cheers

Carsten

Follow-Ups:
- Re: [Condor-users] Help for problem debugging needed
  - From: Carsten Aulbert

References:
- [Condor-users] Help for problem debugging needed
  - From: Carsten Aulbert

Prev by Date: Re: [Condor-users] Problem : submitted jobs stay in I state even withavailable execute nodes.
Next by Date: Re: [Condor-users] Problem : submitted jobs stay in I state even withavailable execute nodes.
Previous by thread: [Condor-users] Help for problem debugging needed
Next by thread: Re: [Condor-users] Help for problem debugging needed
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Help for problem debugging needed