[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Help for problem debugging needed



Hi all,

I've a problem and I don't know how to proceed. A user wrote yesterday
to me that his vanilla universe jobs where being stopped on the cluster.
I had a brief look and saw some messages supporting this, but I have no
clue why it happened:

On the node this happened (StarterLog):
10/18 03:17:54 slot1: Got activate_claim request from shadow (<10.20.30.2:44843>)
10/18 03:17:54 slot1: Remote job ID is 4205801.25
10/18 03:17:54 slot1: Got universe "VANILLA" (5) from request classad
10/18 03:17:54 slot1: State change: claim-activation protocol successful
10/18 03:17:54 slot1: Changing activity: Idle -> Busy

Then the Starterlog for that slot:
10/18 03:17:54 Job 4205801.25 set to execute immediately
10/18 03:17:54 Starting a VANILLA universe job with ID: 4205801.25
10/18 03:17:54 IWD: XXXX
10/18 03:17:54 Output file: XXXX
10/18 03:17:54 Error file: XXXX
10/18 03:17:54 About to exec XXXX
10/18 03:17:54 Create_Process succeeded, pid=27623
10/18 06:16:00 condor_write(): Socket closed when trying to write 184 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:16:00 Buf::write(): condor_write() failed
10/18 06:21:00 condor_write(): Socket closed when trying to write 184 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:21:00 Buf::write(): condor_write() failed
10/18 06:23:04 Got SIGTERM. Performing graceful shutdown.
10/18 06:23:04 ShutdownGraceful all jobs.
10/18 06:23:04 Process exited, pid=27623, signal=15
10/18 06:23:04 condor_write(): Socket closed when trying to write 308 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:23:04 Buf::write(): condor_write() failed
10/18 06:23:04 Failed to send job exit status to shadow
10/18 06:23:04 JobExit() failed, waiting for job lease to expire or for a reconnect attempt

10/18 06:53:04 ShutdownFast all jobs.
10/18 06:53:04 Result of "get_usage" operation from ProcD: ERROR: No family with the given PID is registered
10/18 06:53:04 error getting family usage in VanillaProc::PublishUpdateAd()
10/18 06:53:04 condor_write(): Socket closed when trying to write 67 bytes to <10.20.30.2:44843>, fd is 5
10/18 06:53:04 Buf::write(): condor_write() failed
10/18 06:53:04 Failed to send job exit status to shadow
10/18 06:53:04 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
10/18 06:53:04 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

It looks that somehow this job was remotely shut down, but as far as I can see from the logs, no-one triggered that!
During that time, no network outage was reported nor and NFS service failing.

Are any other log files of interest?

Cheers

Carsten


-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31