Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] communications error after periodic_remove

Date: Thu, 16 Dec 2010 13:21:49 -0500
From: Peter Doherty <doherty@xxxxxxxxxxxxxxxxxxx>
Subject: [Condor-users] communications error after periodic_remove

My job class ad has a 30 minute Periodic_remove in it.

I'm using Condor 7.5.4 on the CM, and 7.4.4 on the worker nodes (yes,I know that's no ideal) I'm using CCB to work at remote sites.I was watching the StarterLog on the node where the job was running,and when the 30 minute timeout arrived, the Starter successfullyexited, but then produced some errors that I'm wondering about.Does this mean the shadow didn't stay alive long enough to receiveit's final bit of information from the Starter?

Some log info is below.
Peter

12/16 12:14:23 (pid:32538) ZKM: successful mapping to frontend
12/16 12:14:23 (pid:32538) File transfer completed successfully.
12/16 12:14:23 (pid:32538) Job 11455392.0 set to execute immediately

12/16 12:14:23 (pid:32538) Starting a VANILLA universe job with ID:11455392.012/16 12:14:23 (pid:32538) IWD: /scratch/condor/execute/dir_26551/glide_q26579/execute/dir_32538

12/16 12:14:23 (pid:32538) Create_Process succeeded, pid=32540
12/16 12:44:23 (pid:32538) Got SIGQUIT.  Performing fast shutdown.
12/16 12:44:23 (pid:32538) ShutdownFast all jobs.
12/16 12:44:23 (pid:32538) Process exited, pid=32540, signal=9

12/16 12:44:23 (pid:32538) condor_read() failed: recv() returned -1,errno = 104 Connection reset by peer, reading 21 bytes from<134.174.140.112:16474>.

12/16 12:44:23 (pid:32538) IO: Failed to read packet header
12/16 12:44:23 (pid:32538) Failed to send job exit status to shadow

12/16 12:44:23 (pid:32538) JobExit() failed, waiting for job lease toexpire or for a reconnect attempt

12/16 12:44:23 (pid:32538) Returning from CStarter::JobReaper()
12/16 12:44:47 (pid:32538) Got SIGTERM. Performing graceful shutdown.
12/16 12:44:47 (pid:32538) ShutdownGraceful all jobs.

12/16 12:44:47 (pid:32538) condor_write(): Socket closed when tryingto write 316 bytes to <134.174.140.112:16474>, fd is 10

12/16 12:44:47 (pid:32538) Buf::write(): condor_write() failed
12/16 12:44:47 (pid:32538) Failed to send job exit status to shadow

12/16 12:44:47 (pid:32538) JobExit() failed, waiting for job lease toexpire or for a reconnect attempt12/16 12:44:47 (pid:32538) **** condor_starter (condor_STARTER) pid32538 EXITING WITH STATUS 0

12/16/10 12:14:22 (pid:9225)******************************************************12/16/10 12:14:22 (pid:9225) ** condor_shadow (CONDOR_SHADOW) STARTINGUP12/16/10 12:14:22 (pid:9225) ** /storage/app/site/condor-7.5.4/sbin/condor_shadow12/16/10 12:14:22 (pid:9225) ** SubsystemInfo: name=SHADOWtype=SHADOW(6) class=DAEMON(1)12/16/10 12:14:22 (pid:9225) ** Configuration: subsystem:SHADOWlocal:<NONE> class:DAEMON12/16/10 12:14:22 (pid:9225) ** $CondorVersion: 7.5.4 Oct 18 2010BuildID: 280908 $

12/16/10 12:14:22 (pid:9225) ** $CondorPlatform: X86_64-LINUX_RHEL5 $
12/16/10 12:14:22 (pid:9225) ** PID = 9225
12/16/10 12:14:22 (pid:9225) ** Log last touched 12/16 12:14:22

12/16/10 12:14:22 (pid:9225)******************************************************12/16/10 12:14:22 (pid:9225) Using config source: /storage/app/site/condor/etc/gwms_schedd_config

12/16/10 12:14:22 (pid:9225) Using local config sources:

12/16/10 12:14:22 (pid:9225) /storage/app/site/condor/gwms_schedd/condor_config.local12/16/10 12:14:22 (pid:9225) DaemonCore: command socket at<134.174.140.112:63643>

12/16/10 12:14:22 (pid:9225) Setting maximum accepts per cycle 4.

12/16/10 12:14:22 (pid:9225) Initializing a VANILLA shadow for job11455392.012/16/10 12:14:22 (pid:9225) (11455392.0) (9225): Request to run on glidein_28572@xxxxxxxxxxx<10.0.54.5:52537?CCBID=134.174.140.112:9636#186593> was ACCEPTED12/16/10 12:44:23 (pid:9225) (11455392.0) (9225): Job 11455392.0 isbeing removed: The job attribute PeriodicRemove expression'( ( JobStatus == 2 ) && ( ( CurrentTime - EnteredCurrentStatus ) >1800 ) )' evaluated to TRUE12/16/10 12:44:23 (pid:9225) (11455392.0) (9225): **** condor_shadow(condor_SHADOW) pid 9225 EXITING WITH STATUS 113

Follow-Ups:
- Re: [Condor-users] communications error after periodic_remove
  - From: Matthew Farrellee

Prev by Date: Re: [Condor-users] condor_status -schedd issue
Next by Date: Re: [Condor-users] communications error after periodic_remove
Previous by thread: Re: [Condor-users] condor_status -schedd issue
Next by thread: Re: [Condor-users] communications error after periodic_remove
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] communications error after periodic_remove