[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] communications error after periodic_remove



My job class ad has a 30 minute Periodic_remove in it.
I'm using Condor 7.5.4 on the CM, and 7.4.4 on the worker nodes (yes, I know that's no ideal) I'm using CCB to work at remote sites. I was watching the StarterLog on the node where the job was running, and when the 30 minute timeout arrived, the Starter successfully exited, but then produced some errors that I'm wondering about. Does this mean the shadow didn't stay alive long enough to receive it's final bit of information from the Starter?
Some log info is below.
Peter

12/16 12:14:23 (pid:32538) ZKM: successful mapping to frontend
12/16 12:14:23 (pid:32538) File transfer completed successfully.
12/16 12:14:23 (pid:32538) Job 11455392.0 set to execute immediately
12/16 12:14:23 (pid:32538) Starting a VANILLA universe job with ID: 11455392.0 12/16 12:14:23 (pid:32538) IWD: /scratch/condor/execute/dir_26551/ glide_q26579/execute/dir_32538
12/16 12:14:23 (pid:32538) Create_Process succeeded, pid=32540
12/16 12:44:23 (pid:32538) Got SIGQUIT.  Performing fast shutdown.
12/16 12:44:23 (pid:32538) ShutdownFast all jobs.
12/16 12:44:23 (pid:32538) Process exited, pid=32540, signal=9
12/16 12:44:23 (pid:32538) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <134.174.140.112:16474>.
12/16 12:44:23 (pid:32538) IO: Failed to read packet header
12/16 12:44:23 (pid:32538) Failed to send job exit status to shadow
12/16 12:44:23 (pid:32538) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
12/16 12:44:23 (pid:32538) Returning from CStarter::JobReaper()
12/16 12:44:47 (pid:32538) Got SIGTERM. Performing graceful shutdown.
12/16 12:44:47 (pid:32538) ShutdownGraceful all jobs.
12/16 12:44:47 (pid:32538) condor_write(): Socket closed when trying to write 316 bytes to <134.174.140.112:16474>, fd is 10
12/16 12:44:47 (pid:32538) Buf::write(): condor_write() failed
12/16 12:44:47 (pid:32538) Failed to send job exit status to shadow
12/16 12:44:47 (pid:32538) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 12/16 12:44:47 (pid:32538) **** condor_starter (condor_STARTER) pid 32538 EXITING WITH STATUS 0


12/16/10 12:14:22 (pid:9225) ****************************************************** 12/16/10 12:14:22 (pid:9225) ** condor_shadow (CONDOR_SHADOW) STARTING UP 12/16/10 12:14:22 (pid:9225) ** /storage/app/site/condor-7.5.4/sbin/ condor_shadow 12/16/10 12:14:22 (pid:9225) ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1) 12/16/10 12:14:22 (pid:9225) ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON 12/16/10 12:14:22 (pid:9225) ** $CondorVersion: 7.5.4 Oct 18 2010 BuildID: 280908 $
12/16/10 12:14:22 (pid:9225) ** $CondorPlatform: X86_64-LINUX_RHEL5 $
12/16/10 12:14:22 (pid:9225) ** PID = 9225
12/16/10 12:14:22 (pid:9225) ** Log last touched 12/16 12:14:22
12/16/10 12:14:22 (pid:9225) ****************************************************** 12/16/10 12:14:22 (pid:9225) Using config source: /storage/app/site/ condor/etc/gwms_schedd_config
12/16/10 12:14:22 (pid:9225) Using local config sources:
12/16/10 12:14:22 (pid:9225) /storage/app/site/condor/gwms_schedd/ condor_config.local 12/16/10 12:14:22 (pid:9225) DaemonCore: command socket at <134.174.140.112:63643>
12/16/10 12:14:22 (pid:9225) Setting maximum accepts per cycle 4.
12/16/10 12:14:22 (pid:9225) Initializing a VANILLA shadow for job 11455392.0 12/16/10 12:14:22 (pid:9225) (11455392.0) (9225): Request to run on glidein_28572@xxxxxxxxxxx <10.0.54.5:52537?CCBID=134.174.140.112:9636#186593> was ACCEPTED 12/16/10 12:44:23 (pid:9225) (11455392.0) (9225): Job 11455392.0 is being removed: The job attribute PeriodicRemove expression '( ( JobStatus == 2 ) && ( ( CurrentTime - EnteredCurrentStatus ) > 1800 ) )' evaluated to TRUE 12/16/10 12:44:23 (pid:9225) (11455392.0) (9225): **** condor_shadow (condor_SHADOW) pid 9225 EXITING WITH STATUS 113