[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] communications error after periodic_remove



That's what it looks like. The starter really doesn't know that it is being killed off because the shadow gave up on the job. So the starter is likely going through its normal shutdown process, and the shadow is gone.

You should be able to safely ignore.

Best,


matt

On 12/16/2010 01:21 PM, Peter Doherty wrote:
My job class ad has a 30 minute Periodic_remove in it.
I'm using Condor 7.5.4 on the CM, and 7.4.4 on the worker nodes (yes, I
know that's no ideal) I'm using CCB to work at remote sites.
I was watching the StarterLog on the node where the job was running, and
when the 30 minute timeout arrived, the Starter successfully exited, but
then produced some errors that I'm wondering about.
Does this mean the shadow didn't stay alive long enough to receive it's
final bit of information from the Starter?
Some log info is below.
Peter

12/16 12:14:23 (pid:32538) ZKM: successful mapping to frontend
12/16 12:14:23 (pid:32538) File transfer completed successfully.
12/16 12:14:23 (pid:32538) Job 11455392.0 set to execute immediately
12/16 12:14:23 (pid:32538) Starting a VANILLA universe job with ID:
11455392.0
12/16 12:14:23 (pid:32538) IWD:
/scratch/condor/execute/dir_26551/glide_q26579/execute/dir_32538
12/16 12:14:23 (pid:32538) Create_Process succeeded, pid=32540
12/16 12:44:23 (pid:32538) Got SIGQUIT. Performing fast shutdown.
12/16 12:44:23 (pid:32538) ShutdownFast all jobs.
12/16 12:44:23 (pid:32538) Process exited, pid=32540, signal=9
12/16 12:44:23 (pid:32538) condor_read() failed: recv() returned -1,
errno = 104 Connection reset by peer, reading 21 bytes from
<134.174.140.112:16474>.
12/16 12:44:23 (pid:32538) IO: Failed to read packet header
12/16 12:44:23 (pid:32538) Failed to send job exit status to shadow
12/16 12:44:23 (pid:32538) JobExit() failed, waiting for job lease to
expire or for a reconnect attempt
12/16 12:44:23 (pid:32538) Returning from CStarter::JobReaper()
12/16 12:44:47 (pid:32538) Got SIGTERM. Performing graceful shutdown.
12/16 12:44:47 (pid:32538) ShutdownGraceful all jobs.
12/16 12:44:47 (pid:32538) condor_write(): Socket closed when trying to
write 316 bytes to <134.174.140.112:16474>, fd is 10
12/16 12:44:47 (pid:32538) Buf::write(): condor_write() failed
12/16 12:44:47 (pid:32538) Failed to send job exit status to shadow
12/16 12:44:47 (pid:32538) JobExit() failed, waiting for job lease to
expire or for a reconnect attempt
12/16 12:44:47 (pid:32538) **** condor_starter (condor_STARTER) pid
32538 EXITING WITH STATUS 0


12/16/10 12:14:22 (pid:9225)
******************************************************
12/16/10 12:14:22 (pid:9225) ** condor_shadow (CONDOR_SHADOW) STARTING UP
12/16/10 12:14:22 (pid:9225) **
/storage/app/site/condor-7.5.4/sbin/condor_shadow
12/16/10 12:14:22 (pid:9225) ** SubsystemInfo: name=SHADOW
type=SHADOW(6) class=DAEMON(1)
12/16/10 12:14:22 (pid:9225) ** Configuration: subsystem:SHADOW
local:<NONE> class:DAEMON
12/16/10 12:14:22 (pid:9225) ** $CondorVersion: 7.5.4 Oct 18 2010
BuildID: 280908 $
12/16/10 12:14:22 (pid:9225) ** $CondorPlatform: X86_64-LINUX_RHEL5 $
12/16/10 12:14:22 (pid:9225) ** PID = 9225
12/16/10 12:14:22 (pid:9225) ** Log last touched 12/16 12:14:22
12/16/10 12:14:22 (pid:9225)
******************************************************
12/16/10 12:14:22 (pid:9225) Using config source:
/storage/app/site/condor/etc/gwms_schedd_config
12/16/10 12:14:22 (pid:9225) Using local config sources:
12/16/10 12:14:22 (pid:9225)
/storage/app/site/condor/gwms_schedd/condor_config.local
12/16/10 12:14:22 (pid:9225) DaemonCore: command socket at
<134.174.140.112:63643>
12/16/10 12:14:22 (pid:9225) Setting maximum accepts per cycle 4.
12/16/10 12:14:22 (pid:9225) Initializing a VANILLA shadow for job
11455392.0
12/16/10 12:14:22 (pid:9225) (11455392.0) (9225): Request to run on
glidein_28572@xxxxxxxxxxx
<10.0.54.5:52537?CCBID=134.174.140.112:9636#186593> was ACCEPTED
12/16/10 12:44:23 (pid:9225) (11455392.0) (9225): Job 11455392.0 is
being removed: The job attribute PeriodicRemove expression '( (
JobStatus == 2 ) && ( ( CurrentTime - EnteredCurrentStatus ) > 1800 ) )'
evaluated to TRUE
12/16/10 12:44:23 (pid:9225) (11455392.0) (9225): **** condor_shadow
(condor_SHADOW) pid 9225 EXITING WITH STATUS 113
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/