[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Debugging job restart issues





Ian Chesal wrote:
I'm trying to figure out why jobs are being dropped from starters and
restarted elsewhere on one of my farms. The scenario is as follows:

A job assigned to a particular linux execution node will rather suddenly
return to idle and get reschuled on another machine in the system. Since
we're not using preemption in our system this is a bit of a strange
occurrence.

If I look at the StarterLog for a machine where this happens, for the
job in question, I see:

4/27 16:12:08 ******************************************************
4/27 16:12:08 ** condor_starter (CONDOR_STARTER) STARTING UP
4/27 16:12:08 ** /tools/condor/6.8.6/linux32/sbin/condor_starter
4/27 16:12:08 ** $CondorVersion: 6.8.6 Sep 13 2007 $
4/27 16:12:08 ** $CondorPlatform: I386-LINUX_RHEL3 $
4/27 16:12:08 ** PID = 29485
4/27 16:12:08 ** Log last touched 4/27 16:12:04
4/27 16:12:08 ******************************************************
4/27 16:12:08 Using config source: /etc/condor/condor_config
4/27 16:12:08 Using local config sources:
4/27 16:12:08    /etc/condor/condor_config.basic
4/27 16:12:08    /etc/condor/os/condor_config.LINUX
4/27 16:12:08    /etc/condor/site/condor_config.SJ
4/27 16:12:08    /etc/condor/machine/condor_config.sj-bs3000-329
4/27 16:12:08    /etc/condor/machine/condor_config.sj-bs3000-329.LINUX
4/27 16:12:08    /etc/condor/patch/condor_config.sj-bs3000-329
4/27 16:12:08    /etc/condor/patch/condor_config.sj-bs3000-329.LINUX
4/27 16:12:08 DaemonCore: Command Socket at <137.57.203.33:44638>
4/27 16:12:08 Done setting resource limits
4/27 16:12:08 Communicating with shadow <137.57.202.107:46960>
4/27 16:12:08 Submitting machine is "sj-schedd1.altera.com"
4/27 16:12:08 File transfer completed successfully.
4/27 16:12:09 Starting a VANILLA universe job with ID: 90268.0
4/27 16:12:09 IWD: /build/condor/execute/dir_29485
4/27 16:12:09 Output file:
/build/condor/execute/dir_29485/_condor_stdout
4/27 16:12:09 Error file: /build/condor/execute/dir_29485/_condor_stderr
4/27 16:12:09 Renice expr "((False =?= True) * 10)" evaluated to 0
4/27 16:12:09 About to exec
/build/condor/execute/dir_29485/condor_exec.exe 2648694
4/27 16:12:09 Create_Process succeeded, pid=29490
4/28 17:35:17 Got SIGTERM. Performing graceful shutdown.
4/28 17:35:17 ShutdownGraceful all jobs.
4/28 17:35:17 Process exited, pid=29490, signal=15
4/28 17:35:17 condor_write(): Socket closed when trying to write 300
bytes to <137.57.202.107:46962>, fd is 16
4/28 17:35:17 Buf::write(): condor_write() failed
4/28 17:35:17 Failed to send job exit status to shadow
4/28 17:35:17 JobExit() failed, waiting for job lease to expire or for a
reconnect attempt
4/28 17:35:17 Last process exited, now Starter is exiting
4/28 17:35:17 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

At 17:35:16 the job was actually doing useful work. What I'm confused
about is the entries that follow. Does the line:

4/28 17:35:17 Got SIGTERM. Performing graceful shutdown.

Was that signal sent by the condor_startd?

Yes.

 Why? The corresponding
StartLog says:

4/28 17:35:17 DaemonCore: Command received via UDP from host
<137.57.202.107:40834>
4/28 17:35:17 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
4/28 17:35:17 vm2: State change: received RELEASE_CLAIM command
4/28 17:35:17 vm2: Changing state and activity: Claimed/Retiring ->
Preempting/Vacating
4/28 17:35:17 DaemonCore: Command received via UDP from host
<137.57.202.107:40834>
4/28 17:35:17 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
4/28 17:35:17 vm2: Got RELEASE_CLAIM while in Preempting state,
ignoring.
4/28 17:35:17 Starter pid 29485 exited with status 0
4/28 17:35:17 vm2: State change: starter exited
4/28 17:35:17 vm2: State change: No preempting claim, returning to owner
4/28 17:35:17 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle
4/28 17:35:17 vm2: State change: IS_OWNER is false
4/28 17:35:17 vm2: Changing state: Owner -> Unclaimed

Is command_release_claim responsible for that signal being sent to the
job?
Yes.

 Is that only generated by the negotiator? The shadow?

The RELEASE_CLAIM command is only ever sent by the schedd. It could be triggered by the shadow exiting due to some exception.

 I'm trying to
figure out why this machine was told to release its claim because that
claim should have lasted for 2 weeks if the job wanted it.
137.57.202.107 is the machine where the condor_shadow for this
corresponding job was running. Was it the shadow that send the command?
Why?

Also, there's an I/O failure between the condor_starter and it's
shadown. It all cases where I see this random vacate occuring it occurs
in conjunction with an I/O failure trying to write data back to the
submitting location via the shadow.
This seems like an important clue.  What does the error message look like?

Is this a specific condor_starter <-> condor_shadow communication issue
or a more general execute node <-> schedd node network issue?

I think it could be either one.
I wish I had ShadowLog data to share but that log file rolled over too
quickly. I've increased the size of the ShadowLog so hopefully I'll
catch it all next time it happens.

Having the ShadowLog and SchedLog from the time of the incident should help. I hope it happens again soon ... or never again!

--Dan