[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] How do I debug a job being mysteriously evicted because startd didn't think the right job was running on the starter?



I'm unable to determine the exact reasons for why this job was evicted.
On the surface the startd thinks the running jobs is not the job that
should be running, but I'm uncertain how this decision was made and how
I can prevent this problem form occuring. My log file for the job shows:


000 (007.000.000) 02/16 17:04:06 Job submitted from host:
<137.57.142.8:3060>
...
001 (007.000.000) 02/16 17:16:20 Job executing on host:
<137.57.176.207:1948>
...
004 (007.000.000) 02/16 17:23:19 Job was evicted.
	(0) Job was not checkpointed.
		Usr 0 00:00:37, Sys 0 00:01:16  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	0  -  Run Bytes Sent By Job
	3519  -  Run Bytes Received By Job
...
009 (007.000.000) 02/16 17:33:10 Job was aborted by the user.
	via condor_rm (by user ichesal)
...

The job was running fine on the remote host. At 17:23:19 the job was
evicted from the machine. The starterlog.vm2 log file for this job
reads:


2/16 17:16:18 ******************************************************
2/16 17:16:18 ** condor_starter (CONDOR_STARTER) STARTING UP
2/16 17:16:18 ** d:\abc\condor\bin\condor_starter.exe
2/16 17:16:18 ** $CondorVersion: 6.7.3 Dec 28 2004 $
2/16 17:16:18 ** $CondorPlatform: INTEL-WINNT40 $
2/16 17:16:18 ** PID = 1448
2/16 17:16:18 ******************************************************
2/16 17:16:18 Using config file: d:\abc\condor\condor_config
2/16 17:16:18 Using local config files:
d:\abc\condor\local.TTC-BS3066-207\condor_config.local
2/16 17:16:18 DaemonCore: Command Socket at <137.57.176.207:1805>
2/16 17:16:18 Setting resource limits not implemented!
2/16 17:16:18 Communicating with shadow <137.57.142.8:3716>
2/16 17:16:18 Submitting machine is "TTC-JCHOI2.altera.com"
2/16 17:16:19 File transfer completed successfully.
2/16 17:16:20 Starting a VANILLA universe job with ID: 7.0
2/16 17:16:20 IWD: d:\abc\condor/execute\dir_1448
2/16 17:16:20 Output file: d:\abc\condor/execute\dir_1448\wrapper.log
2/16 17:16:20 Error file: d:\abc\condor/execute\dir_1448\wrapper.err
2/16 17:16:20 Renice expr "10" evaluated to 10
2/16 17:16:20 About to exec C:\WINDOWS\system32\cmd.exe /Q /C
condor_exec.bat
/experiments/ichesal/condor/TPT/icd/icd_script%20%3D%20tsunami_intracell
_delay_xml_writer.tcl/acs_generator
2/16 17:16:20 Create_Process succeeded, pid=1580
2/16 17:23:18 Got SIGTERM. Performing graceful shutdown.
2/16 17:23:18 ShutdownGraceful all jobs.
2/16 17:23:19 Process exited, pid=1580, status=-1073741510
2/16 17:23:19 Last process exited, now Starter is exiting
2/16 17:23:19 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

Something told this starter to gracefully shutdown. But there are no
shutdown messages in the MasterLog or StartLog for this machine. Looking
around the same time in the StartLog I see the following trail for this
job:


2/16 17:16:13 DaemonCore: Command received via UDP from host
<137.57.176.9:49720>
2/16 17:16:13 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
2/16 17:16:13 vm2: match_info called
2/16 17:16:13 vm2: Received match <137.57.176.207:1948>#1108575471#46
2/16 17:16:13 vm2: State change: match notification protocol successful
2/16 17:16:13 vm2: Changing state: Unclaimed -> Matched
2/16 17:16:13 DaemonCore: Command received via TCP from host
<137.57.142.8:3715>
2/16 17:16:13 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
2/16 17:16:13 vm2: Request accepted.
2/16 17:16:13 vm2: Remote owner is ichesal@xxxxxxxxxx
2/16 17:16:13 vm2: State change: claiming protocol successful
2/16 17:16:13 vm2: Changing state: Matched -> Claimed
2/16 17:16:15 DaemonCore: Command received via TCP from host
<137.57.176.207:1804>
2/16 17:16:15 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling
handler (command_query_ads)
2/16 17:16:15 In command_query_ads
2/16 17:16:18 DaemonCore: Command received via TCP from host
<137.57.142.8:3722>
2/16 17:16:18 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
2/16 17:16:18 vm2: Got activate_claim request from shadow
(<137.57.142.8:3722>)
2/16 17:16:18 vm2: Remote job ID is 7.0
2/16 17:16:18 vm2: Got universe "VANILLA" (5) from request classad
2/16 17:16:18 vm2: State change: claim-activation protocol successful
2/16 17:16:18 vm2: Changing activity: Idle -> Busy

This is where the job started. Everything looks fine. Then there are a
bunch of startd queries for status on this machine:

2/16 17:16:45 DaemonCore: Command received via TCP from host
<137.57.176.207:1848>
2/16 17:16:45 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling
handler (command_query_ads)
2/16 17:16:45 In command_query_ads
2/16 17:17:15 DaemonCore: Command received via TCP from host
<137.57.176.207:1862>
2/16 17:17:15 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling
handler (command_query_ads)
2/16 17:17:15 In command_query_ads
--- cut some repetitive stuff out here ---
2/16 17:23:15 DaemonCore: Command received via TCP from host
<137.57.176.207:1920>
2/16 17:23:15 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling
handler (command_query_ads)
2/16 17:23:15 In command_query_ads
2/16 17:23:18 DaemonCore: Command received via UDP from host
<137.57.176.9:49817>
2/16 17:23:18 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
2/16 17:23:18 vm1: match_info called
2/16 17:23:18 vm1: Received match <137.57.176.207:1948>#1108575471#47
2/16 17:23:18 vm1: State change: match notification protocol successful
2/16 17:23:18 vm1: Changing state: Unclaimed -> Matched
2/16 17:23:18 DaemonCore: Command received via UDP from host
<137.57.176.9:49817>
2/16 17:23:18 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
2/16 17:23:18 vm2: match_info called
2/16 17:23:18 DaemonCore: Command received via TCP from host
<137.57.142.159:3137>
2/16 17:23:18 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
2/16 17:23:18 vm1: Request accepted.
2/16 17:23:18 vm1: Remote owner is pguzy@xxxxxxxxxx
2/16 17:23:18 vm1: State change: claiming protocol successful
2/16 17:23:18 vm1: Changing state: Matched -> Claimed

And now a new job is being assigned to vm1.

2/16 17:23:18 DaemonCore: Command received via TCP from host
<137.57.142.159:3138>
2/16 17:23:18 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
2/16 17:23:18 vm2: ClaimId from schedd
(<137.57.176.207:1948>#1108575471#46) doesn't match
(<137.57.176.207:1948>#1108575471#48)

And here is where I think the trouble begins. Something has happened so
that the VM2 instance no longer thinks its running the right job for the
schedd. What could have occured to cause this?

2/16 17:23:18 DaemonCore: Command received via UDP from host
<137.57.142.159:3139>
2/16 17:23:18 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
2/16 17:23:18 vm2: State change: received RELEASE_CLAIM command
2/16 17:23:18 vm2: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
2/16 17:23:19 DaemonCore: Command received via TCP from host
<137.57.142.8:3836>
2/16 17:23:19 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
2/16 17:23:19 vm2: Got KILL_FRGN_JOB while in Preempting state,
ignoring.
2/16 17:23:19 DaemonCore: Command received via UDP from host
<137.57.142.8:3839>
2/16 17:23:19 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
2/16 17:23:19 vm2: Got RELEASE_CLAIM while in Preempting state,
ignoring.
2/16 17:23:19 DaemonCore: Command received via UDP from host
<137.57.142.8:3840>
2/16 17:23:19 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
2/16 17:23:19 vm2: Got RELEASE_CLAIM while in Preempting state,
ignoring.
2/16 17:23:19 DaemonCore: Command received via UDP from host
<137.57.176.207:1923>
2/16 17:23:19 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
2/16 17:23:19 Starter pid 1448 exited with status 0
2/16 17:24:50 vm2: State change: starter exited
2/16 17:24:50 vm2: State change: No preempting claim, returning to owner
2/16 17:24:50 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle
2/16 17:24:50 vm2: State change: IS_OWNER is false
2/16 17:24:50 vm2: Changing state: Owner -> Unclaimed

And I'm certain this is the origin of the graceful shutdown request I
saw in the VM2 log file. The startd asked the starter to drop the job
because it didn't match what the schedd thought it was running.

How is it that the schedd and the starter end up thinking they're
running different jobs like this? I'm seeing this problem with troubling
frequency among my users jobs: spontaneous eviction of jobs from
machines because the startd thinks the job running and the job that
should be running aren't the same job. Where is the startd getting those
global job ID's from? How can I figure out why the state of the schedd
and the state of the startd don't match?

Thanks!

- Ian


--
Ian R. Chesal <ichesal@xxxxxxxxxx>
Senior Software Engineer

Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300