[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Perpetual "Claimed Idle" state



<>Hi,

I'm submitting a vanilla job from a client to a nominated execute host, and it seems that the Shadow is losing contact with the Starter, causing the job to be resubmitted on the client node, even though the job continues running on the execute node (sometimes to completion, as all the output files get returned). However, the execute node then remains in a "Claimed Idle" state. Here are the relevant parts of the logs; lots of detail to follow so if this thing doesn't interest you then feel free to stop reading here.

This is the ShadowLog, even after the job "finishes":


9/29 07:48:46 ******************************************************
9/29 07:48:46 ** condor_shadow (CONDOR_SHADOW) STARTING UP
9/29 07:48:46 ** /pond/home/condor/RH9.0/condor-6.6.6/sbin/condor_shadow
9/29 07:48:46 ** $CondorVersion: 6.6.6 Jul 26 2004 $
9/29 07:48:46 ** $CondorPlatform: I386-LINUX_RH9 $
9/29 07:48:46 ** PID = 25976
9/29 07:48:46 ******************************************************
9/29 07:48:46 Using config file: /home/condor/condor_config
9/29 07:48:46 Using local config files: /home/condor/condor_config.local
9/29 07:48:46 DaemonCore: Command Socket at <172.24.116.193:50040>
9/29 07:48:47 Initializing a VANILLA shadow
9/29 07:48:47 (368.0) (25976): Request to run on <172.24.116.33:9665> was ACCEPTED
9/29 08:23:35 (368.0) (25976): ERROR "Can no longer talk to condor_starter on execute machine (172.24.116.33)" at line 63 in file NTreceivers.C
9/29 08:23:35 ******************************************************
9/29 08:23:35 ** condor_shadow (CONDOR_SHADOW) STARTING UP
9/29 08:23:35 ** /pond/home/condor/RH9.0/condor-6.6.6/sbin/condor_shadow
9/29 08:23:35 ** $CondorVersion: 6.6.6 Jul 26 2004 $
9/29 08:23:35 ** $CondorPlatform: I386-LINUX_RH9 $
9/29 08:23:35 ** PID = 26024
9/29 08:23:35 ******************************************************
9/29 08:23:35 Using config file: /home/condor/condor_config
9/29 08:23:35 Using local config files: /home/condor/condor_config.local
9/29 08:23:35 DaemonCore: Command Socket at <172.24.116.193:50120>
9/29 08:23:36 Initializing a VANILLA shadow
9/29 08:23:36 (368.0) (26024): Request to run on <172.24.116.33:9665> was ACCEPTED



... and the corresponding SchedLog:



9/29 07:47:00 DaemonCore: Command received via UDP from host <172.24.116.193:36443>
9/29 07:47:00 DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
9/29 07:47:00 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:47:00 Called reschedule_negotiator()
9/29 07:47:00 DaemonCore: Command received via TCP from host <172.24.116.193:50035>
9/29 07:47:00 DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)
9/29 07:47:00 Negotiating for owner: mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:47:00 Checking consistency running and runnable jobs
9/29 07:47:00 Tables are consistent
9/29 07:47:00 Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
9/29 07:47:00 Increasing flock level for mcal00@xxxxxxxxxxxxxxxxxxxxx to 1.
9/29 07:47:05 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:47:49 DaemonCore: Command received via TCP from host <172.24.116.1:9616>
9/29 07:47:49 DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)
9/29 07:47:54 Negotiating for owner: mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:47:54 Checking consistency running and runnable jobs
9/29 07:47:54 Tables are consistent
9/29 07:47:54 Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
9/29 07:47:54 Increasing flock level for mcal00@xxxxxxxxxxxxxxxxxxxxx to 2.
9/29 07:47:54 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:48:33 DaemonCore: Command received via TCP from host <172.24.116.20:9629>
9/29 07:48:33 DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)
9/29 07:48:43 Negotiating for owner: mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:48:43 Checking consistency running and runnable jobs
9/29 07:48:43 Tables are consistent
9/29 07:48:44 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 2
9/29 07:48:44 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:48:46 Started shadow for job 368.0 on "<172.24.116.33:9665>", (shadow pid = 25976)
9/29 07:48:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:53:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:58:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:03:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:08:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:13:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:18:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:23:35 ERROR: Shadow exited with job exception code!
9/29 08:23:35 Started shadow for job 368.0 on "<172.24.116.33:9665>", (shadow pid = 26024)
9/29 08:23:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:28:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:33:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:38:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:43:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:48:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:53:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:58:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:03:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:08:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:13:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:18:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:23:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:28:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:33:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:38:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx



Hence note that we lost contact with the Starter at some stage. Now the logs on the execute node; here's the StartLog first:



9/29 07:49:30 DaemonCore: Command received via UDP from host <172.24.116.20:9615>
9/29 07:49:30 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
9/29 07:49:30 vm2: match_info called
9/29 07:49:30 vm2: Received match <172.24.116.33:9665>#4009692188
9/29 07:49:30 vm2: State change: match notification protocol successful
9/29 07:49:30 vm2: Changing state: Unclaimed -> Matched
9/29 07:49:30 DaemonCore: Command received via TCP from host <172.24.116.193:50039>
9/29 07:49:30 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
9/29 07:49:30 vm2: Request accepted.
9/29 07:49:30 vm2: Remote owner is mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:49:30 vm2: State change: claiming protocol successful
9/29 07:49:30 vm2: Changing state: Matched -> Claimed
9/29 07:49:33 DaemonCore: Command received via TCP from host <172.24.116.193:50042>
9/29 07:49:33 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
9/29 07:49:33 vm2: Got activate_claim request from shadow (<172.24.116.193:50042>)
9/29 07:49:33 vm2: Remote job ID is 368.0
9/29 07:49:33 vm2: Got universe "VANILLA" (5) from request classad
9/29 07:49:33 vm2: State change: claim-activation protocol successful
9/29 07:49:33 vm2: Changing activity: Idle -> Busy
9/29 08:24:21 Starter pid 23353 died on signal 11 (signal 11)
9/29 08:24:21 vm2: State change: starter exited
9/29 08:24:21 vm2: Changing activity: Busy -> Idle
9/29 08:24:22 DaemonCore: Command received via TCP from host <172.24.116.193:50122>
9/29 08:24:22 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
9/29 08:24:22 vm2: Got activate_claim request from shadow (<172.24.116.193:50122>)
9/29 08:24:22 vm2: Remote job ID is 368.0
9/29 08:24:22 vm2: Got universe "VANILLA" (5) from request classad
9/29 08:24:22 vm2: State change: claim-activation protocol successful
9/29 08:24:22 vm2: Changing activity: Idle -> Busy
9/29 09:33:21 Starter pid 23430 exited with status 4
9/29 09:33:21 vm2: State change: starter exited
9/29 09:33:21 vm2: Changing activity: Busy -> Idle
9/29 10:12:28 State change: RunBenchmarks is TRUE
9/29 10:12:28 vm1: Changing activity: Idle -> Benchmarking
9/29 10:12:32 State change: benchmarks completed
9/29 10:12:32 vm1: Changing activity: Benchmarking -> Idle



Note that we received a signal 11; does this imply that the user application got this or the Starter process itself? Now for the StarterLog:


9/29 07:49:33 ******************************************************
9/29 07:49:33 ** condor_starter (CONDOR_STARTER) STARTING UP
9/29 07:49:33 ** /Condor/RH7.2/condor-6.6.6/sbin/condor_starter
9/29 07:49:33 ** $CondorVersion: 6.6.6 Jul 26 2004 $
9/29 07:49:33 ** $CondorPlatform: I386-LINUX_RH72 $
9/29 07:49:33 ** PID = 23353
9/29 07:49:33 ******************************************************
9/29 07:49:33 Using config file: /home/condor/condor_config
9/29 07:49:33 Using local config files: /home/condor/hosts/node1/condor_config.local
9/29 07:49:33 DaemonCore: Command Socket at <172.24.116.33:9690>
9/29 07:49:33 Done setting resource limits
9/29 07:49:33 Starter communicating with condor_shadow <172.24.116.193:50040>
9/29 07:49:33 Submitting machine is "cartman--esc.grid.private.cam.ac.uk"
9/29 07:49:33 VM2_USER set, so running job as condor_user2
9/29 07:49:33 File transfer completed successfully.
9/29 07:49:34 Starting a VANILLA universe job with ID: 368.0
9/29 07:49:34 IWD: /Condor_local/execute/dir_23353
9/29 07:49:34 Output file: /Condor_local/execute/dir_23353/out
9/29 07:49:34 Error file: /Condor_local/execute/dir_23353/err
9/29 07:49:34 About to exec /Condor_local/execute/dir_23353/condor_exec.exe NVT_PARAM CONFIG
9/29 07:49:34 Create_Process succeeded, pid=23355
9/29 08:24:22 ******************************************************
9/29 08:24:22 ** condor_starter (CONDOR_STARTER) STARTING UP
9/29 08:24:22 ** /Condor/RH7.2/condor-6.6.6/sbin/condor_starter
9/29 08:24:22 ** $CondorVersion: 6.6.6 Jul 26 2004 $
9/29 08:24:22 ** $CondorPlatform: I386-LINUX_RH72 $
9/29 08:24:22 ** PID = 23430
9/29 08:24:22 ******************************************************
9/29 08:24:22 Using config file: /home/condor/condor_config
9/29 08:24:22 Using local config files: /home/condor/hosts/node1/condor_config.local
9/29 08:24:22 DaemonCore: Command Socket at <172.24.116.33:9651>
9/29 08:24:22 Done setting resource limits
9/29 08:24:22 Starter communicating with condor_shadow <172.24.116.193:50120>
9/29 08:24:22 Submitting machine is "cartman--esc.grid.private.cam.ac.uk"
9/29 08:24:22 VM2_USER set, so running job as condor_user2
9/29 08:24:22 File transfer completed successfully.
9/29 08:24:23 Starting a VANILLA universe job with ID: 368.0
9/29 08:24:23 IWD: /Condor_local/execute/dir_23430
9/29 08:24:23 Output file: /Condor_local/execute/dir_23430/out
9/29 08:24:23 Error file: /Condor_local/execute/dir_23430/err
9/29 08:24:23 About to exec /Condor_local/execute/dir_23430/condor_exec.exe NVT_PARAM CONFIG
9/29 08:24:23 Create_Process succeeded, pid=23432
9/29 09:28:21 Process exited, pid=23432, status=0
9/29 09:33:21 condor_read(): timeout reading buffer.
9/29 09:33:21 ERROR "Assertion ERROR on (result)" at line 270 in file NTsenders.C
9/29 09:33:21 ShutdownFast all jobs.


And that's it. The current situation is that the execute node's in a "Claimed Idle" state, but not actually running any condor job, while the client machine thinks that it still has a process running on the execute node (from condor_q). I repeat that the job actually ran to completion (probably on the 2nd submit, as job 368.0), and that the application runs fine outside of condor on that execute host when launched via the command line.

Any clues to this behaviour chaps?

Cheers,
Mark

--
Mark Calleja
Cambridge eScience Centre, University of Cambridge
Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA
Tel. (+44/0) 1223 765317, Fax  (+44/0) 1223 765900
http://www.esc.cam.ac.uk/~mcal00