Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Perpetual "Claimed Idle" state
- Date: Wed, 29 Sep 2004 11:55:04 +0100
- From: Mark Calleja <mcal00@xxxxxxxxxxxxx>
- Subject: [Condor-users] Perpetual "Claimed Idle" state
<>Hi,
I'm submitting a vanilla job from a client to a nominated execute host,
and it seems that the Shadow is losing contact with the Starter, causing
the job to be resubmitted on the client node, even though the job
continues running on the execute node (sometimes to completion, as all
the output files get returned). However, the execute node then remains
in a "Claimed Idle" state. Here are the relevant parts of the logs; lots
of detail to follow so if this thing doesn't interest you then feel free
to stop reading here.
This is the ShadowLog, even after the job "finishes":
9/29 07:48:46 ******************************************************
9/29 07:48:46 ** condor_shadow (CONDOR_SHADOW) STARTING UP
9/29 07:48:46 ** /pond/home/condor/RH9.0/condor-6.6.6/sbin/condor_shadow
9/29 07:48:46 ** $CondorVersion: 6.6.6 Jul 26 2004 $
9/29 07:48:46 ** $CondorPlatform: I386-LINUX_RH9 $
9/29 07:48:46 ** PID = 25976
9/29 07:48:46 ******************************************************
9/29 07:48:46 Using config file: /home/condor/condor_config
9/29 07:48:46 Using local config files: /home/condor/condor_config.local
9/29 07:48:46 DaemonCore: Command Socket at <172.24.116.193:50040>
9/29 07:48:47 Initializing a VANILLA shadow
9/29 07:48:47 (368.0) (25976): Request to run on <172.24.116.33:9665>
was ACCEPTED
9/29 08:23:35 (368.0) (25976): ERROR "Can no longer talk to
condor_starter on execute machine (172.24.116.33)" at line 63 in file
NTreceivers.C
9/29 08:23:35 ******************************************************
9/29 08:23:35 ** condor_shadow (CONDOR_SHADOW) STARTING UP
9/29 08:23:35 ** /pond/home/condor/RH9.0/condor-6.6.6/sbin/condor_shadow
9/29 08:23:35 ** $CondorVersion: 6.6.6 Jul 26 2004 $
9/29 08:23:35 ** $CondorPlatform: I386-LINUX_RH9 $
9/29 08:23:35 ** PID = 26024
9/29 08:23:35 ******************************************************
9/29 08:23:35 Using config file: /home/condor/condor_config
9/29 08:23:35 Using local config files: /home/condor/condor_config.local
9/29 08:23:35 DaemonCore: Command Socket at <172.24.116.193:50120>
9/29 08:23:36 Initializing a VANILLA shadow
9/29 08:23:36 (368.0) (26024): Request to run on <172.24.116.33:9665>
was ACCEPTED
... and the corresponding SchedLog:
9/29 07:47:00 DaemonCore: Command received via UDP from host
<172.24.116.193:36443>
9/29 07:47:00 DaemonCore: received command 421 (RESCHEDULE), calling
handler (reschedule_negotiator)
9/29 07:47:00 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:47:00 Called reschedule_negotiator()
9/29 07:47:00 DaemonCore: Command received via TCP from host
<172.24.116.193:50035>
9/29 07:47:00 DaemonCore: received command 416 (NEGOTIATE), calling
handler (negotiate)
9/29 07:47:00 Negotiating for owner: mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:47:00 Checking consistency running and runnable jobs
9/29 07:47:00 Tables are consistent
9/29 07:47:00 Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
9/29 07:47:00 Increasing flock level for mcal00@xxxxxxxxxxxxxxxxxxxxx to 1.
9/29 07:47:05 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:47:49 DaemonCore: Command received via TCP from host
<172.24.116.1:9616>
9/29 07:47:49 DaemonCore: received command 416 (NEGOTIATE), calling
handler (negotiate)
9/29 07:47:54 Negotiating for owner: mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:47:54 Checking consistency running and runnable jobs
9/29 07:47:54 Tables are consistent
9/29 07:47:54 Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
9/29 07:47:54 Increasing flock level for mcal00@xxxxxxxxxxxxxxxxxxxxx to 2.
9/29 07:47:54 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:48:33 DaemonCore: Command received via TCP from host
<172.24.116.20:9629>
9/29 07:48:33 DaemonCore: received command 416 (NEGOTIATE), calling
handler (negotiate)
9/29 07:48:43 Negotiating for owner: mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:48:43 Checking consistency running and runnable jobs
9/29 07:48:43 Tables are consistent
9/29 07:48:44 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 2
9/29 07:48:44 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:48:46 Started shadow for job 368.0 on "<172.24.116.33:9665>",
(shadow pid = 25976)
9/29 07:48:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:53:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:58:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:03:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:08:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:13:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:18:49 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:23:35 ERROR: Shadow exited with job exception code!
9/29 08:23:35 Started shadow for job 368.0 on "<172.24.116.33:9665>",
(shadow pid = 26024)
9/29 08:23:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:28:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:33:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:38:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:43:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:48:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:53:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 08:58:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:03:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:08:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:13:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:18:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:23:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:28:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:33:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 09:38:35 Sent ad to central manager for mcal00@xxxxxxxxxxxxxxxxxxxxx
Hence note that we lost contact with the Starter at some stage. Now the
logs on the execute node; here's the StartLog first:
9/29 07:49:30 DaemonCore: Command received via UDP from host
<172.24.116.20:9615>
9/29 07:49:30 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
9/29 07:49:30 vm2: match_info called
9/29 07:49:30 vm2: Received match <172.24.116.33:9665>#4009692188
9/29 07:49:30 vm2: State change: match notification protocol successful
9/29 07:49:30 vm2: Changing state: Unclaimed -> Matched
9/29 07:49:30 DaemonCore: Command received via TCP from host
<172.24.116.193:50039>
9/29 07:49:30 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
9/29 07:49:30 vm2: Request accepted.
9/29 07:49:30 vm2: Remote owner is mcal00@xxxxxxxxxxxxxxxxxxxxx
9/29 07:49:30 vm2: State change: claiming protocol successful
9/29 07:49:30 vm2: Changing state: Matched -> Claimed
9/29 07:49:33 DaemonCore: Command received via TCP from host
<172.24.116.193:50042>
9/29 07:49:33 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
9/29 07:49:33 vm2: Got activate_claim request from shadow
(<172.24.116.193:50042>)
9/29 07:49:33 vm2: Remote job ID is 368.0
9/29 07:49:33 vm2: Got universe "VANILLA" (5) from request classad
9/29 07:49:33 vm2: State change: claim-activation protocol successful
9/29 07:49:33 vm2: Changing activity: Idle -> Busy
9/29 08:24:21 Starter pid 23353 died on signal 11 (signal 11)
9/29 08:24:21 vm2: State change: starter exited
9/29 08:24:21 vm2: Changing activity: Busy -> Idle
9/29 08:24:22 DaemonCore: Command received via TCP from host
<172.24.116.193:50122>
9/29 08:24:22 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
9/29 08:24:22 vm2: Got activate_claim request from shadow
(<172.24.116.193:50122>)
9/29 08:24:22 vm2: Remote job ID is 368.0
9/29 08:24:22 vm2: Got universe "VANILLA" (5) from request classad
9/29 08:24:22 vm2: State change: claim-activation protocol successful
9/29 08:24:22 vm2: Changing activity: Idle -> Busy
9/29 09:33:21 Starter pid 23430 exited with status 4
9/29 09:33:21 vm2: State change: starter exited
9/29 09:33:21 vm2: Changing activity: Busy -> Idle
9/29 10:12:28 State change: RunBenchmarks is TRUE
9/29 10:12:28 vm1: Changing activity: Idle -> Benchmarking
9/29 10:12:32 State change: benchmarks completed
9/29 10:12:32 vm1: Changing activity: Benchmarking -> Idle
Note that we received a signal 11; does this imply that the user
application got this or the Starter process itself? Now for the StarterLog:
9/29 07:49:33 ******************************************************
9/29 07:49:33 ** condor_starter (CONDOR_STARTER) STARTING UP
9/29 07:49:33 ** /Condor/RH7.2/condor-6.6.6/sbin/condor_starter
9/29 07:49:33 ** $CondorVersion: 6.6.6 Jul 26 2004 $
9/29 07:49:33 ** $CondorPlatform: I386-LINUX_RH72 $
9/29 07:49:33 ** PID = 23353
9/29 07:49:33 ******************************************************
9/29 07:49:33 Using config file: /home/condor/condor_config
9/29 07:49:33 Using local config files:
/home/condor/hosts/node1/condor_config.local
9/29 07:49:33 DaemonCore: Command Socket at <172.24.116.33:9690>
9/29 07:49:33 Done setting resource limits
9/29 07:49:33 Starter communicating with condor_shadow
<172.24.116.193:50040>
9/29 07:49:33 Submitting machine is "cartman--esc.grid.private.cam.ac.uk"
9/29 07:49:33 VM2_USER set, so running job as condor_user2
9/29 07:49:33 File transfer completed successfully.
9/29 07:49:34 Starting a VANILLA universe job with ID: 368.0
9/29 07:49:34 IWD: /Condor_local/execute/dir_23353
9/29 07:49:34 Output file: /Condor_local/execute/dir_23353/out
9/29 07:49:34 Error file: /Condor_local/execute/dir_23353/err
9/29 07:49:34 About to exec
/Condor_local/execute/dir_23353/condor_exec.exe NVT_PARAM CONFIG
9/29 07:49:34 Create_Process succeeded, pid=23355
9/29 08:24:22 ******************************************************
9/29 08:24:22 ** condor_starter (CONDOR_STARTER) STARTING UP
9/29 08:24:22 ** /Condor/RH7.2/condor-6.6.6/sbin/condor_starter
9/29 08:24:22 ** $CondorVersion: 6.6.6 Jul 26 2004 $
9/29 08:24:22 ** $CondorPlatform: I386-LINUX_RH72 $
9/29 08:24:22 ** PID = 23430
9/29 08:24:22 ******************************************************
9/29 08:24:22 Using config file: /home/condor/condor_config
9/29 08:24:22 Using local config files:
/home/condor/hosts/node1/condor_config.local
9/29 08:24:22 DaemonCore: Command Socket at <172.24.116.33:9651>
9/29 08:24:22 Done setting resource limits
9/29 08:24:22 Starter communicating with condor_shadow
<172.24.116.193:50120>
9/29 08:24:22 Submitting machine is "cartman--esc.grid.private.cam.ac.uk"
9/29 08:24:22 VM2_USER set, so running job as condor_user2
9/29 08:24:22 File transfer completed successfully.
9/29 08:24:23 Starting a VANILLA universe job with ID: 368.0
9/29 08:24:23 IWD: /Condor_local/execute/dir_23430
9/29 08:24:23 Output file: /Condor_local/execute/dir_23430/out
9/29 08:24:23 Error file: /Condor_local/execute/dir_23430/err
9/29 08:24:23 About to exec
/Condor_local/execute/dir_23430/condor_exec.exe NVT_PARAM CONFIG
9/29 08:24:23 Create_Process succeeded, pid=23432
9/29 09:28:21 Process exited, pid=23432, status=0
9/29 09:33:21 condor_read(): timeout reading buffer.
9/29 09:33:21 ERROR "Assertion ERROR on (result)" at line 270 in file
NTsenders.C
9/29 09:33:21 ShutdownFast all jobs.
And that's it. The current situation is that the execute node's in a
"Claimed Idle" state, but not actually running any condor job, while the
client machine thinks that it still has a process running on the execute
node (from condor_q). I repeat that the job actually ran to completion
(probably on the 2nd submit, as job 368.0), and that the application
runs fine outside of condor on that execute host when launched via the
command line.
Any clues to this behaviour chaps?
Cheers,
Mark
--
Mark Calleja
Cambridge eScience Centre, University of Cambridge
Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA
Tel. (+44/0) 1223 765317, Fax (+44/0) 1223 765900
http://www.esc.cam.ac.uk/~mcal00