[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Claimed Idle jobs on windows machines



Hello,

We have been having problems where jobs will end up in a state where
they are in the queue marked as running but if you look at condor_status
they are either Claimed Idle or have an owner.

This behavior has been happening whether jobs complete on their own or
someone starts using the machine. I wrote a test program that does
nothing but sleep for 30 minutes and the output some text to the
console. When submitted it will run for 30 minutes as it should and I
will see the output but when it exits the machine goes to Claimed Idle
and the job is still running in the queue. Here is the log from the
STARTD on the machine that my test job was running on. The main thing I
am confused about is why the starter is exiting with signal 4 and what
is causing the machine state to not be changed back to unclaimed when
the job is done. Our central manager is a Linux machine and the machines
in the pool are Windows XP. They are all running condor 6.6.5. Any help
would be appreciated. I just can't seem to figure out why this is
happening. Any clues? Please let me know if you need additional
information.

Thanks in advance.

--Joe Rinkovsky
Unix Systems Support Group
Indiana University


~ 6/24 08:31:10 Changing state: Unclaimed -> Matched 6/24 08:31:10 DaemonCore: Command received via TCP from host <129.79.4.13:10599> 6/24 08:31:10 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim) 6/24 08:31:10 Request accepted. 6/24 08:31:11 Remote owner is jrinkovs@xxxxxxxxxxxxxxxxxxxxxxxxxxx 6/24 08:31:11 State change: claiming protocol successful 6/24 08:31:11 Changing state: Matched -> Claimed 6/24 08:31:13 DaemonCore: Command received via TCP from host <129.79.4.13:11795> 6/24 08:31:13 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 6/24 08:31:13 Got activate_claim request from shadow (<129.79.4.13:11795>) 6/24 08:31:13 Remote job ID is 1913.0 6/24 08:31:13 Got universe "VANILLA" (5) from request classad 6/24 08:31:13 State change: claim-activation protocol successful 6/24 08:31:13 Changing activity: Idle -> Busy 6/24 08:31:44 DaemonCore: Command received via TCP from host <129.79.4.13:11095> 6/24 08:31:44 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling handler (command_query_ads) 6/24 08:31:44 In command_query_ads 6/24 08:31:52 DaemonCore: Command received via TCP from host <129.79.4.13:10904> 6/24 08:31:52 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling handler (command_query_ads) 6/24 08:31:52 In command_query_ads 6/24 08:31:56 DaemonCore: Command received via TCP from host <129.79.4.13:11307> 6/24 08:31:56 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling handler (command_query_ads) 6/24 08:31:56 In command_query_ads 6/24 08:32:59 DaemonCore: Command received via TCP from host <129.79.4.13:12008> 6/24 08:32:59 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling handler (command_query_ads) 6/24 08:32:59 In command_query_ads 6/24 08:33:05 DaemonCore: Command received via TCP from host <129.79.4.13:11125> 6/24 08:33:05 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling handler (command_query_ads) 6/24 08:33:05 In command_query_ads 6/24 09:02:05 DaemonCore: Command received via UDP from host <129.79.19.65:1126> 6/24 09:02:05 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand()) 6/24 09:02:05 Starter pid 584 exited with status 4 6/24 09:02:06 State change: starter exited 6/24 09:02:06 Changing activity: Busy -> Idle