Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] What makes the condor_startd stucked?
- Date: Wed, 11 Feb 2015 10:46:27 +0000
- From: qing <gang.qin@xxxxxxxxxxxxx>
- Subject: [HTCondor-users] What makes the condor_startd stucked?
Dear Condor Expert:
Recently we found that from time to time 'condor_status -state' does
not report the memory of a execute node correctly, the last number in
the memory is missing. In the following example you will see that
'condor_status -state -wide' reports that it's 21175 but 'condor_status'
says it's 211750, a '0' is missing. Meanwhile the free partionable
slot is always at 'Matched' status, while it should be 'Unclaimed'.
node064:~# condor_status -state -wide | grep node064
slot1@xxxxxxxxxxxxxxxxxxxxxxx 54 21175 0.180 0+00:01:20
Matched 0+05:37:30 Idle 0+05:37:30
slot1_12@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55
Claimed 0+09:17:05 Busy 0+05:37:10
slot1_17@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55
Claimed 0+11:31:54 Busy 0+05:37:10
slot1_2@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55
Claimed 0+05:36:52 Busy 0+05:36:11
slot1_36@xxxxxxxxxxxxxxxxxxxxxxx 1 4000 1.000 53+22:26:55
Claimed 0+16:26:48 Busy 0+05:37:10
slot1_47@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55
Claimed 0+09:16:30 Busy 0+05:37:10
slot1_53@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55
Claimed 0+09:16:30 Busy 0+05:37:10
slot1_57@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55
Claimed 0+10:58:15 Busy 0+05:37:10
slot1_61@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55
Claimed 0+11:00:09 Busy 0+05:37:10
slot1_64@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55
Claimed 0+10:59:43 Busy 0+05:37:10
slot1_8@xxxxxxxxxxxxxxxxxxxxxxx 1 4000 1.000 53+22:26:55
Claimed 0+21:33:27 Busy 0+05:37:10
node064:~# condor_status | grep node064
slot1@xxxxxxxxxxxx LINUX X86_64 Matched Idle 0.270 211750
0+05:40:40
slot1_12@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000
0+05:41:40
slot1_17@xxxxxxxxx LINUX X86_64 Claimed Busy 0.380 3000
0+05:41:40
slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000
0+05:40:41
slot1_36@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 4000
0+05:41:40
slot1_47@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000
0+05:41:40
slot1_53@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000
0+05:41:40
slot1_57@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000
0+05:41:40
slot1_61@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000
0+05:41:40
slot1_64@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000
0+05:41:40
slot1_8@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.000 4000
0+05:41:40
From the StartLog we can see that the last status change of slot1 one
is at ~ 4:22.
node064:~# cat /var/log/condor/StartLog | grep slot1 | grep -v '_' |
tail -n 15
02/11/15 02:56:30 slot1: State change: match notification protocol
successful
02/11/15 02:56:30 slot1: Changing state: Unclaimed -> Matched
02/11/15 02:56:30 slot1: Changing state: Matched -> Unclaimed
02/11/15 03:01:36 slot1: Received match
<10.141.0.64:43678>#1418988885#44480#...
02/11/15 03:01:36 slot1: State change: match notification protocol
successful
02/11/15 03:01:36 slot1: Changing state: Unclaimed -> Matched
02/11/15 03:01:36 slot1: Changing state: Matched -> Unclaimed
02/11/15 03:01:41 slot1: State change: entering Drained state
02/11/15 03:01:41 slot1: Changing state and activity: Unclaimed/Idle ->
Drained/Retiring
02/11/15 04:21:46 slot1: State change: slot is no longer draining.
02/11/15 04:21:46 slot1: Changing state and activity: Drained/Retiring
-> Owner/Idle
02/11/15 04:21:46 slot1: Changing state: Owner -> Unclaimed
02/11/15 04:22:46 slot1: Received match
<10.141.0.64:43678>#1418988885#44490#...
02/11/15 04:22:46 slot1: State change: match notification protocol
successful
02/11/15 04:22:46 slot1: Changing state: Unclaimed -> Matched
On the central service side we also have.
svr021:~# grep node064 /var/log/condor/CollectorLog | grep Inserting |
tail -n 10
02/11/15 02:56:35 StartdAd : Inserting ** "<
slot1_58@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 02:56:35 StartdPvtAd : Inserting ** "<
slot1_58@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 03:01:41 StartdAd : Inserting ** "<
slot1_59@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 03:01:41 StartdPvtAd : Inserting ** "<
slot1_59@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:25 StartdAd : Inserting ** "<
slot1_2@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:25 StartdPvtAd : Inserting ** "<
slot1_2@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:45 StartdAd : Inserting ** "<
slot1_4@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:45 StartdPvtAd : Inserting ** "<
slot1_4@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:49 StartdAd : Inserting ** "<
slot1_5@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
02/11/15 04:22:49 StartdPvtAd : Inserting ** "<
slot1_5@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"
Seems to me that the startd stopped sending information to collector
at 04:22. Usually this can be fixed by a simple restart of the startd
daemon. But what could lead to such a behavior?
Cheers,Gang