[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] What makes the condor_startd stucked?



Dear Condor Expert:

Recently we found that from time to time 'condor_status -state' does not report the memory of a execute node correctly, the last number in the memory is missing. In the following example you will see that 'condor_status -state -wide' reports that it's 21175 but 'condor_status' says it's 211750, a '0' is missing. Meanwhile the free partionable slot is always at 'Matched' status, while it should be 'Unclaimed'.

  node064:~# condor_status -state -wide | grep node064
slot1@xxxxxxxxxxxxxxxxxxxxxxx 54 21175 0.180 0+00:01:20 Matched 0+05:37:30 Idle 0+05:37:30 slot1_12@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55 Claimed 0+09:17:05 Busy 0+05:37:10 slot1_17@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55 Claimed 0+11:31:54 Busy 0+05:37:10 slot1_2@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55 Claimed 0+05:36:52 Busy 0+05:36:11 slot1_36@xxxxxxxxxxxxxxxxxxxxxxx 1 4000 1.000 53+22:26:55 Claimed 0+16:26:48 Busy 0+05:37:10 slot1_47@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55 Claimed 0+09:16:30 Busy 0+05:37:10 slot1_53@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55 Claimed 0+09:16:30 Busy 0+05:37:10 slot1_57@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55 Claimed 0+10:58:15 Busy 0+05:37:10 slot1_61@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55 Claimed 0+11:00:09 Busy 0+05:37:10 slot1_64@xxxxxxxxxxxxxxxxxxxxxxx 1 3000 1.000 53+22:26:55 Claimed 0+10:59:43 Busy 0+05:37:10 slot1_8@xxxxxxxxxxxxxxxxxxxxxxx 1 4000 1.000 53+22:26:55 Claimed 0+21:33:27 Busy 0+05:37:10

node064:~# condor_status  | grep node064
slot1@xxxxxxxxxxxx LINUX X86_64 Matched Idle 0.270 211750 0+05:40:40 slot1_12@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000 0+05:41:40 slot1_17@xxxxxxxxx LINUX X86_64 Claimed Busy 0.380 3000 0+05:41:40 slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000 0+05:40:41 slot1_36@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 4000 0+05:41:40 slot1_47@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000 0+05:41:40 slot1_53@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000 0+05:41:40 slot1_57@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000 0+05:41:40 slot1_61@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000 0+05:41:40 slot1_64@xxxxxxxxx LINUX X86_64 Claimed Busy 1.000 3000 0+05:41:40 slot1_8@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.000 4000 0+05:41:40

From the StartLog we can see that the last status change of slot1 one is at ~ 4:22.

node064:~# cat /var/log/condor/StartLog | grep slot1 | grep -v '_' | tail -n 15 02/11/15 02:56:30 slot1: State change: match notification protocol successful
02/11/15 02:56:30 slot1: Changing state: Unclaimed -> Matched
02/11/15 02:56:30 slot1: Changing state: Matched -> Unclaimed
02/11/15 03:01:36 slot1: Received match <10.141.0.64:43678>#1418988885#44480#... 02/11/15 03:01:36 slot1: State change: match notification protocol successful
02/11/15 03:01:36 slot1: Changing state: Unclaimed -> Matched
02/11/15 03:01:36 slot1: Changing state: Matched -> Unclaimed
02/11/15 03:01:41 slot1: State change: entering Drained state
02/11/15 03:01:41 slot1: Changing state and activity: Unclaimed/Idle -> Drained/Retiring
02/11/15 04:21:46 slot1: State change: slot is no longer draining.
02/11/15 04:21:46 slot1: Changing state and activity: Drained/Retiring -> Owner/Idle
02/11/15 04:21:46 slot1: Changing state: Owner -> Unclaimed
02/11/15 04:22:46 slot1: Received match <10.141.0.64:43678>#1418988885#44490#... 02/11/15 04:22:46 slot1: State change: match notification protocol successful
02/11/15 04:22:46 slot1: Changing state: Unclaimed -> Matched

  On the central service side we also have.

svr021:~# grep node064 /var/log/condor/CollectorLog | grep Inserting | tail -n 10 02/11/15 02:56:35 StartdAd : Inserting ** "< slot1_58@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >" 02/11/15 02:56:35 StartdPvtAd : Inserting ** "< slot1_58@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >" 02/11/15 03:01:41 StartdAd : Inserting ** "< slot1_59@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >" 02/11/15 03:01:41 StartdPvtAd : Inserting ** "< slot1_59@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >" 02/11/15 04:22:25 StartdAd : Inserting ** "< slot1_2@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >" 02/11/15 04:22:25 StartdPvtAd : Inserting ** "< slot1_2@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >" 02/11/15 04:22:45 StartdAd : Inserting ** "< slot1_4@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >" 02/11/15 04:22:45 StartdPvtAd : Inserting ** "< slot1_4@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >" 02/11/15 04:22:49 StartdAd : Inserting ** "< slot1_5@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >" 02/11/15 04:22:49 StartdPvtAd : Inserting ** "< slot1_5@xxxxxxxxxxxxxxxxxxxxxxx , 10.141.0.64 >"

Seems to me that the startd stopped sending information to collector at 04:22. Usually this can be fixed by a simple restart of the startd daemon. But what could lead to such a behavior?

  Cheers,Gang