[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] slots stay claimed/idle even after UNUSED_CLAIM_TIMEOUT expired



Hello Greg,

> Are all these machines being used by parallel universe jobs?
Yes, all machines run only parllel universe jobs.

----------
Sergey Komissarov
Senior Software Developer
DATADVANCE

This message may contain confidential information
constituting a trade secret of DATADVANCE. Any distribution,
use or copying of the information contained in this
message is ineligible except under the internal
regulations of DATADVANCE and may entail liability in
accordance with the current legislation of the Russian
Federation. If you have received this message by mistake
please immediately inform me of it. Thank you!

----- Original Message -----
From: "Greg Thain" <gthain@xxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Sent: Friday, August 6, 2021 6:22:42 PM
Subject: Re: [HTCondor-users] slots stay claimed/idle even after UNUSED_CLAIM_TIMEOUT expired

On 8/6/21 10:18 AM, Stanislav V. Markevich via HTCondor-users wrote:
> Hi,
>
> I set UNUSED_CLAIM_TIMEOUT to 180 but some (dynamic) slots are staying in Clamed/Idle state forever (see the last column):


UNUSED_CLAIM_TIMEOUT is only used when running parallel universe jobs. 
Are all these machines being used by parallel universe jobs? In the 
worst case, I believe that "condor_vacate" should be able put 
Claimed/Idle back to Unclaimed.

-greg


>
>
> condor_status -af:h Name OpSys State Activity Cpus Memory TotalTimeClaimedIdle
>
> Name                                    OpSys   State     Activity Cpus Memory TotalTimeClaimedIdle
> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    LINUX   Unclaimed Idle     191  107    undefined
> slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    5
> slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    5
> slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    5
> slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    73153
> slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    73153
> slot1_6@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    73153
> slot1_8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    84669
> slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    84669
> slot1_10@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   Claimed   Idle     1    128    84669
> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    LINUX   Unclaimed Idle     191  107    undefined
> slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    18
> slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    18
> slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    65209
> slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    65209
> slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    83370
> slot1_6@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    65209
> slot1_7@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    65209
> slot1_8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    65209
> slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    16593
> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    LINUX   Unclaimed Idle     191  107    undefined
> slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    23
> slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    73171
> slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    23
> slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    84608
> slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    84608
> slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    84608
> slot1_10@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   Claimed   Idle     1    128    84608
> slot1_11@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   Claimed   Idle     1    128    84608
> slot1_12@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   Claimed   Idle     1    128    84608
>
>
> Normally when slot exceeds UNUSED_CLAIM_TIMEOUT there is a record in the log saying that this slot is released:
>
> 2021-08-06T14:45:50.956165411Z condor_schedd[3032]: Resource slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 182 seconds, limit is 180, releasing
>
> But for problematic slots the last records in the log was hours ago (~24h):
>
> 2021-08-05T16:34:12.960897270Z condor_startd[859]: slot1_12: State change: starter exited
> 2021-08-05T16:34:12.960904225Z condor_startd[859]: slot1_12: Changing activity: Busy -> Idle
> 2021-08-05T16:34:12.960968125Z condor_startd[859]: slot1_12: State change: idle claim shutting down due to CLAIM_WORKLIFE
> 2021-08-05T16:34:12.960974666Z condor_startd[859]: slot1_12: Changing state and activity: Claimed/Idle -> Preempting/Vacating
> 2021-08-05T16:34:12.962018643Z condor_startd[859]: slot1_12: State change: No preempting claim, returning to owner
> 2021-08-05T16:34:12.962359058Z condor_startd[859]: slot1_12: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 2021-08-05T16:34:12.962697322Z condor_startd[859]: slot1_12: State change: IS_OWNER is false
> 2021-08-05T16:34:12.962706591Z condor_startd[859]: slot1_12: Changing state: Owner -> Unclaimed
> 2021-08-05T16:34:12.962748296Z condor_startd[859]: slot1_12: Changing state: Unclaimed -> Delete
> 2021-08-05T16:34:12.962880429Z condor_startd[859]: slot1_12: Resource no longer needed, deleting
>
> and then nothing. The slots are still there and claimed.
>
> Is this a bug? Is there a way to release these slots forcefully?
>
>
> Best regards,
> Stanislav V. Markevich
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/