[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] slots stay claimed/idle even after UNUSED_CLAIM_TIMEOUT expired



On 8/6/21 10:18 AM, Stanislav V. Markevich via HTCondor-users wrote:
Hi,

I set UNUSED_CLAIM_TIMEOUT to 180 but some (dynamic) slots are staying in Clamed/Idle state forever (see the last column):


UNUSED_CLAIM_TIMEOUT is only used when running parallel universe jobs. Are all these machines being used by parallel universe jobs? In the worst case, I believe that "condor_vacate" should be able put Claimed/Idle back to Unclaimed.

-greg




condor_status -af:h Name OpSys State Activity Cpus Memory TotalTimeClaimedIdle

Name                                    OpSys   State     Activity Cpus Memory TotalTimeClaimedIdle
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    LINUX   Unclaimed Idle     191  107    undefined
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    5
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    5
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    5
slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    73153
slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    73153
slot1_6@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    73153
slot1_8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    84669
slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    84669
slot1_10@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   Claimed   Idle     1    128    84669
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    LINUX   Unclaimed Idle     191  107    undefined
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    18
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    18
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    65209
slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    65209
slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    83370
slot1_6@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    65209
slot1_7@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    65209
slot1_8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    65209
slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    16593
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    LINUX   Unclaimed Idle     191  107    undefined
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    23
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    512    73171
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    23
slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    84608
slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    84608
slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   Claimed   Idle     1    128    84608
slot1_10@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   Claimed   Idle     1    128    84608
slot1_11@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   Claimed   Idle     1    128    84608
slot1_12@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   Claimed   Idle     1    128    84608


Normally when slot exceeds UNUSED_CLAIM_TIMEOUT there is a record in the log saying that this slot is released:

2021-08-06T14:45:50.956165411Z condor_schedd[3032]: Resource slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 182 seconds, limit is 180, releasing

But for problematic slots the last records in the log was hours ago (~24h):

2021-08-05T16:34:12.960897270Z condor_startd[859]: slot1_12: State change: starter exited
2021-08-05T16:34:12.960904225Z condor_startd[859]: slot1_12: Changing activity: Busy -> Idle
2021-08-05T16:34:12.960968125Z condor_startd[859]: slot1_12: State change: idle claim shutting down due to CLAIM_WORKLIFE
2021-08-05T16:34:12.960974666Z condor_startd[859]: slot1_12: Changing state and activity: Claimed/Idle -> Preempting/Vacating
2021-08-05T16:34:12.962018643Z condor_startd[859]: slot1_12: State change: No preempting claim, returning to owner
2021-08-05T16:34:12.962359058Z condor_startd[859]: slot1_12: Changing state and activity: Preempting/Vacating -> Owner/Idle
2021-08-05T16:34:12.962697322Z condor_startd[859]: slot1_12: State change: IS_OWNER is false
2021-08-05T16:34:12.962706591Z condor_startd[859]: slot1_12: Changing state: Owner -> Unclaimed
2021-08-05T16:34:12.962748296Z condor_startd[859]: slot1_12: Changing state: Unclaimed -> Delete
2021-08-05T16:34:12.962880429Z condor_startd[859]: slot1_12: Resource no longer needed, deleting

and then nothing. The slots are still there and claimed.

Is this a bug? Is there a way to release these slots forcefully?


Best regards,
Stanislav V. Markevich
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/