[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId



Dear Jaime Frey,

Have you any news? Could I help somehow? At the moment some of my jobs are waiting 20+ hours before execution. ;(

Thanks in advance,
Dmitry.

----- Original Message -----
From: "Jaime Frey" <jfrey@xxxxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
Sent: Thursday, July 8, 2021 9:58:01 PM
Subject: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId

I am able to reproduce the problem, or something very close to it.
I am about to go on vacation, so will not be able to investigate further until a later date.

 - Jaime

> On Jul 5, 2021, at 4:53 AM, Dmitry A. Golubkov via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> 
> Dear Jaime Frey,
> 
> Could I help somehow? Is the issue reproducible on your side?
> 
> Thanks in advance,
> Dmitry.
> 
> ----- Original Message -----
> From: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
> Sent: Monday, June 28, 2021 12:23:23 PM
> Subject: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
> 
> Dear Jaime Frey,
> 
> One more thing, in the log I see only four "Got RELEASE_CLAIM from" messages instead of five.
> 
> Dmitry.
> 
> ----- Original Message -----
> From: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> To: "Jaime Frey" <jfrey@xxxxxxxxxxx>
> Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>, "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> Sent: Monday, June 28, 2021 11:14:21 AM
> Subject: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
> 
> Dear Jaime Frey,
> 
>> Do you see this problem for every job or just some of the time?
> 
> Just sometimes, but in my configuration, the problem is reproducible almost 100% of the time. I do the following to reproduce the issue:
> I re-start HTcodnor cluster before each experiment, the size of my cluster (cpu/memory) with two executors can only run one job (with 5 tasks inside) at a time. I start the first job and wait until all slots are released after finishing, right after that I run the same job again. Just to remind, I use dynamic slots. 
> 
> 
>> It looks like youâre running parallel universe jobs. Do you know if the same issue happens with vanilla universe jobs?
> 
> I have never tried because in my case a job can't be done by one machine. You can ask me to do any experiments if it helps.
> 
> 
>> When the startd kills the claim after the job completes, it sends a RELEASE_CLAIM command to the schedd, which should prevent the schedd from attempting to reuse it.
> 
> Here is the full log file: 
> - https://github.com/herclogon/htcondor/files/6704882/one-success-run-goodresearch-softtimeout.log (https://github.com/herclogon/htcondor/issues/2)
> 
> And YES, I have the line you said in the log:
> 
> --- LOG ---
> 
> 2021-06-22T17:12:55.203059519Z condor_shadow[335]: ParallelShadow::shutDown, exitReason: 100
> 2021-06-22T17:12:55.203062945Z condor_shadow[335]: condor_read(): Socket closed when trying to read 21 bytes from startd at <10.42.0.139:41293>
> 2021-06-22T17:12:55.203066572Z condor_shadow[335]: IO: EOF reading packet header
> 2021-06-22T17:12:55.203069804Z condor_shadow[335]: ParallelShadow::shutDown, exitReason: 100
> 2021-06-22T17:12:55.203073090Z condor_schedd[129]: Got RELEASE_CLAIM from <10.42.0.139:34663>
> 2021-06-22T17:12:55.203076705Z condor_schedd[129]: Deleted match rec for slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> 2021-06-22T17:12:55.203080267Z condor_shadow[335]: Inside RemoteResource::updateFromStarter()
> 2021-06-22T17:12:55.204729750Z condor_collector[52]: Got INVALIDATE_STARTD_ADS
> 2021-06-22T17:12:55.204747252Z condor_collector[52]: #011#011**** Removed(1) ad(s): "< slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 10.42.0.139 >"
> 2021-06-22T17:12:55.204753022Z condor_collector[52]: (Invalidated 1 ads)
> 2021-06-22T17:12:55.204757222Z condor_collector[52]: #011#011**** Removed(1) ad(s): "< slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 10.42.0.139 >"
> 2021-06-22T17:12:55.204761256Z condor_collector[52]: (Invalidated 1 ads)
> 2021-06-22T17:12:55.204771895Z condor_collector[52]: In OfflineCollectorPlugin::update ( 13 )
> 2021-06-22T17:12:55.204693098Z condor_shadow[335]: Inside RemoteResource::updateFromStarter()
> 
> --- LOG ---
> 
> Thanks in advance,
> Dmitry.
> 
> 
> ----- Original Message -----
> From: "Jaime Frey" <jfrey@xxxxxxxxxxx>
> To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
> Sent: Friday, June 25, 2021 7:59:44 PM
> Subject: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
> 
> Do you see this problem for every job or just some of the time?
> 
> It looks like youâre running parallel universe jobs. Do you know if the same issue happens with vanilla universe jobs?
> 
> When the startd kills the claim after the job completes, it sends a RELEASE_CLAIM command to the schedd, which should prevent the schedd from attempting to reuse it. Do you see a message like this in the schedd log:
> 
> 06/25/21 10:50:36.627 Got RELEASE_CLAIM from <192.168.4.40:56731>
> 
> - Jaime
> 
>> On Jun 23, 2021, at 2:04 PM, Dmitry A. Golubkov via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>> 
>> Dear all,
>> 
>> It looks like an issue in the htcondor. If you set CLAIM_WORKLIFE = 0 and use partitionable slots, the execution of jobs will periodically hang.
>> 
>> 
>> So, if you set CLAIM_WORKLIFE = 0 and use partitionable slots everything goes well at first. HTCondor creates slots, claims them, and start the job. After execution the startd shuts down the claim due to CLAIM_WORKLIFE:
> ...
>> As you see, the next job tries to use the claim "a7c85e11eaae4e09b2f4173a6d293e41bd457ebe" which must be already dead after the first run as result we get the error "Error: can't find resource with ClaimId". After this error, startd changes the state of the job from RUN -> IDLE and postpone the launch until next time. After some time this "wrong" claim disappears (10-15 minutes) from "somewhere" and the job can be run successfully. Just to check, I commented DEACTIVATE_CLAIM command in the startd source code to do nothing when the shadow sends the command to the startd, and all my jobs were executed fast without any problems described above. 
>> 
>> I am not an expert to solve the issue by myself, could I open the issue somewhere? Or maybe someone has the patch? Or any ideas on how to fix this correctly? I hope for any help.
>> 
>> 
>> Thanks in advance,
>> Dmitry
>