[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId



Do you see this problem for every job or just some of the time?

It looks like youâre running parallel universe jobs. Do you know if the same issue happens with vanilla universe jobs?

When the startd kills the claim after the job completes, it sends a RELEASE_CLAIM command to the schedd, which should prevent the schedd from attempting to reuse it. Do you see a message like this in the schedd log:

06/25/21 10:50:36.627 Got RELEASE_CLAIM from <192.168.4.40:56731>

 - Jaime

> On Jun 23, 2021, at 2:04 PM, Dmitry A. Golubkov via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> 
> Dear all,
> 
> It looks like an issue in the htcondor. If you set CLAIM_WORKLIFE = 0 and use partitionable slots, the execution of jobs will periodically hang.
> 
> 
> So, if you set CLAIM_WORKLIFE = 0 and use partitionable slots everything goes well at first. HTCondor creates slots, claims them, and start the job. After execution the startd shuts down the claim due to CLAIM_WORKLIFE:
...
> As you see, the next job tries to use the claim "a7c85e11eaae4e09b2f4173a6d293e41bd457ebe" which must be already dead after the first run as result we get the error "Error: can't find resource with ClaimId". After this error, startd changes the state of the job from RUN -> IDLE and postpone the launch until next time. After some time this "wrong" claim disappears (10-15 minutes) from "somewhere" and the job can be run successfully. Just to check, I commented DEACTIVATE_CLAIM command in the startd source code to do nothing when the shadow sends the command to the startd, and all my jobs were executed fast without any problems described above. 
> 
> I am not an expert to solve the issue by myself, could I open the issue somewhere? Or maybe someone has the patch? Or any ideas on how to fix this correctly? I hope for any help.
> 
> 
> Thanks in advance,
> Dmitry