[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId



Dear Todd,

> Stupid question, but did you set CLAIM_WORKLIFE to 0 on both the schedd and the started?

Yes. I set CLAIM_WORKLIFE = 0 on both sides.

Currently, I see the following:

The schedd creates correct allocations before start:

2021-06-21T18:44:44.368923441Z condor_schedd[128]: Allocation for job 2.0, nprocs: 5
2021-06-21T18:44:44.368951374Z condor_schedd[128]: 2.0.0: LINUX#011X86_64#011slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, "<10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#102#..."
2021-06-21T18:44:44.368959602Z condor_schedd[128]: 2.1.0: LINUX#011X86_64#011slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, "<10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#100#..."
2021-06-21T18:44:44.368979457Z condor_schedd[128]: 2.2.0: LINUX#011X86_64#011slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, "<10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#10#..."
2021-06-21T18:44:44.368993495Z condor_schedd[128]: 2.3.0: LINUX#011X86_64#011slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, "<10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#11#..."
2021-06-21T18:44:44.369010378Z condor_schedd[128]: 2.4.0: LINUX#011X86_64#011slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, "<10.42.0.199:46325?addrs=10.42.0.199-46325&alias=pseven-htcondorexecute-deploy-c6cb554f7-mnt4g.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624300884#102#..."

Then the shadow asks the schedd for matches and gets matches where the last match contains the wrong claim id (#1624300884#1# must be #1624300884#102#)

2021-06-21T18:44:50.575922474Z condor_shadow[511]: Got 1 matches for proc # 0
2021-06-21T18:44:50.575937537Z condor_shadow[511]: Got host: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c> id: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#102#...
2021-06-21T18:44:50.575953422Z condor_shadow[511]: Got 1 matches for proc # 1
2021-06-21T18:44:50.575957549Z condor_shadow[511]: Got host: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c> id: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#100#...
2021-06-21T18:44:50.575961975Z condor_shadow[511]: in RemoteResource::initStartdInfo()
2021-06-21T18:44:50.575964948Z condor_shadow[511]: Got 1 matches for proc # 2
2021-06-21T18:44:50.575968013Z condor_shadow[511]: Got host: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c> id: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#10#...
2021-06-21T18:44:50.575999725Z condor_shadow[511]: in RemoteResource::initStartdInfo()
2021-06-21T18:44:50.576911403Z condor_shadow[511]: Got 1 matches for proc # 3
2021-06-21T18:44:50.576924099Z condor_shadow[511]: Got host: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c> id: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#11#...
2021-06-21T18:44:50.576939208Z condor_shadow[511]: in RemoteResource::initStartdInfo()
2021-06-21T18:44:50.577602632Z condor_shadow[511]: Got 1 matches for proc # 4
2021-06-21T18:44:50.577613974Z condor_shadow[511]: Got host: <10.42.0.199:46325?addrs=10.42.0.199-46325&alias=pseven-htcondorexecute-deploy-c6cb554f7-mnt4g.pseven-htcondor&noUDP&sock=startd_86_e1a3> id: <10.42.0.199:46325?addrs=10.42.0.199-46325&alias=pseven-htcondorexecute-deploy-c6cb554f7-mnt4g.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624300884#1#...


Trying to understand why? Any ideas?

Thanks in advance,
Dmitry.


----- Original Message -----
From: "Todd L Miller" <tlmiller@xxxxxxxxxxx>
To: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
Cc: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Monday, June 21, 2021 11:47:28 PM
Subject: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId

> There is one negotiator. After some prints in the source code, at now I 
> understand the following: the schedd executes the first job on dynamic 
> slots, CLAIM_WORKLIFE = 0 in my configuration (to re-create slots each 
> time), but the schedd tries to execute the next job on already expired 
> slots, why it does so, I'm still investigating.

 	Stupid question, but did you set CLAIM_WORKLIFE to 0 on both the 
schedd and the startd?  (I don't remember if the schedd is supposed to 
bounce, or if it's supposed to know better than to try, actually, but if 
it can't know...)

- ToddM