[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId



Dear all,

It looks like an issue in the htcondor. If you set CLAIM_WORKLIFE = 0 and use partitionable slots, the execution of jobs will periodically hang.


So, if you set CLAIM_WORKLIFE = 0 and use partitionable slots everything goes well at first. HTCondor creates slots, claims them, and start the job. After execution the startd shuts down the claim due to CLAIM_WORKLIFE:

--- LOG ---

2021-06-22T17:12:55.220289649Z condor_startd[122]: slot1_4: State change: starter exited
2021-06-22T17:12:55.220293082Z condor_startd[122]: slot1_4: Changing activity: Busy -> Idle
2021-06-22T17:12:55.220296864Z condor_startd[122]: slot1_4: publicClaimId: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#101#...
2021-06-22T17:12:55.220301122Z condor_startd[122]: slot1_4: Id: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#101#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]4a2a950a37efae30d785d9eb1fadce0ac0277289
2021-06-22T17:12:55.220306043Z condor_startd[122]: slot1_4: Computing claimWorklifeExpired(); ClaimAge=12, ClaimWorklife=0
2021-06-22T17:12:55.220310454Z condor_startd[122]: slot1_4: State change: idle claim shutting down due to CLAIM_WORKLIFE
2021-06-22T17:12:55.220314512Z condor_startd[122]: slot1_4: Changing state and activity: Claimed/Idle -> Preempting/Vacating
2021-06-22T17:12:55.220318342Z condor_startd[122]: slot1_4: publicClaimId: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#101#...
2021-06-22T17:12:55.220322782Z condor_startd[122]: slot1_4: Id: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#101#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]4a2a950a37efae30d785d9eb1fadce0ac0277289
2021-06-22T17:12:55.220331259Z condor_startd[122]: Entered vacate_client <10.42.0.145:38549?addrs=10.42.0.145-38549&alias=submit.pseven-htcondor&noUDP&sock=schedd_87_913c> 10.42.0.145...
2021-06-22T17:12:55.220335799Z condor_startd[122]: SharedPortClient: sent connection request to schedd at <10.42.0.145:38549> for shared port id schedd_87_913c
2021-06-22T17:12:55.220339955Z condor_startd[122]: slot1_4: Canceled ClaimLease timer (56)
2021-06-22T17:12:55.220343458Z condor_startd[122]: slot1_4: State change: No preempting claim, returning to owner
2021-06-22T17:12:55.220347405Z condor_startd[122]: slot1_4: Acquiring consumption policy configuration for slot type 1
2021-06-22T17:12:55.220350930Z condor_startd[122]: slot1_4: Changing state and activity: Preempting/Vacating -> Owner/Idle
2021-06-22T17:12:55.220354668Z condor_startd[122]: DBG: ClaimIdParser: setClaimId: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#220#f0b87011397f2d7df3e609d128d247b4800c2d00
2021-06-22T17:12:55.220359148Z condor_startd[122]: slot1_4: State change: IS_OWNER is false
2021-06-22T17:12:55.220362964Z condor_startd[122]: slot1_4: Changing state: Owner -> Unclaimed
2021-06-22T17:12:55.220366307Z condor_startd[122]: slot1_4: publicClaimId: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#220#...
2021-06-22T17:12:55.220371837Z condor_startd[122]: slot1_4: Id: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#220#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]f0b87011397f2d7df3e609d128d247b4800c2d00
2021-06-22T17:12:55.220376798Z condor_startd[122]: slot1_4: Changing state: Unclaimed -> Delete
2021-06-22T17:12:55.220380773Z condor_startd[122]: slot1_4: publicClaimId: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#220#...
2021-06-22T17:12:55.220385091Z condor_startd[122]: slot1_4: Id: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#220#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]f0b87011397f2d7df3e609d128d247b4800c2d00
2021-06-22T17:12:55.220881242Z condor_startd[122]: Trying to update collector <10.43.82.162:19618?alias=pseven-htcondormanager-svc.default.svc.cluster.local>
2021-06-22T17:12:55.220884317Z condor_startd[122]: Attempting to send update via TCP to collector pseven-htcondormanager-svc.default.svc.cluster.local <10.43.82.162:19618?alias=pseven-htcondormanager-svc.default.svc.cluster.local>
2021-06-22T17:12:55.222519373Z condor_startd[122]: slot1_4: Resource no longer needed, deleting
2021-06-22T17:12:55.222525752Z condor_startd[122]: slot1_4: unbind_DevIds for slot1.4 before :
2021-06-22T17:12:55.222528657Z condor_startd[122]: slot1_4: unbind_DevIds for slot1.4 after :

--- LOG ---

It's OK. BUT! At some moment the shadow starts shutdown process and sends DCStartd::deactivateClaim command to startd, at this time startd is releasing the claim due to CLAIM_WORKLIFE (sorry for a long log part):

--- LOG ---

2021-06-22T17:12:57.760263322Z condor_shadow[335]: ParallelShadow::shutDown, exitReason: 100
2021-06-22T17:12:57.760267446Z condor_shadow[335]: In handleJobRemoval, sig 0
2021-06-22T17:12:57.760270929Z condor_shadow[335]: Entering DCStartd::deactivateClaim(forceful)
2021-06-22T17:12:57.760273930Z condor_shadow[335]: SharedPortClient: sent connection request to <10.42.0.139:41293> for shared port id startd_86_e1a3
2021-06-22T17:12:57.760278579Z condor_shadow[335]: DCStartd::deactivateClaim: successfully sent command
2021-06-22T17:12:57.759582253Z condor_starter[192]: All jobs have exited... starter exiting
2021-06-22T17:12:57.760363434Z condor_shadow[335]: Killed starter (fast) at <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>
2021-06-22T17:12:57.760157886Z condor_startd[122]: DBG: get_by_cur_id: Start...
2021-06-22T17:12:57.760370715Z condor_shadow[335]: Entering DCStartd::deactivateClaim(forceful)
2021-06-22T17:12:57.760172320Z condor_startd[122]: DBG: get_by_cur_id: looking for resources...
2021-06-22T17:12:57.760723472Z condor_shadow[335]: SharedPortClient: sent connection request to <10.42.0.139:41293> for shared port id startd_86_e1a3
2021-06-22T17:12:57.760251417Z condor_startd[122]: slot1_1: Computing claimWorklifeExpired(); ClaimAge=14, ClaimWorklife=0
2021-06-22T17:12:57.760258352Z condor_startd[122]: Failed to write ToE tag to .job.ad file (13): Permission denied
2021-06-22T17:12:57.760262399Z condor_startd[122]: slot1_1: Called deactivate_claim_forcibly()
2021-06-22T17:12:57.761052512Z condor_startd[122]: slot1_1[1.0]: In Starter::kill() with pid 192, sig 3 (SIGQUIT)
2021-06-22T17:12:57.761064230Z condor_startd[122]: Send_Signal(): Doing kill(192,3) [SIGQUIT]
2021-06-22T17:12:57.761068993Z condor_startd[122]: slot1_1[1.0]: in starter:killHard starting kill timer
2021-06-22T17:12:57.761072065Z condor_startd[122]: slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
2021-06-22T17:12:57.761075116Z condor_startd[122]: slot1_1: publicClaimId: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#...
2021-06-22T17:12:57.761078464Z condor_startd[122]: slot1_1: Id: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]a7c85e11eaae4e09b2f4173a6d293e41bd457ebe
2021-06-22T17:12:57.761082050Z condor_startd[122]: slot1_1[1.0]: In Starter::kill() with pid 192, sig 15 (SIGTERM)
2021-06-22T17:12:57.761084851Z condor_startd[122]: Send_Signal(): Doing kill(192,15) [SIGTERM]
2021-06-22T17:12:57.761087567Z condor_startd[122]: slot1_1[1.0]: Using max vacate time of 600s for this job.
2021-06-22T17:12:57.761090388Z condor_startd[122]: slot1_1: Acquiring consumption policy configuration for slot type 1
2021-06-22T17:12:57.761093580Z condor_startd[122]: slot1_1: Skipping refresh of .update.ad because Starter is alive, but sent final update
2021-06-22T17:12:57.761096271Z condor_startd[122]: Trying to update collector <10.43.82.162:19618?alias=pseven-htcondormanager-svc.default.svc.cluster.local>
2021-06-22T17:12:57.761099572Z condor_startd[122]: Attempting to send update via TCP to collector pseven-htcondormanager-svc.default.svc.cluster.local <10.43.82.162:19618?alias=pseven-htcondormanager-svc.default.svc.cluster.local>
2021-06-22T17:12:57.761263225Z condor_startd[122]: slot1_1: Sent update to 1 collector(s)
2021-06-22T17:12:57.761275091Z condor_starter[192]: **** condor_starter (condor_STARTER) pid 192 EXITING WITH STATUS 0
2021-06-22T17:12:57.761795251Z condor_startd[122]: Error: problem finding resource for 404 (DEACTIVATE_CLAIM_FORCIBLY)
2021-06-22T17:12:57.762463529Z condor_collector[52]: StartdAd     : Updating ... "< slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 10.42.0.139 >"
2021-06-22T17:12:57.763485324Z condor_shadow[335]: condor_read(): Socket closed when trying to read 21 bytes from <10.42.0.139:41293>
2021-06-22T17:12:57.763366546Z condor_collector[52]: StartdPvtAd  : Updating ... "< slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 10.42.0.139 >"
2021-06-22T17:12:57.763507677Z condor_startd[122]: Starter pid 192 exited with status 0
2021-06-22T17:12:57.763519936Z condor_startd[122]: Saving classad to history file
2021-06-22T17:12:57.764320433Z condor_startd[122]: slot1_1[1.0]: Canceled hardkill-starter timer (77)
2021-06-22T17:12:57.764330659Z condor_startd[122]: slot1_1[1.0]: Canceled softkill-starter timer (78)
2021-06-22T17:12:57.764334287Z condor_procd[120]: PROC_FAMILY_KILL_FAMILY
2021-06-22T17:12:57.764337131Z condor_procd[120]: taking a snapshot...
2021-06-22T17:12:57.764340233Z condor_procd[120]: ProcAPI: read 8 pid entries out of 71 total entries in /proc
2021-06-22T17:12:57.764343668Z condor_procd[120]: process 192 (of family 192) has exited
2021-06-22T17:12:57.764346763Z condor_procd[120]: process 267 (of family 198) has exited
2021-06-22T17:12:57.764349926Z condor_procd[120]: watcher 192 of family with root 198 has died; family removed
2021-06-22T17:12:57.764353234Z condor_procd[120]: ...snapshot complete
2021-06-22T17:12:57.764355938Z condor_procd[120]: sending signal 9 to family with root 192
2021-06-22T17:12:57.765880534Z condor_startd[122]: slot1_1: State change: starter exited
2021-06-22T17:12:57.765891139Z condor_startd[122]: slot1_1: Canceled ClaimLease timer (46)
2021-06-22T17:12:57.765894152Z condor_startd[122]: slot1_1: State change: No preempting claim, returning to owner
2021-06-22T17:12:57.765896594Z condor_startd[122]: slot1_1: Acquiring consumption policy configuration for slot type 1
2021-06-22T17:12:57.765898918Z condor_startd[122]: slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
2021-06-22T17:12:57.765904362Z condor_startd[122]: slot1_1: State change: IS_OWNER is false
2021-06-22T17:12:57.765913477Z condor_startd[122]: slot1_1: Changing state: Owner -> Unclaimed
2021-06-22T17:12:57.765921368Z condor_startd[122]: slot1_1: publicClaimId: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#221#...
2021-06-22T17:12:57.765924850Z condor_startd[122]: slot1_1: Id: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#221#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]bb5347b48e21a6835014d7d9b660d65a176ded26
2021-06-22T17:12:57.765871544Z condor_collector[52]: Got INVALIDATE_STARTD_ADS
2021-06-22T17:12:57.765888681Z condor_collector[52]: #011#011**** Removed(1) ad(s): "< slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 10.42.0.139 >"
2021-06-22T17:12:57.765894741Z condor_collector[52]: (Invalidated 1 ads)
2021-06-22T17:12:57.765928063Z condor_startd[122]: slot1_1: Changing state: Unclaimed -> Delete
2021-06-22T17:12:57.765930330Z condor_startd[122]: slot1_1: publicClaimId: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#221#...
2021-06-22T17:12:57.765898622Z condor_collector[52]: #011#011**** Removed(1) ad(s): "< slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 10.42.0.139 >"
2021-06-22T17:12:57.765902534Z condor_collector[52]: (Invalidated 1 ads)
2021-06-22T17:12:57.765905981Z condor_collector[52]: In OfflineCollectorPlugin::update ( 13 )
2021-06-22T17:12:57.764170335Z condor_shadow[335]: IO: EOF reading packet header
2021-06-22T17:12:57.764188766Z condor_shadow[335]: FAILED to get number of expressions.
2021-06-22T17:12:57.764192919Z condor_shadow[335]: RemoteResource::killStarter(): Could not send command to startd
2021-06-22T17:12:57.764196106Z condor_shadow[335]: Entering DCStartd::deactivateClaim(forceful)
2021-06-22T17:12:57.764199236Z condor_shadow[335]: SharedPortClient: sent connection request to <10.42.0.139:41293> for shared port id startd_86_e1a3
2021-06-22T17:12:57.765932991Z condor_startd[122]: slot1_1: Id: <10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#221#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]bb5347b48e21a6835014d7d9b660d65a176ded26
[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]1e5a411da7ad59e73c2d67f2870dd31dbef25215)
[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]bb5347b48e21a6835014d7d9b660d65a176ded26)
[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]bb5347b48e21a6835014d7d9b660d65a176ded26)
2021-06-22T17:12:57.765966071Z condor_startd[122]: Trying to update collector <10.43.82.162:19618?alias=pseven-htcondormanager-svc.default.svc.cluster.local>
2021-06-22T17:12:57.765970983Z condor_startd[122]: Attempting to send update via TCP to collector pseven-htcondormanager-svc.default.svc.cluster.local <10.43.82.162:19618?alias=pseven-htcondormanager-svc.default.svc.cluster.local>
2021-06-22T17:12:57.765979753Z condor_startd[122]: slot1_1: Resource no longer needed, deleting
2021-06-22T17:12:57.765983947Z condor_startd[122]: slot1_1: unbind_DevIds for slot1.1 before :
2021-06-22T17:12:57.765986116Z condor_startd[122]: slot1_1: unbind_DevIds for slot1.1 after :

--- LOG ---

We are interesting in ClaimId "...pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]a7c85e11eaae4e09b2f4173a6d293e41bd457ebe". I don't know why, but in this particular case the claim data is kept somewhere and htcondor will try to use it for a next run:

--- NEXT RUN LOG ---

2021-06-22T17:13:10.354529594Z condor_startd[122]: command_activate_claim call, 444 ACTIVATE_CLAIM
2021-06-22T17:13:10.354546664Z condor_startd[122]: DBG: command_activate_claim: id: (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]a7c85e11eaae4e09b2f4173a6d293e41bd457ebe)
2021-06-22T17:13:10.354552318Z condor_startd[122]: DBG: get_by_cur_id: Start...
2021-06-22T17:13:10.354556050Z condor_startd[122]: DBG: get_by_cur_id: looking for resources...
2021-06-22T17:13:10.354563134Z condor_startd[122]: DBG: get_by_cur_id: matching: (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#236#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]08a961f866fb19fb999b5ed0176661e934e6ae59) with (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]a7c85e11eaae4e09b2f4173a6d293e41bd457ebe)
2021-06-22T17:13:10.354569175Z condor_startd[122]: DBG: get_by_cur_id: matching: (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#102#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]a935f039302ec5b1cd9a566b3cd79035d0b3dc4a) with (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]a7c85e11eaae4e09b2f4173a6d293e41bd457ebe)
2021-06-22T17:13:10.354576347Z condor_startd[122]: DBG: get_by_cur_id: matching: (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#103#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]9ecb93629d688a63ac46e50293f81d5d259475de) with (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]a7c85e11eaae4e09b2f4173a6d293e41bd457ebe)
2021-06-22T17:13:10.354598715Z condor_startd[122]: DBG: get_by_cur_id: matching: (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#104#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]c4e8ba34c672882c943c6951d2900d30501c82b3) with (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]a7c85e11eaae4e09b2f4173a6d293e41bd457ebe)
2021-06-22T17:13:10.354604664Z condor_startd[122]: DBG: get_by_cur_id: matching: (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#105#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]74c33aacb5483a42c69a6f420fda0c88eb5bd9dd) with (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]a7c85e11eaae4e09b2f4173a6d293e41bd457ebe)
2021-06-22T17:13:10.354609972Z condor_startd[122]: Error: can't find resource with ID (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#[CryptoMethods="BLOWFISH";Encryption="NO";Integrity="YES";]a7c85e11eaae4e09b2f4173a6d293e41bd457ebe) for 444 (ACTIVATE_CLAIM)
2021-06-22T17:13:10.354613515Z condor_startd[122]: Error: can't find resource with ClaimId (<10.42.0.139:41293?addrs=10.42.0.139-41293&alias=pseven-htcondorexecute-deploy-746cd664b9-pp7pg.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624381701#1#...) for 444 (ACTIVATE_CLAIM)
2021-06-22T17:13:10.354617079Z condor_startd[122]: Failed to read end of message from <10.42.0.145:38731>; 1794 untouched bytes.

--- NEXT RUN LOG ---


As you see, the next job tries to use the claim "a7c85e11eaae4e09b2f4173a6d293e41bd457ebe" which must be already dead after the first run as result we get the error "Error: can't find resource with ClaimId". After this error, startd changes the state of the job from RUN -> IDLE and postpone the launch until next time. After some time this "wrong" claim disappears (10-15 minutes) from "somewhere" and the job can be run successfully. Just to check, I commented DEACTIVATE_CLAIM command in the startd source code to do nothing when the shadow sends the command to the startd, and all my jobs were executed fast without any problems described above. 

I am not an expert to solve the issue by myself, could I open the issue somewhere? Or maybe someone has the patch? Or any ideas on how to fix this correctly? I hope for any help.


Thanks in advance,
Dmitry

----- Original Message -----
From: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
To: "Todd L Miller" <tlmiller@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>, "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, June 22, 2021 11:18:13 AM
Subject: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId

Dear Todd,

> Stupid question, but did you set CLAIM_WORKLIFE to 0 on both the schedd and the started?

Yes. I set CLAIM_WORKLIFE = 0 on both sides.

Currently, I see the following:

The schedd creates correct allocations before start:

2021-06-21T18:44:44.368923441Z condor_schedd[128]: Allocation for job 2.0, nprocs: 5
2021-06-21T18:44:44.368951374Z condor_schedd[128]: 2.0.0: LINUX#011X86_64#011slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, "<10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#102#..."
2021-06-21T18:44:44.368959602Z condor_schedd[128]: 2.1.0: LINUX#011X86_64#011slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, "<10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#100#..."
2021-06-21T18:44:44.368979457Z condor_schedd[128]: 2.2.0: LINUX#011X86_64#011slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, "<10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#10#..."
2021-06-21T18:44:44.368993495Z condor_schedd[128]: 2.3.0: LINUX#011X86_64#011slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, "<10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#11#..."
2021-06-21T18:44:44.369010378Z condor_schedd[128]: 2.4.0: LINUX#011X86_64#011slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, "<10.42.0.199:46325?addrs=10.42.0.199-46325&alias=pseven-htcondorexecute-deploy-c6cb554f7-mnt4g.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624300884#102#..."

Then the shadow asks the schedd for matches and gets matches where the last match contains the wrong claim id (#1624300884#1# must be #1624300884#102#)

2021-06-21T18:44:50.575922474Z condor_shadow[511]: Got 1 matches for proc # 0
2021-06-21T18:44:50.575937537Z condor_shadow[511]: Got host: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c> id: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#102#...
2021-06-21T18:44:50.575953422Z condor_shadow[511]: Got 1 matches for proc # 1
2021-06-21T18:44:50.575957549Z condor_shadow[511]: Got host: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c> id: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#100#...
2021-06-21T18:44:50.575961975Z condor_shadow[511]: in RemoteResource::initStartdInfo()
2021-06-21T18:44:50.575964948Z condor_shadow[511]: Got 1 matches for proc # 2
2021-06-21T18:44:50.575968013Z condor_shadow[511]: Got host: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c> id: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#10#...
2021-06-21T18:44:50.575999725Z condor_shadow[511]: in RemoteResource::initStartdInfo()
2021-06-21T18:44:50.576911403Z condor_shadow[511]: Got 1 matches for proc # 3
2021-06-21T18:44:50.576924099Z condor_shadow[511]: Got host: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c> id: <10.42.0.207:40533?addrs=10.42.0.207-40533&alias=pseven-htcondorexecute-deploy-c6cb554f7-drgp6.pseven-htcondor&noUDP&sock=startd_87_913c>#1624300886#11#...
2021-06-21T18:44:50.576939208Z condor_shadow[511]: in RemoteResource::initStartdInfo()
2021-06-21T18:44:50.577602632Z condor_shadow[511]: Got 1 matches for proc # 4
2021-06-21T18:44:50.577613974Z condor_shadow[511]: Got host: <10.42.0.199:46325?addrs=10.42.0.199-46325&alias=pseven-htcondorexecute-deploy-c6cb554f7-mnt4g.pseven-htcondor&noUDP&sock=startd_86_e1a3> id: <10.42.0.199:46325?addrs=10.42.0.199-46325&alias=pseven-htcondorexecute-deploy-c6cb554f7-mnt4g.pseven-htcondor&noUDP&sock=startd_86_e1a3>#1624300884#1#...


Trying to understand why? Any ideas?

Thanks in advance,
Dmitry.


----- Original Message -----
From: "Todd L Miller" <tlmiller@xxxxxxxxxxx>
To: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
Cc: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Monday, June 21, 2021 11:47:28 PM
Subject: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId

> There is one negotiator. After some prints in the source code, at now I 
> understand the following: the schedd executes the first job on dynamic 
> slots, CLAIM_WORKLIFE = 0 in my configuration (to re-create slots each 
> time), but the schedd tries to execute the next job on already expired 
> slots, why it does so, I'm still investigating.

 	Stupid question, but did you set CLAIM_WORKLIFE to 0 on both the 
schedd and the startd?  (I don't remember if the schedd is supposed to 
bounce, or if it's supposed to know better than to try, actually, but if 
it can't know...)

- ToddM
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/