[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Trouble Releasing COD Claims



Hello Condor Experts,

I am trying to use the Condor "Compute On Demand" functionality to run a job, and am having trouble releasing the claim after the job completes. I can request and activate a claim successfully using both the condor_cod tool and the Python bindings Claim class. If I use the Python bindings and still have the Claim object around after the job is finished, I can call claim.release() to successfully release the claim.Â

However, I run into issues trying to release the claim later from a separate process. If I try to release the claim using the condor_cod tool I get the following error:Â
(HTCondor v8.8.3, running command from the host that has the claim.)

dev-exechost-01 [no job set:~] 149% condor_cod release -id "<100.110.25.193:9618>#1563927560#415#cd7bebdd59491a01553804b1cda5cb86939bc09f"
Attempt to send CA_RELEASE_CLAIM to startd <100.110.25.193:9618> failed
AUTHENTICATE:1002:Failure performing handshake

And I see the following the inÂSharedPortLog on that host in response:

10/09/19 14:58:42 DaemonCommandProtocol: Not enough bytes are ready for read.
10/09/19 14:58:42 SharedPortServer: Passing a request from <100.110.25.193:44569> for command 1000 to ID collector.
10/09/19 14:58:42 SharedPortServer: server was busy, failed to connect collector as requested by <100.110.25.193:44569>: primary (fc41ae4b192bf846de08119c9a81c47579587046fc0ee86597e574a317a5e71b/collector): Connection refused (111); alt (/opt/condor/lock/condor/daemon_sock/collector): Connection refused (111)

If I try to use the Python bindings later, I have trouble re-creating the Claim object in a way that allows releasing the COD claim. (This is the only StartdÂin this dev pool, and I had created several other COD claims on it beforehand that had not been released.)

>>> import htcondor
>>> col = htcondor.Collector()
>>> startds = col.query(htcondor.AdTypes.Startd)
>>> private_startds = col.query(htcondor.AdTypes.StartdPrivate)
>>> len(startds)
1
>>> len(private_startds)
1

>>> claim = htcondor.Claim(private_startds[0])
>>> claim
Claim <100.110.25.193:9618>#1563927560#430#[CryptoMethods="3DES";Encryption="NO";Integrity="NO";]3c0aa3f8fad4113469b677c0f26cc408c3290534
>>> claim.release()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
RuntimeError: Startd failed to release claim.

>>> claim = htcondor.Claim(startds[0])
>>> claim
Unclaimed startd at <100.110.25.193:9618?addrs=100.110.25.193-9618&noUDP&sock=16686_c95e_3>
>>> claim.release()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: No claim set for object.

>>> claim.requestCOD()
>>> claim
Claim <100.110.25.193:9618>#1570839909#20#8882a40347d9f999410e5fdc25d0278ca3a31cec
>>> claim.release()
>>>Â

Is there a way to initialize a new Claim object for an existing COD claim, so I can release it? Or is there a better way of doing this?

I'd appreciate any feedback.

Thanks,
Collin
--
Collin Mehring | PE-JoSE - Software Engineer