[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Trouble Releasing COD Claims



I was able to get this to work by manually adding the ClaimId I wanted to release to the Startd adÂbefore initializing a new Claim object from it.Â

Example:
>>> import htcondor
>>> col = htcondor.Collector()
>>> startds = col.query(htcondor.AdTypes.Startd)
>>> len(startds)
1
>>> claim = htcondor.Claim(startds[0])
>>> claim
Unclaimed startd at <100.110.25.193:9618?addrs=100.110.25.193-9618&noUDP&sock=16686_c95e_3>
>>> claim.requestCOD()
>>> claim
Claim <100.110.25.193:9618>#1570839909#24#929a4e5d8a58f6c9ff67f6c92d10521c569dde7b
>>> test_startd = startds[0]
>>> test_startd['ClaimId'] = "<100.110.25.193:9618>#1570839909#24#929a4e5d8a58f6c9ff67f6c92d10521c569dde7b"
>>> test_claim = htcondor.Claim(test_startd)
>>> test_claim.release()
>>>Â

This leaves the original 'claim' objectÂin a confused state, but I don't plan on keeping it around anyway. The Startd looks normal so far after doing this but it could potentially have other side-effects I'm not yet aware of.

Best,
Collin

On Mon, Oct 14, 2019 at 12:32 PM Collin Mehring <collin.mehring@xxxxxxxxxxxxxx> wrote:
Hello Condor Experts,

I am trying to use the Condor "Compute On Demand" functionality to run a job, and am having trouble releasing the claim after the job completes. I can request and activate a claim successfully using both the condor_cod tool and the Python bindings Claim class. If I use the Python bindings and still have the Claim object around after the job is finished, I can call claim.release() to successfully release the claim.Â

However, I run into issues trying to release the claim later from a separate process. If I try to release the claim using the condor_cod tool I get the following error:Â
(HTCondor v8.8.3, running command from the host that has the claim.)

dev-exechost-01 [no job set:~] 149% condor_cod release -id "<100.110.25.193:9618>#1563927560#415#cd7bebdd59491a01553804b1cda5cb86939bc09f"
Attempt to send CA_RELEASE_CLAIM to startd <100.110.25.193:9618> failed
AUTHENTICATE:1002:Failure performing handshake

And I see the following the inÂSharedPortLog on that host in response:

10/09/19 14:58:42 DaemonCommandProtocol: Not enough bytes are ready for read.
10/09/19 14:58:42 SharedPortServer: Passing a request from <100.110.25.193:44569> for command 1000 to ID collector.
10/09/19 14:58:42 SharedPortServer: server was busy, failed to connect collector as requested by <100.110.25.193:44569>: primary (fc41ae4b192bf846de08119c9a81c47579587046fc0ee86597e574a317a5e71b/collector): Connection refused (111); alt (/opt/condor/lock/condor/daemon_sock/collector): Connection refused (111)

If I try to use the Python bindings later, I have trouble re-creating the Claim object in a way that allows releasing the COD claim. (This is the only StartdÂin this dev pool, and I had created several other COD claims on it beforehand that had not been released.)

>>> import htcondor
>>> col = htcondor.Collector()
>>> startds = col.query(htcondor.AdTypes.Startd)
>>> private_startds = col.query(htcondor.AdTypes.StartdPrivate)
>>> len(startds)
1
>>> len(private_startds)
1

>>> claim = htcondor.Claim(private_startds[0])
>>> claim
Claim <100.110.25.193:9618>#1563927560#430#[CryptoMethods="3DES";Encryption="NO";Integrity="NO";]3c0aa3f8fad4113469b677c0f26cc408c3290534
>>> claim.release()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
RuntimeError: Startd failed to release claim.

>>> claim = htcondor.Claim(startds[0])
>>> claim
Unclaimed startd at <100.110.25.193:9618?addrs=100.110.25.193-9618&noUDP&sock=16686_c95e_3>
>>> claim.release()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: No claim set for object.

>>> claim.requestCOD()
>>> claim
Claim <100.110.25.193:9618>#1570839909#20#8882a40347d9f999410e5fdc25d0278ca3a31cec
>>> claim.release()
>>>Â

Is there a way to initialize a new Claim object for an existing COD claim, so I can release it? Or is there a better way of doing this?

I'd appreciate any feedback.

Thanks,
Collin
--
Collin Mehring | PE-JoSE - Software Engineer



--
Collin Mehring | PE-JoSE - Software Engineer