[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Python Bindings crash without exception when remotely holding jobs



Hi All,

So the system I'm working on inspects job queues from various condor instances and provisions cloud resources to run the jobs.

As a part of this process jobs are held if they do not conform to certain conditions- a list of jobs are compiled and then held using:

condor_session.act(htcondor.JobAction.Hold, held_job_ids)

for a little more context:

try:
    logging.debug("Executing job action hold on %s" % condor_host)
    hold_result = condor_session.act(htcondor.JobAction.Hold, held_job_ids)
    logging.debug("Hold result: %s" % hold_result)
condor_session.edit(held_job_ids, "HoldReason", '"Invalid user or group name for htondor host %s, held by job poller"' % condor_host)

except Exception as exc:
    logging.error("Failure holding jobs: %s" % exc)
    logging.error("Aborting cycle...")
    abort_cycle = True
    break


I am pretty sure the error has something to do with the configuration on the remote condor host but my real issue is that it causes the python code to crash with no exception.
This is a snapshot of the Schedd log from the remote condor in question:

03/22/19 10:53:33 (pid:2277705) AUTHENTICATE: handshake failed!
03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: authentication of <IPADDR:44307> did not result in a valid mapped user name, which is required for this command (478 ACT_ON_JOBS), so aborting. 03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXMc7VmW)

Any ideas on how i can stop this crash?

Thanks,
Colson