Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Python Bindings crash without exception when remotely holding jobs

Date: Mon, 25 Mar 2019 11:17:47 -0700
From: Colson Driemel <colsond@xxxxxxx>
Subject: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs

Hi All,

So the system I'm working on inspects job queues from various condorinstances and provisions cloud resources to run the jobs.

As a part of this process jobs are held if they do not conform tocertain conditions- a list of jobs are compiled and then held using:


condor_session.act(htcondor.JobAction.Hold, held_job_ids)

for a little more context:

try:
    logging.debug("Executing job action hold on %s" % condor_host)
    hold_result = condor_session.act(htcondor.JobAction.Hold, held_job_ids)
    logging.debug("Hold result: %s" % hold_result)

condor_session.edit(held_job_ids, "HoldReason", '"Invalid user orgroup name for htondor host %s, held by job poller"' % condor_host)


except Exception as exc:
    logging.error("Failure holding jobs: %s" % exc)
    logging.error("Aborting cycle...")
    abort_cycle = True
    break

I am pretty sure the error has something to do with the configuration onthe remote condor host but my real issue is that it causes the pythoncode to crash with no exception.

This is a snapshot of the Schedd log from the remote condor in question:

03/22/19 10:53:33 (pid:2277705) AUTHENTICATE: handshake failed!

03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: authentication of<IPADDR:44307> did not result in a valid mapped user name, which isrequired for this command (478 ACT_ON_JOBS), so aborting.03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: reason forauthentication failure: AUTHENTICATE:1002:Failure performinghandshake|AUTHENTICATE:1004:Failed to authenticate usingKERBEROS|AUTHENTICATE:1004:Failed to authenticate usingFS|FS:1004:Unable to lstat(/tmp/FS_XXXMc7VmW)


Any ideas on how i can stop this crash?

Thanks,
Colson

Follow-Ups:
- Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs
  - From: John M Knoeller

Prev by Date: [HTCondor-users] requiring gpu through a HTCondor-CE
Next by Date: Re: [HTCondor-users] Parallel job with flocking, condor_tail does not work, upload/download to/from a running job, slots in Claimed-Idle state, ...
Previous by thread: Re: [HTCondor-users] requiring gpu through a HTCondor-CE
Next by thread: Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] Python Bindings crash without exception when remotely holding jobs