[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs



What does the local log file say?  (I'm assuming ToolLog is where your logging.debug messages go?)
Do you get a core file  when the python script aborts?

What I'm trying to get at is - is this a segfault. or is HTCondor aborting on purpose because of some failure.
This will be easy to fix if we can figure out exactly where in the HTCondor code the segfault or abort it happing.

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Colson Driemel
Sent: Monday, March 25, 2019 1:18 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs

Hi All,

So the system I'm working on inspects job queues from various condor 
instances and provisions cloud resources to run the jobs.

As a part of this process jobs are held if they do not conform to 
certain conditions- a list of jobs are compiled and then held using:

condor_session.act(htcondor.JobAction.Hold, held_job_ids)

for a little more context:

try:
     logging.debug("Executing job action hold on %s" % condor_host)
     hold_result = condor_session.act(htcondor.JobAction.Hold, held_job_ids)
     logging.debug("Hold result: %s" % hold_result)
     condor_session.edit(held_job_ids, "HoldReason", '"Invalid user or 
group name for htondor host %s, held by job poller"' % condor_host)

except Exception as exc:
     logging.error("Failure holding jobs: %s" % exc)
     logging.error("Aborting cycle...")
     abort_cycle = True
     break


I am pretty sure the error has something to do with the configuration on 
the remote condor host but my real issue is that it causes the python 
code to crash with no exception.
This is a snapshot of the Schedd log from the remote condor in question:

03/22/19 10:53:33 (pid:2277705) AUTHENTICATE: handshake failed!
03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: authentication of 
<IPADDR:44307> did not result in a valid mapped user name, which is 
required for this command (478 ACT_ON_JOBS), so aborting.
03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: reason for 
authentication failure: AUTHENTICATE:1002:Failure performing 
handshake|AUTHENTICATE:1004:Failed to authenticate using 
KERBEROS|AUTHENTICATE:1004:Failed to authenticate using 
FS|FS:1004:Unable to lstat(/tmp/FS_XXXMc7VmW)

Any ideas on how i can stop this crash?

Thanks,
Colson

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/