[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs

I've tried as many things as I could come up with to try and get something back from the call but nothing seems to generate anything useful.

The subprocess is setup to pipe any output back to stdout and sterr and nothing is coming through those channels before the crash. I tried to use python's fault handler to generate a trace but this is what i got:
Fatal Python error: Segmentation fault

Current thread 0x00007f7287d31740 (most recent call first):
File "/opt/cloudscheduler/data_collectors/condor/csjobs.py", line 309 in job_poller
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 93 in run
File "/usr/lib64/python3.6/multiprocessing/process.py", line 258 in _bootstrap File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 73 in _launch File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 19 in __init__ File "/usr/lib64/python3.6/multiprocessing/context.py", line 277 in _Popen File "/usr/lib64/python3.6/multiprocessing/context.py", line 223 in _Popen
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 105 in start
File "/opt/cloudscheduler/data_collectors/condor/cloudscheduler/lib/ProcessMonitor.py", line 65 in start_all File "/opt/cloudscheduler/data_collectors/condor/csjobs.py", line 526 in <module>

Not terribly useful unfortunately as csjobs 309 is:

hold_result = condor_session.act(htcondor.JobAction.Hold, held_job_ids)

I don't have a lot of experience with C extensions in python so if anyone knows a way that i can get my hands on the coredump I'd appreciate it.

I tried using gdb and backtrace but since it was only a sub-process that died I wasn't able to come up with anything.


On 03/26/2019 02:00 PM, John M Knoeller wrote:
Is there a way to get a stack trace for the SIGSEGV?

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Colson Driemel
Sent: Tuesday, March 26, 2019 1:13 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs

The local log initially said very little, the debug messages are going
to a configured log and the only message is the main process noticing
the process in question has died and restarts it.
I've added the exit code of the subprocess to the log and it is
returning -11 which is SIGSEGV (Segmentation fault)

2019-03-26 10:52:47,968 - Job Poller   - DEBUG - Adding job
2019-03-26 10:52:47,980 - Job Poller   - DEBUG - No alias found in
requirements expression
2019-03-26 10:52:47,981 - Job Poller   - DEBUG - {'requirements':
'(group_name is "test-dev2" && TARGET.Arch == "x86_64") && (TARGET.OpSys
== "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >=
RequestMemory) && (TARGET.Cpus >= RequestCpus) &&
(TARGET.HasFileTransfer)', 'request_ram': 15000, 'request_disk':
94371840, 'q_date': 1553432582, 'proc_id': 0, 'job_status': 1, 'user':
'<removed>', 'request_cpus': 4, 'job_priority': 10,
'entered_current_status': 1553432582, 'global_job_id':
'<removed>#16432.0#1553432582', 'cluster_id': 16432, 'group_name':
2019-03-26 10:52:47,982 - Job Poller   - DEBUG -
inventory_item_hash(old): None
2019-03-26 10:52:47,982 - Job Poller   - DEBUG -
is "test-dev2" && TARGET.Arch == "x86_64") && (TARGET.OpSys == "LINUX")
&& (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
(TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer),user=<removed>
2019-03-26 10:52:47,982 - Job Poller   - DEBUG - Adding job
2019-03-26 10:52:47,988 - Job Poller   - DEBUG - No alias found in
requirements expression
2019-03-26 10:52:47,988 - Job Poller   - DEBUG - testing is not a valid
group for csv2-dev2.heprc.uvic.ca, ignoring foreign job.
2019-03-26 10:52:47,989 - Job Poller   - INFO - 6845 jobs held or to be
held due to invalid user or group specifications.
2019-03-26 10:52:47,992 - Job Poller   - DEBUG - Holding: ['16335.0',
'16335.1', '16335.2', '16335.3', '16335.4', <SHORTENED FOR READABILITY> '']
2019-03-26 10:52:47,993 - Job Poller   - DEBUG - Executing job action
hold on csv2-dev2.heprc.uvic.ca
2019-03-26 10:52:55,698 - MainProcess  - ERROR - job process died,
2019-03-26 10:52:55,993 - MainProcess  - DEBUG - exit code: -11
2019-03-26 10:52:57,158 - Job Poller   - INFO - Retrieved inventory from
the database.
2019-03-26 10:52:57,159 - Job Poller   - DEBUG - Beginning poller cycle


On 03/25/2019 02:51 PM, John M Knoeller wrote:
What does the local log file say?  (I'm assuming ToolLog is where your logging.debug messages go?)
Do you get a core file  when the python script aborts?

What I'm trying to get at is - is this a segfault. or is HTCondor aborting on purpose because of some failure.
This will be easy to fix if we can figure out exactly where in the HTCondor code the segfault or abort it happing.


-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Colson Driemel
Sent: Monday, March 25, 2019 1:18 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs

Hi All,

So the system I'm working on inspects job queues from various condor
instances and provisions cloud resources to run the jobs.

As a part of this process jobs are held if they do not conform to
certain conditions- a list of jobs are compiled and then held using:

condor_session.act(htcondor.JobAction.Hold, held_job_ids)

for a little more context:

       logging.debug("Executing job action hold on %s" % condor_host)
       hold_result = condor_session.act(htcondor.JobAction.Hold, held_job_ids)
       logging.debug("Hold result: %s" % hold_result)
       condor_session.edit(held_job_ids, "HoldReason", '"Invalid user or
group name for htondor host %s, held by job poller"' % condor_host)

except Exception as exc:
       logging.error("Failure holding jobs: %s" % exc)
       logging.error("Aborting cycle...")
       abort_cycle = True

I am pretty sure the error has something to do with the configuration on
the remote condor host but my real issue is that it causes the python
code to crash with no exception.
This is a snapshot of the Schedd log from the remote condor in question:

03/22/19 10:53:33 (pid:2277705) AUTHENTICATE: handshake failed!
03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: authentication of
<IPADDR:44307> did not result in a valid mapped user name, which is
required for this command (478 ACT_ON_JOBS), so aborting.
03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: reason for
authentication failure: AUTHENTICATE:1002:Failure performing
handshake|AUTHENTICATE:1004:Failed to authenticate using
KERBEROS|AUTHENTICATE:1004:Failed to authenticate using
FS|FS:1004:Unable to lstat(/tmp/FS_XXXMc7VmW)

Any ideas on how i can stop this crash?


HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: