Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs

Date: Tue, 26 Mar 2019 21:00:52 +0000
From: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs

Is there a way to get a stack trace for the SIGSEGV?

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Colson Driemel
Sent: Tuesday, March 26, 2019 1:13 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs

The local log initially said very little, the debug messages are going 
to a configured log and the only message is the main process noticing 
the process in question has died and restarts it.
I've added the exit code of the subprocess to the log and it is 
returning -11 which is SIGSEGV (Segmentation fault)

2019-03-26 10:52:47,968 - Job Poller   - DEBUG - Adding job 
<removed>#15688.0#1553144582
2019-03-26 10:52:47,980 - Job Poller   - DEBUG - No alias found in 
requirements expression
2019-03-26 10:52:47,981 - Job Poller   - DEBUG - {'requirements': 
'(group_name is "test-dev2" && TARGET.Arch == "x86_64") && (TARGET.OpSys 
== "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= 
RequestMemory) && (TARGET.Cpus >= RequestCpus) && 
(TARGET.HasFileTransfer)', 'request_ram': 15000, 'request_disk': 
94371840, 'q_date': 1553432582, 'proc_id': 0, 'job_status': 1, 'user': 
'<removed>', 'request_cpus': 4, 'job_priority': 10, 
'entered_current_status': 1553432582, 'global_job_id': 
'<removed>#16432.0#1553432582', 'cluster_id': 16432, 'group_name': 
'test-dev2'}
2019-03-26 10:52:47,982 - Job Poller   - DEBUG - 
inventory_item_hash(old): None
2019-03-26 10:52:47,982 - Job Poller   - DEBUG - 
inventory_item_hash(new): 
97b4c6c61bad8e44a72dfd34cfe1d6f8,cluster_id=16432,entered_current_status=1553432582,global_job_id=<removed>#16432.0#1553432582,job_priority=10,job_status=1,proc_id=0,q_date=1553432582,request_cpus=4,request_disk=94371840,request_ram=15000,requirements=(group_name 
is "test-dev2" && TARGET.Arch == "x86_64") && (TARGET.OpSys == "LINUX") 
&& (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && 
(TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer),user=<removed>
2019-03-26 10:52:47,982 - Job Poller   - DEBUG - Adding job 
csv2-dev2.heprc.uvic.ca#16432.0#1553432582
2019-03-26 10:52:47,988 - Job Poller   - DEBUG - No alias found in 
requirements expression
2019-03-26 10:52:47,988 - Job Poller   - DEBUG - testing is not a valid 
group for csv2-dev2.heprc.uvic.ca, ignoring foreign job.
2019-03-26 10:52:47,989 - Job Poller   - INFO - 6845 jobs held or to be 
held due to invalid user or group specifications.
2019-03-26 10:52:47,992 - Job Poller   - DEBUG - Holding: ['16335.0', 
'16335.1', '16335.2', '16335.3', '16335.4', <SHORTENED FOR READABILITY> '']
2019-03-26 10:52:47,993 - Job Poller   - DEBUG - Executing job action 
hold on csv2-dev2.heprc.uvic.ca
2019-03-26 10:52:55,698 - MainProcess  - ERROR - job process died, 
restarting...
2019-03-26 10:52:55,993 - MainProcess  - DEBUG - exit code: -11
2019-03-26 10:52:57,158 - Job Poller   - INFO - Retrieved inventory from 
the database.
2019-03-26 10:52:57,159 - Job Poller   - DEBUG - Beginning poller cycle

-Colson


On 03/25/2019 02:51 PM, John M Knoeller wrote:
> What does the local log file say?  (I'm assuming ToolLog is where your logging.debug messages go?)
> Do you get a core file  when the python script aborts?
>
> What I'm trying to get at is - is this a segfault. or is HTCondor aborting on purpose because of some failure.
> This will be easy to fix if we can figure out exactly where in the HTCondor code the segfault or abort it happing.
>
> -tj
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Colson Driemel
> Sent: Monday, March 25, 2019 1:18 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs
>
> Hi All,
>
> So the system I'm working on inspects job queues from various condor
> instances and provisions cloud resources to run the jobs.
>
> As a part of this process jobs are held if they do not conform to
> certain conditions- a list of jobs are compiled and then held using:
>
> condor_session.act(htcondor.JobAction.Hold, held_job_ids)
>
> for a little more context:
>
> try:
>       logging.debug("Executing job action hold on %s" % condor_host)
>       hold_result = condor_session.act(htcondor.JobAction.Hold, held_job_ids)
>       logging.debug("Hold result: %s" % hold_result)
>       condor_session.edit(held_job_ids, "HoldReason", '"Invalid user or
> group name for htondor host %s, held by job poller"' % condor_host)
>
> except Exception as exc:
>       logging.error("Failure holding jobs: %s" % exc)
>       logging.error("Aborting cycle...")
>       abort_cycle = True
>       break
>
>
> I am pretty sure the error has something to do with the configuration on
> the remote condor host but my real issue is that it causes the python
> code to crash with no exception.
> This is a snapshot of the Schedd log from the remote condor in question:
>
> 03/22/19 10:53:33 (pid:2277705) AUTHENTICATE: handshake failed!
> 03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: authentication of
> <IPADDR:44307> did not result in a valid mapped user name, which is
> required for this command (478 ACT_ON_JOBS), so aborting.
> 03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: reason for
> authentication failure: AUTHENTICATE:1002:Failure performing
> handshake|AUTHENTICATE:1004:Failed to authenticate using
> KERBEROS|AUTHENTICATE:1004:Failed to authenticate using
> FS|FS:1004:Unable to lstat(/tmp/FS_XXXMc7VmW)
>
> Any ideas on how i can stop this crash?
>
> Thanks,
> Colson
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Follow-Ups:
- Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs
  - From: Colson Driemel

References:
- [HTCondor-users] Python Bindings crash without exception when remotely holding jobs
  - From: Colson Driemel
- Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs
  - From: John M Knoeller
- Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs
  - From: Colson Driemel

Prev by Date: Re: [HTCondor-users] Condor running slow (q,status,submit)
Next by Date: Re: [HTCondor-users] Parallel job with flocking, condor_tail does not work, upload/download to/from a running job, slots in Claimed-Idle state, ...
Previous by thread: Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs
Next by thread: Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs