[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs



Ah right you are! That was good enough to get an exception instead of a crash- thanks everyone!

-Colson


On 03/27/2019 10:31 AM, Jason Patton wrote:
Colson,

Based on the trace you posted, it looks like you're using the bindings from PyPI (installed via pip), e.g.
"/usr/lib64/python3.6/site-packages/htcondor/../htcondor.libs/libpyclassad3-8d384f47.6_8_7_9_clean.so"

You can update those independently of your system HTCondor install.

Jason

On Wed, Mar 27, 2019 at 12:28 PM Colson Driemel <colsond@xxxxxxx> wrote:
Oh nice that's good to hear it may just be a version issue. However when
i try to update condor via yum it only goes to 8.6.13. I went to the
repo defined in yum and found that there was a separate directory for 8.8

ie
https://research.cs.wisc.edu/htcondor/yum/stable/rhel7/
Vs
https://research.cs.wisc.edu/htcondor/yum/stable/8.8/rhel7/

Would it be safe to change the url in htcondor-stable-rhel7.repo and do
the update again? I read
http://research.cs.wisc.edu/htcondor/manual/v8.9/Upgradingfromthe86seriestothe88seriesofHTCondor.html
and it didn't seem like there would be any upgrade issues but I'm not
sure if this is the best way to do the update.

With regards to authentication methods I didn't see any configuration
using KERBEROS (I should note that we use GSI on production and the
python script user would have a valid grid proxy but have that disabled
for development):
condor_config_val -dump | grep -I KERBEROS
KERBEROS_MAP_FILE =

It didn't seem like there was any configuration at all set for
authentication:

# condor_config_val -dump | grep -I AUTH
ATTR_SEC_AUTHENTICATION_METHODS_LIST = GSI
DISABLE_AUTHENTICATION_IP_CHECK = false
GSI_AUTHENTICATION_TIMEOUT = 120
SEC_CLIENT_AUTHENTICATION_TIMEOUT = 120
SEC_DEFAULT_AUTHENTICATION_TIMEOUT = 20
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = true
SEC_READ_AUTHENTICATION_TIMEOUT = 120
XAUTHORITY_USERS =


Hopefully the condor update will solve the issue without having to
change the configuration too much.

Thanks,
-Colson


On 03/27/2019 09:21 AM, John M Knoeller wrote:
> This looks like a bug that we fixed in the HTCondor 8.8.1 release. ÂCould you try that version and see if it fixes your problem?
>
> You could also try removing KERBEROS from the list of authentication methods for your
> python script.  ÂYou could do this by adding a TOOL specialization to your configuration for the authorization methods.
>
> Something like this
>
> TOOL.<knob> = FS, GSI
>
> were <knob> is any of the configuration variables that you currently set to FS, KERBEROS, GSI.
>
> This might only be SEC_DEFAULT_AUTHENTICATION_METHODS, or might be more than one knob, depending on your configuration.
>
> condor_config_val -dump | grep -I KERBEROS
>
> will show you all of the knobs, including built-in defaults
>
> Adding a TOOL specialization to your config would affect all tools (including python scripts). They would refuse to use KERBEROS authentication, but the HTCondor daemons could still use it to authenticate to each other, just not to authenticate a tool.
>
> -tj
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Colson Driemel
> Sent: Wednesday, March 27, 2019 10:49 AM
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] Python Bindings crash without exception when remotely holding jobs
>
> I re-factored the code a bit so it wasn't running as a subprocess and i
> was able to use gdb to get this backtrace which seems helpful:
>
> (gdb) backtrace
> #0Â 0x0000000000000000 in ?? ()
> #1Â 0x00007fffe82187ba in Condor_Auth_Kerberos::init_user
> (this=this@entry=0x858900)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_auth_kerberos.cpp:725
> #2Â 0x00007fffe8219fbf in Condor_Auth_Kerberos::authenticate
> (this=0x858900) at
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_auth_kerberos.cpp:286
> #3Â 0x00007fffe8212985 in Authentication::authenticate_continue
> (this=this@entry=0x1df8190, errstack=errstack@entry=0x1c595b8,
> non_blocking=<optimized out>)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/authentication.cpp:331
> #4Â 0x00007fffe8212f13 in Authentication::authenticate_inner
> (this=this@entry=0x1df8190,
>Â Â Â ÂhostAddr=hostAddr@entry=0xe010d0
> "<206.12.154.223:9618?addrs=206.12.154.223-9618+[2607-f8f0-c10-70f3-2--223]-9618&noUDP&sock=2297468_8802_4>",
>
>Â Â Â Âauth_methods=auth_methods@entry=0x1e43860 "FS,KERBEROS,GSI",
> errstack=errstack@entry=0x1c595b8, timeout=timeout@entry=20,
> non_blocking=non_blocking@entry=false)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/authentication.cpp:163
> #5Â 0x00007fffe8212fba in Authentication::authenticate
> (this=this@entry=0x1df8190,
>Â Â Â ÂhostAddr=0xe010d0
> "<206.12.154.223:9618?addrs=206.12.154.223-9618+[2607-f8f0-c10-70f3-2--223]-9618&noUDP&sock=2297468_8802_4>",
>
>Â Â Â Âauth_methods=auth_methods@entry=0x1e43860 "FS,KERBEROS,GSI",
> errstack=errstack@entry=0x1c595b8, timeout=timeout@entry=20,
> non_blocking=non_blocking@entry=false)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/authentication.cpp:117
> #6Â 0x00007fffe821300b in Authentication::authenticate
> (this=this@entry=0x1df8190, hostAddr=<optimized out>, key=@0x1c597d8: 0x0,
>Â Â Â Âauth_methods=auth_methods@entry=0x1e43860 "FS,KERBEROS,GSI",
> errstack=errstack@entry=0x1c595b8, timeout=timeout@entry=20,
> non_blocking=non_blocking@entry=false)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/authentication.cpp:105
> #7Â 0x00007fffe8238337 in ReliSock::perform_authenticate
> (this=0x7fffffffd110, with_key=with_key@entry=true, key=@0x1c597d8: 0x0,
> methods=0x1e43860 "FS,KERBEROS,GSI",
>Â Â Â Âerrstack=0x1c595b8, auth_timeout=20, non_blocking=false,
> method_used=method_used@entry=0x0)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/reli_sock.cpp:1181
> #8Â 0x00007fffe82383cc in ReliSock::authenticate (this=<optimized out>,
> key=<optimized out>, methods=<optimized out>, errstack=<optimized out>,
> auth_timeout=<optimized out>,
>Â Â Â Ânon_blocking=<optimized out>, method_used=0x0) at
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/reli_sock.cpp:1238
> #9Â 0x00007fffe822e107 in SecManStartCommand::authenticate_inner
> (this=0x1c59570) at
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_secman.cpp:1920
> #10 0x00007fffe8233b25 in SecManStartCommand::startCommand_inner
> (this=this@entry=0x1c59570)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_secman.cpp:1295
> #11 0x00007fffe8233cda in SecManStartCommand::startCommand
> (this=this@entry=0x1c59570) at
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_secman.cpp:1227
> #12 0x00007fffe8233f71 in SecMan::startCommand (this=0x7fffffffd788,
> cmd=cmd@entry=478, sock=sock@entry=0x7fffffffd110,
> raw_protocol=<optimized out>, errstack=errstack@entry=0x0,
>Â Â Â Âsubcmd=<optimized out>, callback_fn=0x0, misc_data=0x0,
> nonblocking=false, cmd_description=0x0, sec_session_id_hint=0x0)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_io/condor_secman.cpp:1119
> #13 0x00007fffe824df24 in Daemon::startCommand (cmd=cmd@entry=478,
> sock=sock@entry=0x7fffffffd110, timeout=timeout@entry=0,
> errstack=errstack@entry=0x0, subcmd=subcmd@entry=0,
>Â Â Â Âcallback_fn=callback_fn@entry=0x0, misc_data=misc_data@entry=0x0,
> nonblocking=nonblocking@entry=false,
> cmd_description=cmd_description@entry=0x0, sec_man=0x0,
>Â Â Â Âsec_man@entry=0x7fffffffd788,
> raw_protocol=raw_protocol@entry=false,
> sec_session_id=sec_session_id@entry=0x0)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_daemon_client/daemon.cpp:567
> #14 0x00007fffe824e1eb in Daemon::startCommand
> (this=this@entry=0x7fffffffd700, cmd=cmd@entry=478,
> sock=sock@entry=0x7fffffffd110, timeout=timeout@entry=0,
>Â Â Â Âerrstack=errstack@entry=0x0,
> cmd_description=cmd_description@entry=0x0,
> raw_protocol=raw_protocol@entry=false,
> sec_session_id=sec_session_id@entry=0x0)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_daemon_client/daemon.cpp:728
> #15 0x00007fffe825fd78 in DCSchedd::actOnJobs
> (this=this@entry=0x7fffffffd700, action=""> > constraint=constraint@entry=0x0, ids=ids@entry=0x7fffffffd670,
>Â Â Â Âreason=reason@entry=0x1b54498 "Python-initiated action.",
> reason_attr=reason_attr@entry=0x7fffe82e8a1b "HoldReason",
> reason_code=reason_code@entry=0x0,
>Â Â Â Âreason_code_attr=reason_code_attr@entry=0x7fffe82c0326
> "HoldReasonSubCode", result_type=result_type@entry=AR_TOTALS,
> errstack=errstack@entry=0x0)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_daemon_client/dc_schedd.cpp:1409
> #16 0x00007fffe82603ec in DCSchedd::holdJobs
> (this=this@entry=0x7fffffffd700, ids=ids@entry=0x7fffffffd670,
> reason=reason@entry=0x1b54498 "Python-initiated action.",
>Â Â Â Âreason_code=reason_code@entry=0x0, errstack=errstack@entry=0x0,
> result_type=result_type@entry=AR_TOTALS)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/condor_daemon_client/dc_schedd.cpp:134
> #17 0x00007fffe92a74b1 in Schedd::actOnJobs
> (this=this@entry=0x7fffe1aa3040, action=""> > job_spec=..., reason=...)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/python-bindings/schedd.cpp:1399
> #18 0x00007fffe92a85a0 in Schedd::actOnJobs2 (this=0x7fffe1aa3040,
> action="" job_spec=...)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_source/src/python-bindings/schedd.cpp:1466
> #19 0x00007fffe928a0ef in
> invoke<boost::python::to_python_value<boost::python::api::object
> const&>, boost::python::api::object (Schedd::*)(JobAction,
> boost::python::api::object), boos---Type <return> to continue, or q
> <return> to quit---
> t::python::arg_from_python<Schedd&>,
> boost::python::arg_from_python<JobAction>,
> boost::python::arg_from_python<boost::python::api::object> >
> (ac1=<synthetic pointer>, ac0=...,
>Â Â Â Âtc=<synthetic pointer>, f=
>Â Â Â Â@0xade1e8: (boost::python::api::object (Schedd::*)(Schedd * const,
> JobAction, boost::python::api::object)) 0x7fffe92a8530
> <Schedd::actOnJobs2(JobAction, boost::python::api::object)>, rc=...) at
> /var/lib/condor/execute/slot1/dir_5037/htcondor_pypi_build/bld_external/boost-1.66.0/install/include/boost/python/detail/invoke.hpp:86
> #20 operator() (args_=<optimized out>, this=0xade1e8)
>Â Â Â Âat
> /var/lib/condor/execute/slot1/dir_5037/htcondor_pypi_build/bld_external/boost-1.66.0/install/include/boost/python/detail/caller.hpp:221
> #21
> boost::python::objects::caller_py_function_impl<boost::python::detail::caller<boost::python::api::object
> (Schedd::*)(JobAction, boost::python::api::object),
> boost::python::default_call_policies,
> boost::mpl::vector4<boost::python::api::object, Schedd&, JobAction,
> boost::python::api::object> > >::operator() (this=0xade1e0,
> args=<optimized out>,
>Â Â Â Âkw=<optimized out>) at
> /var/lib/condor/execute/slot1/dir_5037/htcondor_pypi_build/bld_external/boost-1.66.0/install/include/boost/python/object/py_function.hpp:38
> #22 0x00007fffe8f6033a in
> boost::python::objects::function::call(_object*, _object*) const ()
>Â Â Â from
> /usr/lib64/python3.6/site-packages/htcondor/../htcondor.libs/libpyclassad3-8d384f47.6_8_7_9_clean.so
> #23 0x00007fffe8f606a8 in
> boost::detail::function::void_function_ref_invoker0<boost::python::objects::(anonymous
> namespace)::bind_return,
> void>::invoke(boost::detail::function::function_buffer&) () from
> /usr/lib64/python3.6/site-packages/htcondor/../htcondor.libs/libpyclassad3-8d384f47.6_8_7_9_clean.so
> #24 0x00007fffe8f5ac63 in
> boost::python::handle_exception_impl(boost::function0<void>) ()
>Â Â Â from
> /usr/lib64/python3.6/site-packages/htcondor/../htcondor.libs/libpyclassad3-8d384f47.6_8_7_9_clean.so
> #25 0x00007fffe8f5efb3 in function_call () from
> /usr/lib64/python3.6/site-packages/htcondor/../htcondor.libs/libpyclassad3-8d384f47.6_8_7_9_clean.so
> #26 0x00007ffff795e88b in _PyObject_FastCallDict () from
> /lib64/libpython3.6m.so.1.0
> #27 0x00007ffff7a1e244 in call_function () from /lib64/libpython3.6m.so.1.0
> #28 0x00007ffff7a224c4 in _PyEval_EvalFrameDefault () from
> /lib64/libpython3.6m.so.1.0
> #29 0x00007ffff7a1d5e0 in _PyFunction_FastCall () from
> /lib64/libpython3.6m.so.1.0
> #30 0x00007ffff7a1e2f6 in call_function () from /lib64/libpython3.6m.so.1.0
> #31 0x00007ffff7a224c4 in _PyEval_EvalFrameDefault () from
> /lib64/libpython3.6m.so.1.0
> #32 0x00007ffff7a1df45 in _PyEval_EvalCodeWithName () from
> /lib64/libpython3.6m.so.1.0
> #33 0x00007ffff7a1e47d in PyEval_EvalCodeEx () from
> /lib64/libpython3.6m.so.1.0
> #34 0x00007ffff7a1e4cb in PyEval_EvalCode () from
> /lib64/libpython3.6m.so.1.0
> #35 0x00007ffff7a49914 in run_mod () from /lib64/libpython3.6m.so.1.0
> #36 0x00007ffff7a4bf5d in PyRun_FileExFlags () from
> /lib64/libpython3.6m.so.1.0
> #37 0x00007ffff7a4c0c7 in PyRun_SimpleFileExFlags () from
> /lib64/libpython3.6m.so.1.0
> #38 0x00007ffff7a62733 in Py_Main () from /lib64/libpython3.6m.so.1.0
> #39 0x0000000000400a3e in main ()
>
>
> -Colson
>
>
> On 03/26/2019 03:20 PM, Colson Driemel wrote:
>> I've tried as many things as I could come up with to try and get
>> something back from the call but nothing seems to generate anything
>> useful.
>>
>> The subprocess is setup to pipe any output back to stdout and sterr
>> and nothing is coming through those channels before the crash. I tried
>> to use python's fault handler to generate a trace but this is what i got:
>> Fatal Python error: Segmentation fault
>>
>> Current thread 0x00007f7287d31740 (most recent call first):
>>Â Â File "/opt/cloudscheduler/data_collectors/condor/csjobs.py", line
>> 309 in job_poller
>>Â Â File "/usr/lib64/python3.6/multiprocessing/process.py", line 93 in run
>>Â Â File "/usr/lib64/python3.6/multiprocessing/process.py", line 258 in
>> _bootstrap
>>Â Â File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 73
>> in _launch
>>Â Â File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 19
>> in __init__
>>Â Â File "/usr/lib64/python3.6/multiprocessing/context.py", line 277 in
>> _Popen
>>Â Â File "/usr/lib64/python3.6/multiprocessing/context.py", line 223 in
>> _Popen
>>Â Â File "/usr/lib64/python3.6/multiprocessing/process.py", line 105 in
>> start
>>Â Â File
>> "/opt/cloudscheduler/data_collectors/condor/cloudscheduler/lib/ProcessMonitor.py",
>> line 65 in start_all
>>Â Â File "/opt/cloudscheduler/data_collectors/condor/csjobs.py", line
>> 526 in <module>
>>
>> Not terribly useful unfortunately as csjobs 309 is:
>>
>> hold_result = condor_session.act(htcondor.JobAction.Hold, held_job_ids)
>>
>>
>> I don't have a lot of experience with C extensions in python so if
>> anyone knows a way that i can get my hands on the coredump I'd
>> appreciate it.
>>
>> I tried using gdb and backtrace but since it was only a sub-process
>> that died I wasn't able to come up with anything.
>>
>> -Colson
>>
>>
>> On 03/26/2019 02:00 PM, John M Knoeller wrote:
>>> Is there a way to get a stack trace for the SIGSEGV?
>>>
>>> -----Original Message-----
>>> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
>>> Of Colson Driemel
>>> Sent: Tuesday, March 26, 2019 1:13 PM
>>> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>>> Subject: Re: [HTCondor-users] Python Bindings crash without exception
>>> when remotely holding jobs
>>>
>>> The local log initially said very little, the debug messages are going
>>> to a configured log and the only message is the main process noticing
>>> the process in question has died and restarts it.
>>> I've added the exit code of the subprocess to the log and it is
>>> returning -11 which is SIGSEGV (Segmentation fault)
>>>
>>> 2019-03-26 10:52:47,968 - Job Poller Â- DEBUG - Adding job
>>> <removed>#15688.0#1553144582
>>> 2019-03-26 10:52:47,980 - Job Poller Â- DEBUG - No alias found in
>>> requirements _expression_
>>> 2019-03-26 10:52:47,981 - Job Poller Â- DEBUG - {'requirements':
>>> '(group_name is "test-dev2" && TARGET.Arch == "x86_64") && (TARGET.OpSys
>>> == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >=
>>> RequestMemory) && (TARGET.Cpus >= RequestCpus) &&
>>> (TARGET.HasFileTransfer)', 'request_ram': 15000, 'request_disk':
>>> 94371840, 'q_date': 1553432582, 'proc_id': 0, 'job_status': 1, 'user':
>>> '<removed>', 'request_cpus': 4, 'job_priority': 10,
>>> 'entered_current_status': 1553432582, 'global_job_id':
>>> '<removed>#16432.0#1553432582', 'cluster_id': 16432, 'group_name':
>>> 'test-dev2'}
>>> 2019-03-26 10:52:47,982 - Job Poller Â- DEBUG -
>>> inventory_item_hash(old): None
>>> 2019-03-26 10:52:47,982 - Job Poller Â- DEBUG -
>>> inventory_item_hash(new):
>>> 97b4c6c61bad8e44a72dfd34cfe1d6f8,cluster_id=16432,entered_current_status=1553432582,global_job_id=<removed>#16432.0#1553432582,job_priority=10,job_status=1,proc_id=0,q_date=1553432582,request_cpus=4,request_disk=94371840,request_ram=15000,requirements=(group_name
>>>
>>> is "test-dev2" && TARGET.Arch == "x86_64") && (TARGET.OpSys == "LINUX")
>>> && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
>>> (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer),user=<removed>
>>> 2019-03-26 10:52:47,982 - Job Poller Â- DEBUG - Adding job
>>> csv2-dev2.heprc.uvic.ca#16432.0#1553432582
>>> 2019-03-26 10:52:47,988 - Job Poller Â- DEBUG - No alias found in
>>> requirements _expression_
>>> 2019-03-26 10:52:47,988 - Job Poller Â- DEBUG - testing is not a valid
>>> group for csv2-dev2.heprc.uvic.ca, ignoring foreign job.
>>> 2019-03-26 10:52:47,989 - Job Poller Â- INFO - 6845 jobs held or to be
>>> held due to invalid user or group specifications.
>>> 2019-03-26 10:52:47,992 - Job Poller Â- DEBUG - Holding: ['16335.0',
>>> '16335.1', '16335.2', '16335.3', '16335.4', <SHORTENED FOR
>>> READABILITY> '']
>>> 2019-03-26 10:52:47,993 - Job Poller Â- DEBUG - Executing job action
>>> hold on csv2-dev2.heprc.uvic.ca
>>> 2019-03-26 10:52:55,698 - MainProcess - ERROR - job process died,
>>> restarting...
>>> 2019-03-26 10:52:55,993 - MainProcess - DEBUG - exit code: -11
>>> 2019-03-26 10:52:57,158 - Job Poller Â- INFO - Retrieved inventory from
>>> the database.
>>> 2019-03-26 10:52:57,159 - Job Poller Â- DEBUG - Beginning poller cycle
>>>
>>> -Colson
>>>
>>>
>>> On 03/25/2019 02:51 PM, John M Knoeller wrote:
>>>> What does the local log file say? (I'm assuming ToolLog is where
>>>> your logging.debug messages go?)
>>>> Do you get a core file when the python script aborts?
>>>>
>>>> What I'm trying to get at is - is this a segfault. or is HTCondor
>>>> aborting on purpose because of some failure.
>>>> This will be easy to fix if we can figure out exactly where in the
>>>> HTCondor code the segfault or abort it happing.
>>>>
>>>> -tj
>>>>
>>>> -----Original Message-----
>>>> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
>>>> Of Colson Driemel
>>>> Sent: Monday, March 25, 2019 1:18 PM
>>>> To: htcondor-users@xxxxxxxxxxx
>>>> Subject: [HTCondor-users] Python Bindings crash without exception
>>>> when remotely holding jobs
>>>>
>>>> Hi All,
>>>>
>>>> So the system I'm working on inspects job queues from various condor
>>>> instances and provisions cloud resources to run the jobs.
>>>>
>>>> As a part of this process jobs are held if they do not conform to
>>>> certain conditions- a list of jobs are compiled and then held using:
>>>>
>>>> condor_session.act(htcondor.JobAction.Hold, held_job_ids)
>>>>
>>>> for a little more context:
>>>>
>>>> try:
>>>>Â Â Â Â Âlogging.debug("Executing job action hold on %s" % condor_host)
>>>>Â Â Â Â Âhold_result = condor_session.act(htcondor.JobAction.Hold,
>>>> held_job_ids)
>>>>Â Â Â Â Âlogging.debug("Hold result: %s" % hold_result)
>>>>Â Â Â Â Âcondor_session.edit(held_job_ids, "HoldReason", '"Invalid
>>>> user or
>>>> group name for htondor host %s, held by job poller"' % condor_host)
>>>>
>>>> except Exception as exc:
>>>>Â Â Â Â Âlogging.error("Failure holding jobs: %s" % exc)
>>>>Â Â Â Â Âlogging.error("Aborting cycle...")
>>>>Â Â Â Â Âabort_cycle = True
>>>>Â Â Â Â Âbreak
>>>>
>>>>
>>>> I am pretty sure the error has something to do with the
>>>> configuration on
>>>> the remote condor host but my real issue is that it causes the python
>>>> code to crash with no exception.
>>>> This is a snapshot of the Schedd log from the remote condor in
>>>> question:
>>>>
>>>> 03/22/19 10:53:33 (pid:2277705) AUTHENTICATE: handshake failed!
>>>> 03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: authentication of
>>>> <IPADDR:44307> did not result in a valid mapped user name, which is
>>>> required for this command (478 ACT_ON_JOBS), so aborting.
>>>> 03/22/19 10:53:33 (pid:2277705) DC_AUTHENTICATE: reason for
>>>> authentication failure: AUTHENTICATE:1002:Failure performing
>>>> handshake|AUTHENTICATE:1004:Failed to authenticate using
>>>> KERBEROS|AUTHENTICATE:1004:Failed to authenticate using
>>>> FS|FS:1004:Unable to lstat(/tmp/FS_XXXMc7VmW)
>>>>
>>>> Any ideas on how i can stop this crash?
>>>>
>>>> Thanks,
>>>> Colson
>>>>
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>>> with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>>
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>>> with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>> with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>> with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/