[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
- Date: Fri, 22 Jan 2021 11:46:32 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
this is partly a UNIX 'problem' by using exec you replace the previous bash process, exec will never come back but replace the actual process that called it, hence traps you send to the previous process will not be handled/forwarded either.
I don't see the necessity for your 2-lin bash script, should not something like:
executable = /opt/Abaqus/Commands/abq2017
arguments= job=sim01_NI1100 input=sim01_NI1100.inp user=umat.f inter
Be more straight forward ?
In addition to Christoph's suggestion above to simply get rid of the
wrapper script, what version of HTCondor are you using on the
execute nodes (condor_version will tell you), and was HTCondor
installed / running as root on the execute nodes? I ask because
with HTCondor v8.8 and above, when started as root HTCondor should
be using Linux's control groups (cgroups) mechanism by default to
make sure all processes involved with a job get killed --- even
'orphaned' processes like Christoph describes.
On your execute machine does "condor_config_val BASE_CGROUP" return
"htcondor" (which it should by default) ?
What distro of Linux are you using?
Another random thought is to add the following config knob to your
config (on your execute nodes, or on all nodes is fine as well):
This will tell Linux to put each job in its own pid namespace,
meaning the job cannot "see" other processes running on the system
with things like /bin/ps ... this can cause problems for some very
small percentage of applications, but works fine with 95% of apps
out there. In your case, another advantage of pid namespaces is
it tells the Linux kernel itself to track and kill all processes
associated with a job.
Final random thought: where any of your still-running Abaqus
processes stuck in the "D" (disk IO) state when you look at them
with /bin/ps? On Linux, processes stuck on I/O cannot be killed,
even with "kill -9". I have seen this happen when, for instance,
a job is using a stale/stuck NFS mount....
Hope the above helps