[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm

On 1/22/2021 3:22 AM, christoph.beyer@xxxxxxx wrote:
Hi Felix,

this is partly a UNIX 'problem' by using exec you replace the previous bash process, exec will never come back but replace the actual process that called it, hence traps you send to the previous process will not be handled/forwarded either. 

I don't see the necessity for your 2-lin bash script, should not something like: 

executable = /opt/Abaqus/Commands/abq2017
arguments= job=sim01_NI1100 input=sim01_NI1100.inp user=umat.f inter

Be more straight forward ? 

In addition to Christoph's suggestion above to simply get rid of the wrapper script,  what version of HTCondor are you using on the execute nodes (condor_version will tell you), and was HTCondor installed / running as root on the execute nodes?  I ask because with HTCondor v8.8 and above, when started as root HTCondor should be using Linux's control groups (cgroups) mechanism by default to make sure all processes involved with a job get killed --- even 'orphaned' processes like Christoph describes. 

On your execute machine does "condor_config_val BASE_CGROUP" return "htcondor" (which it should by default) ?

What distro of Linux are you using?

Another random thought is to add the following config knob to your config (on your execute nodes, or on all nodes is fine as well):


This will tell Linux to put each job in its own pid namespace, meaning the job cannot "see" other processes running on the system with things like /bin/ps ... this can cause problems for some very small percentage of applications, but works fine with 95% of apps out there.  In your case, another advantage of pid namespaces is it tells the Linux kernel itself to track and kill all processes associated with a job.

Final random thought:  where any of your still-running Abaqus processes stuck in the "D" (disk IO) state when you look at them with /bin/ps?  On Linux, processes stuck on I/O cannot be killed, even with "kill -9".  I have seen this happen when, for instance, a job is using a stale/stuck NFS mount....

Hope the above helps