Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm

Date: Fri, 22 Jan 2021 11:46:32 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm

On 1/22/2021 3:22 AM, christoph.beyer@xxxxxxx wrote:

Hi Felix,

this is partly a UNIX 'problem' by using exec you replace the previous bash process, exec will never come back but replace the actual process that called it, hence traps you send to the previous process will not be handled/forwarded either. 

I don't see the necessity for your 2-lin bash script, should not something like: 

executable = /opt/Abaqus/Commands/abq2017
arguments= job=sim01_NI1100 input=sim01_NI1100.inp user=umat.f inter

Be more straight forward ?

In addition to Christoph's suggestion above to simply get rid of the wrapper script, what version of HTCondor are you using on the execute nodes (condor_version will tell you), and was HTCondor installed / running as root on the execute nodes? I ask because with HTCondor v8.8 and above, when started as root HTCondor should be using Linux's control groups (cgroups) mechanism by default to make sure all processes involved with a job get killed --- even 'orphaned' processes like Christoph describes.

On your execute machine does "condor_config_val BASE_CGROUP" return "htcondor" (which it should by default) ?

What distro of Linux are you using?

Another random thought is to add the following config knob to your config (on your execute nodes, or on all nodes is fine as well):

USE_PID_NAMESPACES = True

This will tell Linux to put each job in its own pid namespace, meaning the job cannot "see" other processes running on the system with things like /bin/ps ... this can cause problems for some very small percentage of applications, but works fine with 95% of apps out there. In your case, another advantage of pid namespaces is it tells the Linux kernel itself to track and kill all processes associated with a job.

Final random thought: where any of your still-running Abaqus processes stuck in the "D" (disk IO) state when you look at them with /bin/ps? On Linux, processes stuck on I/O cannot be killed, even with "kill -9". I have seen this happen when, for instance, a job is using a stale/stuck NFS mount....

Hope the above helps
Todd

Follow-Ups:
- Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
  - From: felix . koelzow

References:
- [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
  - From: felix . koelzow
- Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
  - From: christoph . beyer

Prev by Date: Re: [HTCondor-users] Several HTCondor startd per node (pilot jobs)
Next by Date: [HTCondor-users] Output of condor_q after upgrade no longer grouped by batch
Previous by thread: Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
Next by thread: Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm