[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Problem with condor_pid_ns_init



On 2/9/2017 3:25 PM, Duncan Brown wrote:
Hi Todd,

On Feb 9, 2017, at 4:10 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:

On 2/9/2017 2:29 PM, Duncan Brown wrote:
Hi all,

We're trying to use PID NAMESPACES, but I'm seeing the following error in my starter logs:

02/07/17 18:08:42 (pid:6967) ERROR "Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly" at line 771 in file /slots/15/dir_2683933/userdir/.tmpCdykJF/BUILD/condor-8.6.0/src/condor_starter.V6.1/vanilla_proc.cpp


The line in the starter log immediately proceeding the ERROR line above may provide some clues.  Could you include 10-20 lines from the log leading up to the error?

Full log below.



Thanks.

The interesting line is

 02/07/17 18:09:59 (pid:6872) Got SIGQUIT.  Performing fast shutdown.

So at 18:09:59 the condor_starter proceeded to kill -9 everything, which explains why "condor_pid_ns_init did not run properly". Clearly the condor_starter could be smarter, and not expect condor_pid_ns_init to do its thing given that it just killed it with SIGKILL. :)

As for why the condor_starter was sent a SIGQUIT, you could look in the StartLog around that same time (18:09:59) for the answer. Off the top of my head, I imagine common causes would be the startd KILL expression in condor_config evaluated to True, or PREEMPT evaluated to True and WANT_VACATE evaluated to False, someone with permissions ran condor_vacate or condor_vacate_job, or the HTCondor service was asked to shutdown on that machine via something like "service condor stop" or "condor_off -fast".

Other than the ERROR entry in the starter log, does this cause problems for your workflows? I.e. does the job go on hold (I hope not!) ?

regards
Todd


What does condor_config_val USE_PID_NAMESPACE_INIT say on your execute node?

[root@CRUSH-SUGWG-OSG-10-5-149-26 ~]# condor_config_val USE_PID_NAMESPACE_INIT
Not defined: USE_PID_NAMESPACE_INIT

Other stuff that may be related:

[root@CRUSH-SUGWG-OSG-10-5-149-26 ~]# condor_config_val -dump | grep -i pid
EC2_GAHP_DEBUG = D_PID
MAX_PID_COLLISION_RETRY = 9
PID = 122064
PID_SNAPSHOT_INTERVAL = 15
PPID = 121992
SCHEDD_DEBUG = D_PID
STARTER_DEBUG = D_PID
USE_PID_NAMESPACES = True

[root@CRUSH-SUGWG-OSG-10-5-149-26 ~]# cat /etc/condor/sugwg-job-wrapper.sh
#!/bin/bash

# Set the umask for LSC users so that others do not have read permission
if [ $UID -gt 199 ] && [ "`id -gn`" = "lsc" ]; then
    umask 027
else
    umask 022
fi

exec "$@"
error=$?
echo "Failed to exec($error): $@" > $_CONDOR_WRAPPER_ERROR_FILE
exit 1

Cheers,
Duncan.

02/07/17 17:55:06 (pid:6872) ******************************************************
02/07/17 17:55:06 (pid:6872) ** condor_starter (CONDOR_STARTER) STARTING UP
02/07/17 17:55:06 (pid:6872) ** /usr/sbin/condor_starter
02/07/17 17:55:06 (pid:6872) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
02/07/17 17:55:06 (pid:6872) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
02/07/17 17:55:06 (pid:6872) ** $CondorVersion: 8.6.0 Jan 26 2017 BuildID: 395190 $
02/07/17 17:55:06 (pid:6872) ** $CondorPlatform: x86_64_RedHat7 $
02/07/17 17:55:06 (pid:6872) ** PID = 6872
02/07/17 17:55:06 (pid:6872) ** Log last touched time unavailable (No such file or directory)
02/07/17 17:55:06 (pid:6872) ******************************************************
02/07/17 17:55:06 (pid:6872) Using config source: /etc/condor/condor_config
02/07/17 17:55:06 (pid:6872) Using local config sources:
02/07/17 17:55:06 (pid:6872)    /etc/condor/condor_config.local
02/07/17 17:55:06 (pid:6872) config Macros = 91, Sorted = 90, StringBytes = 2942, TablesBytes = 3324
02/07/17 17:55:06 (pid:6872) CLASSAD_CACHING is OFF
02/07/17 17:55:06 (pid:6872) Daemon Log is logging: D_ALWAYS D_ERROR
02/07/17 17:55:06 (pid:6872) SharedPortEndpoint: waiting for connections to named socket 3113_23d8_23
02/07/17 17:55:06 (pid:6872) DaemonCore: command socket at <10.5.149.26:9618?addrs=10.5.149.26-9618+[--1]-9618&noUDP&sock=3113_23d8_23>
02/07/17 17:55:06 (pid:6872) DaemonCore: private command socket at <10.5.149.26:9618?addrs=10.5.149.26-9618+[--1]-9618&noUDP&sock=3113_23d8_23>
02/07/17 17:55:06 (pid:6872) Communicating with shadow <128.230.146.18:9615?addrs=128.230.146.18-9615+[--1]-9615&noUDP&sock=2859_4fee_8552>
02/07/17 17:55:06 (pid:6872) Submitting machine is "10.5.2.3"
02/07/17 17:55:06 (pid:6872) setting the orig job name in starter
02/07/17 17:55:06 (pid:6872) setting the orig job iwd in starter
02/07/17 17:55:06 (pid:6872) Chirp config summary: IO false, Updates false, Delayed updates true.
02/07/17 17:55:06 (pid:6872) Initialized IO Proxy.
02/07/17 17:55:06 (pid:6872) Done setting resource limits
02/07/17 17:55:06 (pid:6872) File transfer completed successfully.
02/07/17 17:55:06 (pid:6872) Job 5371589.0 set to execute immediately
02/07/17 17:55:06 (pid:6872) Starting a VANILLA universe job with ID: 5371589.0
02/07/17 17:55:06 (pid:6872) IWD: /var/lib/condor/execute/dir_6872
02/07/17 17:55:06 (pid:6872) Output file: /var/lib/condor/execute/dir_6872/_condor_stdout
02/07/17 17:55:06 (pid:6872) Error file: /var/lib/condor/execute/dir_6872/_condor_stderr
02/07/17 17:55:06 (pid:6872) Renice expr "0" evaluated to 0
02/07/17 17:55:06 (pid:6872) Using wrapper /etc/condor/sugwg-job-wrapper.sh to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe
02/07/17 17:55:06 (pid:6872) Running job as user dbrown
02/07/17 17:55:06 (pid:6872) Create_Process succeeded, pid=6876
02/07/17 17:55:06 (pid:6872) Limiting (soft) memory usage to 2147483648 bytes
02/07/17 17:55:06 (pid:6872) Limiting (hard) memory usage to 48914759680 bytes
02/07/17 17:55:06 (pid:6872) Limiting memsw usage to 48914763776 bytes
02/07/17 18:09:59 (pid:6872) Got SIGQUIT.  Performing fast shutdown.
02/07/17 18:09:59 (pid:6872) ShutdownFast all jobs.
02/07/17 18:09:59 (pid:6872) Process exited, pid=6876, signal=9
02/07/17 18:09:59 (pid:6872) JobReaper: condor_pid_ns_init didn't drop filename /var/lib/condor/execute/dir_6872/.condor_pid_ns_status (2)
02/07/17 18:09:59 (pid:6872) ERROR "Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly" at line 771 in file /slots/15/dir_2683933/userdir/.tmpCdykJF/BUILD/condor-8.6.0/src/condor_starter.V6.1/vanilla_proc.cpp



thanks
Todd




The program exists and is executable:

[root@CRUSH-SUGWG-OSG-10-5-149-17 ~]# locate condor_pid_ns_init
/usr/libexec/condor/condor_pid_ns_init
[root@CRUSH-SUGWG-OSG-10-5-149-17 ~]# less /usr/libexec/condor/condor_pid_ns_init
"/usr/libexec/condor/condor_pid_ns_init" may be a binary file.  See it anyway?

[root@CRUSH-SUGWG-OSG-10-5-149-17 ~]# /usr/libexec/condor/condor_pid_ns_init
[root@CRUSH-SUGWG-OSG-10-5-149-17 ~]# echo $?
0

[root@CRUSH-SUGWG-OSG-10-5-149-17 ~]# /usr/libexec/condor/condor_pid_ns_init -help

(silence)

Any ideas?

Cheers,
Duncan.



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685