[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Problem with condor_pid_ns_init



Hi Todd,

> On Feb 9, 2017, at 4:10 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
> 
> On 2/9/2017 2:29 PM, Duncan Brown wrote:
>> Hi all,
>> 
>> We're trying to use PID NAMESPACES, but I'm seeing the following error in my starter logs:
>> 
>> 02/07/17 18:08:42 (pid:6967) ERROR "Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly" at line 771 in file /slots/15/dir_2683933/userdir/.tmpCdykJF/BUILD/condor-8.6.0/src/condor_starter.V6.1/vanilla_proc.cpp
>> 
> 
> The line in the starter log immediately proceeding the ERROR line above may provide some clues.  Could you include 10-20 lines from the log leading up to the error?

Full log below.

> What does condor_config_val USE_PID_NAMESPACE_INIT say on your execute node?

[root@CRUSH-SUGWG-OSG-10-5-149-26 ~]# condor_config_val USE_PID_NAMESPACE_INIT
Not defined: USE_PID_NAMESPACE_INIT

Other stuff that may be related:

[root@CRUSH-SUGWG-OSG-10-5-149-26 ~]# condor_config_val -dump | grep -i pid
EC2_GAHP_DEBUG = D_PID
MAX_PID_COLLISION_RETRY = 9
PID = 122064
PID_SNAPSHOT_INTERVAL = 15
PPID = 121992
SCHEDD_DEBUG = D_PID
STARTER_DEBUG = D_PID 
USE_PID_NAMESPACES = True

[root@CRUSH-SUGWG-OSG-10-5-149-26 ~]# cat /etc/condor/sugwg-job-wrapper.sh
#!/bin/bash

# Set the umask for LSC users so that others do not have read permission
if [ $UID -gt 199 ] && [ "`id -gn`" = "lsc" ]; then
    umask 027
else
    umask 022
fi

exec "$@"
error=$?
echo "Failed to exec($error): $@" > $_CONDOR_WRAPPER_ERROR_FILE
exit 1

Cheers,
Duncan.

02/07/17 17:55:06 (pid:6872) ******************************************************
02/07/17 17:55:06 (pid:6872) ** condor_starter (CONDOR_STARTER) STARTING UP
02/07/17 17:55:06 (pid:6872) ** /usr/sbin/condor_starter
02/07/17 17:55:06 (pid:6872) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
02/07/17 17:55:06 (pid:6872) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
02/07/17 17:55:06 (pid:6872) ** $CondorVersion: 8.6.0 Jan 26 2017 BuildID: 395190 $
02/07/17 17:55:06 (pid:6872) ** $CondorPlatform: x86_64_RedHat7 $
02/07/17 17:55:06 (pid:6872) ** PID = 6872
02/07/17 17:55:06 (pid:6872) ** Log last touched time unavailable (No such file or directory)
02/07/17 17:55:06 (pid:6872) ******************************************************
02/07/17 17:55:06 (pid:6872) Using config source: /etc/condor/condor_config
02/07/17 17:55:06 (pid:6872) Using local config sources: 
02/07/17 17:55:06 (pid:6872)    /etc/condor/condor_config.local
02/07/17 17:55:06 (pid:6872) config Macros = 91, Sorted = 90, StringBytes = 2942, TablesBytes = 3324
02/07/17 17:55:06 (pid:6872) CLASSAD_CACHING is OFF
02/07/17 17:55:06 (pid:6872) Daemon Log is logging: D_ALWAYS D_ERROR
02/07/17 17:55:06 (pid:6872) SharedPortEndpoint: waiting for connections to named socket 3113_23d8_23
02/07/17 17:55:06 (pid:6872) DaemonCore: command socket at <10.5.149.26:9618?addrs=10.5.149.26-9618+[--1]-9618&noUDP&sock=3113_23d8_23>
02/07/17 17:55:06 (pid:6872) DaemonCore: private command socket at <10.5.149.26:9618?addrs=10.5.149.26-9618+[--1]-9618&noUDP&sock=3113_23d8_23>
02/07/17 17:55:06 (pid:6872) Communicating with shadow <128.230.146.18:9615?addrs=128.230.146.18-9615+[--1]-9615&noUDP&sock=2859_4fee_8552>
02/07/17 17:55:06 (pid:6872) Submitting machine is "10.5.2.3"
02/07/17 17:55:06 (pid:6872) setting the orig job name in starter
02/07/17 17:55:06 (pid:6872) setting the orig job iwd in starter
02/07/17 17:55:06 (pid:6872) Chirp config summary: IO false, Updates false, Delayed updates true.
02/07/17 17:55:06 (pid:6872) Initialized IO Proxy.
02/07/17 17:55:06 (pid:6872) Done setting resource limits
02/07/17 17:55:06 (pid:6872) File transfer completed successfully.
02/07/17 17:55:06 (pid:6872) Job 5371589.0 set to execute immediately
02/07/17 17:55:06 (pid:6872) Starting a VANILLA universe job with ID: 5371589.0
02/07/17 17:55:06 (pid:6872) IWD: /var/lib/condor/execute/dir_6872
02/07/17 17:55:06 (pid:6872) Output file: /var/lib/condor/execute/dir_6872/_condor_stdout
02/07/17 17:55:06 (pid:6872) Error file: /var/lib/condor/execute/dir_6872/_condor_stderr
02/07/17 17:55:06 (pid:6872) Renice expr "0" evaluated to 0
02/07/17 17:55:06 (pid:6872) Using wrapper /etc/condor/sugwg-job-wrapper.sh to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe
02/07/17 17:55:06 (pid:6872) Running job as user dbrown
02/07/17 17:55:06 (pid:6872) Create_Process succeeded, pid=6876
02/07/17 17:55:06 (pid:6872) Limiting (soft) memory usage to 2147483648 bytes
02/07/17 17:55:06 (pid:6872) Limiting (hard) memory usage to 48914759680 bytes
02/07/17 17:55:06 (pid:6872) Limiting memsw usage to 48914763776 bytes
02/07/17 18:09:59 (pid:6872) Got SIGQUIT.  Performing fast shutdown.
02/07/17 18:09:59 (pid:6872) ShutdownFast all jobs.
02/07/17 18:09:59 (pid:6872) Process exited, pid=6876, signal=9
02/07/17 18:09:59 (pid:6872) JobReaper: condor_pid_ns_init didn't drop filename /var/lib/condor/execute/dir_6872/.condor_pid_ns_status (2)
02/07/17 18:09:59 (pid:6872) ERROR "Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly" at line 771 in file /slots/15/dir_2683933/userdir/.tmpCdykJF/BUILD/condor-8.6.0/src/condor_starter.V6.1/vanilla_proc.cpp



> thanks
> Todd
> 
> 
> 
> 
>> The program exists and is executable:
>> 
>> [root@CRUSH-SUGWG-OSG-10-5-149-17 ~]# locate condor_pid_ns_init
>> /usr/libexec/condor/condor_pid_ns_init
>> [root@CRUSH-SUGWG-OSG-10-5-149-17 ~]# less /usr/libexec/condor/condor_pid_ns_init
>> "/usr/libexec/condor/condor_pid_ns_init" may be a binary file.  See it anyway?
>> 
>> [root@CRUSH-SUGWG-OSG-10-5-149-17 ~]# /usr/libexec/condor/condor_pid_ns_init
>> [root@CRUSH-SUGWG-OSG-10-5-149-17 ~]# echo $?
>> 0
>> 
>> [root@CRUSH-SUGWG-OSG-10-5-149-17 ~]# /usr/libexec/condor/condor_pid_ns_init -help
>> 
>> (silence)
>> 
>> Any ideas?
>> 
>> Cheers,
>> Duncan.
>> 
> 
> 
> -- 
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing   Department of Computer Sciences
> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                  Madison, WI 53706-1685
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

-- 

Duncan Brown                         http://dbrown10.expressions.syr.edu
Charles Brightman Professor of Physics     Room 263-1 Physics Department
Director of the Graduate Program      Syracuse University, NY 13244, USA
Phone: 315 443 5993                                    Fax: 315 443 9103