[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] SYSTEM_PERIODIC_REMOVE question



I tried running condor as root and still get the same pid namespace issue, but maybe that’s not the real cause…

 

06/27/14 11:42:52 (112.0) (10538): ERROR "Error from slot1@xxxxxxxxxxxxxxxxxxxxxxx: Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly" at line 558 in file /slots/01/dir_36628/userdir/src/condor

_shadow.V6.1/pseudo_ops.cpp

06/27/14 11:42:53 Result of reading /etc/issue:  Scientific Linux release 6.5 (Carbon)

 

File is there though :

 

[root@dev7242 condor]# ll /usr/libexec/condor/condor_pid_ns_init

-rwxr-xr-x 1 root root 8576 Jun 20 18:27 /usr/libexec/condor/condor_pid_ns_init

 

?

 

 

De : HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] De la part de SCHAER Frederic
Envoyé : vendredi 27 juin 2014 09:58
À : HTCondor-Users Mail List
Objet : [PROVENANCE INTERNET] Re: [HTCondor-users] SYSTEM_PERIODIC_REMOVE question

 

Hi,

 

Attached is the condor-generated job, as I submitted things through an ARC CE.

I run condor-8.2.0-254849.x86_64

 

The arc CE xrsl is :

 

& (executable="testarc.sh")

(inputFiles=("testarc.sh" ""))

(stdout="stdout.txt")

(stderr="stderr.txt")

(count=1)

(memory=100)

(gmlog=".arc")

 

And the testarc.sh just does some things like “env”, “mount” and a very long sleep… I wanted to test job killing (memory, walltime…)

 

Since I had to find the condor-generated log, I also found this in the logs :

 

...

007 (106.000.000) 06/25 16:42:03 Shadow exception!

        Error from slot1@xxxxxxxxxxxxxxxxxxxxxxx: Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly

        0  -  Run Bytes Sent By Job

        0  -  Run Bytes Received By Job

...

001 (106.000.000) 06/25 16:45:53 Job executing on host: <192.54.207.242:60981>

...

007 (106.000.000) 06/25 16:50:53 Shadow exception!

        Error from slot1@xxxxxxxxxxxxxxxxxxxxxxx: Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly

        0  -  Run Bytes Sent By Job

        19085  -  Run Bytes Received By Job

...

001 (106.000.000) 06/25 16:52:53 Job executing on host: <192.54.207.242:60981>

...

007 (106.000.000) 06/25 16:57:53 Shadow exception!

        Error from slot1@xxxxxxxxxxxxxxxxxxxxxxx: Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly

        0  -  Run Bytes Sent By Job

        19085  -  Run Bytes Received By Job

 

This goes on for a very long time, until I guess je job/sleep ends.

I have “USE_PID_NAMESPACES = true” in the startd config.d directory

 

I configured condor to run as condor and not root as I read it’s just dropping privileges (and running as root prevents benchmark from succeeding at start) and the CONDOR_IDS variable is correctly defined to the condor uid/gid, but I realize the condor UID is different on the startd machine than on the scheduler and collector ones  : might that be an issue ?

 

Regards

 

 

De : HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] De la part de Greg Thain
Envoyé : jeudi 26 juin 2014 17:53
À : HTCondor-Users Mail List
Objet : Re: [HTCondor-users] SYSTEM_PERIODIC_REMOVE question

 

On 06/26/2014 05:20 AM, SCHAER Frederic wrote:

Hmmm…

 

According to the shadowlog, the job in fact is killed, but it’s put back in queue and restarted afterwards:

06/26/14 12:04:04 (106.0) (2973): Updating Job Queue: SetAttribute(NumJobStarts = 180)

06/26/14 12:04:04 (106.0) (2973): Updating Job Queue: SetAttribute(RecentBlockReadKbytes = 0)

06/26/14 12:04:04 (106.0) (2973): Updating Job Queue: SetAttribute(RecentBlockReads = 0)

 

Looks like I am missing something to really kill the job and remove it from the queue : any idea ?

 


A quick test show this working here -- can you share you job submit file with us,  and which version of condor you are running?

-greg