Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec

Date: Thu, 21 Feb 2013 13:46:39 -0500
From: Jason Ferrara <jason.ferrara@xxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec

On 2/20/2013 11:05 AM, Nathan Panike wrote:

On Tue, Feb 19, 2013 at 11:07:13PM -0500, Jason Ferrara wrote:

When running a dagman job with approximately 10000  nodes, I'm
seeing occasional random job failures with

02/19/13 22:16:14 Starting a VANILLA universe job with ID: 240791.0
02/19/13 22:16:14 IWD: /my/data/dir
02/19/13 22:16:14 About to exec /home/jferrara/bin/myprog.py
/my/input/dir/infile
02/19/13 22:16:14 Running job as user jferrara
02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py):
child failed because PRIV_USER_FINAL process was still root before
exec()
02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py,
/my/input/dir/infile, ...) failed: (errno=666666: 'Unknown error
666666')
02/19/13 22:16:15 Failed to start job, exiting

in the Starter log.

This is on a setup with one central manager and 6 execute systems,
all running linux.

Where and when the jobs fail seem completely random. Often I can get
through all 10000 jobs without a failure.

Does anyone have any idea whats going on or have any suggestions on
how to debug this?

Possibly you landed on a misconfigured machine?

No, which is why I'm at a loss. A given execute machine will run a bunchof jobs successfully, and then fail a job.

Is it possible there is a timeout issue in condor when querying userinformation? I'm using ldap+sssd for user accounts, and I've noticedthat while most of the time account info is returned immediately (whenrunning "groups <usersname>" for example) but every once in a while ittakes a couple of seconds.

Follow-Ups:
- Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
  - From: Jaime Frey

References:
- [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
  - From: Jason Ferrara
- Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
  - From: Nathan Panike

Prev by Date: [HTCondor-users] Memory requests increasing
Next by Date: Re: [HTCondor-users] Removing a node ungracefully from the master.
Previous by thread: Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
Next by thread: Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec