[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec



On 2/20/2013 11:05 AM, Nathan Panike wrote:
On Tue, Feb 19, 2013 at 11:07:13PM -0500, Jason Ferrara wrote:
When running a dagman job with approximately 10000  nodes, I'm
seeing occasional random job failures with

02/19/13 22:16:14 Starting a VANILLA universe job with ID: 240791.0
02/19/13 22:16:14 IWD: /my/data/dir
02/19/13 22:16:14 About to exec /home/jferrara/bin/myprog.py
/my/input/dir/infile
02/19/13 22:16:14 Running job as user jferrara
02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py):
child failed because PRIV_USER_FINAL process was still root before
exec()
02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py,
/my/input/dir/infile, ...) failed: (errno=666666: 'Unknown error
666666')
02/19/13 22:16:15 Failed to start job, exiting

in the Starter log.

This is on a setup with one central manager and 6 execute systems,
all running linux.

Where and when the jobs fail seem completely random. Often I can get
through all 10000 jobs without a failure.

Does anyone have any idea whats going on or have any suggestions on
how to debug this?
Possibly you landed on a misconfigured machine?
No, which is why I'm at a loss. A given execute machine will run a bunch of jobs successfully, and then fail a job.

Is it possible there is a timeout issue in condor when querying user information? I'm using ldap+sssd for user accounts, and I've noticed that while most of the time account info is returned immediately (when running "groups <usersname>" for example) but every once in a while it takes a couple of seconds.