[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec



On Tue, Feb 19, 2013 at 11:07:13PM -0500, Jason Ferrara wrote:
> When running a dagman job with approximately 10000  nodes, I'm
> seeing occasional random job failures with
> 
> 02/19/13 22:16:14 Starting a VANILLA universe job with ID: 240791.0
> 02/19/13 22:16:14 IWD: /my/data/dir
> 02/19/13 22:16:14 About to exec /home/jferrara/bin/myprog.py
> /my/input/dir/infile
> 02/19/13 22:16:14 Running job as user jferrara
> 02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py):
> child failed because PRIV_USER_FINAL process was still root before
> exec()
> 02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py,
> /my/input/dir/infile, ...) failed: (errno=666666: 'Unknown error
> 666666')
> 02/19/13 22:16:15 Failed to start job, exiting
> 
> in the Starter log.
> 
> This is on a setup with one central manager and 6 execute systems,
> all running linux.
> 
> Where and when the jobs fail seem completely random. Often I can get
> through all 10000 jobs without a failure.
> 
> Does anyone have any idea whats going on or have any suggestions on
> how to debug this?

Possibly you landed on a misconfigured machine?

With DAGMan, you could insert a "RETRY" line, so that DAGMan will retry
the job, instead of simply marking a failure. This is valuable when the
failures really are random/intermittent.

http://research.cs.wisc.edu/htcondor/manual/v7.9/2_10DAGMan_Applications.html#dagman:retry

Nathan Panike