Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec

Date: Wed, 20 Feb 2013 10:05:42 -0600
From: Nathan Panike <nwp@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec

On Tue, Feb 19, 2013 at 11:07:13PM -0500, Jason Ferrara wrote:
> When running a dagman job with approximately 10000  nodes, I'm
> seeing occasional random job failures with
> 
> 02/19/13 22:16:14 Starting a VANILLA universe job with ID: 240791.0
> 02/19/13 22:16:14 IWD: /my/data/dir
> 02/19/13 22:16:14 About to exec /home/jferrara/bin/myprog.py
> /my/input/dir/infile
> 02/19/13 22:16:14 Running job as user jferrara
> 02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py):
> child failed because PRIV_USER_FINAL process was still root before
> exec()
> 02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py,
> /my/input/dir/infile, ...) failed: (errno=666666: 'Unknown error
> 666666')
> 02/19/13 22:16:15 Failed to start job, exiting
> 
> in the Starter log.
> 
> This is on a setup with one central manager and 6 execute systems,
> all running linux.
> 
> Where and when the jobs fail seem completely random. Often I can get
> through all 10000 jobs without a failure.
> 
> Does anyone have any idea whats going on or have any suggestions on
> how to debug this?

Possibly you landed on a misconfigured machine?

With DAGMan, you could insert a "RETRY" line, so that DAGMan will retry
the job, instead of simply marking a failure. This is valuable when the
failures really are random/intermittent.

http://research.cs.wisc.edu/htcondor/manual/v7.9/2_10DAGMan_Applications.html#dagman:retry

Nathan Panike

Follow-Ups:
- Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
  - From: Jason Ferrara

References:
- [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
  - From: Jason Ferrara

Prev by Date: Re: [HTCondor-users] SIGQUIT / debugging
Next by Date: Re: [HTCondor-users] SIGQUIT / debugging
Previous by thread: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
Next by thread: Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec