[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec



When running a dagman job with approximately 10000 nodes, I'm seeing occasional random job failures with

02/19/13 22:16:14 Starting a VANILLA universe job with ID: 240791.0
02/19/13 22:16:14 IWD: /my/data/dir
02/19/13 22:16:14 About to exec /home/jferrara/bin/myprog.py /my/input/dir/infile
02/19/13 22:16:14 Running job as user jferrara
02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py): child failed because PRIV_USER_FINAL process was still root before exec() 02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py, /my/input/dir/infile, ...) failed: (errno=666666: 'Unknown error 666666')
02/19/13 22:16:15 Failed to start job, exiting

in the Starter log.

This is on a setup with one central manager and 6 execute systems, all running linux.

Where and when the jobs fail seem completely random. Often I can get through all 10000 jobs without a failure.

Does anyone have any idea whats going on or have any suggestions on how to debug this?

Thanks