[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] ERROR: unable to update job ad!



	I was recently tasked with setting up a Condor pool here at SCU on
our CentOS 4 systems.  At first I did it with condor-6.6.10, ran into a
few issues related to linux-2.6 support, but made my way around them only
to ultimately be stymied by the following error in the StarterLog whenever
I tried to submit the sh_loop example job:

ERROR: unable to update job ad!  Aborting OsProc::StartJob

	I could find no reference to this bug on any mailing list posting 
or any web-page about condor.  So I decided to go to 6.7.17, as it was 
supposed to have better support for the 2.6 kernel.  Indeed it does, and I 
was able to take out all my config workarounds.  However, I still get the 
same problem when I try to submit the "sh_loop" example job.

	Ultimately everything points to condor_starter dying unexpectedly.  
So I set STARTER_DEBUG = D_ALL in config_config.local for the one execute 
node in the test cluster and resubmitted the job.  This gives the 
following output in StarterLog:

3/14 15:33:31 (fd:11) (pid:13452) Starting a VANILLA universe job with ID: 2.0
3/14 15:33:31 (fd:11) (pid:13452) In OsProc::OsProc()
3/14 15:33:31 (fd:11) (pid:13452) Main job KillSignal: 15 (SIGTERM)
3/14 15:33:31 (fd:11) (pid:13452) Main job RmKillSignal: 15 (SIGTERM)
3/14 15:33:31 (fd:11) (pid:13452) Main job HoldKillSignal: 15 (SIGTERM)
3/14 15:33:31 (fd:11) (pid:13452) in VanillaProc::StartJob()
3/14 15:33:31 (fd:11) (pid:13452) in OsProc::StartJob()
3/14 15:33:31 (fd:11) (pid:13452) IWD: /users/student/ctracy/condor
3/14 15:33:31 (fd:11) (pid:13452) PRIV_CONDOR --> PRIV_USER at os_proc.C:214
3/14 15:33:31 (fd:12) (pid:13452) Input file: /dev/null
3/14 15:33:31 (fd:13) (pid:13452) Output file: /users/student/ctracy/condor/sh_loop.out
3/14 15:33:31 (fd:14) (pid:13452) Error file: /users/student/ctracy/condor/sh_loop.err
3/14 15:33:31 (fd:14) (pid:13452) Doing CONDOR_begin_execution
3/14 15:33:31 (fd:14) (pid:13452) condor_read(): nfds=6
3/14 15:33:31 (fd:14) (pid:13452) condor_read(): nfound=1
3/14 15:33:31 (fd:14) (pid:13452) condor_read(): nfds=6
3/14 15:33:31 (fd:14) (pid:13452) condor_read(): nfound=1
3/14 15:33:31 (fd:14) (pid:13452) ERROR: unable to update job ad!  Aborting OsProc::StartJob...
3/14 15:33:31 (fd:14) (pid:13452) Failed to start job, exiting
3/14 15:33:31 (fd:14) (pid:13452) ShutdownFast all jobs.
3/14 15:33:31 (fd:14) (pid:13452) Got ShutdownFast when no jobs running.
3/14 15:33:31 (fd:14) (pid:13452) PRIV_USER --> PRIV_ROOT at directory.C:408
3/14 15:33:31 (fd:15) (pid:13452) PRIV_ROOT --> PRIV_USER at directory.C:420
3/14 15:33:31 (fd:15) (pid:13452) Removing /opt/condor-6.7.17-local/execute/dir_13452
3/14 15:33:31 (fd:15) (pid:13452) PRIV_USER --> PRIV_ROOT at directory.C:714
3/14 15:33:31 (fd:15) (pid:13452) Attempting to remove /opt/condor-6.7.17-local/execute/dir_13452 as SuperUser (root)
3/14 15:33:31 (fd:15) (pid:13452) PRIV_ROOT --> PRIV_USER at directory.C:760
3/14 15:33:31 (fd:9) (pid:13452) KEYCACHE: deleted: 0x8415570
3/14 15:33:31 (fd:9) (pid:13452) CLOSE <129.210.16.106:43801> fd=6
3/14 15:33:31 (fd:6) (pid:13452) **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

	I'm at a loss as to what to do at this point.  If I had the code 
I'd go look to see what was in OsProc::StartJob, but alas, I don't.  Has 
anyone ever encountered this issue before?

	The setup is currently one central manager that is NOT an execute 
node (master,submit), and one execute node that also a submit node 
(submit,execute).  Both are running CentOS 4.2 on Pentium 4 chips.  They 
do share a common NFS filesystem with local-state data stored on a local 
filesystem on each.

	I'd be happy to provide any further information anyone would like.  
I just didn't want to deluge the list anymore than this initially.

	Thanks for your time,

	Chris

---------------------------------
Chris Tracy
System/Network Administrator
Engineering Design Center
Santa Clara University
"Wherever you go, there you are."