[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] termination with signal 66



On Thu, Oct 27, 2005 at 12:21:31PM -0400, Ian Chesal wrote:
> > If something was working and then just stopped, the first thing to
> > look for is what changed, and Windows Update is a first guess.
> Suddenly
> > exiting with a 66 sounds a DLL change to me. Checking the starter log
> > and the stdout/stderr of the job are another thing to check.
> 
> Interesting, we actually saw a number of our jobs fail last night with
> the same error message. All were running on XP but NONE of the machines
> are set to do auto-updates. They are rack machines that don't have
> access to the outside world.
> 
> What we did see happen was the samba server, where the global config
> files are stored for these machines, started locking out the machines so
> they couldn't access their config files.
> 
> Could this cause a spontaneous 66 error to a running job?

It shouldn't (which is different than it can't :) 

If a daemon couldn't read it's config file, it should refuse to start up.
The userjob itself shouldn't try and read the config file, and Condor
daemons don't read the config files after they've started (unless they
get a reconfig). 

If there was a problem on the execute machine, what should happen is the
starter would fail to run, the shadow would figure out that the starter
isn't there,and the job should stay in the queue. The job shouldn't leave
the queue with an exit status of 66 unless Condor knows the job started
and then ran with an exit status of 66. (If the job needs something from
the Samba share and can't read it, of course it may exit with status 66, and
Condor will report that)

It'd be interesting to see the starterlog from an execute machine, and the
shadow and schedd logs of the submit machine during the run that exited
with status 66.

-Erik