[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] termination with signal 66



> On Thu, Oct 27, 2005 at 12:21:31PM -0400, Ian Chesal wrote:
> > > If something was working and then just stopped, the first thing to
> > > look for is what changed, and Windows Update is a first guess.
> > Suddenly
> > > exiting with a 66 sounds a DLL change to me. Checking the starter
log
> > > and the stdout/stderr of the job are another thing to check.
> >
> > Interesting, we actually saw a number of our jobs fail last night
with
> > the same error message. All were running on XP but NONE of the
machines
> > are set to do auto-updates. They are rack machines that don't have
> > access to the outside world.
> >
> > What we did see happen was the samba server, where the global config
> > files are stored for these machines, started locking out the
machines so
> > they couldn't access their config files.
> >
> > Could this cause a spontaneous 66 error to a running job?
> 
> It shouldn't (which is different than it can't :)
> 
> If a daemon couldn't read it's config file, it should refuse to start
up.
> The userjob itself shouldn't try and read the config file, and Condor
> daemons don't read the config files after they've started (unless they
> get a reconfig).
> 
> If there was a problem on the execute machine, what should happen is
the
> starter would fail to run, the shadow would figure out that the
starter
> isn't there,and the job should stay in the queue. The job shouldn't
leave
> the queue with an exit status of 66 unless Condor knows the job
started
> and then ran with an exit status of 66. (If the job needs something
from
> the Samba share and can't read it, of course it may exit with status
66,
> and
> Condor will report that)
> 
> It'd be interesting to see the starterlog from an execute machine, and
the
> shadow and schedd logs of the submit machine during the run that
exited
> with status 66.

Conveniently I can provide all of the above. ;-)

The Error 66 value comes from the email that the user gets about the job
from the system.

In the starterlog I'm seeing:

10/27 01:21:39 ******************************************************
10/27 01:21:39 ** condor_starter (CONDOR_STARTER) STARTING UP
10/27 01:21:39 ** d:\abc\condor\bin\condor_starter.exe
10/27 01:21:39 ** $CondorVersion: 6.7.12 Sep 24 2005 $
10/27 01:21:39 ** $CondorPlatform: INTEL-WINNT50 $
10/27 01:21:39 ** PID = 1076
10/27 01:21:39 ******************************************************
10/27 01:21:39 Using config file:
\\ttc-sunserve\abc_conf\condor_config.WINNT51
10/27 01:21:39 Using local config files:
d:/abc/condor/local.TTC-JSLAVKIN3/condor_config.local
d:\abc\condor/condor_config.local
10/27 01:21:39 DaemonCore: Command Socket at <137.57.142.107:4812>
10/27 01:21:39 Setting resource limits not implemented!
10/27 01:21:39 Communicating with shadow <137.57.176.238:51489>
10/27 01:21:39 Submitting machine is "ttc-schedd1.altera.com"
10/27 01:21:39 File transfer completed successfully.
10/27 01:21:40 Starting a VANILLA universe job with ID: 18.1
10/27 01:21:40 IWD: d:\abc\condor/execute\dir_1076
10/27 01:21:40 Output file: d:\abc\condor/execute\dir_1076\wrapper.log
10/27 01:21:41 Error file: d:\abc\condor/execute\dir_1076\wrapper.err
10/27 01:21:41 Renice expr "10" evaluated to 10
10/27 01:21:41 About to exec C:\WINDOWS\system32\cmd.exe /Q /C
condor_exec.bat /experiments/tvanderh/condor/armstrong2/run2/sipo40
10/27 01:21:41 Create_Process succeeded, pid=6104
10/27 01:24:18 Process exited, pid=6104, status=0
10/27 01:24:18 ReliSock: put_file: TransmitFile() failed, errno=10054
10/27 01:24:18 ERROR "DoUpload: Failed to send file
d:\abc\condor/execute\dir_1076\job_done, exiting at 1624 " at line 1623
in file ..\src\condor_c++_util\file_transfer.C
10/27 01:24:18 ShutdownFast all jobs.

And in the shadow log on our schedd I'm seeing:

10/27 01:23:25 ******************************************************
10/27 01:23:25 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/27 01:23:25 ** /opt/condor/sbin/condor_shadow
10/27 01:23:25 ** $CondorVersion: 6.7.12 Sep 24 2005 $
10/27 01:23:25 ** $CondorPlatform: I386-LINUX_RH9 $
10/27 01:23:25 ** PID = 13744
10/27 01:23:25 ******************************************************
10/27 01:23:25 Using config file:
/opt/condor/configs/condor_config.LINUX
10/27 01:23:25 Using local config files:
/build/condor/condor_config.local.LINUX
10/27 01:23:25 DaemonCore: Command Socket at <137.57.176.238:51521>
10/27 01:23:25 Initializing a VANILLA shadow for job 18.0
10/27 01:23:26 (18.0) (13744): Request to run on <137.57.142.132:2162>
was ACCEPTED
10/27 01:23:52 (6.2) (13042): Connect failed for 30 seconds; returning
FALSE
10/27 01:23:52 (6.2) (13042): Attempt to reconnect failed: Failed to
connect to starter <137.57.142.120:2406>
10/27 01:23:52 (6.2) (13042): JobLeaseDuration remaining: 650
10/27 01:23:52 (6.2) (13042): Scheduling another attempt to reconnect in
16 seconds
10/27 01:24:08 (6.2) (13042): Attempting to reconnect to starter
<137.57.142.120:2406>
10/27 01:24:10 (6.2) (13042): getpeername failed so connect must have
failed
10/27 01:24:29 (6.0) (13707): Attempting to reconnect to starter
<137.57.142.154:4408>
10/27 01:24:31 (6.0) (13707): getpeername failed so connect must have
failed
10/27 01:24:33 (6.1) (13736): get_file(): Failed to open file
/ttcbatch/experiments3/tvanderh/condor/armstrong/run2/sipo40/job_done,
errno = 13.
10/27 01:24:33 (6.1) (13736): condor_read(): recv() returned -1, errno =
104, assuming failure.
10/27 01:24:33 (6.1) (13736): Can no longer talk to condor_starter
<137.57.142.38:1029>
10/27 01:24:33 (6.1) (13736): Trying to reconnect to disconnected job
10/27 01:24:33 (6.1) (13736): LastJobLeaseRenewal: 1130390535 Thu Oct 27
01:22:15 2005
10/27 01:24:33 (6.1) (13736): JobLeaseDuration: 720 seconds
10/27 01:24:33 (6.1) (13736): JobLeaseDuration remaining: 582
10/27 01:24:33 (6.1) (13736): Attempting to reconnect to starter
<137.57.142.38:1414>
10/27 01:24:35 (6.1) (13736): getpeername failed so connect must have
failed
10/27 01:24:39 (6.2) (13042): Connect failed for 30 seconds; returning
FALSE
10/27 01:24:39 (6.2) (13042): Attempt to reconnect failed: Failed to
connect to starter <137.57.142.120:2406>
10/27 01:24:39 (6.2) (13042): JobLeaseDuration remaining: 603
10/27 01:24:39 (6.2) (13042): Scheduling another attempt to reconnect in
32 seconds
10/27 01:24:48 (18.1) (13737): get_file(): Failed to open file
/ttcbatch/experiments3/tvanderh/condor/armstrong2/run2/sipo40/job_done,
errno = 13.

And in our schedd log I've got nothing useful at all around that time
frame:

10/27 00:00:14 Sent ad to central manager for Priority1@xxxxxxxxxx
10/27 00:00:14 Sent ad to 1 collectors for Priority1@xxxxxxxxxx
10/27 00:01:14 Sent ad to central manager for Priority1@xxxxxxxxxx
10/27 00:01:14 Sent ad to 1 collectors for Priority1@xxxxxxxxxx
10/27 00:01:23 Shadow pid 13303 for job 18.0 exited with status 107
10/27 00:01:23 Sent RELEASE_CLAIM to startd on <137.57.142.38:1029>
10/27 00:01:23 Match record (<137.57.142.38:1029>, 18, 0) deleted
10/27 00:01:23 DaemonCore: Command received via TCP from host
<137.57.142.38:4604>
10/27 00:01:23 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/27 00:01:23 Got VACATE_SERVICE from <137.57.142.38:4604>
10/27 00:02:14 Sent ad to central manager for Priority1@xxxxxxxxxx

- Ian