[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] termination with signal 66



On Thu, Oct 27, 2005 at 12:50:47PM -0400, Ian Chesal wrote:
> 32 seconds
> 10/27 01:24:48 (18.1) (13737): get_file(): Failed to open file
> /ttcbatch/experiments3/tvanderh/condor/armstrong2/run2/sipo40/job_done,
> errno = 13.
> 
> And in our schedd log I've got nothing useful at all around that time
> frame:
> 

The Shadow and the Schedd logs are about an hour and 10 minutes apart :)

How close are the clocks on your execute and submit machines? I'm wondering
about scenarios with disconnected shadow/starters - maybe the shadow
got disconnected from the starter and now can't reconnect, and the schedd
does the wrong thing on the shadow exit - it'd be helpful to see all of
the shadow logs involving job 18.0 - I'd like to see both instances
of it connecting, and then trying to reconnect, and the schedd log for the
whole interval.

Thanks,

-Erik

> 10/27 00:00:14 Sent ad to central manager for Priority1@xxxxxxxxxx
> 10/27 00:00:14 Sent ad to 1 collectors for Priority1@xxxxxxxxxx
> 10/27 00:01:14 Sent ad to central manager for Priority1@xxxxxxxxxx
> 10/27 00:01:14 Sent ad to 1 collectors for Priority1@xxxxxxxxxx
> 10/27 00:01:23 Shadow pid 13303 for job 18.0 exited with status 107
> 10/27 00:01:23 Sent RELEASE_CLAIM to startd on <137.57.142.38:1029>
> 10/27 00:01:23 Match record (<137.57.142.38:1029>, 18, 0) deleted
> 10/27 00:01:23 DaemonCore: Command received via TCP from host
> <137.57.142.38:4604>
> 10/27 00:01:23 DaemonCore: received command 443 (VACATE_SERVICE),
> calling handler (vacate_service)
> 10/27 00:01:23 Got VACATE_SERVICE from <137.57.142.38:4604>
> 10/27 00:02:14 Sent ad to central manager for Priority1@xxxxxxxxxx
> 
> - Ian
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users