[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] startd hangs when using job hooks



Here's another tarball with log files. This time with ALL_DEBUG = D_FULLDEBUG turned on.

It's quite strange. The way my hooks are working right now, every hook returned a job to Condor to run. The job being just a dummy sleep-forever loop for testing. Condor didn't get the jobs even though the hooks said they returned them. And only 3 of the 8 slots on this box (I switched to a faster box) fired the hook fetchwork script and wrote out log files.

And once they ran once, Condor stopped running them. The machine doesn't show up in condor_status output and:

D:\arc\condor>condor_status -debug -direct localhost
03/02 14:03:52 Locale: English_United States.1252
03/02 14:03:52 Initialized the following authorization table:
03/02 14:03:52 Authorizations yet to be resolved:
03/02 14:03:52 allow NEGOTIATOR:  */137.57.206.237 */sj-arcdev.altera.com
03/02 14:03:52 allow ADMINISTRATOR:  */137.57.202.96 */sj-arcdev.altera.com */sqal19-test.altera.com */137.57.206.237 */sj-sw-web5.altera.com */sw-cycleserver.altera.com */sj-bs5450i-380.altera.com */137.57.206.83 */137.57.206.69 */137.57.2
06.69
03/02 14:03:52 allow OWNER:  */137.57.202.96 */sj-arcdev.altera.com */sqal19-test.altera.com */137.57.206.237 */sj-sw-web5.altera.com */sw-cycleserver.altera.com */sj-bs5450i-380.altera.com */sj-bs5450i-380.altera.com */137.57.206.83 */137.
57.206.83 */137.57.206.69 */137.57.206.69
03/02 14:03:52 allow CONFIG:  */137.57.202.96 */sj-arcdev.altera.com */sqal19-test.altera.com */137.57.206.237 */sj-sw-web5.altera.com */sw-cycleserver.altera.com */sj-bs5450i-380.altera.com */sj-bs5450i-380.altera.com */137.57.206.83 */137
.57.206.83 */137.57.206.69 */137.57.206.6903/02 14:04:52 condor_read(): timeout reading 5 bytes from collector at <137.57.
206.83:3868>.
03/02 14:04:52 IO: Failed to read packet header

Ouch.

I didn't update my ALLOW_* statements -- maybe I should revisit permissions for the 7.4 release?

- Ian

On Tue, Mar 2, 2010 at 3:33 PM, Ian Chesal <ian.chesal@xxxxxxxxx> wrote:
On Fri, Feb 12, 2010 at 1:27 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
If you can get the issue to reproduce let us know and we can get a new ticket filed.

I'm trying this now. Condor 7.4.1 on Windows XP SP1 64-bit.

It's hung up for sure. I'm attaching the log files from the machine, no debugging on right now but I can turn it on and try again if you like. Config files in use are also included. The entry point is config/condor_config.

I can see the process tree as:

condor_master.exe
   condor_startd.exe
      cmd.exe
         perl.exe
      cmd.exe
         perl.exe
      cmd.exe
         perl.exe

Which is weird. I expect four cmd/perl trees because it's a 4 slot machine but I only ever get three.

My hook scripts write log data to ARCFetchWorkLog.N and one of them started to put some stuff to the log and then it just stops.

Daemon startup was at 12:03 and I grabbed those logs at 12:32 -- as you can see it hasn't done much in that time.

I made no changes to my config files from 7.2.2 to 7.4.1, but it didn't appear necessary and I don't get any errors on daemon startup.

- Ian

Attachment: 7_4_1-hook-script-logs.tar.gz
Description: GNU Zip compressed data