[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] scheduling problem?



On Wed, 2006-05-24 at 08:05 +0000, John Coulthard wrote:

> It would be great if someone could 
> tell me what's happened but failing that is there a list where I can lookup 
> what "died due to signal #" and  "EXITING WITH STATUS ###" mean?

Processes can be sent signals.  A list of all the signals that exist (at
least on this Linux box I'm working on) can be obtained via `man 7
signal`.

Some signals will cause a process to terminate.  When processes
terminate (either Condor daemons, or a user's condor job), they can
return a numeric status value.  Zero, by convention, means "I ran
successfully"; non-zero indicates that an error occurred.  

(The meaning of different return codes is application-specific.)

> The MarterLog (end of)
> 5/24 06:08:08 The SCHEDD (pid 30865) died due to signal 25

I'm guessing that you're running on a BSD-derived operating system (eg
MacOS X.)  Signal 25 on BSD 4.2 machines is described as follows:

SIGXFSZ     25,25,31    Core    File size limit exceeded (4.2 BSD)

It looks like the SCHEDD exceeded some hard-coded file size limit in the
operating system, possibly in it's history or log files.  As a result,
the OS sent a SIGXFSZ (#25) signal to it, which killed it.

> The ShadowLog (end of)
> 5/24 06:44:28 (84901.58) (31039): **** condor_shadow (condor_SHADOW) EXITING 
> WITH STATUS 100
> 5/24 06:44:31 getpeername failed so connect must have failed
> 5/24 06:49:29 Connect failed for 300 seconds; returning FALSE
> 5/24 06:49:29 Can't connect to queue manager
> CEDAR:6001:Failed to connect to <192.168.0.40:52226>
> 5/24 06:49:29 ERROR "Failed to connect to schedd!" at line 102 in file 
> shadow_initializer.C

Someone more familiar with Condor can tell you what return code 100
indicates, but the error "Failed to connect to schedd!" is a bit of a
give away.  It's failing because it can't talk to the local Schedd
(probably because the OS killed it!)

> The StartLog (end of)

This looks fairly normal.

So, in short, it looks like your root problem is that the SCHEDD on your
job-submission host is keeling over and dying.  I'd have a look around
to see if any of the files it normally uses have gotten very large (eg
>2GB in size.)

Hope this helps.

Cheers,
David
-- 
David McBride <dwm@xxxxxxxxxxxx>
Department of Computing, Imperial College, London

Attachment: signature.asc
Description: This is a digitally signed message part