[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] scheduling problem?



For status codes, you can also see:
http://www.cs.wisc.edu/~adesmet/status.html

In this case, the shadow exit status probably isn't too helpful;
"100 	JOB_EXITED 	The job exited (not killed)"

JK

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of David McBride
> Sent: Wednesday, May 24, 2006 11:27 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] scheduling problem?
> 
> 
> On Wed, 2006-05-24 at 08:05 +0000, John Coulthard wrote:
> 
> > It would be great if someone could 
> > tell me what's happened but failing that is there a list 
> where I can lookup 
> > what "died due to signal #" and  "EXITING WITH STATUS ###" mean?
> 
> Processes can be sent signals.  A list of all the signals 
> that exist (at
> least on this Linux box I'm working on) can be obtained via `man 7
> signal`.
> 
> Some signals will cause a process to terminate.  When processes
> terminate (either Condor daemons, or a user's condor job), they can
> return a numeric status value.  Zero, by convention, means "I ran
> successfully"; non-zero indicates that an error occurred.  
> 
> (The meaning of different return codes is application-specific.)
> 
> > The MarterLog (end of)
> > 5/24 06:08:08 The SCHEDD (pid 30865) died due to signal 25
> 
> I'm guessing that you're running on a BSD-derived operating system (eg
> MacOS X.)  Signal 25 on BSD 4.2 machines is described as follows:
> 
> SIGXFSZ     25,25,31    Core    File size limit exceeded (4.2 BSD)
> 
> It looks like the SCHEDD exceeded some hard-coded file size 
> limit in the
> operating system, possibly in it's history or log files.  As a result,
> the OS sent a SIGXFSZ (#25) signal to it, which killed it.
> 
> > The ShadowLog (end of)
> > 5/24 06:44:28 (84901.58) (31039): **** condor_shadow 
> (condor_SHADOW) EXITING 
> > WITH STATUS 100
> > 5/24 06:44:31 getpeername failed so connect must have failed
> > 5/24 06:49:29 Connect failed for 300 seconds; returning FALSE
> > 5/24 06:49:29 Can't connect to queue manager
> > CEDAR:6001:Failed to connect to <192.168.0.40:52226>
> > 5/24 06:49:29 ERROR "Failed to connect to schedd!" at line 
> 102 in file 
> > shadow_initializer.C
> 
> Someone more familiar with Condor can tell you what return code 100
> indicates, but the error "Failed to connect to schedd!" is a bit of a
> give away.  It's failing because it can't talk to the local Schedd
> (probably because the OS killed it!)
> 
> > The StartLog (end of)
> 
> This looks fairly normal.
> 
> So, in short, it looks like your root problem is that the 
> SCHEDD on your
> job-submission host is keeling over and dying.  I'd have a look around
> to see if any of the files it normally uses have gotten very large (eg
> >2GB in size.)
> 
> Hope this helps.
> 
> Cheers,
> David
> -- 
> David McBride <dwm@xxxxxxxxxxxx>
> Department of Computing, Imperial College, London
>