[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] shadow exited



> On Tue, Jun 15, 2004 at 03:54:06PM -0500, Scott Koranda wrote:
> > Hi,
> > 
> > I found this in the SchedLog:
> > 
> > 6/15 12:48:11 Shadow pid 25284 for job 584394.0 exited with status 107
> > 6/15 12:48:11 Called send_vacate( <129.89.200.23:41351>, 443 )
> > 6/15 12:48:11 Sent RELEASE_CLAIM to startd on <129.89.200.23:41351>
> > 6/15 12:48:11 Match record (<129.89.200.23:41351>, 584394, 0) deleted
> > 6/15 12:48:11 Capability of deleted match: <129.89.200.23:41351>#1298467496
> > 6/15 12:48:11 Entered delete_shadow_rec( 25284 ) 6/15 12:48:11 Deleting shadow rec for PID 25284, job (584394.0)
> > 6/15 12:48:11 Entered check_zombie( 25284, 0x92af31c, st=2 )
> > 6/15 12:48:11 Marked job 584394.0 as IDLE
> > 
> > Why might have the shadow for job 584394.0 exited with status
> > 107?
> > 
> 
> We want the job to go back into the queue. It's normal.  
> 
> Shadow exit codes should be considered a black box - we don't
> follow UNIX conventions of exiting with '0' for success and
> '1' for failure - we've got about 10 or so exit codes, all of
> which are for success.
> 
> Is there something else going that you're worried about?
> 

The job seems to be matched, it starts running, and then a
short time later it is forcibly evicted and goes back into the
queue. 

This is happening every few minutes and I am trying to
understand why so I am poking around in the logs.

The job is part of a DAG and there are a lot of jobs logging
to the same file, but here are the appropriate parts of the
log file for this job:

000 (584394.000.000) 06/15 12:43:40 Job submitted from host: <129.89.201.233:32786>
001 (584394.000.000) 06/15 12:45:31 Job executing on host: <129.89.200.23:41351>004 (584394.000.000) 06/15 12:48:11 Job was evicted.
001 (584394.000.000) 06/15 12:52:31 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 12:55:17 Job was evicted.
001 (584394.000.000) 06/15 12:59:59 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:02:24 Job was evicted.
001 (584394.000.000) 06/15 13:06:48 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:09:30 Job was evicted.
001 (584394.000.000) 06/15 13:13:51 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:16:36 Job was evicted.
001 (584394.000.000) 06/15 13:20:56 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:23:42 Job was evicted.
001 (584394.000.000) 06/15 13:28:16 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:30:48 Job was evicted.
001 (584394.000.000) 06/15 13:35:16 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:37:54 Job was evicted.
001 (584394.000.000) 06/15 13:42:22 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:45:00 Job was evicted.
001 (584394.000.000) 06/15 13:49:36 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:52:06 Job was evicted.
001 (584394.000.000) 06/15 13:56:38 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:59:12 Job was evicted.
001 (584394.000.000) 06/15 14:03:45 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 14:05:49 Job was evicted.
001 (584394.000.000) 06/15 14:07:57 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 14:10:33 Job was evicted.
001 (584394.000.000) 06/15 14:15:14 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 14:17:38 Job was evicted.
001 (584394.000.000) 06/15 14:22:16 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 14:24:44 Job was evicted.
001 (584394.000.000) 06/15 14:29:12 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 14:31:50 Job was evicted.
001 (584394.000.000) 06/15 14:36:21 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 14:38:55 Job was evicted.
001 (584394.000.000) 06/15 14:43:28 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 14:46:00 Job was evicted.
001 (584394.000.000) 06/15 14:50:36 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 14:53:09 Job was evicted.
001 (584394.000.000) 06/15 14:57:45 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 15:00:22 Job was evicted.
001 (584394.000.000) 06/15 15:05:10 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 15:07:34 Job was evicted.
001 (584394.000.000) 06/15 15:11:59 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 15:14:40 Job was evicted.
001 (584394.000.000) 06/15 15:19:09 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 15:21:45 Job was evicted.
001 (584394.000.000) 06/15 15:26:24 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 15:29:35 Job was evicted.
001 (584394.000.000) 06/15 15:33:14 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 15:35:55 Job was evicted.
001 (584394.000.000) 06/15 15:39:22 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 15:42:05 Job was evicted.
001 (584394.000.000) 06/15 15:46:34 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 15:49:11 Job was evicted.
001 (584394.000.000) 06/15 15:53:41 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 15:56:16 Job was evicted.
001 (584394.000.000) 06/15 16:00:54 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 16:03:23 Job was evicted.
001 (584394.000.000) 06/15 16:07:51 Job executing on host: <129.89.200.12:32773>