[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] shadow exited



> > > On Tue, Jun 15, 2004 at 03:54:06PM -0500, Scott Koranda wrote:
> > > > Hi,
> > > > 
> > > > I found this in the SchedLog:
> > > > 
> > > > 6/15 12:48:11 Shadow pid 25284 for job 584394.0 exited with status 107
> > > > 6/15 12:48:11 Called send_vacate( <129.89.200.23:41351>, 443 )
> > > > 6/15 12:48:11 Sent RELEASE_CLAIM to startd on <129.89.200.23:41351>
> > > > 6/15 12:48:11 Match record (<129.89.200.23:41351>, 584394, 0) deleted
> > > > 6/15 12:48:11 Capability of deleted match: <129.89.200.23:41351>#1298467496
> > > > 6/15 12:48:11 Entered delete_shadow_rec( 25284 ) 6/15 12:48:11 Deleting shadow rec for PID 25284, job (584394.0)
> > > > 6/15 12:48:11 Entered check_zombie( 25284, 0x92af31c, st=2 )
> > > > 6/15 12:48:11 Marked job 584394.0 as IDLE
> > > > 
> > > > Why might have the shadow for job 584394.0 exited with status
> > > > 107?
> > > > 
> > > 
> > > We want the job to go back into the queue. It's normal.  
> > > 
> > > Shadow exit codes should be considered a black box - we don't
> > > follow UNIX conventions of exiting with '0' for success and
> > > '1' for failure - we've got about 10 or so exit codes, all of
> > > which are for success.
> > > 
> > > Is there something else going that you're worried about?
> > > 
> > 
> > The job seems to be matched, it starts running, and then a
> > short time later it is forcibly evicted and goes back into the
> > queue. 
> > 
> > This is happening every few minutes and I am trying to
> > understand why so I am poking around in the logs.
> > 
> > The job is part of a DAG and there are a lot of jobs logging
> > to the same file, but here are the appropriate parts of the
> > log file for this job:
> > 
> > 000 (584394.000.000) 06/15 12:43:40 Job submitted from host: <129.89.201.233:32786>
> > 001 (584394.000.000) 06/15 12:45:31 Job executing on host: <129.89.200.23:41351>004 (584394.000.000) 06/15 12:48:11 Job was evicted.
> > 001 (584394.000.000) 06/15 12:52:31 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 12:55:17 Job was evicted.
> > 001 (584394.000.000) 06/15 12:59:59 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:02:24 Job was evicted.
> > 001 (584394.000.000) 06/15 13:06:48 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:09:30 Job was evicted.
> > 001 (584394.000.000) 06/15 13:13:51 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:16:36 Job was evicted.
> > 001 (584394.000.000) 06/15 13:20:56 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:23:42 Job was evicted.
> > 001 (584394.000.000) 06/15 13:28:16 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:30:48 Job was evicted.
> 
> <snip>
> 
> I should all that on machine 129.89.200.12 in the StartLog I
> see many cycles for job  584394.0 that look like this:
> 
> 6/15 12:59:57 Remote job ID is 584394.0
> 6/15 12:59:57 exec_starter( medusa-master003.medusa.phys.uwm.edu, 10, 11 ) : pid 3981
> 6/15 12:59:57 execl(/opt/condor/sbin/condor_starter.std, "condor_starter", medusa-master003.medusa.phys.uwm.edu, 0)
> 6/15 12:59:57 Got RemoteUser (duncan@xxxxxxxxxxxxxxxxxxx) from request classad
> 6/15 12:59:57 Got universe "STANDARD" (1) from request classad
> 6/15 12:59:57 State change: claim-activation protocol successful
> 6/15 12:59:57 Changing activity: Idle -> Busy
> 6/15 13:00:01 Attempting to send update via UDP to collector condor.medusa.phys.uwm.edu <129.89.201.238:9618>
> 6/15 13:00:01 Sent update to 1 collector(s)
> 6/15 13:02:03 DaemonCore: Command received via TCP from host <129.89.201.233:48680>
> 6/15 13:02:03 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> 6/15 13:02:03 Called deactivate_claim_forcibly()
> 6/15 13:02:03 In Starter::kill() with pid 3981, sig 3 (SIGQUIT)
> 6/15 13:02:03 DaemonCore: No more children processes to reap.
> 6/15 13:02:03 Starter pid 3981 exited with status 0
> 6/15 13:02:03 Canceled hardkill-starter timer (1747)
> 6/15 13:02:03 ProcAPI::buildFamily failed: parent 3981 not found on system.
> 6/15 13:02:03 ProcAPI: pid 3981 does not exist.
> 6/15 13:02:03 ProcAPI: pid 3984 does not exist.
> 6/15 13:02:03 cleanup_execute_dir(): /opt/condor/home/execute/dir_3981 does not exist
> 6/15 13:02:03 State change: starter exited
> 6/15 13:02:03 Changing activity: Busy -> Idle
> 
> Further, this is Condor 6.6.1.
> 
> Thanks,
> 
> Scott

This job continues to cycle in this way. It gets matched and
put onto a host and then forcibly evicted a few minutes later.
It goes back into the queue and then this repeats. 

It has been going on for over 12 hours now.

Above I show what I have found in the StartLog on the machines
to which the job is sent. Also I have shown what I see in the
SchedLog on the machine on which the job was submitted.

I looked in the logs for the Negotiator and I see this over
and over:

6/15 13:40:40   Negotiating with duncan@xxxxxxxxxxxxxxxxxxx at <129.89.201.233:32786>
6/15 13:40:40   Calculating schedd limit with the following parameters
6/15 13:40:40     ScheddPrio       = 4.000000
6/15 13:40:40     ScheddPrioFactor = 1.000000
6/15 13:40:40     scheddShare      = 0.738193
6/15 13:40:40     scheddAbsShare   = 0.249999
6/15 13:40:40     ScheddUsage      = 3
6/15 13:40:40     scheddLimit      = 222
6/15 13:40:40     MaxscheddLimit   = 222
6/15 13:40:40 Socket to <129.89.201.233:32786> already in cache, reusing
6/15 13:40:40     Sending SEND_JOB_INFO/eom
6/15 13:40:40     Getting reply from schedd ...
6/15 13:40:40     Got JOB_INFO command; getting classad/eom
6/15 13:40:40     Request 584394.00000:
6/15 13:40:40       Connecting to startd medusa-slave002.medusa.phys.uwm.edu at <129.89.200.12:32773>
6/15 13:40:40       Sending MATCH_INFO/capability
6/15 13:40:40       (Capability is "<129.89.200.12:32773>#8957484439" )
6/15 13:40:40       Sending PERMISSION, capability, startdAd to schedd
6/15 13:40:40       Notifying the accountant

I can't find anything that tells me why the job is being
forcibly evicted (as seen in the StartLog).

There is also nothing in the MatchLog.

So I have examined the

- StartLog
- SchedLog
- NegotiatorLog
- MatchLog

but I can't find a reason for why this job is cycling over and
over again onto machines only to be forcibly evicted.

Any help with this investigation would be really appreciated.

Thanks,

Scott