[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs are being killed after 30-45 minutes



On Wed, Jul 26, 2006 at 01:21:14PM +0100, Santanu Das wrote:
> 7/25 09:56:13 State change: claim-activation protocol successful
> 7/25 09:56:13 Changing activity: Idle -> Busy
> 7/25 10:07:16 Starter pid 29124 died on signal 11 (signal 11)
> 7/25 10:07:16 State change: starter exited
> 7/25 10:07:16 Changing activity: Busy -> Idle
> 7/25 10:07:17 DaemonCore: Command received via TCP from host  
> <172.24.116.151:9583>
> 7/25 10:07:17 DaemonCore: received command 444 (ACTIVATE_CLAIM),  
> calling handler (command_activate_claim)
> 7/25 10:07:17 Got activate_claim request from shadow  
> (<172.24.116.151:9583>)
> 7/25 10:07:17 Remote job ID is 7773.0
> 7/25 10:07:17 Got universe "VANILLA" (5) from request classad
> 7/25 10:07:17 State change: claim-activation protocol successful
> 7/25 10:07:17 Changing activity: Idle -> Busy
> 
> 
> Is this due to "signal 11" issue - what does this actually mean?
> 


It means there's a bug in Condor.

If possible, could you upgrade to 6.8.0? We'd much rather see
if the bug is still present in Condor; based on the log file thus
far I'd suspect that it's something that has already been fixed.

If you can't, for now please set

STARTER_DEBUG  = D_ALL
MAX_STARTER_LOG = 10000000

on all machines where your job may run, run a job,
and then send the StarterLog from the machine that crashed
to condor-admin@xxxxxxxxxxx

-Erik