[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs that should be suspended are evicted, and STARTD crashes



condor -version
$CondorVersion: 6.7.12 Sep 24 2005 $
$CondorPlatform: INTEL-WINNT50 $

I'm trying to suspend a job using a simple keyboard idle test.
I have PREEMPT = FALSE in the condor_config file.  Instead of
being suspended when I touch the keyboard, however, the job 
(and another job in the cluster on the same SMP but different VM)
are evicted.  I notice this bug was fixed recently and wonder if 
it's still lingering:

- Fixed a bug that would cause the condor startd to crash under certain
conditions during
  job eviction. This bug was introduced in Condor version 6.6.6.


Pool Manager MasterLog:

10/24 13:43:51 DaemonCore: Command received via UDP from host
<136.200.32.102:2979>
10/24 13:43:51 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
10/24 13:43:51 The STARTD (pid 2572) exited with status 4
10/24 13:43:56 Procfamily: ERROR: Could not open pid 2932 (err=87).
Maybe it exited already?
10/24 13:44:04 Sending obituary for "Z:\Condor/bin/condor_startd.exe"
10/24 13:44:04 restarting Z:\Condor/bin/condor_startd.exe in 10 seconds
10/24 13:44:14 Started DaemonCore process
"Z:\Condor/bin/condor_startd.exe", pid and pgroup = 632
 

The job log:

...
007 (448.001.000) 10/24 13:44:30 Shadow exception!
 Can no longer talk to condor_starter <136.200.32.102:2086>
 0  -  Run Bytes Sent By Job
 113527448  -  Run Bytes Received By Job
...
007 (448.000.000) 10/24 13:44:34 Shadow exception!
 Can no longer talk to condor_starter <136.200.32.102:2086>
 0  -  Run Bytes Sent By Job
 113527448  -  Run Bytes Received By Job
...

 
Ralph Finch, P.E.
Dept. of Water Resources
Bay-Delta Office, Room 215-13
Sacramento, CA  95814
916-653-7552
rfinch@xxxxxxxxxxxx