[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Instabilities in private network pool. Why?




We are having some problems running a pool composed by win 2000 nodes and controlled
by a central manager also with win 2000. The pool is running Condor version 6.7.2.
Because this pool is working in computers that are used in classes during the day,
it is entirely located behind a firewall and is connected to a private
network (central manager IP#10.0.0.15 and other nodes IP#10.0.0.***).


Opposed to another pool that we are running in our department with valid
network addresses, and which is configured in a very similar way (the
config files are almost identical, except for the default_domain_name which
we haven't defined in the private pool), we are experiencing quite a few
instabilities.

These are some symptoms that we can not understand and explain:
- The schedd daemon crashes and is then restarted quite often;
- Lots of jobs seem to be run several times in different machines before
getting completed; they seem to be receiving orders to be vacated!

We have tried to analyze the log files but we don't understand what is(are)
the problem(s) that you are facing and what is causing them.


It has also happened at least for three times that the collector daemon seems to get "crazy" with a very strange behavior. It starts to use lots of CPU and accesses to the hard-disk and begins to increase the memory in use up to values of 80MB! In normal functioning conditions it uses less than 5MB with very low CPU use! In all the three times this has occured, only a reboot in the central manager machine seemed to work around this abnormal behavior.

Is there any suggestion that anyone can give after reading this description
and taking a look to the parts of the log files that I append to this message?


Many thanks in advance for your attention. Best regards,

Paulo Amado Mendes
----------------------
Civil Engineering Department
University of Coimbra, Portugal




These are the last lines in the log files before condor crashed completely last time in the central manager:


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
MasterLog:
10/29 21:44:48 DaemonCore: Command received via UDP from host <10.0.0.15:2410>
10/29 21:44:48 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
10/29 21:44:48 The SCHEDD (pid 2304) exited with status 44
10/29 21:44:53 Procfamily: ERROR: Could not open pid 1684 (err=87)
10/29 23:52:16 ******************************************************
10/29 23:52:16 ** Condor (CONDOR_MASTER) STARTING UP
10/29 23:52:16 ** C:\Condor\bin\condor_master.exe
10/29 23:52:16 ** $CondorVersion: 6.7.2 Oct 5 2004 $
10/29 23:52:16 ** $CondorPlatform: INTEL-WINNT40 $
10/29 23:52:16 ** PID = 2204
10/29 23:52:16 ******************************************************
10/29 23:52:16 Using config file: C:\Condor\condor_config
10/29 23:52:16 Using local config files: C:\Condor/condor_config.local
10/29 23:52:17 DaemonCore: Command Socket at <10.0.0.15:3621>
10/29 23:52:18 Started DaemonCore process "C:\Condor/bin/condor_collector.exe", pid and pgroup = 1584
10/29 23:52:18 Started DaemonCore process "C:\Condor/bin/condor_negotiator.exe", pid and pgroup = 1316
10/29 23:52:19 Started DaemonCore process "C:\Condor/bin/condor_schedd.exe", pid and pgroup = 1676
10/30 00:52:20 Preen pid is 2392
10/30 00:52:22 DaemonCore: Command received via UDP from host <10.0.0.15:3083>
10/30 00:52:22 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
10/30 00:52:22 Child 2392 died, but not a daemon -- Ignored
10/30 03:22:23 DaemonCore: Command received via UDP from host <10.0.0.15:2244>
10/30 03:22:23 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
10/30 03:22:23 The SCHEDD (pid 1676) exited with status 44
10/30 03:24:36 Sending obituary for "C:\Condor/bin/condor_schedd.exe"
10/30 03:24:39 restarting C:\Condor/bin/condor_schedd.exe in 10 seconds
10/30 03:24:49 Started DaemonCore process "C:\Condor/bin/condor_schedd.exe", pid and pgroup = 1132



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
SchedLog:
10/29 21:44:06 DaemonCore: Command received via UDP from host <10.0.0.15:2139>
10/29 21:44:06 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
10/29 21:44:09 DaemonCore: Command received via TCP from host <10.0.0.29:3653>
10/29 21:44:09 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/29 21:44:09 Got VACATE_SERVICE from <10.0.0.29:3653>
10/29 21:44:09 Sent RELEASE_CLAIM to startd on <10.0.0.29:1029>
10/29 21:44:09 Match record (<10.0.0.29:1029>, 23, 168) deleted
10/29 21:44:09 match or classad for job 23.168 was deleted - not forking a shadow
10/29 21:44:12 Activity on stashed negotiator socket
10/29 21:44:12 Negotiating for owner: Administrator@poolform
10/29 21:44:12 Checking consistency running and runnable jobs
10/29 21:44:12 Tables are consistent
10/29 21:44:15 Out of servers - 17 jobs matched, 994 jobs idle, 3 jobs rejected
10/29 21:44:15 DaemonCore: Command received via UDP from host <10.0.0.15:2183>
10/29 21:44:15 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
10/29 21:44:38 Started shadow for job 23.168 on "<10.0.0.23:1963>", (shadow pid = 2388)
10/29 21:44:38 DaemonCore: Command received via TCP from host <10.0.0.23:1612>
10/29 21:44:38 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/29 21:44:38 Got VACATE_SERVICE from <10.0.0.23:1612>
10/29 21:44:38 Sent RELEASE_CLAIM to startd on <10.0.0.23:1963>
10/29 21:44:38 Match record (<10.0.0.23:1963>, 23, 168) deleted
10/29 21:44:38 DaemonCore: Command received via TCP from host <10.0.0.28:3746>
10/29 21:44:38 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/29 21:44:38 Got VACATE_SERVICE from <10.0.0.28:3746>
10/29 21:44:38 Sent RELEASE_CLAIM to startd on <10.0.0.28:1029>
10/29 21:44:38 Match record (<10.0.0.28:1029>, 23, 140) deleted
10/29 21:44:38 DaemonCore: Command received via UDP from host <10.0.0.15:2300>
10/29 21:44:38 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
10/29 21:44:38 DaemonCore: Command received via UDP from host <10.0.0.15:2304>
10/29 21:44:38 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
10/29 21:44:40 Started shadow for job 23.192 on "<10.0.0.36:1029>", (shadow pid = 2364)
10/29 21:44:40 DaemonCore: Command received via UDP from host <10.0.0.15:2315>
10/29 21:44:40 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
10/29 21:44:40 Scheduler::Relinquish - mrec is NULL, can't relinquish
10/29 21:44:40 Null parameter --- match not deleted
10/29 21:44:41 DaemonCore: Command received via TCP from host <10.0.0.22:4152>
10/29 21:44:41 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/29 21:44:41 Got VACATE_SERVICE from <10.0.0.22:4152>
10/29 21:44:41 Sent RELEASE_CLAIM to startd on <10.0.0.22:1272>
10/29 21:44:41 Match record (<10.0.0.22:1272>, 23, 194) deleted
10/29 21:44:42 Started shadow for job 23.188 on "<10.0.0.25:2522>", (shadow pid = 2004)
10/29 21:44:44 Started shadow for job 23.179 on "<10.0.0.29:1029>", (shadow pid = 2404)
10/29 21:44:46 Started shadow for job 23.177 on "<10.0.0.30:1029>", (shadow pid = 1684)
10/29 23:52:19 ******************************************************



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
ShadowLog:
10/29 21:43:58 ******************************************************
10/29 21:43:58 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/29 21:43:58 ** C:\Condor\bin\condor_shadow.exe
10/29 21:43:58 ** $CondorVersion: 6.7.2 Oct 5 2004 $
10/29 21:43:58 ** $CondorPlatform: INTEL-WINNT40 $
10/29 21:43:58 ** PID = 2472
10/29 21:43:58 ******************************************************
10/29 21:43:58 Using config file: C:\Condor\condor_config
10/29 21:43:58 Using local config files: C:\Condor/condor_config.local
10/29 21:43:58 DaemonCore: Command Socket at <10.0.0.15:2127>
10/29 21:43:58 Initializing a VANILLA shadow for job 23.168
10/29 21:43:58 (23.168) (2472): Request to run on <10.0.0.30:1029> was REFUSED
10/29 21:43:58 (23.168) (2472): Job 23.168 is being evicted
10/29 21:43:58 (23.168) (2472): logEvictEvent with unknown reason (108), aborting
10/29 21:43:58 (23.168) (2472): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 108
10/29 21:44:06 (23.141) (2328): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
10/29 21:44:15 (23.158) (2004): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
10/29 21:44:38 (23.140) (2204): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
10/29 21:44:38 ******************************************************
10/29 21:44:38 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/29 21:44:38 ** C:\Condor\bin\condor_shadow.exe
10/29 21:44:38 ** $CondorVersion: 6.7.2 Oct 5 2004 $
10/29 21:44:38 ** $CondorPlatform: INTEL-WINNT40 $
10/29 21:44:38 ** PID = 2388
10/29 21:44:38 ******************************************************
10/29 21:44:38 Using config file: C:\Condor\condor_config
10/29 21:44:38 Using local config files: C:\Condor/condor_config.local
10/29 21:44:38 DaemonCore: Command Socket at <10.0.0.15:2297>
10/29 21:44:38 Initializing a VANILLA shadow for job 23.168
10/29 21:44:38 (23.157) (1904): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
10/29 21:44:38 (23.168) (2388): Request to run on <10.0.0.23:1963> was REFUSED
10/29 21:44:38 (23.168) (2388): Job 23.168 is being evicted
10/29 21:44:38 (23.168) (2388): logEvictEvent with unknown reason (108), aborting
10/29 21:44:40 (23.168) (2388): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 108
10/29 21:44:40 ******************************************************
10/29 21:44:40 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/29 21:44:40 ** C:\Condor\bin\condor_shadow.exe
10/29 21:44:40 ** $CondorVersion: 6.7.2 Oct 5 2004 $
10/29 21:44:40 ** $CondorPlatform: INTEL-WINNT40 $
10/29 21:44:40 ** PID = 2364
10/29 21:44:40 ******************************************************
10/29 21:44:40 Using config file: C:\Condor\condor_config
10/29 21:44:40 Using local config files: C:\Condor/condor_config.local
10/29 21:44:40 DaemonCore: Command Socket at <10.0.0.15:2312>
10/29 21:44:40 Initializing a VANILLA shadow for job 23.192
10/29 21:44:40 (23.192) (2364): Request to run on <10.0.0.36:1029> was ACCEPTED
10/29 21:44:42 ******************************************************
10/29 21:44:42 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/29 21:44:42 ** C:\Condor\bin\condor_shadow.exe
10/29 21:44:42 ** $CondorVersion: 6.7.2 Oct 5 2004 $
10/29 21:44:42 ** $CondorPlatform: INTEL-WINNT40 $
10/29 21:44:42 ** PID = 2004
10/29 21:44:42 ******************************************************
10/29 21:44:42 Using config file: C:\Condor\condor_config
10/29 21:44:42 Using local config files: C:\Condor/condor_config.local
10/29 21:44:42 DaemonCore: Command Socket at <10.0.0.15:2363>
10/29 21:44:42 Initializing a VANILLA shadow for job 23.188
10/29 21:44:43 (23.188) (2004): Request to run on <10.0.0.25:2522> was ACCEPTED
10/29 21:44:44 ******************************************************
10/29 21:44:44 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/29 21:44:44 ** C:\Condor\bin\condor_shadow.exe
10/29 21:44:44 ** $CondorVersion: 6.7.2 Oct 5 2004 $
10/29 21:44:44 ** $CondorPlatform: INTEL-WINNT40 $
10/29 21:44:44 ** PID = 2404
10/29 21:44:44 ******************************************************
10/29 21:44:44 Using config file: C:\Condor\condor_config
10/29 21:44:44 Using local config files: C:\Condor/condor_config.local
10/29 21:44:44 DaemonCore: Command Socket at <10.0.0.15:2379>
10/29 21:44:44 Initializing a VANILLA shadow for job 23.179
10/29 21:44:44 (23.179) (2404): Request to run on <10.0.0.29:1029> was ACCEPTED
10/29 21:44:49 (23.176) (1852): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:49 (23.176) (1852): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.192) (2364): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:49 (23.172) (2512): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:49 (23.172) (2512): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.166) (1320): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:49 (23.166) (1320): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.170) (2516): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:49 (23.170) (2516): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.188) (2004): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:49 (23.188) (2004): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 ******************************************************
10/29 21:44:49 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/29 21:44:49 ** C:\Condor\bin\condor_shadow.exe
10/29 21:44:49 ** $CondorVersion: 6.7.2 Oct 5 2004 $
10/29 21:44:49 ** $CondorPlatform: INTEL-WINNT40 $
10/29 21:44:49 ** PID = 2484
10/29 21:44:49 ******************************************************
10/29 21:44:49 Using config file: C:\Condor\condor_config
10/29 21:44:49 Using local config files: C:\Condor/condor_config.local
10/29 21:44:49 OpenProcess() failed - Error 87
10/29 21:44:49 ERROR "Assertion ERROR on (pidtmp->hProcess)" at line 5765 in file ..\src\condor_daemon_core.V6\daemon_core.C
10/29 21:44:49 ******************************************************
10/29 21:44:49 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/29 21:44:49 ** C:\Condor\bin\condor_shadow.exe
10/29 21:44:49 ** $CondorVersion: 6.7.2 Oct 5 2004 $
10/29 21:44:49 ** $CondorPlatform: INTEL-WINNT40 $
10/29 21:44:49 ** PID = 1684
10/29 21:44:49 ******************************************************
10/29 21:44:49 Using config file: C:\Condor\condor_config
10/29 21:44:49 Using local config files: C:\Condor/condor_config.local
10/29 21:44:49 OpenProcess() failed - Error 87
10/29 21:44:49 ERROR "Assertion ERROR on (pidtmp->hProcess)" at line 5765 in file ..\src\condor_daemon_core.V6\daemon_core.C
10/29 21:44:50 (23.179) (2404): Can't connect to <10.0.0.15:1184>:0, errno = 10061
10/29 21:44:50 (23.179) (2404): Will keep trying for 10 seconds...
10/29 21:44:49 (23.175) (1572): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:50 (23.175) (1572): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.139) (1600): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:50 (23.139) (1600): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.155) (1512): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:50 (23.155) (1512): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.169) (2128): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:50 (23.169) (2128): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.171) (1872): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:50 (23.171) (1872): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.167) (1740): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:50 (23.167) (1740): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (25.92) (1316): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:50 (25.92) (1316): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.192) (2364): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.174) (2136): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:51 (23.174) (2136): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:49 (23.173) (1804): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:51 (23.173) (1804): Got SIGTERM. Performing graceful shutdown.
10/29 21:44:59 (23.179) (2404): Connect failed for 10 seconds; returning FALSE
10/29 21:44:59 (23.179) (2404): ERROR: SECMAN:2003:TCP connection to <10.0.0.15:1184> failed


10/29 21:44:59 (23.179) (2404): Our Parent process (pid 2304) exited; shutting down
10/29 21:44:59 (23.179) (2404): Got SIGTERM. Performing graceful shutdown.
10/29 21:45:24 (23.155) (1512): Job 23.155 is being evicted
10/29 21:45:27 (23.155) (1512): getpeername failed so connect must have failed
10/29 21:47:38 (23.139) (1600): Job 23.139 is being evicted
10/29 21:47:41 (23.139) (1600): getpeername failed so connect must have failed
10/29 21:47:57 (25.92) (1316): Job 25.92 is being evicted
10/29 21:48:00 (25.92) (1316): getpeername failed so connect must have failed
10/29 21:50:24 (23.155) (1512): Connect failed for 300 seconds; returning FALSE
10/29 21:50:24 (23.155) (1512): Can't connect to queue manager: CEDAR:6001:Failed to connect to <10.0.0.15:1187>
10/29 21:50:24 (23.155) (1512): Failed to update job queue!
10/29 21:50:24 (23.155) (1512): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107
10/29 21:52:39 (23.139) (1600): Connect failed for 300 seconds; returning FALSE
10/29 21:52:39 (23.139) (1600): Can't connect to queue manager: CEDAR:6001:Failed to connect to <10.0.0.15:1187>
10/29 21:52:39 (23.139) (1600): Failed to update job queue!
10/29 21:52:39 (23.139) (1600): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107
10/29 21:52:58 (25.92) (1316): Connect failed for 300 seconds; returning FALSE
10/29 21:52:58 (25.92) (1316): Can't connect to queue manager: CEDAR:6001:Failed to connect to <10.0.0.15:1187>
10/29 21:52:58 (25.92) (1316): Failed to update job queue!
10/29 21:52:58 (25.92) (1316): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
CollectorLog:
10/29 21:41:31 DaemonCore: Can't receive command request (perhaps a timeout?)
10/29 21:41:40 StartdAd : Inserting ** "< vm1@labest05 , 10.0.0.185 >"
10/29 21:41:40 StartdPvtAd : Inserting ** "< vm1@labest05 , 10.0.0.185 >"
10/29 21:42:08 (Sent 59 ads in response to query)
10/29 21:42:08 Got QUERY_STARTD_PVT_ADS
10/29 21:42:08 (Sent 33 ads in response to query)
10/29 21:43:32 DaemonCore: Can't receive command request (perhaps a timeout?)
10/29 21:44:11 (Sent 59 ads in response to query)
10/29 21:44:11 Got QUERY_STARTD_PVT_ADS
10/29 21:44:11 (Sent 33 ads in response to query)
10/29 21:44:28 Got QUERY_MASTER_ADS
10/29 21:44:28 (Sent 1 ads in response to query)
10/29 21:44:28 Got QUERY_MASTER_ADS
10/29 21:44:28 (Sent 1 ads in response to query)
10/29 21:44:29 Got QUERY_MASTER_ADS
10/29 21:44:29 (Sent 1 ads in response to query)
10/29 21:44:29 Got QUERY_MASTER_ADS
10/29 21:44:29 (Sent 1 ads in response to query)
10/29 21:44:30 Got QUERY_MASTER_ADS
10/29 21:44:30 (Sent 1 ads in response to query)
10/29 21:44:30 Got QUERY_MASTER_ADS
10/29 21:44:30 (Sent 1 ads in response to query)
10/29 21:44:31 Got QUERY_MASTER_ADS
10/29 21:44:31 (Sent 1 ads in response to query)
10/29 21:44:31 Got QUERY_MASTER_ADS
10/29 21:44:31 (Sent 1 ads in response to query)
10/29 21:44:32 Got QUERY_MASTER_ADS
10/29 21:44:32 (Sent 1 ads in response to query)
10/29 21:44:32 Got QUERY_MASTER_ADS
10/29 21:44:32 (Sent 1 ads in response to query)
10/29 21:44:33 Got QUERY_MASTER_ADS
10/29 21:44:33 (Sent 1 ads in response to query)
10/29 21:44:33 Got QUERY_MASTER_ADS
10/29 21:44:33 (Sent 1 ads in response to query)
10/29 21:44:34 Got QUERY_MASTER_ADS
10/29 21:44:34 (Sent 1 ads in response to query)
10/29 21:44:34 Got QUERY_MASTER_ADS
10/29 21:44:34 (Sent 1 ads in response to query)
10/29 21:44:34 Got QUERY_MASTER_ADS
10/29 21:44:34 (Sent 1 ads in response to query)
10/29 21:44:34 Got QUERY_MASTER_ADS
10/29 21:44:34 (Sent 1 ads in response to query)
10/29 21:44:34 Got QUERY_MASTER_ADS
10/29 21:44:34 (Sent 1 ads in response to query)
10/29 21:44:35 Got QUERY_MASTER_ADS
10/29 21:44:35 (Sent 1 ads in response to query)
10/29 21:44:35 Got QUERY_MASTER_ADS
10/29 21:44:35 (Sent 1 ads in response to query)
10/29 21:44:36 Got QUERY_MASTER_ADS
10/29 21:44:36 (Sent 1 ads in response to query)
10/29 21:44:36 Got QUERY_MASTER_ADS
10/29 21:44:36 (Sent 1 ads in response to query)
10/29 21:44:37 Got QUERY_MASTER_ADS
10/29 21:44:37 (Sent 1 ads in response to query)
10/29 21:44:38 Got QUERY_STARTD_ADS
10/29 21:44:38 (Sent 1 ads in response to query)
10/29 21:44:39 Got QUERY_STARTD_ADS
10/29 21:44:39 (Sent 1 ads in response to query)
10/29 21:44:39 Got QUERY_STARTD_ADS
10/29 21:44:39 (Sent 1 ads in response to query)
10/29 21:44:40 Got QUERY_STARTD_ADS
10/29 21:44:40 (Sent 1 ads in response to query)
10/29 21:44:40 Got QUERY_STARTD_ADS
10/29 21:44:40 (Sent 1 ads in response to query)
10/29 21:44:40 Got QUERY_STARTD_ADS
10/29 21:44:40 (Sent 1 ads in response to query)
10/29 21:44:41 Got QUERY_STARTD_ADS
10/29 21:44:41 (Sent 2 ads in response to query)
10/29 21:44:41 Got QUERY_STARTD_ADS
10/29 21:44:41 (Sent 2 ads in response to query)
10/29 21:44:41 Got QUERY_STARTD_ADS
10/29 21:44:41 (Sent 2 ads in response to query)
10/29 21:44:42 Got QUERY_STARTD_ADS
10/29 21:44:42 (Sent 2 ads in response to query)
10/29 21:44:42 Got QUERY_STARTD_ADS
10/29 21:44:42 (Sent 1 ads in response to query)
10/29 21:44:42 Got QUERY_STARTD_ADS
10/29 21:44:42 (Sent 1 ads in response to query)
10/29 21:44:43 Got QUERY_STARTD_ADS
10/29 21:44:43 (Sent 1 ads in response to query)
10/29 21:44:43 Got QUERY_STARTD_ADS
10/29 21:44:43 (Sent 1 ads in response to query)
10/29 21:44:43 Got QUERY_STARTD_ADS
10/29 21:44:43 (Sent 1 ads in response to query)
10/29 21:44:43 Got QUERY_STARTD_ADS
10/29 21:44:43 (Sent 1 ads in response to query)
10/29 21:44:44 Got QUERY_STARTD_ADS
10/29 21:44:44 (Sent 2 ads in response to query)
10/29 21:44:44 Got QUERY_STARTD_ADS
10/29 21:44:44 (Sent 2 ads in response to query)
10/29 21:44:44 Got QUERY_STARTD_ADS
10/29 21:44:44 (Sent 2 ads in response to query)
10/29 21:44:45 Got QUERY_STARTD_ADS
10/29 21:44:45 (Sent 2 ads in response to query)
10/29 21:44:45 Got QUERY_STARTD_ADS
10/29 21:44:45 (Sent 2 ads in response to query)
10/29 21:44:45 Got QUERY_STARTD_ADS
10/29 21:44:45 (Sent 2 ads in response to query)
10/29 21:44:46 Got QUERY_STARTD_ADS
10/29 21:44:46 (Sent 3 ads in response to query)
10/29 21:44:46 Got QUERY_STARTD_ADS
10/29 21:44:46 (Sent 3 ads in response to query)
10/29 21:44:46 Got QUERY_STARTD_ADS
10/29 21:44:46 (Sent 3 ads in response to query)
10/29 21:44:46 Got QUERY_STARTD_ADS
10/29 21:44:46 (Sent 3 ads in response to query)
10/29 21:44:46 Got QUERY_STARTD_ADS
10/29 21:44:46 (Sent 3 ads in response to query)
10/29 21:44:46 Got QUERY_STARTD_ADS
10/29 21:44:46 (Sent 3 ads in response to query)
10/29 21:44:47 Got QUERY_STARTD_ADS
10/29 21:44:47 (Sent 3 ads in response to query)
10/29 21:44:47 Got QUERY_STARTD_ADS
10/29 21:44:47 (Sent 3 ads in response to query)
10/29 21:44:47 Got QUERY_STARTD_ADS
10/29 21:44:47 (Sent 3 ads in response to query)
10/29 21:44:47 Got QUERY_STARTD_ADS
10/29 21:44:47 (Sent 3 ads in response to query)
10/29 21:44:47 Got QUERY_STARTD_ADS
10/29 21:44:47 (Sent 3 ads in response to query)
10/29 21:44:47 Got QUERY_STARTD_ADS
10/29 21:44:47 (Sent 3 ads in response to query)
10/29 21:44:47 Got QUERY_STARTD_ADS
10/29 21:44:47 (Sent 3 ads in response to query)
10/29 21:44:48 Got QUERY_STARTD_ADS
10/29 21:44:48 (Sent 3 ads in response to query)
10/29 21:44:48 Got QUERY_STARTD_ADS
10/29 21:44:48 (Sent 3 ads in response to query)
10/29 21:44:48 Got QUERY_STARTD_ADS
10/29 21:44:48 (Sent 3 ads in response to query)
10/29 21:44:48 Got QUERY_STARTD_ADS
10/29 21:44:48 (Sent 3 ads in response to query)
10/29 21:44:48 Got QUERY_STARTD_ADS
10/29 21:44:48 (Sent 3 ads in response to query)
10/29 21:44:48 Got QUERY_STARTD_ADS
10/29 21:44:49 (Sent 3 ads in response to query)
10/29 21:44:49 Got QUERY_STARTD_ADS
10/29 21:44:49 (Sent 3 ads in response to query)
10/29 21:44:49 Got QUERY_STARTD_ADS
10/29 21:44:49 (Sent 3 ads in response to query)
10/29 21:44:49 Got QUERY_STARTD_ADS
10/29 21:44:49 (Sent 3 ads in response to query)
10/29 21:44:59 Our Parent process (pid 2208) exited; shutting down
10/29 21:44:59 Got SIGTERM. Performing graceful shutdown.
10/29 21:44:59 **** condor_collector.exe (condor_COLLECTOR) EXITING WITH STATUS 0




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
NegotiatorLog:
10/29 21:42:11 ---------- Finished Negotiation Cycle ----------
10/29 21:43:30 DaemonCore: Can't receive command request (perhaps a timeout?)
10/29 21:44:11 ---------- Started Negotiation Cycle ----------
10/29 21:44:11 Phase 1: Obtaining ads from collector ...
10/29 21:44:11 Getting all public ads ...
10/29 21:44:11 Sorting 59 ads ...
10/29 21:44:11 Getting startd private ads ...
10/29 21:44:11 Got ads: 59 public and 33 private
10/29 21:44:11 Public ads include 1 submitter, 33 startd
10/29 21:44:11 Phase 2: Performing accounting ...
10/29 21:44:12 Phase 3: Sorting submitter ads by priority ...
10/29 21:44:12 Phase 4.1: Negotiating with schedds ...
10/29 21:44:12 Negotiating with Administrator@poolform at <10.0.0.15:1184>
10/29 21:44:12 Request 00023.00168:
10/29 21:44:12 Matched 23.168 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.30:1029>
10/29 21:44:12 Successfully matched with vm2@sc34pc10
10/29 21:44:12 Request 00023.00177:
10/29 21:44:13 Matched 23.177 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.27:1029>
10/29 21:44:13 Successfully matched with vm1@sc34pc07
10/29 21:44:13 Request 00023.00178:
10/29 21:44:13 Matched 23.178 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.27:1029>
10/29 21:44:13 Successfully matched with vm2@sc34pc07
10/29 21:44:13 Request 00023.00179:
10/29 21:44:13 Matched 23.179 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.29:1029>
10/29 21:44:13 Successfully matched with vm1@sc34pc09
10/29 21:44:13 Request 00023.00180:
10/29 21:44:13 Matched 23.180 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.29:1029>
10/29 21:44:13 Successfully matched with vm2@sc34pc09
10/29 21:44:13 Request 00023.00181:
10/29 21:44:13 Matched 23.181 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.182:1029>
10/29 21:44:13 Successfully matched with vm1@labest02
10/29 21:44:13 Request 00023.00182:
10/29 21:44:13 Matched 23.182 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.182:1029>
10/29 21:44:13 Successfully matched with vm2@labest02
10/29 21:44:13 Request 00023.00183:
10/29 21:44:13 Matched 23.183 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.183:1029>
10/29 21:44:14 Successfully matched with vm1@labest03
10/29 21:44:14 Request 00023.00184:
10/29 21:44:14 Matched 23.184 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.183:1029>
10/29 21:44:14 Successfully matched with vm2@labest03
10/29 21:44:14 Request 00023.00185:
10/29 21:44:14 Matched 23.185 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.32:2123>
10/29 21:44:14 Successfully matched with SC34PC12
10/29 21:44:14 Request 00023.00186:
10/29 21:44:14 Matched 23.186 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.24:4898>
10/29 21:44:14 Successfully matched with SC34PC04
10/29 21:44:14 Request 00023.00187:
10/29 21:44:14 Matched 23.187 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.33:3427>
10/29 21:44:14 Successfully matched with SC34PC13
10/29 21:44:14 Request 00023.00188:
10/29 21:44:14 Matched 23.188 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.25:2522>
10/29 21:44:14 Successfully matched with SC34PC05
10/29 21:44:14 Request 00023.00189:
10/29 21:44:14 Matched 23.189 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.34:3190>
10/29 21:44:14 Successfully matched with SC34PC14
10/29 21:44:14 Request 00023.00190:
10/29 21:44:14 Matched 23.190 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.26:1029>
10/29 21:44:14 Successfully matched with SC34PC06
10/29 21:44:14 Request 00023.00191:
10/29 21:44:14 Matched 23.191 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.35:1029>
10/29 21:44:14 Successfully matched with SC34PC15
10/29 21:44:14 Request 00023.00192:
10/29 21:44:15 Matched 23.192 Administrator@poolform <10.0.0.15:1184> preempting none <10.0.0.36:1029>
10/29 21:44:15 Successfully matched with SC34PC16
10/29 21:44:15 Over submitter resource limit (17) ... only consider startd ranks
10/29 21:44:15 Request 00023.00193:
10/29 21:44:15 Rejected 23.193 Administrator@poolform <10.0.0.15:1184>: no match found
10/29 21:44:15 Request 00024.00000:
10/29 21:44:15 Rejected 24.0 Administrator@poolform <10.0.0.15:1184>: no match found
10/29 21:44:15 Request 00025.00093:
10/29 21:44:15 Rejected 25.93 Administrator@poolform <10.0.0.15:1184>: no match found
10/29 21:44:15 Got NO_MORE_JOBS; done negotiating
10/29 21:44:15 Phase 4.2: Negotiating with schedds ...
10/29 21:44:15 Negotiating with Administrator@poolform at <10.0.0.15:1184>
10/29 21:44:15 ---------- Finished Negotiation Cycle ----------
10/29 21:44:59 Our Parent process (pid 2208) exited; shutting down
10/29 21:44:59 Got SIGTERM. Performing graceful shutdown.
10/29 21:44:59 **** condor_negotiator.exe (condor_NEGOTIATOR) EXITING WITH STATUS 0