[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] unresponsive condor_quill



Hi all,

we currently have problems with quill (again) and maybe someone hit this 
problem before and might share a hint with us:

On one of our submit hosts which is pretty loaded right now, condor_quill sees 
to be unresponsive and is killed by the master from time to time:

From the log files:

MasterLog:
[...]
8/31 15:23:32 ERROR: Child pid 15813 appears hung! Killing it hard.
8/31 15:23:32 The QUILL (pid 15813) was killed because it was no longer 
responding
8/31 15:23:32 Sending obituary for "/opt/condor/sbin/condor_quill"
8/31 15:23:32 restarting /opt/condor/sbin/condor_quill in 10 seconds
8/31 15:23:42 Started DaemonCore process "/opt/condor/sbin/condor_quill", pid 
and pgroup = 15138
8/31 17:15:44 ERROR: Child pid 15138 appears hung! Killing it hard.
8/31 17:15:49 The QUILL (pid 15138) was killed because it was no longer 
responding
8/31 17:15:50 Sending obituary for "/opt/condor/sbin/condor_quill"
8/31 17:15:50 restarting /opt/condor/sbin/condor_quill in 10 seconds
8/31 17:16:00 Started DaemonCore process "/opt/condor/sbin/condor_quill", pid 
and pgroup = 8487

QuillLog:
8/31 14:33:20 ++++++++ Sending Quill ad to collector ++++++++ 
8/31 14:33:20 ++++++++ Sent Quill ad to collector ++++++++    
8/31 14:33:30 ******** Start of Polling Job Queue Log ********
8/31 14:33:30 JOB QUEUE POLLING RESULT: ADDED                 
8/31 15:23:42 ******************************************************
8/31 15:23:42 ** condor_quill (CONDOR_QUILL) STARTING UP            
8/31 15:23:42 ** /opt/condor-7.2.4/sbin/condor_quill                
8/31 15:23:42 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
8/31 15:23:42 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON 
8/31 15:23:42 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $      
8/31 15:23:42 ** $CondorPlatform: X86_64-LINUX_DEBIAN50 $                 
8/31 15:23:42 ** PID = 15138                                              
8/31 15:23:42 ** Log last touched 8/31 14:33:30                           
8/31 15:23:42 ******************************************************      
8/31 15:23:42 Using config source: /opt/condor-7.2.4/etc/condor_config    
8/31 15:23:42 Using local config sources:                                 
8/31 15:23:42    /etc/default/condor|                                     
8/31 15:23:42 DaemonCore: Command Socket at <10.20.30.2:38508>            
8/31 15:23:42 main_init() called                                          
8/31 15:23:42 configuring tt options from config file                     
8/31 15:23:42 Using Polling Period = 10
8/31 15:23:42 Using logs 8/31 15:23:42 /var/log/condor/schedd_sql.log 8/31 
15:23:42 /var/log/condor/sql.log 8/31 15:23:42
8/31 15:23:42 Using Job Queue File /local/condor.atlas2/spool/job_queue.log
8/31 15:23:42 Using Database Type = Postgres
8/31 15:23:42 Using Database IpAddress = 10.20.30.4:5432
8/31 15:23:42 Using Database Name = quill
8/31 15:23:42 Using Database User = quillwriter
8/31 15:23:42 ******** Start of Polling Job Queue Log ********
8/31 15:23:42 JOB QUEUE POLLING RESULT: COMPRESSED
8/31 16:15:39 ********* End of Polling Job Queue Log *********
8/31 16:15:42 ******** Start of Polling Event Log ********
8/31 16:15:44 ********* End of Polling Event Log *********
8/31 16:15:44 ******** Start of Polling XML Log ********
8/31 16:15:44 ********* End of Polling XML Log *********
8/31 16:15:44 ++++++++ Sending Quill ad to collector ++++++++
8/31 16:15:44 ++++++++ Sent Quill ad to collector ++++++++
8/31 16:15:54 ******** Start of Polling Job Queue Log ********
8/31 16:16:00 JOB QUEUE POLLING RESULT: ADDED
8/31 17:16:00 ******************************************************
8/31 17:16:00 ** condor_quill (CONDOR_QUILL) STARTING UP
8/31 17:16:00 ** /opt/condor-7.2.4/sbin/condor_quill
8/31 17:16:00 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
8/31 17:16:00 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
8/31 17:16:00 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
8/31 17:16:00 ** $CondorPlatform: X86_64-LINUX_DEBIAN50 $
8/31 17:16:00 ** PID = 8487
8/31 17:16:00 ** Log last touched 8/31 16:16:00
8/31 17:16:00 ******************************************************
8/31 17:16:00 Using config source: /opt/condor-7.2.4/etc/condor_config
8/31 17:16:00 Using local config sources:
8/31 17:16:00    /etc/default/condor|
8/31 17:16:00 DaemonCore: Command Socket at <10.20.30.2:53809>
8/31 17:16:00 main_init() called
8/31 17:16:00 configuring tt options from config file
8/31 17:16:00 Using Polling Period = 10
8/31 17:16:00 Using logs 8/31 17:16:00 /var/log/condor/schedd_sql.log 8/31 
17:16:00 /var/log/condor/sql.log 8/31 17:16:00
8/31 17:16:00 Using Job Queue File /local/condor.atlas2/spool/job_queue.log
8/31 17:16:00 Using Database Type = Postgres
8/31 17:16:00 Using Database IpAddress = 10.20.30.4:5432
8/31 17:16:00 Using Database Name = quill
8/31 17:16:00 Using Database User = quillwriter
8/31 17:16:00 ******** Start of Polling Job Queue Log ********
8/31 17:16:00 JOB QUEUE POLLING RESULT: COMPRESSED

As I've seen that this will happen soon again, I started strace against 
condor_quill about 2 minutes before it was restarted and it was doing "stuff":

[...]
sendto(5, "Q\0\0\0\225DELETE FROM ClusterAds_Verti"..., 150, 0, NULL, 0) = 150
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])   
recvfrom(5, "C\0\0\0\rDELETE 0\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 20    
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
sendto(5, "Q\0\0\0\233INSERT INTO ClusterAds_Verti"..., 156, 0, NULL, 0) = 156
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])   
recvfrom(5, "C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 22 
lseek(6, 1107835382, SEEK_SET)          = 1107835382                          
read(6, "1"..., 1)                      = 1                                   
read(6, "0"..., 1)                      = 1                                   
read(6, "3"..., 1)                      = 1                                   
read(6, " "..., 1)                      = 1                                   
read(6, "0"..., 1)                      = 1                                   
read(6, "1"..., 1)                      = 1                                   
read(6, "2"..., 1)                      = 1                                   
read(6, "7"..., 1)                      = 1                                   
read(6, "3"..., 1)                      = 1                                   
read(6, "7"..., 1)                      = 1                                   
read(6, "2"..., 1)                      = 1                                   
read(6, "6"..., 1)                      = 1                                   
read(6, "."..., 1)                      = 1                                   
read(6, "-"..., 1)                      = 1                                   
read(6, "1"..., 1)                      = 1                                   
read(6, " "..., 1)                      = 1                                   
read(6, "D"..., 1)                      = 1                                   
read(6, "A"..., 1)                      = 1                                   
read(6, "G"..., 1)                      = 1                                   
read(6, "M"..., 1)                      = 1                                   
read(6, "a"..., 1)                      = 1                                   
read(6, "n"..., 1)                      = 1                                   
read(6, "J"..., 1)                      = 1                                   
read(6, "o"..., 1)                      = 1                                   
read(6, "b"..., 1)                      = 1                                   
read(6, "I"..., 1)                      = 1                                   
read(6, "d"..., 1)                      = 1                                   
read(6, " "..., 1)                      = 1                                   
read(6, "1"..., 1)                      = 1                                   
read(6, "2"..., 1)                      = 1                                   
read(6, "6"..., 1)                      = 1                                   
read(6, "8"..., 1)                      = 1                                   
read(6, "0"..., 1)                      = 1                                   
read(6, "4"..., 1)                      = 1                                   
read(6, "8"..., 1)                      = 1                                   
read(6, "\n"..., 1)                     = 1                                   
lseek(6, 0, SEEK_CUR)                   = 1107835418                          
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
sendto(5, "Q\0\0\0\216DELETE FROM ClusterAds_Verti"..., 143, 0, NULL, 0) = 143
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])   
recvfrom(5, "C\0\0\0\rDELETE 0\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 20    
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
sendto(5, "Q\0\0\0\231INSERT INTO ClusterAds_Verti"..., 154, 0, NULL, 0) = 154
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])   
recvfrom(5, "C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 22 
lseek(6, 1107835418, SEEK_SET)          = 1107835418                          
read(6, "1"..., 1)                      = 1                                   
read(6, "0"..., 1)                      = 1                                   
read(6, "3"..., 1)                      = 1                                   
read(6, " "..., 1)                      = 1                                   
read(6, "1"..., 1)                      = 1                                   
read(6, "2"..., 1)                      = 1                                   
read(6, "7"..., 1)                      = 1                                   
read(6, "3"..., 1)                      = 1                                   
read(6, "7"..., 1)                      = 1                                   
read(6, "2"..., 1)                      = 1                                   
read(6, "6"..., 1)                      = 1                                   
read(6, "."..., 1)                      = 1                                   
read(6, "0"..., 1)                      = 1                                   
read(6, " "..., 1)                      = 1                                   
read(6, "P"..., 1)                      = 1                                   
read(6, "r"..., 1)                      = 1                                   
read(6, "o"..., 1)                      = 1                                   
read(6, "c"..., 1)                      = 1                                   
read(6, "I"..., 1)                      = 1                                   
read(6, "d"..., 1)                      = 1                                   
read(6, " "..., 1)                      = 1                                   
read(6, "0"..., 1)                      = 1                                   
read(6, "\n"..., 1)                     = 1                                   
lseek(6, 0, SEEK_CUR)                   = 1107835441                          
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
sendto(5, "Q\0\0\0\230DELETE FROM ProcAds_Vertical"..., 153, 0, NULL, 0) = 153
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])   
recvfrom(5, "C\0\0\0\rDELETE 0\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 20    
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
sendto(5, "Q\0\0\0\231INSERT INTO ProcAds_Vertical"..., 154, 0, NULL, 0) = 154
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])   
recvfrom(5, "C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 22 
lseek(6, 1107835441, SEEK_SET)          = 1107835441                          
read(6, "1"..., 1)                      = 1                                   
read(6, "0"..., 1)                      = 1                                   
read(6, "6"..., 1)                      = 1                                   
read(6, " "..., 1)                      = 1                                   
read(6, "\n"..., 1)                     = 1                                   
lseek(6, 0, SEEK_CUR)                   = 1107835446                          
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
sendto(5, "Q\0\0\0\vCOMMIT\0"..., 12, 0, NULL, 0) = 12                        
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])   
recvfrom(5, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I"..., 16384, 0, NULL, NULL) = 18      
lseek(6, 1107835446, SEEK_SET)          = 1107835446                          
read(6, "1"..., 1)                      = 1                                   
read(6, "0"..., 1)                      = 1                                   
read(6, "5"..., 1)                      = 1                                   
read(6, " "..., 1)                      = 1                                   
read(6, "\n"..., 1)                     = 1                                   
lseek(6, 0, SEEK_CUR)                   = 1107835451                          
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
sendto(5, "Q\0\0\0\vBEGIN;\0"..., 12, 0, NULL, 0) = 12                        
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0                            
poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])   
recvfrom(5, "C\0\0\0\nBEGIN\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 17       
lseek(6, 1107835451, SEEK_SET)          = 1107835451
read(6, "1"..., 1)                      = 1
read(6, "0"..., 1)                      = 1
read(6, "3"..., 1)                      = 1
read(6, " "..., 1)                      = 1
read(6, "0"..., 1)                      = 1
read(6, "."..., 1)                      = 1
read(6, "0"..., 1)                      = 1
read(6, " "..., 1)                      = 1
read(6, "N"..., 1)                      = 1
read(6, "e"..., 1)                      = 1
read(6, "x"..., 1)                      = 1
read(6, "t"..., 1)                      = 1
read(6, "C"..., 1)                      = 1
read(6, "l"..., 1)                      = 1
read(6, "u"..., 1)                      = 1
read(6, "s"..., 1)                      = 1
read(6, "t"..., 1)                      = 1
read(6, "e"..., 1)                      = 1
read(6, "r"..., 1)                      = 1
read(6, "N"..., 1)                      = 1
read(6, "u"..., 1)                      = 1
read(6, "m"..., 1)                      = 1
read(6, " "..., 1)                      = 1
read(6, "1"..., 1)                      = 1
read(6, "2"..., 1)                      = 1
read(6, "7"..., 1)                      = 1
read(6, "3"..., 1)                      = 1
read(6, "7"..., 1)                      = 1
read(6, "2"..., 1)                      = 1
read(6, "8"..., 1)                      = 1
read(6, "\n"..., 1)                     = 1
lseek(6, 0, SEEK_CUR)                   = 1107835482
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0
sendto(5, "Q\0\0\0\232DELETE FROM ProcAds_Vertical"..., 155, 0, NULL, 0) = 155
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0
poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
recvfrom(5, "C\0\0\0\rDELETE 1\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 20
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0
sendto(5, "Q\0\0\0\241INSERT INTO ProcAds_Vertical"..., 162, 0, NULL, 0) = 162
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0
poll([{fd=5, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
+++ killed by SIGKILL +++


Anyone seen this before or has a hint for us how to keep it running happily?

Cheers

Carsten