[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_collector starting and dying without even writing to the log



Aha! Our sys-admins tracked the problem down to a missing compatability library (for libstdc++). I still don't understand why condor_collector could be launched from the command-line, but our setup is back up and running, so I won't look a gift-horse in the mouth.



This is for the RH9 "** $CondorVersion: 6.6.11 Mar 23 2006 $" build
running on Fedora Core 4. Interestingly, we weren't seeing this for tje
RH7 "** $CondorVersion: 6.6.9 Mar 10 2005 $" build running on Fedora Core 4.

 From the MasterLog:

11/22 14:38:32 Started DaemonCore process
"/usr/local/condor/sbin/condor_collector", pid and pgroup = 4480
11/22 14:38:32 enter Daemons::UpdateCollector
11/22 14:38:32 Attempting to send update via UDP to collector
XXX.corefa.com <XXX.XXX.XXX.XXX:9618>
11/22 14:38:32 Can't connect to <XXX.XXX.XXX.XXX:9618>:0, errno = 111
11/22 14:38:32 Will keep trying for 10 seconds...
11/22 14:38:42 Connect failed for 10 seconds; returning FALSE
11/22 14:38:42 ERROR:
SECMAN:2003:TCP connection to <XXX.XXX.XXX.XXX:9618> failed

11/22 14:38:42 Can't send UPDATE_MASTER_AD to collector XXX.corefa.com
<XXX.XXX.XXX.XXX:9618>: Failed to send UDP update command to collector
11/22 14:38:42 DaemonCore: No more children processes to reap.
11/22 14:38:42 start recover timer (63)
11/22 14:38:42 Started DaemonCore process
"/usr/local/condor/sbin/condor_schedd", pid and pgroup = 4485
11/22 14:38:42 enter Daemons::UpdateCollector
11/22 14:38:42 Attempting to send update via UDP to collector
XXX.corefa.com <XXX.XXX.XXX.XXX:9618>
11/22 14:38:42 Can't connect to <XXX.XXX.XXX.XXX:9618>:0, errno = 111
11/22 14:38:42 Will keep trying for 10 seconds...
11/22 14:38:52 Connect failed for 10 seconds; returning FALSE
11/22 14:38:52 ERROR:
SECMAN:2003:TCP connection to <XXX.XXX.XXX.XXX:9618> failed

11/22 14:38:52 Can't send UPDATE_MASTER_AD to collector XXX.corefa.com
<XXX.XXX.XXX.XXX:9618>: Failed to send UDP update command to collector
11/22 14:38:52 The COLLECTOR (pid 4480) exited with status 127
11/22 14:38:52 ProcAPI::buildFamily failed: parent 4480 not found on system.
11/22 14:38:52 restarting /usr/local/condor/sbin/condor_collector in 265
seconds


This looks to be similar to:
https://www-auth.cs.wisc.edu/lists/condor-users/2006-May/msg00289.shtml

Any ideas on how I can diagnose this further? Some other observations:

* I can launch condor_collector from the command-line, and THEN it
writes to the log. But the process doesn't actually do any collecting

* If I hammer 'ps' to see when Condor is launching condor_collector, the
launched process is owned by 'root', not 'condor'

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR