[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_master dying for no apparent reason



Over the last week we've had two instances of the Condor daemons on a machine going down for apparently no reason.  These were two different machines, but both were "submit only" (condor_master and condor_schedd) machines.  I'm hoping someone could take a quick look at my log file and see if there's anything here that would help with diagnosis.  The repeated entries saying

 

9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist.

9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist.

 

worry me, but I don't know what they mean.  I've confirmed that the pid doesn't exist, but I don't know why it's looking for it.

 

One other potential item (although it might be a red herring) is that the condor_config and local config files are in the ~condor directory on an NFS mounted partition and we've had occasional trouble with that mount failing on us.  But normally that gives a pretty obvious error message, and we aren't getting anything here.  However, if there's a "cd ~condor" command or equivalent in the code somewhere that could be a problem since you can't cd to ~condor on our systems.  You can 'cd /home/condor', and 'ls ~condor', but 'cd ~condor' is disabled (I don't know why.)

 

 

Finally, is there a way to ensure that we get notified when the condor_master daemon goes down?  I have PUBLISH_OBITUARIES set to True and OBITUARY_LOG_LENGTH set to 20, but I'm not getting any emails at the ADMIN address at all when these issues occur.

 

 

Thanks in advance for any help.  I'm stumped, so anything at all would be appreciated.

 

-Colin

 

 

System information:

 

Condor version 6.7.14

Redhat Enterprise Linux 4 (Linux 2.6.9-5.0.5.ELsmp) on the two machines that went down

Redhat Enterprise Linux 3 (Linux 2.4.21-32.0.1.ELsmp) on the central_master

 

 

 


This email and any files transmitted with it are confidential, proprietary
and intended solely for the individual or entity to whom they are addressed.
If you have received this email in error please delete it immediately.

9/25 17:13:25 Getting monitoring info for pid 2796
9/25 17:13:38 enter Daemons::UpdateCollector
9/25 17:13:38 Trying to update collector <xx.xx.xx.xx:9618>
9/25 17:13:38 Attempting to send update via UDP to collector mnappmb00.fairisaac.com <xx.xx.xx.xx:9618>
9/25 17:13:38 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
9/25 17:13:38 exit Daemons::UpdateCollector
9/25 17:13:38 enter Daemons::CheckForNewExecutable
9/25 17:13:38 Time stamp of running /opt/condor/sbin/condor_master: 1139286549
9/25 17:13:38 GetTimeStamp returned: 1139286549
9/25 17:13:38 Time stamp of running /opt/condor/sbin/condor_schedd: 1139286549
9/25 17:13:38 GetTimeStamp returned: 1139286549
9/25 17:13:38 exit Daemons::CheckForNewExecutable
9/25 17:13:58 ProcAPI::getProcInfo() pid 8109 does not exist.
9/25 17:13:58 ProcAPI::getProcInfo() pid 8109 does not exist.
9/25 17:13:58 ProcAPI::getProcInfo() pid 8109 does not exist.
9/25 17:13:58 ProcAPI::getProcInfo() pid 8109 does not exist.
9/25 17:13:58 ProcAPI::getProcInfo() pid 8109 does not exist.
9/25 17:13:58 ProcAPI::buildFamily() Found daddypid on the system: 2799
9/25 17:14:58 ProcAPI::buildFamily() Found daddypid on the system: 2799
9/25 17:15:59 ProcAPI::buildFamily() Found daddypid on the system: 2799
9/25 17:16:59 ProcAPI::getProcInfo() pid 13414 does not exist.
9/25 17:16:59 ProcAPI::getProcInfo() pid 13414 does not exist.
9/25 17:16:59 ProcAPI::getProcInfo() pid 13414 does not exist.
9/25 17:16:59 ProcAPI::getProcInfo() pid 13414 does not exist.
9/25 17:16:59 ProcAPI::getProcInfo() pid 13414 does not exist.
9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist.
9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist.
9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist.
9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist.
9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist.
9/25 17:16:59 ProcAPI::buildFamily() Found daddypid on the system: 2799
9/25 17:17:25 Getting monitoring info for pid 2796
9/25 17:17:59 ProcAPI::buildFamily() Found daddypid on the system: 2799
9/25 17:18:38 enter Daemons::UpdateCollector
9/25 17:18:38 Trying to update collector <xx.xx.xx.xx:9618>
9/25 17:18:38 Attempting to send update via UDP to collector mnappmb00.fairisaac.com <xx.xx.xx.xx:9618>
9/25 17:18:38 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
9/25 17:18:38 exit Daemons::UpdateCollector
9/25 17:18:38 enter Daemons::CheckForNewExecutable
9/25 17:18:38 Time stamp of running /opt/condor/sbin/condor_master: 1139286549
9/25 17:18:38 GetTimeStamp returned: 1139286549
9/25 17:18:38 Time stamp of running /opt/condor/sbin/condor_schedd: 1139286549
9/25 17:18:38 GetTimeStamp returned: 1139286549
9/25 17:18:38 exit Daemons::CheckForNewExecutable
9/25 17:18:59 ProcAPI::buildFamily() Found daddypid on the system: 2799
9/25 17:20:00 ProcAPI::buildFamily() Found daddypid on the system: 2799
9/25 17:21:00 ProcAPI::buildFamily() Found daddypid on the system: 2799
9/25 17:21:25 Getting monitoring info for pid 2796
9/25 17:22:00 ProcAPI::getProcInfo() pid 21664 does not exist.
9/25 17:22:00 ProcAPI::getProcInfo() pid 21664 does not exist.
9/25 17:22:00 ProcAPI::getProcInfo() pid 21664 does not exist.
9/25 17:22:00 ProcAPI::getProcInfo() pid 21664 does not exist.
9/25 17:22:00 ProcAPI::getProcInfo() pid 21664 does not exist.
9/25 17:22:00 ProcAPI::getProcInfo() pid 21666 does not exist.
9/25 17:22:00 ProcAPI::getProcInfo() pid 21666 does not exist.
9/25 17:22:00 ProcAPI::getProcInfo() pid 21666 does not exist.
9/25 17:22:00 ProcAPI::getProcInfo() pid 21666 does not exist.
9/25 17:22:00 ProcAPI::getProcInfo() pid 21666 does not exist.
9/25 17:22:00 ProcAPI::buildFamily() Found daddypid on the system: 2799
9/25 17:22:24 DaemonCore: Command received via UDP from host <xx.xx.xx.xx:33012>
9/25 17:22:24 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
9/25 17:23:01 ProcAPI::buildFamily() Found daddypid on the system: 2799
9/25 17:23:38 enter Daemons::UpdateCollector
9/25 17:23:38 Trying to update collector <xx.xx.xx.xx:9618>
9/25 17:23:38 Attempting to send update via UDP to collector mnappmb00.fairisaac.com <xx.xx.xx.xx:9618>
9/25 17:23:38 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
9/25 17:23:38 exit Daemons::UpdateCollector
9/25 17:23:38 enter Daemons::CheckForNewExecutable
9/25 17:23:38 Time stamp of running /opt/condor/sbin/condor_master: 1139286549
9/25 17:23:38 GetTimeStamp returned: 1139286549
9/25 17:23:38 Time stamp of running /opt/condor/sbin/condor_schedd: 1139286549
9/25 17:23:38 GetTimeStamp returned: 1139286549
9/25 17:23:38 exit Daemons::CheckForNewExecutable
9/26 12:28:03 NET_REMAP_ENABLE is undefined, using default value of False
9/26 12:28:03 NET_REMAP_ENABLE is undefined, using default value of False

9/26 12:28:03 PASSWD_CACHE_REFRESH is undefined, using default value of 300

9/26 12:28:03 ******************************************************
9/26 12:28:03 ** condor_master (CONDOR_MASTER) STARTING UP
9/26 12:28:03 ** /opt/condor/sbin/condor_master
9/26 12:28:03 ** $CondorVersion: 6.7.14 Dec 13 2005 $
9/26 12:28:03 ** $CondorPlatform: I386-LINUX_RH9 $
9/26 12:28:03 ** PID = 32758
9/26 12:28:03 ******************************************************