[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Can't find address of local schedd



Hello Marcelo,

Based on what you've written, it sounds like you're experiencing case #1 in Jason's email.  Your daemons are configured to run on the correct server, but stopped running suddenly and now will not start again.

Considering you didn't make any other changes, and the sudden nature of the stop, you might be out of disk space.  That's a common cause of daemons stopping logging mid-logline. Another option is that permissions or something else changed to prevent Condor from writing to that directory.

Try running "df" on the /var/opt/condor/log to make sure you have disk space. Being out of disk space is not the only reason Condor could have stopped working, but it is a good initial check.

Regards,
Rob

Marcelo Chiapparini wrote:
Jason,

thank you for the help. Below are the results of your advices:

2009/4/14 Jason Stowe <jstowe@xxxxxxxxxxxxxxxxxx>:
  
Marcelo,
The errors you are getting could be caused by a few problems, so below
is a more detailed process to help you debug this:
    
$ condor_status
CEDAR:6001:Failed to connect to <xxx.xx.xxx.xx:xxxx>
Error: Couldn't contact the condor_collector on cluster-name.domain

Extra Info: the condor_collector is a process that runs on the central
      
...
    
responding. Also see the Troubleshooting section of the manual.
      
This error indicates that the condor_status command couldn't
communicate with the collector. This most likely means:
(1) the collector (and the condor_master/other daemons) isn't running
on the central manager,
(2) the collector is running, but not on the server the command thinks
it is, or
(3) the collector is running where condor_status thinks it is, but
condor_status doesn't have permission to talk with it.

To rule out #1, on the central manager of the pool, after you run
condor_master on the head node for the cluster, what do you get when
you run:
$ ps -ef | grep condor
Does the condor_master/condor_collector show up here?
    

No. Deamons are not running on the central node:

# condor_master
# ps -ef | grep condor
root     25980 15002  0 09:41 pts/1    00:00:00 grep condor

  
This should tell you the directory log files are located in:
$ condor_config_val -config -verbose LOG
    

I found they! They are in /var/opt/condor/log. Thanks!

  
To check for option #2, determine where the collector should be by running:
condor_config_val -verbose COLLECTOR_HOST
    

# condor_config_val -verbose COLLECTOR_HOST
COLLECTOR_HOST: lacad-dft.fis.uerj.br

  
Does this match the machine you expect to be the central manager?
    

Yes!

  
For situation #3, do you get permission denied errors in the logfiles?
Checking the HOSTALLOW_READ settings on the central manager will be
the next step:
http://www.cs.wisc.edu/condor/manual/v7.2/3_6Security.html#sec:Host-Security
    

# condor_config_val -verbose HOSTALLOW_READ
HOSTALLOW_READ: *
  Defined in '/opt/condor/etc/condor_config', line 209.


Looking at the CollectorLog file, it is clear that something happened
at 14:42:01, because the write to this log was interrupted in the
middle of a sentence. See the last lines of the CollectorLog:

<snip>
4/13 14:40:22 NegotiatorAd  : Inserting ** "< lacad-dft.fis.uerj.br >"
4/13 14:41:55 (Sending 84 ads in response to query)
4/13 14:41:55 Got QUERY_STARTD_PVT_ADS
4/13 14:41:55 (Sending 64 ads in response to query)
4/13 14:42:01 Got QUERY

and nothing more was written since this. This was yesterday, when
Condor stops to work.
Looking at the MasterLog file we find the same. Again, things were
interrupted abruptly at 14:42:14. (sorry for the long log,  but I want
to give a good idea of what happened...)

<snip>
4/10 10:50:18 Preen pid is 10018
4/10 10:50:18 Child 10018 died, but not a daemon -- Ignored
4/11 10:50:18 Preen pid is 12156
4/11 10:50:18 Child 12156 died, but not a daemon -- Ignored
4/12 10:50:18 Preen pid is 10655
4/12 10:50:18 Child 10655 died, but not a daemon -- Ignored
4/13 10:50:18 Preen pid is 18824
4/13 10:50:18 Child 18824 died, but not a daemon -- Ignored
4/13 14:34:51 The SCHEDD (pid 4063) exited with status 4
4/13 14:34:51 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:34:51 restarting /opt/condor/sbin/condor_schedd in 10 seconds
4/13 14:35:01 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20203
4/13 14:35:01 The SCHEDD (pid 20203) exited with status 4
4/13 14:35:01 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:35:01 restarting /opt/condor/sbin/condor_schedd in 11 seconds
4/13 14:35:12 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20210
4/13 14:35:12 The SCHEDD (pid 20210) exited with status 44
4/13 14:35:12 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:35:12 restarting /opt/condor/sbin/condor_schedd in 13 seconds
4/13 14:35:25 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20214
4/13 14:35:25 The SCHEDD (pid 20214) exited with status 44
4/13 14:35:25 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:35:25 restarting /opt/condor/sbin/condor_schedd in 17 seconds
4/13 14:35:42 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20218
4/13 14:35:42 The SCHEDD (pid 20218) exited with status 44
4/13 14:35:42 restarting /opt/condor/sbin/condor_schedd in 25 seconds
4/13 14:36:07 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20219
4/13 14:36:07 The SCHEDD (pid 20219) exited with status 44
4/13 14:36:07 restarting /opt/condor/sbin/condor_schedd in 41 seconds
4/13 14:36:48 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20220
4/13 14:36:48 The SCHEDD (pid 20220) exited with status 44
4/13 14:36:48 restarting /opt/condor/sbin/condor_schedd in 73 seconds
4/13 14:38:01 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20222
4/13 14:38:01 The SCHEDD (pid 20222) exited with status 44
4/13 14:38:01 restarting /opt/condor/sbin/condor_schedd in 137 seconds
4/13 14:40:18 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20226
4/13 14:40:18 The SCHEDD (pid 20226) exited with status 44
4/13 14:40:18 restarting /opt/condor/sbin/condor_schedd in 265 seconds
4/13 14:42:01 The COLLECTOR (pid 3779) exited with status 44
4/13 14:42:01 Sending obituary for "/opt/condor/sbin/condor_collector"
4/13 14:42:01 restarting /opt/condor/sbin/condor_collector in 10 seconds
4/13 14:42:01 attempt to connect to <152.92.133.74:9618> failed:
Connection refused (connect errno = 111).
4/13 14:42:01 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618> failed

4/13 14:42:01 Failed to start non-blocking update to <152.92.133.74:9618>.
4/13 14:42:11 Started DaemonCore process
"/opt/condor/sbin/condor_collector", pid and pgroup = 20233
4/13 14:42:14 attempt to connect to <152.92.133.74:9618> failed:
Connection refused (connect errno = 111).
4/13 14:42:14 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618> failed

4/13 14:42:14 Failed to start non-blocking update to <152.92.133.74:9618>.
4/13 14:42:14 The COLLECTOR (pid 20233) exited with status 44
4/13 14:42:14 Sending obituary for "/opt/condor/sbin/condor_collector"
4/13 14:42:

Is this a physical problem with the hardware? I reboot physically the
cluster today, 4/14, but Condor refuses to run. Nothing was written to
the logs since yesterday 4/13 14:42:14.

Any help will be very welcome,

Regards

Marcelo
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/
  


-- 

===================================
Rob Futrick
main: 888.292.5320

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and CycleServer Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com