[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Can't find address of local schedd



Hi Rob,

Bingo! you was right:

# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             15872604   4889488  10163804  33% /
/dev/sda5            828959588   2753132 783418536   1% /state/partition1
/dev/sda2              3968124   3831872         0 100% /var
tmpfs                  4087108         0   4087108   0% /dev/shm
tmpfs                  1995656      4992   1990664   1% /var/lib/ganglia/rrds

/var is full!

Filesystem           1K-blocks   Used        Available   Use%    Mounted on
/dev/sda2              3968124    3831872    0              100%    /var


Now I have to figure out what is the reason. To fix it and to prevent
it to happen again. The user is compiling his programs with
condor_compile and submitting them in the standard universe. May be
/var is full with his checkpoint images? If not, any help will be very
well come!

Regards

Marcelo

ps: I want to thanks all the support from people of this marvelous list!


2009/4/14 Robert Futrick <rfutrick@xxxxxxxxxxxxxxxxxx>:
> Hello Marcelo,
>
> Based on what you've written, it sounds like you're experiencing case #1 in
> Jason's email.  Your daemons are configured to run on the correct server,
> but stopped running suddenly and now will not start again.
>
> Considering you didn't make any other changes, and the sudden nature of the
> stop, you might be out of disk space.  That's a common cause of daemons
> stopping logging mid-logline. Another option is that permissions or
> something else changed to prevent Condor from writing to that directory.
>
> Try running "df" on the /var/opt/condor/log to make sure you have disk
> space. Being out of disk space is not the only reason Condor could have
> stopped working, but it is a good initial check.
>
> Regards,
> Rob
>
> Marcelo Chiapparini wrote:
>
> Jason,
>
> thank you for the help. Below are the results of your advices:
>
> 2009/4/14 Jason Stowe <jstowe@xxxxxxxxxxxxxxxxxx>:
>
>
> Marcelo,
> The errors you are getting could be caused by a few problems, so below
> is a more detailed process to help you debug this:
>
>
> $ condor_status
> CEDAR:6001:Failed to connect to <xxx.xx.xxx.xx:xxxx>
> Error: Couldn't contact the condor_collector on cluster-name.domain
>
> Extra Info: the condor_collector is a process that runs on the central
>
>
> ...
>
>
> responding. Also see the Troubleshooting section of the manual.
>
>
> This error indicates that the condor_status command couldn't
> communicate with the collector. This most likely means:
> (1) the collector (and the condor_master/other daemons) isn't running
> on the central manager,
> (2) the collector is running, but not on the server the command thinks
> it is, or
> (3) the collector is running where condor_status thinks it is, but
> condor_status doesn't have permission to talk with it.
>
> To rule out #1, on the central manager of the pool, after you run
> condor_master on the head node for the cluster, what do you get when
> you run:
> $ ps -ef | grep condor
> Does the condor_master/condor_collector show up here?
>
>
> No. Deamons are not running on the central node:
>
> # condor_master
> # ps -ef | grep condor
> root     25980 15002  0 09:41 pts/1    00:00:00 grep condor
>
>
>
> This should tell you the directory log files are located in:
> $ condor_config_val -config -verbose LOG
>
>
> I found they! They are in /var/opt/condor/log. Thanks!
>
>
>
> To check for option #2, determine where the collector should be by running:
> condor_config_val -verbose COLLECTOR_HOST
>
>
> # condor_config_val -verbose COLLECTOR_HOST
> COLLECTOR_HOST: lacad-dft.fis.uerj.br
>
>
>
> Does this match the machine you expect to be the central manager?
>
>
> Yes!
>
>
>
> For situation #3, do you get permission denied errors in the logfiles?
> Checking the HOSTALLOW_READ settings on the central manager will be
> the next step:
> http://www.cs.wisc.edu/condor/manual/v7.2/3_6Security.html#sec:Host-Security
>
>
> # condor_config_val -verbose HOSTALLOW_READ
> HOSTALLOW_READ: *
>   Defined in '/opt/condor/etc/condor_config', line 209.
>
>
> Looking at the CollectorLog file, it is clear that something happened
> at 14:42:01, because the write to this log was interrupted in the
> middle of a sentence. See the last lines of the CollectorLog:
>
> <snip>
> 4/13 14:40:22 NegotiatorAd  : Inserting ** "< lacad-dft.fis.uerj.br >"
> 4/13 14:41:55 (Sending 84 ads in response to query)
> 4/13 14:41:55 Got QUERY_STARTD_PVT_ADS
> 4/13 14:41:55 (Sending 64 ads in response to query)
> 4/13 14:42:01 Got QUERY
>
> and nothing more was written since this. This was yesterday, when
> Condor stops to work.
> Looking at the MasterLog file we find the same. Again, things were
> interrupted abruptly at 14:42:14. (sorry for the long log,  but I want
> to give a good idea of what happened...)
>
> <snip>
> 4/10 10:50:18 Preen pid is 10018
> 4/10 10:50:18 Child 10018 died, but not a daemon -- Ignored
> 4/11 10:50:18 Preen pid is 12156
> 4/11 10:50:18 Child 12156 died, but not a daemon -- Ignored
> 4/12 10:50:18 Preen pid is 10655
> 4/12 10:50:18 Child 10655 died, but not a daemon -- Ignored
> 4/13 10:50:18 Preen pid is 18824
> 4/13 10:50:18 Child 18824 died, but not a daemon -- Ignored
> 4/13 14:34:51 The SCHEDD (pid 4063) exited with status 4
> 4/13 14:34:51 Sending obituary for "/opt/condor/sbin/condor_schedd"
> 4/13 14:34:51 restarting /opt/condor/sbin/condor_schedd in 10 seconds
> 4/13 14:35:01 Started DaemonCore process
> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20203
> 4/13 14:35:01 The SCHEDD (pid 20203) exited with status 4
> 4/13 14:35:01 Sending obituary for "/opt/condor/sbin/condor_schedd"
> 4/13 14:35:01 restarting /opt/condor/sbin/condor_schedd in 11 seconds
> 4/13 14:35:12 Started DaemonCore process
> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20210
> 4/13 14:35:12 The SCHEDD (pid 20210) exited with status 44
> 4/13 14:35:12 Sending obituary for "/opt/condor/sbin/condor_schedd"
> 4/13 14:35:12 restarting /opt/condor/sbin/condor_schedd in 13 seconds
> 4/13 14:35:25 Started DaemonCore process
> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20214
> 4/13 14:35:25 The SCHEDD (pid 20214) exited with status 44
> 4/13 14:35:25 Sending obituary for "/opt/condor/sbin/condor_schedd"
> 4/13 14:35:25 restarting /opt/condor/sbin/condor_schedd in 17 seconds
> 4/13 14:35:42 Started DaemonCore process
> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20218
> 4/13 14:35:42 The SCHEDD (pid 20218) exited with status 44
> 4/13 14:35:42 restarting /opt/condor/sbin/condor_schedd in 25 seconds
> 4/13 14:36:07 Started DaemonCore process
> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20219
> 4/13 14:36:07 The SCHEDD (pid 20219) exited with status 44
> 4/13 14:36:07 restarting /opt/condor/sbin/condor_schedd in 41 seconds
> 4/13 14:36:48 Started DaemonCore process
> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20220
> 4/13 14:36:48 The SCHEDD (pid 20220) exited with status 44
> 4/13 14:36:48 restarting /opt/condor/sbin/condor_schedd in 73 seconds
> 4/13 14:38:01 Started DaemonCore process
> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20222
> 4/13 14:38:01 The SCHEDD (pid 20222) exited with status 44
> 4/13 14:38:01 restarting /opt/condor/sbin/condor_schedd in 137 seconds
> 4/13 14:40:18 Started DaemonCore process
> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20226
> 4/13 14:40:18 The SCHEDD (pid 20226) exited with status 44
> 4/13 14:40:18 restarting /opt/condor/sbin/condor_schedd in 265 seconds
> 4/13 14:42:01 The COLLECTOR (pid 3779) exited with status 44
> 4/13 14:42:01 Sending obituary for "/opt/condor/sbin/condor_collector"
> 4/13 14:42:01 restarting /opt/condor/sbin/condor_collector in 10 seconds
> 4/13 14:42:01 attempt to connect to <152.92.133.74:9618> failed:
> Connection refused (connect errno = 111).
> 4/13 14:42:01 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618>
> failed
>
> 4/13 14:42:01 Failed to start non-blocking update to <152.92.133.74:9618>.
> 4/13 14:42:11 Started DaemonCore process
> "/opt/condor/sbin/condor_collector", pid and pgroup = 20233
> 4/13 14:42:14 attempt to connect to <152.92.133.74:9618> failed:
> Connection refused (connect errno = 111).
> 4/13 14:42:14 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618>
> failed
>
> 4/13 14:42:14 Failed to start non-blocking update to <152.92.133.74:9618>.
> 4/13 14:42:14 The COLLECTOR (pid 20233) exited with status 44
> 4/13 14:42:14 Sending obituary for "/opt/condor/sbin/condor_
> collector"
> 4/13 14:42:
>
> Is this a physical problem with the hardware? I reboot physically the
> cluster today, 4/14, but Condor refuses to run. Nothing was written to
> the logs since yesterday 4/13 14:42:14.
>
> Any help will be very welcome,
>
> Regards
>
> Marcelo
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>
> --
>
> ===================================
> Rob Futrick
> main: 888.292.5320
>
> Cycle Computing, LLC
> Leader in Condor Grid Solutions
> Enterprise Condor Support and CycleServer Management Tools
>
> http://www.cyclecomputing.com
> http://www.cyclecloud.com
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>



-- 
Marcelo Chiapparini
http://sites.google.com/site/marcelochiapparini