[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Can't find address of local schedd



Dear condorers

after cleaning up the /var/opt/condor/spool directory, I was be able
to start Condor with condor_master. Now things and up an running
again. Now I have to see why files
clusterN.procM.subproc0 became so big.

Thanks a lot to everyone who helped!

Regards

Marcelo


2009/4/14 Marcelo Chiapparini <marcelo.chiappa@xxxxxxxxx>:
> Hello,
>
> I looked at /var/opt/condor/spool directory. Here is it content:
>
> # ls -all
> total 3508704
> drwxr-xr-x 3 condor condor      4096 Apr 13 14:35 .
> drwxr-xr-x 5 condor condor      4096 Dec 15 11:15 ..
> -rw------- 1 condor condor    248004 Apr 13 14:41 Accountantnew.log
> -rwxr-xr-x 1 condor condor   2077155 Apr 13 13:45 cluster15.ickpt.subproc0
> -rwxr-xr-x 1 condor condor   2077155 Apr 13 08:50 cluster8.ickpt.subproc0
> -rw-r--r-- 1 condor condor 277414943 Apr 13 11:43 cluster8.proc0.subproc0
> -rw-r--r-- 1 condor condor   2322432 Apr 13 14:34 cluster8.proc0.subproc0.tmp
> -rw-r--r-- 1 condor condor 277414943 Apr 13 12:09 cluster8.proc1.subproc0
> -rw-r--r-- 1 condor condor 277419039 Apr 13 11:33 cluster8.proc2.subproc0
> -rw-r--r-- 1 condor condor 277419039 Apr 13 12:02 cluster8.proc4.subproc0
> -rw-r--r-- 1 condor condor 277414943 Apr 13 11:43 cluster8.proc5.subproc0
> -rwxr-xr-x 1 condor condor   2077155 Apr 13 09:07 cluster9.ickpt.subproc0
> -rw-r--r-- 1 condor condor 101482496 Apr 13 12:24 cluster9.proc0.subproc0.tmp
> -rw-r--r-- 1 condor condor 277410847 Apr 13 11:48 cluster9.proc10.subproc0
> -rw-r--r-- 1 condor condor 277414943 Apr 13 11:58 cluster9.proc14.subproc0
> -rw-r--r-- 1 condor condor 277410847 Apr 13 11:48 cluster9.proc15.subproc0
> -rw-r--r-- 1 condor condor   1024000 Apr 13 14:29 cluster9.proc15.subproc0.tmp
> -rw-r--r-- 1 condor condor  43974656 Apr 13 12:24 cluster9.proc16.subproc0.tmp
> -rw-r--r-- 1 condor condor  16863232 Apr 13 12:33 cluster9.proc17.subproc0.tmp
> -rw-r--r-- 1 condor condor 277414943 Apr 13 11:48 cluster9.proc1.subproc0
> -rw-r--r-- 1 condor condor  77766656 Apr 13 12:33 cluster9.proc2.subproc0.tmp
> -rw-r--r-- 1 condor condor 277419039 Apr 13 11:58 cluster9.proc4.subproc0
> -rw-r--r-- 1 condor condor 277414943 Apr 13 11:48 cluster9.proc6.subproc0
> -rw-r--r-- 1 condor condor   9547776 Apr 13 12:33 cluster9.proc7.subproc0.tmp
> -rw-r--r-- 1 condor condor 277419039 Apr 13 11:58 cluster9.proc9.subproc0
> -rw-r--r-- 1 condor condor    218377 Apr 13 13:45 history
> -rw------- 2 condor condor    262144 Apr 13 14:34 job_queue.log
> -rw------- 2 condor condor    262144 Apr 13 14:34 job_queue.log.4
> -rw------- 1 condor condor         0 Apr 13 14:35 job_queue.log.tmp
> drwxrwxrwt 2 condor condor      4096 Dec 15 11:15 local_univ_execute
>
>
> As can be seen, there are many files named clusterN.procM.subproc0
> which are huge (277 MB). The content of the directory amounts 3.5 GB.
> The size of the /var directory is 3.8 GB (the default Rocks
> installation). So, /spool directory is consuming all the room in /var.
> What is the content of clusterN.procM.subproc0 files? How can I
> prevent these files to grow so much? It is safe to erase them?
>
> Thanks in advance
>
> Marcelo
>
>
> 2009/4/14 Marcelo Chiapparini <marcelo.chiappa@xxxxxxxxx>:
>> Hi Rob,
>>
>> Bingo! you was right:
>>
>> # df
>> Filesystem           1K-blocks      Used Available Use% Mounted on
>> /dev/sda1             15872604   4889488  10163804  33% /
>> /dev/sda5            828959588   2753132 783418536   1% /state/partition1
>> /dev/sda2              3968124   3831872         0 100% /var
>> tmpfs                  4087108         0   4087108   0% /dev/shm
>> tmpfs                  1995656      4992   1990664   1% /var/lib/ganglia/rrds
>>
>> /var is full!
>>
>> Filesystem           1K-blocks   Used        Available   Use%    Mounted on
>> /dev/sda2              3968124    3831872    0              100%    /var
>>
>>
>> Now I have to figure out what is the reason. To fix it and to prevent
>> it to happen again. The user is compiling his programs with
>> condor_compile and submitting them in the standard universe. May be
>> /var is full with his checkpoint images? If not, any help will be very
>> well come!
>>
>> Regards
>>
>> Marcelo
>>
>> ps: I want to thanks all the support from people of this marvelous list!
>>
>>
>> 2009/4/14 Robert Futrick <rfutrick@xxxxxxxxxxxxxxxxxx>:
>>> Hello Marcelo,
>>>
>>> Based on what you've written, it sounds like you're experiencing case #1 in
>>> Jason's email.  Your daemons are configured to run on the correct server,
>>> but stopped running suddenly and now will not start again.
>>>
>>> Considering you didn't make any other changes, and the sudden nature of the
>>> stop, you might be out of disk space.  That's a common cause of daemons
>>> stopping logging mid-logline. Another option is that permissions or
>>> something else changed to prevent Condor from writing to that directory.
>>>
>>> Try running "df" on the /var/opt/condor/log to make sure you have disk
>>> space. Being out of disk space is not the only reason Condor could have
>>> stopped working, but it is a good initial check.
>>>
>>> Regards,
>>> Rob
>>>
>>> Marcelo Chiapparini wrote:
>>>
>>> Jason,
>>>
>>> thank you for the help. Below are the results of your advices:
>>>
>>> 2009/4/14 Jason Stowe <jstowe@xxxxxxxxxxxxxxxxxx>:
>>>
>>>
>>> Marcelo,
>>> The errors you are getting could be caused by a few problems, so below
>>> is a more detailed process to help you debug this:
>>>
>>>
>>> $ condor_status
>>> CEDAR:6001:Failed to connect to <xxx.xx.xxx.xx:xxxx>
>>> Error: Couldn't contact the condor_collector on cluster-name.domain
>>>
>>> Extra Info: the condor_collector is a process that runs on the central
>>>
>>>
>>> ...
>>>
>>>
>>> responding. Also see the Troubleshooting section of the manual.
>>>
>>>
>>> This error indicates that the condor_status command couldn't
>>> communicate with the collector. This most likely means:
>>> (1) the collector (and the condor_master/other daemons) isn't running
>>> on the central manager,
>>> (2) the collector is running, but not on the server the command thinks
>>> it is, or
>>> (3) the collector is running where condor_status thinks it is, but
>>> condor_status doesn't have permission to talk with it.
>>>
>>> To rule out #1, on the central manager of the pool, after you run
>>> condor_master on the head node for the cluster, what do you get when
>>> you run:
>>> $ ps -ef | grep condor
>>> Does the condor_master/condor_collector show up here?
>>>
>>>
>>> No. Deamons are not running on the central node:
>>>
>>> # condor_master
>>> # ps -ef | grep condor
>>> root     25980 15002  0 09:41 pts/1    00:00:00 grep condor
>>>
>>>
>>>
>>> This should tell you the directory log files are located in:
>>> $ condor_config_val -config -verbose LOG
>>>
>>>
>>> I found they! They are in /var/opt/condor/log. Thanks!
>>>
>>>
>>>
>>> To check for option #2, determine where the collector should be by running:
>>> condor_config_val -verbose COLLECTOR_HOST
>>>
>>>
>>> # condor_config_val -verbose COLLECTOR_HOST
>>> COLLECTOR_HOST: lacad-dft.fis.uerj.br
>>>
>>>
>>>
>>> Does this match the machine you expect to be the central manager?
>>>
>>>
>>> Yes!
>>>
>>>
>>>
>>> For situation #3, do you get permission denied errors in the logfiles?
>>> Checking the HOSTALLOW_READ settings on the central manager will be
>>> the next step:
>>> http://www.cs.wisc.edu/condor/manual/v7.2/3_6Security.html#sec:Host-Security
>>>
>>>
>>> # condor_config_val -verbose HOSTALLOW_READ
>>> HOSTALLOW_READ: *
>>>   Defined in '/opt/condor/etc/condor_config', line 209.
>>>
>>>
>>> Looking at the CollectorLog file, it is clear that something happened
>>> at 14:42:01, because the write to this log was interrupted in the
>>> middle of a sentence. See the last lines of the CollectorLog:
>>>
>>> <snip>
>>> 4/13 14:40:22 NegotiatorAd  : Inserting ** "< lacad-dft.fis.uerj.br >"
>>> 4/13 14:41:55 (Sending 84 ads in response to query)
>>> 4/13 14:41:55 Got QUERY_STARTD_PVT_ADS
>>> 4/13 14:41:55 (Sending 64 ads in response to query)
>>> 4/13 14:42:01 Got QUERY
>>>
>>> and nothing more was written since this. This was yesterday, when
>>> Condor stops to work.
>>> Looking at the MasterLog file we find the same. Again, things were
>>> interrupted abruptly at 14:42:14. (sorry for the long log,  but I want
>>> to give a good idea of what happened...)
>>>
>>> <snip>
>>> 4/10 10:50:18 Preen pid is 10018
>>> 4/10 10:50:18 Child 10018 died, but not a daemon -- Ignored
>>> 4/11 10:50:18 Preen pid is 12156
>>> 4/11 10:50:18 Child 12156 died, but not a daemon -- Ignored
>>> 4/12 10:50:18 Preen pid is 10655
>>> 4/12 10:50:18 Child 10655 died, but not a daemon -- Ignored
>>> 4/13 10:50:18 Preen pid is 18824
>>> 4/13 10:50:18 Child 18824 died, but not a daemon -- Ignored
>>> 4/13 14:34:51 The SCHEDD (pid 4063) exited with status 4
>>> 4/13 14:34:51 Sending obituary for "/opt/condor/sbin/condor_schedd"
>>> 4/13 14:34:51 restarting /opt/condor/sbin/condor_schedd in 10 seconds
>>> 4/13 14:35:01 Started DaemonCore process
>>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20203
>>> 4/13 14:35:01 The SCHEDD (pid 20203) exited with status 4
>>> 4/13 14:35:01 Sending obituary for "/opt/condor/sbin/condor_schedd"
>>> 4/13 14:35:01 restarting /opt/condor/sbin/condor_schedd in 11 seconds
>>> 4/13 14:35:12 Started DaemonCore process
>>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20210
>>> 4/13 14:35:12 The SCHEDD (pid 20210) exited with status 44
>>> 4/13 14:35:12 Sending obituary for "/opt/condor/sbin/condor_schedd"
>>> 4/13 14:35:12 restarting /opt/condor/sbin/condor_schedd in 13 seconds
>>> 4/13 14:35:25 Started DaemonCore process
>>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20214
>>> 4/13 14:35:25 The SCHEDD (pid 20214) exited with status 44
>>> 4/13 14:35:25 Sending obituary for "/opt/condor/sbin/condor_schedd"
>>> 4/13 14:35:25 restarting /opt/condor/sbin/condor_schedd in 17 seconds
>>> 4/13 14:35:42 Started DaemonCore process
>>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20218
>>> 4/13 14:35:42 The SCHEDD (pid 20218) exited with status 44
>>> 4/13 14:35:42 restarting /opt/condor/sbin/condor_schedd in 25 seconds
>>> 4/13 14:36:07 Started DaemonCore process
>>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20219
>>> 4/13 14:36:07 The SCHEDD (pid 20219) exited with status 44
>>> 4/13 14:36:07 restarting /opt/condor/sbin/condor_schedd in 41 seconds
>>> 4/13 14:36:48 Started DaemonCore process
>>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20220
>>> 4/13 14:36:48 The SCHEDD (pid 20220) exited with status 44
>>> 4/13 14:36:48 restarting /opt/condor/sbin/condor_schedd in 73 seconds
>>> 4/13 14:38:01 Started DaemonCore process
>>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20222
>>> 4/13 14:38:01 The SCHEDD (pid 20222) exited with status 44
>>> 4/13 14:38:01 restarting /opt/condor/sbin/condor_schedd in 137 seconds
>>> 4/13 14:40:18 Started DaemonCore process
>>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20226
>>> 4/13 14:40:18 The SCHEDD (pid 20226) exited with status 44
>>> 4/13 14:40:18 restarting /opt/condor/sbin/condor_schedd in 265 seconds
>>> 4/13 14:42:01 The COLLECTOR (pid 3779) exited with status 44
>>> 4/13 14:42:01 Sending obituary for "/opt/condor/sbin/condor_collector"
>>> 4/13 14:42:01 restarting /opt/condor/sbin/condor_collector in 10 seconds
>>> 4/13 14:42:01 attempt to connect to <152.92.133.74:9618> failed:
>>> Connection refused (connect errno = 111).
>>> 4/13 14:42:01 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618>
>>> failed
>>>
>>> 4/13 14:42:01 Failed to start non-blocking update to <152.92.133.74:9618>.
>>> 4/13 14:42:11 Started DaemonCore process
>>> "/opt/condor/sbin/condor_collector", pid and pgroup = 20233
>>> 4/13 14:42:14 attempt to connect to <152.92.133.74:9618> failed:
>>> Connection refused (connect errno = 111).
>>> 4/13 14:42:14 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618>
>>> failed
>>>
>>> 4/13 14:42:14 Failed to start non-blocking update to <152.92.133.74:9618>.
>>> 4/13 14:42:14 The COLLECTOR (pid 20233) exited with status 44
>>> 4/13 14:42:14 Sending obituary for "/opt/condor/sbin/condor_
>>> collector"
>>> 4/13 14:42:
>>>
>>> Is this a physical problem with the hardware? I reboot physically the
>>> cluster today, 4/14, but Condor refuses to run. Nothing was written to
>>> the logs since yesterday 4/13 14:42:14.
>>>
>>> Any help will be very welcome,
>>>
>>> Regards
>>>
>>> Marcelo
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>
>>>
>>> --
>>>
>>> ===================================
>>> Rob Futrick
>>> main: 888.292.5320
>>>
>>> Cycle Computing, LLC
>>> Leader in Condor Grid Solutions
>>> Enterprise Condor Support and CycleServer Management Tools
>>>
>>> http://www.cyclecomputing.com
>>> http://www.cyclecloud.com
>>>
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>
>>>
>