[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] node gone from condor_status?



Nodes disappear from condor_status when they fail to send an update ad to the collector for too long and the Collector discards the expired ad.  There are a variety of reasons why that can happen.  

Is the node still running a HTCondor starter?  Check the StartLog on the execute node.   
Did the ALLOW_ list on the collector change? Maybe it's no longer allowed to send updates?  Check the CollectorLog on the central manger node. 

Is the node configured to send updates via TCP.  Check the UPDATE_COLLECTOR_WITH_TCP knob on the execute node.
If updates are UDP, then it's possible that only the initial update (right after the reconfig) got through to the collector
Because the initial update is always TCP, but if UPDATE_COLLECTOR_WITH_TCP=false, the subsequent updates with be UDP.

-tj

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Dimitri Maziuk
Sent: Friday, March 3, 2017 4:11 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] node gone from condor_status?

Hi all,

> [root@turkey ~]# condor_status turkey ; echo $?
> 0
> [root@turkey ~]# ps -AF | grep condor
> condor       789       1  0 17379  6336   2 Feb06 ?        00:00:23 /usr/sbin/condor_master -f
> root         934     789  0  6202  4452   3 Feb06 ?        00:39:01 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 501
> condor       935     789  0 12347  5568   3 Feb06 ?        00:00:26 condor_shared_port -f
> condor      1003     789  0 12829  7848   2 Feb06 ?        00:44:20 condor_startd -f
> condor    179933    1003  0 12705  6480   3 Mar01 ?        00:00:59 condor_starter -f -a slot7 exocet.bmrb.wisc.edu
> bbee      179940  179933  0  4493  1468   3 Mar01 ?        00:00:00 /bin/sh /var/lib/condor/execute/dir_179933/condor_exec.exe 208 22

The last line's one last job still running.

I changed the START=FALSE and did a condor_reconfig for 8.6.1 update yesterday, it's taking a while for the jobs to taper off. Couple of hours ago there were two running jobs and the rest of the cores were in Owner. Sometime between then and now the node has disappeared from condor_status output. Any idea why?

TIA
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu