[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs vacating reason



In case anyone wants to know the solution.    The symptom is processes dying after exactly 20 minutes.  That's the clue that ALIVE's aren't getting through.

Removing the entry in /etc/hosts that mapped "f0.<mydom>.local" to 127.0.0.1 on the schedd machine (which was also the collector/negotiator... so I'm not sure it's dependent on schedd) worked immediately to allow ALIVES to go through.   

Apparently, the schedd (or perhaps collector/negotiator) server uses it's own /etc/hosts to let the startd compute server know what ip to connect to for ALIVE pings?   It seems rather backward ...  there are good reasons why the startd server should use it's own DNS (multi-segment networks, failover, etc).   (NO_DNS is false in my config)

Anyway, it's fixed, i put my suspend options back on and condor works like a dream and is already in the process of saving our a***es by letting us schedule massive compute jobs.

Thanks!