[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] various condor naming problems



I'm having a range of problems and I think they might all be related to
DNS, or to host naming otherwise.  Here's the latest example:

I'm sitting at my Linux desktop which is a member of our condor pool,
trying to retrieve logs from failed jobs on a Windows machine.  In my
local log file I have:

   001 (022.000.000) 08/14 12:42:07 Job executing on host: <128.135.36.115:3303>

A few lines later that job dies with a shadow exception, "Can no longer
talk to condor_starter <128.135.36.115:3303>"

Okay, that IP address points to c-pc-19.uchicago.edu, so I should be
able to do:

   $ condor_fetchlog c-pc-19.uchicago.edu STARTER

...right?  But I get an error:

   Couldn't locate daemon on c-pc-19.uchicago.edu: Can't find address for master c-pc-19.uchicago.edu

I get the same error whether I use "c-pc-19", "C-PC-19" (the value I get
for "Machine =" when I run 'condor_status -long'), the system's FQDN,
or its IP address.

I also get the same error when I run the command on the cluster's
central manager.

So, why might it be that the cluster is contacting the compute node long
enough to submit the job, but can't get an address for it otherwise?