[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to get address of starter for this job in SchedLog



On 6/15/2015 1:43 PM, DiPalma, Michelle wrote:
Hello,

We’re seeing the following error in SchedLog in about 1 in 10 jobs:

GET_JOB_CONNECT_INFO failed: Failed to get address of starter for this job

When this error occurs, the slot for the job is created successfully but
the user is never connected via ssh (no address to connect to).

About our pool:

All of the jobs on this particular pool are interactive. The hostnames
not fully qualified but DEFAULT_DOMAIN_NAME is set properly. It seems to
happen about 1 in 10 jobs all over the pool. Other interactive jobs
submitted just before or after this error (from the same user and
matched against the same host) work perfectly. We turned on full
debugging but nothing useful was logged.

What are we missing?

Any help is much appreciated.

-Michelle


Hi Michelle,

This sounds suspiciously like a bug that was just fixed a couple weeks ago. Take a peek at
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5070

The bug fix went into the source code starting with HTCondor v8.3.6, which should be released to the web site by the end of this week or early next week. In the meantime, setting JOB_START_DELAY=0 in condor_config may help reduce the occurrences of the failure.

regards,
Todd