[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] issues with condor_q and imports from jobs



CEDAR:6001:Failed to connect to <192.168.10.17:1571>

If the schedd is actually  listening at that address and port, then "Failed to connect" is almost certainly because of a firewall or router and not because of HTCondor configuration.   

does 

    condor_status -schedd -af Name MyAddress

show the address above? 

does condor_q work when you run it on the machine that is running the Schedd?
 
You can have a look at the SchedLog to see if it is actively refusing the connection, but I don't think you will see anything.  if the problem is the HTCondor configuration causing the Schedd to refuse the command I would expect a different error message from condor_q.

You don't say what version of HTCondor is being used, but I'm assuming that this is an older version because starting with 8.6, the default is to use shared port, in which case the port would be 9618, and not 1571 above.

As for the import failures,  is there a shared file system?  that could result in intermittent errors.   Or perhaps a slow motion disk failure? do all of the failures happen on one of the execute machines? or do they happen on both?

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Larry Martell
Sent: Thursday, November 29, 2018 6:15 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] issues with condor_q and imports from jobs

I had a condor deployment I set up at a customer site 9 months ago. It
had a master and 2 execute hosts. All was well and my customer was
happily running job. Now they call me and said many jobs are
sporadically failing. I took a look and the first thing I noticed was
that condor_q does not work. On the master I get:

-- Failed to fetch ads from: <192.168.10.17:1571> : liszt
CEDAR:6001:Failed to connect to <192.168.10.17:1571>

condor_status works and systemctl status condor reports all is well. I
tried restarting condor on all hosts but still get the same error.
None of the configs appear to have been changed.

Next, I looked at the job failures. The jobs they run are all the same
program and the are invoked using the python interface. The jobs are
python scripts. They run 1,000's of them every night. On any given
night some will fail with an import error on a module. The module
being imported does exist, and it clearly can be imported, as some
jobs work and some do not and they are all the same code.

Anyone have any thoughts as to what can be going on and/or how I can
debug this more?
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/