[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] issues with condor_q and imports from jobs



I had a condor deployment I set up at a customer site 9 months ago. It
had a master and 2 execute hosts. All was well and my customer was
happily running job. Now they call me and said many jobs are
sporadically failing. I took a look and the first thing I noticed was
that condor_q does not work. On the master I get:

-- Failed to fetch ads from: <192.168.10.17:1571> : liszt
CEDAR:6001:Failed to connect to <192.168.10.17:1571>

condor_status works and systemctl status condor reports all is well. I
tried restarting condor on all hosts but still get the same error.
None of the configs appear to have been changed.

Next, I looked at the job failures. The jobs they run are all the same
program and the are invoked using the python interface. The jobs are
python scripts. They run 1,000's of them every night. On any given
night some will fail with an import error on a module. The module
being imported does exist, and it clearly can be imported, as some
jobs work and some do not and they are all the same code.

Anyone have any thoughts as to what can be going on and/or how I can
debug this more?