[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Going from Condor 7.7 to HTCondor 8.8



Background: I'm the sysadmin of a small CentOS 6 computing farm. For years our small condor pool was running Condor 7.7; higher versions offered no new features we needed. Then the user required a new (unrelated) software installation for which the old CentOS 5 condor 7.7 libraries were incompatible and they requested I upgrade to HTCondor 8.8.

From that point until now, I have not been able to get HTCondor 8.8 to fully run on the farm. My debugging steps included erasing the condor_config* files and replacing them with those from the RPMs and completely wiping the contents of LOCAL_DIR.

Where I'm at now: Although the condor services start up properly, I can't submit any jobs. The error is:

# condor_submit myfile.cmd
Submitting job(s)
ERROR: Failed to connect to local queue manager
SECMAN:2007:Failed to end classad message.

The results of web searches on this error have not helped. For the record:

- I've followed the instructions at <https://lists.cs.wisc.edu/archive/htcondor-users/2008-March/msg00178.shtml> multiple times. Since I had started with a fresh LOCAL_DIR, the file LOCAL_DIR/spool/job_queue.log had no invalid entries, but I gave it a try anyway.

- At present, the users are not submitting any condor jobs, so schedd is not busy.

- Schedd is running:

# ps -elf | grep schedd
4 S condor 60019 59973 0 80 0 - 13065 poll_s May22 ? 00:00:07 condor_schedd -f

- The firewall is off. Neither iptables nor netfilter is running. (Our site has Cisco firewall that I've configured to block off port 9618 from the outside, so I'm concerned.)

- nmap tells me that port 9618 on the CONDOR_HOST is open.

- The only error in SchedLog is
DC_AUTHENTICATE: Unable to reconcile!

- I turned on debugging in condor_config.local:
  TOOL_DEBUG = D_ALL
  SUBMIT_DEBUG = D_ALL

and ran the job with
# condor_submit -debug myfile.cmd

I can post the results on request. I'm no expert, but the relevant lines appear to be:

05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN: command 1112 QMGMT_WTE_CMD to schedd at <129.236.252.84:9618> from TCP port 19038 (blocking). 05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN:: default CLIENT meths: FS,KERBEROS,GSI,CLAIMTOBE 05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_write(fd=4 schedd at <9.236.252.84:9618>,,size=416,timeout=0,flags=0,non_blocking=0) 05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_read(fd=4 schedd at <1.236.252.84:9618>,,size=5,timeout=0,flags=0,non_blocking=0) 05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) Stream::get(int) failed to re padding 05/23/19 15:57:02 (fd:5) (pid:863797) (D_ALWAYS) SECMAN: no classad from serverfailing


- The only non-default lines in the condor_config file are:

BIND_ALL_INTERFACES = TRUE
SEC_DEFAULT_AUTHENTICATION = NEVER


Is there anything else I can do?

Thanks!

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature