[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Going from Condor 7.7 to HTCondor 8.8



Hi again William,

I should also mention that since you are making such a big leap in version, the quickest way to just get things working would be to completely blow away the old HTCondor stuff and just install the latest RPM.  I know you said you moved the configuration over but so much has changed over the years, including various directory layouts and permissions, it will almost certainly be faster to start with a fresh install rather than debug an upgrade over the existing install.

Once you have done that and have it working, then let's discuss why you want to turn off authentication and if you have a compelling reason we can certainly do that.  But I'd suggest starting from a working default installation first so we know there's no vestiges from the 7.7 release still around.

Let me know please if I can help out with any of that.


Cheers,
-zach




ïOn 5/23/19, 3:23 PM, "HTCondor-users on behalf of William Seligman" <htcondor-users-bounces@xxxxxxxxxxx on behalf of seligman@xxxxxxxxxxxxxxxxxx> wrote:

    Background: I'm the sysadmin of a small CentOS 6 computing farm. For years our 
    small condor pool was running Condor 7.7; higher versions offered no new 
    features we needed. Then the user required a new (unrelated) software 
    installation for which the old CentOS 5 condor 7.7 libraries were incompatible 
    and they requested I upgrade to HTCondor 8.8.
    
     From that point until now, I have not been able to get HTCondor 8.8 to fully 
    run on the farm. My debugging steps included erasing the condor_config* files 
    and replacing them with those from the RPMs and completely wiping the contents 
    of LOCAL_DIR.
    
    Where I'm at now: Although the condor services start up properly, I can't submit 
    any jobs. The error is:
    
    # condor_submit myfile.cmd
    Submitting job(s)
    ERROR: Failed to connect to local queue manager
    SECMAN:2007:Failed to end classad message.
    
    The results of web searches on this error have not helped. For the record:
    
    - I've followed the instructions at 
    <https://lists.cs.wisc.edu/archive/htcondor-users/2008-March/msg00178.shtml> 
    multiple times. Since I had started with a fresh LOCAL_DIR, the file 
    LOCAL_DIR/spool/job_queue.log had no invalid entries, but I gave it a try anyway.
    
    - At present, the users are not submitting any condor jobs, so schedd is not busy.
    
    - Schedd is running:
    
    # ps -elf | grep schedd
    4 S condor     60019   59973  0  80   0 - 13065 poll_s May22 ?        00:00:07 
    condor_schedd -f
    
    - The firewall is off. Neither iptables nor netfilter is running. (Our site has 
    Cisco firewall that I've configured to block off port 9618 from the outside, so 
    I'm concerned.)
    
    - nmap tells me that port 9618 on the CONDOR_HOST is open.
    
    - The only error in SchedLog is
    DC_AUTHENTICATE: Unable to reconcile!
    
    - I turned on debugging in condor_config.local:
       TOOL_DEBUG = D_ALL
       SUBMIT_DEBUG = D_ALL
    
    and ran the job with
    # condor_submit -debug myfile.cmd
    
    I can post the results on request. I'm no expert, but the relevant lines appear 
    to be:
    
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN: command 1112 
    QMGMT_WTE_CMD to schedd at <129.236.252.84:9618> from TCP port 19038 (blocking).
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN:: default CLIENT 
    meths: FS,KERBEROS,GSI,CLAIMTOBE
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_write(fd=4 schedd at 
    <9.236.252.84:9618>,,size=416,timeout=0,flags=0,non_blocking=0)
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_read(fd=4 schedd at 
    <1.236.252.84:9618>,,size=5,timeout=0,flags=0,non_blocking=0)
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) Stream::get(int) failed to re 
    padding
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_ALWAYS) SECMAN: no classad from 
    serverfailing
    
    
    - The only non-default lines in the condor_config file are:
    
    BIND_ALL_INTERFACES = TRUE
    SEC_DEFAULT_AUTHENTICATION = NEVER
    
    
    Is there anything else I can do?
    
    Thanks!