[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Going from Condor 7.7 to HTCondor 8.8



Hi William,

Running "condor_submit -debug ..." shows you the client side of the conversation.  The other side would be in the SchedLog file and will probably explain why the SchedD appears to be closing the connection during submit.  It could be due to a permissions issue.  You may need to set new ALLOW_WRITE or ALLOW_* parameters since some of the defaults may be different for that big of a version jump.

To get more useful information in the SchedLog, you may need to set "SCHEDD_DEBUG=D_ALL" in your condor_config, perform a condor_reconfig, then repeat the test, and then see if anything in the log jumps out.  (PERMISSION DENIED, perhaps?)

Feel free to send it to me offline and I'd be happy to take a look because as you said, it's an insane amount of information, especially if you don't know what you are looking for.  If you are going to do that you could also attach your config files, or the output of "condor_config_val  -dump".

Thanks!


Cheers,
-zach


ïOn 5/23/19, 3:23 PM, "HTCondor-users on behalf of William Seligman" <htcondor-users-bounces@xxxxxxxxxxx on behalf of seligman@xxxxxxxxxxxxxxxxxx> wrote:

    Background: I'm the sysadmin of a small CentOS 6 computing farm. For years our 
    small condor pool was running Condor 7.7; higher versions offered no new 
    features we needed. Then the user required a new (unrelated) software 
    installation for which the old CentOS 5 condor 7.7 libraries were incompatible 
    and they requested I upgrade to HTCondor 8.8.
    
     From that point until now, I have not been able to get HTCondor 8.8 to fully 
    run on the farm. My debugging steps included erasing the condor_config* files 
    and replacing them with those from the RPMs and completely wiping the contents 
    of LOCAL_DIR.
    
    Where I'm at now: Although the condor services start up properly, I can't submit 
    any jobs. The error is:
    
    # condor_submit myfile.cmd
    Submitting job(s)
    ERROR: Failed to connect to local queue manager
    SECMAN:2007:Failed to end classad message.
    
    The results of web searches on this error have not helped. For the record:
    
    - I've followed the instructions at 
    <https://lists.cs.wisc.edu/archive/htcondor-users/2008-March/msg00178.shtml> 
    multiple times. Since I had started with a fresh LOCAL_DIR, the file 
    LOCAL_DIR/spool/job_queue.log had no invalid entries, but I gave it a try anyway.
    
    - At present, the users are not submitting any condor jobs, so schedd is not busy.
    
    - Schedd is running:
    
    # ps -elf | grep schedd
    4 S condor     60019   59973  0  80   0 - 13065 poll_s May22 ?        00:00:07 
    condor_schedd -f
    
    - The firewall is off. Neither iptables nor netfilter is running. (Our site has 
    Cisco firewall that I've configured to block off port 9618 from the outside, so 
    I'm concerned.)
    
    - nmap tells me that port 9618 on the CONDOR_HOST is open.
    
    - The only error in SchedLog is
    DC_AUTHENTICATE: Unable to reconcile!
    
    - I turned on debugging in condor_config.local:
       TOOL_DEBUG = D_ALL
       SUBMIT_DEBUG = D_ALL
    
    and ran the job with
    # condor_submit -debug myfile.cmd
    
    I can post the results on request. I'm no expert, but the relevant lines appear 
    to be:
    
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN: command 1112 
    QMGMT_WTE_CMD to schedd at <129.236.252.84:9618> from TCP port 19038 (blocking).
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN:: default CLIENT 
    meths: FS,KERBEROS,GSI,CLAIMTOBE
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_write(fd=4 schedd at 
    <9.236.252.84:9618>,,size=416,timeout=0,flags=0,non_blocking=0)
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_read(fd=4 schedd at 
    <1.236.252.84:9618>,,size=5,timeout=0,flags=0,non_blocking=0)
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) Stream::get(int) failed to re 
    padding
    05/23/19 15:57:02 (fd:5) (pid:863797) (D_ALWAYS) SECMAN: no classad from 
    serverfailing
    
    
    - The only non-default lines in the condor_config file are:
    
    BIND_ALL_INTERFACES = TRUE
    SEC_DEFAULT_AUTHENTICATION = NEVER
    
    
    Is there anything else I can do?
    
    Thanks!