[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] high availability schedd - condor_submit failover



Rob de Graaf wrote:
> Hello,
> 
> I'm looking into setting up a highly available job queue. I've followed 
> the instructions in section 3.10.1 of the manual, and set up a spool dir 
> on nfs. I've set SCHEDD_NAME = ha-schedd@ on the machines sharing that 
> spool, and when I disconnect the active schedd, another will spawn and 
> take control of the queue.
> 
> According to the manual, I should be able to submit jobs to the highly 
> available queue using condor_submit -n ha-schedd@ but this only works on 
> the machine which is the active schedd, not on the possible replacement 
> schedds. When I try from one of the backup schedds, I get:
> 
> user@host:~/job$ condor_submit -n ha-schedd@ job.submit
> Submitting job(s)
> ERROR: Failed to connect to queue manager ha-schedd@
> AUTHENTICATE:1003:Failed to authenticate with any method
> AUTHENTICATE:1004:Failed to authenticate using GSI
> GSI:5003:Failed to authenticate.  Globus is reporting error (851968:45). 
>   There is probably a problem with your credentials.  (Did you run 
> grid-proxy-init?)
> AUTHENTICATE:1004:Failed to authenticate using KERBEROS
> AUTHENTICATE:1004:Failed to authenticate using FS
> 
> I'm guessing this is because I'm using host-based authentication, and 
> "remote" submission to the active schedd requires a stronger mechanism? 
> Is there a simple way around this, or do I have to set up kerberos / GSI 
> to be able to have condor_submit failover? I only need job submission to 
> work from those machines who may become the active schedd at some point.
> 
> Thanks!
> 
> Rob

It's probably because all your submit nodes are not setup with the same
security settings. Your backup may be asking for stronger authentication
than your primary.

Best,


matt