[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] high availability schedd - condor_submit failover



Hello,

I'm looking into setting up a highly available job queue. I've followed the instructions in section 3.10.1 of the manual, and set up a spool dir on nfs. I've set SCHEDD_NAME = ha-schedd@ on the machines sharing that spool, and when I disconnect the active schedd, another will spawn and take control of the queue.

According to the manual, I should be able to submit jobs to the highly available queue using condor_submit -n ha-schedd@ but this only works on the machine which is the active schedd, not on the possible replacement schedds. When I try from one of the backup schedds, I get:

user@host:~/job$ condor_submit -n ha-schedd@ job.submit
Submitting job(s)
ERROR: Failed to connect to queue manager ha-schedd@
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:45). There is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS

I'm guessing this is because I'm using host-based authentication, and "remote" submission to the active schedd requires a stronger mechanism? Is there a simple way around this, or do I have to set up kerberos / GSI to be able to have condor_submit failover? I only need job submission to work from those machines who may become the active schedd at some point.

Thanks!

Rob