[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Auto failover between these two schedulers



Hi Gagan,

It is possible to set up to host machines to work together such that if the schedd on one host falls over then the other host will start up the schedd and reconnect with all the corresponding running jobs. This is done with the High Availability Schedd. Otherwise, there is not currently built in mechanism for an AP to pick up the work of another one if the system has fallen overs.

-Cole Bollig

From: gagan tiwari <gagan.tiwari@xxxxxxxxxxxxxxxxxx>
Sent: Monday, August 14, 2023 11:28 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Cole Bollig <cabollig@xxxxxxxx>
Subject: Auto failover between these two schedulers
 
Hi Todd / Cole,
                                
                   Thanks for pointing that out.

condor_q -global -all  did the trick.  I am able to get job details from the remote schedd now.

Now, please let me know how to set up failover between these two schedulers. In case one of the submit nodes goes down , all jobs submitted through it should failover to another submit node.

Thanks,
Gagan

On Mon, Aug 14, 2023 at 7:07âPM Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Hi Gagan,

This may be because condor_q normally only shows jobs for the particular user. Try condor_q -global -all

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of gagan tiwari <gagan.tiwari@xxxxxxxxxxxxxxxxxx>
Sent: Friday, August 11, 2023 11:49 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Get job details with multiple submit nodes
 
Hi Tod,

    Here is the output.

condor_q -global

-- Schedd: ms-r1 : <192.168.30.72:9618?... @ 08/11/23 22:18:36
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for CONDOR_ANONYMOUS_USER: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 11 jobs; 0 completed, 0 removed, 0 idle, 11 running, 0 held, 0 suspended


So, it shows that 11 jobs are running but doesn't show detail under columns :- 

OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

If I need to set-up authentication to get the above details, please refer me to any document on how to set it up.

Thanks,
Gagan


On Fri, Aug 11, 2023 at 10:06âPM Todd L Miller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> We tried the *condor_q -global*  option but that also doesn't show the
> details of all jobs submitted through  all submit nodes.

        It should.  What does it do instead?  (You may need to set up
authentication that works over the network.)

- ToddM
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/