[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to minimize the reschedule interval for jobs on failed machines?



Maybe ALIVE_INTERVAL and STARTD_SENDS_ALIVES are related, MAX_CLAIM_ALIVES_MISSED is a startd configuration variable in documents, but maybe also related.
I will try them.

在 2013-9-10 PM7:50,"Andrey Kuznetsov" <akuznet1@xxxxxxxx>写道:
Try reading 3.3.11 http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html

ALIVE_INTERVAL perhaps?

condor_q -r or -run shows jobs with running state, otherwise shows ALL jobs submitted on that machine. Use -g or -global to see the queue for the cluster.


On Tue, Sep 10, 2013 at 3:30 AM, 钱晓明 <kyleqian@xxxxxxxxx> wrote:

I find condor will execute jobs in other slots when the machine they on failed. But I think the interval is too long, about 22 minutes in my 5 nodes cluster.
So how can I minimize this interval? Condor should know that machine is down, because new jobs are not sent to it.
By the way, condor_q always shows that jobs are in running state, is it right?


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Andrey Kuznetsov <akuznet1@xxxxxxxx>

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/