[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to minimize the reschedule interval for jobs on failed machines?



Hi, I find that ALIVE_INTERVAL has no affect on this.
After set ALIVE_INTERVAL to 60, restart all machines, it still took 20 minutes to reschedule jobs on faied machines.
I checked log files and find something:
SchedLog:
09/11/13 18:54:59 (pid:4143) Completed REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxx <10.255.255.252:48979> for nobody
09/11/13 18:54:59 (pid:4143) Starting add_shadow_birthdate(86.0)
09/11/13 18:54:59 (pid:4143) Started shadow for job 86.0 on slot1@xxxxxxxxxxxxxxx <10.255.255.252:48979> for nobody, (shadow pid = 4344)
......
09/11/13 19:15:28 (pid:4143) Match record (slot1@xxxxxxxxxxxxxxx <10.255.255.252:48979> for nobody, 86.0) deleted
.....
09/11/13 19:15:48 (pid:4143) Activity on stashed negotiator socket: <192.168.1.100:44745>
09/11/13 19:15:48 (pid:4143) Using negotiation protocol: NEGOTIATE
09/11/13 19:15:48 (pid:4143) Negotiating for owner: nobody@local
09/11/13 19:15:48 (pid:4143) Checking consistency running and runnable jobs
09/11/13 19:15:48 (pid:4143) Tables are consistent
09/11/13 19:15:48 (pid:4143) Rebuilt prioritized runnable job list in 0.000s.
09/11/13 19:15:48 (pid:4143) Finished negotiating for nobody in local pool: 2 matched, 0 rejected
09/11/13 19:15:48 (pid:4143) Completed REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxx <10.255.255.251:49636> for nobody
09/11/13 19:15:48 (pid:4143) Starting add_shadow_birthdate(86.0)
09/11/13 19:15:48 (pid:4143) Started shadow for job 86.0 on slot1@xxxxxxxxxxxxxxx <10.255.255.251:49636> for nobody, (shadow pid = 5128)
 
Between these 3 group messages, there are repeated messages like this:
09/11/13 19:12:21 (pid:4143) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
09/11/13 19:12:21 (pid:4143) Sent ad to central manager for nobody@local
09/11/13 19:12:21 (pid:4143) Sent ad to 1 collectors for nobody@local
 
In job_queue.log(in spool directory), I found these:
103 86.0 LastJobLeaseRenewal 1378896909
106
105
103 86.0 RemoteWallClockTime 1229.000000
104 86.0 WallClockCheckpoint
103 86.0 CumulativeSlotTime 1229.000000
103 86.0 LastRemoteHost "slot1@xxxxxxxxxxxxxxx"
104 86.0 LastRemotePool
103 86.0 LastPublicClaimId "<10.255.255.252:48979>#1378896622#1#..."
104 86.0 ClaimId
104 86.0 PublicClaimId
104 86.0 ClaimIds
104 86.0 PublicClaimIds
104 86.0 StartdIpAddr
104 86.0 RemoteHost
104 86.0 RemotePool
104 86.0 RemoteSlotID
104 86.0 RemoteVirtualMachineID
104 86.0 DelegatedProxyExpiration
104 86.0 ShadowBday
106
103 86.0 CurrentHosts 0
105
103 86.0 CurrentHosts 0
104 86.0 ShadowBday
103 86.0 LastJobStatus 2
103 86.0 JobStatus 1
103 86.0 EnteredCurrentStatus
103 86.0 LastSuspensionTime 0
103 86.0 MaxHosts 1
104 86.0 RemotePool
There is no timestamp for each line, but EnteredCurrentStatus(1378898128) - LastJobLeaseRenewal(1378896909) is just 20 minutes.
 
Can someone can help me? Which configuration variable can affect this?


2013/9/10 Andrey Kuznetsov <akuznet1@xxxxxxxx>
Try reading 3.3.11 http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html

ALIVE_INTERVAL perhaps?

condor_q -r or -run shows jobs with running state, otherwise shows ALL jobs submitted on that machine. Use -g or -global to see the queue for the cluster.


On Tue, Sep 10, 2013 at 3:30 AM, 钱晓明 <kyleqian@xxxxxxxxx> wrote:

I find condor will execute jobs in other slots when the machine they on failed. But I think the interval is too long, about 22 minutes in my 5 nodes cluster.
So how can I minimize this interval? Condor should know that machine is down, because new jobs are not sent to it.
By the way, condor_q always shows that jobs are in running state, is it right?


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Andrey Kuznetsov <akuznet1@xxxxxxxx>

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/