[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] questions about condor_restart



On 7/22/19 10:53 AM, Shawn A Kwang wrote:
I have a couple of "best-practices questions" for condor cluster
administration.

Is it safe to run 'condor_restart' (-graceful) on a running condor pool
components? Of course you may ask: what do I mean by 'safe'?

Let me ask this question another way. What happens if I run
condor_restart on a 1) Central manager, 2) Submit node (running schedd),
or 3) Compute node? All while users are actively running jobs.


Shawn:


This is a great question. Assuming everything comes back after a restart, a restart of


o) The central manager. All running jobs stay running. No new matches can be made. Schedds can start new jobs running only by using existing matches for the same user. condor_status doesn't work while the collector is down.

o) Submit node. All running jobs stay running for up to the lease duration. If the schedd comes back before the job lease expires, it reconnects to the running jobs and the jobs stay running. If the schedd is down for too long, the jobs get preempted and go back to idle. The default job lease duration is 20 minutes.

o) Execute machines. All running jobs on that execute machine are preempted and killed. The schedd will notice the jobs have been preempted, mark them as Idle, and try to restart them again from scratch.


-greg



Thanks in advance.

Sincerely,
Shawn


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/