[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Reboot / pause condor master server (VM)



Rob Stevenson wrote:
> Hi Guys,
> Just a quickie. I'm not sure what happens when the condor master server
> is rebooted. I would guess that all the slaves will continue running
> their jobs with no problem? What would happen if a run finishes while
> the master server is still unavailable?
>  
> Perhaps I should have experimented before using the system in a live
> environment? :S
>  
> Many thanks for any pointers,
> Rob Stevenson - Systems Administrator
> IS Services

I'm going to assume when you say "condor master server" you're talking about the machine running the condor_schedd daemon. In Condor terminology there's: a Central Manager, running the collector and negotiator; some number of Submit Nodes, running the schedd; and, many (hopefully!) Execute Nodes, running the startd.

When you take down a Submit node all the jobs it was managing keep running running on their respective Execute nodes. The schedd actually has a lease on each startd where it is running a job. If a job completes while the schedd is down the startd will wait until either the schedd returns to collect output or the lease expires. You can poke around using the terms job, lease and claim, to get a better idea of how a lease might expire while the schedd is down.

Aside - I've often wondered if the schedd shouldn't renew all its leases before gracefully shutting down.

Best,


matt