[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CM Failover with submits from CM

Dan Bradley wrote:
> Janzen Brewer wrote:
>> Dan Bradley wrote:
>>> Condor supports fail-over of the submit node.
>> I understand that the submit node can be failed over, but I'm curious as 
>> to what happens to the output of a completed job if the submit node from 
>> which it was submitted failed during its execution. Does the execute 
>> node keep the output until the secondary submit node undergoes failback? 
>> Or does it attempt to write it to the same directory on the secondary 
>> submit node?
> I don't know much about schedd failover.
> I think the directories where output is to be stored would all need to 
> be on a shared disk accessible to both submit nodes.  Jobs that are 
> running when the primary submit node fails will wait for up to the job 
> lease duration (default 20 minutes) for the secondary submit node to 
> take over.  When the job finishes, whether if finishes during that time 
> or after that time, the output would get copied back to the functioning 
> submit node onto the shared disk.
> Of course, if you do all this only to make the shared filesystem into a 
> single point of failure, you've probably only made things slightly worse.
> --Dan

That's accurate.

The SPOOL and all state is shared between the HA Schedds.