[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CM Failover with submits from CM



Dan Bradley wrote:
> 
> Janzen Brewer wrote:
>> Dan Bradley wrote:
>>   
>>> Condor supports fail-over of the submit node.
>>>   
>>>     
>> I understand that the submit node can be failed over, but I'm curious as 
>> to what happens to the output of a completed job if the submit node from 
>> which it was submitted failed during its execution. Does the execute 
>> node keep the output until the secondary submit node undergoes failback? 
>> Or does it attempt to write it to the same directory on the secondary 
>> submit node?
>>   
> 
> I don't know much about schedd failover.
> 
> I think the directories where output is to be stored would all need to 
> be on a shared disk accessible to both submit nodes.  Jobs that are 
> running when the primary submit node fails will wait for up to the job 
> lease duration (default 20 minutes) for the secondary submit node to 
> take over.  When the job finishes, whether if finishes during that time 
> or after that time, the output would get copied back to the functioning 
> submit node onto the shared disk.
> 
> Of course, if you do all this only to make the shared filesystem into a 
> single point of failure, you've probably only made things slightly worse.
> 
> --Dan

That's accurate.

The SPOOL and all state is shared between the HA Schedds.

Best,


matt