[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CM Failover with submits from CM





Janzen Brewer wrote:
Dan Bradley wrote:
Condor supports fail-over of the submit node.

I understand that the submit node can be failed over, but I'm curious as to what happens to the output of a completed job if the submit node from which it was submitted failed during its execution. Does the execute node keep the output until the secondary submit node undergoes failback? Or does it attempt to write it to the same directory on the secondary submit node?

I don't know much about schedd failover.

I think the directories where output is to be stored would all need to be on a shared disk accessible to both submit nodes. Jobs that are running when the primary submit node fails will wait for up to the job lease duration (default 20 minutes) for the secondary submit node to take over. When the job finishes, whether if finishes during that time or after that time, the output would get copied back to the functioning submit node onto the shared disk.

Of course, if you do all this only to make the shared filesystem into a single point of failure, you've probably only made things slightly worse.

--Dan