[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] [PATCH] Speeding up condor_dagman submission
- Date: Tue, 11 Aug 2015 11:46:04 +0100
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: Re: [HTCondor-users] [PATCH] Speeding up condor_dagman submission
On 10/08/2015 02:55, R. Kent Wenger wrote:
Ah, the fundamental thing is this: we want to avoid having two
instances of DAGMan simultaneously running on the same DAG. This will
goof things up because the two DAGMans will be using the same log for
their node jobs, and the events will get mixed together.
So, to avoid this, DAGMan creates a lock file at startup (which
contains the UniquePID information). When DAGMan starts up, it looks
for the lock file. If the file exists, DAGMan tries to read the
UniquePID info from the lock file. If it succeeds in doing that, and
the corresponding process is still alive, DAGMan says, "Oops, there's
another DAGMan already running on this DAG", and exits. If DAGMan
can't read the UniquePID info, or that process does not exist, DAGMan
assumes that there was an earlier instance of DAGMan running on that
DAG, but that instance no longer exists. So the just-started DAGMan
then continues in recovery mode.
Hopefully that all makes sense...
It does indeed. Thank you!