[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [PATCH] Speeding up condor_dagman submission



On 10/08/2015 02:55, R. Kent Wenger wrote:

Ah, the fundamental thing is this: we want to avoid having two instances of DAGMan simultaneously running on the same DAG. This will goof things up because the two DAGMans will be using the same log for their node jobs, and the events will get mixed together.

So, to avoid this, DAGMan creates a lock file at startup (which contains the UniquePID information). When DAGMan starts up, it looks for the lock file. If the file exists, DAGMan tries to read the UniquePID info from the lock file. If it succeeds in doing that, and the corresponding process is still alive, DAGMan says, "Oops, there's another DAGMan already running on this DAG", and exits. If DAGMan can't read the UniquePID info, or that process does not exist, DAGMan assumes that there was an earlier instance of DAGMan running on that DAG, but that instance no longer exists. So the just-started DAGMan then continues in recovery mode.

Hopefully that all makes sense...
It does indeed. Thank you!