[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [PATCH] Speeding up condor_dagman submission



On Sat, 8 Aug 2015, Brian Candler wrote:

Incidentally, by grepping for sleep I just found src/condor_procapi/WISDOM which says: "See UniqueProcessId.pdf in this folder for a more indepth discussion of how the new ProcAPI ProcessId code works"

And I read that document. But I still don't understand why this PID birthday issue applies to condor_dagman running in the scheduler universe, but not to a regular job running on a worker.

Ah, the fundamental thing is this: we want to avoid having two instances of DAGMan simultaneously running on the same DAG. This will goof things up because the two DAGMans will be using the same log for their node jobs, and the events will get mixed together.

So, to avoid this, DAGMan creates a lock file at startup (which contains the UniquePID information). When DAGMan starts up, it looks for the lock file. If the file exists, DAGMan tries to read the UniquePID info from the lock file. If it succeeds in doing that, and the corresponding process is still alive, DAGMan says, "Oops, there's another DAGMan already running on this DAG", and exits. If DAGMan can't read the UniquePID info, or that process does not exist, DAGMan assumes that there was an earlier instance of DAGMan running on that DAG, but that instance no longer exists. So the just-started DAGMan then continues in recovery mode.

Hopefully that all makes sense...

Kent