[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] [PATCH] Speeding up condor_dagman submission
- Date: Sun, 09 Aug 2015 20:55:11 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] [PATCH] Speeding up condor_dagman submission
On Sat, 8 Aug 2015, Brian Candler wrote:
Incidentally, by grepping for sleep I just found src/condor_procapi/WISDOM
"See UniqueProcessId.pdf in this folder for a more indepth discussion of how
the new ProcAPI ProcessId code works"
And I read that document. But I still don't understand why this PID birthday
issue applies to condor_dagman running in the scheduler universe, but not to
a regular job running on a worker.
Ah, the fundamental thing is this: we want to avoid having two instances
of DAGMan simultaneously running on the same DAG. This will goof things
up because the two DAGMans will be using the same log for their node jobs,
and the events will get mixed together.
So, to avoid this, DAGMan creates a lock file at startup (which contains
the UniquePID information). When DAGMan starts up, it looks for the lock
file. If the file exists, DAGMan tries to read the UniquePID info from
the lock file. If it succeeds in doing that, and the corresponding
process is still alive, DAGMan says, "Oops, there's another DAGMan already
running on this DAG", and exits. If DAGMan can't read the UniquePID
info, or that process does not exist, DAGMan assumes that there was an
earlier instance of DAGMan running on that DAG, but that instance no
longer exists. So the just-started DAGMan then continues in recovery
Hopefully that all makes sense...