[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] [PATCH] Speeding up condor_dagman submission
- Date: Sun, 09 Aug 2015 17:03:17 +0200
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] [PATCH] Speeding up condor_dagman submission
> On Aug 8, 2015, at 10:19 AM, Brian Candler <b.candler@xxxxxxxxx> wrote:
> On 07/08/2015 21:35, R. Kent Wenger wrote:
>> This will be improved in 8.3.8 -- we've changed the sleep from 12 seconds to 3. And we have some ideas for getting rid of the sleep entirely in the 8.5 series, but they're too big a change to squeeze into 8.3.8.
> That's good news.
> Incidentally, by grepping for sleep I just found src/condor_procapi/WISDOM which says:
> "See UniqueProcessId.pdf in this folder for a more indepth discussion of how the new ProcAPI ProcessId code works"
> And I read that document. But I still don't understand why this PID birthday issue applies to condor_dagman running in the scheduler universe, but not to a regular job running on a worker.
Iâve also been poking around the code and following the conversation in the ticket.
Basically, ProcessId is an attempt to create a unique identifier for a process on the host that can be persisted to disk.
When a DAG starts, DAGMan reads the ID from a lockfile and if the ProcessId still exists, then it exits (assuming that no two DAGMan instances should be running on the same DAG). If no lockfile exists - or the old process is dead - creates the lockfile and writes its own ProcessId to it.
Since the ProcessId embeds an integer timestamp, the sleep is there to make sure the timestamp is unique (i.e., avoid the case where a PID is reused in <1s).
Now, none of this is really bulletproof. It depends on (a) all processes being visible to all other processes and (b) no two processes existing on the system with the same PID. Neither assumption is true on hosts that support PID namespaces (which oh-by-the-way, HTCondor can use).
Turns out that DAGMan is the only consumer of this particular technique. All other parts of HTCondor utilize POSIX locks (or whatever the Windows equivalent is); for these, the kernel imposes uniqueness, handles race conditions, and locks are global across all namespaces.
If Iâm reading Kentâs ticket updates correctly, the plan is to move DAGMan to POSIX locks and nuke the ProcessId stuff.
PS - Historically, HTCondor had a heckuva time with lock files due to NFS. This may be why the ProcessId approach was taken for DAGMan (I wasnât there; just guessing). Since then, the issues with lock files in HTCondor have been cleaned up, one-by-one.