[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs License Management

On Oct 15, 2008, at 9:01 AM, Ian Chesal wrote:

Stuart Anderson wrote:
In the context of the new Concurrency Limit will it be possible for
running job to drop a resource constraint when it is done with it,
is it implicitly assumed that all jobs require their specified
resources for their entire lifetime?

The motivation for this is managing I/O resources where a typical
flow is to launch a large number of jobs that each read in a large
amount data from a shared filesystem (or set of filesystems), and
crunch on the data for a long time before outputing a relatively
amount of results. It would be interesting to be able to hand out
tokens for filer access but then be able to return them after the
intensive phase of each individual job is done.


Stuart Anderson  anderson@xxxxxxxxxxxxxxxx

Right now the limits exist for the lifetime of the job. It is
conceivable that jobs able to modify their ad, via chirp, would be
to update the limits they use. However, this is currently not part of
the implementation.

We deal with this now in our own pre-Condor resource scheduler and truly
the best answer we have come up with this to the problem is: divide up
the jobs. It is more work on the part of the job developer but
ultimately it lets you keep the simplest resource request and
partitioning scheme. Predictability wins out time and time again for us
over complexity.

We'll often see developers writing flows that use limited, expensive
Tool A, then B then C and submitting a job that requires all three that
then blocks for an eternity, starved, trying to get all three, while
jobs that only need 1 of the three fly by it. The answer is always:
write a job that submits a job. Your entry job uses Tool A, finishes,
submits a job that uses Tool B, etc. DAGs make this even easier.

Stuart, in your case a DAG would work very well: the first point on the
DAG is your file-transfer intensive portion of the job, and it needs a
resource, the second point that follows is the number crunching portion
and it doesn't need any resources.

That is working well for us for jobs with static resource requirements, i.e., we heavily use the DAGMan CATEGORY and MAXJOBS keywords. However, the next level of control I am looking for is for jobs that have transient resource requirements. Put another way, I would rather not have to break up individual processes that do I/O and then number crunching into multiple processes, e.g., using shared memory as in exchange method.


Stuart Anderson  anderson@xxxxxxxxxxxxxxxx

Attachment: smime.p7s
Description: S/MIME cryptographic signature