On Oct 15, 2008, at 9:01 AM, Ian Chesal wrote:
Stuart Anderson wrote:In the context of the new Concurrency Limit will it be possible forarunning job to drop a resource constraint when it is done with it,oris it implicitly assumed that all jobs require their specified resources for their entire lifetime? The motivation for this is managing I/O resources where a typicalworkflow is to launch a large number of jobs that each read in a large amount data from a shared filesystem (or set of filesystems), andthencrunch on the data for a long time before outputing a relativelysmallamount of results. It would be interesting to be able to hand out tokens for filer access but then be able to return them after theI/Ointensive phase of each individual job is done. Thanks. -- Stuart Anderson anderson@xxxxxxxxxxxxxxxx http://www.ligo.caltech.edu/~andersonRight now the limits exist for the lifetime of the job. It is conceivable that jobs able to modify their ad, via chirp, would beableto update the limits they use. However, this is currently not part of the implementation.We deal with this now in our own pre-Condor resource scheduler and trulythe best answer we have come up with this to the problem is: divide up the jobs. It is more work on the part of the job developer but ultimately it lets you keep the simplest resource request andpartitioning scheme. Predictability wins out time and time again for usover complexity. We'll often see developers writing flows that use limited, expensiveTool A, then B then C and submitting a job that requires all three thatthen blocks for an eternity, starved, trying to get all three, while jobs that only need 1 of the three fly by it. The answer is always: write a job that submits a job. Your entry job uses Tool A, finishes, submits a job that uses Tool B, etc. DAGs make this even easier.Stuart, in your case a DAG would work very well: the first point on theDAG is your file-transfer intensive portion of the job, and it needs aresource, the second point that follows is the number crunching portionand it doesn't need any resources.
That is working well for us for jobs with static resource requirements, i.e., we heavily use the DAGMan CATEGORY and MAXJOBS keywords. However, the next level of control I am looking for is for jobs that have transient resource requirements. Put another way, I would rather not have to break up individual processes that do I/O and then number crunching into multiple processes, e.g., using shared memory as in exchange method.
Thanks. -- Stuart Anderson anderson@xxxxxxxxxxxxxxxx http://www.ligo.caltech.edu/~anderson
Description: S/MIME cryptographic signature