Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restrincting the number of jobs copying data

Date: Wed, 17 Jun 2009 10:15:29 -0400
From: Ian Chesal <ICHESAL@xxxxxxxxxx>
Subject: Re: [Condor-users] Restrincting the number of jobs copying data

> What I would like to be able to do is submit all the jobs to Condor,
> then restrict the number of jobs that are allowed to be copying, as
> each job finishes copying its data, the processing part of the job
> will start. Allowing a new job will be started for copying data.
> Each job has three parts, Part A copies data, then Part B processes
> the data, then Part C writes the results. I'd like to limit the number
> of jobs that are in Part A and C at any one time, while at the same
> time allow any number of jobs to be in Part B.
>
>  The 'simple' solution would be to only allow 50 jobs to run. But,
> once the 50 jobs have finished copying the data, they can start doing
> the processing, and then let the next 50 start copying data.
>
>  I'm pretty much at a loss as to where to start to create this
> restriction, so any hints would be appreciated.

Personally I'd try to solve this outside of Condor with a mutex
approach:

http://en.wikipedia.org/wiki/Mutual_exclusion

What you want to do is create a shared counter on your file system. When
a job wants to copy data from the file system it checks the counter
file. If the count is less than X it gets an exclusive lock on the file,
if the count is still less than X it increments it, writes the file, and
releases the lock. Now it can do the copy. When it's done: exclusive
lock on the file, decrement the counter, release the lock.

In this way you can have all 500 nodes working away at jobs, but the
copying of data from your filer would be restricted.

The tricky part is obtaining a lock on a file on a remote file system.
That's not always easy, but it's not an insurmountable obstacle. You
could always deploy a really light weight database for the mutex. That
would actually simplify the code and it works perfectly from all
platforms.

The other option would be to explore solving this problem with a DAG:
the first point in the graph stages the data and is restricted to run
only 50 instances at a time, the second point in the graph processes the
data without any parallel restrictions. I'm not well versed on DAGs, so
maybe Kent or someone can fill in whether this is even possible (data
staging as a DAG step).

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.

References:
- [Condor-users] Restrincting the number of jobs copying data
  - From: Mark Assad

Prev by Date: Re: [Condor-users] Restrincting the number of jobs copying data
Next by Date: [Condor-users] GAHP error
Previous by thread: Re: [Condor-users] Restrincting the number of jobs copying data
Next by thread: [Condor-users] GAHP error
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Restrincting the number of jobs copying data