[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Using Condor as a substitute for OpenPBS



On 10/26/05, David McGiven <david.mcgiven@xxxxxxxxxxxx> wrote:
> Dear Condor Users,
>
> I'm actually using OpenPBS. Since that project has been unmaintained for
> some years now, I would like to switch to a free one but under active
> development.
>
> I would like to know if I can use Condor like I use PBS :
>
> - There's a single que, in which all jobs submitted by the different users
> go into. The PBS allocates free nodes for those jobs, and starts running
> them.
>
> When there are no more free nodes, the jobs keep on entering on the que,
> but have to wait till there are free nodes, of course.

I shall ignore the app specific aspects and concentrate on the above...

short answer in general: yes but with issues as the number of execute
nodes and job  interval (time a job takes) increases.

To achieve total FIFO behaviour you would have one schedd (submit and
job control point) on one machine and all submissions to it must come
from the same user.

At this point the schedd would allocate jobs in order of submission.
However what happens if  job 1 starts, job 2 starts, job 3 starts then
job 1 dies (say the execute node loses power).

what to do with 1 - should we kill job 3 and replace job 1 there,
should we just put 1 in to the next available machine?

if the answer to this question is the latter then you have no
problems, otherwise you will need to set things up on the execute noes
with RANK settings so that the submit time of a job causes it to rank
higher - this is somewhat nasty (since it utterly precludes use of the
later behaviour or any expansion to say two schedd's in future without
layering more and more complexity in this expression which I don't
recommend)

The issue is that communication with the schedd is largely single
threaded so it becomes a choke point (a job finishing can take a lot
of resources depending on the file transfer requirements.

Also since every running job requires a shadow on the submit machine
then you may hit file descriptor / memory / other resource limits.

Also you have a single point of failure since if that machine dies all
jobs are lost (with 6.7.x series job leasing you could be resilient to
a reboot but if the job log goes so do your jobs)

> I don't want anyother functionalities of Condor like

> checkpointing,
not a problem run in the vanilla universe

> scheduling priorities

you just not use them (or force them to be the same via some
constrained submit gateway) but this requires all jobs to appear to be
the same user and again that there is a single schedd.

>, idle cycling harvest,

trivial

> plain queue with a FIFO order.

> Also, I don't want to apply any kernel patches, nor linking any external
> libraries to my applications.

vanilla universe again

> Is it possible to replace OpenPBS with Condor and have the same
> functionality ?

I do not know if there is any other functionality for this that you
have not mentioned and rely on.

Hope this helps

Matt