[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Using Condor as a substitute for OpenPBS



Dear Matt,

Thanks for your advice.

> what to do with 1 - should we kill job 3 and replace job 1 there,
> should we just put 1 in to the next available machine?

Absolutely not. I'm sorry Job 1 ... you'll have to wait again on the que!

Regards,
David

----- Original Message -----

> On 10/26/05, David McGiven <david.mcgiven@xxxxxxxxxxxx> wrote:
> > Dear Condor Users,
> >
> > I'm actually using OpenPBS. Since that project has been unmaintained for
> > some years now, I would like to switch to a free one but under active
> > development.
> >
> > I would like to know if I can use Condor like I use PBS :
> >
> > - There's a single que, in which all jobs submitted by the different
users
> > go into. The PBS allocates free nodes for those jobs, and starts running
> > them.
> >
> > When there are no more free nodes, the jobs keep on entering on the que,
> > but have to wait till there are free nodes, of course.
>
> I shall ignore the app specific aspects and concentrate on the above...
>
> short answer in general: yes but with issues as the number of execute
> nodes and job  interval (time a job takes) increases.
>
> To achieve total FIFO behaviour you would have one schedd (submit and
> job control point) on one machine and all submissions to it must come
> from the same user.
>
> At this point the schedd would allocate jobs in order of submission.
> However what happens if  job 1 starts, job 2 starts, job 3 starts then
> job 1 dies (say the execute node loses power).
>
> what to do with 1 - should we kill job 3 and replace job 1 there,
> should we just put 1 in to the next available machine?
>
> if the answer to this question is the latter then you have no
> problems, otherwise you will need to set things up on the execute noes
> with RANK settings so that the submit time of a job causes it to rank
> higher - this is somewhat nasty (since it utterly precludes use of the
> later behaviour or any expansion to say two schedd's in future without
> layering more and more complexity in this expression which I don't
> recommend)
>
> The issue is that communication with the schedd is largely single
> threaded so it becomes a choke point (a job finishing can take a lot
> of resources depending on the file transfer requirements.
>
> Also since every running job requires a shadow on the submit machine
> then you may hit file descriptor / memory / other resource limits.
>
> Also you have a single point of failure since if that machine dies all
> jobs are lost (with 6.7.x series job leasing you could be resilient to
> a reboot but if the job log goes so do your jobs)
>
> > I don't want anyother functionalities of Condor like
>
> > checkpointing,
> not a problem run in the vanilla universe
>
> > scheduling priorities
>
> you just not use them (or force them to be the same via some
> constrained submit gateway) but this requires all jobs to appear to be
> the same user and again that there is a single schedd.
>
> >, idle cycling harvest,
>
> trivial
>
> > plain queue with a FIFO order.
>
> > Also, I don't want to apply any kernel patches, nor linking any external
> > libraries to my applications.
>
> vanilla universe again
>
> > Is it possible to replace OpenPBS with Condor and have the same
> > functionality ?
>
> I do not know if there is any other functionality for this that you
> have not mentioned and rely on.
>
> Hope this helps
>
> Matt
>
>
>