[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Running long jobs



Do you have the condor_config and condor_config.local files you could
post or email?

The log files will show why a job was preempted, either MasterLog or
StartLog, I forget which.  You'll probably have to ask your condor admin
for them.

Ralph Finch, P.E.
Dept. of Water Resources
Bay-Delta Office, Room 215-13
Sacramento, CA  95814
916-653-7552
rfinch@xxxxxxxxxxxx
 

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Daniel 
> R Figueiredo
> Sent: Monday, December 05, 2005 2:51 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Running long jobs
> 
> Hi Eric and Ralph,
> 
> Thanks for your respective messages. I now understand better 
> the idea of 
> using two VMs per processor and how this could indeed lead to 
> a solution. 
> However, I still don't understand why a more simple solution, 
> such as the 
> one suggested by Ralph, would not work. To be clear, I don't know why 
> Condor decides to evict the long jobs (say, around 15 hours). 
> It could be 
> keyboard activity, as suggested. However, it could also be 
> due to user 
> priorities (this is probably more likely). Recall that this 
> job is running 
> in a heavily loaded Condor cluster (several users, dispatch 
> queue with 
> large backlog), which could make the long job receive low 
> priority (over 
> time) compared to new submitted jobs by users with few jobs. 
> Can this case 
> also be handled with a similar approach as suggested by 
> Ralph? If not, is 
> this why we need the VM approach?
> 
> Sorry for the long exchange of messages in resolving this 
> issue, but I 
> would like to understand what is going on here.
> 
> Thanks,
> Daniel
> 
> 
> 
> On Sun, 4 Dec 2005, Finch, Ralph wrote:
> 
> > I don't think Daniel needs two VMs; he simply wants his one job to
> > suspend for some reason, then resume when the "reason" no longer
> > applies.
> >
> > Looking at his original post, Daniel said:
> >
> > "The problem is that after the job has been running for 
> some hours (say
> > 10 hours) Condor decides to evict the job from the machine."
> >
> > Why it gets evicted is not said, so we don't know the criteria for
> > suspending a job.  I'll assume keyboard activity. Then "the 
> minimal set
> > of configuration fields that must be changed in order to achieve
> > [suspension instead of eviction]" is:
> >
> > WANT_SUSPEND 		= TRUE
> > PREEMPT			= FALSE
> > PREEMPTION_REQUIREMENTS	= FALSE
> > KILL 				= FALSE
> >
> > ContinueIdleTime		= 5 * $(MINUTE)
> > SUSPEND			= $(KeyboardBusy)
> > CONTINUE			= (KeyboardIdle > $(ContinueIdleTime))
> >
> > Ralph Finch, P.E.
> > Dept. of Water Resources
> > Bay-Delta Office, Room 215-13
> > Sacramento, CA  95814
> > 916-653-7552
> > rfinch@xxxxxxxxxxxx
> >
> >
> >> -----Original Message-----
> >> From: condor-users-bounces@xxxxxxxxxxx
> >> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
> >> Sent: Saturday, December 03, 2005 11:39 AM
> >> To: Condor-Users Mail List
> >> Subject: Re: [Condor-users] Running long jobs
> >>
> >> On Sat, Dec 03, 2005 at 07:01:43PM +0100, Daniel R 
> Figueiredo wrote:
> >>>
> >>> On Wed, 30 Nov 2005, Erik Paulson wrote:
> >>>
> >>> Thanks for your message. It's now clear that I'll need
> >> support from the
> >>> Condor administrator. However, I looked through the report
> >> "Condor and The
> >>> Bolonga Batch System" as you suggested, but it was not 
> clear how to
> >>> configurate Condor to run long jobs with preemption 
> implemented via
> >>> suspension (as opposed to preemption via termination). In
> >> particular, I
> >>> would like to know what is the minimal set of configuration
> >> fields that
> >>> must be changed in order to achieve this? Recall that I
> >> would like for
> >>> long jobs to be preempted via suspension (as opposed to
> >> terminated through
> >>> a signal) and later resume from where they stopped (as opposed to
> >>> restarting from the beginning). Any ideas on how to this? I
> >> could then
> >>> suggest something concrete to our local Condor administrator.
> >>>
> >>
> >> You need to create 2 VMs. There is no way to have one VM
> >> suspend a job, start
> >> another one, and resume the first one later resume it later -
> >> if a job has
> >> state on a machine, it must have a VM watching over it, and a
> >> VM can only
> >> watch over one job at a time.
> >>
> >> You can emulate your desired behaviour with 2 VMs - the
> >> second VM can be
> >> configured to suspend the job whenever it sees the state of
> >> the first VM
> >> switch to "Claimed". The BBS document should give you all of
> >> the details you
> >> need.
> >>
> >> -Erik
> >> _______________________________________________
> >> Condor-users mailing list
> >> Condor-users@xxxxxxxxxxx
> >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
>