[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Running long jobs



Hi Eric and Ralph,

Thanks for your respective messages. I now understand better the idea of using two VMs per processor and how this could indeed lead to a solution. However, I still don't understand why a more simple solution, such as the one suggested by Ralph, would not work. To be clear, I don't know why Condor decides to evict the long jobs (say, around 15 hours). It could be keyboard activity, as suggested. However, it could also be due to user priorities (this is probably more likely). Recall that this job is running in a heavily loaded Condor cluster (several users, dispatch queue with large backlog), which could make the long job receive low priority (over time) compared to new submitted jobs by users with few jobs. Can this case also be handled with a similar approach as suggested by Ralph? If not, is this why we need the VM approach?

Sorry for the long exchange of messages in resolving this issue, but I would like to understand what is going on here.

Thanks,
Daniel



On Sun, 4 Dec 2005, Finch, Ralph wrote:

I don't think Daniel needs two VMs; he simply wants his one job to
suspend for some reason, then resume when the "reason" no longer
applies.

Looking at his original post, Daniel said:

"The problem is that after the job has been running for some hours (say
10 hours) Condor decides to evict the job from the machine."

Why it gets evicted is not said, so we don't know the criteria for
suspending a job.  I'll assume keyboard activity. Then "the minimal set
of configuration fields that must be changed in order to achieve
[suspension instead of eviction]" is:

WANT_SUSPEND 		= TRUE
PREEMPT			= FALSE
PREEMPTION_REQUIREMENTS	= FALSE
KILL 				= FALSE

ContinueIdleTime		= 5 * $(MINUTE)
SUSPEND			= $(KeyboardBusy)
CONTINUE			= (KeyboardIdle > $(ContinueIdleTime))

Ralph Finch, P.E.
Dept. of Water Resources
Bay-Delta Office, Room 215-13
Sacramento, CA  95814
916-653-7552
rfinch@xxxxxxxxxxxx


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Saturday, December 03, 2005 11:39 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Running long jobs

On Sat, Dec 03, 2005 at 07:01:43PM +0100, Daniel R Figueiredo wrote:

On Wed, 30 Nov 2005, Erik Paulson wrote:

Thanks for your message. It's now clear that I'll need
support from the
Condor administrator. However, I looked through the report
"Condor and The
Bolonga Batch System" as you suggested, but it was not clear how to
configurate Condor to run long jobs with preemption implemented via
suspension (as opposed to preemption via termination). In
particular, I
would like to know what is the minimal set of configuration
fields that
must be changed in order to achieve this? Recall that I
would like for
long jobs to be preempted via suspension (as opposed to
terminated through
a signal) and later resume from where they stopped (as opposed to
restarting from the beginning). Any ideas on how to this? I
could then
suggest something concrete to our local Condor administrator.


You need to create 2 VMs. There is no way to have one VM
suspend a job, start
another one, and resume the first one later resume it later -
if a job has
state on a machine, it must have a VM watching over it, and a
VM can only
watch over one job at a time.

You can emulate your desired behaviour with 2 VMs - the
second VM can be
configured to suspend the job whenever it sees the state of
the first VM
switch to "Claimed". The BBS document should give you all of
the details you
need.

-Erik
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users