[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] BOINC running, all machine Owner

On Fri, 7 Apr 2006 19:06:22 +0200 (CEST)  Emmanuel Le Guirriec wrote:

> I tried with EVICT_BACKFILL = FALSE and i can see that when boinc is 
> running in backfill, the LoadAvg = 1.0000 and the CondorLoadAvg = 0.00
> Then the CpuBusy = true ( 1 - 0 >= 0.5).

i decided not to include BOINC-generated load in CondorLoadAvg.  but i
forgot to a) document this fact and b) provide an alternative.
whoops. ;)  sorry about that.

maybe we should add something like BackfillLoadAvg, and then folks can
re-write their policies based on that.  i'll have to think about this
more.  feedback from the community is most welcome...

> I think i've to find a policy to EVICT_BACKFILL that doesn't use
> CpuBusy. Any idea ?

yeah, take "CpuBusy" out of your EVICT_BACKFILL. ;)  keep in mind:

a) you can still refer to keyboard/mouse activity if it's an
interactive machine.  in condor "keyboard" activity includes ssh
sessions, etc, not just the physical keyboard.

b) if condor finds "real" work to give it, the backfill job will be
evicted, anyway, regardless of this expression.

so, if it's an interactive machine, just do this:

EVICT_BACKFILL = $(KeyboardBusy)

if it's just a compute node in a rack managed by condor, do this:


the only problem this creates is if your machines can have background
computations spawned by some method other than interactive users or
condor.  in that case, you can go back to the dark ages of what condor
policy expressions did from before CondorLoadAvg:

EVICT_BACKFILL = $(KeyboardBusy) || (LoadAvg > 1.3)

(or something).  so, if a BOINC job is running on this machine, the
load should be 1.0 (since BOINC is usually nice and CPU-bound).  if
there's something else running on the CPU generating at least a 0.3
load, then it's probably time to get BOINC out of the way.

there are some problems with this approach (especially on SMP machine,
but the current BOINC support doesn't work very well on SMP machines
yet, anyway), but it's a stop-gap measure until we can come up with a
better solution.   again, if all jobs on the machines are either
started directly via interactive users or by condor, none of this