[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] preempt and then hold?

On Tue, 2 Aug 2005 09:47:32 -0700  "Michael Yoder" wrote:

> > How often is the ImageSize computed for
> > 
> > a) standard universe jobs ?
> > b) vanilla universe jobs ?
> It's the same, and defaults to 5 minutes.  This can be configured by
> (http://docs.optena.com/display/CONDOR/UPDATE_INTERVAL).  The
> information is also sent during state changes.

sorry, that's not really true (granted, the startd's code is
confusing, and i'm sure the behavior isn't obvious just from looking
at it).  the startd is recomputing the imagesize of the starter and
all it's children much more frequently than the UPDATE_INTERVAL.  you
only see the changed version with condor_status every UPDATE_INTERVAL
and/or state change (which might be what mike's talking about).
however, anything that's potentially changing rapidly about a job or
machine which a machine owner might want to use for policy expressions
is recomputed every "POLLING_INTERVAL" (defaults to 5 seconds).
examples are the load average and condor load average (don't ask),
memory usage, etc.  so, the startd itself has a new version of the
imagesize every 5 seconds, and you could watch that with
"condor_status -direct" if you wanted to.

> Yes.  The ImageSize isn't determined via checkpointing magic - the
> information comes from the operating system: usually somewhere in
> /proc.

that's also not entirely true.  for vanilla jobs, that's right.
however, for standard universe jobs, it's more complicated.  the
startd *does* come up with its own version of the imagesize as
described above, and, once the job starts running, that's the value
you'd see when evaluating the PREEMPT expression.  so, if the question
is "what value of ImageSize am i getting if i use it in my PREEMPT
expression?", that's all true, for vanilla *and* standard universe
jobs.  however, for standard universe jobs, the checkpointing magic
*is* used to update the real version of the imagesize as stored in the
job queue (visible by condor_q) and what would be used for subsequent
attempts to find a match for the job, etc.  so, if you're talking
about a standard universe job, and you're not doing periodic
checkpointing, and you're not doing checkpointing when you leave the
machine (due to condor_vacate -fast, the KILL expression becoming
true, etc), then your job ad at the schedd will still have the initial
size of your executable itself as the imagesize, and that won't be
particularly accurate for matching if your job allocates and uses a
lot of memory.

hope this helps clarify,