[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] File last modification time or job last write() attribute?
- Date: Wed, 01 Jun 2016 14:13:48 -0400
- From: Michael V Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] File last modification time or job last write() attribute?
From: MIRON LIVNY <miron@xxxxxxxxxxx>
Date: 05/27/2016 01:50 AM
> You are right, Dimitri.
> The reason I used C was to make the point that the definition of "stuck"
> has an impact on the frequency of the probe. I can see cases where
> probe is expensive.
> In the case if all goes well we will probe in a very low frequency.
With this comment in mind, I tweaked my hook script
to only chirp if the
value has changed. Thanks!
Perhaps this idea could be expressed more generally
as a "watchdog
service" for a job.
The linked article tells a sad tale of the demise
of the Clementine
mission for want of a few lines of hardware-WDT enablement
they had implemented a thruster timeout in software,
that froze too
when the processor hung. Clementine's mission to near-Earth
Geographos had to be abandoned for want of the fuel
that was spewed
out during the hang. http://www.ganssle.com/watchdogs.htm
As we know HTCondor's startd can provide the equivalent
unused hardware watchdog, outside the purview of the
already a number of job characteristics that can be
evaluated by an
periodic_hold _expression_, such as BlockWrites, BlockReads,
RecentBlockWrites, ResidentSetSize_RAW, RemoteSysCpu
RemoteWallClockTime, and so on. And from what I gathered
the delightful and informative HTCondor Week 2016
- there will
be even more stats available on a variety of other
aspects of the
job in future revisions.
I considered using RecentBlockWrites to watchdog the
our situation, but the trouble there lies in the fact
elements of the job may be writing other things unrelated
the hung element and that activity is reflected in
RBW, and so
create a "noise floor" which would require
characterize in order to avoid false positives.
CPU utilizaton is another potential sensor to use
a "RecentRemoteUserCpu" it's tricky to make
on it. In one case we're looking for the overall utilization
since job startup to fall below about 20% - a safe
for the job in question - but if it's been running
a long time
there's a long tail there.
The last modification time of a given file is really
statistical sensor. (I'm also looking at adding a
regexp which can
be looked for in the tail end of the specified file.)
So perhaps a direction which could be explored is
type - we have periodic_hold, periodic_remove, periodic_checkpoint,
and even periodic_memory_sync... what about a "periodic_run"
"periodic_info" directive for condor_submit?
It would be given an input-transferred executable
be run by the startd during the standard periodic
bring "update_job_info" in from the hooks
and "+" notation to
submit-native functionality. It would be given a copy
of the job
classad on stdin, and deliver an update classad on
stdout. It would
probably need to be handled asynchronously like the
The executable would only be responsible for updating
based on specific details it's looking for, while
watchdog trigger and action would be handled by the
Another use case that comes to mind is to use "strace"
info script, to attach to the running process based
JobPid attribute, and look for patterns in its execution
to detect problematic behavior such as an infinite
loop or a
This certainly gives a nice length of stout rope to
when you see folks parsing the stdout of condor_q
in a watchdog
script they wrote themselves, you realize that they
have quite a bit of gallows rope on hand to begin with.
I suppose there's nothing inherently wrong with doing
this with an
update_job_info hook, aside from the constraints that
existed in the hook mechanisms such as the inability
(as far as
I know) to mix different hooks together since it's
to specify a comma-delimited list of hook keywords.
Food for thought...