[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] File last modification time or job last write() attribute?

From: MIRON LIVNY <miron@xxxxxxxxxxx>
Date: 05/27/2016 01:50 AM

> You are right, Dimitri.
> The reason I used C was to make the point that the definition of "stuck"
> has an impact on the frequency of the probe. I can see cases where the
> probe is expensive.
> In the case if all goes well we will probe in a very low frequency.

With this comment in mind, I tweaked my hook script to only chirp if the
value has changed. Thanks!

Perhaps this idea could be expressed more generally as a "watchdog
service" for a job.

The linked article tells a sad tale of the demise of the Clementine
mission for want of a few lines of hardware-WDT enablement code. Although
they had implemented a thruster timeout in software, that froze too
when the processor hung. Clementine's mission to near-Earth asteroid
Geographos had to be abandoned for want of the fuel that was spewed
out during the hang. http://www.ganssle.com/watchdogs.htm

As we know HTCondor's startd can provide the equivalent of Clementine's
unused hardware watchdog, outside the purview of the job. There's
already a number of job characteristics that can be evaluated by an
periodic_hold _expression_, such as BlockWrites, BlockReads, BytesSent,
RecentBlockWrites, ResidentSetSize_RAW, RemoteSysCpu / RemoteUserCpu,
RemoteWallClockTime, and so on. And from what I gathered at
the delightful and informative HTCondor Week 2016 -
http://research.cs.wisc.edu/htcondor/HTCondorWeek2016/ - there will
be even more stats available on a variety of other aspects of the
job in future revisions.

I considered using RecentBlockWrites to watchdog the job in
our situation, but the trouble there lies in the fact that other
elements of the job may be writing other things unrelated to
the hung element and that activity is reflected in RBW, and so
create a "noise floor" which would require testing to
characterize in order to avoid false positives.

CPU utilizaton is another potential sensor to use but without
a "RecentRemoteUserCpu" it's tricky to make decisions based
on it. In one case we're looking for the overall utilization
since job startup to fall below about 20% - a safe noise floor
for the job in question - but if it's been running a long time
there's a long tail there.

The last modification time of a given file is really just another
statistical sensor. (I'm also looking at adding a regexp which can
be looked for in the tail end of the specified file.)

So perhaps a direction which could be explored is another "periodic"
type - we have periodic_hold, periodic_remove, periodic_checkpoint,
and even periodic_memory_sync... what about a "periodic_run" or
"periodic_info" directive for condor_submit?

It would be given an input-transferred executable to
be run by the startd during the standard periodic interval, to
bring "update_job_info" in from the hooks and "+" notation to
submit-native functionality. It would be given a copy of the job
classad on stdin, and deliver an update classad on stdout. It would
probably need to be handled asynchronously like the job info
hook is.

The executable would only be responsible for updating classads
based on specific details it's looking for, while the actual
watchdog trigger and action would be handled by the other
periodic_* expressions.

Another use case that comes to mind is to use "strace" in an
info script, to attach to the running process based on the
JobPid attribute, and look for patterns in its execution
to detect problematic behavior such as an infinite loop or a
hung call.

This certainly gives a nice length of stout rope to users, but
when you see folks parsing the stdout of condor_q in a watchdog
script they wrote themselves, you realize that they already
have quite a bit of gallows rope on hand to begin with.

I suppose there's nothing inherently wrong with doing this with an
update_job_info hook, aside from the constraints that have always
existed in the hook mechanisms such as the inability (as far as
I know) to mix different hooks together since it's not possible
to specify a comma-delimited list of hook keywords.

Food for thought...

        -Michael Pelletier.