[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Job Attributes and Job Policy Expressions
- Date: Mon, 12 Jul 2010 13:04:49 +0800
- From: <Greg.Hitchen@xxxxxxxx>
- Subject: Re: [Condor-users] Job Attributes and Job Policy Expressions
Sorry, I also meant
to ask where/what is evaluating these job policies?
Is it the schedd on
the submit node or the startd on the execute node?
I'm guessing it has
to be the schedd for the on_exit_hold and on_exit_remove
policies but is this
also the case for the periodic policies?
Is anyone aware of
anything documenting job attributes, particularly in
relation to what
attributes are available at what times? e.g. JobStartDate
appear until a job has transitioned from idle to running.
It is possible to
use "condor_q -l" to see a job's attributes but I was hoping
for a listing of ALL
possible attributes and when they are "available".
The reason being
that I have been fiddling with some job policy expressions
to "overcome" some
issues we have on occasion when submitting jobs.
e.g. some jobs
exiting too early and some seeming to run forever. If we
the "too early" jobs then they seem to mostly run OK.
Manually putting the
"run forever" jobs on hold and then manually releasing
causes them to mostly run OK. This can be a labourious
process with 10,000+
submitted jobs, so we were looking at a way to make
automatically using on_exit_remove, periodic_hold, etc.
I now have something
that seems to work for us but it was a bit of a trial and
error process as
some of the existing docs/examples don't seem to work?
doesn't exist, i.e. is not defined) and even some of the
seen with "condor_q
-l" give "undefined" errors.
docs/example give one like:
== False) && (ExitSignal != 0)) || (ServerStartTime -
JobStartdate < 3600 )
As far as I can tell
there is no ServerStartTime, there is however a ServerTime
but even reference
to that says it is undefined, yet I can see it with condor_q
BTW this is for
windows version 7.2.4
Our trial and error
solution gave us the following, which seems to work
OK for our
purposes. This particular test setup is for jobs that should
for 20 minutes, any
less than this or more than this by 5 mins means
something dodgy has
happened so we want to try re-running the job.
MINUTE = 60
> (15 * $(MINUTE))
periodic_hold = (CurrentTime - JobCurrentStartDate)
> (30 * $(MINUTE))
periodic_release = (CurrentTime -
EnteredCurrentStatus) > (5 * $(MINUTE))
Thanks for any