[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job Attributes and Job Policy Expressions

Sorry, I also meant to ask where/what is evaluating these job policies?
Is it the schedd on the submit node or the startd on the execute node?
I'm guessing it has to be the schedd for the on_exit_hold and on_exit_remove
policies but is this also the case for the periodic policies?

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Greg.Hitchen@xxxxxxxx
Sent: Monday, 12 July 2010 11:39 AM
To: condor-users@xxxxxxxxxxx
Subject: [ExternalEmail] [Condor-users] Job Attributes and Job Policy Expressions

Hi All
Is anyone aware of anything documenting job attributes, particularly in
relation to what attributes are available at what times? e.g. JobStartDate
obviously won't appear until a job has transitioned from idle to running.
It is possible to use "condor_q -l" to see a job's attributes but I was hoping
for a listing of ALL possible attributes and when they are "available".
The reason being that I have been fiddling with some job policy expressions
to "overcome" some issues we have on occasion when submitting jobs.
e.g. some jobs exiting too early and some seeming to run forever. If we
manually resubmit the "too early" jobs then they seem to mostly run OK.
Manually putting the "run forever" jobs on hold and then manually releasing
them also causes them to mostly run OK. This can be a labourious
process with 10,000+ submitted jobs, so we were looking at a way to make
this happen automatically using on_exit_remove, periodic_hold, etc.
I now have something that seems to work for us but it was a bit of a trial and
error process as some of the existing docs/examples don't seem to work?
(as the attribute doesn't exist, i.e. is not defined) and even some of the attributes
seen with "condor_q -l" give "undefined" errors.
e.g. the docs/example give one like:
== False) && (ExitSignal != 0)) || (ServerStartTime - JobStartdate < 3600 )
As far as I can tell there is no ServerStartTime, there is however a ServerTime
but even reference to that says it is undefined, yet I can see it with condor_q -l
BTW this is for windows version 7.2.4
Our trial and error solution gave us the following, which seems to work
OK for our purposes. This particular test setup is for jobs that should run
for 20 minutes, any less than this or more than this by 5 mins means
something dodgy has happened so we want to try re-running the job.


- JobCurrentStartDate) > (15 * $(MINUTE))

periodic_hold = (CurrentTime - JobCurrentStartDate) > (30 * $(MINUTE))

periodic_release = (CurrentTime - EnteredCurrentStatus) > (5 * $(MINUTE))


Thanks for any help