[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Forcing a job classad from config file?
- Date: Tue, 15 Feb 2005 10:46:14 -0600
- From: "Doak Bane" <doak.bane@xxxxxxxxxx>
- Subject: Re: [Condor-users] Forcing a job classad from config file?
Thanks for your prompt responses. A bit more background...
I am (obviously) running version 6.7.3 - on RedHat WS 3 Update 3.
My current environment is purely for testing - soon to be deployed
in production. My site has been running Condor 6.4.7 for about 2 years
with very _minimal_ changes to, or understanding of, the default
packaged configuration. My job is to change that - to define and
implement a set of workable policies. (I am also experimenting with
some of the policies outlined in the Bologna Batch System paper.)
The target execution environment is a dedicated cluster of dual
processor diskless/headless machines - maybe to be expanded to
desktop machines in the future. My test playground is a small
environment that mimics that.
Other comments embedded. Thanks Dan,
From: "Dan Bradley" <dan@xxxxxxxxxxxx>
The MaxJobRetirementTime setting in the config file controls how long the
startd (i.e. execution machine) will let the job run when its claim is
being retired. So that setting of MaxJobRetirementTime refers to the
machine policy and it goes in the machine ClassAd, not the job ClassAd.
There is a more obscure case where you may want to set
MaxJobRetirementTime in the job ClassAd. Doing this allows you to specify
a _shorter_ retirement time than the one granted by the machine policy.
By default, standard universe and nice-user jobs have their
MaxJobRetirementTime=0, so they don't wait around in retirement. In all
other cases, the default is to not define MaxJobRetirementTime in the job
ClassAd, so the job will use the maximum amount of retirement time granted
by the machine.
Testing has been mostly on standard universe jobs but most user jobs
are vanilla universe.
So from your post, I assume that you want MaxJobRetirementTime to be
non-zero for either standard universe or nice-user jobs. In all other
cases it should already be working. Is this correct?
I will continue testing today with vanilla only jobs. I was not aware of
the different behavior with standard vs. vanilla universes. That seems to
be the source of my confusion. My test jobs need to more closely
match the production environment.
The problem I'm trying to solve in the current production environment is
that User-A would submit thousands of vanilla jobs (one or more clusters).
Runtime for each job is typically under 1 hour. User-B submits a few
jobs and never gets access to any machines. The greater insult is that
User-A then submits more jobs and they run before the User-B jobs.
I don't necessarily want preemption to kill/checkpoint/restart the
User-A jobs, just to insert a wedge so User-B can get access to some
resources within a reasonable period of time. I stumbled on
MaxJobRetirementTime from reading this mailing list - not finding
it in the version 6.6.7 manual, began exploring 6.7.3. It does EXACTLY
what I need - simple, clean, and straight-forward when used with a simple
PREEMPTION_REQUIREMENTS expression based on priority, and a
shorter (than 1 day) PRIORITY_HALFLIFE.
Since my testing has been with standard universe, and wanting
MaxJobRetirementTime job classad to be non-zero, my first thought is
that the machine classad value should be "copied" to the job classad in
standard universe as well - but then you've gotta remember, I'm not a
Condor expert. There may be many good reasons to not do that.
I have verified that using SUBMIT_EXPRS to set the default
MaxJobRetirementTime in the job ClassAd does not work for standard
universe and nice-user jobs, because this is getting overwritten to 0.
Another problem is that you can't independently set the machine and job
attributes, since they both have the same name. I'll think about this and
try to provide a solution.
From my perspective, I think having that option would be nice.
One workaround that may or may not be useful to you until a fix becomes
available is to use condor_submit -a MaxJobRetirementTime=X.
Doak Bane wrote:
What I want is to force all job classads to (by default) take on the
value for MaxJobRetirementTime as defined in a config file. Just defining
a value in the config file does not pass any value to job classads. Jobs
just get truly preempted, with no chance to retire, and restart later. I
also tried this:
MaxJobRetirementTime = 3600
SUBMIT_EXPRS = MaxJobRetirementTime
With, or without, the SUBMIT_EXPRS all job classads still show:
MaxJobRetirementTime = 0
Of course, if MaxJobRetirementTime is explicitly defined in the submit
command file then things work correctly and jobs retire as expected.
Is there a way to make this work besides trickery with wrappers or
changing all submit files?
Condor-users mailing list
Condor-users mailing list