[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Forcing a job classad from config file?

Thanks for your prompt responses. A bit more background...

I am (obviously) running version 6.7.3 - on RedHat WS 3 Update 3.
My current environment is purely for testing - soon to be deployed
in production. My site has been running Condor 6.4.7 for about 2 years
with very _minimal_ changes to, or understanding of, the default
packaged configuration. My job is to change that -  to define and
implement a set of workable policies. (I am also experimenting with
some of the policies outlined in the Bologna Batch System paper.)
The target execution environment is a dedicated cluster of dual
processor diskless/headless machines - maybe to be expanded to
desktop machines in the future. My test playground is a small
environment that mimics that.

Other comments embedded. Thanks Dan,
Doak Bane

From: "Dan Bradley" <dan@xxxxxxxxxxxx>

The MaxJobRetirementTime setting in the config file controls how long the startd (i.e. execution machine) will let the job run when its claim is being retired. So that setting of MaxJobRetirementTime refers to the machine policy and it goes in the machine ClassAd, not the job ClassAd.

There is a more obscure case where you may want to set MaxJobRetirementTime in the job ClassAd. Doing this allows you to specify a _shorter_ retirement time than the one granted by the machine policy. By default, standard universe and nice-user jobs have their MaxJobRetirementTime=0, so they don't wait around in retirement. In all other cases, the default is to not define MaxJobRetirementTime in the job ClassAd, so the job will use the maximum amount of retirement time granted by the machine.

Testing has been mostly on standard universe jobs but most user jobs are vanilla universe.

So from your post, I assume that you want MaxJobRetirementTime to be non-zero for either standard universe or nice-user jobs. In all other cases it should already be working. Is this correct?

I will continue testing today with vanilla only jobs. I was not aware of the different behavior with standard vs. vanilla universes. That seems to be the source of my confusion. My test jobs need to more closely match the production environment.

The problem I'm trying to solve in the current production environment is
that User-A would submit thousands of vanilla jobs (one or more clusters).
Runtime for each job is typically under 1 hour. User-B submits a few
jobs and never gets access to any machines. The greater insult is that
User-A then submits more jobs and they run before the User-B jobs.
I don't necessarily want preemption to kill/checkpoint/restart the
User-A jobs, just to insert a wedge so User-B can get access to some
resources within a reasonable period of time. I stumbled on
MaxJobRetirementTime from reading this mailing list - not finding
it in the version 6.6.7 manual, began exploring 6.7.3. It does EXACTLY
what I need - simple, clean, and straight-forward when used with a simple PREEMPTION_REQUIREMENTS expression based on priority, and a
shorter (than 1 day) PRIORITY_HALFLIFE.

Since my testing has been with standard universe, and wanting
MaxJobRetirementTime job classad to be non-zero, my first thought is
that the machine classad value should be "copied" to the job classad in
standard universe as well - but then you've gotta remember, I'm not a
Condor expert. There may be many good reasons to not do that.

I have verified that using SUBMIT_EXPRS to set the default MaxJobRetirementTime in the job ClassAd does not work for standard universe and nice-user jobs, because this is getting overwritten to 0. Another problem is that you can't independently set the machine and job attributes, since they both have the same name. I'll think about this and try to provide a solution.

From my perspective, I think having that option would be nice.

One workaround that may or may not be useful to you until a fix becomes available is to use condor_submit -a MaxJobRetirementTime=X.

--Dan Bradley

Doak Bane wrote:

What I want is to force all job classads to (by default) take on the value for MaxJobRetirementTime as defined in a config file. Just defining a value in the config file does not pass any value to job classads. Jobs just get truly preempted, with no chance to retire, and restart later. I also tried this:
MaxJobRetirementTime = 3600
SUBMIT_EXPRS = MaxJobRetirementTime

With, or without, the SUBMIT_EXPRS all job classads still show:
   MaxJobRetirementTime = 0

Of course, if MaxJobRetirementTime is explicitly defined in the submit command file then things work correctly and jobs retire as expected.

Is there a way to make this work besides trickery with wrappers or changing all submit files?

Doak Bane
Condor-users mailing list

Condor-users mailing list