[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor: Increase requested RAM memory if a job is retried



Romain,

The reason the submit-conditional approach doesn't work is because the "NumJobStarts" is a job attribute, and during submission, no job has yet been created.

Think of the submit file as a recipe for creating the job. Once the cake is baked / job is queued, there's no longer any need for the recipe, and your if/then/else statement is in the recipe, not the job.

(One minor note - 4GB is actually interpreted as four giga-blocks by condor_submit, or two gigabytes. Use "4G" instead.)

You'll want to use a ClassAd expression for this purpose, since that's baked in to the job and is managed by the scheduler. To replicate your if/then/else statement as an expression for request_memory, you can do this in the submit:

request_memory = ifThenElse(isUndefined(NumJobStarts) || NumJobStarts == 0, 2048, 8192)

That will work in any version of HTCondor back to 2014 and beyond. An alternative syntax supported by the newer releases is:

request_memory = (NumJobStarts?:0 == 0) ? 2048 : 8192

When your job restarts, this expression will evaluate to 8192 and your job will get an 8-gigabyte slot allocation.

Your requirements expression below wouldn't affect the RequestMemory attribute, it would just make sure that the machine on which the job is to be run has at least that amount of memory available.

You can get as fancy as you wish in the request_memory expression:

request_memory =  (NumJobStarts?:0 == 0) \
			? 2048 \
			: (NumJobStarts == 1 \
				? 8192 \
				: 16384)

This sets it to 2g first, then 4g on the second run, and 16g on the third and subsequent runs. The ?: operator substitutes in "0" if NumJobStarts is undefined - probably a belt-and-suspenders approach because the NumJobStarts is, as far as we know, always initialized when the job is submitted, but why leave that loose end?

We don't need to use the ?: operator in the second instance of NumJobStarts because if we're in the "false" section of the first "equals 0" test, we know that the attribute is defined.

You can even do math against any other attributes in the job. For instance, the CST Microwave Studio product has "acceleration token" licenses that stack up based on the number of GPU cards the job is using, so the number of tokens is a function of the RequestGPUs attribute:

          MY.CSTAccelTokensGPU='(RequestGpus?:0) <= 2 \
              ? (RequestGpus?:0) \ 
              : (RequestGpus <= 4 ? 3 \
                  :(RequestGpus <= 8 ? 4 \
                      : (5 + int((RequestGpus - 1) / 16))))'

This translates to: 0 GPU is 0 tokens, 1 GPU is 1, 2 GPUs is 2, 3 is 3, 4 is 3, 5 to 8 is 4, 9 to 16 is 5, and then  one additional token for each additional 16 GPUs. Fold this into the concurrency limit expression and you're good to go.

I hope this helps!

Michael V. Pelletier
Digital Technology
HPC Support Team
Raytheon Missiles and Defense

>     On 3/2/22 19:48, romain.bouquet04@xxxxxxxxx
>     <mailto:romain.bouquet04@xxxxxxxxx> wrote:
>      > Dear all,
>      >
>      > I have jobs that I set to be retried automatically by condor in
>     case of
>      > failure.
>      > I was wondering if there is a way for condor to automatically
>     increase
>      > the requested RAM for a job in case it failed and it is retried.
>      >
>      > I was looking at the NumJobStarts which counts the number of
>     times a job
>      > is started
>      >
>     https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html
>     
> <https://htcondor.readthedocs.io/en/latest/classad-attributes/job-clas
> sad-attributes.html>
> 
>      >
>     <https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html
>     <https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html>>||
>      >
>      > And I was trying to add something as below in the submit file
>     (but it
>      > does not work):
>      > (based on
>      >
>     https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file
>     
> <https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-j
> ob.html#using-conditionals-in-the-submit-description-file>
> 
>      >
>     <https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file
>     
> <https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-j
> ob.html#using-conditionals-in-the-submit-description-file>>)
> 
>      >
>      >
>      > if NumJobStarts == 0
>      >Â ÂÂ request_memory = 2GB
>      > else
>      >Â Â request_memory = 8GB
>      > endif
>      >
>      > I could use requirement with a syntax like
>      > requirement = (NumJobStarts == 0 &&ÂTARGET.Memory >= 2GB) ||
>      > (NumJobStarts >= 1 &&ÂTARGET.Memory >= 8GB)
>      > But apparently it is not recommended to request memory that way
>      >
>      > Would anyone have a better solution?
>      >
>      > Many thanks in advance
>      > Best,
>      > Romain Bouquet
>      > ||