[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor: Increase requested RAM memory if a job is retried



Hi Romain,
I made it in the past.
I remember that sometime jobs didn't start and I didn't find a reason for that.
Please give it a try and let me know if it's work for you.

StartMemory = 1024
PlusMemory = 4096
request_memory = ifthenelse(((LastHoldReasonCode != 34) || (MemoryProvisioned != $(PlusMemory)) || IsUndefined(MemoryProvisioned)),$(StartMemory),$(PlusMemory))
periodic_release = (JobStatus ==5) && (HoldReasonCode == 34) && (MemoryProvisioned == StartMemory)



Thanks
David


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of romain.bouquet04@xxxxxxxxx <romain.bouquet04@xxxxxxxxx>
Sent: 03 March 2022 13:20
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor: Increase requested RAM memory if a job is retried
 
HI again Gianmauro,

Thanks I don't think for my jobs that run for a long time it would be a "solution" as I don't want a cron process to run in parallel.
But thanks a lot anyway for your answers! It is much appreciated to propose that solution.

Best,
Romain

Le jeu. 3 mars 2022 à 11:06, <gmauro@xxxxxxxxxxxxxxxxxxxxxxxxxx> a écrit :
I have a cron job that run the script every 5 minutes.
It works fine for us.

Gianmauro

On 3/3/22 11:01, romain.bouquet04@xxxxxxxxx wrote:
> Hi Gianmauro,
>
> Thanks for your answer but from what I understand you launch this script
> manually right ?
> What I would like is finding a way for condor to increase the memory
> itself as my jobs are retried automatically.
>
> Best,
> Romain
>
> Le mer. 2 mars 2022 à 20:12, <gmauro@xxxxxxxxxxxxxxxxxxxxxxxxxx
> <mailto:gmauro@xxxxxxxxxxxxxxxxxxxxxxxxxx>> a écrit :
>
>     Hi Roman,
>
>     I use this script for exactly the purpose you described
>     It will relaunch the script with 3 times the memory requested until it
>     reach a cap.
>     Every relaunch is recorded in a log file.
>
>     $ cat /usr/bin/htcondor-release-held-jobs
>
>     #!/bin/bash
>     CAP=524288 # 512GB
>     MULTIPLIER=3
>     LOG=/data/dnb01/maintenance/condor_rerun_held_jobs.log
>
>     if [ ! -f "$LOG" ]; then
>     touch "$LOG"
>     echo "Created $LOG"
>     fi
>
>     for j in $(condor_q -hold -autoformat ClusterId HoldReasonCode| awk
>     '(($2-34) == 0){print $1}'| paste -s -d ' ')
>     do
>         JOB_DESCRIPTION=$(condor_q "$j" -autoformat JobDescription)
>         MEMORY_PROVISIONED=$(condor_q "$j" -autoformat MemoryProvisioned)
>
>         if [ $(($MEMORY_PROVISIONED * $MULTIPLIER)) -gt $CAP ]; then
>           REQUEST_MEMORY=$CAP
>         else
>           REQUEST_MEMORY=$(($MEMORY_PROVISIONED * $MULTIPLIER))
>         fi
>         REMOTE_HOST=$(condor_q "$j" -autoformat LastRemoteHost|cut -f2
>     -d@|cut -f1 -d.)
>
>         DATE_WITH_TIME=$(date "+%d/%m/%Y-%H:%M:%S")
>         /bin/cat <<EOM >>$LOG
>         $DATE_WITH_TIME, rerunning held job, id $j, description
>     $JOB_DESCRIPTION, memory_provisioned $MEMORY_PROVISIONED,
>     request_memory
>     $REQUEST_MEMORY, $REMOTE_HOST
>     EOM
>
>         condor_qedit "$j" RequestMemory=$REQUEST_MEMORY
>         condor_release "$j"
>     done
>
>     Hope it helps,
>     Gianmauro
>
>
>     On 3/2/22 19:48, romain.bouquet04@xxxxxxxxx
>     <mailto:romain.bouquet04@xxxxxxxxx> wrote:
>      > Dear all,
>      >
>      > I have jobs that I set to be retried automatically by condor in
>     case of
>      > failure.
>      > I was wondering if there is a way for condor to automatically
>     increase
>      > the requested RAM for a job in case it failed and it is retried.
>      >
>      > I was looking at the NumJobStarts which counts the number of
>     times a job
>      > is started
>      >
>     https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html
>     <https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html>
>
>      >
>     <https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html
>     <https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html>>||
>      >
>      > And I was trying to add something as below in the submit file
>     (but it
>      > does not work):
>      > (based on
>      >
>     https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file
>     <https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file>
>
>      >
>     <https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file
>     <https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file>>)
>
>      >
>      >
>      > if NumJobStarts == 0
>      >     request_memory = 2GB
>      > else
>      >    request_memory = 8GB
>      > endif
>      >
>      > I could use requirement with a syntax like
>      > requirement = (NumJobStarts == 0 && TARGET.Memory >= 2GB) ||
>      > (NumJobStarts >= 1 && TARGET.Memory >= 8GB)
>      > But apparently it is not recommended to request memory that way
>      >
>      > Would anyone have a better solution?
>      >
>      > Many thanks in advance
>      > Best,
>      > Romain Bouquet
>      > ||
>      >
>      > _______________________________________________
>      > HTCondor-users mailing list
>      > To unsubscribe, send a message to
>     htcondor-users-request@xxxxxxxxxxx
>     <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>      > subject: Unsubscribe
>      > You can also unsubscribe by visiting
>      > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>     <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>
>      >
>      > The archives can be found at:
>      > https://lists.cs.wisc.edu/archive/htcondor-users/
>     <https://lists.cs.wisc.edu/archive/htcondor-users/>
>
>     --
>     Gianmauro Cuccuru
>
>     UseGalaxy.eu
>     Bioinformatics Group
>     Department of Computer Science
>     Albert-Ludwigs-University Freiburg
>     Georges-Köhler-Allee 106
>     79110 Freiburg, Germany
>     _______________________________________________
>     HTCondor-users mailing list
>     To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>     <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>     subject: Unsubscribe
>     You can also unsubscribe by visiting
>     https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>     <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>
>
>     The archives can be found at:
>     https://lists.cs.wisc.edu/archive/htcondor-users/
>     <https://lists.cs.wisc.edu/archive/htcondor-users/>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

--
Gianmauro Cuccuru

UseGalaxy.eu
Bioinformatics Group
Department of Computer Science
Albert-Ludwigs-University Freiburg
Georges-Köhler-Allee 106
79110 Freiburg, Germany
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
H