[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor: Increase requested RAM memory if a job is retried



Hi Jason and David,

Thanks a lot for the solutions what you propose seems great!
I will try them but I am confident it should work
Now that I am thinking maybe the if else from my first message was not working because NumJobStarts is not defined when submitting jobs
but only after the job has run at least one time
so maybe something like below could work

if defined NumJobStarts
ÂÂ request_memory = 2GB
else
 request_memory = 8GB
endif

Anyway thanks again the solutions looks promising and should work,
Best,
Romain

LeÂjeu. 3 mars 2022 ÃÂ16:39, Jason Patton <jpatton@xxxxxxxxxxx> a ÃcritÂ:

Like David, we also have a recipe that we share with our local users:


periodic_release = (JobStatus ==Â 5) && (HoldReasonCode == 34) && (NumJobStarts < 5)

request_memory = ifthenelse(MemoryUsage =!= undefined,(MAX({second-mem,MemoryUsage * 3/2})),first-mem)


Replace "second-mem" and "first-mem" with the two values to try first (and second), then each subsequent time the job goes on hold it will request 50% more memory for a maximum of 5 tries.


Jason Patton


On 3/3/22 6:59 AM, duduhandelman@xxxxxxxxxxx wrote:
Hi Romain,
I made it in the past.
I remember that sometime jobs didn't start and I didn't find a reason for that.
Please give it a try andÂlet me know if it's work for you.

StartMemory = 1024
PlusMemory = 4096
request_memory = ifthenelse(((LastHoldReasonCode != 34) || (MemoryProvisioned != $(PlusMemory)) || IsUndefined(MemoryProvisioned)),$(StartMemory),$(PlusMemory))
periodic_release = (JobStatus ==5) && (HoldReasonCode == 34) && (MemoryProvisioned == StartMemory)



Thanks
David


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of romain.bouquet04@xxxxxxxxx <romain.bouquet04@xxxxxxxxx>
Sent: 03 March 2022 13:20
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor: Increase requested RAM memory if a job is retried
Â
HI again Gianmauro,

Thanks I don't think for my jobs that run for a long time it would be a "solution" as I don't want a cron process to run in parallel.
But thanks a lot anyway for your answers! It is much appreciated to propose that solution.

Best,
Romain

LeÂjeu. 3 mars 2022 ÃÂ11:06, <gmauro@xxxxxxxxxxxxxxxxxxxxxxxxxx> a ÃcritÂ:
I have a cron job that run the script every 5 minutes.
It works fine for us.

Gianmauro

On 3/3/22 11:01, romain.bouquet04@xxxxxxxxx wrote:
> Hi Gianmauro,
>
> Thanks for your answer but from what I understand you launch this script
> manually right ?
> What I would like is finding a way for condor to increase the memory
> itself as my jobs are retried automatically.
>
> Best,
> Romain
>
> LeÂmer. 2 mars 2022 ÃÂ20:12, <gmauro@xxxxxxxxxxxxxxxxxxxxxxxxxx
> <mailto:gmauro@xxxxxxxxxxxxxxxxxxxxxxxxxx>> a ÃcritÂ:
>
>Â Â ÂHi Roman,
>
>Â Â ÂI use this script for exactly the purpose you described
>Â Â ÂIt will relaunch the script with 3 times the memory requested until it
>Â Â Âreach a cap.
>Â Â ÂEvery relaunch is recorded in a log file.
>
>Â Â Â$ cat /usr/bin/htcondor-release-held-jobs
>
>Â Â Â#!/bin/bash
>Â Â ÂCAP=524288 # 512GB
>Â Â ÂMULTIPLIER=3
>Â Â ÂLOG=/data/dnb01/maintenance/condor_rerun_held_jobs.log
>
>Â Â Âif [ ! -f "$LOG" ]; then
>Â Â Âtouch "$LOG"
>Â Â Âecho "Created $LOG"
>Â Â Âfi
>
>Â Â Âfor j in $(condor_q -hold -autoformat ClusterId HoldReasonCode| awk
>Â Â Â'(($2-34) == 0){print $1}'| paste -s -d ' ')
>Â Â Âdo
>Â Â Â Â ÂJOB_DESCRIPTION=$(condor_q "$j" -autoformat JobDescription)
>Â Â Â Â ÂMEMORY_PROVISIONED=$(condor_q "$j" -autoformat MemoryProvisioned)
>
>Â Â Â Â Âif [ $(($MEMORY_PROVISIONED * $MULTIPLIER)) -gt $CAP ]; then
>Â Â Â Â Â ÂREQUEST_MEMORY=$CAP
>Â Â Â Â Âelse
>Â Â Â Â Â ÂREQUEST_MEMORY=$(($MEMORY_PROVISIONED * $MULTIPLIER))
>Â Â Â Â Âfi
>Â Â Â Â ÂREMOTE_HOST=$(condor_q "$j" -autoformat LastRemoteHost|cut -f2
>Â Â Â-d@|cut -f1 -d.)
>
>Â Â Â Â ÂDATE_WITH_TIME=$(date "+%d/%m/%Y-%H:%M:%S")
>Â Â Â Â Â/bin/cat <<EOM >>$LOG
>Â Â Â Â Â$DATE_WITH_TIME, rerunning held job, id $j, description
>Â Â Â$JOB_DESCRIPTION, memory_provisioned $MEMORY_PROVISIONED,
>Â Â Ârequest_memory
>Â Â Â$REQUEST_MEMORY, $REMOTE_HOST
>Â Â ÂEOM
>
>Â Â Â Â Âcondor_qedit "$j" RequestMemory=$REQUEST_MEMORY
>Â Â Â Â Âcondor_release "$j"
>Â Â Âdone
>
>Â Â ÂHope it helps,
>Â Â ÂGianmauro
>
>
>Â Â ÂOn 3/2/22 19:48, romain.bouquet04@xxxxxxxxx
>Â Â Â<mailto:romain.bouquet04@xxxxxxxxx> wrote:
>Â Â Â > Dear all,
>Â Â Â >
>Â Â Â > I have jobs that I set to be retried automatically by condor in
>Â Â Âcase of
>Â Â Â > failure.
>Â Â Â > I was wondering if there is a way for condor to automatically
>Â Â Âincrease
>Â Â Â > the requested RAM for a job in case it failed and it is retried.
>Â Â Â >
>Â Â Â > I was looking at the NumJobStarts which counts the number of
>Â Â Âtimes a job
>Â Â Â > is started
>Â Â Â >
>Â Â Âhttps://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html
>Â Â Â<https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html>
>
>Â Â Â >
>Â Â Â<https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html
>Â Â Â<https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html>>||
>Â Â Â >
>Â Â Â > And I was trying to add something as below in the submit file
>Â Â Â(but it
>Â Â Â > does not work):
>Â Â Â > (based on
>Â Â Â >
>Â Â Âhttps://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file
>Â Â Â<https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file>
>
>Â Â Â >
>Â Â Â<https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file
>Â Â Â<https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file>>)
>
>Â Â Â >
>Â Â Â >
>Â Â Â > if NumJobStarts == 0
>Â Â Â >Â ÂÂ request_memory = 2GB
>Â Â Â > else
>Â Â Â >Â Â request_memory = 8GB
>Â Â Â > endif
>Â Â Â >
>Â Â Â > I could use requirement with a syntax like
>Â Â Â > requirement = (NumJobStarts == 0 &&ÂTARGET.Memory >= 2GB) ||
>Â Â Â > (NumJobStarts >= 1 &&ÂTARGET.Memory >= 8GB)
>Â Â Â > But apparently it is not recommended to request memory that way
>Â Â Â >
>Â Â Â > Would anyone have a better solution?
>Â Â Â >
>Â Â Â > Many thanks in advance
>Â Â Â > Best,
>Â Â Â > Romain Bouquet
>Â Â Â > ||
>Â Â Â >
>Â Â Â > _______________________________________________
>Â Â Â > HTCondor-users mailing list
>Â Â Â > To unsubscribe, send a message to
>Â Â Âhtcondor-users-request@xxxxxxxxxxx
>Â Â Â<mailto:htcondor-users-request@xxxxxxxxxxx> with a
>Â Â Â > subject: Unsubscribe
>Â Â Â > You can also unsubscribe by visiting
>Â Â Â > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>Â Â Â<https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>
>Â Â Â >
>Â Â Â > The archives can be found at:
>Â Â Â > https://lists.cs.wisc.edu/archive/htcondor-users/
>Â Â Â<https://lists.cs.wisc.edu/archive/htcondor-users/>
>
>Â Â Â--
>Â Â ÂGianmauro Cuccuru
>
>Â Â ÂUseGalaxy.eu
>Â Â ÂBioinformatics Group
>Â Â ÂDepartment of Computer Science
>Â Â ÂAlbert-Ludwigs-University Freiburg
>Â Â ÂGeorges-KÃhler-Allee 106
>Â Â Â79110 Freiburg, Germany
>Â Â Â_______________________________________________
>Â Â ÂHTCondor-users mailing list
>Â Â ÂTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>Â Â Â<mailto:htcondor-users-request@xxxxxxxxxxx> with a
>Â Â Âsubject: Unsubscribe
>Â Â ÂYou can also unsubscribe by visiting
>Â Â Âhttps://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>Â Â Â<https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>
>
>Â Â ÂThe archives can be found at:
>Â Â Âhttps://lists.cs.wisc.edu/archive/htcondor-users/
>Â Â Â<https://lists.cs.wisc.edu/archive/htcondor-users/>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

--
Gianmauro Cuccuru

UseGalaxy.eu
Bioinformatics Group
Department of Computer Science
Albert-Ludwigs-University Freiburg
Georges-KÃhler-Allee 106
79110 Freiburg, Germany
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
H

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/