[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor: Increase requested RAM memory if a job is retried



I have a cron job that run the script every 5 minutes.
It works fine for us.

Gianmauro

On 3/3/22 11:01, romain.bouquet04@xxxxxxxxx wrote:
Hi Gianmauro,

Thanks for your answer but from what I understand you launch this script manually right ? What I would like is finding a way for condor to increase the memory itself as my jobs are retried automatically.

Best,
Romain

LeÂmer. 2 mars 2022 ÃÂ20:12, <gmauro@xxxxxxxxxxxxxxxxxxxxxxxxxx <mailto:gmauro@xxxxxxxxxxxxxxxxxxxxxxxxxx>> a ÃcritÂ:

    Hi Roman,

    I use this script for exactly the purpose you described
    It will relaunch the script with 3 times the memory requested until it
    reach a cap.
    Every relaunch is recorded in a log file.

    $ cat /usr/bin/htcondor-release-held-jobs

    #!/bin/bash
    CAP=524288 # 512GB
    MULTIPLIER=3
    LOG=/data/dnb01/maintenance/condor_rerun_held_jobs.log

    if [ ! -f "$LOG" ]; then
    touch "$LOG"
    echo "Created $LOG"
    fi

    for j in $(condor_q -hold -autoformat ClusterId HoldReasonCode| awk
    '(($2-34) == 0){print $1}'| paste -s -d ' ')
    do
     Â ÂJOB_DESCRIPTION=$(condor_q "$j" -autoformat JobDescription)
     Â ÂMEMORY_PROVISIONED=$(condor_q "$j" -autoformat MemoryProvisioned)

     Â Âif [ $(($MEMORY_PROVISIONED * $MULTIPLIER)) -gt $CAP ]; then
     Â Â ÂREQUEST_MEMORY=$CAP
     Â Âelse
     Â Â ÂREQUEST_MEMORY=$(($MEMORY_PROVISIONED * $MULTIPLIER))
     Â Âfi
     Â ÂREMOTE_HOST=$(condor_q "$j" -autoformat LastRemoteHost|cut -f2
    -d@|cut -f1 -d.)

     Â ÂDATE_WITH_TIME=$(date "+%d/%m/%Y-%H:%M:%S")
     Â Â/bin/cat <<EOM >>$LOG
     Â Â$DATE_WITH_TIME, rerunning held job, id $j, description
    $JOB_DESCRIPTION, memory_provisioned $MEMORY_PROVISIONED,
    request_memory
    $REQUEST_MEMORY, $REMOTE_HOST
    EOM

     Â Âcondor_qedit "$j" RequestMemory=$REQUEST_MEMORY
     Â Âcondor_release "$j"
    done

    Hope it helps,
    Gianmauro


    On 3/2/22 19:48, romain.bouquet04@xxxxxxxxx
    <mailto:romain.bouquet04@xxxxxxxxx> wrote:
     > Dear all,
     >
     > I have jobs that I set to be retried automatically by condor in
    case of
     > failure.
     > I was wondering if there is a way for condor to automatically
    increase
     > the requested RAM for a job in case it failed and it is retried.
     >
     > I was looking at the NumJobStarts which counts the number of
    times a job
     > is started
     >
    https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html
    <https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html>

     >
    <https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html
    <https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html>>||
     >
     > And I was trying to add something as below in the submit file
    (but it
     > does not work):
     > (based on
     >
    https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file
    <https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file>

     >
    <https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file
    <https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#using-conditionals-in-the-submit-description-file>>)

     >
     >
     > if NumJobStarts == 0
     >Â ÂÂ request_memory = 2GB
     > else
     >Â Â request_memory = 8GB
     > endif
     >
     > I could use requirement with a syntax like
     > requirement = (NumJobStarts == 0 &&ÂTARGET.Memory >= 2GB) ||
     > (NumJobStarts >= 1 &&ÂTARGET.Memory >= 8GB)
     > But apparently it is not recommended to request memory that way
     >
     > Would anyone have a better solution?
     >
     > Many thanks in advance
     > Best,
     > Romain Bouquet
     > ||
     >
     > _______________________________________________
     > HTCondor-users mailing list
     > To unsubscribe, send a message to
    htcondor-users-request@xxxxxxxxxxx
    <mailto:htcondor-users-request@xxxxxxxxxxx> with a
     > subject: Unsubscribe
     > You can also unsubscribe by visiting
     > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
    <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>
     >
     > The archives can be found at:
     > https://lists.cs.wisc.edu/archive/htcondor-users/
    <https://lists.cs.wisc.edu/archive/htcondor-users/>

-- Gianmauro Cuccuru

    UseGalaxy.eu
    Bioinformatics Group
    Department of Computer Science
    Albert-Ludwigs-University Freiburg
    Georges-KÃhler-Allee 106
    79110 Freiburg, Germany
    _______________________________________________
    HTCondor-users mailing list
    To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
    <mailto:htcondor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
    <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/htcondor-users/
    <https://lists.cs.wisc.edu/archive/htcondor-users/>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--
Gianmauro Cuccuru

UseGalaxy.eu
Bioinformatics Group
Department of Computer Science
Albert-Ludwigs-University Freiburg
Georges-KÃhler-Allee 106
79110 Freiburg, Germany