[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Are evicted jobs memory requirements automatically adjusted?



Chris,

Thanks. Don't write answers from memory is my lesson here. :)


On Thu, Feb 7, 2013 at 9:52 AM, Chris Filo Gorgolewski <krzysztof.gorgolewski@xxxxxxxxx> wrote:
I;m new to Condor, but I believe that it is that by default ImageSize is not directly used in the Requirements field used for matching. It is used through RequestMemory when user did not specify it explicitly. Therefore when user specifies "request_memory", even if ImageSize is being updated is is not used for matching.

Best,
Chris


On 7 February 2013 18:47, Ian Chesal <ian.chesal@xxxxxxxxx> wrote:
TJ,

The matching is driven by ImageSize, no? And that references RequestMemory, which should be updated to 2000 MB after the job is bumped back to the queue.

(Sorry, I don't have a Condor queue on hand to check that ImageSize _expression_ that gets written out when you use request_memory in your submission file so I'm working from memory here...)

- Ian


On Thu, Feb 7, 2013 at 8:36 AM, John (TJ) Knoeller <johnkn@xxxxxxxxxxx> wrote:
As HTCondor is configured by default it's 5a.   If you want 5b the job should have

    RequestMemory = Max(500, MemoryUsage)

But you probably don't really want that.   Jobs that have MemoryUsage > RequestMemory should
probably not match anything until the user intervenes and sets RequestMemory correctly. 

Keep in mind that jobs that lie about memory usage like this are  waste time on larger and
larger slots as they hunt for a slot that can accommodate them.  

-tj


On 2/6/2013 8:48 AM, Chris Filo Gorgolewski wrote:
When a job gets suspended and evicted it's memory requirements adjusted by the actual values measured during it's execution?

For example:
1. I job with "request_memory = 500" is being submitted.
2. The job gets assigned to a node, starts running.
3. It allocates 2000MB.
4. The job gets suspended and evicted (reason does not really matter in this example)
5a. Job gets resubmitted to another node with memory requirement of 500MB (as requested by the user)
5b. Job gets resubmitted to another node with memory requirement of 2000MB (as measured during previous execution)

So which one is it 5a or 5b?

If 5a how can I achieve 5b (increasing the memory requirements by the maximum memory allocation from the previous failed execution)?

This is especially important when using policy described here: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitMemoryUsage

Thanks in advance!

Best,
Chris


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/