[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dynamic memory checking on ClassAd



A quick look at condor_sysapi/virt_mem.c says sysapi_swap_space is dwAvailPageFile/1024L on Windows and sysinfo:freeswap*sysinfo:mem_unit on Linux, which is what condor_startd.V6/ResAttributes.cpp publishes as TotalVirtualMemory.

Ian is spot on, and the value is going to be recomputing at least as often as your UPDATE_INTERVAL.

The VirtualMemory attribute is a fraction of the total for each slot, controlled by a per slot /swap/ specification.

Best,


matt

BYONG WU CHONG wrote:
> Thank you, Ian
> 
> I knew someone can answer this question, though I was surprised to see that this solution is an insider-view solution.
> 
> I think your suggestion will work. I will try your idea and feedback the result here.
> 
> - Byong-Wu
> 
> 
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
> Sent: Thursday, October 29, 2009 8:39 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Dynamic memory checking on ClassAd
> 
>> Is there an easy way to prevent starting a job to the machine which has
>> a temporary memory shortage problem?
> 
> Kind of. Condor isn't 100% realtime about its system metric tracking. So you can't get a perfect solution, but you can probably get close. The key is making sure the system metrics you're interested in are captured in the system's calssad.
> 
> 
>> I use following Requirements.
>>       Requirements      = Memory >= 650
> 
> This expression only checks against the statically allocated memory that a slot has. Memory isn't a dynamic value. Condor detects TotalMemory in the box when it starts up, divides it between your slots, and that's how they get their Memory attributes set.
> 
> 
>> But this cause problem, if the target machine has about 100 MB of
>> memory due to zombie processes. The schedd process will start new
>> jobs on the target machine and jobs will be killed. Very soon, the
>> whole job list gets exhausted due to just one problematic machine in the
>> cluster.
>>
>> I know that zombie processes must be removed soon, but I want to make
>> Condor act smartly on this unfortunate event and save the job list to
>> itself. So my question is this.
>>
>> *** I want to prevent schedd from starting new jobs when the current
>> available virtual space is smaller than the given threshold value. ***
>>
>>
>> So what I want is something like this.
>>       Requirements = Memory >= 650 && Dynamic_Available_Memory_Size >= 200
>>
>> I tried Image_Size attribute setting to 200000 KB according to this manual section.
>>
>> http://www.cs.wisc.edu/condor/manual/v7.3/2_5Submitting_Job.html#2336
>>
>> But, somehow, the job was submitted to the machine which had less than
>> 200MB virtual memory space and eventually this job was killed due to
>> memory shortage. I changed Image_Size to about 700MB, but still schedd
>> assigned jobs to the problematic machine.
>>
>>
>>
>> Can you help me on this issue?
> 
>>From the docs the VirtualMemory attribute represents:
> 
>         The amount of currently available virtual memory (swap space) expressed in Kbytes.
> 
> (ref: http://www.cs.wisc.edu/condor/manual/v7.2/Appendix_A_ClassAd.html#77198)
> 
> There's no doc reference for TotalVirtualMemory, but based on how the other Total* attributes in machine classads behave I'll go out on a limb and guess that it's the of all VirtualMemory attributes on each slot in Kbytes. So that means you want:
> 
>         Requirements = Memory >= 650 && TotalVirtualMemory >= 200
> 
> But like I said above: it's possible that TotalVirtualMemory will lag reality by a few minutes or more.
> 
> Someone from the Condor team can confirm or deny the TotalVirtualMemory attribute behaviour and let you know how often it's updated and how accurate it is for each operating system.
> 
> - Ian
> 
> Confidentiality Notice.
> This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/