[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dynamic memory checking on ClassAd



Thank you, Ian

I knew someone can answer this question, though I was surprised to see that this solution is an insider-view solution.

I think your suggestion will work. I will try your idea and feedback the result here.

- Byong-Wu


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: Thursday, October 29, 2009 8:39 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Dynamic memory checking on ClassAd

> Is there an easy way to prevent starting a job to the machine which has
> a temporary memory shortage problem?

Kind of. Condor isn't 100% realtime about its system metric tracking. So you can't get a perfect solution, but you can probably get close. The key is making sure the system metrics you're interested in are captured in the system's calssad.


> I use following Requirements.
>       Requirements      = Memory >= 650

This expression only checks against the statically allocated memory that a slot has. Memory isn't a dynamic value. Condor detects TotalMemory in the box when it starts up, divides it between your slots, and that's how they get their Memory attributes set.


> But this cause problem, if the target machine has about 100 MB of
> memory due to zombie processes. The schedd process will start new
> jobs on the target machine and jobs will be killed. Very soon, the
> whole job list gets exhausted due to just one problematic machine in the
> cluster.
>
> I know that zombie processes must be removed soon, but I want to make
> Condor act smartly on this unfortunate event and save the job list to
> itself. So my question is this.
>
> *** I want to prevent schedd from starting new jobs when the current
> available virtual space is smaller than the given threshold value. ***
>
>
> So what I want is something like this.
>       Requirements = Memory >= 650 && Dynamic_Available_Memory_Size >= 200
>
> I tried Image_Size attribute setting to 200000 KB according to this manual section.
>
> http://www.cs.wisc.edu/condor/manual/v7.3/2_5Submitting_Job.html#2336
>
> But, somehow, the job was submitted to the machine which had less than
> 200MB virtual memory space and eventually this job was killed due to
> memory shortage. I changed Image_Size to about 700MB, but still schedd
> assigned jobs to the problematic machine.
>
>
>
> Can you help me on this issue?

>From the docs the VirtualMemory attribute represents:

        The amount of currently available virtual memory (swap space) expressed in Kbytes.

(ref: http://www.cs.wisc.edu/condor/manual/v7.2/Appendix_A_ClassAd.html#77198)

There's no doc reference for TotalVirtualMemory, but based on how the other Total* attributes in machine classads behave I'll go out on a limb and guess that it's the of all VirtualMemory attributes on each slot in Kbytes. So that means you want:

        Requirements = Memory >= 650 && TotalVirtualMemory >= 200

But like I said above: it's possible that TotalVirtualMemory will lag reality by a few minutes or more.

Someone from the Condor team can confirm or deny the TotalVirtualMemory attribute behaviour and let you know how often it's updated and how accurate it is for each operating system.

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/