[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dynamic memory checking on ClassAd



Could anyone help me on this?

 

I thought this is a common problem where a typical Condor user might face with.

 

Is there an easy way to prevent starting a job to the machine which has a temporary memory shortage problem?

 

The users on the machines I use tend to behave quite nasty and just one problematic machine clogged with zombie processes flushes the whole Condor task list in an hour. It takes quite careful preparation to make a Condor task list and the list is flushed with just a single problematic node in the cluster.

 

I hope there is at least one person who can help me on this.

 

Thanks,

 

 

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of BYONG WU CHONG
Sent: Wednesday, October 28, 2009 5:25 AM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Dynamic memory checking on ClassAd

 

Hello Condor-users,

 

I have a question about using dynamic memory checking on ClassAd.

 

I use following Requirements.

Requirements      = Memory >= 650

 

But this cause problem, if the target machine has about 100 MB of memory due to zombie processes. The schedd process will start new jobs on the target machine and jobs will be killed. Very soon, the whole job list gets exhausted due to just one problematic machine in the cluster.

I know that zombie processes must be removed soon, but I want to make Condor act smartly on this unfortunate event and save the job list to itself. So my question is this.

 

*** I want to prevent schedd from starting new jobs when the current available virtual space is smaller than the given threshold value. ***

 

So what I want is something like this.

Requirements = Memory >= 650 && Dynamic_Available_Memory_Size >= 200

 

I tried Image_Size attribute setting to 200000 KB according to this manual section.

http://www.cs.wisc.edu/condor/manual/v7.3/2_5Submitting_Job.html#2336

But, somehow, the job was submitted to the machine which had less than 200MB virtual memory space and eventually this job was killed due to memory shortage. I changed Image_Size to about 700MB, but still schedd assigned jobs to the problematic machine.

 

Can you help me on this issue?

 

Thanks,

 

- Byong-Wu