[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Memory issues when running condor jobs:



Yeye, 

Does the application finish on the machine if it is run manually by itself (with no other jobs)? It is unlikely the machine is preempting the job (but we could check your PREEMPT _expression_ to double check), and more likely that the job is failing because it uses more memory than the 32-bit address space, and never re-matching. We've seen jobs that fail when they get to large a memory footprint, and then because the ImageSize for the job has been updated to the larger 2-3GB number, it never reschedules because no slots have that memory available. 

Steps to test this are to manually run the job on the machine in question while nothing else is running, to see if it completes successfully.

If the job does run when nothing else is running on the machine, you might decrease the number of slots on the machine, so each slot has more RAM.
If the job doesn't run because it runs out of memory in the 32-bit address space, Condor won't change that because it merely schedules jobs. 
Otherwise, you might find that the job fails for a reason other than memory.

Hope this helps!

Best,
Doug


-- 
===================================
Douglas Clayton
main: 888.292.5320

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and Management Tools 


On Oct 15, 2008, at 1:22 AM, Yeye He wrote:

Hi all,

I am running a very memory-intensive job via Condor and every time when
the virtual memory size goes beyond 1.6GB or so, the job was evicted and
never picked up by anyone (condor_q -analyze shows that all machines
that qualifies to run the program reject to do so). I understand it may
have something to do with machine's local job policy, where when the
image size of my job exceeds certain threshold it gets evicted.

One obvious solution on my end is to limit the memory footprint of this
program. But this is a "naive" approach that someone proposed in a paper
and we are proposing something else to beat it. We are trying to get
some data points on some non-trivial dataset to show that we indeed beat
it in terms of both efficiency and quality. The problem is that in the
"naive" approach its memory footprint can easily go beyond 2-3 GB, which
my 32bit workstation cannot handle due to limits of address space
(allocation error). I don't have access to a 64 bit machine so that is
why I am using Condor to start with.

I am just wondering, without rewriting current implementation
(potentially by moving some data to disk and removing them from in-mem
structure), is there any workaround to this problem?

Any suggestions are highly appreciated!
-Yeye
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

-- 
===================================
Douglas Clayton
phone: 919.647.9648

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com