[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Memory issues when running condor jobs:



You should get in touch with the Condor team and see if you can be giving priority on a 64 bit machine to use all its resources.

Best,


matt

Yeye He wrote:
Hi Doug,

Yes, the job cannot finish on a 32-bit machine due to limits of address space. The problem is that I don't have access to a 64-bit machine myself, so I was using Condor to run the job on some 64-bit machine somewhere in the pool. That's why I am trying to run it on Condor to start with.

I believe that some of those 64 bit machines are divided up into 2-4 slots, such that, as you stated, when the memory footprint of my job goes beyond 2G, it gets kicked out.

Since I don't own those machines in the pool, I cannot change how those machines are configured in the way they are shared. But from a user's perspective, I hope to grab all the slots on a machine when they are all available, so that I can run my memory-intensive job to completion.

Dividing a machine up into smaller slots would serve more small jobs simultaneously but in the case that they are all sitting there idle, having the flexibility to dedicate all resources on one big machine to serve a big job seems appealing to me in this situation. From a Condor user's perspective, is there a way to do that?

Thanks!
-Yeye

Douglas Clayton wrote:
Yeye, Does the application finish on the machine if it is run manually by itself (with no other jobs)? It is unlikely the machine is preempting the job (but we could check your PREEMPT expression to double check), and more likely that the job is failing because it uses more memory than the 32-bit address space, and never re-matching. We've seen jobs that fail when they get to large a memory footprint, and then because the ImageSize for the job has been updated to the larger 2-3GB number, it never reschedules because no slots have that memory available. Steps to test this are to manually run the job on the machine in question while nothing else is running, to see if it completes successfully.

If the job does run when nothing else is running on the machine, you might decrease the number of slots on the machine, so each slot has more RAM. If the job doesn't run because it runs out of memory in the 32-bit address space, Condor won't change that because it merely schedules jobs. Otherwise, you might find that the job fails for a reason other than memory.

Hope this helps!

Best,
Doug


--
===================================
Douglas Clayton
main: 888.292.5320

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and Management Tools
http://www.cyclecomputing.com
http://www.cyclecloud.com

On Oct 15, 2008, at 1:22 AM, Yeye He wrote:

Hi all,

I am running a very memory-intensive job via Condor and every time when
the virtual memory size goes beyond 1.6GB or so, the job was evicted and
never picked up by anyone (condor_q -analyze shows that all machines
that qualifies to run the program reject to do so). I understand it may
have something to do with machine's local job policy, where when the
image size of my job exceeds certain threshold it gets evicted.

One obvious solution on my end is to limit the memory footprint of this
program. But this is a "naive" approach that someone proposed in a paper
and we are proposing something else to beat it. We are trying to get
some data points on some non-trivial dataset to show that we indeed beat
it in terms of both efficiency and quality. The problem is that in the
"naive" approach its memory footprint can easily go beyond 2-3 GB, which
my 32bit workstation cannot handle due to limits of address space
(allocation error). I don't have access to a 64 bit machine so that is
why I am using Condor to start with.

I am just wondering, without rewriting current implementation
(potentially by moving some data to disk and removing them from in-mem
structure), is there any workaround to this problem?

Any suggestions are highly appreciated!
-Yeye
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx <mailto:condor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
===================================
Douglas Clayton
phone: 919.647.9648

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com <http://www.cyclecomputing.com/>
http://www.cyclecloud.com <http://www.cyclecomputing.com/>


------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/