Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Memory issues when running condor jobs:

Date: Wed, 15 Oct 2008 13:09:43 -0500
From: Matthew Farrellee <matt@xxxxxxxxxx>
Subject: Re: [Condor-users] Memory issues when running condor jobs:

You should get in touch with the Condor team and see if you can begiving priority on a 64 bit machine to use all its resources.


Best,


matt

Yeye He wrote:

Hi Doug,
Yes, the job cannot finish on a 32-bit machine due to limits of addressspace. The problem is that I don't have access to a 64-bit machinemyself, so I was using Condor to run the job on some 64-bit machinesomewhere in the pool. That's why I am trying to run it on Condor tostart with.
I believe that some of those 64 bit machines are divided up into 2-4slots, such that, as you stated, when the memory footprint of my jobgoes beyond 2G, it gets kicked out.
Since I don't own those machines in the pool, I cannot change how thosemachines are configured in the way they are shared. But from a user'sperspective, I hope to grab all the slots on a machine when they are allavailable, so that I can run my memory-intensive job to completion.
Dividing a machine up into smaller slots would serve more small jobssimultaneously but in the case that they are all sitting there idle,having the flexibility to dedicate all resources on one big machine toserve a big job seems appealing to me in this situation. From a Condoruser's perspective, is there a way to do that?
Thanks!
-Yeye

Douglas Clayton wrote:
Yeye,Does the application finish on the machine if it is run manually byitself (with no other jobs)? It is unlikely the machine is preemptingthe job (but we could check your PREEMPT expression to double check),and more likely that the job is failing because it uses more memory thanthe 32-bit address space, and never re-matching. We've seen jobs thatfail when they get to large a memory footprint, and then because theImageSize for the job has been updated to the larger 2-3GB number, itnever reschedules because no slots have that memory available.Steps to test this are to manually run the job on the machine inquestion while nothing else is running, to see if it completes successfully.
If the job does run when nothing else is running on the machine, youmight decrease the number of slots on the machine, so each slot has moreRAM.If the job doesn't run because it runs out of memory in the 32-bitaddress space, Condor won't change that because it merely schedules jobs.Otherwise, you might find that the job fails for a reason other than memory.
Hope this helps!

Best,
Doug


--
===================================
Douglas Clayton
main: 888.292.5320

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and Management Tools
http://www.cyclecomputing.com
http://www.cyclecloud.com

On Oct 15, 2008, at 1:22 AM, Yeye He wrote:
Hi all,

I am running a very memory-intensive job via Condor and every time when
the virtual memory size goes beyond 1.6GB or so, the job was evicted and
never picked up by anyone (condor_q -analyze shows that all machines
that qualifies to run the program reject to do so). I understand it may
have something to do with machine's local job policy, where when the
image size of my job exceeds certain threshold it gets evicted.

One obvious solution on my end is to limit the memory footprint of this
program. But this is a "naive" approach that someone proposed in a paper
and we are proposing something else to beat it. We are trying to get
some data points on some non-trivial dataset to show that we indeed beat
it in terms of both efficiency and quality. The problem is that in the
"naive" approach its memory footprint can easily go beyond 2-3 GB, which
my 32bit workstation cannot handle due to limits of address space
(allocation error). I don't have access to a 64 bit machine so that is
why I am using Condor to start with.

I am just wondering, without rewriting current implementation
(potentially by moving some data to disk and removing them from in-mem
structure), is there any workaround to this problem?

Any suggestions are highly appreciated!
-Yeye
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx<mailto:condor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
===================================
Douglas Clayton
phone: 919.647.9648

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com <http://www.cyclecomputing.com/>
http://www.cyclecloud.com <http://www.cyclecomputing.com/>


------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/

References:
- [Condor-users] Memory issues when running condor jobs:
  - From: Yeye He
- Re: [Condor-users] Memory issues when running condor jobs:
  - From: Douglas Clayton
- Re: [Condor-users] Memory issues when running condor jobs:
  - From: Yeye He

Prev by Date: Re: [Condor-users] Jobs License Management
Next by Date: Re: [Condor-users] DAGMan
Previous by thread: Re: [Condor-users] Memory issues when running condor jobs:
Next by thread: [Condor-users] Error: Could not connect to negotiator ((null))
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Memory issues when running condor jobs: