[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] OS unable to allocate memory to job when run under condor



Here is a reproducible test case:

-------------- start of test.do ------------------
## requires stata and ssc install runmlwin

use "http://www.bristol.ac.uk/cmm/media/runmlwin/tutorial.dta", clear 
describe
! ulimit -a

## this call to runmlwin writes out a script and calls a windows program via wine1.3 with the script as an argument.
runmlwin normexam cons standlrt, level2(school:) level1(student: cons) mlwinsettings(size(85000)) batch

----------- end of test.do------------------------

                                                                                                                       
-------------test.submit---------------------
########################################################
##
##
## CONDOR SUBMIT FILE FOR LINUX
##
##
########################################################


universe = vanilla

# I'm using NFS:
should_transfer_files = no
getenv = true

executable = /path/to/stata
arguments = -s do test.do

output = test_stata_submit.cout
error =  test_stata_submit.cerr
log =    /tmp/test_stata_submit.clog

queue

------------ END OF test.submit -----------------


The STATA file test.do works when run from the command line (# stata -s do test.do).
But when i run it via condor (# condor_submit test.submit). I get the following error (i've modified runmlwin.ado to produce more debug output):

---------start of output------------ 
NOTE   Initialise MLwiN storage
INIT 3 85000 1500 3

 error while obeying batch file /var/lib/condor/execute/dir_7545/St07546.000008 at line number 9:
INIT 3 85000 1500 3
Unable to allocate 85000 k worksheet cells. Worksheet size unchanged.

number of levels            :    3
worksheet size(k cells)     :50000
number of columns           : 1500
number of expl. vars.       :    3
number of links             :   20

 error while obeying batch file /var/lib/condor/execute/dir_7545/St07546.000008 at line number 9:
 INIT 3 85000 1500 3

 Unable to allocate 85000 k worksheet cells. Worksheet size unchanged..
Execution completed --- Begin MLwiN error log --- 
---------end of output------------ 


We've already gone to great lengths to make sure linux & condor isn't configured to limit memory allocation. This makes me think perhaps wine is considering libraries condor shares with stata as part of the mlwin processes' address space thus effectively limiting the memory available - but i'm reaching beyond my expertise here. hence the plea for help!

thanks, jason








On Mar 28, 2012, at 5:48 PM, Ian Chesal wrote:

Don't rely on condor_ssh here. Run a job that's a script that just outputs 'ulimit -a' and returns the results from that call.

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools


On Wednesday, 28 March, 2012 at 5:33 PM, jason herman wrote:

i've looked into that but I'm new to ulimit so perhaps i'm missing something.

first, about the machines. I'm running on ec2. for these tests i've been just running on a single machine. I've tested para & HVM architected machines seperately.

logged in via ssh, command line:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 59623
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 59623
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


when i condor_ssh_to_job:

ulimit -a
core file size          (blocks, -c) 1317074
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 59623
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 59623
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

SO there are differences (core file size) but not max memory size, data seg size, or virtual memory. 


I turned on wine debugging. This causes the process to runs in slowmotion. So we can examine /proc/pid/limits for the wine process:

root@master:/proc/11457# cat limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        0                    0                    bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             59623                59623                processes 
Max open files            4096                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       59623                59623                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        


From within the condor job - from stata, which launches the wine process via stata's shell command):


SO it would appear there is no applicable linux memory limit. 


any additional thoughts?

jason





On Mar 28, 2012, at 3:47 PM, Ian Chesal wrote:

On Monday, 26 March, 2012 at 7:30 PM, jason herman wrote:
So the question is what could be preventing the OS from allocating memory to a process that a condor job forked via shell?
Could be a different set of system limits are being applied when jobs are run via Condor.

What does the following report for the shell from where you run your jobs manually:

ulimit -a

And what does that same command report when you run it as a Condor job the same way you run your application?

Also: are your machines homogenous? Are the machines where you run your commands by hand the same (RAM, disk, CPU, etc.) as the machines where jobs run under Condor's control?

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/