[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trying to run condor_glidein on the National Grid Service


As far as this error is concerned:

> 10/15 22:19:07 ERROR: Can't allocate 5th virtual machine of type 0
>        Requesting: Cpus: 1, Memory: 1, Swap: 10.00%, Disk: 10.00%
>        Available:  Cpus: 6, Memory: 0, Swap: 60.00%, Disk: 60.00%
> 10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in file
> ResMgr.C

I think the problem is that -count 10 sets NUM_CPUS=10 in the glidein
configuration, which causes the startd to try and create 10 vms on
every node. That's probably why it runs out of resources. IMHO the
-count and -vms arguments to condor_config are extremely confusing and
can't do what you actually want them to do in most cases. I think that
condor_glidein was originally intended to start a single glidein at a
time. What I usually do is generate a submit script, and modify it to
get the distribution of hosts/processes I want.

Try this (notice the -gensubmit):

condor_glidein -count 10 -arch 6.6.7-i686-pc-Linux-2.4
-setup_jobmanager jobmanager-fork -gensubmit

It won't submit a job, but it will produce a glidein_run.submit script
that looks like this:

Universe = Grid
Executable = $(DOLLAR)(HOME)/Condor_glidein/6.6.7-i686-pc-Linux-2.4/glidein_startup
Arguments = -dyn -f
Environment = CONDOR_CONFIG=$(DOLLAR)(HOME)/Condor_glidein/glidein_condor_config;_condor_CONDOR_HOST=array.usc.edu;_condor_GLIDEIN_HOS
Transfer_Executable = False
GlobusRSL = (count=10)(jobtype=single)
Grid_Resource = gt2 ngs.leeds.ac.uk/jobmanager-pbs
Notification = Never

First, check out GlobusRSL. It is requesting count=10, but
jobtype=single. That means that it will reserve 10 nodes, but start a
condor_master on only one of them! If you want 10 nodes, you should
change that to:

GlobusRSL = (count=10)(jobtype=multiple)

Or if you want to be even more explicit:

GlobusRSL = (host_count=10)(count=10)(jobtype=multiple)

Also notice that in the environment it is setting _condor_NUM_CPUS=10.
That's probably why the startd is failing. You should either remove
that, or change it to the number of VMs you actually want.