[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trying to run condor_glidein on the National Grid Service



Jean-Alain,

As far as this error is concerned:

> 10/15 22:19:07 ERROR: Can't allocate 5th virtual machine of type 0
>        Requesting: Cpus: 1, Memory: 1, Swap: 10.00%, Disk: 10.00%
>        Available:  Cpus: 6, Memory: 0, Swap: 60.00%, Disk: 60.00%
> 10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in file
> ResMgr.C

I think the problem is that -count 10 sets NUM_CPUS=10 in the glidein
configuration, which causes the startd to try and create 10 vms on
every node. That's probably why it runs out of resources. IMHO the
-count and -vms arguments to condor_config are extremely confusing and
can't do what you actually want them to do in most cases. I think that
condor_glidein was originally intended to start a single glidein at a
time. What I usually do is generate a submit script, and modify it to
get the distribution of hosts/processes I want.

Try this (notice the -gensubmit):

condor_glidein -count 10 -arch 6.6.7-i686-pc-Linux-2.4
-setup_jobmanager jobmanager-fork -gensubmit
ngs.leeds.ac.uk/jobmanager-pbs

It won't submit a job, but it will produce a glidein_run.submit script
that looks like this:

Universe = Grid
Executable = $(DOLLAR)(HOME)/Condor_glidein/6.6.7-i686-pc-Linux-2.4/glidein_startup
Arguments = -dyn -f
Environment = CONDOR_CONFIG=$(DOLLAR)(HOME)/Condor_glidein/glidein_condor_config;_condor_CONDOR_HOST=array.usc.edu;_condor_GLIDEIN_HOS
T=array.usc.edu;_condor_LOCAL_DIR=$(DOLLAR)(HOME)/Condor_glidein/local;_condor_SBIN=$(DOLLAR)(HOME)/Condor_glidein/6.6.7-i686-pc-Linux
-2.4;_condor_CONDOR_ADMIN=juve@xxxxxxxxxxxxx;_condor_NUM_CPUS=10;_condor_UID_DOMAIN=leeds.ac.uk;_condor_FILESYSTEM_DOMAIN=leeds.ac.uk;
_condor_MAIL=/bin/mail;_condor_STARTD_NOCLAIM_SHUTDOWN=1200;_condor_START_owner=juve;_condor_UPDATE_COLLECTOR_WITH_TCP=True
Transfer_Executable = False
GlobusRSL = (count=10)(jobtype=single)
Grid_Resource = gt2 ngs.leeds.ac.uk/jobmanager-pbs
Notification = Never
Queue

First, check out GlobusRSL. It is requesting count=10, but
jobtype=single. That means that it will reserve 10 nodes, but start a
condor_master on only one of them! If you want 10 nodes, you should
change that to:

GlobusRSL = (count=10)(jobtype=multiple)

Or if you want to be even more explicit:

GlobusRSL = (host_count=10)(count=10)(jobtype=multiple)

Also notice that in the environment it is setting _condor_NUM_CPUS=10.
That's probably why the startd is failing. You should either remove
that, or change it to the number of VMs you actually want.

Cheers,
Gideon