[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI condor Config



On 10/28/06, Diego Bello <dbello@xxxxxxxxx> wrote:
On 10/24/06, Becky Gietzel <bgietzel@xxxxxxxxxxx> wrote:
>
> On Oct 20, 2006, at 5:50 PM, Diego Bello wrote:
>
> > Hi everyone, I have a Condor pool made of workstations to support MPI,
> > simple jobs and dag, all using globus. Condor version is 6.8.0
> >
> > What I need is that MPI jobs could be stopped if a machine is used,
> > wich normally is between 10 am and 9 pm. I have tried some
> > configurations taken from the condor manual, but some jobs doesn't
> > start. I think there could be a problem with the start configuration.
> >
> > I'm now trying a dag job, with three jobs doing nothing more than
> > /bin/hostname, but it gets to the queue, the first job starts running
> > but, after several hours, it doesn't finish. If I send a globus job
> > directly, it works. My proxy is valid for 48 hrs.
> >
> > I have attached my central manager and my exec nodes's config files.
> > Can someone tell me if there is something wrong with my config files?.
> >
>
> You'll want to adjust your START policy for the execute nodes as
> follows:
>
> Add:
> IsNighttime = (ClockMin < 600 || ClockMin > 1260)
>
>
> Replace the START and PREEMPT expressions with:
> START   = ( (Scheduler =?= $(DedicatedScheduler) && $(IsNighttime) =?
> = TRUE && $(KeyboardIdleTime) > $(StartIdleTime) ) ||  $(START) )
> PREEMPT = (Scheduler =!= $(DedicatedScheduler) && $(KeyboardBusy)
>
> This policy will allow MPI jobs to start only during the nighttime
> hours if nobody is actively using the machine.  Once you set up the
> new policy, make sure you are able to run a simple Vanilla universe /
> bin/hostname job.  When that is working try the dag job with /bin/
> hostname again.  Then try an MPI job.
>
> If you are using the MPI Universe for your MPI jobs I'd recommend
> switching to the Parallel Universe.
>
>
> Thanks,
>
> Becky
>

Thanks for the reply!

I tried what you said, but condor daemons can't start in exec nodes.
This is the error message i get:

*** Last 20 line(s) of file StartLog:
10/28 23:00:39 Using config source: /etc/condor/condor_config
10/28 23:00:39 Using local config sources:
10/28 23:00:39    /opt/condor-6.8.0/local.chaparro/condor_config.local
10/28 23:00:39 DaemonCore: Command Socket at <200.1.19.171:9642>
10/28 23:00:39 ERROR "Syntax error in START expression: '( (Scheduler
=?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx" && (ClockMin < 600 ||
ClockMin > 1260) =?= TRUE &&  > 15 * 60 ) ||  ( (KeyboardIdle > 15 *
60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed"
&& State != "Owner")) ) )'" at line 286 in file util.C
10/28 23:00:52 passwd_cache::cache_uid(): getpwnam("condor") failed:
user not found
10/28 23:00:52 passwd_cache::cache_uid(): getpwnam("condor") failed:
user not found
10/28 23:00:52 ******************************************************
10/28 23:00:52 ** condor_startd (CONDOR_STARTD) STARTING UP
10/28 23:00:52 ** /opt/condor-6.8.0/sbin/condor_startd
10/28 23:00:52 ** $CondorVersion: 6.8.0 Jul 19 2006 $
10/28 23:00:52 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/28 23:00:52 ** PID = 5351
10/28 23:00:52 ** Log last touched 10/28 23:00:39
10/28 23:00:52 ******************************************************
10/28 23:00:52 Using config source: /etc/condor/condor_config
10/28 23:00:52 Using local config sources:
10/28 23:00:52    /opt/condor-6.8.0/local.chaparro/condor_config.local
10/28 23:00:52 DaemonCore: Command Socket at <200.1.19.171:9683>
10/28 23:00:52 ERROR "Syntax error in START expression: '( (Scheduler
=?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx" && (ClockMin < 600 ||
ClockMin > 1260) =?= TRUE &&  > 15 * 60 ) ||  ( (KeyboardIdle > 15 *
60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed"
&& State != "Owner")) ) )'" at line 286 in file util.C
- Hide quoted text -
*** End of file StartLog


I tried having a START=TRUE expression before what you said, and then
removing that. The only difference was the error message saying TRUE
instead ((LoadAvg - Condor.......

In the PREEMT line, I supose there is a missing ")" at the end, am I right?.

Can you help me find out what is going wrong?

Thanks.

--
Diego Bello Carreño
Estudiante Memorista de Ingeniería Civil Informática
UTFSM, Valparaíso, Chile
Usuario #294897 counter.li.org


I have finally found the reason!!!

I though that "KeyboardIdleTime" was some condor variable, but NO!, I
have to define it. So I gave it a value of 60 (mins) and now it
works!!!

Thanks very much Becky for your help.

--
Diego Bello Carreño
Estudiante Memorista de Ingeniería Civil Informática
UTFSM, Valparaíso, Chile
Usuario #294897 counter.li.org