[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI condor Config



On 10/24/06, Becky Gietzel <bgietzel@xxxxxxxxxxx> wrote:

On Oct 20, 2006, at 5:50 PM, Diego Bello wrote:

> Hi everyone, I have a Condor pool made of workstations to support MPI,
> simple jobs and dag, all using globus. Condor version is 6.8.0
>
> What I need is that MPI jobs could be stopped if a machine is used,
> wich normally is between 10 am and 9 pm. I have tried some
> configurations taken from the condor manual, but some jobs doesn't
> start. I think there could be a problem with the start configuration.
>
> I'm now trying a dag job, with three jobs doing nothing more than
> /bin/hostname, but it gets to the queue, the first job starts running
> but, after several hours, it doesn't finish. If I send a globus job
> directly, it works. My proxy is valid for 48 hrs.
>
> I have attached my central manager and my exec nodes's config files.
> Can someone tell me if there is something wrong with my config files?.
>

You'll want to adjust your START policy for the execute nodes as
follows:

Add:
IsNighttime = (ClockMin < 600 || ClockMin > 1260)


Replace the START and PREEMPT expressions with:
START   = ( (Scheduler =?= $(DedicatedScheduler) && $(IsNighttime) =?
= TRUE && $(KeyboardIdleTime) > $(StartIdleTime) ) ||  $(START) )
PREEMPT = (Scheduler =!= $(DedicatedScheduler) && $(KeyboardBusy)

This policy will allow MPI jobs to start only during the nighttime
hours if nobody is actively using the machine.  Once you set up the
new policy, make sure you are able to run a simple Vanilla universe /
bin/hostname job.  When that is working try the dag job with /bin/
hostname again.  Then try an MPI job.

If you are using the MPI Universe for your MPI jobs I'd recommend
switching to the Parallel Universe.


Thanks,

Becky


Thanks for the reply!

I tried what you said, but condor daemons can't start in exec nodes.
This is the error message i get:

*** Last 20 line(s) of file StartLog:
10/28 23:00:39 Using config source: /etc/condor/condor_config
10/28 23:00:39 Using local config sources:
10/28 23:00:39    /opt/condor-6.8.0/local.chaparro/condor_config.local
10/28 23:00:39 DaemonCore: Command Socket at <200.1.19.171:9642>
10/28 23:00:39 ERROR "Syntax error in START expression: '( (Scheduler
=?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx" && (ClockMin < 600 ||
ClockMin > 1260) =?= TRUE &&  > 15 * 60 ) ||  ( (KeyboardIdle > 15 *
60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed"
&& State != "Owner")) ) )'" at line 286 in file util.C
10/28 23:00:52 passwd_cache::cache_uid(): getpwnam("condor") failed:
user not found
10/28 23:00:52 passwd_cache::cache_uid(): getpwnam("condor") failed:
user not found
10/28 23:00:52 ******************************************************
10/28 23:00:52 ** condor_startd (CONDOR_STARTD) STARTING UP
10/28 23:00:52 ** /opt/condor-6.8.0/sbin/condor_startd
10/28 23:00:52 ** $CondorVersion: 6.8.0 Jul 19 2006 $
10/28 23:00:52 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/28 23:00:52 ** PID = 5351
10/28 23:00:52 ** Log last touched 10/28 23:00:39
10/28 23:00:52 ******************************************************
10/28 23:00:52 Using config source: /etc/condor/condor_config
10/28 23:00:52 Using local config sources:
10/28 23:00:52    /opt/condor-6.8.0/local.chaparro/condor_config.local
10/28 23:00:52 DaemonCore: Command Socket at <200.1.19.171:9683>
10/28 23:00:52 ERROR "Syntax error in START expression: '( (Scheduler
=?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx" && (ClockMin < 600 ||
ClockMin > 1260) =?= TRUE &&  > 15 * 60 ) ||  ( (KeyboardIdle > 15 *
60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed"
&& State != "Owner")) ) )'" at line 286 in file util.C
- Hide quoted text -
*** End of file StartLog


I tried having a START=TRUE expression before what you said, and then
removing that. The only difference was the error message saying TRUE
instead ((LoadAvg - Condor.......

In the PREEMT line, I supose there is a missing ")" at the end, am I right?.

Can you help me find out what is going wrong?

Thanks.

--
Diego Bello Carreño
Estudiante Memorista de Ingeniería Civil Informática
UTFSM, Valparaíso, Chile
Usuario #294897 counter.li.org