[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI jobs not executing



Junaid,

Have a look at the example 
(install directory)/etc/examples/condor_config.local.dedicated.resource
I think you are going to need to define more than what you have defined so far.
In my case I'm running MPI jobs. Outside of condor usually when a MPI job is run
you'll need to set up some sort of passwordless login. However condor has taken
care of this for you and included 2 scripts:
lamscript
mp1script
what these guys do is "essentially" take care of passwordless login for you
while your mpi job is running. I'll be mucking around with condor later today
and i'll let you know what I come up with.

Cheers

Danny Nayar
New Mexico State University





Quoting "Junaid N. Sahibzada" <sjunaidn@xxxxxxxxx>:

> Hi all,
>   
>   I will give a step by step narration of what i have done so that you can
> tell where i am making a mistake.
>   
>   1. I changed the local config files of all the compute nodes. so all the
> dedicated nodes have the following local config file
>   
>   CONDOR_HOST = caudate-nh.nsw.cmis.csiro.au
>   
>   RELEASE_DIR = /usr/local/condor
>   
>   LOCAL_DIR = /home/condor/
>   
>   CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>   
>   UID_DOMAIN = nsw.cmis.csiro.au
>   
>   FILESYSTEM_DOMAIN = nsw.cmis.csiro.au
>   
>   CONDOR_IDS = 000.0
>   
>   DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
>   
>   LOCK = /tmp/condor-lock.$(HOSTNAME)0.874095049061911
>   
>   DAEMON_LIST = MASTER,  SCHEDD, STARTD
>   
>   JAVA = /usr/bin/java
>   
>   
>   ##--------------------------------------------------------------------
>   ## 2) Always run jobs, but prefer dedicated ones
>   ##--------------------------------------------------------------------
>   START           = True
>   SUSPEND = False
>   
>   CONTINUE        = True
>   
>   PREEMPT = False
>   
>   KILL            = False
>   
>   WANT_SUSPEND    = False
>   
>   WANT_VACATE     = False
>   
>   RANK            = Scheduler =?= $(DedicatedScheduler)
>   
>   MPI_CONDOR_RSH_PATH = $(LIBEXEC)
>   
>   
>   CONDOR_SSHD = /usr/sbin/sshd
>   
>   
>   CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
>   
>   
>   STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
>          
>   
>   
>   And the dedicated submit machine has the following local config file
>   
>     ##  What machine is your central manager?
>   
>   CONDOR_HOST = caudate-nh.nsw.cmis.csiro.au
>   
>   
>   ##  Pathnames:
>   ##  Where have you installed the bin, sbin and lib condor directories?
>   
>   RELEASE_DIR = /usr/local/condor
>   
>   
>   ##  Where is the local condor directory for each host?
>   ##  This is where the local config file(s), logs and
>   ##  spool/execute directories are located
>   
>   LOCAL_DIR = /home/condor/
>   
>   
>   ##  Mail parameters:
>   ##  When something goes wrong with condor at your site, who should get
>   ##  the email?
>   
>   CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
>   
>   
>   ##  Network domain parameters:
>   ##  Internet domain of machines sharing a common UID space.  If your
>   ##  machines don't share a common UID space, set it to
>   ##  UID_DOMAIN = $(FULL_HOSTNAME)
>   ##  to specify that each machine has its own UID space.
>   
>   UID_DOMAIN = nsw.cmis.csiro.au
>   
>   
>   ##  Internet domain of machines sharing a common file system.
>   ##  If your machines don't use a network file system, set it to
>   ##  FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
>   ##  to specify that each machine has its own file system.
>   
>   FILESYSTEM_DOMAIN = nsw.cmis.csiro.au
>   
>   
>   ##  The user/group ID <uid>.<gid> of the "Condor" user.
>   ##  (this can also be specified in the environment)
>   ##  Note: the CONDOR_IDS setting is ignored on Win32 platforms
>   
>   CONDOR_IDS = 000.0
>   
>   LOCK = /tmp/condor-lock.$(HOSTNAME)0.597654629106732
>   
>   
>   ##  condor_master
>   ##  Daemons you want the master to keep running for you:
>   
>   DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
>   
>   
>   DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
>   STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
>   
>   
>   ##  Java parameters:
>   ##  If you would like this machine to be able to run Java jobs,
>   ##  then set JAVA to the path of your JVM binary.  If you are not
>   ##  interested in Java, there is no harm in leaving this entry
>   ##  empty or incorrect.
>   
>   JAVA = /usr/bin/java
>   UNUSED_CLAIM_TIMEOUT = 600
>   
>   
>   START    = Owner == "sah006" || Owner == "condor"
>   SUSPEND  = False
>   CONTINUE = True
>   PREEMPT  = False
>   KILL     = False
>   
>   
>   these are the changes i have made to the compute nodes and to the dedicated
> submit node.
>   
>   I have submiited mpi jobs but they are not being executed.
>   
>   here is my submit file
>   
>   universe = parallel
>   executable = a.out
>   log = logfile
>   error = log.error
>   output = log.output
>   machine_count = 4
>   queue
>   
>   
>   the program is a simple program which i have copy pasted from a website. it
> runs and compiles perfectly from the command line.
>   
>   now can any one tell me what is the problem?
>   
>   and by the way do i have to start an mpd ring before i send jobs to
> condor?
>   
>   i have tried both ways. its not working
>   
>   regards
>   
>   
>   
>   
>   
> 
> Junaid N. Sahibzada
> Cell # (+61) 404 998 494 
> 284/9 Crystal St. Waterloo, 2017, NSW, Australia
> International Student MSc Internetworking, UTS, Australia
> Bachelor of Information Technology, NUST, Pakistan
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 		
> ---------------------------------
> Brings words and photos together (easily) with
>  PhotoMail  - it's free and works with Yahoo! Mail.