[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI jobs not executing



Hi all,

I will give a step by step narration of what i have done so that you can tell where i am making a mistake.

1. I changed the local config files of all the compute nodes. so all the dedicated nodes have the following local config file

CONDOR_HOST = caudate-nh.nsw.cmis.csiro.au

RELEASE_DIR = /usr/local/condor

LOCAL_DIR = /home/condor/

CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

UID_DOMAIN = nsw.cmis.csiro.au

FILESYSTEM_DOMAIN = nsw.cmis.csiro.au

CONDOR_IDS = 000.0

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"

LOCK = /tmp/condor-lock.$(HOSTNAME)0.874095049061911

DAEMON_LIST = MASTER,  SCHEDD, STARTD

JAVA = /usr/bin/java


##--------------------------------------------------------------------
## 2) Always run jobs, but prefer dedicated ones
##--------------------------------------------------------------------
START           = True
SUSPEND = False

CONTINUE        = True

PREEMPT = False

KILL            = False

WANT_SUSPEND    = False

WANT_VACATE     = False

RANK            = Scheduler =?= $(DedicatedScheduler)

MPI_CONDOR_RSH_PATH = $(LIBEXEC)


CONDOR_SSHD = /usr/sbin/sshd


CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen


STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
     

And the dedicated submit machine has the following local config file

##  What machine is your central manager?

CONDOR_HOST = caudate-nh.nsw.cmis.csiro.au


##  Pathnames:
##  Where have you installed the bin, sbin and lib condor directories?

RELEASE_DIR = /usr/local/condor


##  Where is the local condor directory for each host?
##  This is where the local config file(s), logs and
##  spool/execute directories are located

LOCAL_DIR = /home/condor/


##  Mail parameters:
##  When something goes wrong with condor at your site, who should get
##  the email?

CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxxxxxxxx


##  Network domain parameters:
##  Internet domain of machines sharing a common UID space.  If your
##  machines don't share a common UID space, set it to
##  UID_DOMAIN = $(FULL_HOSTNAME)
##  t! o specify that each machine has its own UID space.

UID_DOMAIN = nsw.cmis.csiro.au


##  Internet domain of machines sharing a common file system.
##  If your machines don't use a network file system, set it to
##  FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
##  to specify that each machine has its own file system.

FILESYSTEM_DOMAIN = nsw.cmis.csiro.au


##  The user/group ID <uid>.<gid> of the "Condor" user.
##  (this can also be specified in the environment)
##  Note: the CONDOR_IDS setting is ignored on Win32 platforms

CONDOR_IDS = 000.0

LOCK = /tmp/condor-lock.$(HOSTNAME)0.597654629106732


##  condor_master
##  Daemons you want the master to keep running for you:

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD


DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx! o.au"
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler


##  Java parameters:
##  If you would like this machine to be able to run Java jobs,
##  then set JAVA to the path of your JVM binary.  If you are not
##  interested in Java, there is no harm in leaving this entry
##  empty or incorrect.

JAVA = /usr/bin/java
UNUSED_CLAIM_TIMEOUT = 600


START    = Owner == "sah006" || Owner == "condor"
SUSPEND  = False
CONTINUE = True
PREEMPT  = False
KILL     = False

these are the changes i have made to the compute nodes and to the dedicated submit node.

I have submiited mpi jobs but they are not being executed.

here is my submit file

universe = parallel
executable = a.out
log = logfile
error = log.error
output = log.output machine_count = 4
queue

the program is a simple program which i have copy pasted from a website. it runs and compiles perfectly from the command line.

now can any one tell me what is the problem?

and by the way do i have to start an mpd ring before i send jobs to condor?

i have tried both ways. its not working

regards






Junaid N. Sahibzada
Cell # (+61) 404 998 494 
284/9 Crystal St. Waterloo, 2017, NSW, Australia
International Student MSc Internetworking, UTS, Australia
Bachelor of Information Technology, NUST, Pakistan


Brings words and photos together (easily) with
PhotoMail - it's free and works with Yahoo! Mail.