[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI jobs not executing



Danny

thanks for the reply. there is a bug in those local config files provided as examples.

You have to add the rest of the normal directives as well in the file before things work out for you.

the example local config files do not have the rest of the entries.

they assume that you will be add the rest of them yourself.

look at my config files.  i have added the rest of the directives as well.

also can you tell me that when you run your MPI jobs, do you setup your rings outside condor before you run your jobs, or you dont need rings and condor can handle that for you?

and some how when i set up my ring some thing goes wrong and although the daemon is there yet the ring is not there.

this happens after every lets say 30 to 40 mins and i have to manually set up a ring again.

also if u cant set up passwordless logins you can follow this method to create a ring

f! irst go to a machine and do this

mpd &
mpdtrace -l

this will start a mpd daemon and give you the ip and port on which mpd is running .

go to the other machine and run this command

mpd -h IP -p port &

this will start a dameon on this machine and join the mpd daemon to the ring on the other machine. the IP and the port should be that of the first machine.

regards


rnayar@xxxxxxxx wrote:
Junaid,

Have a look at the example
(install directory)/etc/examples/condor_config.local.dedicated.resource
I think you are going to need to define more than what you have defined so far.
In my case I'm running MPI jobs. Outside of condor usually when a MPI job is run
you'll need to set up some sort of passwordless login. However condor has taken
care of this for! you and included 2 scripts:
lamscript
mp1script
what these guys do is "essentially" take care of passwordless login for you
while your mpi job is running. I'll be mucking around with condor later today
and i'll let you know what I come up with.

Cheers

Danny Nayar
New Mexico State University





Quoting "Junaid N. Sahibzada" :

> Hi all,
>
> I will give a step by step narration of what i have done so that you can
> tell where i am making a mistake.
>
> 1. I changed the local config files of all the compute nodes. so all the
> dedicated nodes have the following local config file
>
> CONDOR_HOST = caudate-nh.nsw.cmis.csiro.au
>
> RELEASE_DIR = /usr/local/condor
>
> LOCAL_DIR = /home/condor/
>
> CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> UID_DOMAIN = nsw.cmis.csiro.au
>
> FILESYSTEM_DOMAIN = nsw.cmis.csiro.au
>
> CONDOR_IDS = 000.0
>
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
>
> LOCK = /tmp/condor-lock.$(HOSTNAME)0.874095049061911
>
> DAEMON_LIST = MASTER, SCHEDD, STARTD
>
> JAVA = /usr/bin/java
>
>
> ##--------------------------------------------------------------------
> ## 2) Always run jobs, but prefer dedicated ones
> ##--------------------------------------------------------------------
> START = True
> SUSPEND = False
>
> CONTINUE = True
>
> PREEMPT = False
>
> KILL = False
>
> WANT_SUSPEND = False
>
> WANT_VACATE = False
>
> RANK = Scheduler =?= $(DedicatedScheduler)
> > MPI_CONDOR_RSH_PATH = $(LIBEXEC)
>
>
> CONDOR_SSHD = /usr/sbin/sshd
>
>
> CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
>
>
> STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
>
>
>
> And the dedicated submit machine has the following local config file
>
> ## What machine is your central manager?
>
> CONDOR_HOST = caudate-nh.nsw.cmis.csiro.au
>
>
> ## Pathnames:
> ## Where have you installed the bin, sbin and lib condor directories?
>
> RELEASE_DIR = /usr/local/condor
>
>
> ## Where is the local condor directory for each host?
> ## This is where the local config file(s), logs and
> ## spool/execute directories are located
>
> LOCAL_DIR = /home/condor/
>
>
> ## Mail parameters:
> ! ## When something goes wrong with condor at your site, who should get
> ## the email?
>
> CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
>
> ## Network domain parameters:
> ## Internet domain of machines sharing a common UID space. If your
> ## machines don't share a common UID space, set it to
> ## UID_DOMAIN = $(FULL_HOSTNAME)
> ## to specify that each machine has its own UID space.
>
> UID_DOMAIN = nsw.cmis.csiro.au
>
>
> ## Internet domain of machines sharing a common file system.
> ## If your machines don't use a network file system, set it to
> ## FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
> ## to specify that each machine has its own file system.
>
> FILESYSTEM_DOMAIN = nsw.cmis.csiro.au
>
>
> ## The user/group ID . of the "Condor" user.
> ## (this can! also be specified in the environment)
> ## Note: the CONDOR_IDS setting is ignored on Win32 platforms
>
> CONDOR_IDS = 000.0
>
> LOCK = /tmp/condor-lock.$(HOSTNAME)0.597654629106732
>
>
> ## condor_master
> ## Daemons you want the master to keep running for you:
>
> DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
>
>
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
>
>
> ## Java parameters:
> ## If you would like this machine to be able to run Java jobs,
> ## then set JAVA to the path of your JVM binary. If you are not
> ## interested in Java, there is no harm in leaving this entry
> ## empty or incorrect.
>
> JAVA = /usr/bin/java
> UNUSED_CLAIM_TIMEOUT = 600
>
> !
> START = Owner == "sah006" || Owner == "condor"
> SUSPEND = False
> CONTINUE = True
> PREEMPT = False
> KILL = False
>
>
> these are the changes i have made to the compute nodes and to the dedicated
> submit node.
>
> I have submiited mpi jobs but they are not being executed.
>
> here is my submit file
>
> universe = parallel
> executable = a.out
> log = logfile
> error = log.error
> output = log.output
> machine_count = 4
> queue
>
>
> the program is a simple program which i have copy pasted from a website. it
> runs and compiles perfectly from the command line.
>
> now can any one tell me what is the problem?
>
> and by the way do i have to start an mpd ring before i send jobs to
> condor?
>
> i have tried both way! s. its not working
>
> regards
>
>
>
>
>
>
> Junaid N. Sahibzada
> Cell # (+61) 404 998 494
> 284/9 Crystal St. Waterloo, 2017, NSW, Australia
> International Student MSc Internetworking, UTS, Australia
> Bachelor of Information Technology, NUST, Pakistan
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------
> Brings words and photos together (easily) with
> PhotoMail - it's free and works with Yahoo! Mail.


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users



Junaid N. Sahibzada
Cell # (+61) 404 998 494 
284/9 Crystal St. Waterloo, 2017, NSW, Australia
International Student MSc Internetworking, UTS, Australia
Bachelor of Information Technology, NUST, Pakistan


Brings words and photos together (easily) with
PhotoMail - it's free and works with Yahoo! Mail.