[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] LAM/MPI and the lamscript
- Date: Tue, 24 Apr 2007 18:12:08 +0100
- From: Sara Campos <scampos@xxxxxxxxxxx>
- Subject: [Condor-users] LAM/MPI and the lamscript
I've posted before to the mailing list but I didn't receive any
answer. My main doubt was about how to use the lamscript. Every
time I tried to use it the job was idle and condor _q -analyze
showed "6 match but reject the job for unknown reasons" (I am testing
with 3 computers, each with 2 processors). The submit script was
something like this:
Executable = lamscript
Universe = parallel
machine_count = 2
arguments = test_2.sh
output = run.out
error = run.error
log = run.log
+WantParallelSchedulingGroups = True
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = test_2.sh
And in the local config files I have something like this:
ParallelSchedulingGroup = "$(HOSTNAME)"
DedicatedScheduler = "DedicatedScheduler@$(FULL_HOSTNAME)"
Startd_EXPRS = $(STARTD_EXPRS), DedicatedScheduler,
RANK = Scheduler =?= $(DedicatedScheduler)
I was able of running test_2.sh in parallel outside Condor so the
executable works and also the lam and mpi are working in the machines.
In the lamscript I changed the LAMDIR and adapted the lamboot
command. I didn't add the LAMDIR to the .cshrc file as it is suggested
in the script because I don't have a .cshrc file (and sincerely I didn't
understand why it was necessary to do that). I don't know if I should
have changed the script in other places, if I am doing something else
wrong that has nothing to do with the lamscript or if the problem is
related to the .cshrc file.
I hope someone can help me with this ... I didn't find much information
in the archives.
Thanks in advance
PS: Bellow you can see my previous message which has some doubts mostly
concerned with this problem.
We are thinking to use Condor to manage a pool of dedicated
multiprocessor machines. One of our goals is to be able of running
parallel jobs using LAM/MPI and running the job on a single machine
(using the different processors). We have been doing some tests with
only a few machines but some doubts have appeared.
1. We tried to use the lamscript script provided but it didn't work out
probably because the user's login shell is bash. Is it necessary to have
csh as a login shell in order to run the lamscript? If so, how can we
overcome that since all users in our pool use bash? If I am confused
what is exactly meant by this paragraph taken from the manual "For LAM,
there is a similar path setting, but it is called LAMDIR in the
lamscript script. In addition, this path must be part of the path set in
the user’s .cshrc script. As of this writing, the LAM implementation
does not work if the user’s login shell is the Bourne or compatible shell."?
2. Is it imperative to define a dedicated scheduler in order to run
parallel jobs or is this only optional? If so what are the advantages?
What happens for instance when the submission script defines a scheduler
but is submitted from a different machine (that not the dedicated
scheduler)? Finally, how does the central manager orders the jobs from
the different submit machines' queues and is this related with the
convenience of defining a dedicated scheduler?
I hope I haven't made too many boring questions... Thanks in advance.