Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run

Date: Fri, 10 Feb 2006 21:39:44 +0000
From: Jean-Alain Grunchec <jgrunche@xxxxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run

Hello,

I was asked by someone who experienced a similar issue how this problemwas solved.

I assumed that I needed only one computer so as to run a single node(i.e. the machine running the dedicated scheduler would reun a node).This was wrong. In order to run a single node, you need two machines.The machine running the dedicated scheduler uses a 'normal'condor_config.local file. It will submit the jobs but will not run anode locally.


Something like this (you don't need all the commented lines)

## DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxx"

START     = True
SUSPEND   = False
CONTINUE  = True
PREEMPT   = False
KILL      = False
WANT_SUSPEND   = False
WANT_VACATE    = False

HIGHPORT = 9700 #Required by my firewall
LOWPORT = 9600  #Required by my firewall

UNUSED_CLAIM_TIMEOUT = 600 #this comes from the example/usr/local/condor/etc/examples/condor_config.local.dedicated.submit


## RANK      = Scheduler =?= $(DedicatedScheduler)

## STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

This isn't the entire file but the remaining part was configured whencondor was installed and hasn't been altered ever since.

The 2nd machine (which will run a node) needs to be configured so thatit will use the dedicated scheduler on the first machine. Thecondor_config file looks like this (see example/usr/local/condor/etc/examples/condor_config.local.dedicated.resource ):


DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxx"

START           = True
SUSPEND = False
CONTINUE        = True
PREEMPT = False
KILL            = False
WANT_SUSPEND    = False
WANT_VACATE     = False
RANK            = Scheduler =?= $(DedicatedScheduler)
HIGHPORT = 9700 #Required by my firewall
LOWPORT = 9600  #Required by my firewall

MPI_CONDOR_RSH_PATH = $(LIBEXEC)

CONDOR_SSHD = /usr/sbin/sshd

CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen

STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

This was all on Condor side IIRC (6.7.14), Fedora Core 4.

I was then able to run the initial script :

universe = parallel
executable = /bin/cat
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
queue

So that was OK with condor. But it took someting like 5 minutes tostart during which the job was idle.


In order to speed this up I changed this

NEGOTIATOR_INTERVAL     = 61 #300

And the job starts much quicker.

I also managed to run a MPI job on a single node (I think simplempi isone of the provided MPI examples) with mpich-1.2.4


######################################
## MPI example submit description file
######################################
universe = MPI
executable = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = yes
when_to_transfer_output = on_exit
queue

I still did not run these scripts on several nodes.

Currently I am trying to get the parallel universe to run this MPIexample (it seems that this would allow the use of LAM/newer versionsof MPI).


######################################
## Example submit description file
## for MPICH 1 MPI
## works with MPICH 1.2.4, 1.2.5 and 1.2.6
######################################
universe = parallel
executable = mp1script
arguments = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = yes
when_to_transfer_output = on_exit
queue

Unfortunately the jobs starts 'running' but is blocked. For some reasonit starts some connections, but does not seem to recognize them (andthen try with a next new port, again and again). I tried to look at thefiles and find out what might be the reason for this. In/usr/local/condor/libexec/sshd.sh there is a line like this :


	if grep "^Server listening on 0.0.0.0 port" sshd.out > /dev/null 2>&1

I replaced this by :

	if grep "Server listening on :: port" sshd.out > /dev/null 2>&1

Not sure at all if there was a typo, but I had the '^' this on the twocomputers.

The next problem is that simplempi does not seem to be transferedtowards the temporary folder of the remote node, so there is an error(can't find the executable). I am not sure if there is a nice way/a fewlines to add/ which would transfer the executable.

Follow-Ups:
- Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run
  - From: Greg Thain

References:
- [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run
  - From: Jean-Alain Grunchec

Prev by Date: Re: [Condor-users] MPI - What the heck does this mean?
Next by Date: Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run
Previous by thread: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run
Next by thread: Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run