[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run



Hello,
I was asked by someone who experienced a similar issue how this problem was solved.

I assumed that I needed only one computer so as to run a single node (i.e. the machine running the dedicated scheduler would reun a node). This was wrong. In order to run a single node, you need two machines. The machine running the dedicated scheduler uses a 'normal' condor_config.local file. It will submit the jobs but will not run a node locally.

Something like this (you don't need all the commented lines)

## DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxx"

START     = True
SUSPEND   = False
CONTINUE  = True
PREEMPT   = False
KILL      = False
WANT_SUSPEND   = False
WANT_VACATE    = False

HIGHPORT = 9700 #Required by my firewall
LOWPORT = 9600  #Required by my firewall

UNUSED_CLAIM_TIMEOUT = 600 #this comes from the example /usr/local/condor/etc/examples/condor_config.local.dedicated.submit

## RANK      = Scheduler =?= $(DedicatedScheduler)

## STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

This isn't the entire file but the remaining part was configured when condor was installed and hasn't been altered ever since.

The 2nd machine (which will run a node) needs to be configured so that it will use the dedicated scheduler on the first machine. The condor_config file looks like this (see example /usr/local/condor/etc/examples/condor_config.local.dedicated.resource ) :

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxx"

START           = True
SUSPEND = False
CONTINUE        = True
PREEMPT = False
KILL            = False
WANT_SUSPEND    = False
WANT_VACATE     = False
RANK            = Scheduler =?= $(DedicatedScheduler)
HIGHPORT = 9700 #Required by my firewall
LOWPORT = 9600  #Required by my firewall

MPI_CONDOR_RSH_PATH = $(LIBEXEC)

CONDOR_SSHD = /usr/sbin/sshd

CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen

STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

This was all on Condor side IIRC (6.7.14), Fedora Core 4.

I was then able to run the initial script :

universe = parallel
executable = /bin/cat
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
queue

So that was OK with condor. But it took someting like 5 minutes to start during which the job was idle.

In order to speed this up I changed this

NEGOTIATOR_INTERVAL     = 61 #300

And the job starts much quicker.

I also managed to run a MPI job on a single node (I think simplempi is one of the provided MPI examples) with mpich-1.2.4

######################################
## MPI example submit description file
######################################
universe = MPI
executable = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = yes
when_to_transfer_output = on_exit
queue

I still did not run these scripts on several nodes.

Currently I am trying to get the parallel universe to run this MPI example (it seems that this would allow the use of LAM/newer versions of MPI).

######################################
## Example submit description file
## for MPICH 1 MPI
## works with MPICH 1.2.4, 1.2.5 and 1.2.6
######################################
universe = parallel
executable = mp1script
arguments = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = yes
when_to_transfer_output = on_exit
queue

Unfortunately the jobs starts 'running' but is blocked. For some reason it starts some connections, but does not seem to recognize them (and then try with a next new port, again and again). I tried to look at the files and find out what might be the reason for this. In /usr/local/condor/libexec/sshd.sh there is a line like this :

	if grep "^Server listening on 0.0.0.0 port" sshd.out > /dev/null 2>&1

I replaced this by :

	if grep "Server listening on :: port" sshd.out > /dev/null 2>&1

Not sure at all if there was a typo, but I had the '^' this on the two computers.

The next problem is that simplempi does not seem to be transfered towards the temporary folder of the remote node, so there is an error (can't find the executable). I am not sure if there is a nice way/a few lines to add/ which would transfer the executable.