[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] help needed :) on configuration



hi,


since 2 weeks I play ( seriously ) with condor. It works perfectly with my test cluster.

Then I try a "real life" cluster installation I have
A master and 4 nodes

the master has 2 network cards one has private adress 192.168.1.1 and the other is in public vlan

Nodes have 192.168.1.{100, 101, 102,103 }

with an mpirun I have the results I want.

My command is

mpirun clustalw-mpi -infile=/home/galaxy/galaxy-dist/database/files/000/dataset_11.dat -outfile=/home/galaxy/test/parallel/clustal/res.dat -OUTORDER=ALIGNED -SEQNOS=OFF -TYPE=DNA

but when I try to use condor_submit with


universe                = parallel
executable              = openmpiscript
arguments = "clustalw-mpi -infile=/home/galaxy/galaxy-dist/database/files/000/dataset_11.dat -outfile=/home/galaxy/test/parallel/clustal/res.dat -OUTORDER=ALIGNED -SEQNOS=OFF -TYPE=DNA"
Getenv                  = True
machine_count           = 8
transfer_input_files = contact,/home/galaxy/galaxy-dist/database/files/000/dataset_11.dat,/usr/bin/clustalw-mpi
transfer_output_files    = /home/galaxy/test/parallel/clustal/res.dat
should_transfer_files   = yes
when_to_transfer_output = on_exit
Log                     = logs/mpi_$(Process).log
Output                  = logs/mpi_$(Process).out
Error                   = logs/mpi_$(Process).error
notification            = never
queue 1


I don't retrieve the "res.dat" file.

condor seem to do something with clustal ... but it returns no results.


This is my first problem ...

A second problem occurs when I try NFS ... Job are periodically in Idle and Run state. When I look in condor log I see a "/home/condor/execute/" write failed or read failed ... but my conf file neither on master nor slave contains this path



my installation is under debian with condor 7.8.0 package.


My resquest could be ... if you have conf file MASTER and SLAVE ? help me .. send me :)

Or  have look bellow to understant my errors :)


The Master

FULL_HOSTNAME   = 192.168.1.1
CONDOR_HOST     = 192.168.1.1
RELEASE_DIR             = /usr
LOCAL_DIR               = /var
WORK_DIR                = /home/share/condor/host/$(HOSTNAME)
LOCAL_CONFIG_FILE = /etc/condor/condor_config.local,/etc/condor/condor_config.local.dedicated.resource
LOCAL_CONFIG_DIR        = /etc/condor/config.d
UID_DOMAIN              = cluster.galaxy
FILESYSTEM_DOMAIN       = cluster.galaxy
COLLECTOR_NAME          = PoolGalaxy
CONDOR_IDS=0.0
ALLOW_READ = *
ALLOW_WRITE = *



For Slaves

FULL_HOSTNAME = 192.168.1.1
CONDOR_HOST     = $(FULL_HOSTNAME)
RELEASE_DIR             = /usr
LOCAL_DIR               = /var
LOCAL_CONFIG_FILE = /etc/condor/condor_config.local,/etc/condor/condor_config.local.dedicated.resource
LOCAL_CONFIG_DIR        = /etc/condor/config.d
CONDOR_ADMIN            = jerome.leconte@xxxxxxxxxxxxxxx
MAIL                    = /bin/mail
UID_DOMAIN              = cluster.galaxy
FILESYSTEM_DOMAIN       = cluster.galaxy
COLLECTOR_NAME          = PoolGalaxy
ALLOW_READ = *
ALLOW_WRITE = $(FULL_HOSTNAME), $(IP_ADDRESS), *
BIND_ALL_INTERFACES = FALSE
NETWORK_INTERFACE = 192.168.1.*




Thank you