Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] New to Condor, Need to RUN MPI
- Date: Tue, 3 Feb 2009 14:39:00 -0500
- From: Samir Khanal <skhanal@xxxxxxxx>
- Subject: Re: [Condor-users] New to Condor, Need to RUN MPI
hi
I have been trying so hard to get my condor work for MPI jobs.
I have some more issues, the condor_status doesnot show the compute nodes.
I only see the frontend slots and none of the compute nodes.
[skhanal@xxxxx ~]$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxxxxxxxxxxx LINUX X86_64 Owner Idle 0.000 990 0+00:15:05
slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 990 0+02:30:07
slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 990 0+02:30:08
slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 990 0+02:30:09
slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 990 0+02:30:10
slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 990 0+02:30:11
slot7@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 990 0+02:30:12
slot8@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 990 0+02:30:05
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 8 1 0 7 0 0 0
Total 8 1 0 7 0 0 0
---------------------------------------------------------------------------
[skhanal@xxxxx~]$ ps -el | grep condor
5 S 407 3678 1 0 75 0 - 6700 - ? 00:00:02 condor_master
4 S 407 3695 3678 0 75 0 - 6936 - ? 00:00:00 condor_collecto
4 S 407 3859 3678 0 75 0 - 6745 - ? 00:00:00 condor_schedd
4 S 407 3861 3678 0 78 0 - 6604 - ? 00:00:07 condor_startd
4 S 0 3865 3859 0 78 0 - 4981 - ? 00:00:00 condor_procd
Config.local on front end has this entry
--------------------------------------------------------
CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxx
CONDOR_DEVELOPERS = NONE
CONDOR_DEVELOPERS_COLLECTOR = NONE
CONDOR_HOST = xxxxx.xx.xxxx.xxx
CONDOR_IDS = 407.407
DAEMON_LIST = MASTER, SCHEDD, STARTD,COLLECTOR
EMAIL_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = xx.xxxx.xxx
HOSTALLOW_WRITE = xxxxx.xx.xxxx.xxx
JAVA = /usr/java/latest/bin/java
LOCAL_DIR = /var/opt/condor
LOCK = /tmp/condor-lock.$(HOSTNAME)
MAIL = /bin/mail
NEGOTIATOR_INTERVAL = 120
NETWORK_INTERFACE = 129.1.64.81
RELEASE_DIR = /opt/condor
UID_DOMAIN = xx.xxxx.xxx
and that on compute nodes have this one
######################################################################
#
# Condor local configuration file for compute node.
#
CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxx
CONDOR_DEVELOPERS = NONE
CONDOR_DEVELOPERS_COLLECTOR = NONE
CONDOR_HOST = xxxxx.xx.xxxx.xxx
CONDOR_IDS = 407.407
DAEMON_LIST = MASTER, SCHEDD, STARTD
EMAIL_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = xx.xxxx.xxx
HOSTALLOW_WRITE = xxxxx.xx.xxxx.xxx, *.local, *.xx.xxxx.xxx
JAVA = /usr/java/latest/bin/java
LOCAL_DIR = /var/opt/condor
LOCK = /tmp/condor-lock.$(HOSTNAME)
MAIL = /bin/mail
NEGOTIATOR_INTERVAL = 120
RELEASE_DIR = /opt/condor
UID_DOMAIN = xx.xxxx.xxx
# First set JAVA_MAXHEAP_ARGUMENT to null, to disable the default of max RAM
JAVA_MAXHEAP_ARGUMENT =
# Now set the argument with the Sun-specific maximum allowable value
JAVA_EXTRA_ARGUMENTS = -Xmx1906m
NETWORK_INTERFACE = 10.1.255.254
##--------------------------------------------------------------------
## 1) Only run dedicated jobs
##--------------------------------------------------------------------
START = Scheduler =?= $(DedicatedScheduler)
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE = False
RANK = Scheduler =?= $(DedicatedScheduler)
Am i missing anything?
Samir
________________________________________
From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] On Behalf Of Samir Khanal [skhanal@xxxxxxxx]
Sent: Friday, January 30, 2009 3:29 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] New to Condor, Need to RUN MPI
Hi Todd
As per your suggestion i just changed the MPDIR
---------------------------
# Set this to the bin directory of MPICH installation
MPDIR=/opt/mpich/gnu/bin
PATH=$MPDIR:.:$PATH
export PATH
export P4_RSHCOMMAND=$CONDOR_SSH
CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE
# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
## run the actual mpijob
mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@
--------------------
That strange message seems to go away but i still get the following
--------------------------
running /var/opt/condor/execute/dir_6084/bones on 2 LINUX ch_p4 processors
Cannot read machines.
Looked for files with extension LINUX in
directory /opt/mpich/gnu/share .
---------------------------
I check and there is a file called machines.LINUX in that DIR.
Thanks
Samir Khanal
CS Grad Student
Hayes 226
Bowling Green State University
Bowling Green, OH 43402
skhanal@xxxxxxxx
________________________________________
From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum [tannenba@xxxxxxxxxxx]
Sent: Friday, January 30, 2009 3:03 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] New to Condor, Need to RUN MPI
Samir Khanal wrote:
> I tried Parallel Universe too, here is what i get
[snip]
> running /home/skhanal/condor/bones on 2 LINUX ch_p4 processors
> Created /var/opt/condor/execute/dir_5352/PILxVizf5531
> Host compute-0-0 is not in contact file /var/opt/condor/execute/dir_5352/contact
> p0_5556: p4_error: Child process exited while making connection to remote process on compute-0-0: 0
> p0_5556: (2.003906) net_send: could not write to fd=4, errno = 32
>
>
> The job does not complete successfully with above messages.
>
> Help ! Help!
>
Why did you feel compelled to hack the sample mp1script included with
Condor? Are you trying to use mpich? If so, just set the path
correctly (to MPDIR) in the sample script where the comment says so; no
other changes should be needed.
Your customizations to the sample mp1script look very suspect to me.
regards,
Todd
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/