[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] New to Condor, Need to RUN MPI



hi 
I have been trying so hard to get my condor work for MPI jobs. 

I have some more issues, the condor_status doesnot show the compute nodes.
I only see the frontend slots and none of the compute nodes.

[skhanal@xxxxx ~]$ condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@xxxxxxxxxxxx LINUX      X86_64 Owner     Idle     0.000   990  0+00:15:05
slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   990  0+02:30:07
slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   990  0+02:30:08
slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   990  0+02:30:09
slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   990  0+02:30:10
slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   990  0+02:30:11
slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   990  0+02:30:12
slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   990  0+02:30:05

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     8     1       0         7       0          0        0

               Total     8     1       0         7       0          0        0
---------------------------------------------------------------------------
[skhanal@xxxxx~]$ ps -el | grep condor
5 S   407  3678     1  0  75   0 -  6700 -      ?        00:00:02 condor_master
4 S   407  3695  3678  0  75   0 -  6936 -      ?        00:00:00 condor_collecto
4 S   407  3859  3678  0  75   0 -  6745 -      ?        00:00:00 condor_schedd
4 S   407  3861  3678  0  78   0 -  6604 -      ?        00:00:07 condor_startd
4 S     0  3865  3859  0  78   0 -  4981 -      ?        00:00:00 condor_procd


Config.local on front end has this entry
--------------------------------------------------------

CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxx
CONDOR_DEVELOPERS = NONE
CONDOR_DEVELOPERS_COLLECTOR = NONE
CONDOR_HOST = xxxxx.xx.xxxx.xxx
CONDOR_IDS = 407.407
DAEMON_LIST = MASTER, SCHEDD, STARTD,COLLECTOR
EMAIL_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = xx.xxxx.xxx
HOSTALLOW_WRITE = xxxxx.xx.xxxx.xxx
JAVA = /usr/java/latest/bin/java
LOCAL_DIR = /var/opt/condor
LOCK = /tmp/condor-lock.$(HOSTNAME)
MAIL = /bin/mail
NEGOTIATOR_INTERVAL = 120
NETWORK_INTERFACE = 129.1.64.81
RELEASE_DIR = /opt/condor
UID_DOMAIN = xx.xxxx.xxx



and that on compute nodes have this one

######################################################################
#
#  Condor local configuration file for compute node.
#
CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxx
CONDOR_DEVELOPERS = NONE
CONDOR_DEVELOPERS_COLLECTOR = NONE
CONDOR_HOST = xxxxx.xx.xxxx.xxx
CONDOR_IDS = 407.407
DAEMON_LIST = MASTER, SCHEDD, STARTD
EMAIL_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = xx.xxxx.xxx
HOSTALLOW_WRITE = xxxxx.xx.xxxx.xxx, *.local, *.xx.xxxx.xxx
JAVA = /usr/java/latest/bin/java
LOCAL_DIR = /var/opt/condor
LOCK = /tmp/condor-lock.$(HOSTNAME)
MAIL = /bin/mail
NEGOTIATOR_INTERVAL = 120
RELEASE_DIR = /opt/condor
UID_DOMAIN = xx.xxxx.xxx
# First set JAVA_MAXHEAP_ARGUMENT to null, to disable the default of max RAM
JAVA_MAXHEAP_ARGUMENT =
# Now set the argument with the Sun-specific maximum allowable value
JAVA_EXTRA_ARGUMENTS = -Xmx1906m
NETWORK_INTERFACE = 10.1.255.254

##--------------------------------------------------------------------
## 1) Only run dedicated jobs
##--------------------------------------------------------------------
START           = Scheduler =?= $(DedicatedScheduler)
SUSPEND = False
CONTINUE        = True
PREEMPT = False
KILL            = False
WANT_SUSPEND    = False
WANT_VACATE     = False
RANK            = Scheduler =?= $(DedicatedScheduler)


Am i missing anything?

Samir 



________________________________________
From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] On Behalf Of Samir Khanal [skhanal@xxxxxxxx]
Sent: Friday, January 30, 2009 3:29 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] New to Condor, Need to RUN MPI

Hi Todd
As per your suggestion i just changed the MPDIR

---------------------------
# Set this to the bin directory of MPICH installation
MPDIR=/opt/mpich/gnu/bin
PATH=$MPDIR:.:$PATH
export PATH

export P4_RSHCOMMAND=$CONDOR_SSH

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE

# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

## run the actual mpijob
mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@

--------------------

That strange message seems to go away but i still get the following

--------------------------
running /var/opt/condor/execute/dir_6084/bones on 2 LINUX ch_p4 processors
Cannot read machines.
Looked for files with extension LINUX in
directory /opt/mpich/gnu/share .
---------------------------
I check and there is a file called machines.LINUX in that DIR.

Thanks

Samir Khanal
CS Grad Student
Hayes 226
Bowling Green State University
Bowling Green, OH 43402
skhanal@xxxxxxxx

________________________________________
From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum [tannenba@xxxxxxxxxxx]
Sent: Friday, January 30, 2009 3:03 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] New to Condor, Need to RUN MPI

Samir Khanal wrote:
> I tried Parallel Universe too, here is what i get
[snip]
> running /home/skhanal/condor/bones on 2 LINUX ch_p4 processors
> Created /var/opt/condor/execute/dir_5352/PILxVizf5531
> Host compute-0-0 is not in contact file /var/opt/condor/execute/dir_5352/contact
> p0_5556:  p4_error: Child process exited while making connection to remote process on compute-0-0: 0
> p0_5556: (2.003906) net_send: could not write to fd=4, errno = 32
>
>
> The job does not complete successfully with above messages.
>
> Help ! Help!
>

Why did you feel compelled to hack the sample mp1script included with
Condor?  Are you trying to use mpich?  If so, just set the path
correctly (to MPDIR) in the sample script where the comment says so; no
other changes should be needed.

Your customizations to the sample mp1script look very suspect to me.

regards,
Todd


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/