[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] New to Condor, Need to RUN MPI



I tried Parallel Universe too, here is what i get

-- Job script----
universe = parallel
#getenv  = True
executable = /home/skhanal/condor/mp1script
arguments = /home/skhanal/condor/bones
Output = foo.out.$(NODE)
log = userlog.txt
error = foo.err.$(NODE)
machine_count = 2
should_transfer_files = yes
when_to_transfer_output = on_exit
#transfer_input_files =
queue


----mp1script----

# Set this to the bin directory of MPICH installation
MPDIR=/opt/mpich/gnu/bin
PATH=$MPDIR:.:$PATH
export PATH

export P4_RSHCOMMAND=$CONDOR_SSH

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE

# The second field in the contact file is the machine name
# that condor_ssh knows how to use
#sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

## run the actual mpijob
/opt/mpich/gnu/bin/mpirun -v -np $_CONDOR_NPROCS -machinefile /home/skhanal/condor/machines $EXECUTABLE $@

sshd_cleanup
rm -f machines


---foo.out.0 file------
running /home/skhanal/condor/bones on 2 LINUX ch_p4 processors
Created /var/opt/condor/execute/dir_5352/PILxVizf5531
Host compute-0-0 is not in contact file /var/opt/condor/execute/dir_5352/contact
p0_5556:  p4_error: Child process exited while making connection to remote process on compute-0-0: 0
p0_5556: (2.003906) net_send: could not write to fd=4, errno = 32


The job does not complete successfully with above messages.

Help ! Help!

________________________________________
From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] On Behalf Of Samir Khanal [skhanal@xxxxxxxx]
Sent: Friday, January 30, 2009 1:14 PM
To: Condor-Users Mail List
Subject: [Condor-users] New to Condor, Need to RUN MPI

Hi
I am using and configuring condor for the first time and was trying to get a sample to work on my cluster
(its rocks 5.1 with Condor)

I was able to get the app to work on pbs/torque but i am having hard time having condor configured for MPI

I have changed the condor_config.local in compute-0-0 to be the MPI machine with Dedicated Scheduler

condor_status shows
---------------------

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@compute-0-0. LINUX      X86_64 Owner     Idle     0.000   954  0+00:02:27
slot2@compute-0-0. LINUX      X86_64 Owner     Idle     0.000   954  0+00:02:28
slot3@compute-0-0. LINUX      X86_64 Owner     Idle     0.000   954  0+00:02:29
slot4@compute-0-0. LINUX      X86_64 Owner     Idle     0.000   954  0+00:02:30
slot1@compute-0-1. LINUX      X86_64 Unclaimed Idle     0.000   954  1+12:29:53
slot2@compute-0-1. LINUX      X86_64 Unclaimed Idle     0.000   954  0+00:10:05
slot3@compute-0-1. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:56:07
slot4@compute-0-1. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:56:08
slot1@compute-0-2. LINUX      X86_64 Unclaimed Idle     0.000   954  0+00:05:04
slot2@compute-0-2. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:51:08
slot3@compute-0-2. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:51:09
slot4@compute-0-2. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:51:10
slot1@compute-0-3. LINUX      X86_64 Unclaimed Idle     0.000   954  1+12:30:23
slot2@compute-0-3. LINUX      X86_64 Unclaimed Idle     0.010   954  0+00:05:05
slot3@compute-0-3. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:51:10
slot4@compute-0-3. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:51:11
slot1@compute-0-4. LINUX      X86_64 Unclaimed Idle     0.000   954  1+12:27:23
slot2@compute-0-4. LINUX      X86_64 Unclaimed Idle     0.000   954  0+00:00:00
slot3@compute-0-4. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:46:11
slot4@compute-0-4. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:46:12
slot1@compute-0-5. LINUX      X86_64 Unclaimed Idle     0.010   954  0+00:00:00
slot2@compute-0-5. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:46:06
slot3@compute-0-5. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:46:07
slot4@compute-0-5. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:46:08



My Job file
----------------

universe = MPI
executable = /home/skhanal/condor/bones
log = logfile
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 2
should_transfer_files = yes
when_to_transfer_output = on_exit
queue

the job when submitted goes into "R" mode and ends with following messages on the output and log files.

Output.0 says
------------------------
p0_4788:  p4_error: Child process exited while making connection to remote process on compute-0-0.local: 0
p0_4788: (6.007812) net_send: could not write to fd=4, errno = 32

and output.1 says
-------------------------
rm_4794: (-) net_recv failed for fd = 3
rm_4794:  p4_error: net_recv read, errno = : 104


the logfile says
----------------------------

000 (029.000.000) 01/30 12:53:45 Job submitted from host: <129.1.64.81:39320>
...
014 (029.000.000) 01/30 12:53:48 Node 0 executing on host: <10.1.255.254:54415>
...
014 (029.000.001) 01/30 12:53:49 Node 1 executing on host: <10.1.255.254:54415>
...
015 (029.000.000) 01/30 12:53:54 Node 0 terminated.
        (1) Normal termination (return value 1)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        173  -  Run Bytes Sent By Node
        1919489  -  Run Bytes Received By Node
        173  -  Total Bytes Sent By Node
        1919489  -  Total Bytes Received By Node
...
015 (029.000.001) 01/30 12:53:54 Node 1 terminated.
        (1) Normal termination (return value 139)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        89  -  Run Bytes Sent By Node
        1919329  -  Run Bytes Received By Node
        89  -  Total Bytes Sent By Node
        1919329  -  Total Bytes Received By Node
...
005 (029.000.000) 01/30 12:53:54 Job terminated.
        (1) Normal termination (return value 1)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        262  -  Run Bytes Sent By Job
        3838818  -  Run Bytes Received By Job
        262  -  Total Bytes Sent By Job
        3838818  -  Total Bytes Received By Job

--------------------------------------------------------------------
Is there anything else i need to change for the MPI to work?

I read something about shadow, but could not quite get if that is needed to get condor working for MPI.

Please help


Samir Khanal
Networking Lab
Bowling Green State University
Bowling Green, OH 43402
skhanal@xxxxxxxx
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/