[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] New to Condor, Need to RUN MPI



Hi 
I am using and configuring condor for the first time and was trying to get a sample to work on my cluster 
(its rocks 5.1 with Condor)

I was able to get the app to work on pbs/torque but i am having hard time having condor configured for MPI

I have changed the condor_config.local in compute-0-0 to be the MPI machine with Dedicated Scheduler

condor_status shows
---------------------

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@compute-0-0. LINUX      X86_64 Owner     Idle     0.000   954  0+00:02:27
slot2@compute-0-0. LINUX      X86_64 Owner     Idle     0.000   954  0+00:02:28
slot3@compute-0-0. LINUX      X86_64 Owner     Idle     0.000   954  0+00:02:29
slot4@compute-0-0. LINUX      X86_64 Owner     Idle     0.000   954  0+00:02:30
slot1@compute-0-1. LINUX      X86_64 Unclaimed Idle     0.000   954  1+12:29:53
slot2@compute-0-1. LINUX      X86_64 Unclaimed Idle     0.000   954  0+00:10:05
slot3@compute-0-1. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:56:07
slot4@compute-0-1. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:56:08
slot1@compute-0-2. LINUX      X86_64 Unclaimed Idle     0.000   954  0+00:05:04
slot2@compute-0-2. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:51:08
slot3@compute-0-2. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:51:09
slot4@compute-0-2. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:51:10
slot1@compute-0-3. LINUX      X86_64 Unclaimed Idle     0.000   954  1+12:30:23
slot2@compute-0-3. LINUX      X86_64 Unclaimed Idle     0.010   954  0+00:05:05
slot3@compute-0-3. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:51:10
slot4@compute-0-3. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:51:11
slot1@compute-0-4. LINUX      X86_64 Unclaimed Idle     0.000   954  1+12:27:23
slot2@compute-0-4. LINUX      X86_64 Unclaimed Idle     0.000   954  0+00:00:00
slot3@compute-0-4. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:46:11
slot4@compute-0-4. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:46:12
slot1@compute-0-5. LINUX      X86_64 Unclaimed Idle     0.010   954  0+00:00:00
slot2@compute-0-5. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:46:06
slot3@compute-0-5. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:46:07
slot4@compute-0-5. LINUX      X86_64 Unclaimed Idle     0.000   954  1+23:46:08



My Job file
----------------

universe = MPI
executable = /home/skhanal/condor/bones
log = logfile
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 2
should_transfer_files = yes
when_to_transfer_output = on_exit
queue

the job when submitted goes into "R" mode and ends with following messages on the output and log files.

Output.0 says
------------------------
p0_4788:  p4_error: Child process exited while making connection to remote process on compute-0-0.local: 0
p0_4788: (6.007812) net_send: could not write to fd=4, errno = 32

and output.1 says
-------------------------
rm_4794: (-) net_recv failed for fd = 3
rm_4794:  p4_error: net_recv read, errno = : 104


the logfile says
----------------------------

000 (029.000.000) 01/30 12:53:45 Job submitted from host: <129.1.64.81:39320>
...
014 (029.000.000) 01/30 12:53:48 Node 0 executing on host: <10.1.255.254:54415>
...
014 (029.000.001) 01/30 12:53:49 Node 1 executing on host: <10.1.255.254:54415>
...
015 (029.000.000) 01/30 12:53:54 Node 0 terminated.
        (1) Normal termination (return value 1)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        173  -  Run Bytes Sent By Node
        1919489  -  Run Bytes Received By Node
        173  -  Total Bytes Sent By Node
        1919489  -  Total Bytes Received By Node
...
015 (029.000.001) 01/30 12:53:54 Node 1 terminated.
        (1) Normal termination (return value 139)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        89  -  Run Bytes Sent By Node
        1919329  -  Run Bytes Received By Node
        89  -  Total Bytes Sent By Node
        1919329  -  Total Bytes Received By Node
...
005 (029.000.000) 01/30 12:53:54 Job terminated.
        (1) Normal termination (return value 1)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        262  -  Run Bytes Sent By Job
        3838818  -  Run Bytes Received By Job
        262  -  Total Bytes Sent By Job
        3838818  -  Total Bytes Received By Job

--------------------------------------------------------------------
Is there anything else i need to change for the MPI to work?

I read something about shadow, but could not quite get if that is needed to get condor working for MPI.

Please help


Samir Khanal
Networking Lab
Bowling Green State University
Bowling Green, OH 43402
skhanal@xxxxxxxx