[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] New to Condor, Need to RUN MPI



Hi all 

I have some more problems as i could not even get  a simple MPI program work through condor.

my mp1script is default and i have installed mpich1.2.4 in my home directory which is available on all compute nodes.

------------------ a part of it looks like ----------------
# Set this to the bin directory of MPICH installation
MPDIR=/home/skhanal/mpich-1.2.4/ch_p4/bin
#MPDIR=/opt/mpich/gnu/bin
PATH=$MPDIR:.:$PATH
export PATH

export P4_RSHCOMMAND=$CONDOR_SSH

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE

# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

## run the actual mpijob
mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@

--------------------------------------------
My Submit file

universe = parallel
executable = mp1script
arguments = /home/skhanal/condor/a.out
Output = foo.out.$(NODE)
log = userlog.txt
error = foo.err.$(NODE)
machine_count = 2
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files =/home/skhanal/condor/a.out
queue

-------------Program-----------------
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>

#include <mpi.h>

int main (int argc, char *argv[]) {
        int myrank, size;
        char HOST[256];

        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);

        bzero(HOST, sizeof(HOST));
        gethostname(HOST, sizeof(HOST));

        printf("%s \n", (char *)HOST);

        MPI_Finalize();
}

------------------------------------

What i get?


[skhanal@comet condor]$ cat foo.err.0 

sort: open failed: +0: No such file or directory

[skhanal@comet condor]$ cat foo.out.0 

running /home/skhanal/condor/a.out on 2 LINUX ch_p4 processors
Could not find enough machines for architecture LINUX
-------------------------------------

Also when i change 

mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@
to 
mpirun -v -np $_CONDOR_NPROCS -machinefile /home/skhanal/condor/machines $EXECUTABLE $@

i get the following message 

running /home/skhanal/condor/a.out on 2 LINUX ch_p4 processors
Could not find enough machines for architecture LINUX

i have configured 1 linux box to be used for mpi dedicated scheduler
-=--------------------------------
consor_status -long compute-0-0 
results in

MyType = "Machine"
TargetType = "Job"
Name = "slot1@xxxxxxxxxxxxxxxxx"
Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxx"
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
MyCurrentTime = 1233597152
Machine = "compute-0-0.local"
PublicNetworkIpAddr = "<10.1.255.254:54415>"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxx"
COLLECTOR_HOST_STRING = "comet.cs.bgsu.edu"
CondorVersion = "$CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846 $"
CondorPlatform = "$CondorPlatform: X86_64-LINUX_RHEL5 $"
SlotID = 1
VirtualMachineID = 1
ImageSize = 60528
ExecutableSize = 2
JobUniverse = 11
NiceUser = FALSE
VirtualMemory = 255029
TotalDisk = 3622408
Disk = 905602
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 335780
ConsoleIdle = 335780
Memory = 954
Cpus = 1
StartdIpAddr = "<10.1.255.254:54415>"
Arch = "X86_64"
OpSys = "LINUX"
UidDomain = "cs.bgsu.edu"
FileSystemDomain = "cs.bgsu.edu"
Subnet = "10.1.255"
HasIOProxy = TRUE
CheckpointPlatform = "LINUX X86_64 2.6.x normal 0xffffffffff600000"
TotalVirtualMemory = 1020116
TotalCpus = 4
TotalMemory = 3816
KFlops = 1708126
Mips = 8131
LastBenchmark = 1233261373
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 772
ClockDay = 1
TotalSlots = 4
TotalVirtualMachines = 4
HasFileTransfer = TRUE
HasPerFileEncryption = TRUE
HasReconnect = TRUE
HasMPI = TRUE
HasTDP = TRUE
HasJobDeferral = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
JavaVendor = "Sun Microsystems Inc."
JavaVersion = "1.6.0_07"
JavaMFlops = 722.844727
HasJava = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList = "HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDeferral,HasJICLocalConfig,HasJICLocalStdin,HasJava,HasVM,HasRemoteSyscalls,HasCheckpointing"
HasVM = FALSE
CpuBusyTime = 0
CpuIsBusy = FALSE
TimeToLive = 2147483647
State = "Claimed"
EnteredCurrentState = 1233595058
Activity = "Idle"
EnteredCurrentActivity = 1233597148
TotalTimeOwnerIdle = 322444
TotalTimeClaimedIdle = 10342
TotalTimeClaimedBusy = 2998
Start = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxx"
Requirements = (START) && (IsValidCheckpointPlatform)
IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) || ((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
MaxJobRetirementTime = 0
CurrentRank = 1.000000
RemoteUser = "DedicatedScheduler@xxxxxxxxxxxxxxxxx"
RemoteOwner = "DedicatedScheduler@xxxxxxxxxxxxxxxxx"
ClientMachine = "comet.cs.bgsu.edu"
TotalClaimRunTime = 910
MonitorSelfTime = 1233597134
MonitorSelfCPUUsage = 0.150046
MonitorSelfImageSize = 26520.000000
MonitorSelfResidentSetSize = 4556
MonitorSelfAge = 0
MonitorSelfRegisteredSocketCount = 2
DaemonStartTime = 1233261361
UpdateSequenceNumber = 1182
MyAddress = "<10.1.255.254:54415>"
LastHeardFrom = 1233597152
UpdatesTotal = 1487
UpdatesSequenced = 1485
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"
----------------------------------

Any help is appreciated.

Thank
Samir


________________________________________
From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum [tannenba@xxxxxxxxxxx]
Sent: Friday, January 30, 2009 3:03 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] New to Condor, Need to RUN MPI

Samir Khanal wrote:
> I tried Parallel Universe too, here is what i get
[snip]
> running /home/skhanal/condor/bones on 2 LINUX ch_p4 processors
> Created /var/opt/condor/execute/dir_5352/PILxVizf5531
> Host compute-0-0 is not in contact file /var/opt/condor/execute/dir_5352/contact
> p0_5556:  p4_error: Child process exited while making connection to remote process on compute-0-0: 0
> p0_5556: (2.003906) net_send: could not write to fd=4, errno = 32
>
>
> The job does not complete successfully with above messages.
>
> Help ! Help!
>

Why did you feel compelled to hack the sample mp1script included with
Condor?  Are you trying to use mpich?  If so, just set the path
correctly (to MPDIR) in the sample script where the comment says so; no
other changes should be needed.

Your customizations to the sample mp1script look very suspect to me.

regards,
Todd


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/