[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI on Windows XP



Hello

I am new to Condor, so apologise in advance that I do not have a lot of 
experience with it (but I have first done my best to solve the problem by 
looking at existing resources on internet)

I am having a problem running MPI on Windows XP with Condor. I 
have read the relevant bits of the manual and searched the Condor 
user archives and can't find the answer, so hope someone can help 
me or give me some hints. i.er. I cannot get MPI  to run under Condor, 
altho' it runs without Condor on the same PCs.

In case it is relevant I point out that we had some problems getting 
Condor installed successfully because our network does not use MS 
networking, but Novell, and all machine names are in DNS. Thus we 
had to enter the domains for read and write access as IP ranges, not 
the accounting domain name. However, we now have Condor 
successfully installed and I can successfully run Condor sample runs, 
such as printname, and see all the PCs in the pool with condor_status.

I am using Condor 6.6.6 on a pool of Windows XP PCs.

As the Condor manual says it only supports mpich 1.2.4. or below, I 
uninstallled mpich 1.2.5. and reinstalled mpich 1.2.4 from Argonne. I 
can successfully run for instance their sample cpi program compiled 
under 1.2.4. from the command line using mpirun but not using 
Condor.

I have set up the PCs in the pool to be dedicated resources 
uncommenting, editing and using the 3rd option in the sample file in 
my condor_config.local file. I have also, for testing purposes, set 
START=True for all PCs (currently for testing, 5, but later more) in the 
the pool. Currently (for testing) the main Condor server node, and the 
dedicated scheduler and submit node are all the same node (TR4985).

The submit file I am using is:

########################## ############  
## MPI example submit description file
######################################
universe = MPI
executable = cpi124.exe
log = logfile
#input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 2
queue
## If the dedicated scheduler has resources claimed, but nothing to
## use them for (no MPI jobs in the queue that could use them), how
## long should it hold onto them before releasing them back to the
## regular Condor pool?  Specified in seconds.  Default is 10 minutes.
## If you define this to '0', the schedd will never release claims
## (unless the schedd is shutdown).  If your dedicated resources are
## configured to only run jobs, you should probably set this attribute
## to '0'
#UNUSED_CLAIM_TIMEOUT = 600

However, when I submit this it hangs in the queue and never 
progresses. 

I have looked in the logs and cannot see anything that gives any clue 
to me. Please can someone suggest what I should be looking for, or 
what further information you would like me to post to help you to assist 
me, and seeign I am new to Condor and not necessarily familiar with 
how to get all this info, maybe tell me what to do to get it?

I append below the bits of logs at relevant times, the output for these 5 
machines from condor_status -l and from condor_q -analyze, even tho' 
the comment for this says this is not relevant to MPI universe. (At our 
university the PCs have names that look like TR4985, which DNS 
resolves to IP addresses. One can ping these names by themselves, 
but if one tries to ping TR4985.lincoln.ac.nz it doesn't work.) However, 
Condor examples such as printname work when submitted, just not 
MPI jobs.

(BTW, I also have an unconnected query. In my collector log I see the 
following (several times)  and I don't know what the problem is.

9/2 10:25:59 Can't connect to <128.105.143.14:9618>:0, errno = 
10060
9/2 10:25:59 Will keep trying for 10 seconds...
9/2 10:26:00 Connect failed for 10 seconds; returning FALSE
9/2 10:26:00 ERROR:
SECMAN:2003:TCP connection to <128.105.143.14:9618> failed
9/2 10:26:00 Can't send UPDATE_COLLECTOR_AD to collector 
(condor.cs.wisc.edu): Failed to send UDP update command to 
collector) Other collector communicatiosn to PCs in my domain seem 
to work.

Many thanks for any help.

Elizabeth (Hi Bruce)

>From Collector log:

9/2 10:13:30 (Sent 16 ads in response to query)
9/2 10:13:30 Got QUERY_STARTD_PVT_ADS
9/2 10:13:30 (Sent 5 ads in response to query)
9/2 10:13:32 Got QUERY_STARTD_ADS
9/2 10:13:32 (Sent 0 ads in response to query)
9/2 10:13:39 Got QUERY_STARTD_ADS
9/2 10:13:39 (Sent 5 ads in response to query)
9/2 10:13:50 Got QUERY_STARTD_ADS
9/2 10:13:50 (Sent 5 ads in response to query)

>From Negotiatior log:

9/2 10:13:30 ---------- Started Negotiation Cycle ----------
9/2 10:13:30 Phase 1:  Obtaining ads from collector ...
9/2 10:13:30   Getting all public ads ...
9/2 10:13:30   Sorting 16 ads ...
9/2 10:13:30   Getting startd private ads ...
9/2 10:13:30 Got ads: 16 public and 5 private
9/2 10:13:30 Public ads include 2 submitter, 5 startd
9/2 10:13:30 Phase 2:  Performing accounting ...
9/2 10:13:30 Phase 3:  Sorting submitter ads by priority ...
9/2 10:13:30 Phase 4.1:  Negotiating with schedds ...
9/2 10:13:30 ---------- Finished Negotiation Cycle ----------
9/2 10:13:39 Getting state information from the accountant

>From schedd log:

9/2 10:12:58 DaemonCore: Command received via TCP from host 
<138.75.3.19:3714>
9/2 10:12:58 DaemonCore: received command 478 (ACT_ON_JOBS), 
calling handler (actOnJobs)
9/2 10:13:30 DaemonCore: Command received via UDP from host 
<138.75.3.19:3717>
9/2 10:13:30 DaemonCore: received command 421 (RESCHEDULE), 
calling handler (reschedule_negotiator)
9/2 10:13:30 Sent ad to central manager for postea@xxxxxxxxxxxxx
9/2 10:13:30 Called reschedule_negotiator()

Name = "TR4977.lincoln.ac.nz"
Machine = "TR4977.lincoln.ac.nz"
Rank = (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxx" * 
1000000) + 1
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "TR4985"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxx"
CondorVersion = "$CondorVersion: 6.6.6 Jul 26 2004 $"
CondorPlatform = "$CondorPlatform: INTEL-WINNT40 $"
VirtualMachineID = 1
VirtualMemory = 1045788
Disk = 5355212
CondorLoadAvg = 0.000000
LoadAvg = 0.010000
KeyboardIdle = 600
ConsoleIdle = 600
Memory = 512
Cpus = 1
StartdIpAddr = "<127.0.0.1:1245>"
Arch = "INTEL"
OpSys = "WINNT51"
UidDomain = "lincoln.ac.nz"
FileSystemDomain = "lincoln.ac.nz"
Subnet = "138.75.7"
HasIOProxy = TRUE
TotalVirtualMemory = 1045788
TotalDisk = 5355212
KFlops = 663595
Mips = 2061
LastBenchmark = 1094064583
TotalLoadAvg = 0.010000
TotalCondorLoadAvg = 0.000000
ClockMin = 636
ClockDay = 4
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
StarterAbilityList = 
"HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin"

CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Unclaimed"
EnteredCurrentState = 1094077004
Activity = "Idle"
EnteredCurrentActivity = 1094077004
Start = TRUE
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1094075025
UpdateSequenceNumber = 11
MyAddress = "<127.0.0.1:1245>"
LastHeardFrom = 1094078208
UpdatesTotal = 12
UpdatesSequenced = 11
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

MyType = "Machine"
TargetType = "Job"
Name = "TR4979.lincoln.ac.nz"
Machine = "TR4979.lincoln.ac.nz"
Rank = (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxx" * 
1000000) + 1
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "TR4985"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxx"
CondorVersion = "$CondorVersion: 6.6.6 Jul 26 2004 $"
CondorPlatform = "$CondorPlatform: INTEL-WINNT40 $"
VirtualMachineID = 1
VirtualMemory = 1039924
Disk = 5703444
CondorLoadAvg = 0.000000
LoadAvg = 0.020000
KeyboardIdle = 2100
ConsoleIdle = 2100
Memory = 512
Cpus = 1
StartdIpAddr = "<138.75.7.155:1030>"
Arch = "INTEL"
OpSys = "WINNT51"
UidDomain = "lincoln.ac.nz"
FileSystemDomain = "lincoln.ac.nz"
Subnet = "138.75.7"
HasIOProxy = TRUE
TotalVirtualMemory = 1039924
TotalDisk = 5703444
KFlops = 648038
Mips = 2062
LastBenchmark = 1094070390
TotalLoadAvg = 0.020000
TotalCondorLoadAvg = 0.000000
ClockMin = 634
ClockDay = 4
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
StarterAbilityList = 
"HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin"

CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Unclaimed"
EnteredCurrentState = 1093998280
Activity = "Idle"
EnteredCurrentActivity = 1094070390
Start = TRUE
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1094075083
UpdateSequenceNumber = 10
MyAddress = "<138.75.7.155:1030>"
LastHeardFrom = 1094078087
UpdatesTotal = 11
UpdatesSequenced = 10
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

MyType = "Machine"
TargetType = "Job"
Name = "TR4983.lincoln.ac.nz"
Machine = "TR4983.lincoln.ac.nz"
Rank = (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxx" * 
1000000) + 1
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "TR4985"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxx"
CondorVersion = "$CondorVersion: 6.6.6 Jul 26 2004 $"
CondorPlatform = "$CondorPlatform: INTEL-WINNT40 $"
VirtualMachineID = 1
VirtualMemory = 1041160
Disk = 5660856
CondorLoadAvg = 0.000000
LoadAvg = 0.010000
KeyboardIdle = 2400
ConsoleIdle = 2400
Memory = 512
Cpus = 1
StartdIpAddr = "<138.75.3.150:1288>"
Arch = "INTEL"
OpSys = "WINNT51"
UidDomain = "lincoln.ac.nz"
FileSystemDomain = "lincoln.ac.nz"
Subnet = "138.75.3"
HasIOProxy = TRUE
TotalVirtualMemory = 1041160
TotalDisk = 5660856
KFlops = 660611
Mips = 2102
LastBenchmark = 1094064607
TotalLoadAvg = 0.010000
TotalCondorLoadAvg = 0.000000
ClockMin = 636
ClockDay = 4
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
StarterAbilityList = 
"HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin"

CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Unclaimed"
EnteredCurrentState = 1094006986
Activity = "Idle"
EnteredCurrentActivity = 1094064607
Start = TRUE
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1094074891
UpdateSequenceNumber = 11
MyAddress = "<138.75.3.150:1288>"
LastHeardFrom = 1094078196
UpdatesTotal = 12
UpdatesSequenced = 11
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

MyType = "Machine"
TargetType = "Job"
Name = "TR4984.lincoln.ac.nz"
Machine = "TR4984.lincoln.ac.nz"
Rank = (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxx" * 
1000000) + 1
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "TR4985"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxx"
CondorVersion = "$CondorVersion: 6.6.6 Jul 26 2004 $"
CondorPlatform = "$CondorPlatform: INTEL-WINNT40 $"
VirtualMachineID = 1
VirtualMemory = 1047988
Disk = 5680152
CondorLoadAvg = 0.000000
LoadAvg = 0.020000
KeyboardIdle = 2101
ConsoleIdle = 2101
Memory = 512
Cpus = 1
StartdIpAddr = "<138.75.7.241:1173>"
Arch = "INTEL"
OpSys = "WINNT51"
UidDomain = "lincoln.ac.nz"
FileSystemDomain = "lincoln.ac.nz"
Subnet = "138.75.7"
HasIOProxy = TRUE
TotalVirtualMemory = 1047988
TotalDisk = 5680152
KFlops = 664599
Mips = 2056
LastBenchmark = 1094064454
TotalLoadAvg = 0.020000
TotalCondorLoadAvg = 0.000000
ClockMin = 635
ClockDay = 4
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
StarterAbilityList = 
"HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin"

CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Unclaimed"
EnteredCurrentState = 1094006832
Activity = "Idle"
EnteredCurrentActivity = 1094064454
Start = TRUE
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1094075128
UpdateSequenceNumber = 10
MyAddress = "<138.75.7.241:1173>"
LastHeardFrom = 1094078133
UpdatesTotal = 11
UpdatesSequenced = 10
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

MyType = "Machine"
TargetType = "Job"
Name = "TR4985.lincoln.ac.nz"
Machine = "TR4985.lincoln.ac.nz"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "TR4985"
CondorVersion = "$CondorVersion: 6.6.6 Jul 26 2004 $"
CondorPlatform = "$CondorPlatform: INTEL-WINNT40 $"
VirtualMachineID = 1
VirtualMemory = 1002220
Disk = 5671580
CondorLoadAvg = 0.000000
LoadAvg = 0.020000
KeyboardIdle = 0
ConsoleIdle = 0
Memory = 512
Cpus = 1
StartdIpAddr = "<138.75.3.19:3487>"
Arch = "INTEL"
OpSys = "WINNT51"
UidDomain = "lincoln.ac.nz"
FileSystemDomain = "lincoln.ac.nz"
Subnet = "138.75.3"
HasIOProxy = TRUE
TotalVirtualMemory = 1002220
TotalDisk = 5671580
KFlops = 500851
Mips = 1538
LastBenchmark = 1094074802
TotalLoadAvg = 0.020000
TotalCondorLoadAvg = 0.000000
ClockMin = 635
ClockDay = 4
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
StarterAbilityList = 
"HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin"

CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Unclaimed"
EnteredCurrentState = 1094075244
Activity = "Idle"
EnteredCurrentActivity = 1094075244
Start = TRUE
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1094074794
UpdateSequenceNumber = 14
MyAddress = "<138.75.3.19:3487>"
LastHeardFrom = 1094078107
UpdatesTotal = 15
UpdatesSequenced = 14
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

(Below different submission from one described above whihc is why 
time is different)
D:\MPI>condor_q -analyze


-- Submitter: TR4985.lincoln.ac.nz : <138.75.3.19:3488> : 
TR4985.lincoln.ac.nz
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
---
006.000:  Run analysis summary.  Of 5 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match, but are serving users with a better priority in the pool
      1 match, match, but reject the job for unknown reasons
      0 match, but will not currently preempt their existing job
      4 are available to run your job

WARNING: Analysis is meaningless for MPI universe jobs.

1 jobs; 1 idle, 0 running, 0 held