[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI job problem



Dear all

 My mpi job always IDLE in my computing pool.
 The job is an expample of mpich which is in the mpich package
 subdirectory "example", cpi (calculate pi).
 I have set up the dedicated scheduler and dedicated resources (with NFS).
 The model is
 pragma001.grid.sinica.edu.tw - central manager and dedicated scheduler
 pragma002.grid.sinica.edu.tw - dedicated resource
 pragma004.grid.sinica.edu.tw - dedicated resource
 
The following are some messages, job description file ,local configuration 
file and schedlog

=================================================================
Job description file :

universe = MPI
executable = cpi
machine_count = 1
log = logofcpi.new
error = errofcpi.$(NODE).new
output = outofcpi.$(NODE).new
queue

=================================================================

[lyho@pragma001 pragma001]$ condor_q


-- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
pragma001.g
rid.sinica.edu.tw
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 136.0   lyho            4/29 14:22   0+00:00:00 I  0   0.3  cpi

1 jobs; 1 idle, 0 running, 0 held


[lyho@pragma001 pragma001]$ condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   
ActvtyTime

pragma001.gri LINUX       INTEL  Owner      Idle       0.000   469  
0+00:35:04
pragma002.gri LINUX       INTEL  Owner      Idle       1.000   469  
0+03:42:04
pragma004.gri LINUX       INTEL  Owner      Idle       1.000  1004  
0+03:40:06

                     Machines Owner Claimed Unclaimed Matched Preempting

         INTEL/LINUX        3     3       0         0       0          0

               Total        3     3       0         0       0          0

=================================================================

[lyho@pragma001 pragma001]$ condor_q -analyze


-- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
pragma001.g
rid.sinica.edu.tw
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
---
136.000:  Run analysis summary.  Of 3 machines,
      0 are rejected by your job's requirements
      3 reject your job because of their own requirements
      0 match, but are serving users with a better priority in the pool
      0 match, match, but reject the job for unknown reasons
      0 match, but will not currently preempt their existing job
      0 are available to run your job

WARNING:  Be advised:   Request 136.0 did not match any resource's 
constraints


WARNING: Analysis is meaningless for MPI universe jobs.

1 jobs; 1 idle, 0 running, 0 held

===================================================================

[lyho@pragma001 pragma001]$ condor_status -l|less

MyType = "Machine"
TargetType = "Job"
Name = "pragma001.grid.sinica.edu.tw"
Machine = "pragma001.grid.sinica.edu.tw"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 945720
Disk = 59017960
CondorLoadAvg = 0.000000
LoadAvg = 0.010000
KeyboardIdle = 175
ConsoleIdle = 30290412
Memory = 469
Cpus = 1
StartdIpAddr = "<140.109.98.21:33669>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "grid.sinica.edu.tw"
FileSystemDomain = "grid.sinica.edu.tw"
Subnet = "140.109.98"
HasIOProxy = TRUE
TotalVirtualMemory = 945720
TotalDisk = 59017960
KFlops = 868714
Mips = 1941
LastBenchmark = 1114753475
TotalLoadAvg = 0.010000
TotalCondorLoadAvg = 0.000000
ClockMin = 894
ClockDay = 5
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList 
= "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Owner"
EnteredCurrentState = 1114755875
Activity = "Idle"
EnteredCurrentActivity = 1114755875
Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 
0.300000) ||
 (State != "Unclaimed" && State != "Owner")))
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1114695432
UpdateSequenceNumber = 210
MyAddress = "<140.109.98.21:33669>"
LastHeardFrom = 1114757679
UpdatesTotal = 211
UpdatesSequenced = 210
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

MyType = "Machine"
TargetType = "Job"
Name = "pragma002.grid.sinica.edu.tw"
Machine = "pragma002.grid.sinica.edu.tw"
Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 953140
Disk = 59017960
CondorLoadAvg = 0.000000
LoadAvg = 0.870000
KeyboardIdle = 1564555
ConsoleIdle = 1564995
Memory = 469
Cpus = 1
StartdIpAddr = "<140.109.98.22:48852>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "grid.sinica.edu.tw"
FileSystemDomain = "grid.sinica.edu.tw"
Subnet = "140.109.98"
HasIOProxy = TRUE
TotalVirtualMemory = 953140
TotalDisk = 59017960
KFlops = 832323
Mips = 2033
LastBenchmark = 1114744656
TotalLoadAvg = 0.870000
TotalCondorLoadAvg = 0.000000
ClockMin = 893
ClockDay = 5
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList 
= "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 4
CpuIsBusy = TRUE
State = "Owner"
EnteredCurrentState = 1114744651
Activity = "Idle"
EnteredCurrentActivity = 1114744651
Start = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1114744650
UpdateSequenceNumber = 44
MyAddress = "<140.109.98.22:48852>"
LastHeardFrom = 1114757675
UpdatesTotal = 107
UpdatesSequenced = 105
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

MyType = "Machine"
TargetType = "Job"
Name = "pragma004.grid.sinica.edu.tw"
Machine = "pragma004.grid.sinica.edu.tw"
Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 2013048
Disk = 59017960
CondorLoadAvg = 0.000000
LoadAvg = 0.890000
KeyboardIdle = 8300
ConsoleIdle = 30290424
Memory = 1004
Cpus = 1
StartdIpAddr = "<140.109.98.24:35849>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "grid.sinica.edu.tw"
FileSystemDomain = "grid.sinica.edu.tw"
Subnet = "140.109.98"
HasIOProxy = TRUE
TotalVirtualMemory = 2013048
TotalDisk = 59017960
KFlops = 547145
Mips = 1324
LastBenchmark = 1114744778
TotalLoadAvg = 0.890000
TotalCondorLoadAvg = 0.000000
ClockMin = 894
ClockDay = 5
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList 
= "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 4
OpSys = "LINUX"
UidDomain = "grid.sinica.edu.tw"
FileSystemDomain = "grid.sinica.edu.tw"
Subnet = "140.109.98"
HasIOProxy = TRUE
TotalVirtualMemory = 2013048
TotalDisk = 59017960
KFlops = 547145
Mips = 1324
LastBenchmark = 1114744778
TotalLoadAvg = 0.890000
TotalCondorLoadAvg = 0.000000
ClockMin = 894
ClockDay = 5
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList 
= "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 4
CpuIsBusy = TRUE
State = "Owner"
EnteredCurrentState = 1114744769
Activity = "Idle"
EnteredCurrentActivity = 1114744769
Start = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1114744768
UpdateSequenceNumber = 44
MyAddress = "<140.109.98.24:35849>"
LastHeardFrom = 1114757675
UpdatesTotal = 106
UpdatesSequenced = 104
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

================================================================

pragma001 local configuration file :

COLLECTOR_NAME          = ASCC-Condor
DAEMON_LIST   = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD
COLLECTOR     = $(SBIN)/condor_collector
NEGOTIATOR    = $(SBIN)/condor_negotiator
UNUSED_CLAIM_TIMEOUT = 0

=================================================================

pragma002 and pragma004 local configuration file :

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"

##--------------------------------------------------------------------
## 1) Only run dedicated jobs
##--------------------------------------------------------------------
START           = Scheduler =?= $(DedicatedScheduler)
SUSPEND = False
CONTINUE        = True
PREEMPT = False
KILL            = False
WANT_SUSPEND    = False
WANT_VACATE     = False
RANK            = Scheduler =?= $(DedicatedScheduler)
MPI_CONDOR_RSH_PATH = $(SBIN)
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler



========================================================================

schedlog on pagma001 :

4/29 15:12:50 Found idle MPI cluster 136
4/29 15:12:50 Started timer (182) to call handleDedicatedJobs() in 2 secs
4/29 15:12:50 JobsRunning = 0
4/29 15:12:50 JobsIdle = 0
4/29 15:12:50 JobsHeld = 0
4/29 15:12:50 JobsRemoved = 0
4/29 15:12:50 SchedUniverseJobsRunning = 0
4/29 15:12:50 SchedUniverseJobsIdle = 0
4/29 15:12:50 N_Owners = 1
4/29 15:12:50 MaxJobsRunning = 200
4/29 15:12:50 Attempting to send update via UDP to collector 
pragma001.grid.sini
ca.edu.tw <140.109.98.21:9618>
4/29 15:12:50 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/29 15:12:50 Sent HEART BEAT ad to central mgr: Number of submittors=1
4/29 15:12:50 Attempting to send update via UDP to collector marlin.bii.a-
star.e
du.sg <202.6.243.157:9618>
4/29 15:12:50 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/29 15:12:50 Changed attribute: RunningJobs = 0
4/29 15:12:50 Changed attribute: IdleJobs = 0
4/29 15:12:50 Changed attribute: HeldJobs = 0
4/29 15:12:50 Changed attribute: FlockedJobs = 0
4/29 15:12:50 Changed attribute: Name = "lyho@xxxxxxxxxxxxxxxxxx"
4/29 15:12:50 Attempting to send update via UDP to collector 
pragma001.grid.sini
ca.edu.tw <140.109.98.21:9618>
4/29 15:12:50 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/29 15:12:50 Sent ad to central manager for lyho@xxxxxxxxxxxxxxxxxx
4/29 15:12:50 ============ Begin clean_shadow_recs =============
4/29 15:12:50 ============ End clean_shadow_recs =============
4/29 15:12:52 Starting DedicatedScheduler::handleDedicatedJobs
4/29 15:12:52 Found 1 idle dedicated job(s)
4/29 15:12:52 DedicatedScheduler: Listing all dedicated jobs -
4/29 15:12:52 Dedicated job: 136.0 lyho
4/29 15:12:52 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 
0
4/29 15:12:52 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/29 15:12:52 Found 0 potential dedicated resources
4/29 15:12:52 Displaying dedicated resources:
4/29 15:12:52  No resources claimed
4/29 15:12:52 In DedicatedScheduler::publishRequestAd()
4/29 15:12:52 Attempting to send update via UDP to collector 
pragma001.grid.sini
ca.edu.tw <140.109.98.21:9618>
4/29 15:12:52 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/29 15:12:52 Finished DedicatedScheduler::handleDedicatedJobs


==========================================================================



I found that the resources state are always "owner" , is it the problem ?


Can anyone give me a BIG help ?
Thanks a lot