[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI job problem



The problem seems to be in the fact that all your computers are in the
"Owner" state, i.e. Condor is NOT allowed to start any job on them.
Obviously you're using the START expression (in the condor_config),
which makes your resources reject Condor jobs when they are under load
or when there's some  keyboard activity. ( the output you sent was
produced on pragma001, so you were working on it, and two others have a
load average of 1.000 ) .
To TEST that MPI really works you might want to disable this, by putting
START=TRUE ( which would allow any job to be invoked, regardless of the
current computer activity), or START=($(START))||((Scheduler =?=
$(DedicatedScheduler)).
Mark


On Fri, 2005-04-29 at 15:24 +0800, Li-Yung_Ho wrote:
> Dear all
> 
>  My mpi job always IDLE in my computing pool.
>  The job is an expample of mpich which is in the mpich package
>  subdirectory "example", cpi (calculate pi).
>  I have set up the dedicated scheduler and dedicated resources (with NFS).
>  The model is
>  pragma001.grid.sinica.edu.tw - central manager and dedicated scheduler
>  pragma002.grid.sinica.edu.tw - dedicated resource
>  pragma004.grid.sinica.edu.tw - dedicated resource
>  
> The following are some messages, job description file ,local configuration 
> file and schedlog
> 
> =================================================================
> Job description file :
> 
> universe = MPI
> executable = cpi
> machine_count = 1
> log = logofcpi.new
> error = errofcpi.$(NODE).new
> output = outofcpi.$(NODE).new
> queue
> 
> =================================================================
> 
> [lyho@pragma001 pragma001]$ condor_q
> 
> 
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
> pragma001.g
> rid.sinica.edu.tw
>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>  136.0   lyho            4/29 14:22   0+00:00:00 I  0   0.3  cpi
> 
> 1 jobs; 1 idle, 0 running, 0 held
> 
> 
> [lyho@pragma001 pragma001]$ condor_status
> 
> Name          OpSys       Arch   State      Activity   LoadAv Mem   
> ActvtyTime
> 
> pragma001.gri LINUX       INTEL  Owner      Idle       0.000   469  
> 0+00:35:04
> pragma002.gri LINUX       INTEL  Owner      Idle       1.000   469  
> 0+03:42:04
> pragma004.gri LINUX       INTEL  Owner      Idle       1.000  1004  
> 0+03:40:06
> 
>                      Machines Owner Claimed Unclaimed Matched Preempting
> 
>          INTEL/LINUX        3     3       0         0       0          0
> 
>                Total        3     3       0         0       0          0
> 
> =================================================================
> 
> [lyho@pragma001 pragma001]$ condor_q -analyze
> 
> 
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
> pragma001.g
> rid.sinica.edu.tw
>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> ---
> 136.000:  Run analysis summary.  Of 3 machines,
>       0 are rejected by your job's requirements
>       3 reject your job because of their own requirements
>       0 match, but are serving users with a better priority in the pool
>       0 match, match, but reject the job for unknown reasons
>       0 match, but will not currently preempt their existing job
>       0 are available to run your job
> 
> WARNING:  Be advised:   Request 136.0 did not match any resource's 
> constraints
> 
> 
> WARNING: Analysis is meaningless for MPI universe jobs.
> 
> 1 jobs; 1 idle, 0 running, 0 held
> 
> ===================================================================
> 
> [lyho@pragma001 pragma001]$ condor_status -l|less
> 
> MyType = "Machine"
> TargetType = "Job"
> Name = "pragma001.grid.sinica.edu.tw"
> Machine = "pragma001.grid.sinica.edu.tw"
> Rank = 0.000000
> CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
> CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> VirtualMachineID = 1
> VirtualMemory = 945720
> Disk = 59017960
> CondorLoadAvg = 0.000000
> LoadAvg = 0.010000
> KeyboardIdle = 175
> ConsoleIdle = 30290412
> Memory = 469
> Cpus = 1
> StartdIpAddr = "<140.109.98.21:33669>"
> Arch = "INTEL"
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 945720
> TotalDisk = 59017960
> KFlops = 868714
> Mips = 1941
> LastBenchmark = 1114753475
> TotalLoadAvg = 0.010000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 894
> ClockDay = 5
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList 
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 0
> CpuIsBusy = FALSE
> State = "Owner"
> EnteredCurrentState = 1114755875
> Activity = "Idle"
> EnteredCurrentActivity = 1114755875
> Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 
> 0.300000) ||
>  (State != "Unclaimed" && State != "Owner")))
> Requirements = START
> CurrentRank = 0.000000
> DaemonStartTime = 1114695432
> UpdateSequenceNumber = 210
> MyAddress = "<140.109.98.21:33669>"
> LastHeardFrom = 1114757679
> UpdatesTotal = 211
> UpdatesSequenced = 210
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
> 
> MyType = "Machine"
> TargetType = "Job"
> Name = "pragma002.grid.sinica.edu.tw"
> Machine = "pragma002.grid.sinica.edu.tw"
> Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> VirtualMachineID = 1
> VirtualMemory = 953140
> Disk = 59017960
> CondorLoadAvg = 0.000000
> LoadAvg = 0.870000
> KeyboardIdle = 1564555
> ConsoleIdle = 1564995
> Memory = 469
> Cpus = 1
> StartdIpAddr = "<140.109.98.22:48852>"
> Arch = "INTEL"
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 953140
> TotalDisk = 59017960
> KFlops = 832323
> Mips = 2033
> LastBenchmark = 1114744656
> TotalLoadAvg = 0.870000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 893
> ClockDay = 5
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList 
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 4
> CpuIsBusy = TRUE
> State = "Owner"
> EnteredCurrentState = 1114744651
> Activity = "Idle"
> EnteredCurrentActivity = 1114744651
> Start = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> Requirements = START
> CurrentRank = 0.000000
> DaemonStartTime = 1114744650
> UpdateSequenceNumber = 44
> MyAddress = "<140.109.98.22:48852>"
> LastHeardFrom = 1114757675
> UpdatesTotal = 107
> UpdatesSequenced = 105
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
> 
> MyType = "Machine"
> TargetType = "Job"
> Name = "pragma004.grid.sinica.edu.tw"
> Machine = "pragma004.grid.sinica.edu.tw"
> Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> VirtualMachineID = 1
> VirtualMemory = 2013048
> Disk = 59017960
> CondorLoadAvg = 0.000000
> LoadAvg = 0.890000
> KeyboardIdle = 8300
> ConsoleIdle = 30290424
> Memory = 1004
> Cpus = 1
> StartdIpAddr = "<140.109.98.24:35849>"
> Arch = "INTEL"
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 2013048
> TotalDisk = 59017960
> KFlops = 547145
> Mips = 1324
> LastBenchmark = 1114744778
> TotalLoadAvg = 0.890000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 894
> ClockDay = 5
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList 
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 4
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 2013048
> TotalDisk = 59017960
> KFlops = 547145
> Mips = 1324
> LastBenchmark = 1114744778
> TotalLoadAvg = 0.890000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 894
> ClockDay = 5
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList 
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 4
> CpuIsBusy = TRUE
> State = "Owner"
> EnteredCurrentState = 1114744769
> Activity = "Idle"
> EnteredCurrentActivity = 1114744769
> Start = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> Requirements = START
> CurrentRank = 0.000000
> DaemonStartTime = 1114744768
> UpdateSequenceNumber = 44
> MyAddress = "<140.109.98.24:35849>"
> LastHeardFrom = 1114757675
> UpdatesTotal = 106
> UpdatesSequenced = 104
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
> 
> ================================================================
> 
> pragma001 local configuration file :
> 
> COLLECTOR_NAME          = ASCC-Condor
> DAEMON_LIST   = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD
> COLLECTOR     = $(SBIN)/condor_collector
> NEGOTIATOR    = $(SBIN)/condor_negotiator
> UNUSED_CLAIM_TIMEOUT = 0
> 
> =================================================================
> 
> pragma002 and pragma004 local configuration file :
> 
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> 
> ##--------------------------------------------------------------------
> ## 1) Only run dedicated jobs
> ##--------------------------------------------------------------------
> START           = Scheduler =?= $(DedicatedScheduler)
> SUSPEND = False
> CONTINUE        = True
> PREEMPT = False
> KILL            = False
> WANT_SUSPEND    = False
> WANT_VACATE     = False
> RANK            = Scheduler =?= $(DedicatedScheduler)
> MPI_CONDOR_RSH_PATH = $(SBIN)
> STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
> 
> 
> 
> ========================================================================
> 
> schedlog on pagma001 :
> 
> 4/29 15:12:50 Found idle MPI cluster 136
> 4/29 15:12:50 Started timer (182) to call handleDedicatedJobs() in 2 secs
> 4/29 15:12:50 JobsRunning = 0
> 4/29 15:12:50 JobsIdle = 0
> 4/29 15:12:50 JobsHeld = 0
> 4/29 15:12:50 JobsRemoved = 0
> 4/29 15:12:50 SchedUniverseJobsRunning = 0
> 4/29 15:12:50 SchedUniverseJobsIdle = 0
> 4/29 15:12:50 N_Owners = 1
> 4/29 15:12:50 MaxJobsRunning = 200
> 4/29 15:12:50 Attempting to send update via UDP to collector 
> pragma001.grid.sini
> ca.edu.tw <140.109.98.21:9618>
> 4/29 15:12:50 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 4/29 15:12:50 Sent HEART BEAT ad to central mgr: Number of submittors=1
> 4/29 15:12:50 Attempting to send update via UDP to collector marlin.bii.a-
> star.e
> du.sg <202.6.243.157:9618>
> 4/29 15:12:50 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 4/29 15:12:50 Changed attribute: RunningJobs = 0
> 4/29 15:12:50 Changed attribute: IdleJobs = 0
> 4/29 15:12:50 Changed attribute: HeldJobs = 0
> 4/29 15:12:50 Changed attribute: FlockedJobs = 0
> 4/29 15:12:50 Changed attribute: Name = "lyho@xxxxxxxxxxxxxxxxxx"
> 4/29 15:12:50 Attempting to send update via UDP to collector 
> pragma001.grid.sini
> ca.edu.tw <140.109.98.21:9618>
> 4/29 15:12:50 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 4/29 15:12:50 Sent ad to central manager for lyho@xxxxxxxxxxxxxxxxxx
> 4/29 15:12:50 ============ Begin clean_shadow_recs =============
> 4/29 15:12:50 ============ End clean_shadow_recs =============
> 4/29 15:12:52 Starting DedicatedScheduler::handleDedicatedJobs
> 4/29 15:12:52 Found 1 idle dedicated job(s)
> 4/29 15:12:52 DedicatedScheduler: Listing all dedicated jobs -
> 4/29 15:12:52 Dedicated job: 136.0 lyho
> 4/29 15:12:52 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 
> 0
> 4/29 15:12:52 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 4/29 15:12:52 Found 0 potential dedicated resources
> 4/29 15:12:52 Displaying dedicated resources:
> 4/29 15:12:52  No resources claimed
> 4/29 15:12:52 In DedicatedScheduler::publishRequestAd()
> 4/29 15:12:52 Attempting to send update via UDP to collector 
> pragma001.grid.sini
> ca.edu.tw <140.109.98.21:9618>
> 4/29 15:12:52 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 4/29 15:12:52 Finished DedicatedScheduler::handleDedicatedJobs
> 
> 
> ==========================================================================
> 
> 
> 
> I found that the resources state are always "owner" , is it the problem ?
> 
> 
> Can anyone give me a BIG help ?
> Thanks a lot
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users