[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI job problem



Good morning:

The problem is that the dedicated scheduler can't find any resources
dedicated to it.  This is because the name of the dedicated scheduler is
DedicatedScheduler@lyho@grid.sinica.edu.tw, but the startd list just
DedicatedScheduler@xxxxxxxxxxxxxxxxxxx  Your username is part of the
scheduler name because you are running condor as your userid (which is
fine).  If you change the name of the DedicatedScheduler attribute in
the startds to DedicatedScheduler@lyho@grid.sinica.edu.tw, I think it
will work.

Good luck,

-greg


Li-Yung_Ho wrote:
> Dear Greg
> Of course and thanks for your help
> This is the SchedLog of pragma001.grid.sinica.edu.tw 
> and there is nothing in startdlog 
> ------------------------------------------------------------------------
> 5/3 09:04:01 -------- Begin starting jobs --------
> 5/3 09:04:01 -------- Done starting jobs --------
> 5/3 09:04:02 JobsRunning = 0
> 5/3 09:04:02 JobsIdle = 0
> 5/3 09:04:02 JobsHeld = 0
> 5/3 09:04:02 JobsRemoved = 0
> 5/3 09:04:02 SchedUniverseJobsRunning = 0
> 5/3 09:04:02 SchedUniverseJobsIdle = 0
> 5/3 09:04:02 N_Owners = 0
> 5/3 09:04:02 MaxJobsRunning = 200
> 5/3 09:04:02 Attempting to send update via UDP to collector 
> pragma001.grid.sinic
> a.edu.tw <140.109.98.21:9618>
> 5/3 09:04:02 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 5/3 09:04:02 Sent HEART BEAT ad to central mgr: Number of submittors=0
> 5/3 09:04:02 Attempting to send update via UDP to collector marlin.bii.a-
> star.ed
> u.sg <202.6.243.157:9618>
> 5/3 09:04:02 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 5/3 09:04:02 ============ Begin clean_shadow_recs =============
> 5/3 09:04:02 ============ End clean_shadow_recs =============
> 5/3 09:06:28 DaemonCore: Command received via TCP from host 
> <140.109.98.21:44215
> 
> 5/3 09:06:28 DaemonCore: received command 1111 (QMGMT_CMD), calling handler 
> (han
> dle_q)
> 5/3 09:06:28 condor_read(): Socket closed when trying to read buffer
> 5/3 09:06:28 QMGR Connection closed
> 5/3 09:07:35 DaemonCore: Command received via TCP from host 
> <140.109.98.21:44245
> 
> 5/3 09:07:35 DaemonCore: received command 1111 (QMGMT_CMD), calling handler 
> (han
> dle_q)
> 5/3 09:07:35 AUTHENTICATE_FS: used file /tmp/qmgr_6LKOTY, status: 1
> 5/3 09:07:35 OwnerCheck retval 1 (success), super_user
> 5/3 09:07:35 OwnerCheck retval 1 (success), super_user
> 5/3 09:07:36 wrote 300788 bytes
> 5/3 09:07:36 done with transfer, errno = 0
> 5/3 09:07:36 condor_read(): Socket closed when trying to read buffer
> 5/3 09:07:36 QMGR Connection closed
> 5/3 09:07:36 DaemonCore: Command received via TCP from host 
> <140.109.98.21:44256
> 
> 5/3 09:07:36 DaemonCore: received command 464 (ATTEMPT_ACCESS), calling 
> handler
> (attempt_access_handler)
> 5/3 09:07:36 ATTEMPT_ACCESS: Switching to user uid: 510 gid: 510.
> 5/3 09:07:36 Checking 
> file /home/lyho/test/examples/condor_test/outofcpi.0.new f
> or write permission.
> 5/3 09:07:36 Switching back to old priv state.
> 5/3 09:07:36 DaemonCore: Command received via TCP from host 
> <140.109.98.21:44257
> 
> 5/3 09:07:36 DaemonCore: received command 464 (ATTEMPT_ACCESS), calling 
> handler
> (attempt_access_handler)
> 5/3 09:07:36 ATTEMPT_ACCESS: Switching to user uid: 510 gid: 510.
> 5/3 09:07:36 Checking 
> file /home/lyho/test/examples/condor_test/errofcpi.0.new f
> or write permission.
> 5/3 09:07:36 Switching back to old priv state.
> 5/3 09:07:36 Found idle MPI cluster 143
> 5/3 09:07:36 Started timer (1035) to call handleDedicatedJobs() in 2 secs
> 5/3 09:07:36 JobsRunning = 0
> 5/3 09:07:36 JobsIdle = 0
> 5/3 09:07:36 JobsHeld = 0
> 5/3 09:07:36 JobsRemoved = 0
> 5/3 09:07:36 SchedUniverseJobsRunning = 0
> 5/3 09:07:36 SchedUniverseJobsIdle = 0
> 5/3 09:07:36 N_Owners = 1
> 5/3 09:07:36 MaxJobsRunning = 200
> 5/3 09:07:36 Attempting to send update via UDP to collector 
> pragma001.grid.sinic
> a.edu.tw <140.109.98.21:9618>
> 5/3 09:07:36 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 5/3 09:07:36 Sent HEART BEAT ad to central mgr: Number of submittors=1
> 5/3 09:07:36 Attempting to send update via UDP to collector marlin.bii.a-
> star.ed
> u.sg <202.6.243.157:9618>
> 5/3 09:07:36 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 5/3 09:07:36 Changed attribute: RunningJobs = 0
> 5/3 09:07:36 Changed attribute: IdleJobs = 0
> 5/3 09:07:36 Changed attribute: HeldJobs = 0
> 5/3 09:07:36 Changed attribute: FlockedJobs = 0
> 5/3 09:07:36 Changed attribute: Name = "lyho@xxxxxxxxxxxxxxxxxx"
> 5/3 09:07:36 Attempting to send update via UDP to collector 
> pragma001.grid.sinic
> a.edu.tw <140.109.98.21:9618>
> 5/3 09:07:36 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 5/3 09:07:36 Sent ad to central manager for lyho@xxxxxxxxxxxxxxxxxx
> 5/3 09:07:36 ============ Begin clean_shadow_recs =============
> 5/3 09:07:36 ============ End clean_shadow_recs =============
> 5/3 09:07:36 Called reschedule_negotiator()
> 5/3 09:07:36 Sending RESCHEDULE command to negotiator(s)
> 5/3 09:07:36 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 5/3 09:07:36 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 5/3 09:07:38 Starting DedicatedScheduler::handleDedicatedJobs
> 5/3 09:07:38 Found 1 idle dedicated job(s)
> 5/3 09:07:38 DedicatedScheduler: Listing all dedicated jobs -
> 5/3 09:07:38 Dedicated job: 143.0 lyho
> 5/3 09:07:38 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0
> 5/3 09:07:38 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 5/3 09:07:38 Found 0 potential dedicated resources
> 5/3 09:07:38 Displaying dedicated resources:
> 5/3 09:07:38  No resources claimed
> 5/3 09:07:38 In DedicatedScheduler::publishRequestAd()
> 5/3 09:07:38 Attempting to send update via UDP to collector 
> pragma001.grid.sinic
> a.edu.tw <140.109.98.21:9618>
> 5/3 09:07:38 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 5/3 09:07:38 Finished DedicatedScheduler::handleDedicatedJobs
> 5/3 09:07:38 DaemonCore: Command received via TCP from host 
> <140.109.98.21:44271
> 
> 5/3 09:07:38 DaemonCore: received command 1111 (QMGMT_CMD), calling handler 
> (han
> dle_q)
> 5/3 09:07:38 condor_read(): Socket closed when trying to read buffer
> 5/3 09:07:38 QMGR Connection closed
> 5/3 09:07:39 DaemonCore: Command received via TCP from host 
> <140.109.98.21:44284
> 
> 5/3 09:07:39 DaemonCore: received command 1111 (QMGMT_CMD), calling handler 
> (han
> dle_q)
> 5/3 09:07:39 condor_read(): Socket closed when trying to read buffer
> 5/3 09:07:39 QMGR Connection closed
> 5/3 09:07:40 DaemonCore: Command received via TCP from host 
> <140.109.98.21:44297
> 
> 5/3 09:07:40 DaemonCore: received command 1111 (QMGMT_CMD), calling handler 
> (han
> dle_q)
> 5/3 09:07:40 condor_read(): Socket closed when trying to read buffer
> 5/3 09:07:40 QMGR Connection closed
> ---------------------------------------------------------------------------
> job status :
> 
> ---------------------------------------------------------------------------
> [lyho@pragma001 log]$ condor_q
> 
> 
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
> pragma001.g
> rid.sinica.edu.tw
>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>  143.0   lyho            5/3  09:07   0+00:00:00 I  0   0.3  cpi
> 
> 1 jobs; 1 idle, 0 running, 0 held
> ---------------------------------------------------------------------------
> 
> [lyho@pragma001 log]$ condor_q -l
> 
> 
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
> pragma001.g
> rid.sinica.edu.tw
> MyType = "Job"
> TargetType = "Machine"
> ClusterId = 143
> QDate = 1115082455
> CompletionDate = 0
> Owner = "lyho"
> RemoteWallClockTime = 0.000000
> LocalUserCpu = 0.000000
> LocalSysCpu = 0.000000
> RemoteUserCpu = 0.000000
> RemoteSysCpu = 0.000000
> ExitStatus = 0
> NumCkpts = 0
> NumRestarts = 0
> NumSystemHolds = 0
> CommittedTime = 0
> TotalSuspensions = 0
> LastSuspensionTime = 0
> CumulativeSuspensionTime = 0
> ExitBySignal = FALSE
> CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> RootDir = "/"
> Iwd = "/home/lyho/test/examples/condor_test"
> JobUniverse = 8
> Cmd = "/home/lyho/test/examples/condor_test/cpi"
> CurrentHosts = 0
> WantRemoteSyscalls = FALSE
> WantCheckpoint = FALSE
> MinHosts = 2
> MaxHosts = 2
> JobStatus = 1
> EnteredCurrentStatus = 1115082456
> JobPrio = 0
> User = "lyho@xxxxxxxxxxxxxxxxxx"
> NiceUser = FALSE
> Env = ""
> JobNotification = 2
> UserLog = "/home/lyho/test/examples/condor_test/logofcpi.new"
> CoreSize = 0
> KillSig = "SIGTERM"
> Rank = 0.000000
> In = "/dev/null"
> TransferIn = FALSE
> Out = "outofcpi.#MpInOdE#.new"
> Err = "errofcpi.#MpInOdE#.new"
> BufferSize = 524288
> BufferBlockSize = 32768
> ShouldTransferFiles = "NO"
> TransferFiles = "NEVER"
> ImageSize = 294
> ExecutableSize = 294
> DiskUsage = 294
> Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= 
> DiskUsage) &&
>  ((Memory * 1024) >= ImageSize) && (HasMPI) && (TARGET.FileSystemDomain == 
> MY.Fi
> leSystemDomain)
> FileSystemDomain = "grid.sinica.edu.tw"
> PeriodicHold = FALSE
> PeriodicRelease = FALSE
> PeriodicRemove = FALSE
> OnExitHold = FALSE
> OnExitRemove = TRUE
> LeaveJobInQueue = FALSE
> Args = ""
> ProcId = 0
> Scheduler = "DedicatedScheduler@lyho@pragma001.grid.sinica.edu.tw"
> ServerTime = 1115083476
> -------------------------------------------------------------------------
> machine status:
> -------------------------------------------------------------------------
> [lyho@pragma001 log]$ condor_status
> 
> Name          OpSys       Arch   State      Activity   LoadAv Mem   
> ActvtyTime
> 
> pragma001.gri LINUX       INTEL  Owner      Idle       0.000   469  
> 0+00:15:04
> pragma002.gri LINUX       INTEL  Unclaimed  Idle       0.890   469  
> 0+03:36:01
> pragma004.gri LINUX       INTEL  Unclaimed  Idle       1.000  1004  
> 0+03:34:48
> 
>                      Machines Owner Claimed Unclaimed Matched Preempting
> 
>          INTEL/LINUX        3     1       0         2       0          0
> 
>                Total        3     1       0         2       0          0
> 
> 
> -------------------------------------------------------------------------
> 
> [lyho@pragma001 log]$ condor_status -l
> MyType = "Machine"
> TargetType = "Job"
> Name = "pragma001.grid.sinica.edu.tw"
> Machine = "pragma001.grid.sinica.edu.tw"
> Rank = 0.000000
> CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
> CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> VirtualMachineID = 1
> VirtualMemory = 940764
> Disk = 58974996
> CondorLoadAvg = 0.000000
> LoadAvg = 0.010000
> KeyboardIdle = 154
> ConsoleIdle = 30616471
> Memory = 469
> Cpus = 1
> StartdIpAddr = "<140.109.98.21:33669>"
> Arch = "INTEL"
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 940764
> TotalDisk = 58974996
> KFlops = 875905
> Mips = 1905
> LastBenchmark = 1115071434
> TotalLoadAvg = 0.010000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 568
> ClockDay = 2
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList 
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 0
> CpuIsBusy = FALSE
> State = "Owner"
> EnteredCurrentState = 1115082534
> Activity = "Idle"
> EnteredCurrentActivity = 1115082534
> Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 
> 0.300000) ||
>  (State != "Unclaimed" && State != "Owner")))
> Requirements = START
> CurrentRank = 0.000000
> DaemonStartTime = 1114695432
> UpdateSequenceNumber = 1297
> MyAddress = "<140.109.98.21:33669>"
> LastHeardFrom = 1115083738
> UpdatesTotal = 1298
> UpdatesSequenced = 1297
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
> 
> MyType = "Machine"
> TargetType = "Job"
> Name = "pragma002.grid.sinica.edu.tw"
> Machine = "pragma002.grid.sinica.edu.tw"
> Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> VirtualMachineID = 1
> VirtualMemory = 945368
> Disk = 58974996
> CondorLoadAvg = 0.000000
> LoadAvg = 0.990000
> KeyboardIdle = 44595
> ConsoleIdle = 1891066
> Memory = 469
> Cpus = 1
> StartdIpAddr = "<140.109.98.22:48852>"
> Arch = "INTEL"
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 945368
> TotalDisk = 58974996
> KFlops = 801365
> Mips = 1880
> LastBenchmark = 1115070484
> TotalLoadAvg = 0.990000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 568
> ClockDay = 2
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList 
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 304
> CpuIsBusy = TRUE
> State = "Unclaimed"
> EnteredCurrentState = 1115011084
> Activity = "Idle"
> EnteredCurrentActivity = 1115070484
> Start = TRUE
> Requirements = START
> CurrentRank = 0.000000
> DaemonStartTime = 1114744650
> UpdateSequenceNumber = 1132
> MyAddress = "<140.109.98.22:48852>"
> LastHeardFrom = 1115083745
> UpdatesTotal = 1195
> UpdatesSequenced = 1193
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
> 
> MyType = "Machine"
> TargetType = "Job"
> Name = "pragma004.grid.sinica.edu.tw"
> Machine = "pragma004.grid.sinica.edu.tw"
> Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> VirtualMachineID = 1
> VirtualMemory = 2009408
> Disk = 58974912
> CondorLoadAvg = 0.000000
> LoadAvg = 1.000000
> KeyboardIdle = 37227
> ConsoleIdle = 30616285
> Memory = 1004
> Cpus = 1
> StartdIpAddr = "<140.109.98.24:35849>"
> Arch = "INTEL"
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 2009408
> TotalDisk = 58974912
> KFlops = 575797
> Mips = 1281
> LastBenchmark = 1115070647
> TotalLoadAvg = 1.000000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 565
> ClockDay = 2
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList 
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 9305
> CpuIsBusy = TRUE
> State = "Unclaimed"
> EnteredCurrentState = 1114767739
> Activity = "Idle"
> EnteredCurrentActivity = 1115070647
> Start = TRUE
> Requirements = START
> CurrentRank = 0.000000
> DaemonStartTime = 1114744768
> UpdateSequenceNumber = 1130
> MyAddress = "<140.109.98.24:35849>"
> LastHeardFrom = 1115083535
> UpdatesTotal = 1192
> UpdatesSequenced = 1190
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
> --------------------------------------------------------------------------
> condor_q -analyze :
> 
> --------------------------------------------------------------------------
> [lyho@pragma001 log]$ condor_q -analyze
> 
> 
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
> pragma001.g
> rid.sinica.edu.tw
>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> ---
> 143.000:  Run analysis summary.  Of 3 machines,
>       0 are rejected by your job's requirements
>       1 reject your job because of their own requirements
>       0 match, but are serving users with a better priority in the pool
>       2 match, match, but reject the job for unknown reasons
>       0 match, but will not currently preempt their existing job
>       0 are available to run your job
> 
> WARNING: Analysis is meaningless for MPI universe jobs.
> 
> 1 jobs; 1 idle, 0 running, 0 held
> 
> --------------------------------------------------------------------------
> 
> really appreciate your help !
> 
> Leon
> 
> 
> On Mon, 02 May 2005 07:59:06 -0500, Greg Thain wrote
> 
>>Can you send us the log from the schedd and the startd?
>>
>>Thanks,
>>
>>-greg
>>
>>Li-Yung_Ho wrote:
>>
>>>Hi Mark and Greg
>>>Thanks for your responses
>>>
>>>I change the START attribute from Scheduler =?= $(DedicatedScheduler) to 
> 
> True
> 
>>>in pragma002 and pragma004 local configuraion file and indeed , the 
> 
> status 
> 
>>>become "Unclaimed"
>>>------------------------------------------------------------------------
>>>[lyho@pragma001 lyho]$ condor_status
>>>
>>>Name          OpSys       Arch   State      Activity   LoadAv Mem   
>>>ActvtyTime
>>>
>>>pragma001.gri LINUX       INTEL  Owner      Idle       0.010   469  
>>>0+00:10:04
>>>pragma002.gri LINUX       INTEL  Unclaimed  Idle       0.290   469  
>>>0+03:21:02
>>>pragma004.gri LINUX       INTEL  Unclaimed  Idle       0.150  1004  
>>>0+03:19:48
>>>
>>>                     Machines Owner Claimed Unclaimed Matched Preempting
>>>
>>>         INTEL/LINUX        3     1       0         2       0          0
>>>
>>>               Total        3     1       0         2       0          0
>>>
>>>-------------------------------------------------------------------------
>>>
>>>but the job still IDLE
>>>
>>>-------------------------------------------------------------------------
>>>[lyho@pragma001 lyho]$ condor_q
>>>
>>>
>>>-- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
>>>pragma001.g
>>>rid.sinica.edu.tw
>>> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>>> 140.0   lyho            4/29 17:44   0+00:00:00 I  0   0.3  cpi
>>>
>>>1 jobs; 1 idle, 0 running, 0 held
>>>
>>>------------------------------------------------------------------------
>>>
>>>and then I test the vanilla job
>>>the job description file :
>>>============================
>>>universe = vanilla
>>>executable = cpi
>>>log = logofcpi.new
>>>error = errofcpi.$(NODE).new
>>>output = outofcpi.$(NODE).new
>>>queue
>>>=============================
>>>
>>>and it can be done
>>>
>>>------------------------------------------------------------------------
>>>[lyho@pragma001 condor_test]$ condor_q
>>>
>>>
>>>-- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> : 
>>>pragma001.g
>>>rid.sinica.edu.tw
>>> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>>> 142.0   lyho            5/2  13:18   0+00:00:00 R  0   0.3  cpi
>>>
>>>1 jobs; 0 idle, 1 running, 0 held
>>>---------------------------------------------------------------------
>>>
>>>The files of log, error and output
>>>
>>>---------------------------------------------------------------------
>>>[lyho@pragma001 condor_test]$ more *.new
>>>::::::::::::::
>>>errofcpi..new
>>>::::::::::::::
>>>Process 0 on pragma002.grid.sinica.edu.tw
>>>::::::::::::::
>>>logofcpi.new
>>>::::::::::::::
>>>000 (142.000.000) 05/02 13:18:57 Job submitted from host: 
>>><140.109.98.21:33670>
>>>...
>>>001 (142.000.000) 05/02 13:19:00 Job executing on host: 
> 
> <140.109.98.22:48852>
> 
>>>...
>>>005 (142.000.000) 05/02 13:19:00 Job terminated.
>>>        (1) Normal termination (return value 0)
>>>                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>>>                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>>>                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>>>                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>>>        0  -  Run Bytes Sent By Job
>>>        0  -  Run Bytes Received By Job
>>>        0  -  Total Bytes Sent By Job
>>>        0  -  Total Bytes Received By Job
>>>...
>>>::::::::::::::
>>>outofcpi..new
>>>::::::::::::::
>>>pi is approximately 3.1416009869231254, Error is 0.0000083333333323
>>>wall clock time = 0.000055
>>>
>>>--------------------------------------------------------------------
>>>
>>>So, someting wrong with mpi job
>>>
>>>Can anyone help me ??
>>>
>>>
>>>
>>>On Fri, 29 Apr 2005 12:11:53 +0300, Mark Silberstein wrote
>>>
>>>
>>>>The problem seems to be in the fact that all your computers are in 
>>>>the "Owner" state, i.e. Condor is NOT allowed to start any job on them.
>>>>Obviously you're using the START expression (in the condor_config),
>>>>which makes your resources reject Condor jobs when they are under 
>>>>load or when there's some  keyboard activity. ( the output you sent was
>>>>produced on pragma001, so you were working on it, and two others 
>>>>have a load average of 1.000 ) . To TEST that MPI really works you 
>>>>might want to disable this, by putting START=TRUE ( which would 
>>>>allow any job to be invoked, regardless of the current computer 
>>>>activity), or START=($(START))||((Scheduler =?= $(DedicatedScheduler)
>>>>). Mark
>>>>
>>>
>>>
>>>_______________________________________________
>>>Condor-users mailing list
>>>Condor-users@xxxxxxxxxxx
>>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>>_______________________________________________
>>Condor-users mailing list
>>Condor-users@xxxxxxxxxxx
>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users