[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI dedicated job don't start



Hi, finally some help... unfortunately, it doesn't seem to be the problem.

This is like my copy/paste failed : here is what it should look like, and as you can see, the START expression is that complex. Maybe someone has an other idea ? or a direction to look at ?

Thanks in advance
Nicolas

$ condor_status -l vm1@calisto
MyType = "Machine"
TargetType = "Job"
Name = "vm1@xxxxxxxxxxxxxxxxx"
Machine = "calisto.my.domain"
Rank = (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxx" * 10000000) + 0
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "io.my.domain"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxx"
CondorVersion = "$CondorVersion: 6.8.3 Jan  4 2007 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL3 $"
VirtualMachineID = 1
VirtualMemory = 1048568
Disk = 4847562
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 0
ConsoleIdle = 0
Memory = 1010
Cpus = 1
StartdIpAddr = "<172.XX.XX.XX:53487>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "my.domain"
FileSystemDomain = "my.domain"
Subnet = "172.X.X"
HasIOProxy = TRUE
CheckpointPlatform = "LINUX INTEL 2.4.x normal"
TotalVirtualMemory = 2097136
TotalDisk = 9695124
TotalCpus = 2
TotalMemory = 2020
KFlops = 676519
Mips = 1963
LastBenchmark = 1171976958
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 849
ClockDay = 2
TotalVirtualMachines = 2
HasFileTransfer = TRUE
HasPerFileEncryption = TRUE
HasReconnect = TRUE
HasMPI = TRUE
HasTDP = TRUE
HasJobDeferral = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList = "HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDeferral,HasJICLocalConfig,HasJICLocalStdin,HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
TimeToLive = 2147483647
State = "Unclaimed"
EnteredCurrentState = 1171976958
Activity = "Idle"
EnteredCurrentActivity = 1171976958
Start = (((LoadAvg - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" && State != "Owner"))
Requirements = (START) && (IsValidCheckpointPlatform)
IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) || ((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
MaxJobRetirementTime = 0
CurrentRank = 0.000000
MonitorSelfTime = 1171976958
MonitorSelfCPUUsage = 0.000000
MonitorSelfImageSize = 7712.000000
MonitorSelfResidentSetSize = 3384
MonitorSelfAge = 0
MonitorSelfRegisteredSocketCount = 2
DaemonStartTime = 1171976951
UpdateSequenceNumber = 0
MyAddress = "<172.XX.XX.XX:53487>"
LastHeardFrom = 1171976962
UpdatesTotal = 6878
UpdatesSequenced = 6877
UpdatesLost = 1
UpdatesHistory = "0x00000000000000000000000000000000


----------------
On Mon, 19 Feb 2007 11:25:48 +0100
"Fran__ois Bachmann" <f.bachmann@xxxxxxxxx> wrote:

> Hello Nicolas
> 
> I'd start here:
> 
> >The following attributes are missing from the job ClassAd:
> >CheckpointPlatform
> 
> Seeing that your START expression is rather complex, try cutting it down to
> something not containing CheckpointPlatform and see if it works. Then work
> your way back from there...
> 
> HTH
> Fran__ois
> 
> On 2/5/07, Nicolas GUIOT <nicolas.guiot@xxxxxxx> wrote:
> >
> > Hi
> >
> > I tried to submit the simplest MPI example from the manual, but did'nt
> > succeed in running it.
> > Could you please help me to find out why are my MPI jobs rejected ?
> >
> > Here is the result of condor_q -bette-analyze, then condor_status -l of
> > one of the box that should run the job and the submission file.
> >
> > Thanks in advance
> > Nicolas
> >
> > ############################
> > ##   submit description file for a parallel program
> > #############################################
> > universe = parallel
> > executable = /bin/sleep
> > arguments = 30
> > machine_count = 4
> > queue
> >
> >
> > ######################################
> > $ condor_q -better-analyze 16
> > -- Submitter: seurat.my.domain : <172.XX.XX.XX:32857> :
> > seurat.my.domain 
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > AddConstraint: Condition value not literal
> > ---
> > 016.000:  Run analysis summary.  Of 51 machines,
> >      24 are rejected by your job's requirements
> >       9 reject your job because of their own requirements
> >       0 match but are serving users with a better priority in the pool
> >      18 match but reject the job for unknown reasons
> >       0 match but will not currently preempt their existing job
> >       0 are available to run your job
> >
> > The Requirements expression for your job is:
> >
> > ( target.Arch == "INTEL" ) && ( target.OpSys == "LINUX" ) &&
> > ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize )
> > &&
> > ( TARGET.FileSystemDomain == MY.FileSystemDomain )
> >
> >     Condition                         Machines Matched    Suggestion
> >     ---------                         ----------------    ----------
> > 1   ( target.Arch == "INTEL" )        27
> > 2   ( target.OpSys == "LINUX" )       51
> > 3   ( target.Disk >= 10000 )          51
> > 4   ( ( 1024 * target.Memory ) >= 10000 )51
> > 5   ( TARGET.FileSystemDomain == "my.domain" )51
> >
> > The following attributes are missing from the job ClassAd:
> >
> > CheckpointPlatform
> > #############################################"
> >
> >
> > $ condor_status -l calisto
> > MyType = "Machine"
> > TargetType = "Job"
> > Name = "vm1@xxxxxxxxxxxxxxxxx"
> > Machine = "calisto.my.domain"
> > Rank = (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxx" *
> > 10000000) + 0 CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> > COLLECTOR_HOST_STRING = "io.my.domain"
> > DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxx"
> > CondorVersion = "$CondorVersion: 6.8.3 Jan  4 2007 $"
> > CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL3 $"
> > VirtualMachineID = 1
> > VirtualMemory = 1048568
> > Disk = 3006996
> > CondorLoadAvg = 0.000000
> > LoadAvg = 0.000000
> > KeyboardIdle = 236778
> > ConsoleIdle = 236778
> > Memory = 1010
> > Cpus = 1
> > StartdIpAddr = "<172.XX.XX.XX:32814>"
> > Arch = "INTEL"
> > OpSys = "LINUX"
> > UidDomain = "my.domain"
> > FileSystemDomain = "my.domain"
> > Subnet = "172.XX.XX"
> > HasIOProxy = TRUE
> > CheckpointPlatform = "LINUX INTEL 2.4.x normal"
> > TotalVirtualMemory = 2097136
> > TotalDisk = 6013992
> > TotalCpus = 2
> > TotalMemory = 2020
> > KFlops = 672176
> > Mips = 1887
> > LastBenchmark = 1170660059
> > TotalLoadAvg = 0.010000
> > TotalCondorLoadAvg = 0.000000
> > ClockMin = 680
> > ClockDay = 1
> > TotalVirtualMachines = 2
> > HasFileTransfer = TRUE
> > HasPerFileEncryption = TRUE
> > HasReconnect = TRUE
> > HasMPI = TRUE
> > HasTDP = TRUE
> > HasJobDeferral = TRUE
> > HasJICLocalConfig = TRUE
> > HasJICLocalStdin = TRUE
> > HasPVM = TRUE
> > HasRemoteSyscalls = TRUE
> > HasCheckpointing = TRUE
> > StarterAbilityList =
> >
> > "HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDeferral,HasJICLocalConfig,HasJICLocalStdin,HasPVM,HasRemoteSyscalls,HasCheckpointing"
> > CpuBusyTime = 0 CpuIsBusy = FALSE
> > TimeToLive = 2147483647
> > State = "Unclaimed"
> > EnteredCurrentState = 1170613946
> > Activity = "Idle"
> > EnteredCurrentActivity = 1170613946
> > Start = (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxx") ||
> > (((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.300000)
> > || (State != "Unclaimed" && State != "Owner")))) Requirements = (START)
> > && (IsValidCheckpointPlatform) IsValidCheckpointPlatform =
> > (((TARGET.JobUniverse == 1) == FALSE) || ((MY.CheckpointPlatform =!=
> > UNDEFINED) && ((TARGET.LastCheckpointPlatform =?=
> > MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
> > MaxJobRetirementTime = 0 CurrentRank = 0.000000 MonitorSelfTime =
> > 1170670728 MonitorSelfCPUUsage = 0.012512 MonitorSelfImageSize =
> > 8156.000000 MonitorSelfResidentSetSize = 3876 MonitorSelfAge = 0
> > MonitorSelfRegisteredSocketCount = 2
> > DaemonStartTime = 1170434080
> > UpdateSequenceNumber = 806
> > MyAddress = "<172.XX.XX.XX:32814>"
> > LastHeardFrom = 1170670862
> > UpdatesTotal = 807
> > UpdatesSequenced = 806
> > UpdatesLost = 0
> > UpdatesHistory = "0x00000000000000000000000000000000"
> >


----------------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
----------------------------------------------------