[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs are Executed Only on the Central Manager



By the way the executable (simple) is present on other machines too. (on
/home/condor/test/simple) - simple is the executable. Or do I have to
transfer also this executable?

So that others can follow, I have the following submit file:

Universe   = vanilla
Executable = simple
Arguments  = 4 10
Log        = simple.log
Output     = simple.out
Error      = simple.error

should_transfer_file = YES
when_to_transfer_file = ON_EXIT

Queue 25


After I fire up condor_submit I got the following. I submitted this job
from my central manager (in this case phys-ugradlab01)

During the execution of the jobs

#############################################
[condor@phys-ugradlab01 test]$ condor_q -l
MyType = "Job"
TargetType = "Machine"
ClusterId = 5
QDate = 1158556776
CompletionDate = 0
Owner = "condor"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
LastSuspensionTime = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.8.0 Jul 19 2006 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL3 $"
RootDir = "/"
Iwd = "/home/condor/test"
JobUniverse = 5
Cmd = "/home/condor/test/simple"
MinHosts = 1
MaxHosts = 1
CurrentHosts = 0
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
JobStatus = 1
EnteredCurrentStatus = 1158556776
JobPrio = 0
User = "condor@xxxxxxxxxxxxxxxxxxxxx"
NiceUser = FALSE
Environment = ""
JobNotification = 2
WantRemoteIO = TRUE
UserLog = "/home/condor/test/simple.log"
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "simple.out"
StreamOut = FALSE
Err = "simple.error"
StreamErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize_RAW = 5
ImageSize = 10000
ExecutableSize_RAW = 5
ExecutableSize = 10000
DiskUsage_RAW = 5
DiskUsage = 10000
Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >=
DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain
== MY.FileSystemDomain)
FileSystemDomain = "physics.msuiit.edu.ph"
JobLeaseDuration = 1200
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
OnExitHold = FALSE
OnExitRemove = TRUE
LeaveJobInQueue = FALSE
Args = "4 10"
GlobalJobId = "phys-ugradlab01.physics.msuiit.edu.ph#1158556776#5.24"
ProcId = 24
AutoClusterId = 0
AutoClusterAttrs =
"JobUniverse,LastCheckpointPlatform,NumCkpts,DiskUsage,ImageSize,FileSystemDomain,Requirements"
ServerTime = 1158556833
###################################################


After all the jobs are executed:

#############################################
[condor@phys-ugradlab01 test]$ condor_q -l

-- Submitter: phys-ugradlab01.physics.msuiit.edu.ph : <10.0.40.148:33456>
: phys-ugradlab01.physics.msuiit.edu.ph
#############################################

###########################################3
[condor@phys-ugradlab01 test]$ condor_status -l
MyType = "Machine"
TargetType = "Job"
Name = "nucleus.cluster.physics.msuiit.edu.ph"
Machine = "nucleus.cluster.physics.msuiit.edu.ph"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "phys-ugradlab01.physics.msuiit.edu.ph"
CondorVersion = "$CondorVersion: 6.8.0 Jul 19 2006 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL3 $"
VirtualMachineID = 1
VirtualMemory = 1052216
Disk = 3093260
CondorLoadAvg = 0.000000
LoadAvg = 0.520000
KeyboardIdle = 0
ConsoleIdle = 0
Memory = 1011
Cpus = 1
StartdIpAddr = "<10.0.40.250:34509>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "physics.msuiit.edu.ph"
FileSystemDomain = "physics.msuiit.edu.ph"
Subnet = "10.0.40"
HasIOProxy = TRUE
CheckpointPlatform = "LINUX INTEL 2.6.x normal"
TotalVirtualMemory = 1052216
TotalDisk = 3093260
TotalCpus = 1
TotalMemory = 1011
KFlops = 638760
Mips = 2108
LastBenchmark = 1158554363
TotalLoadAvg = 0.520000
TotalCondorLoadAvg = 0.000000
ClockMin = 784
ClockDay = 1
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasPerFileEncryption = TRUE
HasReconnect = TRUE
HasMPI = TRUE
HasTDP = TRUE
HasJobDeferral = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList =
"HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDeferral,HasJICLocalConfig,HasJICLocalStdin,HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 4
CpuIsBusy = TRUE
TimeToLive = 2147483647
State = "Owner"
EnteredCurrentState = 1158554356
Activity = "Idle"
EnteredCurrentActivity = 1158554356
Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <=
0.300000) || (State != "Unclaimed" && State != "Owner")))
Requirements = (START) && (IsValidCheckpointPlatform)
IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) ||
((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform
=?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
MaxJobRetirementTime = 0
CurrentRank = 0.000000
MonitorSelfTime = 1158555803
MonitorSelfCPUUsage = 0.000000
MonitorSelfImageSize = 8768.000000
MonitorSelfResidentSetSize = 3384
MonitorSelfAge = 0
DaemonStartTime = 1158554355
UpdateSequenceNumber = 5
MyAddress = "<10.0.40.250:34509>"
LastHeardFrom = 1158555615
UpdatesTotal = 6
UpdatesSequenced = 5
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

MyType = "Machine"
TargetType = "Job"
Name = "phys-ugradlab01.physics.msuiit.edu.ph"
Machine = "phys-ugradlab01.physics.msuiit.edu.ph"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "phys-ugradlab01.physics.msuiit.edu.ph"
CondorVersion = "$CondorVersion: 6.8.0 Jul 19 2006 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL3 $"
VirtualMachineID = 1
VirtualMemory = 1052248
Disk = 2831836
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 600
ConsoleIdle = 600
Memory = 1003
Cpus = 1
StartdIpAddr = "<10.0.40.148:33457>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "physics.msuiit.edu.ph"
FileSystemDomain = "physics.msuiit.edu.ph"
Subnet = "10.0.40"
HasIOProxy = TRUE
CheckpointPlatform = "LINUX INTEL 2.6.x normal"
TotalVirtualMemory = 1052248
TotalDisk = 2831836
TotalCpus = 1
TotalMemory = 1003
KFlops = 780303
Mips = 2277
LastBenchmark = 1158553890
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 301
ClockDay = 1
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasPerFileEncryption = TRUE
HasReconnect = TRUE
HasMPI = TRUE
HasTDP = TRUE
HasJobDeferral = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
JavaVendor = "Sun Microsystems Inc."
JavaVersion = "1.4.2_12"
JavaMFlops = 164.002945
HasJava = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList =
"HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDeferral,HasJICLocalConfig,HasJICLocalStdin,HasJava,HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
TimeToLive = 2147483647
State = "Unclaimed"
EnteredCurrentState = 1158554808
Activity = "Idle"
EnteredCurrentActivity = 1158554808
Start = TRUE
Requirements = (START) && (IsValidCheckpointPlatform)
IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) ||
((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform
=?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
MaxJobRetirementTime = 0
CurrentRank = 0.000000
MonitorSelfTime = 1158555570
MonitorSelfCPUUsage = 0.004167
MonitorSelfImageSize = 8784.000000
MonitorSelfResidentSetSize = 3640
MonitorSelfAge = 0
DaemonStartTime = 1158553884
UpdateSequenceNumber = 31
MyAddress = "<10.0.40.148:33457>"
LastHeardFrom = 1158555694
UpdatesTotal = 32
UpdatesSequenced = 31
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

MyType = "Machine"
TargetType = "Job"
Name = "phys-ugradlab02.physics.msuiit.edu.ph"
Machine = "phys-ugradlab02.physics.msuiit.edu.ph"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "phys-ugradlab01.physics.msuiit.edu.ph"
CondorVersion = "$CondorVersion: 6.8.0 Jul 19 2006 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL3 $"
VirtualMachineID = 1
VirtualMemory = 1052248
Disk = 2858884
CondorLoadAvg = 0.000000
LoadAvg = 0.040000
KeyboardIdle = 0
ConsoleIdle = 0
Memory = 1003
Cpus = 1
StartdIpAddr = "<10.0.40.139:33390>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "physics.msuiit.edu.ph"
FileSystemDomain = "physics.msuiit.edu.ph"
Subnet = "10.0.40"
HasIOProxy = TRUE
CheckpointPlatform = "LINUX INTEL 2.6.x normal"
TotalVirtualMemory = 1052248
TotalDisk = 2858884
TotalCpus = 1
TotalMemory = 1003
KFlops = 758748
Mips = 2189
LastBenchmark = 1158554342
TotalLoadAvg = 0.040000
TotalCondorLoadAvg = 0.000000
ClockMin = 784
ClockDay = 1
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasPerFileEncryption = TRUE
HasReconnect = TRUE
HasMPI = TRUE
HasTDP = TRUE
HasJobDeferral = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
JavaVendor = "Sun Microsystems Inc."
JavaVersion = "1.4.2_12"
JavaMFlops = 174.977432
HasJava = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList =
"HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDeferral,HasJICLocalConfig,HasJICLocalStdin,HasJava,HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
TimeToLive = 2147483647
State = "Owner"
EnteredCurrentState = 1158554337
Activity = "Idle"
EnteredCurrentActivity = 1158554337
Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <=
0.300000) || (State != "Unclaimed" && State != "Owner")))
Requirements = (START) && (IsValidCheckpointPlatform)
IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) ||
((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform
=?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
MaxJobRetirementTime = 0
CurrentRank = 0.000000
MonitorSelfTime = 1158555782
MonitorSelfCPUUsage = 0.000000
MonitorSelfImageSize = 8268.000000
MonitorSelfResidentSetSize = 3380
MonitorSelfAge = 0
DaemonStartTime = 1158554336
UpdateSequenceNumber = 5
MyAddress = "<10.0.40.139:33390>"
LastHeardFrom = 1158555586
UpdatesTotal = 6
UpdatesSequenced = 5
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

MyType = "Machine"
TargetType = "Job"
Name = "phys-ugradlab03.physics.msuiit.edu.ph"
Machine = "phys-ugradlab03.physics.msuiit.edu.ph"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "phys-ugradlab01.physics.msuiit.edu.ph"
CondorVersion = "$CondorVersion: 6.8.0 Jul 19 2006 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL3 $"
VirtualMachineID = 1
VirtualMemory = 1052248
Disk = 2873652
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 0
ConsoleIdle = 0
Memory = 1003
Cpus = 1
StartdIpAddr = "<10.0.40.112:32878>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "physics.msuiit.edu.ph"
FileSystemDomain = "physics.msuiit.edu.ph"
Subnet = "10.0.40"
HasIOProxy = TRUE
CheckpointPlatform = "LINUX INTEL 2.6.x normal"
TotalVirtualMemory = 1052248
TotalDisk = 2873652
TotalCpus = 1
TotalMemory = 1003
KFlops = 754579
Mips = 2277
LastBenchmark = 1158554650
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 789
ClockDay = 1
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasPerFileEncryption = TRUE
HasReconnect = TRUE
HasMPI = TRUE
HasTDP = TRUE
HasJobDeferral = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
JavaVendor = "Sun Microsystems Inc."
JavaVersion = "1.4.2_12"
JavaMFlops = 182.371674
HasJava = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList =
"HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDeferral,HasJICLocalConfig,HasJICLocalStdin,HasJava,HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
TimeToLive = 2147483647
State = "Owner"
EnteredCurrentState = 1158554644
Activity = "Idle"
EnteredCurrentActivity = 1158554644
Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <=
0.300000) || (State != "Unclaimed" && State != "Owner")))
Requirements = (START) && (IsValidCheckpointPlatform)
IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) ||
((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform
=?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
MaxJobRetirementTime = 0
CurrentRank = 0.000000
MonitorSelfTime = 1158556090
MonitorSelfCPUUsage = 0.000000
MonitorSelfImageSize = 9112.000000
MonitorSelfResidentSetSize = 3388
MonitorSelfAge = 0
DaemonStartTime = 1158554643
UpdateSequenceNumber = 5
MyAddress = "<10.0.40.112:32878>"
LastHeardFrom = 1158555600
UpdatesTotal = 6
UpdatesSequenced = 5
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"
#####################################################

> On Sun, Sep 17, 2006 at 04:01:57AM +0800, leo@xxxxxxxxxxxxxxxxxxxxx wrote:
>>
>> I see now, what does it meant by "...their own requirements"?
>>
>
> It means that the condor_startd on those machines does not believe your
> job matches with the machine - either because your job is missing
> something
> or (more likely) the startd isn't willing to run a job at the moment -
> usually
> because the keyboard or mouse is not idle and so the machine is in the
> "Owner" state.
>
> We can answer this very quickly if you do, from the machine that you have
> submitted your job:
>
> condor_status -l
> and
> condor_q -l
>
> -Erik

Thanks.

Leo