[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run



Hello,

I am trying to get this example to work on my condor pool (restricted
for this purpose to a single computer). I use Fedora Core 4 and Condor
6.7.14 .

#############################################
##   submit description file for parallel program
#############################################
universe = parallel
executable = /bin/sleep
arguments = 30
machine_count = 8
queue

Since I do only have one computer in my pool (the submitting machine),
it becomes :

#############################################
##   submit description file for parallel program
#############################################
universe = parallel
executable = /bin/sleep
arguments = 30
machine_count = 1
queue

I changed condor_config.local so that it looks as following :

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxx"

START     = True
SUSPEND   = False
CONTINUE  = True
PREEMPT   = False
KILL      = False
WANT_SUSPEND   = False
WANT_VACATE    = False
RANK      = Scheduler =?= $(DedicatedScheduler)

STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
ALL_DEBUG               = D_FULLDEBUG

Now I submit the job and it remains iddle.

condor_q -analyze


-- Submitter: ys.cap.ed.ac.uk : <129.215.181.34:35702> : ys.cap.ed.ac.uk
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
---
005.000:  Run analysis summary.  Of 1 machines,
       0 are rejected by your job's requirements
       0 reject your job because of their own requirements
       0 match but are serving users with a better priority in the pool
       0 match but reject the job for unknown reasons
       0 match but will not currently preempt their existing job
       1 are available to run your job

1 jobs; 1 idle, 0 running, 0 held

condor_q

-- Submitter: ys.cap.ed.ac.uk : <129.215.181.34:35702> : ys.cap.ed.ac.uk
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
    5.0   jgrunche        2/3  14:42   0+00:00:00 I  0   0.0  sleep 30

1 jobs; 1 idle, 0 running, 0 held

Even if I do a couple of condor_reschedule it does not change much.

I think all necessar daemons work :

[jgrunche@ys MPI]$ ps -A|grep condor
27734 ?        00:00:01 condor_master
27735 ?        00:00:00 condor_collecto
27736 ?        00:00:00 condor_negotiat
27737 ?        00:00:05 condor_startd
27738 ?        00:00:00 condor_schedd

Obviously I manage to get this to work in the vanilla universe

#############################################
##   submit description file for parallel program
#############################################
universe = vanilla
executable = /bin/sleep
arguments = 30
queue

with an altered condor_config.local

#DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxx"

START     = True
SUSPEND   = False
CONTINUE  = True
PREEMPT   = False
KILL      = False
WANT_SUSPEND   = False
WANT_VACATE    = False
#RANK      = Scheduler =?= $(DedicatedScheduler)

#STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
#ALL_DEBUG               = D_FULLDEBUG

So I really don't know if there is a parameter that I did not set right
for this example taken out of the manual.


I include the results of condor_status and condor_q

[jgrunche@ys condor]$ condor_status -l
MyType = "Machine"
TargetType = "Job"
Name = "ys.cap.ed.ac.uk"
Machine = "ys.cap.ed.ac.uk"
Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxx"
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "ys.cap.ed.ac.uk"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxx"
CondorVersion = "$CondorVersion: 6.7.14 Dec 13 2005 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 2031608
Disk = 37631704
CondorLoadAvg = 0.000000
LoadAvg = 0.160000
KeyboardIdle = 0
ConsoleIdle = 0
Memory = 2027
Cpus = 1
StartdIpAddr = "<129.215.181.34:35701>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "ys.cap.ed.ac.uk"
FileSystemDomain = "ys.cap.ed.ac.uk"
Subnet = "129.215.181"
HasIOProxy = TRUE
TotalVirtualMemory = 2031608
TotalDisk = 37631704
TotalCpus = 1
TotalMemory = 2027
KFlops = 750455
Mips = 2189
LastBenchmark = 1138977608
TotalLoadAvg = 0.160000
TotalCondorLoadAvg = 0.000000
ClockMin = 900
ClockDay = 5
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasPerFileEncryption = TRUE
HasReconnect = TRUE
HasMPI = TRUE
HasTDP = TRUE
HasJobDeferral = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
JavaVendor = "Free Software Foundation, Inc."
JavaVersion = "1.4.2"
JavaMFlops = 3.289999
HasJava = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList =
"HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDeferral,HasJICLocalConfig,HasJICLocalStdin,HasJava,HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
TimeToLive = 2147483647
State = "Unclaimed"
EnteredCurrentState = 1138977608
Activity = "Idle"
EnteredCurrentActivity = 1138977608
Start = TRUE
Requirements = START
MaxJobRetirementTime = 0
CurrentRank = 0.000000
MonitorSelfTime = 1138978808
MonitorSelfCPUUsage = 0.000000
MonitorSelfImageSize = 7492.000000
MonitorSelfResidentSetSize = 3300
MonitorSelfAge = 1213
DaemonStartTime = 1138977601
UpdateSequenceNumber = 4
MyAddress = "<129.215.181.34:35701>"
LastHeardFrom = 1138978812
UpdatesTotal = 5
UpdatesSequenced = 4
UpdatesLost = 0



[jgrunche@ys condor]$ condor_q -l


-- Submitter: ys.cap.ed.ac.uk : <129.215.181.34:35702> : ys.cap.ed.ac.uk
MyType = "Job"
TargetType = "Machine"
ClusterId = 5
QDate = 1138977720
CompletionDate = 0
Owner = "jgrunche"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
LastSuspensionTime = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.7.14 Dec 13 2005 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
RootDir = "/"
Iwd = "/home/jgrunche/condex/MPI"
JobUniverse = 11
Cmd = "/bin/sleep"
WantIOProxy = TRUE
CurrentHosts = 0
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
RemoteSpoolDir = "/home/condor/spool/cluster5.proc0.subproc0"
MinHosts = 1
MaxHosts = 1
JobStatus = 1
EnteredCurrentStatus = 1138977720
JobPrio = 0
User = "jgrunche@xxxxxxxxxxxxxxx"
NiceUser = FALSE
Env = ""
JobNotification = 2
WantRemoteIO = TRUE
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "/dev/null"
TransferOut = FALSE
Err = "/dev/null"
TransferErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize = 19
ExecutableSize = 19
DiskUsage = 19
Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >=
DiskUsage) && ((Memory * 1024) >= ImageSize) &&
(TARGET.FileSystemDomain == MY.FileSystemDomain)
FileSystemDomain = "ys.cap.ed.ac.uk"
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
OnExitHold = FALSE
OnExitRemove = TRUE
LeaveJobInQueue = FALSE
Args = "30"
GlobalJobId = "ys.cap.ed.ac.uk#1138977720#5.0"
ProcId = 0
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxx"
ServerTime = 1138979122

Thanks for your help.