[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] config problems....



hi...

in an ongoing attempt to figure out why only the local/submit machine seems
to be processing the submitted job, there appear to be issues with
either/both the config file and submit file setup.

i'd like to know if someone can provide me with a sample config file that
demonstrates how to deal with two different machines that don't have a NFS
share, as well as a sample submit file that shows how to have an app being
run within the network, on both the local and remote machines.

also, in the future i hope to be able to establish an NFS share on one of
the systems. could someone provide a sample config file (and submit file)
that shows how to implement that kind of situation.

i've read/reviewed a good deal of what i could find within the docs/google,
but i still can't seem to figure out what i'm missing....

at this point, this is looking to be a show stopper!!!

thanks for any help/assistance/thoughts/comments/etc....

regards,

-bruce



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of bruce
Sent: Saturday, October 09, 2004 11:42 AM
To: 'Condor-Users Mail List'; 'Nathan Mueller'
Subject: [Condor-users] config problems....


hi...

it appears that i've hit a wall....

i have two systems with condor, a manager, and a client. i can see that the
client is communicating with the manager, via the 'condor_status'. i can
submit test jobs to either machine, with the test job being run on the
local/submitted machine.

the problem i'm having is that i can't seem to 'force' get the condor app to
run the test apps on both the submitted and remote server. i submit my test
on the client machine, (which simply runs a test perl script 50 times), and
the submitted jobs only run on the client machine. there is nothing running
on the manager machine, so i'm clueless as to why the jobs don't rollover to
run on the manager machine. i'm also not sure how to set the config file to
ensure that multiple perl scripts are being run simultaneously on a given
server.

the setup that i'm including only seems to have a few jobs running at the
same time!!!!!

i'm including the 'part 3' sections of the condor_config file for both
machines, although the config files are relatively the same. i'm also
including the test submit file that i'm using.

if anyone has any insight/comments/thoughts as to what i'm
missing/overlooking, i'd appreciate it!!

thanks....

-bruce

ps.. this is critical to being able to actually evaluate/confirm that condor
will do what we need!!!!!


------------------------------------------------------------
client - lserver2 config file
------------------------------------------------------------
##  This section contains macros are here to help write legible
##  expressions:
MINUTE          = 60
HOUR            = (60 * $(MINUTE))
StateTimer      = (CurrentTime - EnteredCurrentState)
ActivityTimer   = (CurrentTime - EnteredCurrentActivity)
ActivationTimer = (CurrentTime - JobStart)
LastCkpt        = (CurrentTime - LastPeriodicCheckpoint)

##  The JobUniverse attribute is just an int.  These macros can be
##  used to specify the universe in a human-readable way:
STANDARD        = 1
PVM             = 4
VANILLA         = 5
MPI             = 8
IsPVM           = (TARGET.JobUniverse == $(PVM))
IsMPI           = (TARGET.JobUniverse == $(MPI))
IsVanilla       = (TARGET.JobUniverse == $(VANILLA))
IsStandard      = (TARGET.JobUniverse == $(STANDARD))

SmallJob        = (TARGET.ImageSize <  (15 * 1024))

NonCondorLoadAvg        = (LoadAvg - CondorLoadAvg)
BackgroundLoad          = 1.2
HighLoad                = 0.5
StartIdleTime           = 15 * $(MINUTE)
ContinueIdleTime        =  5 * $(MINUTE)
MaxSuspendTime          = 10 * $(MINUTE)
MaxVacateTime           = 10 * $(MINUTE)

KeyboardBusy            = (KeyboardIdle < $(MINUTE))
ConsoleBusy             = (ConsoleIdle  < $(MINUTE))
CPUIdle                 = ($(NonCondorLoadAvg) <= $(BackgroundLoad))
CPUBusy                 = ($(NonCondorLoadAvg) >= $(HighLoad))
KeyboardNotBusy         = ($(KeyboardBusy) == False)

BigJob          = (TARGET.ImageSize >= (50 * 1024))
MediumJob       = (TARGET.ImageSize >= (15 * 1024) && TARGET.ImageSize < (50
* 1024))
SmallJob        = (TARGET.ImageSize <  (15 * 1024))

JustCPU                 = ($(CPUBusy) && ($(KeyboardBusy) == False))
MachineBusy             = ($(CPUBusy) || $(KeyboardBusy))

##  The RANK expression controls which jobs this machine prefers to
##  run over others.  Some examples from the manual include:
##    RANK = TARGET.ImageSize
##    RANK = (Owner == "coltrane") + (Owner == "tyner") \
##                  + ((Owner == "garrison") * 10) + (Owner == "jones")
##  By default, RANK is always 0, meaning that all jobs have an equal
##  ranking.
#RANK                   = 0


#####################################################################
##  This where you choose the configuration that you would like to
##  use.  It has no defaults so it must be defined.  We start this
##  file off with the UWCS_* policy.
######################################################################

##  Also here is what is referred to as the TESTINGMODE_*, which is
##  a quick hardwired way to test Condor.
##  Replace UWCS_* with TESTINGMODE_* if you wish to do testing mode.
##  For example:
##  WANT_SUSPEND                = $(UWCS_WANT_SUSPEND)
##  becomes
##  WANT_SUSPEND                = $(TESTINGMODE_WANT_SUSPEND)

#bdouglas test params....
WANT_SUSPEND            = False
WANT_VACATE             = False
START                   = ($(Test_START)) || Owner == "apache"
SUSPEND                 = False
CONTINUE                = True
PREEMPT                 = False
KILL                    = False
PERIODIC_CHECKPOINT     = False
PREEMPTION_REQUIREMENTS = False
PREEMPTION_RANK         = 0

###
#WANT_SUSPEND           = $(UWCS_WANT_SUSPEND)
#WANT_VACATE            = $(UWCS_WANT_VACATE)
#START                  = $(UWCS_START)
#SUSPEND                        = $(UWCS_SUSPEND)
#CONTINUE               = $(UWCS_CONTINUE)
#PREEMPT                        = $(UWCS_PREEMPT)
#KILL                   = $(UWCS_KILL)
#PERIODIC_CHECKPOINT    = $(UWCS_PERIODIC_CHECKPOINT)
#PREEMPTION_REQUIREMENTS        = $(UWCS_PREEMPTION_REQUIREMENTS)
#PREEMPTION_RANK                = $(UWCS_PREEMPTION_RANK)

#####################################################################
## B Douglas - Test network attribs...
#####################################################################
Test_START      = $(CPUIdle)

#####################################################################
## This is the UWisc - CS Department Configuration.
#####################################################################
UWCS_WANT_SUSPEND       = ( $(SmallJob) || $(KeyboardNotBusy) \
                            || $(IsPVM) || $(IsVanilla) )
UWCS_WANT_VACATE        = ( $(ActivationTimer) > 10 * $(MINUTE) \
                            || $(IsPVM) || $(IsVanilla) )

# Only start jobs if:
# 1) the keyboard has been idle long enough, AND
# 2) the load average is low enough OR the machine is currently
#    running a Condor job
# (NOTE: Condor will only run 1 job at a time on a given resource.
# The reasons Condor might consider running a different job while
# already running one are machine Rank (defined above), and user
# priorities.)
UWCS_START      = ( (KeyboardIdle > $(StartIdleTime)) \
                    && ( $(CPUIdle) || \
                         (State != "Unclaimed" && State != "Owner")) )

# Suspend jobs if:
# 1) the keyboard has been touched, OR
# 2a) The cpu has been busy for more than 2 minutes, AND
# 2b) the job has been running for more than 90 seconds
UWCS_SUSPEND = ( $(KeyboardBusy) || \
                 ( (CpuBusyTime > 2 * $(MINUTE)) \
                   && $(ActivationTimer) > 90 ) )

# Continue jobs if:
# 1) the cpu is idle, AND
# 2) we've been suspended more than 10 seconds, AND
# 3) the keyboard hasn't been touched in a while
UWCS_CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) \
                  && (KeyboardIdle > $(ContinueIdleTime)) )

# Preempt jobs if:
# 1) The job is suspended and has been suspended longer than we want
# 2) OR, we don't want to suspend this job, but the conditions to
#    suspend jobs have been met (someone is using the machine)
UWCS_PREEMPT = ( ((Activity == "Suspended") && \
                  ($(ActivityTimer) > $(MaxSuspendTime))) \
                 || (SUSPEND && (WANT_SUSPEND == False)) )

# Kill jobs if they have taken too long to vacate gracefully
UWCS_KILL = $(ActivityTimer) > $(MaxVacateTime)

##  Only define vanilla versions of these if you want to make them
##  different from the above settings.
#SUSPEND_VANILLA  = ( $(KeyboardBusy) || \
#       ((CpuBusyTime > 2 * $(MINUTE)) && $(ActivationTimer) > 90) )
#CONTINUE_VANILLA = ( $(CPUIdle) && ($(ActivityTimer) > 10) \
#                     && (KeyboardIdle > $(ContinueIdleTime)) )
#PREEMPT_VANILLA  = ( ((Activity == "Suspended") && \
#                     ($(ActivityTimer) > $(MaxSuspendTime))) \
#                     || (SUSPEND_VANILLA && (WANT_SUSPEND == False)) )
#KILL_VANILLA    = $(ActivityTimer) > $(MaxVacateTime)

##  We use a simple Periodic checkpointing mechanism, but then
##  again we have a very fast network.
UWCS_PERIODIC_CHECKPOINT        = $(LastCkpt) > (3 * $(HOUR))

##  You might want to checkpoint a little less often.  A good
##  example of this is below.  For jobs smaller than 60 megabytes, we
##  periodic checkpoint every 6 hours.  For larger jobs, we only
##  checkpoint every 12 hours.
#UWCS_PERIODIC_CHECKPOINT       = ( (TARGET.ImageSize < 60000) && \
#                           ($(LastCkpt) > (6 * $(HOUR))) ) || \
#                         ( $(LastCkpt) > (12 * $(HOUR)) )

##  The negotiator will not preempt a job running on a given machine
##  unless the PREEMPTION_REQUIREMENTS expression evaluates to true
##  and the owner of the idle job has a better priority than the owner
##  of the running job.  This expression defaults to true.
UWCS_PREEMPTION_REQUIREMENTS = $(StateTimer) > (1 * $(HOUR)) &&
RemoteUserPrio > SubmittorPrio * 1.2

##  The PREEMPTION_RANK expression is used to rank machines which the
##  job ranks the same.  For example, if the job has no preference, it
##  is usually preferable to preempt a job with a small ImageSize
##  instead of a job with a large ImageSize.  The default is to rank
##  all preemptable matches the same.  However, the negotiator will
##  always prefer to match the job with an idle machine over a
##  preemptable machine, if the job has no preference between them.
UWCS_PREEMPTION_RANK = (RemoteUserPrio * 1000000) - TARGET.ImageSize


#####################################################################
##  This is a Configuration that will cause your Condor jobs to
##  always run.  This is intended for testing only.
######################################################################

##  This mode will cause your jobs to start on a machine an will let
##  them run to completion.  Condor will ignore all of what is going
##  on in the machine (load average, keyboard activity, etc.)

TESTINGMODE_WANT_SUSPEND        = False
TESTINGMODE_WANT_VACATE         = False
TESTINGMODE_START               = True
TESTINGMODE_SUSPEND             = False
TESTINGMODE_CONTINUE            = True
TESTINGMODE_PREEMPT             = False
TESTINGMODE_KILL                = False
TESTINGMODE_PERIODIC_CHECKPOINT = False
TESTINGMODE_PREEMPTION_REQUIREMENTS = False
TESTINGMODE_PREEMPTION_RANK = 0


------------------------------------------------------------
manager - lserver5 config file
------------------------------------------------------------

##  This section contains macros are here to help write legible
##  expressions:
MINUTE      = 60
HOUR        = (60 * $(MINUTE))
StateTimer  = (CurrentTime - EnteredCurrentState)
ActivityTimer   = (CurrentTime - EnteredCurrentActivity)
ActivationTimer = (CurrentTime - JobStart)
LastCkpt    = (CurrentTime - LastPeriodicCheckpoint)

##  The JobUniverse attribute is just an int.  These macros can be
##  used to specify the universe in a human-readable way:
STANDARD    = 1
PVM         = 4
VANILLA     = 5
MPI         = 8
IsPVM           = (TARGET.JobUniverse == $(PVM))
IsMPI           = (TARGET.JobUniverse == $(MPI))
IsVanilla       = (TARGET.JobUniverse == $(VANILLA))
IsStandard      = (TARGET.JobUniverse == $(STANDARD))

SmallJob    = (TARGET.ImageSize <  (15 * 1024))

NonCondorLoadAvg    = (LoadAvg - CondorLoadAvg)
BackgroundLoad      = 1.1
HighLoad     = 0.5
StartIdleTime       = 15 * $(MINUTE)
ContinueIdleTime    =  5 * $(MINUTE)
MaxSuspendTime      = 10 * $(MINUTE)
MaxVacateTime       = 10 * $(MINUTE)

KeyboardBusy    = (KeyboardIdle < $(MINUTE))
ConsoleBusy  = (ConsoleIdle  < $(MINUTE))
CPUIdle      = ($(NonCondorLoadAvg) <= $(BackgroundLoad))
CPUBusy      = ($(NonCondorLoadAvg) >= $(HighLoad))
KeyboardNotBusy     = ($(KeyboardBusy) == False)

BigJob      = (TARGET.ImageSize >= (50 * 1024))
MediumJob   = (TARGET.ImageSize >= (15 * 1024) && TARGET.ImageSize < (50 *
1024))
SmallJob    = (TARGET.ImageSize <  (15 * 1024))

JustCPU      = ($(CPUBusy) && ($(KeyboardBusy) == False))
MachineBusy  = ($(CPUBusy) || $(KeyboardBusy))

##  The RANK expression controls which jobs this machine prefers to
##  run over others.  Some examples from the manual include:
##    RANK = TARGET.ImageSize
##    RANK = (Owner == "coltrane") + (Owner == "tyner") \
##                  + ((Owner == "garrison") * 10) + (Owner == "jones")
##  By default, RANK is always 0, meaning that all jobs have an equal
##  ranking.
#RANK        = 0


#####################################################################
##  This where you choose the configuration that you would like to
##  use.  It has no defaults so it must be defined.  We start this
##  file off with the UWCS_* policy.
######################################################################

##  Also here is what is referred to as the TESTINGMODE_*, which is
##  a quick hardwired way to test Condor.
##  Replace UWCS_* with TESTINGMODE_* if you wish to do testing mode.
##  For example:
##  WANT_SUSPEND     = $(UWCS_WANT_SUSPEND)
##  becomes
##  WANT_SUSPEND     = $(TESTINGMODE_WANT_SUSPEND)

#bdouglas test params
WANT_SUSPEND        = FALSE
WANT_VACATE  = FALSE
START        = ($(Test_Start)) || Owner == "apache"
SUSPEND      = FALSE
CONTINUE     = TRUE
PREEMPT      = FALSE
KILL         = FALSE
PERIODIC_CHECKPOINT = FALSE
PREEMPTION_REQUIREMENTS = FALSE
PREEMPTION_RANK     = 0

#####################################################################
## bdouglas - test netowrk attribs
#####################################################################
Test_Start    = $(CPUIdle)

#WANT_SUSPEND       = $(UWCS_WANT_SUSPEND)
#WANT_VACATE    = $(UWCS_WANT_VACATE)
#START       = $(UWCS_START)
#SUSPEND        = $(UWCS_SUSPEND)
#CONTINUE    = $(UWCS_CONTINUE)
#PREEMPT        = $(UWCS_PREEMPT)
#KILL        = $(UWCS_KILL)
#PERIODIC_CHECKPOINT = $(UWCS_PERIODIC_CHECKPOINT)
#PREEMPTION_REQUIREMENTS = $(UWCS_PREEMPTION_REQUIREMENTS)
#PREEMPTION_RANK     = $(UWCS_PREEMPTION_RANK)

#####################################################################
## This is the UWisc - CS Department Configuration.
#####################################################################
UWCS_WANT_SUSPEND   = ( $(SmallJob) || $(KeyboardNotBusy) \
                            || $(IsPVM) || $(IsVanilla) )
UWCS_WANT_VACATE    = ( $(ActivationTimer) > 10 * $(MINUTE) \
                            || $(IsPVM) || $(IsVanilla) )

# Only start jobs if:
# 1) the keyboard has been idle long enough, AND
# 2) the load average is low enough OR the machine is currently
#    running a Condor job
# (NOTE: Condor will only run 1 job at a time on a given resource.
# The reasons Condor might consider running a different job while
# already running one are machine Rank (defined above), and user
# priorities.)
UWCS_START  = ( (KeyboardIdle > $(StartIdleTime)) \
                    && ( $(CPUIdle) || \
                         (State != "Unclaimed" && State != "Owner")) )

# Suspend jobs if:
# 1) the keyboard has been touched, OR
# 2a) The cpu has been busy for more than 2 minutes, AND
# 2b) the job has been running for more than 90 seconds
UWCS_SUSPEND = ( $(KeyboardBusy) || \
                 ( (CpuBusyTime > 2 * $(MINUTE)) \
                   && $(ActivationTimer) > 90 ) )

# Continue jobs if:
# 1) the cpu is idle, AND
# 2) we've been suspended more than 10 seconds, AND
# 3) the keyboard hasn't been touched in a while
UWCS_CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) \
                  && (KeyboardIdle > $(ContinueIdleTime)) )

# Preempt jobs if:
# 1) The job is suspended and has been suspended longer than we want
# 2) OR, we don't want to suspend this job, but the conditions to
#    suspend jobs have been met (someone is using the machine)
UWCS_PREEMPT = ( ((Activity == "Suspended") && \
                  ($(ActivityTimer) > $(MaxSuspendTime))) \
             || (SUSPEND && (WANT_SUSPEND == False)) )

# Kill jobs if they have taken too long to vacate gracefully
UWCS_KILL = $(ActivityTimer) > $(MaxVacateTime)

##  Only define vanilla versions of these if you want to make them
##  different from the above settings.
#SUSPEND_VANILLA  = ( $(KeyboardBusy) || \
#       ((CpuBusyTime > 2 * $(MINUTE)) && $(ActivationTimer) > 90) )
#CONTINUE_VANILLA = ( $(CPUIdle) && ($(ActivityTimer) > 10) \
#                     && (KeyboardIdle > $(ContinueIdleTime)) )
#PREEMPT_VANILLA  = ( ((Activity == "Suspended") && \
#                     ($(ActivityTimer) > $(MaxSuspendTime))) \
#                     || (SUSPEND_VANILLA && (WANT_SUSPEND == False)) )
#KILL_VANILLA    = $(ActivityTimer) > $(MaxVacateTime)

##  We use a simple Periodic checkpointing mechanism, but then
##  again we have a very fast network.
UWCS_PERIODIC_CHECKPOINT = $(LastCkpt) > (3 * $(HOUR))

##  You might want to checkpoint a little less often.  A good
##  example of this is below.  For jobs smaller than 60 megabytes, we
##  periodic checkpoint every 6 hours.  For larger jobs, we only
##  checkpoint every 12 hours.
#UWCS_PERIODIC_CHECKPOINT       = ( (TARGET.ImageSize < 60000) && \
#                ($(LastCkpt) > (6 * $(HOUR))) ) || \
#              ( $(LastCkpt) > (12 * $(HOUR)) )

##  The negotiator will not preempt a job running on a given machine
##  unless the PREEMPTION_REQUIREMENTS expression evaluates to true
##  and the owner of the idle job has a better priority than the owner
##  of the running job.  This expression defaults to true.
UWCS_PREEMPTION_REQUIREMENTS = $(StateTimer) > (1 * $(HOUR)) &&
RemoteUserPrio > SubmittorPrio * 1.2

##  The PREEMPTION_RANK expression is used to rank machines which the
##  job ranks the same.  For example, if the job has no preference, it
##  is usually preferable to preempt a job with a small ImageSize
##  instead of a job with a large ImageSize.  The default is to rank
##  all preemptable matches the same.  However, the negotiator will
##  always prefer to match the job with an idle machine over a
##  preemptable machine, if the job has no preference between them.
UWCS_PREEMPTION_RANK = (RemoteUserPrio * 1000000) - TARGET.ImageSize


#####################################################################
##  This is a Configuration that will cause your Condor jobs to
##  always run.  This is intended for testing only.
######################################################################

##  This mode will cause your jobs to start on a machine an will let
##  them run to completion.  Condor will ignore all of what is going
##  on in the machine (load average, keyboard activity, etc.)

TESTINGMODE_WANT_SUSPEND = False
TESTINGMODE_WANT_VACATE  = False
TESTINGMODE_START    = True
TESTINGMODE_SUSPEND  = False
TESTINGMODE_CONTINUE    = True
TESTINGMODE_PREEMPT  = False
TESTINGMODE_KILL     = False
TESTINGMODE_PERIODIC_CHECKPOINT = False
TESTINGMODE_PREEMPTION_REQUIREMENTS = False
TESTINGMODE_PREEMPTION_RANK = 0


the test submit file.....
$ cat submit1.txt
Executable = ctest.pl
Universe = vanilla
#Output = hello.out
Error = ctest.err
Log = ctest.log

#TRANSFER_FILES = ALWAYS
#Requirements = (OpSys == "LINUX")
initialdir    = /college/data/

Queue 50




_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users