[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Long time to reallocation of jobs



Hi, I'm have a condor manager and two nodes configured to act like a
dedicated cluster (testing for while, with this options configurated:

   Modified condor_config file:

   #START = $(UWCS_START)
   START = True
   #SUSPEND = $(UWCS_SUSPEND)
   #CONTINUE = $(UWCS_CONTINUE)
   #PREEMPT = $(UWCS_PREEMPT)
   SUSPEND = False
   CONTINUE = True
   PREEMPT = False
   #KILL = $(UWCS_KILL)
   KILL = $(ActivityTimer) > $(MaxVacateTime)
   #PREEMPTION_REQUIREMENTS = $(UWCS_PREEMPTION_REQUIREMENTS)
   PREEMPTION_REQUIREMENTS=False

I submit jobs with this description file:

Executable = job2
Universe = vanilla
Requirements = (Arch == "INTEL") || (Arch == "X86_64")
Log = job2.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
job_lease_duration = 180
Queue 10

For testing the reallocation os jobs, I shutdown one of nodes and
verified on ShadownLog that jobs take about 130 minutos to be moved to
another node, look:

8/11 11:40:49 (7.8) (26492): Request to run on <200.200.x.x:59245> was ACCEPTED
8/11 11:55:53 (7.8) (26492): ZKM: setting default map to (null)
8/11 13:52:18 (7.8) (26492): condor_read(): recv() returned -1, errno
= 110, assuming failure reading 5 bytes from unknown source.
8/11 13:52:18 (7.8) (26492): IO: Failed to read packet header
8/11 13:52:18 (7.8) (26492): Can no longer talk to condor_starter
<200.200.x.x:59245>
8/11 13:52:18 (7.8) (26492): Trying to reconnect to disconnected job
8/11 13:52:18 (7.8) (26492): LastJobLeaseRenewal: 1218465663 Mon Aug
11 11:41:03 2008
8/11 13:52:18 (7.8) (26492): JobLeaseDuration: 180 seconds
8/11 13:52:18 (7.8) (26492): JobLeaseDuration remaining: EXPIRED!
8/11 13:52:18 (7.8) (26492): Reconnect FAILED: Job disconnected too
long: JobLeaseDuration (180 seconds) expired
8/11 13:52:18 (7.8) (26492): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 107
8/11 13:55:45 Initializing a VANILLA shadow for job 7.8
8/11 13:55:46 (7.8) (23196): Request to run on <200.100.x.x:60004> was ACCEPTED
8/11 14:10:48 (7.8) (23196): ZKM: setting default map to (null)
8/11 14:15:48 (7.8) (23196): ZKM: setting default map to (null)
8/11 14:15:48 (7.8) (23196): Job 7.8 terminated: exited with status 0
8/11 14:15:48 (7.8) (23196): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 100


My doubt is, how I could configure condor to check node failure on
less time, like 15 minutes, and thus, move job to another node?

I have another question, what is the default time for condor manager
do a "condor_reschedule"? Is possible change this time?

Thanks in advanced,

Juliao