[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs repeatedly evicted after 30 mins



Hi Greg,

Your problem rings a bell - i'm sure i've had exactly the same symptoms
on some of our unix execute nodes. It's been a while since though, so my memory is a bit hazy, but i think what was wrong was that although i
opened the relevant TCP ports in the firewall of the execute node i forgot
to also open the same UDP ports.

I'm fairly new to condor so i'm not entirely sure what the exact reason
for the observed behaviour was (maybe someone else can shed some more light on this ?) but it appeared to be the following: The scheduler on the submit node and the starter deamon on the execute node could communicate fine (using tcp) and the jobs would start. The submit node then (after almost exactly half an hour) tried to 'check up' on the job, but this time using UDP. Since the udp ports were blocked, this failed and the submit node then
canceled the job.

As i say, i might be completely wrong. Let me know if it is not what's wrong and i'll have another think - maybe i'm mixing something up here.

I hope it helps tho,

regards,

 Mike

--On Thursday, March 02, 2006 11:19 AM +0800 Greg.Hitchen@xxxxxxxx wrote:


We have the situation where a user submits ~10 jobs,
all of which should run for ~5 hours. Many/most of
them get repeatedly evicted after 30 mins and requeued.
Below are the relevent logs from the submitting and execute
machines for one particular instance.

I have tested this myself with different jobs and the eviction
is ALWAYS ALMOST EXACTLY a few seconds (20?) under 30 minutes.

The line in the START LOG:

3/1 05:57:16 State change: claim timed out (condor_schedd gone?)

seems to be the relevant one?

ALL of the evictions (for different execute machines and different
jobs, same submit machine) occur at 30 minutes.

Thanks for any help.

Cheers

Greg

RELEVANT CONFIG SETTINGS?

MachineBusy = ( $(WorkHours) || $(CPUBusy) || $(KeyboardBusy) )

WorkHours = ( (ClockMin >= 480 && ClockMin < 1080) && \
              (ClockDay > 0 && ClockDay < 6) )
AfterHours = ( (ClockMin < 480 || ClockMin >= 1080) || \
               (ClockDay == 0 || ClockDay == 6) )

WANT_SUSPEND	= False
WANT_VACATE		= False
START			= $(AfterHours) && $(CPUIdle) && KeyboardIdle >
$(StartIdleTime)
SUSPEND		= $(UWCS_SUSPEND)
CONTINUE		= $(UWCS_CONTINUE)
PREEMPT		= $(UWCS_PREEMPT)
KILL			= True
PERIODIC_CHECKPOINT	= $(UWCS_PERIODIC_CHECKPOINT)
PREEMPTION_REQUIREMENTS	= $(UWCS_PREEMPTION_REQUIREMENTS)
PREEMPTION_RANK		= $(UWCS_PREEMPTION_RANK)
NEGOTIATOR_PRE_JOB_RANK = $(UWCS_NEGOTIATOR_PRE_JOB_RANK)
NEGOTIATOR_POST_JOB_RANK = $(UWCS_NEGOTIATOR_POST_JOB_RANK)


SHADOW LOG FROM SUBMITTING MACHINE

3/1 08:27:18 ******************************************************
3/1 08:27:18 ** condor_shadow (CONDOR_SHADOW) STARTING UP
3/1 08:27:18 ** C:\Condor\bin\condor_shadow.exe
3/1 08:27:18 ** $CondorVersion: 6.6.10 Jun 22 2005 $
3/1 08:27:18 ** $CondorPlatform: INTEL-WINNT50 $
3/1 08:27:18 ** PID = 1148
3/1 08:27:18 ******************************************************
3/1 08:27:18 Using config file: c:\condor\condor_config
3/1 08:27:18 Using local config files: C:\Condor/condor_config.local
3/1 08:27:18 DaemonCore: Command Socket at <130.155.67.83:9149>
3/1 08:27:19 Initializing a VANILLA shadow
3/1 08:27:20 (72.0) (1148): Request to run on <130.116.147.60:9836> was
ACCEPTED
3/1 08:57:16 (72.0) (1148): Job 72.0 is being evicted
3/1 08:57:16 (72.0) (1148): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 107

SCHEDD LOG FROM SUBMITTING MACHINE

3/1 08:27:18 Started shadow for job 72.0 on "<130.116.147.60:9836>",
(shadow pid = 1148)
3/1 08:57:16 Shadow pid 1148 for job 72.0 exited with status 107
3/1 08:57:16 Sent RELEASE_CLAIM to startd on <130.116.147.60:9836>
3/1 08:57:16 Match record (<130.116.147.60:9836>, 72, 0) deleted

STARTER LOG FROM EXECUTE MACHINE

3/1 05:27:21 ******************************************************
3/1 05:27:21 ** condor_starter (CONDOR_STARTER) STARTING UP
3/1 05:27:21 ** C:\Condor\bin\condor_starter.exe
3/1 05:27:21 ** $CondorVersion: 6.6.10 Jun 22 2005 $
3/1 05:27:21 ** $CondorPlatform: INTEL-WINNT50 $
3/1 05:27:21 ** PID = 800
3/1 05:27:21 ******************************************************
3/1 05:27:21 Using config file: c:\condor\condor_config
3/1 05:27:21 Using local config files: C:\Condor/condor_config.local
3/1 05:27:21 DaemonCore: Command Socket at <130.116.147.60:9931>
3/1 05:27:21 Setting resource limits not implemented!
3/1 05:27:21 Starter communicating with condor_shadow
<130.155.67.83:9149>
3/1 05:27:21 Submitting machine is "student3-lu.minerals.csiro.au"
3/1 05:27:36 File transfer completed successfully.
3/1 05:27:37 Starting a VANILLA universe job with ID: 72.0
3/1 05:27:37 IWD: C:\Condor/execute\dir_800
3/1 05:27:37 Output file: C:\Condor/execute\dir_800\EA+mrAD.log
3/1 05:27:37 Renice expr "10" evaluated to 10
3/1 05:27:37 About to exec C:\Condor\execute\dir_800\condor_exec.exe
EA+mrAD.egs
3/1 05:27:37 Create_Process succeeded, pid=3536
3/1 05:57:16 Got SIGQUIT.  Performing fast shutdown.
3/1 05:57:16 ShutdownFast all jobs.
3/1 05:57:16 Process exited, pid=3536, status=0
3/1 05:57:17 Last process exited, now Starter is exiting
3/1 05:57:17 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

START LOG FROM EXECUTE MACHINE

3/1 05:27:16 DaemonCore: Command received via UDP from host
<130.116.131.60:9593>
3/1 05:27:16 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
3/1 05:27:16 match_info called
3/1 05:27:16 Received match <130.116.147.60:9836>#2100392750
3/1 05:27:16 State change: match notification protocol successful
3/1 05:27:16 Changing state: Unclaimed -> Matched
3/1 05:27:16 DaemonCore: Command received via TCP from host
<130.155.67.83:9600>
3/1 05:27:16 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
3/1 05:27:16 Request accepted.
3/1 05:27:16 Remote owner is odw010@xxxxxxxx
3/1 05:27:16 State change: claiming protocol successful
3/1 05:27:16 Changing state: Matched -> Claimed
3/1 05:27:20 DaemonCore: Command received via TCP from host
<130.155.67.83:9540>
3/1 05:27:20 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
3/1 05:27:20 Got activate_claim request from shadow
(<130.155.67.83:9540>)
3/1 05:27:20 Remote job ID is 72.0
3/1 05:27:21 Got universe "VANILLA" (5) from request classad
3/1 05:27:21 State change: claim-activation protocol successful
3/1 05:27:21 Changing activity: Idle -> Busy
3/1 05:57:16 State change: claim timed out (condor_schedd gone?)
3/1 05:57:16 Changing state and activity: Claimed/Busy ->
Preempting/Killing
3/1 05:57:17 DaemonCore: Command received via TCP from host
<130.155.67.83:9584>
3/1 05:57:17 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
3/1 05:57:17 Got KILL_FRGN_JOB while in Preempting state, ignoring.
3/1 05:57:17 DaemonCore: Command received via UDP from host
<130.116.147.60:9482>
3/1 05:57:17 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
3/1 05:57:17 Starter pid 800 exited with status 0
3/1 05:57:17 State change: starter exited
3/1 05:57:17 State change: No preempting claim, returning to owner
3/1 05:57:17 Changing state and activity: Preempting/Killing ->
Owner/Idle
3/1 05:57:17 State change: IS_OWNER is false
3/1 05:57:17 Changing state: Owner -> Unclaimed

-----------------------------------------------------------------------
Greg Hitchen
greg.hitchen@xxxxxxxx
CSIRO Exploration and Mining				phone:+61 8 6436
8663
Australian Resources Research Centre (ARRC)	fax:	+61 8 6436 8555
Postal address:						mob:	0407 952
748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-----------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users



--------------------------------------------------------------
Michael Tyka, Computational Protein Folding
C.62, Department of Biochemistry,
University of Bristol
http://www.bch.bris.ac.uk/staff/pfdg/mike.htm