[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Why was job evicted?



Hello condor-users@xxxxxxxxxxx,

Joe Blow's (I've changed all the names, hosts, and IP addresses
to satisfy our security department) job got evicted unexpectedly after
running over 6 hours, and I don't understand why.  I have these questions
about the material that follows:

1.	Why did Joe Blow's job get evicted?  I expected that
	my condor_config file on runhost would prevent any job from
	being evicted, once it got initiated.

2.	What is a "command 404 (DEACTIVATE_CLAIM_FORCIBLY)," i.e. what
	command would I issue from a shell to do this?

3.	Why would a "command 404 (DEACTIVATE_CLAIM_FORCIBLY)" be
	initiated automatically from homehost?  I'm assuming it was
	done automatically, because it's highly unlikely anyone would've
	been working at 22:25 (10:25 P.M.).  BTW homehost isn't my
	Central Manager, which is mgrhost(192.168.1.125).

4.	What causes the messages like "DC_AUTHENTICATE: attempt to open
	invalid session runhost:13840:1063767050:316, failing" that I
	see scattered throughout my log?

Here are the details of the situation:

runhost(RedHat 7.0)  is running Condor 6.4.7 Jan 26 2003.
homehost(RedHat 9.0) is running Condor 6.5.3 Jul  2 2003
mgrhost(HP-UX 11.0)  is running Condor 6.5.3 Jul  1 2003

One of our people, Joe Blow, submitted a Vanilla Flow-3D job
from homehost(192.168.1.245) at 16:17:31 on 09/16, it was initiated
on runhost(192.168.1.213), and his job was evicted at 22:25:50,
after using 6:07:59 minutes of user processing time, according to
his job.log:

	$ cat /home/rdodge/Flow3D/75-2-cu/job.log
	000 (106.000.000) 09/16 16:17:31 Job submitted from host:
<192.168.1.245:33135>
	001 (106.000.000) 09/16 16:17:36 Job executing on host: <192.168.1.213:1034>
	006 (106.000.000) 09/16 16:17:45 Image size of job updated: 116560
	004 (106.000.000) 09/16 22:25:50 Job was evicted.
	        (0) Job was not checkpointed.
	                Usr 0 06:07:59, Sys 0 00:00:07  -  Run Remote Usage
	                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	        0  -  Run Bytes Sent By Job
	        2618  -  Run Bytes Received By Job
	001 (106.000.000) 09/16 22:55:51 Job executing on host: <192.168.1.213:1034>
	004 (106.000.000) 09/17 06:25:50 Job was evicted.
	        (0) Job was not checkpointed.
	                Usr 0 07:29:41, Sys 0 00:00:09  -  Run Remote Usage
	                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	        0  -  Run Bytes Sent By Job
	        2618  -  Run Bytes Received By Job
	001 (106.000.000) 09/17 06:51:00 Job executing on host: <192.168.1.213:1034>
	004 (106.000.000) 09/17 08:53:59 Job was evicted.
	        (0) Job was not checkpointed.
	                Usr 0 02:02:51, Sys 0 00:00:02  -  Run Remote Usage
	                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	        0  -  Run Bytes Sent By Job
	        2618  -  Run Bytes Received By Job
	009 (106.000.000) 09/17 08:54:00 Job was aborted by the user.
	        via condor_rm (by user rdodge)

The StartLog on runhost shows these messages at the time of eviction:

	9/16 22:20:50 DC_AUTHENTICATE: attempt to open invalid session
runhost:13840:1063767050:316, failing.
	9/16 22:25:50 DC_AUTHENTICATE: attempt to open invalid session
runhost:13840:1063767050:316, failing.
	9/16 22:25:50 vm1: State change: claim timed out (condor_schedd gone?)
	9/16 22:25:50 vm1: Changing state and activity: Claimed/Busy ->
Preempting/Killing
	9/16 22:25:50 DaemonCore: Command received via TCP from host
<192.168.1.245:47182>
	9/16 22:25:50 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY),
calling handler (command_handler)
	9/16 22:25:50 vm1: Got deactivate_claim_forcibly while in Preempting state,
ignoring.
	9/16 22:25:50 Starter pid 8588 exited with status 0
	9/16 22:25:50 vm1: State change: starter exited
	9/16 22:25:50 vm1: State change: No preempting match, returning to owner
	9/16 22:25:50 vm1: Changing state and activity: Preempting/Killing ->
Owner/Idle
	9/16 22:25:50 vm1: State change: IS_OWNER is false
	9/16 22:25:50 vm1: Changing state: Owner -> Unclaimed
	9/16 22:25:50 DC_AUTHENTICATE: attempt to open invalid session
runhost:13840:1063767050:316, failing.

The StarterLog.vm1 on runhost shows these messages:

	9/17 06:50:59 ******************************************************
	9/17 06:50:59 ** condor_starter (CONDOR_STARTER) STARTING UP
	9/17 06:50:59 ** $CondorVersion: 6.4.7 Jan 26 2003 $
	9/17 06:50:59 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
	9/17 06:50:59 ** PID = 19871
	9/17 06:50:59 ******************************************************
	9/17 06:50:59 DaemonCore: Command Socket at <192.168.1.213:2790>
	9/17 06:50:59 Submitting machine is "homehost.kcc.com"
	9/17 06:50:59 Done setting resource limits
	9/17 06:50:59 File transfer completed successfully.
	9/17 06:51:00 Starting a VANILLA universe job.
	9/17 06:51:00 Output file: /home/condor/execute/dir_19871/job.stdout
	9/17 06:51:00 Error file: /home/condor/execute/dir_19871/job.stderr
	9/17 06:51:00 About to exec /home/condor/execute/dir_19871/condor_exec.exe
	9/17 06:51:00 Create_Process succeeded, pid=19873
	9/17 08:53:59 Got SIGQUIT.  Performing fast shutdown.
	9/17 08:53:59 ShutdownFast all jobs.
	9/17 08:53:59 Job exited, pid=19873, signal=9
	9/17 08:53:59 Last process exited, now Starter is exiting
	9/17 08:53:59 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

On runhost, I have these setting in /usr/local/condor/etc/condor_config, which
I thought
would prevent any job from being suspended or evicted:

	WANT_SUSPEND            = $(TESTINGMODE_WANT_SUSPEND)
	WANT_VACATE             = $(TESTINGMODE_WANT_VACATE)
	START                   = $(TESTINGMODE_START)
	SUSPEND                 = $(TESTINGMODE_SUSPEND)
	CONTINUE                = $(TESTINGMODE_CONTINUE)
	PREEMPT                 = $(TESTINGMODE_PREEMPT)
	KILL                    = $(TESTINGMODE_KILL)
	PERIODIC_CHECKPOINT     = $(TESTINGMODE_PERIODIC_CHECKPOINT)
	PREEMPTION_REQUIREMENTS = $(TESTINGMODE_PREEMPTION_REQUIREMENTS)
	PREEMPTION_RANK         = $(TESTINGMODE_PREEMPTION_RANK)
	TESTINGMODE_WANT_SUSPEND        = False
	TESTINGMODE_WANT_VACATE         = False
	TESTINGMODE_START               = True
	TESTINGMODE_SUSPEND             = False
	TESTINGMODE_CONTINUE            = True
	TESTINGMODE_PREEMPT             = False
	TESTINGMODE_KILL                = False
	TESTINGMODE_PERIODIC_CHECKPOINT = False
	TESTINGMODE_PREEMPTION_REQUIREMENTS = False
	TESTINGMODE_PREEMPTION_RANK = 0



------------------------------------------------------------------------------
This e-mail is intended for the use of the addressee(s) only and may contain
privileged, confidential, or proprietary information that is exempt from
disclosure under law.  If you have received this message in error, please
inform us promptly by reply e-mail, then delete the e-mail and destroy any
printed copy.   Thank you.
==============================================================================
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>