[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Can't find reason for eviction of vanilla jobs



I've been running some vanilla jobs in the past few days, but there has been trouble with evictions - only on two machines within the pool.

This is with version 7.2.1. I would appreciate guidance in tracking this down. Various log file excerpts are below.

Adam


Here's a typical example (8.018):

001 (008.018.000) 03/30 14:31:56 Job executing on host: <execute.host:48432>
...
001 (008.019.000) 03/30 14:32:01 Job executing on host: <execute.host:48432>
...
006 (008.018.000) 03/30 14:32:04 Image size of job updated: 241764
...
006 (008.019.000) 03/30 14:37:09 Image size of job updated: 241764
...
004 (008.018.000) 03/30 14:51:55 Job was evicted.
       (0) Job was not checkpointed.
               Usr 0 00:19:58, Sys 0 00:00:00  -  Run Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
       0  -  Run Bytes Sent By Job
       978049  -  Run Bytes Received By Job

On the relevant execute hose StarterLog.slot1:

3/30 14:51:55 Got SIGQUIT.  Performing fast shutdown.
3/30 14:51:55 ShutdownFast all jobs.
3/30 14:51:55 Process exited, pid=3908, signal=9
3/30 14:51:55 Last process exited, now Starter is exiting
3/30 14:51:55 **** condor_starter (condor_STARTER) pid 3906 EXITING WITH STATUS 0
3/30 14:56:55 ******************************************************
3/30 14:56:55 ** condor_starter (CONDOR_STARTER) STARTING UP
3/30 14:56:55 ** /usr/sbin/condor_starter
3/30 14:56:55 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
3/30 14:56:55 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
3/30 14:56:55 ** $CondorVersion: 7.2.1 Mar 27 2009 BuildID: Fedora-7.2.1-1.el5 $
3/30 14:56:55 ** $CondorPlatform: X86_64-LINUX_CENTOS52el5centos $
3/30 14:56:55 ** PID = 4003
3/30 14:56:55 ** Log last touched 3/30 14:51:55
3/30 14:56:55 ******************************************************
3/30 14:56:55 Using config source: /etc/condor/condor_config
3/30 14:56:55 Using local config sources:
3/30 14:56:55    /var/lib/condor/condor_config.local
3/30 14:56:55 DaemonCore: Command Socket at <127.0.0.1:57074>
3/30 14:56:55 Done setting resource limits
3/30 14:56:55 Communicating with shadow <central manager:60732>
3/30 14:56:55 Submitting machine is "x.x.x.x"
3/30 14:56:55 setting the orig job name in starter
3/30 14:56:55 setting the orig job iwd in starter
3/30 14:56:55 File transfer completed successfully.
3/30 14:56:56 warning: unable to chmod condor_exec.exe to ensure execute bit is set: Operation not permitted
3/30 14:56:56 Job 8.18 set to execute immediately
3/30 14:56:56 Starting a VANILLA universe job with ID: 8.18
3/30 14:56:56 IWD: /var/lib/condor/execute/dir_4003
3/30 14:56:56 Input file: /var/lib/condor/execute/dir_4003/dammin-Fib4nter.18.inp 3/30 14:56:56 Output file: /var/lib/condor/execute/dir_4003/dammin-Fib4nter.18.out 3/30 14:56:56 Error file: /var/lib/condor/execute/dir_4003/dammin-Fib4nter.18.err 3/30 14:56:56 About to exec /var/lib/condor/execute/dir_4003/condor_exec.exe
3/30 14:56:56 Create_Process succeeded, pid=4005
3/30 15:16:55 Got SIGQUIT.  Performing fast shutdown.
3/30 15:16:55 ShutdownFast all jobs.
3/30 15:16:55 Process exited, pid=4005, signal=9
3/30 15:16:55 Last process exited, now Starter is exiting
3/30 15:16:55 **** condor_starter (condor_STARTER) pid 4003 EXITING WITH STATUS 0

and here's the StartLog from that machine:

3/30 14:51:55 slot1: State change: claim lease expired (condor_schedd gone?)
3/30 14:51:55 slot1: Changing state and activity: Claimed/Busy -> Preempting/Killing
3/30 14:51:55 slot1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
3/30 14:51:55 Starter pid 3906 exited with status 0
3/30 14:51:55 slot1: State change: starter exited
3/30 14:51:55 slot1: State change: No preempting claim, returning to owner
3/30 14:51:55 slot1: Changing state and activity: Preempting/Killing -> Owner/Idle
3/30 14:51:55 slot1: State change: IS_OWNER is false
3/30 14:51:55 slot1: Changing state: Owner -> Unclaimed

In the ShadowLog on the central manager:

3/30 14:51:55 (8.18) (8272): Job 8.18 is being evicted from slot1@xxxxxxxxxxxx 3/30 14:51:55 (8.18) (8272): SECMAN: command 404 KILL_FRGN_JOB to <execute.host:48432> from TCP port 60686 (blocking). 3/30 14:51:55 (8.18) (8272): SECMAN: using session execute.host:2701:1238419915:42 for {<execute.host:48432>,<404>}.
3/30 14:51:55 (8.18) (8272): SECMAN: startCommand succeeded.
3/30 14:51:55 (8.18) (8272): SECMAN: command 1111 QMGMT_CMD to schedd at <central.manager:47959> from TCP port 60687 (blocking). 3/30 14:51:55 (8.18) (8272): SECMAN: using session central.manager:32568:1238419915:405 for {<central.manager:47959>,<1111>}.
3/30 14:51:55 (8.18) (8272): SECMAN: startCommand succeeded.
3/30 14:51:55 (8.18) (8272): HANDSHAKE: in handshake(my_methods = 'FS,KERBEROS,GSI')
3/30 14:51:55 (8.18) (8272): HANDSHAKE: handshake() - i am the client
3/30 14:51:55 (8.18) (8272): HANDSHAKE: sending (methods == 100) to server
3/30 14:51:55 (8.18) (8272): HANDSHAKE: server replied (method = 4)
3/30 14:51:55 (8.18) (8272): AUTHENTICATE_FS: used dir /tmp/FS_XXX7autYz, status: 1
3/30 14:51:55 (8.18) (8272): Authentication was a Success.
3/30 14:51:55 (8.18) (8272): ZKM: setting default map to (null)
3/30 14:51:55 (8.18) (8272): ZKM: post-map: current user is '(null)'
3/30 14:51:55 (8.18) (8272): ZKM: post-map: current domain is '(null)'
3/30 14:51:55 (8.18) (8272): ZKM: post-map: current FQU is '(null)'
3/30 14:51:55 (8.18) (8272): **** condor_shadow (condor_SHADOW) pid 8272 EXITING WITH STATUS 107

And SchedLog on the central manager:

3/30 14:51:55 (pid:32568) DC_AUTHENTICATE: received DC_AUTHENTICATE from <central.manager:60687> 3/30 14:51:55 (pid:32568) DC_AUTHENTICATE: resuming session id sienna:32568:1238419915:405 with return address <central.manager:60568>:
3/30 14:51:55 (pid:32568) DC_AUTHENTICATE: Success.
3/30 14:51:55 (pid:32568) PERMISSION GRANTED to unauthenticated user from host central.manager for command 1111 (QMGMT_CMD), access level READ: reason: READ authorization policy allows access by anyone 3/30 14:51:55 (pid:32568) Calling authenticate(FS,KERBEROS,GSI) in qmgmt_receivers 3/30 14:51:55 (pid:32568) HANDSHAKE: in handshake(my_methods = 'FS,KERBEROS,GSI')
3/30 14:51:55 (pid:32568) HANDSHAKE: handshake() - i am the server
3/30 14:51:55 (pid:32568) HANDSHAKE: client sent (methods == 100)
3/30 14:51:55 (pid:32568) HANDSHAKE: i picked (method == 4)
3/30 14:51:55 (pid:32568) HANDSHAKE: client received (method == 4)
3/30 14:51:55 (pid:32568) FS: client template is /tmp/FS_XXXXXXXXX
3/30 14:51:55 (pid:32568) FS: client filename is /tmp/FS_XXX7autYz
3/30 14:51:55 (pid:32568) AUTHENTICATE_FS: used dir /tmp/FS_XXX7autYz, status: 1
3/30 14:51:55 (pid:32568) Authentication was a Success.
3/30 14:51:55 (pid:32568) ZKM: setting default map to condor@xxxxxxxxxxxxxxx
3/30 14:51:55 (pid:32568) ZKM: post-map: current user is 'condor'
3/30 14:51:55 (pid:32568) ZKM: post-map: current domain is 'central.manager'
3/30 14:51:55 (pid:32568) ZKM: post-map: current FQU is 'condor@xxxxxxxxxxxxxxx' 3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx from host central.manager for queue management, access level WRITE: reason: cached result for WRITE; see first case for the full reason 3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx from host central.manager for queue management, access level WRITE: reason: cached result for WRITE; see first case for the full reason 3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx from host central.manager for queue management, access level WRITE: reason: cached result for WRITE; see first case for the full reason 3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx from host central.manager for queue management, access level WRITE: reason: cached result for WRITE; see first case for the full reason 3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx from host central.manager for queue management, access level WRITE: reason: cached result for WRITE; see first case for the full reason 3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx from host central.manager for queue management, access level WRITE: reason: cached result for WRITE; see first case for the full reason 3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx from host central.manager for queue management, access level WRITE: reason: cached result for WRITE; see first case for the full reason 3/30 14:51:55 (pid:32568) PERMISSION GRANTED to condor@xxxxxxxxxxxxxxx from host central.manager for queue management, access level WRITE: reason: cached result for WRITE; see first case for the full reason 3/30 14:51:55 (pid:32568) DC_INVALIDATE_KEY: removed key id sienna:32568:1238419915:404. 3/30 14:51:55 (pid:32568) DC_INVALIDATE_KEY: removed key id sienna:32568:1238419915:405. 3/30 14:51:55 (pid:32568) Shadow pid 8272 for job 8.18 exited with status 107 3/30 14:51:55 (pid:32568) SECMAN: command 443 RELEASE_CLAIM to startd at <execute.host:48432> from UDP port 45734 (non-blocking). 3/30 14:51:55 (pid:32568) SECMAN: using session marsantes:2701:1238411954:8 for {<execute.host:48432>,<443>}. 3/30 14:51:55 (pid:32568) SECMAN: UDP, m_have_session == 1, peer_can_negotiate == 1
3/30 14:51:55 (pid:32568) SECMAN: startCommand succeeded.
3/30 14:51:55 (pid:32568) Sent RELEASE_CLAIM to startd at <execute.host:48432> 3/30 14:51:55 (pid:32568) Match record (slot1@xxxxxxxxxxxx <execute.host:48432> for adam@xxxxxxxxxxxxxxx, 8.18) deleted 3/30 14:51:55 (pid:32568) IpVerify::FillHole: open count at level DAEMON for execute.host now 1 3/30 14:51:55 (pid:32568) IpVerify::FillHole: open count at level WRITE for execute.host now 1 3/30 14:51:55 (pid:32568) IpVerify::FillHole: open count at level READ for execute.host now 3 3/30 14:51:55 (pid:32568) IpVerify::FillHole: open count at level READ for execute.host now 2
3/30 14:51:55 (pid:32568) Null parameter --- match not deleted