[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Strange condor eviction




I have a user who is reporting that his job is getting evicted
after running for several hours.  This is puzzling since I have
PREEMPT and PREEMPTION_REQUIREMENTS set to FALSE across my cluster,
e.g. if a user starts they should get to finish.  The user reports
(and I confirm) that this same job has been evicted a couple times
in a row now.  Nevertheless it restarted on another node after
the eviction and is still running now.  It doesn't appear to be
a case of memory.  This is condor 6.7.13

Here's the log of the job:

[root@fnpcsrv1 condor_log]# more lia_fd_05_11_1_4.log
000 (53837.000.000) 01/23 10:49:58 Job submitted from host: <131.225.167.42:3908
2>
...
001 (53837.000.000) 01/23 10:50:03 Job executing on host: <131.225.167.170:32772

...
006 (53837.000.000) 01/23 10:50:11 Image size of job updated: 6360
...
006 (53837.000.000) 01/23 11:10:11 Image size of job updated: 154944
...
004 (53837.000.000) 01/23 17:22:45 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 04:52:00, Sys 0 00:07:16  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
       0  -  Run Bytes Received By Job
...
001 (53837.000.000) 01/23 18:11:30 Job executing on host: <131.225.167.156:32772

...
006 (53837.000.000) 01/23 18:31:40 Image size of job updated: 150152
...
006 (53837.000.000) 01/23 22:51:39 Image size of job updated: 150552
...
006 (53837.000.000) 01/23 23:31:39 Image size of job updated: 153320
...
006 (53837.000.000) 01/24 03:51:41 Image size of job updated: 155168
...
006 (53837.000.000) 01/24 05:11:39 Image size of job updated: 158732


As you can see, after the job was evicted, it restarted on another
node and seems to be going fine now.

--------------------------------------

From the node in question, here's the appropriate section of the StartLog

1/23 10:49:59 DaemonCore: Command received via UDP from host <131.225.167.42:191
77>
1/23 10:49:59 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
1/23 10:49:59 vm1: match_info called
1/23 10:49:59 vm1: Received match <131.225.167.170:32772>#1137606512#551
1/23 10:49:59 vm1: State change: match notification protocol successful
1/23 10:49:59 vm1: Changing state: Unclaimed -> Matched
1/23 10:49:59 DaemonCore: Command received via TCP from host <131.225.167.42:177
17>
1/23 10:49:59 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim)
1/23 10:49:59 vm1: Request accepted.
1/23 10:49:59 vm1: Remote owner is rubin@xxxxxxxx
1/23 10:49:59 vm1: State change: claiming protocol successful
1/23 10:49:59 vm1: Changing state: Matched -> Claimed
1/23 10:50:03 DaemonCore: Command received via TCP from host <131.225.167.42:177
23>
1/23 10:50:03 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
 (command_activate_claim)
1/23 10:50:03 vm1: Got activate_claim request from shadow (<131.225.167.42:17723
)
1/23 10:50:03 vm1: Remote job ID is 53837.0
1/23 10:50:03 vm1: Got universe "VANILLA" (5) from request classad
1/23 10:50:03 vm1: State change: claim-activation protocol successful
1/23 10:50:03 vm1: Changing activity: Idle -> Busy
<clip>
1/23 17:22:44 vm1: State change: claim lease expired (condor_schedd gone?)
1/23 17:22:44 vm1: Changing state and activity: Claimed/Busy -> Preempting/Killi
ng
1/23 17:22:44 vm2: State change: claim lease expired (condor_schedd gone?)
1/23 17:22:44 vm2: Changing state and activity: Claimed/Idle -> Preempting/Killi
ng
1/23 17:22:44 vm2: State change: No preempting claim, returning to owner
1/23 17:22:44 vm2: Changing state and activity: Preempting/Killing -> Owner/Idle
1/23 17:22:44 vm2: State change: IS_OWNER is false
1/23 17:22:44 vm2: Changing state: Owner -> Unclaimed
1/23 17:22:45 DaemonCore: Command received via TCP from host <131.225.167.42:384
24>
1/23 17:22:45 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), call
ing handler (command_handler)
1/23 17:22:45 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
1/23 17:22:45 Starter pid 18109 exited with status 0
1/23 17:22:45 vm1: State change: starter exited
1/23 17:22:45 vm1: State change: No preempting claim, returning to owner
1/23 17:22:45 vm1: Changing state and activity: Preempting/Killing -> Owner/Idle
1/23 17:22:45 vm1: State change: IS_OWNER is false
1/23 17:22:45 vm1: Changing state: Owner -> Unclaimed
1/23 17:26:51 State change: RunBenchmarks is TRUE
1/23 17:26:51 vm2: Changing activity: Idle -> Benchmarking
1/23 17:26:55 State change: benchmarks completed
1/23 17:26:55 vm2: Changing activity: Benchmarking -> Idle
1/23 17:26:55 State change: RunBenchmarks is TRUE
1/23 17:26:55 vm1: Changing activity: Idle -> Benchmarking
1/23 17:27:00 State change: benchmarks completed
1/23 17:27:00 vm1: Changing activity: Benchmarking -> Idle

At 17:22 the job was evicted from vm1.
---------------------------------------
Here's the log from StarterLog.vm1

1/23 10:50:03 ******************************************************
1/23 10:50:03 ** condor_starter (CONDOR_STARTER) STARTING UP
1/23 10:50:03 ** /opt/condor-6.7.13/sbin/condor_starter
1/23 10:50:03 ** $CondorVersion: 6.7.13 Nov  7 2005 $
1/23 10:50:03 ** $CondorPlatform: I386-LINUX_RH9 $
1/23 10:50:03 ** PID = 18109
1/23 10:50:03 ******************************************************
1/23 10:50:03 Using config file: /etc/condor/condor_config
1/23 10:50:03 Using local config files: /opt/condor-6.7.13/etc/group_params.conf
ig /opt/condor-6.7.13/local/condor_config.local
1/23 10:50:03 DaemonCore: Command Socket at <131.225.167.170:37303>
1/23 10:50:03 Done setting resource limits
1/23 10:50:03 Communicating with shadow <131.225.167.42:17722>
1/23 10:50:03 Submitting machine is "fngp-osg.fnal.gov"
1/23 10:50:03 Starting a VANILLA universe job with ID: 53837.0
1/23 10:50:03 IWD: /home/minfarm/condor_submit
1/23 10:50:03 Output file: /home/minfarm/condor_log/lia_fd_05_11_1_4.1205.1.out 1/23 10:50:03 Error file: /home/minfarm/condor_log/lia_fd_05_11_1_4.1205.1.err 1/23 10:50:03 About to exec /home/rubin/condor/ana_li 1205 1 R1.11 fd_05_11_1_4
1/23 10:50:03 Create_Process succeeded, pid=18118
1/23 17:22:44 Got SIGQUIT.  Performing fast shutdown.
1/23 17:22:44 ShutdownFast all jobs.
1/23 17:22:44 Process exited, pid=18118, signal=9
1/23 17:22:45 Last process exited, now Starter is exiting
1/23 17:22:45 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

--------------------------------------------------------------
Nothing in the system logs indicates any problems at this time.
The SchedLog has unfortunately already rotated past this time frame.


Any ideas what might be going on?

Steve Timm



--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team