[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Strange condor eviction



On Tue, 24 Jan 2006, Matt Hope wrote:

On 1/24/06, Steven Timm <timm@xxxxxxxx> wrote:

I have a user who is reporting that his job is getting evicted
after running for several hours.  This is puzzling since I have
PREEMPT and PREEMPTION_REQUIREMENTS set to FALSE across my cluster,
e.g. if a user starts they should get to finish.  The user reports
(and I confirm) that this same job has been evicted a couple times
in a row now.  Nevertheless it restarted on another node after
the eviction and is still running now.  It doesn't appear to be
a case of memory.  This is condor 6.7.13

Here's the log of the job:

[root@fnpcsrv1 condor_log]# more lia_fd_05_11_1_4.log
000 (53837.000.000) 01/23 10:49:58 Job submitted from host:
<131.225.167.42:3908
2>
...
001 (53837.000.000) 01/23 10:50:03 Job executing on host:
<131.225.167.170:32772

...
006 (53837.000.000) 01/23 10:50:11 Image size of job updated: 6360
...
006 (53837.000.000) 01/23 11:10:11 Image size of job updated: 154944
...
004 (53837.000.000) 01/23 17:22:45 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 04:52:00, Sys 0 00:07:16  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
       0  -  Run Bytes Received By Job
--------------------------------------

From the node in question, here's the appropriate section of the StartLog

1/23 10:49:59 DaemonCore: Command received via UDP from host
<131.225.167.42:191
77>
1/23 10:49:59 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
1/23 10:49:59 vm1: match_info called
1/23 10:49:59 vm1: Received match <131.225.167.170:32772>#1137606512#551
1/23 10:49:59 vm1: State change: match notification protocol successful
1/23 10:49:59 vm1: Changing state: Unclaimed -> Matched
1/23 10:49:59 DaemonCore: Command received via TCP from host
<131.225.167.42:177
17>
1/23 10:49:59 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler
(command_request_claim)
1/23 10:49:59 vm1: Request accepted.
1/23 10:49:59 vm1: Remote owner is rubin@xxxxxxxx
1/23 10:49:59 vm1: State change: claiming protocol successful
1/23 10:49:59 vm1: Changing state: Matched -> Claimed
1/23 10:50:03 DaemonCore: Command received via TCP from host
<131.225.167.42:17723>
1/23 10:50:03 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
 (command_activate_claim)
1/23 10:50:03 vm1: Got activate_claim request from shadow(<131.225.167.42:17723>)
1/23 10:50:03 vm1: Remote job ID is 53837.0
1/23 10:50:03 vm1: Got universe "VANILLA" (5) from request classad
1/23 10:50:03 vm1: State change: claim-activation protocol successful
1/23 10:50:03 vm1: Changing activity: Idle -> Busy
<clip>
1/23 17:22:44 vm1: State change: claim lease expired (condor_schedd gone?)
1/23 17:22:44 vm1: Changing state and activity: Claimed/Busy -> Preempting/Killing
1/23 17:22:44 vm2: State change: claim lease expired (condor_schedd gone?)
1/23 17:22:44 vm2: Changing state and activity: Claimed/Idle -> Preempting/Killing

<snip>

The execute machine seemed unable to talk to the submit machine. Did
the user reboot their machine? is there a stateful firewall between
them tha is timing out their connection?

It happened to both vm's at the same time but I guess this is likely
due to two jobs from the same user sarting at the same time rather
than a network glitch but worth checking..

On closer examination I see in my MasterLog the following:

1/23 17:47:40 The SCHEDD (pid 30331) was killed because it was no longer responding

This corresponds to just after the job was evicted.  So it could
be that the schedd was in some wierd state.  The PS output shows that
the schedd was restarted at this time last night.

What would make a schedd get so confused that it would
have to get restarted, and how does condor detect that the schedd is confused?

Steve




Any ideas what might be going on?

There could be all sorts of reasons for the loss of connectivity - you
could try searching the archive for "condor_schedd gone"

You could try enabling the D_PROTOCOL debug level to see what is going on...

Matt

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team