[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Strange condor eviction



On 1/24/06, Steven Timm <timm@xxxxxxxx> wrote:
>
> I have a user who is reporting that his job is getting evicted
> after running for several hours.  This is puzzling since I have
> PREEMPT and PREEMPTION_REQUIREMENTS set to FALSE across my cluster,
> e.g. if a user starts they should get to finish.  The user reports
> (and I confirm) that this same job has been evicted a couple times
> in a row now.  Nevertheless it restarted on another node after
> the eviction and is still running now.  It doesn't appear to be
> a case of memory.  This is condor 6.7.13
>
> Here's the log of the job:
>
> [root@fnpcsrv1 condor_log]# more lia_fd_05_11_1_4.log
> 000 (53837.000.000) 01/23 10:49:58 Job submitted from host:
> <131.225.167.42:3908
> 2>
> ...
> 001 (53837.000.000) 01/23 10:50:03 Job executing on host:
> <131.225.167.170:32772
> >
> ...
> 006 (53837.000.000) 01/23 10:50:11 Image size of job updated: 6360
> ...
> 006 (53837.000.000) 01/23 11:10:11 Image size of job updated: 154944
> ...
> 004 (53837.000.000) 01/23 17:22:45 Job was evicted.
>         (0) Job was not checkpointed.
>                 Usr 0 04:52:00, Sys 0 00:07:16  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>         0  -  Run Bytes Sent By Job
>        0  -  Run Bytes Received By Job
> --------------------------------------
>
> >From the node in question, here's the appropriate section of the StartLog
>
> 1/23 10:49:59 DaemonCore: Command received via UDP from host
> <131.225.167.42:191
> 77>
> 1/23 10:49:59 DaemonCore: received command 440 (MATCH_INFO), calling
> handler (command_match_info)
> 1/23 10:49:59 vm1: match_info called
> 1/23 10:49:59 vm1: Received match <131.225.167.170:32772>#1137606512#551
> 1/23 10:49:59 vm1: State change: match notification protocol successful
> 1/23 10:49:59 vm1: Changing state: Unclaimed -> Matched
> 1/23 10:49:59 DaemonCore: Command received via TCP from host
> <131.225.167.42:177
> 17>
> 1/23 10:49:59 DaemonCore: received command 442 (REQUEST_CLAIM), calling
> handler
> (command_request_claim)
> 1/23 10:49:59 vm1: Request accepted.
> 1/23 10:49:59 vm1: Remote owner is rubin@xxxxxxxx
> 1/23 10:49:59 vm1: State change: claiming protocol successful
> 1/23 10:49:59 vm1: Changing state: Matched -> Claimed
> 1/23 10:50:03 DaemonCore: Command received via TCP from host
> <131.225.167.42:17723>
> 1/23 10:50:03 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
>  (command_activate_claim)
> 1/23 10:50:03 vm1: Got activate_claim request from shadow(<131.225.167.42:17723>)
> 1/23 10:50:03 vm1: Remote job ID is 53837.0
> 1/23 10:50:03 vm1: Got universe "VANILLA" (5) from request classad
> 1/23 10:50:03 vm1: State change: claim-activation protocol successful
> 1/23 10:50:03 vm1: Changing activity: Idle -> Busy
 <clip>
> 1/23 17:22:44 vm1: State change: claim lease expired (condor_schedd gone?)
> 1/23 17:22:44 vm1: Changing state and activity: Claimed/Busy -> Preempting/Killing
> 1/23 17:22:44 vm2: State change: claim lease expired (condor_schedd gone?)
> 1/23 17:22:44 vm2: Changing state and activity: Claimed/Idle -> Preempting/Killing

<snip>

The execute machine seemed unable to talk to the submit machine. Did
the user reboot their machine? is there a stateful firewall between
them tha is timing out their connection?

It happened to both vm's at the same time but I guess this is likely
due to two jobs from the same user sarting at the same time rather
than a network glitch but worth checking..

> Any ideas what might be going on?

There could be all sorts of reasons for the loss of connectivity - you
could try searching the archive for "condor_schedd gone"

You could try enabling the D_PROTOCOL debug level to see what is going on...

Matt