Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] out-of-memory event?

Date: Thu, 26 Oct 2017 09:10:04 -0400
From: Michael Di Domenico <mdidomenico4@xxxxxxxxx>
Subject: Re: [HTCondor-users] out-of-memory event?

On Thu, Oct 12, 2017 at 10:33 AM, Michael Di Domenico
<mdidomenico4@xxxxxxxxx> wrote:
> found further evidence in the starterlog
>
> "job was held due to OOM event: job has encountered an out-of-memory event"
>
> however, when i look through the system logs, the OOM killer doesn't
> seem to have killed anything.

turns out this might have been a bit of a red herring.  after several
days i finally tracked down that the jobs were failing on only a few
specific hosts and at exactly the same time everyday.  turns out there
is a cronjob on those machines that does 'systemctl restart
gdm.service'

it's not clear exactly why restarting gdm kills off the jobs, it's
also not clear why condor thinks this is an out-of-memory event.  my
own supposition is that condor assumes if a job is killed by someone
other then itself it must have been OOM.  but i don't know the code
i'm likely wrong

Follow-Ups:
- Re: [HTCondor-users] out-of-memory event?
  - From: Greg Thain

References:
- [HTCondor-users] out-of-memory event?
  - From: Michael Di Domenico
- Re: [HTCondor-users] out-of-memory event?
  - From: Michael Di Domenico

Prev by Date: Re: [HTCondor-users] Limiting number of jobs of specific user to N per node
Next by Date: Re: [HTCondor-users] out-of-memory event?
Previous by thread: Re: [HTCondor-users] out-of-memory event?
Next by thread: Re: [HTCondor-users] out-of-memory event?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] out-of-memory event?