[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] out-of-memory event?



On Thu, Oct 12, 2017 at 10:33 AM, Michael Di Domenico
<mdidomenico4@xxxxxxxxx> wrote:
> found further evidence in the starterlog
>
> "job was held due to OOM event: job has encountered an out-of-memory event"
>
> however, when i look through the system logs, the OOM killer doesn't
> seem to have killed anything.

turns out this might have been a bit of a red herring.  after several
days i finally tracked down that the jobs were failing on only a few
specific hosts and at exactly the same time everyday.  turns out there
is a cronjob on those machines that does 'systemctl restart
gdm.service'

it's not clear exactly why restarting gdm kills off the jobs, it's
also not clear why condor thinks this is an out-of-memory event.  my
own supposition is that condor assumes if a job is killed by someone
other then itself it must have been OOM.  but i don't know the code
i'm likely wrong