[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] out-of-memory event?

On Thu, Oct 12, 2017 at 10:33 AM, Michael Di Domenico
<mdidomenico4@xxxxxxxxx> wrote:
> found further evidence in the starterlog
> "job was held due to OOM event: job has encountered an out-of-memory event"
> however, when i look through the system logs, the OOM killer doesn't
> seem to have killed anything.

turns out this might have been a bit of a red herring.  after several
days i finally tracked down that the jobs were failing on only a few
specific hosts and at exactly the same time everyday.  turns out there
is a cronjob on those machines that does 'systemctl restart

it's not clear exactly why restarting gdm kills off the jobs, it's
also not clear why condor thinks this is an out-of-memory event.  my
own supposition is that condor assumes if a job is killed by someone
other then itself it must have been OOM.  but i don't know the code
i'm likely wrong