[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] out-of-memory event?
- Date: Thu, 26 Oct 2017 09:10:04 -0400
- From: Michael Di Domenico <mdidomenico4@xxxxxxxxx>
- Subject: Re: [HTCondor-users] out-of-memory event?
On Thu, Oct 12, 2017 at 10:33 AM, Michael Di Domenico
> found further evidence in the starterlog
> "job was held due to OOM event: job has encountered an out-of-memory event"
> however, when i look through the system logs, the OOM killer doesn't
> seem to have killed anything.
turns out this might have been a bit of a red herring. after several
days i finally tracked down that the jobs were failing on only a few
specific hosts and at exactly the same time everyday. turns out there
is a cronjob on those machines that does 'systemctl restart
it's not clear exactly why restarting gdm kills off the jobs, it's
also not clear why condor thinks this is an out-of-memory event. my
own supposition is that condor assumes if a job is killed by someone
other then itself it must have been OOM. but i don't know the code
i'm likely wrong