[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] incomplete output files

to follow up with more detail, it looks like this is related to our
usage of cgroup memory limit hard

if i set the request_memory=1023 the job will fail with incomplete writes
if i set the request_memory=1025 the job will succeed

so the question is why am i not getting notified that the job has
exceeded its memory limit?  we do see condor put jobs on hold for
exceeding memory from time to time, but not with these

i'm still reading through the condor manual on cgroup memory controls.
i definitely don't want to unrestrict users from memory limits, bad
things happen.  but its not clear if there's a way to enable soft and
hard limits in condor

On Tue, Nov 17, 2020 at 2:18 PM Michael Di Domenico
<mdidomenico4@xxxxxxxxx> wrote:
> has anyone seen issues where jobs running in condor fail to completely
> write out their output.
> take this piece of code
> #include <stdio.h>
> #include <stdlib.h>
> int main(int argc, char **argv)
> {
> unsigned char *c2;
> char fname[256];
> file *cf;
> sprintf(fname, "/blah/blah.%s", argv[1]);
> c2 = malloc((size_t) 256*128*256*128);
> cf = fopen(fname, "w");
> fwrite(c2,256*128*256*128,1,cf);
> fflush(cf);
> fclose(cf);
> return(0);
> }
> if i run this with queue 30 under condor, some percentage of the files
> will not write out their 1GB of data
> if i run the exact same program outside of condor, using my
> workstation, slurm, etc, it works fine.
> we're running condor 8.8.6 and have cgroup controls turned on (but
> it's only limiting memory).  otherwise we're not doing anything
> special