[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] incomplete output files



You don't say which version of HTCondor you are using. However, we have
improved cgroup memory management in the recent 8.9.9 release.
Previously, you could choose to set either the hard or the soft limit.
Now, HTCondor will always set both the hard and soft limit in cgroups.
Here is a quick summary:

When CGROUP_MEMORY_LIMIT_POLICY is soft, the soft limit is set to the
slot size, and the hard limit is set to the TotalMemory of the whole
startd. When CGROUP_MEMORY_LIMIT_POLICY is hard, the hard limit is set
to the slot size, and the soft limit is set 90% lower.

https://htcondor.readthedocs.io/en/latest/version-history/development-release-series-89.html#version-8-9-9

I hope you find this helpful.

...Tim

On 11/17/20 1:53 PM, mdidomenico4@xxxxxxxxx wrote:
> to follow up with more detail, it looks like this is related to our
> usage of cgroup memory limit hard
>
> if i set the request_memory=1023 the job will fail with incomplete writes
> if i set the request_memory=1025 the job will succeed
>
> so the question is why am i not getting notified that the job has
> exceeded its memory limit?  we do see condor put jobs on hold for
> exceeding memory from time to time, but not with these
>
> i'm still reading through the condor manual on cgroup memory controls.
> i definitely don't want to unrestrict users from memory limits, bad
> things happen.  but its not clear if there's a way to enable soft and
> hard limits in condor
>
>
>
>
>
>
> On Tue, Nov 17, 2020 at 2:18 PM Michael Di Domenico
> <mdidomenico4@xxxxxxxxx> wrote:
>> has anyone seen issues where jobs running in condor fail to completely
>> write out their output.
>>
>> take this piece of code
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> int main(int argc, char **argv)
>> {
>> unsigned char *c2;
>> char fname[256];
>> file *cf;
>> sprintf(fname, "/blah/blah.%s", argv[1]);
>> c2 = malloc((size_t) 256*128*256*128);
>> cf = fopen(fname, "w");
>> fwrite(c2,256*128*256*128,1,cf);
>> fflush(cf);
>> fclose(cf);
>> return(0);
>> }
>>
>> if i run this with queue 30 under condor, some percentage of the files
>> will not write out their 1GB of data
>>
>> if i run the exact same program outside of condor, using my
>> workstation, slurm, etc, it works fine.
>>
>> we're running condor 8.8.6 and have cgroup controls turned on (but
>> it's only limiting memory).  otherwise we're not doing anything
>> special
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Tim Theisen
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736