[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] cgroups v2 not working



On 5/5/23 11:18, Mira Kuntz wrote:

Hello,
I am using HTCondor 10.4.0 on RockyLinux 9.1 with a shared filesystem (NFS)
Today I switched to cgroups v1 because afaik cgroups v2 do not provide statistics about actual memory usage which I need for my use case.


Hi Mira:

I'm sorry that HTCondor doesn't seem to be getting you the memory information you need.  There are some fixes in HTCondor 10.5, perhaps when that comes out, v2 will work for you.

In the meantime, I'm curious about this error.  I think the problem is on the starter side -- can you email me (off list) the StarterLog for this job?


-greg



After the switch to cgroups v1, my jobs are not executed anymore.
I found the following error in the ShadowLog:

05/05/23 16:03:13 ******************************************************
05/05/23 16:03:13 ** condor_shadow (CONDOR_SHADOW) STARTING UP
05/05/23 16:03:13 ** /usr/sbin/condor_shadow
05/05/23 16:03:13 ** SubsystemInfo: name=SHADOW type=SHADOW(5) class=DAEMON(1)
05/05/23 16:03:13 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
05/05/23 16:03:13 ** $CondorVersion: 10.4.0 2023-04-06 BuildID: 638308 PackageID: 10.4.0-1 $
05/05/23 16:03:13 ** $CondorPlatform: x86_64_AlmaLinux9 $
05/05/23 16:03:13 ** PID = 8028
05/05/23 16:03:13 ** Log last touched 5/5 16:03:13
05/05/23 16:03:13 ******************************************************
05/05/23 16:03:13 Using config source: /etc/condor/condor_config
05/05/23 16:03:13 Using local config sources:
05/05/23 16:03:13 /etc/condor/config.d/01-central-manager.config
05/05/23 16:03:13    /etc/condor/config.d/10-stash-plugin.conf
05/05/23 16:03:13    /etc/condor/condor_config.local
05/05/23 16:03:13 config Macros = 73, Sorted = 73, StringBytes = 2133, TablesBytes = 1232
05/05/23 16:03:13 CLASSAD_CACHING is OFF
05/05/23 16:03:13 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
05/05/23 16:03:13 SharedPortEndpoint: waiting for connections to named socket shadow_7616_e92f_20
05/05/23 16:03:13 DaemonCore: command socket at <xxx.xx.xx.xx:xxxx?addrs=xxx.xx.xx.xx:xxxx&alias=vgcn-mira-central-manager.pulsar.novalocal&noUDP&sock=shadow_7616_e92f_20>
05/05/23 16:03:13 DaemonCore: private command socket at <xxx.xx.xx.xx:xxxx?addrs=xxx.xx.xx.xx:xxxx&alias=vgcn-mira-central-manager.pulsar.novalocal&noUDP&sock=shadow_7616_e92f_20>
05/05/23 16:03:13 Initializing a VANILLA shadow for job 8.0
05/05/23 16:03:13 (8.0) (8028): LIMIT_DIRECTORY_ACCESS = <unset>
05/05/23 16:03:13 (8.0) (8028): Request to run on slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <192.168.199.168:9618?addrs=192.168.199.168-9618&alias=vgcn-mira-exec-node-1.pulsar.novalocal&noUDP&sock=startd_6688_fd7d> was ACCEPTED
05/05/23 16:03:13 (7.0) (8027): ERROR "Error from slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Assertion ERROR on (false)" at line 585 in file /var/lib/condor/execute/slot1/dir_3409276/userdir/.tmpiiiaCw/BUILD/condor-10.4.0/src/condor_shadow.V6.1/pseudo_ops.cpp
05/05/23 16:03:13 (8.0) (8028): ERROR "Error from slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Assertion ERROR on (false)" at line 585 in file /var/lib/condor/execute/slot1/dir_3409276/userdir/.tmpiiiaCw/BUILD/condor-10.4.0/src/condor_shadow.V6.1/pseudo_ops.cpp

Besides the cgroups, I changed nothing in my setup and the exact same jobs were completing before the switch.

I would be very grateful for any information.

Thanks

Mira


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/