[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] cgroups v2 not working



Hello,
I am using HTCondor 10.4.0 on RockyLinux 9.1 with a shared filesystem (NFS)
Today I switched to cgroups v1 because afaik cgroups v2 do not provide statistics about actual memory usage which I need for my use case.

After the switch to cgroups v1, my jobs are not executed anymore.
I found the following error in the ShadowLog:

05/05/23 16:03:13 ******************************************************
05/05/23 16:03:13 ** condor_shadow (CONDOR_SHADOW) STARTING UP
05/05/23 16:03:13 ** /usr/sbin/condor_shadow
05/05/23 16:03:13 ** SubsystemInfo: name=SHADOW type=SHADOW(5) class=DAEMON(1)
05/05/23 16:03:13 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
05/05/23 16:03:13 ** $CondorVersion: 10.4.0 2023-04-06 BuildID: 638308 PackageID: 10.4.0-1 $
05/05/23 16:03:13 ** $CondorPlatform: x86_64_AlmaLinux9 $
05/05/23 16:03:13 ** PID = 8028
05/05/23 16:03:13 ** Log last touched 5/5 16:03:13
05/05/23 16:03:13 ******************************************************
05/05/23 16:03:13 Using config source: /etc/condor/condor_config
05/05/23 16:03:13 Using local config sources:
05/05/23 16:03:13 /etc/condor/config.d/01-central-manager.config
05/05/23 16:03:13ÂÂÂ /etc/condor/config.d/10-stash-plugin.conf
05/05/23 16:03:13ÂÂÂ /etc/condor/condor_config.local
05/05/23 16:03:13 config Macros = 73, Sorted = 73, StringBytes = 2133, TablesBytes = 1232
05/05/23 16:03:13 CLASSAD_CACHING is OFF
05/05/23 16:03:13 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
05/05/23 16:03:13 SharedPortEndpoint: waiting for connections to named socket shadow_7616_e92f_20
05/05/23 16:03:13 DaemonCore: command socket at <xxx.xx.xx.xx:xxxx?addrs=xxx.xx.xx.xx:xxxx&alias=vgcn-mira-central-manager.pulsar.novalocal&noUDP&sock=shadow_7616_e92f_20>
05/05/23 16:03:13 DaemonCore: private command socket at <xxx.xx.xx.xx:xxxx?addrs=xxx.xx.xx.xx:xxxx&alias=vgcn-mira-central-manager.pulsar.novalocal&noUDP&sock=shadow_7616_e92f_20>
05/05/23 16:03:13 Initializing a VANILLA shadow for job 8.0
05/05/23 16:03:13 (8.0) (8028): LIMIT_DIRECTORY_ACCESS = <unset>
05/05/23 16:03:13 (8.0) (8028): Request to run on slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <192.168.199.168:9618?addrs=192.168.199.168-9618&alias=vgcn-mira-exec-node-1.pulsar.novalocal&noUDP&sock=startd_6688_fd7d> was ACCEPTED
05/05/23 16:03:13 (7.0) (8027): ERROR "Error from slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Assertion ERROR on (false)" at line 585 in file /var/lib/condor/execute/slot1/dir_3409276/userdir/.tmpiiiaCw/BUILD/condor-10.4.0/src/condor_shadow.V6.1/pseudo_ops.cpp
05/05/23 16:03:13 (8.0) (8028): ERROR "Error from slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Assertion ERROR on (false)" at line 585 in file /var/lib/condor/execute/slot1/dir_3409276/userdir/.tmpiiiaCw/BUILD/condor-10.4.0/src/condor_shadow.V6.1/pseudo_ops.cpp

Besides the cgroups, I changed nothing in my setup and the exact same jobs were completing before the switch.

I would be very grateful for any information.

Thanks

Mira

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature