[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Problem with ALICE jobs on AL9



Hi list,
we are in the middle of migrating our infrastructure from CentOS 7 to
Alma Linux 9.
Most of the infra is on CC7 with condor 9.* and we have one testing wn
cluster with AL9 on condor 23.
So far the ATLAS workload works fine on new cluster, but the ALICE jobs
land on the WN and fail right a way without producing any output.

One possible cause:
06/14/24 09:39:54 (pid:1) Unexpected permissions failure in setting
hard limit for max core sizesetrlimit(4, new = [rlim_cur =
18446744073709551615, rlim_max = 18446744073709551615]) : old =
[rlim_cur = 0, rlim_max = 0], errno: 1(Operation not permitted).
Attempting workaround.                                                
06/14/24 09:39:54 (pid:1) Workaround not applicable, no hard limit
enforcement for max core size. 

Disabled core dumps for the condor service (our local hack to prevent
our local users to plague the FS with core files).
But even ps after setting the testing WN to allow core dumps, the
behaviour is still the same.

I do not know what to do next, there is no info why the job failed in
the condor_history and logs on WN.
The payload (job agent) works fine when run manually  under the
appropriate user.

We have planned outage for next week to migrate most of our infra to
the AL9 and HTC23, to make thing even more interesting.

Side notes: The CEs are ARC and WNs use v1 cgroups because v2 are not
working.

Cheers
AM


-- 
Alexandr Mikula
OddÄlenà sÃÅovÃnà a vÃpoÄetnà techniky & VÃpoÄetnà stÅedisko 
FyzikÃlnà Ãstav Akademie vÄd Äeskà republiky, v. v. i.
Institute of Physics of the Czech Academy of Sciences 

Attachment: smime.p7s
Description: S/MIME cryptographic signature