[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs fail after updating from 10.5.0 to 10.6.0/10.7.0



Hi,

Just curious if there was a problem found here? Iâve noticed something similar when testing release > 10.5 on el9 (tested 10.7 and 23.0). For me jobs go on hold immediately with:

"Error from slot1_1@xxxxxxxxxxxxxxxxxxx: Failed to execute '/pool/condor/dir_2355954/condor_pid_ns_init' with arguments /afs/cern.ch/user/b/bejones/tmp/condor/hello.sh hello: (errno=2: 'No such file or directoryâ)"

Looks like itâs looking for condor_pid_ns_init in the sandbox rather than in LIBEXEC. Same config on the EP works on 10.5. Are we missing some config?

cheers,
Ben

On 20 Sep 2023, at 16:11, Tim Theisen via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Hello Carles,

We do not know of any issues with PID namespaces. However, it is possible that it is no longer working properly. We will try to reproduce the problem here.

...Tim

On 9/19/23 08:09, Carles Acosta wrote:
Hello again,

I updated the CE and the testing WNs in Alma9 to HTCondor 10.8.0 but the jobs continued to fail. So, my last option was to change to false the USE_PID_NAMESPACES option on Alma9 WNs. After that, the CE jobs started to run again. 

Is there an issue introduced in HTCondor 10.6.0 with AlmaLinux 9 - CE jobs and pid namespaces? 

As I commented, this issue started with HTCondor 10.6.0 version in AlmaLinux 9 WNs and apparently only for the jobs routed from a CE.

Cheers,

Carles

On Tue, 5 Sept 2023 at 08:14, Carles Acosta <cacosta@xxxxxx> wrote:
Hi,

After more testing, we have discovered that not all jobs are failing, only the ones coming from the HTCondor-CE. 

According to the HTCondor release highlights, in version 10.6.0 the executable is no longer renamed to condor_exec.exe. Could the problem be related to this? I do not know.

We have run the StarterLog with D_ALL debug for one example job. We can send the log if necessary. 

Thank you again.

Cheers,

Carles

On Fri, 1 Sept 2023 at 16:14, Carles Acosta <cacosta@xxxxxx> wrote:
Hello,

We have WNs in AlmaLinux 9 with HTCondor 10.5.0 that were running apparently fine. However, after updating to 10.6.0 (or 10.7.0), new jobs are not correctly executed. There are these errors in the StarterLog.slotX_X:

09/01/23 07:23:30 (pid:54345) Create_Process succeeded, pid=54393
09/01/23 07:23:30 (pid:54345) Process exited, pid=54393, status=127
09/01/23 07:23:30 (pid:54345) JobReaper: condor_pid_ns_init didn't drop filename /home/execute/dir_54345/.condor_pid_ns_status (2)
09/01/23 07:23:30 (pid:54345) ERROR "Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly" at line 751 in file /var/lib/condor/execute/slot1/dir_3398586/userdir/build-ytPdzf/BUILD/condor-10.7.0/src/condor_starter.V6.1/vanilla_proc.cpp

09/01/23 07:23:30 (pid:54345) ShutdownFast all jobs.

I do not see in StartLog any other hint:

106336 09/01/23 07:23:30 Starter pid 54345 exited with status 4
106337 09/01/23 07:23:30 slot1_1: State change: starter exited
106338 09/01/23 07:23:30 slot1_1: Changing activity: Busy -> Idle

Reading again the version history, I'm not sure what change generates this error. Has anyone had a similar problem?

Thank you in advance.

Best regards,

Carles

--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice:  http://legal.ifae.es


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice:  http://legal.ifae.es


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice:  http://legal.ifae.es

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
-- 
Tim Theisen (he, him, his)
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/