[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs fail to start after update from 8.6.13 to 8.8.1



the default value for the configuration knob MOUNT_UNDER_SCRATCH changed from 8.6 to 8.8
The new default value is

MOUNT_UNDER_SCRATCH = /tmp,/var/tmp

Since your execute directory is under /tmp,  the attempt to mount /tmp into the job sandbox is recursive,
causing problems.  I'm surprised it doesn't fail earlier.

You can fix this either by adding this to your configuration

MOUNT_UNDER_SCRATCH = /var/tmp

or this

MOUNT_UNDER_SCRATCH = 

Or you can fix it by moving your execute directory so that it is no longer under /tmp

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Laurent Wandrebeck
Sent: Thursday, March 21, 2019 4:35 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] jobs fail to start after update from 8.6.13 to 8.8.1

Hi there,

Weâre happily running HTCondor for quite a while on CentOS 7.
After update to 8.8.1, jobs now fail to start. Simple setup, one
master, and some execute nodes.

Everything seems to be related to EXECUTE, which is
/tmp/condor/execute, defined as x /tmp/condor/execute in
etc/tmpfiles.d/condor.conf.

on an execute node:
03/21/19 10:14:08 (pid:22633) Job 4443.944 set to execute immediately
03/21/19 10:14:08 (pid:22633) Starting a VANILLA universe job with ID: 4443.944
03/21/19 10:14:08 (pid:22633) Current mount, /tmp, is shared.
03/21/19 10:14:08 (pid:22633) Current mount, /, is shared.
03/21/19 10:14:08 (pid:22633) IWD: /tmp/condor/execute/dir_22633
03/21/19 10:14:08 (pid:22633) Renice expr "0" evaluated to 0
03/21/19 10:14:08 (pid:22633) About to exec /tmp/condor/execute/dir_22633/condor_exec.exe 
03/21/19 10:14:08 (pid:22633) Running job as user low
03/21/19 10:14:08 (pid:22633) Warning: Create_Process: failed to read child process failure code
03/21/19 10:14:08 (pid:22633) Create_Process(/tmp/condor/execute/dir_22633/condor_exec.exe,, ...) failed: (errno=2: 'No such file or directory')
03/21/19 10:14:08 (pid:22633) Failed to start job, exiting
03/21/19 10:14:08 (pid:22633) ShutdownFast all jobs.
03/21/19 10:14:08 (pid:22633) condor_read() failed: recv(fd=13) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.1.71.91:18752>.
03/21/19 10:14:08 (pid:22633) IO: Failed to read packet header
03/21/19 10:14:08 (pid:22633) Lost connection to shadow, waiting 2400 secs for reconnect
03/21/19 10:14:08 (pid:22633) All jobs have exited... starter exiting
03/21/19 10:14:08 (pid:22633) **** condor_starter (condor_STARTER) pid 22633 EXITING WITH STATUS 0

Any idea ? (selinux is not the culprit)
Thanks,
-- 
Laurent Wandrebeck
HYGEOS, Earth Observation Department / Observation de la Terre
Euratechnologies
165 Avenue de Bretagne
59000 Lille, France
tel: +33 3 20 08 24 98
https://www.hygeos.com

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/