[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs that require root permissions




On Mar 19, 2013, at 4:47 AM, Michael Hanke <michael.hanke@xxxxxxxxx> wrote:

On Mon, Mar 18, 2013 at 9:38 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
Hi Michael,

I suspect we are chasing an incorrect lead with respect to the job suspension; the fakeroot is being leaked to the mount namespace, not the HTCondor one (so the bug I thought of does not apply here).

However, if you add:

MOUNT_UNDER_SCRATCH=/tmp

it should make those warning/error messages go away.

Could you tell a little more on why bind-mounting /tmp will disable the warnings? From the documentation it is not obvious to me.

Sorry, I'm too terse sometimes - 

Condor is complaining about sandbox cleaning (I think) because it is finding files owned by root in the job sandbox (there are assumptions littered throughout the code, especially sandbox cleanup, that there is only one UID for files in a sandbox; we hit similar issues when using glexec).

It sounds like the root-owned files are all from filesystems which are remounted / bind-mounted into the sandbox by pbuilder (/proc, /dev/pts).  By enabling MOUNT_UNDER_SCRATCH, HTCondor will put the job in a separate "mount namespace" that makes mounts in the job invisible to the rest of the system; this is required to give the job a private /tmp, but the private /tmp is a side-effect in this case.

Hence, /proc and /dev/pts would be invisible to the condor_starter and wouldn't be cleaned up.

 
What are your SUSPEND-related attributes set to on that worker node?


% condor_config_val -dump |grep -i suspend
MAXSUSPENDTIME = 10 * $(MINUTE)
SUSPEND = $(UWCS_SUSPEND)
TESTINGMODE_SUSPEND = False
TESTINGMODE_WANT_SUSPEND = False
UWCS_PREEMPT = ( ((Activity == "Suspended") && ($(ActivityTimer) > $(MaxSuspendTime))) || (SUSPEND && (WANT_SUSPEND == False)) )
UWCS_SUSPEND = ( $(KeyboardBusy) || ( (CpuBusyTime > 2 * $(MINUTE)) && $(ActivationTimer) > 90 ) )
UWCS_WANT_SUSPEND = ( $(SmallJob) || $(KeyboardNotBusy) || $(IsVanilla) ) && ( $(SUSPEND) )
VM_SOFT_SUSPEND = True
WANT_SUSPEND = $(UWCS_WANT_SUSPEND)
 
This is a dedicated cluster node -- no keyboard.

Ah - 

What does CpuBusyTime look like?  If there's enough system activity (or if the root-owned processes are not being tracked by the procd and counting as system activity), then the SUSPEND _expression_ could trigger.

If it's a dedicated cluster - and you have no need for job suspension - you can set:

SUSPEND = FALSE
WANT_SUSPEND = FALSE

Hope this helps!

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature