[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor with Hooks is not running jobs as Owner, always runs them as CONDOR_IDS



I might have hit a bug with Hooks. I'm passing, from my fetch work hook,
to Condor:

3/23 10:59:30 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch (pid
26887) printed to stderr: DEBUG: Slot State="Unclaimed"
Found job 38443
Cmd = "/tools/arc/scripts/arc_execute.sh"
Owner = "ichesal"
IWD = "/data/ichesal"
Out = "/data/ichesal/job/20090323/1000/38443/stdout.txt"
Err = "/data/ichesal/job/20090323/1000/38443/stderr.txt"
Args = "38443"

3/23 10:59:30 State change: Finished fetching work successfully
3/23 10:59:30 Changing state: Unclaimed -> Claimed
3/23 10:59:30 Warning: starting ClaimLease timer before lease duration
set.
3/23 10:59:30 Got universe "VANILLA" (5) from request classad
3/23 10:59:30 Changing activity: Idle -> Busy
3/23 10:59:30 DaemonCore: return from reaper for pid 26887
3/23 10:59:30 Received UDP command 60008 (DC_CHILDALIVE) from
<137.57.202.213:44611>, access level DAEMON
3/23 10:59:30 Calling HandleReq <HandleChildAliveCommand> (0)
3/23 10:59:30 Return from HandleReq <HandleChildAliveCommand> (handler:
0.000s, sec: 0.001s)
3/23 10:59:36 Calling pipe Handler <DC stderr pipe handler> for Pipe
end=65540 <DC stderr pipe>
3/23 10:59:36 Return from pipe Handler
3/23 10:59:36 Calling pipe Handler <DC stdout pipe handler> for Pipe
end=65538 <DC stdout pipe>
3/23 10:59:36 Return from pipe Handler
3/23 10:59:36 Calling pipe Handler <DC stderr pipe handler> for Pipe
end=65540 <DC stderr pipe>
3/23 10:59:36 Return from pipe Handler
3/23 10:59:36 DaemonCore: pid 26910 exited with status 0, invoking
reaper 1 <HookClientMgr Output Reaper>
3/23 10:59:36 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch (pid
26910) printed to stderr: DEBUG: Slot State="Claimed"

(That output is from my StartLog on the machine running the hook)

And the StarterLog on the machine looks like this:

3/23 10:59:30 ******************************************************
3/23 10:59:30 ** condor_starter (CONDOR_STARTER) STARTING UP
3/23 10:59:30 **
/net/sj-swnas1/tools/condor/7.2.1/linux32/sbin/condor_starter
3/23 10:59:30 ** SubsystemInfo: name=STARTER type=STARTER(8)
class=DAEMON(1)
3/23 10:59:30 ** Configuration: subsystem:STARTER local:<NONE>
class:DAEMON
3/23 10:59:30 ** $CondorVersion: 7.2.1 Feb 18 2009 BuildID: 133382 $
3/23 10:59:30 ** $CondorPlatform: I386-LINUX_RHEL5 $
3/23 10:59:30 ** PID = 26896
3/23 10:59:30 ** Log last touched 3/23 10:58:05
3/23 10:59:30 ******************************************************
3/23 10:59:30 Using config source: /etc/condor/condor_config
3/23 10:59:30 Using local config sources:
3/23 10:59:30    /tools/arc/condor/condor_config.basic
3/23 10:59:30    /tools/arc/condor/os/condor_config.LINUX
3/23 10:59:30    /tools/arc/condor/site/condor_config.SJDEV
3/23 10:59:30    /tools/arc/condor/machine/condor_config.sqal08
3/23 10:59:30    /tools/arc/condor/machine/condor_config.sqal08.LINUX
3/23 10:59:30    /tools/arc/condor/patch/condor_config.sqal08
3/23 10:59:30    /tools/arc/condor/patch/condor_config.sqal08.LINUX
3/23 10:59:30    /tools/arc/condor/cycleserver/sqal08.config
3/23 10:59:30 DaemonCore: Command Socket at <137.57.202.213:45378>
3/23 10:59:30 Done setting resource limits
3/23 10:59:30 Starter running a local job with no shadow
3/23 10:59:30 Reading job ClassAd from "STDIN"
3/23 10:59:30 Found ClassAd data in "STDIN"
3/23 10:59:30 setting the orig job name in starter
3/23 10:59:30 setting the orig job iwd in starter
3/23 10:59:30 Job 1.0 set to execute immediately
3/23 10:59:30 Starting a VANILLA universe job with ID: 1.0
3/23 10:59:30 IWD: /data/ichesal
3/23 10:59:30 Output file:
/data/ichesal/job/20090323/1000/38443/stdout.txt
3/23 10:59:30 Error file:
/data/ichesal/job/20090323/1000/38443/stderr.txt
3/23 10:59:30 Renice expr "((False =?= True) * 10)" evaluated to 0
3/23 10:59:30 About to exec /tools/arc/scripts/arc_execute.sh 38443
3/23 10:59:30 Create_Process succeeded, pid=26897
3/23 11:09:31 Process exited, pid=26897, status=0
3/23 11:09:31 All jobs have exited... starter exiting
3/23 11:09:31 **** condor_starter (condor_STARTER) pid 26896 EXITING
WITH STATUS 0

Which seems great. Except: my jobs aren't running as user ichesal on the
machine. They're running as the daemon user set in CONDOR_IDS: aceadmin.

The COD documentation claims Condor should context switch to the Owner
ID passed via the fetch work script. See:
http://www.cs.wisc.edu/condor/manual/v7.2/4_3Computing_On.html#sec:cod-a
pplication-attributes

But that's not happening.

On my non-Hook machines the user context switching is occurring as
expected and jobs I submit to those machines run as me. The
condor_starter process is still owned by aceadmin but the sub-process
spawned to run my Cmd is owned by me. The aceadmin account has no
special privileges on any machine.

Is this a bug with hooks? Or is this a problem with my Class Ad output
from my fetch work script? Because there's nary a peep from Condor about
my Class Ad being wrong I'm thinking this a Hooks bug...

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.