[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs vacating before run.



I'm currently testing a new condor roll out (v 6.8.2) and have run
into an odd situation.

I have three classes of system at this point.  There's four
submit/execute nodes, sixteen execute only nodes and a "private" batch
of submit/execute nodes which have 'START = Owner == "someuser"'
configs.

Other than the START statment on the private nodes all teh policy bits
are in the global config on a shared filesystem and are in the
supplied "testing" mode, which should accept everything and preempt
nothing as I under stand it.

This all seems to work nicely for me.  My one other user can run on
all the nodes except the execute only nodes.  These he can claim but
after a brief tiem they then revert to owner state.  I've included the
relevent StartLog from the execute system below.

The one difference about these systems that springs to mind is that
they do not have a complete /etc/passwd file unlike the submit nodes
(as an admin I do have an account the failing user does not).  My
understanding was that users did not need login capability to execute
nodes.  I extrapolated this to mean they didn;'t need to be in the
password file, am I incorrect?


StartLog fragment:

2/26 06:57:28 DaemonCore: Command received via UDP from condor@xxxxxxxxxxxxx fro
m host <128.30.2.158:41113>
2/26 06:57:28 DaemonCore: received command 440 (MATCH_INFO), calling handler (co
mmand_match_info)
2/26 06:57:28 match_info called
2/26 06:57:28 Received match <128.30.2.170:44965>#1171060608#653
2/26 06:57:28 State change: match notification protocol successful
2/26 06:57:28 Changing state: Unclaimed -> Matched
2/26 06:57:30 DaemonCore: Command received via TCP from condor@xxxxxxxxxxxxx fro
m host <128.30.108.56:34130>
2/26 06:57:30 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler 
(command_request_claim)
2/26 06:57:30 Request accepted.
2/26 06:57:30 Remote owner is vkm@xxxxxxxxxxxxx
2/26 06:57:30 State change: claiming protocol successful
2/26 06:57:30 Changing state: Matched -> Claimed
2/26 06:58:42 DaemonCore: Command received via TCP from condor@xxxxxxxxxxxxx fro
m host <128.30.108.56:44523>
2/26 06:58:42 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
 (command_activate_claim)
2/26 06:58:42 Got activate_claim request from shadow (<128.30.108.56:44523>)
2/26 06:58:42 Remote job ID is 19860.0
2/26 06:58:42 Got universe "VANILLA" (5) from request classad
2/26 06:58:42 State change: claim-activation protocol successful
2/26 06:58:42 Changing activity: Idle -> Busy
2/26 06:58:42 Starter pid 24310 exited with status 1
2/26 06:58:42 State change: starter exited
2/26 06:58:42 Changing activity: Busy -> Idle
2/26 06:58:42 DaemonCore: Command received via TCP from condor@xxxxxxxxxxxxx fro
m host <128.30.108.56:34876>
2/26 06:58:42 DaemonCore: received command 1200 (CA_CMD), calling handler (comma
nd_classad_handler)
2/26 06:58:42 Aborting CA_LOCATE_STARTER
2/26 06:58:42 ClaimId (<128.30.2.170:44965>#1171060608#653) and GlobalJobId ( co
cosci-1.csail.mit.edu#1172286171#19860.0 ) not found
2/26 06:58:42 DaemonCore: Command received via UDP from condor@xxxxxxxxxxxxx fro
m host <128.30.108.56:55493>
2/26 06:58:42 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler 
(command_release_claim)
2/26 06:58:42 State change: received RELEASE_CLAIM command
2/26 06:58:42 Changing state and activity: Claimed/Idle -> Preempting/Vacating
2/26 06:58:42 State change: No preempting claim, returning to owner
2/26 06:58:42 Changing state and activity: Preempting/Vacating -> Owner/Idle
2/26 06:58:42 State change: IS_OWNER is false
2/26 06:58:42 Changing state: Owner -> Unclaimed
2/26 06:58:42 DaemonCore: Command received via UDP from condor@xxxxxxxxxxxxx fro
m host <128.30.108.56:55493>
2/26 06:58:42 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler 
(command_release_claim)
2/26 06:58:42 Warning: can't find resource with ClaimId (<128.30.2.170:44965>#11
71060608#653)

Thanks,
-Jon