[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] windows xp log off kills jobs



We have the same problem (v6.8.6;7.0.0),
vanilla universe jobs are killed when interactive (desktop) user logs off

job classads:
ExitBySignal = FALSE
ExitCode = 5   (sometimes 6)
JobStatus = 4

For our type of workflow (short run´s ,15 minutes ; long job chains) the
following workaround help´s "enough" for now

===submit excerpt===
OnExitRemove = (ExitCode != 6) && (ExitCode != 5)

PeriodicRelease = (LastHoldReasonCode == 11) || (LastHoldReasonCode == 13)
|| (HoldReasonCode == 13) || (HoldReasonCode == 11)
===========

walter
p.s.:
client:   win-xp,sp2,german, condor 6.8.6 / 7.0.0
(simple installation: "always run condor jobs")
server/submit: linux, condor 6.8.6
executable: old fortran compiler executable (digital;intel)

On Mo, 31.12.2007, 18:49, Finch, Ralph wrote:
> We are not, though that looks useful and we probably will start using it
for another type of job we run under Condor.
>
> We just tested your 2nd question.  Yes, my Condor jobs are killed when
someone else logs on, then logs off.
>
> Below is the portion of the StartLog on the machine, with my comments.
>
> 12/31 08:23:09 vm1: Got activate_claim request from shadow
> (<136.200.32.179:4851>)
> 12/31 08:23:09 vm1: Remote job ID is 413.7
> 12/31 08:23:10 vm1: Got universe "VANILLA" (5) from request classad
12/31 08:23:10 vm1: State change: claim-activation protocol successful
12/31 08:23:10 vm1: Changing activity: Idle -> Busy
> ## the other person touches the keyboard to login; job on vm1 suspended
12/31 08:54:01 vm1: State change: SUSPEND is TRUE
> 12/31 08:54:01 vm1: Changing activity: Busy -> Suspended
> ## other person logs out; apparently jobs on vm2, 3, and 4 are forced
off.  Why?
> 12/31 08:55:23 DaemonCore: Command received via TCP from host
> <136.200.32.179:2291>
> 12/31 08:55:23 DaemonCore: received command 404
> (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> 12/31 08:55:23 vm2: Called deactivate_claim_forcibly()
> 12/31 08:55:23 DaemonCore: Command received via TCP from host
> <136.200.32.179:2293>
> 12/31 08:55:23 DaemonCore: received command 404
> (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> 12/31 08:55:23 vm3: Called deactivate_claim_forcibly()
> 12/31 08:55:23 DaemonCore: Command received via UDP from host
> <136.200.32.102:2307>
> 12/31 08:55:23 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
> 12/31 08:55:23 Starter pid 2404 exited with status 0
> 12/31 08:55:23 vm2: State change: starter exited
> 12/31 08:55:23 vm2: Changing activity: Busy -> Idle
> 12/31 08:55:23 vm2: State change: idle claim shutting down due to
CLAIM_WORKLIFE
> 12/31 08:55:23 vm2: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
> 12/31 08:55:23 vm2: State change: No preempting claim, returning to
owner
> 12/31 08:55:23 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle
> 12/31 08:55:23 vm2: State change: IS_OWNER is false
> 12/31 08:55:23 vm2: Changing state: Owner -> Unclaimed
> 12/31 08:55:23 DaemonCore: Command received via TCP from host
> <136.200.32.179:2295>
> 12/31 08:55:23 DaemonCore: received command 404
> (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> 12/31 08:55:23 vm4: Called deactivate_claim_forcibly()
> 12/31 08:55:23 DaemonCore: Command received via UDP from host
> <136.200.32.179:2297>
> 12/31 08:55:23 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
> 12/31 08:55:23 Warning: can't find resource with ClaimId
> (<136.200.32.102:1037>#1198783236#65#...)
> 12/31 08:55:23 Starter pid 4084 exited with status 0
> 12/31 08:55:23 vm3: State change: starter exited
> 12/31 08:55:23 vm3: Changing activity: Busy -> Idle
> 12/31 08:55:23 vm3: State change: idle claim shutting down due to
CLAIM_WORKLIFE
> 12/31 08:55:23 vm3: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
> 12/31 08:55:23 vm3: State change: No preempting claim, returning to
owner
> 12/31 08:55:23 vm3: Changing state and activity: Preempting/Vacating ->
Owner/Idle
> 12/31 08:55:23 vm3: State change: IS_OWNER is false
> 12/31 08:55:23 vm3: Changing state: Owner -> Unclaimed
> 12/31 08:55:23 DaemonCore: Command received via UDP from host
> <136.200.32.102:2309>
> 12/31 08:55:23 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
> 12/31 08:55:23 Starter pid 3284 exited with status 0
> 12/31 08:55:23 vm4: State change: starter exited
> 12/31 08:55:23 vm4: Changing activity: Busy -> Idle
> 12/31 08:55:23 vm4: State change: idle claim shutting down due to
CLAIM_WORKLIFE
> 12/31 08:55:23 vm4: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
> 12/31 08:55:23 vm4: State change: No preempting claim, returning to
owner
> 12/31 08:55:23 vm4: Changing state and activity: Preempting/Vacating ->
Owner/Idle
> 12/31 08:55:23 vm4: State change: IS_OWNER is false
> 12/31 08:55:23 vm4: Changing state: Owner -> Unclaimed
> 12/31 08:55:23 DaemonCore: Command received via UDP from host
> <136.200.32.102:2313>
> 12/31 08:55:23 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
> 12/31 08:55:23 DaemonCore: Command received via UDP from host
> <136.200.32.179:2299>
> 12/31 08:55:23 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
> 12/31 08:55:23 Warning: can't find resource with ClaimId
> (<136.200.32.102:1037>#1198783236#67#...)
> 12/31 08:55:23 DaemonCore: Command received via UDP from host
> <136.200.32.179:2301>
> 12/31 08:55:23 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
> 12/31 08:55:23 Warning: can't find resource with ClaimId
> (<136.200.32.102:1037>#1198783236#68#...)
> ## job on vm1 continues from suspension, then is forced off too! 12/31
08:56:05 vm1: State change: CONTINUE is TRUE
> 12/31 08:56:05 vm1: Changing activity: Suspended -> Busy
> 12/31 08:56:05 vm2: State change: IS_OWNER is TRUE
> 12/31 08:56:05 vm2: Changing state: Unclaimed -> Owner
> 12/31 08:56:05 DaemonCore: Command received via TCP from host
> <136.200.32.179:2335>
> 12/31 08:56:05 DaemonCore: received command 404
> (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> 12/31 08:56:05 vm1: Called deactivate_claim_forcibly()
> 12/31 08:56:05 DaemonCore: Command received via UDP from host
> <136.200.32.102:2374>
> 12/31 08:56:05 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
> 12/31 08:56:05 Starter pid 1664 exited with status 0
> 12/31 08:56:05 vm1: State change: starter exited
> 12/31 08:56:05 vm1: Changing activity: Busy -> Idle
> 12/31 08:56:06 vm1: State change: START is false
> 12/31 08:56:06 vm1: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
> 12/31 08:56:06 vm1: State change: No preempting claim, returning to
owner
> 12/31 08:56:06 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle
> 12/31 08:56:06 DaemonCore: Command received via UDP from host
> <136.200.32.179:2337>
> 12/31 08:56:06 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
> 12/31 08:56:06 Warning: can't find resource with ClaimId
> (<136.200.32.102:1037>#1198783236#69#...)
> 12/31 08:56:08 vm2: State change: IS_OWNER is false
> 12/31 08:56:08 vm2: Changing state: Owner -> Unclaimed
>
> Ralph Finch
> 916-653-7552
>
>
> -----Original Message-----
> Hmm... Are you using RunAsOwner? If so does it happen if you run a job
and then someone else logs on then off?
>
> Clutching at straws here...
>
> Matt
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>

-- 
Walter Penits
ITS-Management
Computational Physics
University Vienna

http://homepage.univie.ac.at/walter.penits/

If you want to do the impossible, don't hire an expert because he knows it
can't be done.
- Henry Ford