[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How claim lease and lease expired duration calculated.



Johnson,

If you add D_COMMAND to STARTD_DEBUG, then you will be able to see a record of ALIVE messages received by the startd. If you add D_PROTOCOL to SCHEDD_DEBUG, you will be able to see a record of ALIVE messages sent by the schedd to the startds. Perhaps this will help clarify what is happening.

--Dan

Johnson koil Raj wrote:
Hi,
There is frequent claim lease expire occurs and the /condor_startd/ drop the claim the running jobs are get killed.

1. As per the manual "The length of the claim lease is the job's ClassAd attribute JobLeaseDuration." for my job JobLeaseDuration = 1200 but the claimlease duration is set to 2400.

2. whether the claim is droped based on MAX_CLAIM_ALIVES_MISSED or within the claimlease duration the Startd didn't receive any Alive packets.

In my pool I have the following configuration.

ALIVE_INTERVAL         = 600 (default 300)
REQUEST CLAIM TIMEOUT  = default(30  min)

MAX CLAIM ALIVES MISSED= default(6) at startd

I have copied the part of StartLog.

8/6 16:27:55 slot1.50: State change: claiming protocol successful
8/6 16:27:55 slot1.50: Changing state: Owner -> Claimed
8/6 16:27:55 slot1.50: Started ClaimLease timer (172447) w/ 2400 second lease duration 8/6 16:27:55 slot1.50: Got activate_claim request from shadow (<10.207.100.66:9978>)
8/6 16:27:55 slot1.50: Read request ad and starter from shadow.
8/6 16:27:56 slot1.50: JobLeaseDuration defined in job ClassAd: 1200
8/6 16:27:56 slot1.50: Resetting ClaimLease timer (172447) with new duration
8/6 16:27:56 slot1.50: About to Create_Process "condor_starter -f -a slot1.50 gridprime.pesgrid.wipro.com"
8/6 16:27:56 slot1.50: State change: claim-activation protocol successful
8/6 16:27:56 slot1.50: Changing activity: Idle -> Busy
8/6 17:22:05 slot1.50: State change: claim lease expired (condor_schedd gone?) 8/6 17:22:05 slot1.50: Changing state and activity: Claimed/Busy -> Preempting/Killing
8/6 17:22:05 slot1.50: In Starter::kill() with pid 15687, sig 3 (SIGQUIT)
8/6 17:22:05 slot1.50: Got ALIVE while in Preempting state, ignoring.
8/6 17:23:11 slot1.50: State change: No preempting claim, returning to owner
8/6 17:23:11 slot1.50: Changing state and activity: Preempting/Killing -> Owner/Idle
8/6 17:23:11 slot1.50: State change: IS_OWNER is false
8/6 17:23:11 slot1.50: Changing state: Owner -> Unclaimed
8/6 17:23:11 slot1.50: Changing state: Unclaimed -> Delete
8/6 17:23:11 slot1.50: Resource no longer needed, deleting
8/6 17:25:27 slot1.50: New machine resource of type -1 allocated
8/6 17:25:29 slot1.50: Rank of this claim is: 0.000000
8/6 17:25:29 slot1.50: Request accepted.
8/6 17:25:29 slot1.50: State change: claiming protocol successful
8/6 17:25:29 slot1.50: Changing state: Owner -> Claimed
8/6 17:25:29 slot1.50: Started ClaimLease timer (176480) w/ 2400 second lease duration 8/6 17:25:30 slot1.50: Got activate_claim request from shadow (<10.207.100.66:9845>)
8/6 17:25:30 slot1.50: Read request ad and starter from shadow.
8/6 17:25:31 slot1.50: JobLeaseDuration defined in job ClassAd: 1200
8/6 17:25:31 slot1.50: Resetting ClaimLease timer (176480) with new duration
8/6 17:25:31 slot1.50: About to Create_Process "condor_starter -f -a slot1.50 gridprime.pesgrid.wipro.com"
8/6 17:25:32 slot1.50: State change: claim-activation protocol successful
8/6 17:25:32 slot1.50: Changing activity: Idle -> Busy
8/6 18:15:28 slot1.50: State change: claim lease expired (condor_schedd gone?) 8/6 18:15:28 slot1.50: Changing state and activity: Claimed/Busy -> Preempting/Killing
8/6 18:15:28 slot1.50: In Starter::kill() with pid 17698, sig 3 (SIGQUIT)
8/6 18:15:30 slot1.50: Got ALIVE while in Preempting state, ignoring.
8/6 18:17:12 slot1.50: State change: No preempting claim, returning to owner
8/6 18:17:12 slot1.50: Changing state and activity: Preempting/Killing -> Owner/Idle
8/6 18:17:12 slot1.50: State change: IS_OWNER is false


by
Johnson

Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
www.wipro.com
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/