[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] fetchwork vs. claim_worklife



Hi all

we are currently experimenting with Condor's fetchwork scheme to succeed 
backfill, however it seems that it does not like to have CLAIM_WORKLIFE 
around, which is partially understandable.

I found this in one node's log file and it does not really make much sense to 
me:

04/12 13:29:39 slot3: match_info called
04/12 13:29:39 slot3: Received match <10.12.0.6:49978>#1301995497#556#...
04/12 13:29:39 slot3: Started match timer (24339) for 120 seconds.
04/12 13:29:39 slot3: State change: match notification protocol successful
04/12 13:29:39 slot3: Changing state: Owner -> Matched
[...] # NOTE: 10.12.0.6 is this node's IP
04/12 13:29:39 slot3: Canceled match timer (24339)
04/12 13:29:39 slot3: Schedd addr = <10.20.30.1:40974>
04/12 13:29:39 slot3: Alive interval = 300
04/12 13:29:39 slot3: Received ClaimId from schedd 
(<10.12.0.6:49978>#1301995497#556#...)
[...] # NOTE: 10.20.30.1 is one of our submit machines
04/12 13:29:39 slot3: Rank of this claim is: 1.000000
04/12 13:29:39 slot3: Request accepted.
04/12 13:29:39 slot3: Remote owner is USER@xxxxxxxxxxx
04/12 13:29:39 slot3: State change: claiming protocol successful
04/12 13:29:39 slot3: Changing state: Matched -> Claimed
04/12 13:29:39 slot3: Started ClaimLease timer (24344) w/ 6000 second lease 
duration
[...]
04/12 13:29:40 slot3: Got activate_claim request from shadow 
(<10.20.30.1:38426>)
04/12 13:29:40 slot3: Read request ad and starter from shadow.
[...]
04/12 13:29:40 slot3: Remote job ID is 15052166.0
04/12 13:29:40 slot3: Remote global job ID is HOST....#15052166.0#1302384456
04/12 13:29:40 slot3: JobLeaseDuration defined in job ClassAd: 7200
04/12 13:29:40 slot3: Resetting ClaimLease timer (24344) with new duration
04/12 13:29:40 Create_Process: using fast clone() to create child process.
04/12 13:29:40 slot3: Got RemoteUser (USER@xxxxxxxxxxx) from request classad
04/12 13:29:40 slot3: Got universe "STANDARD" (1) from request classad
04/12 13:29:40 slot3: State change: claim-activation protocol successful
04/12 13:29:40 slot3: Changing activity: Idle -> Busy


So far so good, fetchwork won't run just yet as the rank is too low:
04/12 13:29:50 slot3: Rank of this fetched claim is: 0.000000
04/12 13:29:50 slot3: Fetched claim doesn't have sufficient rank, refusing.

Job finshes:
04/12 13:34:50 slot3: State change: starter exited
04/12 13:34:50 slot3: Changing activity: Busy -> Idle
04/12 13:34:50 slot3: Computing claimWorklifeExpired(); ClaimAge=311, 
ClaimWorklife=3600

Interestingly, fetchwork's rank is still bad:
04/12 13:35:00 slot3: Rank of this fetched claim is: 0.000000
04/12 13:35:00 slot3: Fetched claim doesn't have sufficient rank, refusing.

Finally:
04/12 14:29:40 slot3: Computing claimWorklifeExpired(); ClaimAge=3601, 
ClaimWorklife=3600
04/12 14:29:40 slot3: State change: idle claim shutting down due to 
CLAIM_WORKLIFE
04/12 14:29:40 slot3: Changing state and activity: Claimed/Idle -> 
Preempting/Vacating
04/12 14:29:40 Entered vacate_client <10.20.30.1:40974> atlas1.atlas.local...
04/12 14:29:40 slot3: Canceled ClaimLease timer (24344)
04/12 14:29:40 slot3: State change: No preempting claim, returning to owner
04/12 14:29:40 slot3: Changing state and activity: Preempting/Vacating -> 
Owner/Idle
04/12 14:29:41 slot3: Rank of this fetched claim is: 0.000000
04/12 14:29:41 State change: Finished fetching work successfully
04/12 14:29:41 slot3: Set destination state to Claimed
04/12 14:29:41 slot3: Changing state: Owner -> Claimed
04/12 14:29:41 slot3: Warning: starting ClaimLease timer before lease duration 
set.
04/12 14:29:41 slot3: Started ClaimLease timer (24466) w/ 1200 second lease 
duration
04/12 14:29:41 slot3: JobLeaseDuration defined in job ClassAd: 604800
04/12 14:29:41 slot3: Resetting ClaimLease timer (24466) with new duration
04/12 14:29:41 slot3: Sending Machine Ad to Starter
04/12 14:29:41 slot3: About to Create_Process "condor_starter -f -job-input-ad 
-"
04/12 14:29:41 Create_Process: using fast clone() to create child process.
04/12 14:29:41 slot3: Got RemoteUser (nice-user.boinc) from request classad
04/12 14:29:41 slot3: Got universe "VANILLA" (5) from request classad
04/12 14:29:41 slot3: Changing activity: Idle -> Busy


Can anyone tell me, why it has to leave this core idle for close to an hour?

Cheers && TIA

Carsten