[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unable to unset monitors in claim destructor. The StartOfJob* attributes will be stale.



The starter daemon for this job was created (we should probably be logging that in the startdâs log). You should see data in StarterLog.slot2_1.
In this instance, the problem is shown in the shadow log on the submit machine. The SYSTEM_PERIODIC_REMOVE expression is evaluating to True, apparently because the job ran for longer than the MaxJobRetirementTime value in its ad, which looks to be 3 hours.

 - Jaime

> On Aug 19, 2019, at 3:08 AM, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
> 
> Hi,
> 
> I do see tons of these in the startlog on workernodes that don't run any successful jobs anymore from one time to another, can someone point me in the right direction what is acutally going on ? 
> 
> 
> 08/18/19 12:26:19 slot2_1: New machine resource of type -2 allocated
> 08/18/19 12:26:20 slot2_1: Request accepted.
> 08/18/19 12:26:20 slot2_1: Remote owner is butensch@xxxxxxx
> 08/18/19 12:26:20 slot2_1: State change: claiming protocol successful
> 08/18/19 12:26:20 slot2_1: Changing state: Owner -> Claimed
> 08/18/19 12:26:20 slot2_1: Got activate_claim request from shadow (131.169.223.42)
> 08/18/19 12:26:20 slot2_1: Remote job ID is 2390585.6605
> 08/18/19 12:26:20 slot2_1: Got universe "VANILLA" (5) from request classad
> 08/18/19 12:26:20 slot2_1: State change: claim-activation protocol successful
> 08/18/19 12:26:20 slot2_1: Changing activity: Idle -> Busy
> 08/18/19 12:26:20 slot2_1: Failed to open '/var/lib/condor/execute/dir_44371/.update.ad.tmp' for writing update ad: No such file or directory (2).
> 08/18/19 15:26:23 slot2_1: Called deactivate_claim()
> 08/18/19 15:26:23 slot2_1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
> 08/18/19 15:26:23 slot2_1: State change: starter exited
> 08/18/19 15:26:23 slot2_1: State change: No preempting claim, returning to owner
> 08/18/19 15:26:23 slot2_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 08/18/19 15:26:23 slot2_1: State change: IS_OWNER is false
> 08/18/19 15:26:23 slot2_1: Changing state: Owner -> Unclaimed
> 08/18/19 15:26:23 slot2_1: Changing state: Unclaimed -> Delete
> 08/18/19 15:26:23 slot2_1: Resource no longer needed, deleting
> 08/18/19 15:26:23 slot2_1: Unable to unset monitors in claim destructor.  The StartOfJob* attributes will be stale.  (0x1d215c0, (nil))
> 
> The job id does not appear in any other log on the workernode hence the snippet above seems to be the only wittness of failure ? 
> 
> The Shadowlog for the same job looks like this: 
> 
> /var/log/condor/SchedLog:08/16/19 17:54:58 (pid:1021145) job_transforms for 2390585.6605: 13 considered, 10 applied (T01SysDefaultProject,T02JobDefaults,T03JobValues,T04JobEnhance,T05JobClasses,T06AccountingGroup,T07AccountingStatusHold,T10BirdResource,T11ShellEnvironment,T12JobHistory)
> /var/log/condor/SchedLog:08/18/19 12:26:20 (pid:1021145) Starting add_shadow_birthdate(2390585.6605)
> /var/log/condor/SchedLog:08/18/19 12:26:20 (pid:1021145) Started shadow for job 2390585.6605 on slot2@xxxxxxxxxxxxxxxxx <131.169.160.158:36448?addrs=131.169.160.158-36448> for BIRD_other.lite.butensch, (shadow pid = 770760)
> /var/log/condor/SchedLog:08/18/19 15:26:23 (pid:1021145) Shadow pid 770760 for job 2390585.6605 exited with status 113
> /var/log/condor/SchedLog:08/18/19 15:26:23 (pid:1021145) Removing job 2390585.6605
> /var/log/condor/ShadowLog:08/18/19 12:26:20 Initializing a VANILLA shadow for job 2390585.6605
> /var/log/condor/ShadowLog:08/18/19 12:26:20 (2390585.6605) (770760): Request to run on slot2@xxxxxxxxxxxxxxxxx <131.169.160.158:36448?addrs=131.169.160.158-36448> was ACCEPTED
> /var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): Job 2390585.6605 is being removed: The system macro SYSTEM_PERIODIC_REMOVE expression '((JobStatus == 5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600)) || (JobRunCount > 10) || ((JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > MaxJobRetirementTime))' evaluated to TRUE
> /var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): Job 2390585.6605 is being removed: The system macro SYSTEM_PERIODIC_REMOVE expression '((JobStatus == 5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600)) || (JobRunCount > 10) || ((JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > MaxJobRetirementTime))' evaluated to TRUE
> /var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): Job 2390585.6605 is being removed: The system macro SYSTEM_PERIODIC_REMOVE expression '((JobStatus == 5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600)) || (JobRunCount > 10) || ((JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > MaxJobRetirementTime))' evaluated to TRUE
> /var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): Job 2390585.6605 is being evicted from slot2@xxxxxxxxxxxxxxxxx
> /var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): logEvictEvent with unknown reason (113), not logging.
> /var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): **** condor_shadow (condor_SHADOW) pid 770760 EXITING WITH STATUS 113
> 
> -- 
> Christoph Beyer
> DESY Hamburg
> IT-Department
> 
> Notkestr. 85
> Building 02b, Room 009
> 22607 Hamburg
> 
> phone:+49-(0)40-8998-2317
> mail: christoph.beyer@xxxxxxx
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/