[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Unable to unset monitors in claim destructor. The StartOfJob* attributes will be stale.



Hi,

I do see tons of these in the startlog on workernodes that don't run any successful jobs anymore from one time to another, can someone point me in the right direction what is acutally going on ? 


08/18/19 12:26:19 slot2_1: New machine resource of type -2 allocated
08/18/19 12:26:20 slot2_1: Request accepted.
08/18/19 12:26:20 slot2_1: Remote owner is butensch@xxxxxxx
08/18/19 12:26:20 slot2_1: State change: claiming protocol successful
08/18/19 12:26:20 slot2_1: Changing state: Owner -> Claimed
08/18/19 12:26:20 slot2_1: Got activate_claim request from shadow (131.169.223.42)
08/18/19 12:26:20 slot2_1: Remote job ID is 2390585.6605
08/18/19 12:26:20 slot2_1: Got universe "VANILLA" (5) from request classad
08/18/19 12:26:20 slot2_1: State change: claim-activation protocol successful
08/18/19 12:26:20 slot2_1: Changing activity: Idle -> Busy
08/18/19 12:26:20 slot2_1: Failed to open '/var/lib/condor/execute/dir_44371/.update.ad.tmp' for writing update ad: No such file or directory (2).
08/18/19 15:26:23 slot2_1: Called deactivate_claim()
08/18/19 15:26:23 slot2_1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
08/18/19 15:26:23 slot2_1: State change: starter exited
08/18/19 15:26:23 slot2_1: State change: No preempting claim, returning to owner
08/18/19 15:26:23 slot2_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
08/18/19 15:26:23 slot2_1: State change: IS_OWNER is false
08/18/19 15:26:23 slot2_1: Changing state: Owner -> Unclaimed
08/18/19 15:26:23 slot2_1: Changing state: Unclaimed -> Delete
08/18/19 15:26:23 slot2_1: Resource no longer needed, deleting
08/18/19 15:26:23 slot2_1: Unable to unset monitors in claim destructor.  The StartOfJob* attributes will be stale.  (0x1d215c0, (nil))

The job id does not appear in any other log on the workernode hence the snippet above seems to be the only wittness of failure ? 

The Shadowlog for the same job looks like this: 

/var/log/condor/SchedLog:08/16/19 17:54:58 (pid:1021145) job_transforms for 2390585.6605: 13 considered, 10 applied (T01SysDefaultProject,T02JobDefaults,T03JobValues,T04JobEnhance,T05JobClasses,T06AccountingGroup,T07AccountingStatusHold,T10BirdResource,T11ShellEnvironment,T12JobHistory)
/var/log/condor/SchedLog:08/18/19 12:26:20 (pid:1021145) Starting add_shadow_birthdate(2390585.6605)
/var/log/condor/SchedLog:08/18/19 12:26:20 (pid:1021145) Started shadow for job 2390585.6605 on slot2@xxxxxxxxxxxxxxxxx <131.169.160.158:36448?addrs=131.169.160.158-36448> for BIRD_other.lite.butensch, (shadow pid = 770760)
/var/log/condor/SchedLog:08/18/19 15:26:23 (pid:1021145) Shadow pid 770760 for job 2390585.6605 exited with status 113
/var/log/condor/SchedLog:08/18/19 15:26:23 (pid:1021145) Removing job 2390585.6605
/var/log/condor/ShadowLog:08/18/19 12:26:20 Initializing a VANILLA shadow for job 2390585.6605
/var/log/condor/ShadowLog:08/18/19 12:26:20 (2390585.6605) (770760): Request to run on slot2@xxxxxxxxxxxxxxxxx <131.169.160.158:36448?addrs=131.169.160.158-36448> was ACCEPTED
/var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): Job 2390585.6605 is being removed: The system macro SYSTEM_PERIODIC_REMOVE expression '((JobStatus == 5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600)) || (JobRunCount > 10) || ((JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > MaxJobRetirementTime))' evaluated to TRUE
/var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): Job 2390585.6605 is being removed: The system macro SYSTEM_PERIODIC_REMOVE expression '((JobStatus == 5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600)) || (JobRunCount > 10) || ((JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > MaxJobRetirementTime))' evaluated to TRUE
/var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): Job 2390585.6605 is being removed: The system macro SYSTEM_PERIODIC_REMOVE expression '((JobStatus == 5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600)) || (JobRunCount > 10) || ((JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > MaxJobRetirementTime))' evaluated to TRUE
/var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): Job 2390585.6605 is being evicted from slot2@xxxxxxxxxxxxxxxxx
/var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): logEvictEvent with unknown reason (113), not logging.
/var/log/condor/ShadowLog:08/18/19 15:26:23 (2390585.6605) (770760): **** condor_shadow (condor_SHADOW) pid 770760 EXITING WITH STATUS 113

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx