[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Starter Log not getting updated with jobs (nominally) started on the slot



These logs suggest that the AP is successfully claiming the EP but is not able to start any jobs on it. Jobs matched to the EP should still be in the APâs queue.

Look for lines in the StartLog after the "Changing state: Owner -> Claimedâ. For a normal job start, you would see these lines:

03/25/24 22:26:18 slot1_1: Got activate_claim request from shadow (192.168.4.135)
03/25/24 22:26:18 slot1_1: Remote job ID is 2728.0
03/25/24 22:26:18 slot1_1: Got universe "VANILLA" (5) from request classad
03/25/24 22:26:18 slot1_1: State change: claim-activation protocol successful
03/25/24 22:26:18 slot1_1: Changing activity: Idle -> Busy

A failure to activate the claim (i.e. start a job) should show some different entries.

If thatâs not informative, then look at the ShadowLog on the AP.

 - Jaime

> On Mar 21, 2024, at 8:59âAM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
> 
> Hi all,
> 
> and another question/observation - we have noticed an odd behaviour on one of our EPs [1]. The node seem to have collapsed three weeks ago into a black hole.
> I.e., all the StarterLog.slot* activities has stoped around March 1st [1]. However, the startd has been accepting and "starting" jobs all along [3] sending the jobs to their doom.
> 
> I have not found yet a smoking gun in the master or startd log (unfortunately, our log replication  does not reach back to beginning of March).
> Has somebody maybe observed something similar?
> 
> Cheers,
>  Thomas
> 
> 
> [1]
> condor-9.0.8-1.el7.x86_64
> condor-boinc-7.16.16-1.el7.x86_64
> condor-classads-9.0.8-1.el7.x86_64
> condor-externals-9.0.8-1.el7.x86_64
> condor-procd-9.0.8-1.el7.x86_64
> htcondor-ce-client-5.1.3-1.el7.noarch
> python2-condor-9.0.8-1.el7.x86_64
> python3-condor-9.0.8-1.el7.x86_64
> 
> 
> [2]
> [root@batch0653 ~]# ls -alltr /var/log/condor/StarterLog* | tail -n 5
> -rw-r--r-- 1 25411 1000  4992974 Mar  1 22:51 /var/log/condor/StarterLog.slot1_6
> -rw-r--r-- 1 25411 1000  1928326 Mar  1 23:36 /var/log/condor/StarterLog.slot1_3
> -rw-r--r-- 1 25411 1000  5323270 Mar  2 04:47 /var/log/condor/StarterLog.slot1_8
> -rw-r--r-- 1 25411 1000  5730429 Mar  2 05:56 /var/log/condor/StarterLog.slot1_7
> -rw-r--r-- 1 25411 1000  3578995 Mar  2 07:28 /var/log/condor/StarterLog.slot1_10
> 
> [root@batch0653 condor]# stat StarterLog.slot1_3
>  File: âStarterLog.slot1_3â
>  Size: 1928326   	Blocks: 3776       IO Block: 4096   regular file
> Device: 806h/2054d	Inode: 524483      Links: 1
> Access: (0644/-rw-r--r--)  Uid: (25411/ UNKNOWN)   Gid: ( 1000/ UNKNOWN)
> Access: 2024-03-21 14:05:38.397796356 +0100
> Modify: 2024-03-01 23:36:56.630725665 +0100
> Change: 2024-03-01 23:36:56.630725665 +0100
> Birth: -
> 
> [3]
> [root@batch0653 condor]#  grep "slot1_3" StartLog | grep "Owner -> Claimed"  | head -n 3
> 03/21/24 14:36:47 slot1_3: Changing state: Owner -> Claimed
> 03/21/24 14:37:13 slot1_3: Changing state: Owner -> Claimed
> 03/21/24 14:37:39 slot1_3: Changing state: Owner -> Claimed
> [root@batch0653 condor]# grep "slot1_3" StartLog | grep "Owner -> Claimed"  | wc -l
> 45
>