[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Starter Log not getting updated with jobs (nominally) started on the slot



Hi all,

and another question/observation - we have noticed an odd behaviour on one of our EPs [1]. The node seem to have collapsed three weeks ago into a black hole. I.e., all the StarterLog.slot* activities has stoped around March 1st [1]. However, the startd has been accepting and "starting" jobs all along [3] sending the jobs to their doom.

I have not found yet a smoking gun in the master or startd log (unfortunately, our log replication does not reach back to beginning of March).
Has somebody maybe observed something similar?

Cheers,
  Thomas


[1]
condor-9.0.8-1.el7.x86_64
condor-boinc-7.16.16-1.el7.x86_64
condor-classads-9.0.8-1.el7.x86_64
condor-externals-9.0.8-1.el7.x86_64
condor-procd-9.0.8-1.el7.x86_64
htcondor-ce-client-5.1.3-1.el7.noarch
python2-condor-9.0.8-1.el7.x86_64
python3-condor-9.0.8-1.el7.x86_64


[2]
[root@batch0653 ~]# ls -alltr /var/log/condor/StarterLog* | tail -n 5
-rw-r--r-- 1 25411 1000 4992974 Mar 1 22:51 /var/log/condor/StarterLog.slot1_6 -rw-r--r-- 1 25411 1000 1928326 Mar 1 23:36 /var/log/condor/StarterLog.slot1_3 -rw-r--r-- 1 25411 1000 5323270 Mar 2 04:47 /var/log/condor/StarterLog.slot1_8 -rw-r--r-- 1 25411 1000 5730429 Mar 2 05:56 /var/log/condor/StarterLog.slot1_7 -rw-r--r-- 1 25411 1000 3578995 Mar 2 07:28 /var/log/condor/StarterLog.slot1_10

[root@batch0653 condor]# stat StarterLog.slot1_3
  File: âStarterLog.slot1_3â
  Size: 1928326   	Blocks: 3776       IO Block: 4096   regular file
Device: 806h/2054d	Inode: 524483      Links: 1
Access: (0644/-rw-r--r--)  Uid: (25411/ UNKNOWN)   Gid: ( 1000/ UNKNOWN)
Access: 2024-03-21 14:05:38.397796356 +0100
Modify: 2024-03-01 23:36:56.630725665 +0100
Change: 2024-03-01 23:36:56.630725665 +0100
 Birth: -

[3]
[root@batch0653 condor]# grep "slot1_3" StartLog | grep "Owner -> Claimed" | head -n 3
03/21/24 14:36:47 slot1_3: Changing state: Owner -> Claimed
03/21/24 14:37:13 slot1_3: Changing state: Owner -> Claimed
03/21/24 14:37:39 slot1_3: Changing state: Owner -> Claimed
[root@batch0653 condor]# grep "slot1_3" StartLog | grep "Owner -> Claimed" | wc -l
45


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature