[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Job Scheduling issue in 8.8.5 version



Hello Experts,

We are seeing an issue where one job of the batch remains in idle state despite having resources available in the cluster. This started happening after the update to 8.8.5, we never saw this behavior with the 8.5.8 version.Â

We are using scheduler level splitting of slots.Â

# condor_config_val CLAIM_PARTITIONABLE_LEFTOVERS
true

Whenever this issue happened we noticed "Request was NOT accepted for claim" in schedlog which I believeÂindicates one failed attempt was made but then another attempt was made approx after 21m this time the job started running.Â

# grep '2290171.0' /var/log/condor/SchedLog
08/27/21 00:22:30 (pid:9386) job_transforms for 2290171.0: 1 considered, 1 applied (SetTestTeam)
08/27/21 00:22:44 (pid:9386) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxxxxxxxx<xx.xx.84.175:9618?addrs=xx.xx.84.175-9618&noUDP&sock=7226_0371_3> for testuser1 2290171.0
08/27/21 00:22:44 (pid:9386) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxx<xx.xx.84.175:9618?addrs=xx.xx.84.175-9618&noUDP&sock=7226_0371_3> for testuser1, 2290171.0) deleted
08/27/21 00:43:40 (pid:9386) Starting add_shadow_birthdate(2290171.0)
08/27/21 00:43:40 (pid:9386) Started shadow for job 2290171.0 on slot1@xxxxxxxxxxxxxxxxxxxxxxx<xx.xx.84.31:9618?addrs=xx.xx.84.31-9618&noUDP&sock=56704_ce58_3> for testuser1, (shadow pid = 1946817)

What can we do to speed the job matchmakingÂafter the first failed attempt?Â



Thanks & Regards,
Vikrant Aggarwal