[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Shadow exit with status 108 (Condor Version 9.0.17)



Youâll have to look at the ShadowLog on the submit node and the StartLog on the execute node to get details on why the attempts to claim failed. This should be be happening frequently in normal circumstances.

One explanation, which you hint at, is that the slot is being matched to another job before the schedd can finish fully claiming it. This can happen if the negotiator runs a new matchmaking cycle quickly (while the schedd is still trying to start additional jobs on a partitionable slot).

 - Jaime

On Dec 12, 2023, at 12:20âPM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello Experts,

108 JOB_NOT_STARTED Can't connect to startd or request refused

With scheduler level splitting, we are noticing JobRunCount keep on increasing because of 108 ERROR. Happening often on various submit nodes. It doesn't have an impact on the job runtime, is't expected to see this happening frequently? 

Looks like that some other job used the claim. That's why this job was forced to delete the claim? As the job hasn't started running yet hence nothing conclusive can be determined from worker node logs. 


12/12/23 11:06:25 (pid:551974) job_transforms for 12104372.1: 2 considered, 2 applied (SetTeam,SetWaitForSec)
12/12/23 11:15:30 (pid:551974) match (slot1@xxxxxxxxxxxxxxxxxxx <xx.xx.140.191:9618?addrs=xx.xx.140.191-9618&alias=test155.example.com&noUDP&sock=startd_692440_d2e3> for testuser) switching to job 12104372.1
12/12/23 11:15:30 (pid:551974) Shadow pid 3126758 switching to job 12104372.1.
12/12/23 11:15:30 (pid:551974) Starting add_shadow_birthdate(12104372.1)
12/12/23 11:15:30 (pid:551974) Match record (slot1@xxxxxxxxxxxxxxxxxxx <xx.xx.140.191:9618?addrs=xx.xx.140.191-9618&alias=test155.example.com&noUDP&sock=startd_692440_d2e3> for testuser, 12104372.1) deleted
12/12/23 11:15:30 (pid:551974) Shadow pid 3126758 for job 12104372.1 exited with status 108
12/12/23 11:23:11 (pid:551974) Starting add_shadow_birthdate(12104372.1)
12/12/23 11:23:11 (pid:551974) Started shadow for job 12104372.1 on slot1@xxxxxxxxxxxxxxxxxxx <xx.xx.xx.xx:9618?addrs=xx.xx.140.126-9618&alias=test126.example.com&noUDP&sock=startd_3028316_784f> for testuser, (shadow pid = 3159624)
12/12/23 11:54:37 (pid:551974) Shadow pid 3159624 for job 12104372.1 exited with status 115
12/12/23 11:54:37 (pid:551974) Match record (slot1@xxxxxxxxxxxxxxxxxxx <xx.xx.xx.xx:9618?addrs=xx.xx.140.126-9618&alias=test126.example.com&noUDP&sock=startd_3028316_784f> for testuser, 12104372.1) deleted



Thanks & Regards,
Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/