[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected job preemption on pslots



On 6/20/2023 5:53 AM, Jan Behrend wrote:
This I could find on the schedd, which to me looks like a smoking gun:

Jun 19 06:54:21 msched condor_schedd[1005]: ERROR: Child pid 1211506 appears hung! Killing it hard.
Jun 19 06:54:21 msched condor_schedd[1005]: Shadow pid 1211506 successfully killed because the Shadow was hung.
Jun 19 06:54:21 msched condor_schedd[1005]: Shadow pid 1211506 for job 360.0 exited with status 4

      
Jun 19 08:57:23 msched condor_schedd[1005]: ERROR: Child pid 1216722 appears hung! Killing it hard.
Jun 19 08:57:23 msched condor_schedd[1005]: Shadow pid 1216722 successfully killed because the Shadow was hung.
Jun 19 08:57:23 msched condor_schedd[1005]: Shadow pid 1216722 for job 360.0 exited with status 4

      
Jun 19 11:09:23 msched condor_schedd[1005]: ERROR: Child pid 1221143 appears hung! Killing it hard.
Jun 19 11:09:23 msched condor_schedd[1005]: Shadow pid 1221143 successfully killed because the Shadow was hung.
Jun 19 11:09:23 msched condor_schedd[1005]: Shadow pid 1221143 for job 360.0 exited with status 4

      

Is this a local problem on the schedd machine running the shadow daemon?
At the same time I get this on the execution hosts:

Jun 19 06:54:21 pssproto04 condor_starter[24368]: Connection to shadow may be lost, will test by sending whoami request.
Jun 19 08:57:23 pssproto04 condor_starter[7887]: Connection to shadow may be lost, will test by sending whoami request.
Jun 19 11:09:23 pssproto04 condor_starter[24220]: Connection to shadow may be lost, will test by sending whoami request.
[...]


To be network or resource related the time intervals are too even, I think.  Any ideas?

Hi Jan,

It may be helpful to look in the ShadowLog file to see what a hung shadow was attempting to do before it got killed.  For instance, of the schedd log says it is killing pid 1221143 because it appears hung, try doing a grep 1221143 `condor_config_val ShadowLog` as each line in the ShadowLog is prefaced with the pid of the shadow.  You may want to increase (temporarily) the logging level of the shadow via placing in the config file  SHADOW_DEBUG=D_FULLDEBUG

One reason I have seen shadows hang like the above is if they are reading/writing to a shared filesystem that becomes overloaded and/or an NFS automount that goes stale.  The shadow is the process that writes to job event logs, and also is the process that performs HTCondor file transfers.

Hope the above helps,
Todd