[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected job preemption on pslots



Hello everyone,

On Tue, 2023-06-20 at 14:29 -0500, Todd Tannenbaum wrote:
It may be helpful to look in the ShadowLog file to see what a hung shadow was attempting to do before it got killed.  For instance, of the schedd log says it is killing pid 1221143 because it appears hung, try doing a grep 1221143 `condor_config_val ShadowLog` as each line in the ShadowLog is prefaced with the pid of the shadow.  You may want to increase (temporarily) the logging level of the shadow via placing in the config file  SHADOW_DEBUG=D_FULLDEBUG

very good hint.  Once found the issue was easily fixed ...

So here is the actual smoking gun:

Jun 21 07:14:32 msched condor_shadow[1334572]: ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.98.76.53:32077> (try 1 of 3): SECMAN:2006:AES not supported for UDP
Jun 21 07:14:37 msched condor_shadow[1334572]: ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.98.76.53:32077> (try 2 of 3): SECMAN:2006:AES not supported for UDP|SECMAN:2006:AES not supported for UDP
Jun 21 07:14:42 msched condor_shadow[1334572]: ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.98.76.53:32077> (try 3 of 3): SECMAN:2006:AES not supported for UDP|SECMAN:2006:AES not supported for UDP|SECMAN:2006:AES not supported for UDP
Jun 21 07:34:38 msched condor_shadow[1334572]: ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.98.76.53:32077> (try 1 of 3): SECMAN:2006:AES not supported for UDP
Jun 21 07:34:43 msched condor_shadow[1334572]: ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.98.76.53:32077> (try 2 of 3): SECMAN:2006:AES not supported for UDP|SECMAN:2006:AES not supported for UDP
Jun 21 07:34:48 msched condor_shadow[1334572]: ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.98.76.53:32077> (try 3 of 3): SECMAN:2006:AES not supported for UDP|SECMAN:2006:AES not supported for UDP|SECMAN:2006:AES not supported for UDP
Jun 21 07:54:44 msched condor_shadow[1334572]: ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.98.76.53:32077> (try 1 of 3): SECMAN:2006:AES not supported for UDP
Jun 21 07:54:49 msched condor_shadow[1334572]: ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.98.76.53:32077> (try 2 of 3): SECMAN:2006:AES not supported for UDP|SECMAN:2006:AES not supported for UDP
Jun 21 07:54:54 msched condor_shadow[1334572]: ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.98.76.53:32077> (try 3 of 3): SECMAN:2006:AES not supported for UDP|SECMAN:2006:AES not supported for UDP|SECMAN:2006:AES not supported for UDP
Jun 21 07:56:42 msched condor_schedd[1314004]: Shadow pid 1334572 successfully killed because the Shadow was hung.

This was because I messed with an option I was not supposed to mess with:
SEC_DEFAULT_CRYPTO_METHODS = AES

The manual says:
SEC_DEFAULT_CRYPTO_METHODS controls the default setting if no others are specified. [...] it is recommended to leave these settings untouched.

Removing the option fixed the issue.
Thanks to everyone who helped pushing me into the right direction!
Now we can finally burn some GPU cycles ...

Cheers, Jan

-- 
MAX-PLANCK-INSTITUT fuer Radioastronomie
Jan Behrend - Backend Development Group
----------------------------------------
Auf dem Huegel 69, D-53121 Bonn                                  
Tel: +49 (228) 525 248
http://www.mpifr-bonn.mpg.de

Attachment: smime.p7s
Description: S/MIME cryptographic signature