[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HT Condor 8.6.13 on Windows. Zombie process after suspend



I have a zombie process left running on a Windows 10 machine after, appparently a suspend .. this is the starterlog for this slot
I call them "zombies" as condor cannot kill them via condor_rm. I have to manually kill the job on the machine.
One question is "why" would jobs be suspended when I have WANT_SUSPEND=FALSE in the condor_config.local (to override the WANT_SUSPEND of use POLICY:Desktop)

08/01/19 19:03:37 (pid:18072) Starting a VANILLA universe job with ID: 186.96
08/01/19 19:03:37 (pid:18072) Tracking process family by login "condor-slot3"
08/01/19 19:03:37 (pid:18072) IWD: D:\condor\execute\dir_18072
08/01/19 19:03:37 (pid:18072) Output file: D:\condor\execute\dir_18072\_condor_stdout
08/01/19 19:03:37 (pid:18072) Error file: D:\condor\execute\dir_18072\_condor_stderr
08/01/19 19:03:37 (pid:18072) Renice expr "10" evaluated to 10
08/01/19 19:03:37 (pid:18072) About to exec D:\condor\execute\dir_18072\condor_exec.bat PW_InfPlane_NotEliminated.xml 2019.000 X86_64_WINDOWS solver solver_error_PW_InfPlane_NotEliminated_mswin64_2019.000.xml.txt solver_output_PW_InfPlane_NotEliminated_mswin64_2019.000.xml.txt -runraytracingsolver -runinternalfesolver -runrayonsolver -runinternalacousticfesolver -runpemsolver -runcfdsolver -runparametervariation -runmontecarlo -runoptimization -va1file PW_InfPlane_NotEliminated_mswin64_2019.000.va1 PW_InfPlane_NotEliminated.xml
08/01/19 19:03:37 (pid:18072) Running job as user condor-slot3
08/01/19 19:03:37 (pid:18072) Executable is a batch file, running: "C:\WINDOWS\system32\cmd.exe" /Q /C "D:\condor\execute\dir_18072\condor_exec.bat"
08/01/19 19:03:37 (pid:18072) Create_Process succeeded, pid=41260
08/01/19 19:06:36 (pid:18072) Suspending all jobs.
08/01/19 19:07:09 (pid:18072) Continuing all jobs.
08/01/19 19:07:09 (pid:18072) Result of "signal_family" operation from ProcD: ERROR: No family with the given PID is registered
08/01/19 19:07:09 (pid:18072) error continuing family in VanillaProc::Continue()
08/01/19 19:07:09 (pid:18072) Result of "get_usage" operation from ProcD: ERROR: No family with the given PID is registered
08/01/19 19:07:09 (pid:18072) error getting family usage in VanillaProc::PublishUpdateAd() for pid 41260
08/01/19 19:07:09 (pid:18072) condor_write() failed: send() 129 bytes to <127.0.0.1:64203> returned -1, timeout=0, errno=10054 .
08/01/19 19:07:09 (pid:18072) Buf::write(): condor_write() failed
08/01/19 19:07:09 (pid:18072) Got SIGQUIT. Performing fast shutdown.
08/01/19 19:13:45 (pid:18072) Result of "get_usage" operation from ProcD: ERROR: No family with the given PID is registered
08/01/19 19:13:45 (pid:18072) error getting family usage in VanillaProc::PublishUpdateAd() for pid 41260
08/01/19 19:13:45 (pid:18072) Result of "get_usage" operation from ProcD: ERROR: No family with the given PID is registered
08/01/19 19:13:45 (pid:18072) error getting family usage in VanillaProc::PublishUpdateAd() for pid 41260
08/01/19 19:18:45 (pid:18072) Result of "get_usage" operation from ProcD: ERROR: No family with the given PID is registered
08/01/19 19:18:45 (pid:18072) error getting family usage in VanillaProc::PublishUpdateAd() for pid 41260
... etc ... etc.....

Andrew