[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor jobs leaving behind Windows Desktops and exhausting Window's heap?



We've hit upon a strange problem on some of are machines in 2 different
pools this week. This could be a Condor bug, might not be. I thought I'd
report it in and see if anyone else has hit upon this problem.

The common factors between the machines where the problems has occurred
are: they're all running Windows XP SP2 64-bit, they're all running
Condor 6.8.6, jobs are run using a set of domain accounts not as the
submitting users.

Otherwise it's different server hardware (Dell and HP) and different
numbers of slots (4 on the Dells and 8 on the HPs). The machines have
not necessarily been running the same type of jobs, and they definitely
haven't been running the same jobs from one particular user. They're at
different sites so it's a pretty safe assumption that the job type,
duration and ownership has been random.

The symptoms are as follows:

Jobs start cycling through the machines very quickly. Only one slot at a
time runs a job, never all slots concurrently. A job lands on the
machine and then a few seconds later leaves the machine. The
StarterLog.vm# logs all say:

4/17 14:28:28 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
4/17 14:29:19 ******************************************************
4/17 14:29:19 ** condor_starter (CONDOR_STARTER) STARTING UP
4/17 14:29:19 ** d:\abc\condor\bin\condor_starter.exe
4/17 14:29:19 ** $CondorVersion: 6.8.6 Sep 13 2007 $
4/17 14:29:19 ** $CondorPlatform: INTEL-WINNT50 $
4/17 14:29:19 ** PID = 164580
4/17 14:29:19 ** Log last touched 4/17 14:28:28
4/17 14:29:19 ******************************************************
4/17 14:29:19 Using config source:
\\sj-negotiator\condor\configs\condor_config
4/17 14:29:19 Using local config sources:
4/17 14:29:19    \\sj-negotiator\condor\configs/condor_config.basic
4/17 14:29:19    \\sj-negotiator\condor\configs/os/condor_config.WINNT52
4/17 14:29:19    \\sj-negotiator\condor\configs/site/condor_config.SJ
4/17 14:29:19
\\sj-negotiator\condor\configs/machine/condor_config.SJ-BS5450I
-366
4/17 14:29:19
\\sj-negotiator\condor\configs/machine/condor_config.SJ-BS5450I
-366.WINNT52
4/17 14:29:19
\\sj-negotiator\condor\configs/patch/condor_config.SJ-BS5450I-3
66
4/17 14:29:19
\\sj-negotiator\condor\configs/patch/condor_config.SJ-BS5450I-3
66.WINNT52
4/17 14:29:19 DaemonCore: Command Socket at <137.57.206.98:3750>
4/17 14:29:19 Setting resource limits not implemented!
4/17 14:29:19 Communicating with shadow <137.57.202.107:60503>
4/17 14:29:19 Submitting machine is "sj-schedd1.altera.com"
4/17 14:29:19 File transfer completed successfully.
4/17 14:29:20 Starting a VANILLA universe job with ID: 85808.44
4/17 14:29:20 IWD: d:/abc/condor/execute\dir_164580
4/17 14:29:20 Output file: d:/abc/condor/execute\dir_164580\wrapper.log
4/17 14:29:20 Error file: d:/abc/condor/execute\dir_164580\wrapper.err
4/17 14:29:20 Renice expr "((False =?= True) * 10)" evaluated to 0
4/17 14:29:20 About to exec C:\WINDOWS\system32\cmd.exe /Q /C
condor_exec.bat /e
xperiments/miotov/more_regtests/quartus_regtest/db/cut/armstrong_pll/con
version/
locked_port
4/17 14:29:20 Create_Process succeeded, pid=54108
4/17 14:29:20 Process exited, pid=54108, status=128
4/17 14:29:20 Got SIGQUIT.  Performing fast shutdown.
4/17 14:29:20 ShutdownFast all jobs.

That is, all the jobs exit with status=128. If you watch the process
monitor it's pretty clear that the condor_starter.exe is the process
dying.

If you try to access those network shares using Start -> Run you get a
pop up on the machine that says:

	\\sj-negotiator\condor
	Not enough server storage is available to process this command

And in the Event Log, right around the time the machine starts cycling
through jobs, theres an error entry for source Win32k that says:

	Failed to create desktop due to heap exhaustion.

If you use the dheapmon tool from Microsoft
(http://www-1.ibm.com/support/docview.wss?ratlid=cctocbody&rs=984&uid=sw
g21150076) to look at the head you see 109 open desktops on the machine,
the majority belonging to Service-0x1-<somenumbers>\Default -- which I'm
guessing right now are defunct condor_starter processes.

The actual output from dheapmon is long so I've included it at the end
of this email. We're seeing this happen in clusters of machines. 10 or
so physical boxes at a time. There's nothing that says this is
definitely a Condor issue. The open desktops seem highly likely related
to Condor since these machines are inaccessible to users through any
other mechanism and Condor is the only software we run that knows how to
create virtual desktops.

If anyone else has seen this before I'd appreciate you sharing your
insight into the problem. If you'd like more information from me let
know.


- Ian

Here is the dheapmon output on a machine that's in a good state, happy
running jobs. It's a 4 slot machine running Windows XP SP2 64-bit:

I:\local_bin\dheapmon8.1\x64>dheapmon
Desktop Heap Information Monitor Tool (Version 8.1.2925.0)
Copyright (c) Microsoft Corporation.  All rights reserved.
-------------------------------------------------------------
  Session ID:    0 Total Desktop: ( 31520 KB -   17 desktops)

  WinStation\Desktop            Heap Size(KB)    Used Rate(%)
-------------------------------------------------------------
  WinSta0\Default                   20480              0.4
  WinSta0\Disconnect                   96              5.0
  WinSta0\Winlogon                    192              6.9
  Service-0x0-3e7$\Default            768              8.1
  Service-0x0-3e4$\Default            768              2.4
  Service-0x0-3e5$\Default            768              0.6
  SAWinSta\SADesktop                  768              0.6
  REXECD-372\default                  768              0.6
  RSHD-480\default                    768              0.6
  SNMPTrapdService-940\Default        768              0.6
  MKSRlogind-1448\default             768              0.6
  MKSTelnetd-1516\default             768              0.6
  MKSSecureSH-1496\default            768              0.6
  Service-0x0-a7be994$\Default        768              2.0
  Service-0x0-b6ca7c1$\Default        768              1.3
  Service-0x0-b6caf81$\Default        768              1.3
  Service-0x0-b6cc353$\Default        768              1.3
-------------------------------------------------------------

The first 13 desktops until the MKSSecureSH session are related to
system processes. The 4 "Service-0x0" sessions at the bottom are
presumably the 4 condor VMs.

On an 8 slot machine where things are bad dheapmon shows 95 extra
desktops:

D:\dheapmon8.1\x64>dheapmon
Desktop Heap Information Monitor Tool (Version 8.1.2925.0)
Copyright (c) Microsoft Corporation.  All rights reserved.
-------------------------------------------------------------
  Session ID:    0 Total Desktop: (102176 KB -  109 desktops)

  WinStation\Desktop            Heap Size(KB)    Used Rate(%)
-------------------------------------------------------------
  WinSta0\Default                   20480              1.1
  WinSta0\Disconnect                   96              5.0
  WinSta0\Winlogon                    192              4.4
  Service-0x0-3e7$\Default            768              7.3
  Service-0x0-3e4$\Default            768              0.6
  Service-0x0-3e5$\Default            768              1.3
  SAWinSta\SADesktop                  768              0.6
  REXECD-1336\default                 768              0.6
  RSHD-1352\default                   768              0.6
  SNMPTrapdService-1388\Default       768              0.6
  MKSRlogind-1872\default             768              0.6
  MKSSecureSH-1916\default            768              0.6
  MKSTelnetd-1948\default             768              0.6
  Service-0x1-19a20331$\Default       768              0.6
  Service-0x1-19a21377$\Default       768              0.6
  Service-0x1-19a225e9$\Default       768              0.6
  Service-0x1-19a230f6$\Default       768              0.6
  Service-0x1-19a247ba$\Default       768              0.6
  Service-0x1-19cb09d2$\Default       768              0.6
  Service-0x1-bb5f635a$\Default       768              0.6
  Service-0x1-c0110064$\Default       768              0.6
  Service-0x1-c0ab7778$\Default       768              0.6
  Service-0x1-c0b1c117$\Default       768              0.6
  Service-0x1-c0e16209$\Default       768              0.6
  Service-0x1-c1dc2ac0$\Default       768              0.6
  Service-0x1-c1f20def$\Default       768              0.6
  Service-0x1-c624321a$\Default       768              0.6
  Service-0x1-c67c47c6$\Default       768              0.6
  Service-0x1-c7068a3a$\Default       768              0.6
  Service-0x1-c7eb4686$\Default       768              0.6
  Service-0x1-c88d3789$\Default       768              0.6
  Service-0x1-c8b37ad4$\Default       768              0.6
  Service-0x1-c8e58093$\Default       768              0.6
  Service-0x1-cf0f5fb0$\Default       768              0.6
  Service-0x1-d29dbc1d$\Default       768              0.6
  Service-0x1-d3af0f1a$\Default       768              0.6
  Service-0x1-d40984da$\Default       768              0.6
  Service-0x1-d4c87043$\Default       768              0.6
  Service-0x1-d51b4178$\Default       768              0.6
  Service-0x1-dbfd6403$\Default       768              0.6
  Service-0x1-df44a93c$\Default       768              0.6
  Service-0x1-e0c832bb$\Default       768              0.6
  Service-0x1-e231b7d8$\Default       768              0.6
  Service-0x1-e26fbf58$\Default       768              0.6
  Service-0x1-e284e121$\Default       768              0.6
  Service-0x1-e287a58e$\Default       768              0.6
  Service-0x1-e9a13a8a$\Default       768              0.6
  Service-0x1-e9a5922f$\Default       768              0.6
  Service-0x1-e9a5f526$\Default       768              0.6
  Service-0x1-ea1e0f85$\Default       768              0.6
  Service-0x1-ea72c743$\Default       768              0.6
  Service-0x1-ea80d7b3$\Default       768              0.6
  Service-0x1-ea81dd3d$\Default       768              0.6
  Service-0x1-ea947ed5$\Default       768              0.6
  Service-0x1-eab4f86c$\Default       768              0.6
  Service-0x1-eb116b3e$\Default       768              0.6
  Service-0x1-eb2eeea1$\Default       768              0.6
  Service-0x1-eb70040f$\Default       768              0.6
  Service-0x1-ebad7ce5$\Default       768              0.6
  Service-0x1-ebb35777$\Default       768              0.6
  Service-0x1-ec9369f9$\Default       768              0.6
  Service-0x1-ecf5d773$\Default       768              0.6
  Service-0x1-ed2ddc4b$\Default       768              0.6
  Service-0x1-ed4c781f$\Default       768              0.6
  Service-0x1-ee5828b7$\Default       768              0.6
  Service-0x1-eea8811e$\Default       768              0.6
  Service-0x1-ef000be0$\Default       768              0.6
  Service-0x1-f0521103$\Default       768              0.6
  Service-0x1-f058a999$\Default       768              0.6
  Service-0x1-f058e04f$\Default       768              0.6
  Service-0x1-f0828bfc$\Default       768              0.6
  Service-0x1-f0851015$\Default       768              0.6
  Service-0x1-f0e6048b$\Default       768              0.6
  Service-0x1-f0f7b600$\Default       768              0.6
  Service-0x1-f0f7ba3a$\Default       768              0.6
  Service-0x1-f118c985$\Default       768              0.6
  Service-0x1-f1190b59$\Default       768              0.6
  Service-0x1-f13ad17c$\Default       768              0.6
  Service-0x1-f14e3800$\Default       768              0.6
  Service-0x1-f14e420c$\Default       768              0.9
  Service-0x1-f15114a6$\Default       768              0.6
  Service-0x1-f17366b4$\Default       768              0.6
  Service-0x1-f173943d$\Default       768              0.6
  Service-0x1-f1c44276$\Default       768              0.6
  Service-0x1-f4192ea9$\Default       768              0.6
  Service-0x1-f41c5940$\Default       768              0.6
  Service-0x1-f424405b$\Default       768              0.9
  Service-0x1-f50df4c5$\Default       768              0.9
  Service-0x1-fc86b4dd$\Default       768              0.9
  Service-0x1-fdf45a74$\Default       768              0.9
  Service-0x1-fecdef13$\Default       768              0.9
  Service-0x2-155e4ff3$\Default       768              0.9
  Service-0x2-15b35455$\Default       768              0.6
  Service-0x2-15b41796$\Default       768              0.6
  Service-0x2-15b46004$\Default       768              0.6
  Service-0x2-15b4caac$\Default       768              0.6
  Service-0x2-15b60751$\Default       768              0.6
  Service-0x2-15e769d7$\Default       768              0.6
  Service-0x2-18b64a49$\Default       768              0.6
  Service-0x2-1b85c0db$\Default       768              0.6
  Service-0x2-1b91d50f$\Default       768              0.6
  Service-0x2-1c94746f$\Default       768              0.6
  Service-0x2-1ca6120b$\Default       768              0.6
  Service-0x2-1cac8db0$\Default       768              0.6
  Service-0x2-1cc2ff41$\Default       768              0.6
  Service-0x2-1ebde145$\Default       768              0.6
  Service-0x2-1ece1746$\Default       768              0.9
  Service-0x2-1f5c513c$\Default       768              0.9
  Service-0x2-3e7cb559$\Default       768              2.0
-------------------------------------------------------------


Confidentiality Notice.  This message may contain information that is confidential or otherwise protected from disclosure.
If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution, 
or copying of this message, or any attachments, is strictly prohibited.  If you have received this message in error, 
please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.