[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] User process tree w/PPID=1 : not valid (runaway/breakaway), confirmed?



Hi,

What does condor_who return when run on the worker node?

Also is this htcondor cluster sandboxed by pid_namespaces and/or cgroups?

You can tell from the worker-node by:

~]# condor_config_val USE_PID_NAMESPACES

and

~]# condor_config_val BASE_CGROUP

Cheers, Iain
________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Winnie Lacesso [Winnie.Lacesso@xxxxxxxxxxxxx]
Sent: 16 March 2016 09:44
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] User process tree w/PPID=1 : not valid        (runaway/breakaway), confirmed?

Good morning!

I'm extremely new to htcondor, having managed pbs/torque/maui CREAM-CEs
for years. On one WN converted to htcondor is a suspiciously (to me) high
load & in looking at top & tracing PIDs back, 3 process trees owned by a
pool account with PPID=1 show up. They're all using (trying to) 100% of a
CPU, thus interefering with legit jobs assigned by condor (or however it's
phrased) to that WN.

UID          PID    PPID  C STIME TTY          TIME CMD
cms457   2009833       1  0 Feb26 ?        00:00:05 ./combine -H ProfileLikelihood -t 10 -M HybridNew -m 650 -s 16 -d cut_based_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_ll_m650_13TeV_MJJ-95-135_MVA-0p1_All.dat.root -n CutBased_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_MJJ-95-135_MVA-0p1_All
cms457   2009838       1  0 Feb26 ?        00:00:05 ./combine -H ProfileLikelihood -t 10 -M HybridNew -m 650 -s 16 -d cut_based_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_ll_m650_13TeV_MJJ-95-135_MVA-0p1_All.dat.root -n CutBased_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_MJJ-95-135_MVA-0p1_All
cms457   2009858       1  0 Feb26 ?        00:00:04 ./combine -H ProfileLikelihood -t 10 -M HybridNew -m 650 -s 16 -d cut_based_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_ll_m650_13TeV_MJJ-95-135_MVA-0p1_All.dat.root -n CutBased_X0ToHHTo2B2L2Nu_BDT_X0_650_VS_TT_DY_MJJ-95-135_MVA-0p1_All

root@sm09> pstree -lp 2009858
combine(2009858)---combine(2009924)---combine(2009932)---combine(2009938)---combine(2009947)---combine(2009955)---combine(2668426)---combine(2729346)
root@sm09> pstree -lp 2009833
combine(2009833)---combine(2090922)---combine(2102168)
root@sm09> pstree -lp 2009838
combine(2009838)---combine(3219772)---combine(3499009)---combine(3547343)

So based on years of pb/tq/maui admin coupled with their PPID=1 (& that
they've been running since Feb26!!!), I think they're breakaway/runaway
process tress, not properly killed or exited by a previous legit htcondor
job, & so should be killed.

My colleague who built the htcondor system here is 99.9% sure that a pool
account process tree with PPID=1 is breakaway/runaway but not 100% sure,
so recommed asking on this list.

If anyone has been a pbs/torque/maui admin & now does htcondor admin & has
built a translation table of "this way on pbs/torque/maui = howto on
htcondor" I'd be VERY grateful for a copy!

In particular, qstat in pbs/torque world gives the PID of the "start" of
any pool account job, so pstree -lp $pid shows the whole process tree.
eg:

                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1984254.lcgce04.lhcbpil0 long     cream_190561598 4236 1  1    --  60:00 R 25:04    sm00

root@sm00> pstree -lp 4236
bash(4236)---1984254.lcgce04(4251)---CREAM190561598_(4256)---perl(4314)-+-perl(4316)
                                                                        `-sh(4315)---DIRAC_9ofzuZ_pi(4318)---python(4320)---python(4321)---python(6301)-+-Job127964102(7411)---python2.7(7412)-+-python(7444)-+-sh(7778)---python(7787)---python(7788)---bd2kstarmumu_eo(7812)
                                                                                                                                                        |                                      |              `-{python}(7445)
                                                                                                                                                        |                                      |-{python2.7}(7415)
                                                                                                                                                        |                                      `-{python2.7}(7443)
                                                                                                                                                        `-{python}(6434)

My colleague says he knows of no way (yet) to get that start-of-job PID in
htcondor. Does anyone on this list know how?

Grateful for advice+pointers!
PS If my above questions are answered in some online
tutorial/documentation, a URL would be most welcome!

Winnie Lacesso / Bristol University Particle Physics Computing Systems
HH Wills Physics Laboratory, Tyndall Avenue, Bristol, BS8 1TL, UK
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/