Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] killing globus-job-managers

Date: Wed, 19 Jul 2006 16:51:01 -0500 (CDT)
From: Steven Timm <timm@xxxxxxxx>
Subject: Re: [Condor-users] killing globus-job-managers

On Wed, 19 Jul 2006, Michael Thomas wrote:

Once again I started seeing high loads on my gatekeeper due to a large
number of globus-job-manager processes.

I started to kill some of the older (> 1 day) g-j-m processes and saw an
immediate reduction in the system load, as I had expected.


Oftentimes you just need to find the right one that is hung and
then all the rest of them will clear out on their own.
This is especially true when you do ps auxwww and see some in state D,
waiting on nfs I/O.


My question:  Is it ok to start arbitrarily killing some of these g-j-m
processes?  What effect will it have on the corresponding jobs?

I usually look to see if the corresponding job has exited the queuealready. If so, there's no harm in killing it.

Even if the job hasn't exited, condor-g will restart another
jobmanager when it needs to.


Would it be better/equivalent to condor_rm some of the g-j-m jobs (which
are easily identified by their command:  data --dest-url=http://...)?
What effect will condor_rm'ing these jobs have for the user?

These are grid monitor jobs.  they should never under any circumstance
last more than one hour.  If they do something is really wrong.

Cancelling them it will have no effect on whether the user's jobs executeor not, just on what is reported to his condor-g client.

Some users are correctly setting
GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE to limit the number of g-j-m
processes.  Is there an equivalent setting that I can use on the
gatekeeper to limit the number of g-j-m processes launched by any given
user?

Condor-G should be doing the right thing even if that setting isn't
being used, and only running one per user at a time.
You can use a setting START_LOCAL_UNIVERSE
if you are using the managedfork job manager, which you must be
given what you are saying here.   That controls how many can
simultaneously start at once.

but if there are that many grid monitor jobs getting hung,

then there's some bad issue on the client machine that is sending them toyou. Those jobs don't hang on their own. iptables or quota

or something.

Steve

--Mike


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team

Follow-Ups:
- Re: [Condor-users] killing globus-job-managers
  - From: Michael Thomas
- Re: [Condor-users] killing globus-job-managers
  - From: Michael Thomas

References:
- [Condor-users] killing globus-job-managers
  - From: Michael Thomas

Prev by Date: [Condor-users] killing globus-job-managers
Next by Date: [Condor-users] Problem with condor_install script
Previous by thread: [Condor-users] killing globus-job-managers
Next by thread: Re: [Condor-users] killing globus-job-managers
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] killing globus-job-managers