[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] killing globus-job-managers

On Wed, 19 Jul 2006, Michael Thomas wrote:

Once again I started seeing high loads on my gatekeeper due to a large
number of globus-job-manager processes.

I started to kill some of the older (> 1 day) g-j-m processes and saw an
immediate reduction in the system load, as I had expected.

Oftentimes you just need to find the right one that is hung and
then all the rest of them will clear out on their own.
This is especially true when you do ps auxwww and see some in state D,
waiting on nfs I/O.

My question:  Is it ok to start arbitrarily killing some of these g-j-m
processes?  What effect will it have on the corresponding jobs?

I usually look to see if the corresponding job has exited the queue already. If so, there's no harm in killing it.
Even if the job hasn't exited, condor-g will restart another
jobmanager when it needs to.

Would it be better/equivalent to condor_rm some of the g-j-m jobs (which
are easily identified by their command:  data --dest-url=http://...)?
What effect will condor_rm'ing these jobs have for the user?

These are grid monitor jobs.  they should never under any circumstance
last more than one hour.  If they do something is really wrong.
Cancelling them it will have no effect on whether the user's jobs execute or not, just on what is reported to his condor-g client.

Some users are correctly setting
processes.  Is there an equivalent setting that I can use on the
gatekeeper to limit the number of g-j-m processes launched by any given

Condor-G should be doing the right thing even if that setting isn't
being used, and only running one per user at a time.
You can use a setting START_LOCAL_UNIVERSE
if you are using the managedfork job manager, which you must be
given what you are saying here.   That controls how many can
simultaneously start at once.

but if there are that many grid monitor jobs getting hung,
then there's some bad issue on the client machine that is sending them to you. Those jobs don't hang on their own. iptables or quota
or something.



Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team