[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] killing globus-job-managers



Steven Timm wrote:
> On Wed, 19 Jul 2006, Michael Thomas wrote:
> 
> 
>>Once again I started seeing high loads on my gatekeeper due to a large
>>number of globus-job-manager processes.
>>
>>I started to kill some of the older (> 1 day) g-j-m processes and saw an
>>immediate reduction in the system load, as I had expected.
> 
> 
> Oftentimes you just need to find the right one that is hung and
> then all the rest of them will clear out on their own.
> This is especially true when you do ps auxwww and see some in state D,
> waiting on nfs I/O.

Finding that one or two that are hung is not that easy, when it appears
that most of them are hung.  pstree doesn't show any tree output, just a
list of unrelated globus-job-manager jobs.  Even if I manage to kill 50%
of the supposedly hung g-j-m processes, the rest aren't able to clear
out on their own because there are so darned many for all users, and
more g-j-m processes keep coming back.

>>My question:  Is it ok to start arbitrarily killing some of these g-j-m
>>processes?  What effect will it have on the corresponding jobs?
> 
> 
> I usually look to see if the corresponding job has exited the queue 
> already.  If so, there's no harm in killing it.
> Even if the job hasn't exited, condor-g will restart another
> jobmanager when it needs to.

How do you find the corresponding job?  I didn't see anything obvious in
condor_q -l that would indicate which job they are attached to.  And in
many cases, there are more g-j-m jobs than user jobs.

>>Would it be better/equivalent to condor_rm some of the g-j-m jobs (which
>>are easily identified by their command:  data --dest-url=http://...)?
>>What effect will condor_rm'ing these jobs have for the user?
>>
> 
> These are grid monitor jobs.  they should never under any circumstance
> last more than one hour.  If they do something is really wrong.

Then something is really wrong.

> Cancelling them it will have no effect on whether the user's jobs execute 
> or not, just on what is reported to his condor-g client.

Then I think I want to be careful about killing them, as accurate
reporting is important for us.

>>Some users are correctly setting
>>GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE to limit the number of g-j-m
>>processes.  Is there an equivalent setting that I can use on the
>>gatekeeper to limit the number of g-j-m processes launched by any given
>>user?
>>
> 
> Condor-G should be doing the right thing even if that setting isn't
> being used, and only running one per user at a time.
> You can use a setting START_LOCAL_UNIVERSE
> if you are using the managedfork job manager, which you must be
> given what you are saying here.   That controls how many can
> simultaneously start at once.

I don't see how the OSG-recommended value for START_LOCAL_UNIVERSE will
limit the number of grid monitor jobs:

START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 20 || GridMonitorJob == TRUE

Is there some other value that should be here to limit it to one per user?

> but if there are that many grid monitor jobs getting hung,
> then there's some bad issue on the client machine that is sending them to 
> you.  Those jobs don't hang on their own.  iptables or quota
> or something.

If this is true, then that is really bad because it means that issues on
the client side can quite easily take down our gatekeeper.  But I
suspect that there is still some configuration problem on our
gatekeeper, because we see these extra g-j-m processes and grid monitor
jobs regardless of the user.  The problem is, I still don't understand
the operation of the grid monitor enough to diagnose this, and haven't
found any decent documentation describing it in detail.  It's quickly
getting frustrating.  :(

--Mike

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature