[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] killing globus-job-managers

I usually look to see if the corresponding job has exited the queue
already.  If so, there's no harm in killing it.
Even if the job hasn't exited, condor-g will restart another
jobmanager when it needs to.

How do you find the corresponding job?  I didn't see anything obvious in
condor_q -l that would indicate which job they are attached to.  And in
many cases, there are more g-j-m jobs than user jobs.

Look in /var/log/messages. For every job there are three lines of gridinfo
including one that says
Jul 19 18:45:16 fngp-osg gridinfo[31672]: JMA 2006/07/19 18:45:16 GATEKEEPER_JM_ ID 2006-07-19.18:45:12.0000031649.0000000000 has GRAM_SCRIPT_JOB_ID 450589 manag
er type managedfork

This ties the condor job id of the managedfork job local universe 450589,
to the process id of the globus-job-manager process, namely 31672.

Also look at /home/<userid>/gram_job_mgr_31672.log
that may give you some idea as to why the process isn't exiting.

Would it be better/equivalent to condor_rm some of the g-j-m jobs (which
are easily identified by their command:  data --dest-url=http://...)?
What effect will condor_rm'ing these jobs have for the user?

These are grid monitor jobs.  they should never under any circumstance
last more than one hour.  If they do something is really wrong.

Then something is really wrong.

Cancelling them it will have no effect on whether the user's jobs execute
or not, just on what is reported to his condor-g client.

Then I think I want to be careful about killing them, as accurate
reporting is important for us.

Some users are correctly setting
processes.  Is there an equivalent setting that I can use on the
gatekeeper to limit the number of g-j-m processes launched by any given

Condor-G should be doing the right thing even if that setting isn't
being used, and only running one per user at a time.
You can use a setting START_LOCAL_UNIVERSE
if you are using the managedfork job manager, which you must be
given what you are saying here.   That controls how many can
simultaneously start at once.

I don't see how the OSG-recommended value for START_LOCAL_UNIVERSE will
limit the number of grid monitor jobs:

START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 20 || GridMonitorJob == TRUE

Is there some other value that should be here to limit it to one per user?

but if there are that many grid monitor jobs getting hung,
then there's some bad issue on the client machine that is sending them to
you.  Those jobs don't hang on their own.  iptables or quota
or something.

If this is true, then that is really bad because it means that issues on
the client side can quite easily take down our gatekeeper.  But I
suspect that there is still some configuration problem on our
gatekeeper, because we see these extra g-j-m processes and grid monitor
jobs regardless of the user.  The problem is, I still don't understand
the operation of the grid monitor enough to diagnose this, and haven't
found any decent documentation describing it in detail.  It's quickly
getting frustrating.  :(

The condor manual has gotten better but still not perfect.
The grid_monitor.sh script is what gets submitted as the monitoring job.
You can submit it manually.  In the archives of the OSG lists there are
instructions on how to do that.

But check your iptables.  it is often a culprit in problems like these.