[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] killing globus-job-managers




I should also note that other fork jobs (which in the Open Science
Grid software are managed by condor as Local Universe Jobs) could be
hanging besides the grid_monitor.sh jobs to which I referred in
my previous post.  You can only know for sure what is hanging
by finding the pid of the condor_starter of the local universe job
and doing a pstree to see what all it is spawning.

Also there is no obvious way that I see to limit users starting local universe jobs to one per user. Local universe jobs are not affected by group quotas as far as I know. BUt maybe the condor gurus know something that I don't.


Steve


------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team

On Wed, 19 Jul 2006, Steven Timm wrote:

I usually look to see if the corresponding job has exited the queue
already.  If so, there's no harm in killing it.
Even if the job hasn't exited, condor-g will restart another
jobmanager when it needs to.

How do you find the corresponding job?  I didn't see anything obvious in
condor_q -l that would indicate which job they are attached to.  And in
many cases, there are more g-j-m jobs than user jobs.

Look in /var/log/messages. For every job there are three lines of gridinfo
including one that says
Jul 19 18:45:16 fngp-osg gridinfo[31672]: JMA 2006/07/19 18:45:16
GATEKEEPER_JM_
ID 2006-07-19.18:45:12.0000031649.0000000000 has GRAM_SCRIPT_JOB_ID 450589
manag
er type managedfork

This ties the condor job id of the managedfork job local universe 450589,
to the process id of the globus-job-manager process, namely 31672.

Also look at /home/<userid>/gram_job_mgr_31672.log
that may give you some idea as to why the process isn't exiting.


Would it be better/equivalent to condor_rm some of the g-j-m jobs (which
are easily identified by their command:  data --dest-url=http://...)?
What effect will condor_rm'ing these jobs have for the user?


These are grid monitor jobs.  they should never under any circumstance
last more than one hour.  If they do something is really wrong.

Then something is really wrong.

Cancelling them it will have no effect on whether the user's jobs execute
or not, just on what is reported to his condor-g client.

Then I think I want to be careful about killing them, as accurate
reporting is important for us.

Some users are correctly setting
GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE to limit the number of g-j-m
processes.  Is there an equivalent setting that I can use on the
gatekeeper to limit the number of g-j-m processes launched by any given
user?


Condor-G should be doing the right thing even if that setting isn't
being used, and only running one per user at a time.
You can use a setting START_LOCAL_UNIVERSE
if you are using the managedfork job manager, which you must be
given what you are saying here.   That controls how many can
simultaneously start at once.

I don't see how the OSG-recommended value for START_LOCAL_UNIVERSE will
limit the number of grid monitor jobs:

START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 20 || GridMonitorJob == TRUE

Is there some other value that should be here to limit it to one per user?

but if there are that many grid monitor jobs getting hung,
then there's some bad issue on the client machine that is sending them to
you.  Those jobs don't hang on their own.  iptables or quota
or something.

If this is true, then that is really bad because it means that issues on
the client side can quite easily take down our gatekeeper.  But I
suspect that there is still some configuration problem on our
gatekeeper, because we see these extra g-j-m processes and grid monitor
jobs regardless of the user.  The problem is, I still don't understand
the operation of the grid monitor enough to diagnose this, and haven't
found any decent documentation describing it in detail.  It's quickly
getting frustrating.  :(


The condor manual has gotten better but still not perfect.
The grid_monitor.sh script is what gets submitted as the monitoring job.
You can submit it manually.  In the archives of the OSG lists there are
instructions on how to do that.

But check your iptables.  it is often a culprit in problems like these.

Steve


--Mike

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR