Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] heavy loads

Date: Fri, 30 Jun 2006 10:20:10 -0500
From: Michael Thomas <thomas@xxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] heavy loads

Hi Steve,

strace shows that there are quite a number of the perlglobus-job-manager scripts that are hanging during a read operation.both lsof and /proc confirm that it's blocking while trying to read froma pipe (FIFO), but I am not sure how to figure out what's connected tothe other end of the pipe.

In any case, I've started to kill these hanging perl scripts to see ifit helps clear things up.


streaming is disabled, and the grid monitor is enabled.

Thanks for the tip,

--Mike

Steven Timm wrote:

There are two types of globus-job-manager processes using
the globus which condor refers to as "gt2".

One is a jobmanager-condor script which stays live as long as the
job is live.

The second is a globus-job-manager-script perl script which runs
once per minute and is forked off from the main jobmanager-condor.

Do pstree, I think you will see several others are children of the one.
I've seen this happen in cases where there is some problem with
NFS and the globus-job-manager is stuck trying to delete or undelete
a hard link across NFS.  strace will tell you if this is the case.
Oftentimes it can be just one process that is hung waiting for some
nfs file and once you kill that process the rest of them will clear out.

As to why there is at least one globus-job-manager per condor job,
there are several reasons why it could be.  Did you disable streaming?
You have to for the grid monitor to work.  Do you see
jobmanager-forks trying to start from your client node to your head node?
Those are needed to start up the grid monitor.  IS ENABLE_GRID_MONITOR
set to TRUE in your condor_config file?  It needs to be.

In the archives of this list there is a procedure to try to start
up the grid monitor manually, to see if you have any problems
that are blocking the automatic condor-G start.

Steve


------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team

On Thu, 29 Jun 2006, Michael Thomas wrote:

I forgot to mention that this is using condor 6.7.18, and there are > 1300jobs in the queue right now (all but 200 are idle).
--Mike

Michael Thomas wrote:
While doing some stress testing on our 200-node cluster using condor-g, wehave noticed some extremely large loads on the cluster. The large loadseems to be caused by 500+ globus-job-manager processes, with sometimes 2or 3 globus-job-manager processes for each job.
condor_config contains the line:
GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE = 10
...but that seems to be ignored.
Why would we have multiple globus-job-managers for a single job, and whatcan we do to reduce the number of globus-job-manager processes so that ourgatekeeper doesn't get quite so overloaded?
--Mike

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [Condor-users] heavy loads
  - From: Steven Timm

References:
- [Condor-users] heavy loads
  - From: Michael Thomas
- Re: [Condor-users] heavy loads
  - From: Michael Thomas
- Re: [Condor-users] heavy loads
  - From: Steven Timm

Prev by Date: Re: [Condor-users] Preemption question
Next by Date: Re: [Condor-users] heavy loads
Previous by thread: Re: [Condor-users] heavy loads
Next by thread: Re: [Condor-users] heavy loads
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] heavy loads