[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] heavy loads



Hi Steve,

strace shows that there are quite a number of the perl globus-job-manager scripts that are hanging during a read operation. both lsof and /proc confirm that it's blocking while trying to read from a pipe (FIFO), but I am not sure how to figure out what's connected to the other end of the pipe.

In any case, I've started to kill these hanging perl scripts to see if it helps clear things up.

streaming is disabled, and the grid monitor is enabled.

Thanks for the tip,

--Mike

Steven Timm wrote:
There are two types of globus-job-manager processes using
the globus which condor refers to as "gt2".

One is a jobmanager-condor script which stays live as long as the
job is live.

The second is a globus-job-manager-script perl script which runs
once per minute and is forked off from the main jobmanager-condor.

Do pstree, I think you will see several others are children of the one.
I've seen this happen in cases where there is some problem with
NFS and the globus-job-manager is stuck trying to delete or undelete
a hard link across NFS.  strace will tell you if this is the case.
Oftentimes it can be just one process that is hung waiting for some
nfs file and once you kill that process the rest of them will clear out.

As to why there is at least one globus-job-manager per condor job,
there are several reasons why it could be.  Did you disable streaming?
You have to for the grid monitor to work.  Do you see
jobmanager-forks trying to start from your client node to your head node?
Those are needed to start up the grid monitor.  IS ENABLE_GRID_MONITOR
set to TRUE in your condor_config file?  It needs to be.

In the archives of this list there is a procedure to try to start
up the grid monitor manually, to see if you have any problems
that are blocking the automatic condor-G start.

Steve


------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team

On Thu, 29 Jun 2006, Michael Thomas wrote:

I forgot to mention that this is using condor 6.7.18, and there are > 1300 jobs in the queue right now (all but 200 are idle).

--Mike

Michael Thomas wrote:
While doing some stress testing on our 200-node cluster using condor-g, we have noticed some extremely large loads on the cluster. The large load seems to be caused by 500+ globus-job-manager processes, with sometimes 2 or 3 globus-job-manager processes for each job.

condor_config contains the line:
GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE = 10
...but that seems to be ignored.

Why would we have multiple globus-job-managers for a single job, and what can we do to reduce the number of globus-job-manager processes so that our gatekeeper doesn't get quite so overloaded?

--Mike

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature