[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] gridmonitor.sh questions.



Hi all,

I'm using condor-G (6.6 branch) to submit jobs via the globus universe to several machines with a mixture of Globus Toolkit version 2 and Globus Toolkit version 4 installed in front of a variety of local schedulers (pbs, condor and sge). Like many previous posters to this list I have observed problems related to high load on the gatekeeper machines when I have many jobs running or queued. I think I understand how I'm supposed to deal with this (make sure condor-G uses the gridmonitor on the remote resource) but I have some questions I'm hoping somebody can answer.

When I enable the gridmonitor and submit to a gt2 machine the gridmontor runs and the load on the gatekeeper is dramatically reduced. This is good. However, I have observed some odd behavior relating to fork jobs run using the globus-job-run command. Once the gridmonitor is running a simple command such as globus-job-run xxx.xxx.xxx/jobmanager-fork /bin/date takes up to several minutes to exit (if no gridmonitor is running this completes within 10 seconds). Question 1: is this behavior expected when the gridmonitor is in use? Is the slowdown likely to be due to latency associated with poling the status of each job, or is it due to some built in sleeping in the gridmonitor? Is this something that can be tuned? 

My second problem relates to gt4 machines where we submit jobs via the old-style pre web services interface (again using the globus universe). In this case the gridmonitor does not run. Part of the problem is that the gridmonitor expects to write log and lock files to $GLOBUS_LOCATION/tmp, it appears that under gt2 this is world writable but under gt4 it is only writable to the globus user by default. I have modified the gridmonitor to write its log and lock files to /tmp, the gridmonitor then runs but incorrectly reports the status of jobs. Question 2: Is the gridmonitor designed to be used in this (pre web services gt4) setup? Without it I see the same high load problems on the gatekeeper.

I suspect the solution is to upgrade condor and move to submissions via the gt4 web services interface using the grid universe. To do this I will need a convincing case to show that the gt2 gatekeepers should be updated. Question 3: Does anybody have reliable statistics on the scaling of condor to globus  submission using various globus interfaces (gt2, pre ws gt4 and ws gt4) with and without the gridmonitor?

Cheers,

Andrew



--

Dr Andrew Walker

Department of Earth Sciences
University of Cambridge
Downing Street
Cambridge 
CB2 3EQ
UK

phone +44 (0)1223 333432