Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] schedd getting more than max_jobs_running running

Date: Wed, 25 Mar 2015 13:44:55 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] schedd getting more than max_jobs_running running

Hi Joe,

Re the below, it looks very much like a bug that was fixed back inJanuary. See

  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4554

This bug fix has already appeared in the developer series with HTCondorv8.3.3 and above, and will also appear in the stable series startingwith the upcoming HTCondor v8.2.8.

If you adverse to running HTCondor v8.3.x just yet, you have a fewoptions to consider:

1. Wait until HTCondor v8.2.8 is released, which will likely be about 4to 6 weeks (we typically hold back on stable releases to accumulateenough bug patches to make upgrading worthwhile).

OR

2. We can send you a link to download a v8.2.8 pre-release from thenightly build results.

OR

3. I think you can work around this problem immediately by placing thefollowing config knob in your central manager

   USE_RESOURCE_REQUEST_COUNTS = False

and doing a condor_reconfig. If you do this, the bad news isnegotiation cycles will take longer as you are effectively disablingprotocol scalability improvements introduced in v8.1, and yournegotiation cycles will run at the same (slower) speed that they didback with HTCondor v8.0.x. You will also want to remember to set thisknob back to its default value of True (or remove it from your config)once you upgrade to v8.2.8 or above.


Hope the above helps,
Todd

On 3/25/2015 12:42 PM, Joe Boyd wrote:

Hi condor-users,

We have MAX_JOBS_RUNNING set to:

[root@fifebatch1 condor]# condor_config_val -v MAX_JOBS_RUNNING
MAX_JOBS_RUNNING = 10000
  # at: /etc/condor/config.d/02_gwms_schedds.config, line 9
  # raw: MAX_JOBS_RUNNING = 10000

[root@fifebatch1 condor]# ls -al
/etc/condor/config.d/02_gwms_schedds.config
-rw-r--r-- 1 root root 3528 Feb 26 14:43
/etc/condor/config.d/02_gwms_schedds.config
[root@fifebatch1 condor]# grep MAX_JOBS_RUNNING
/etc/condor/config.d/02_gwms_schedds.config
MAX_JOBS_RUNNING        = 10000

Grepping in our schedd log there are lines like:

SchedLog.20150324T161949:03/24/15 08:17:11 (pid:2002) Preempting 66 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 08:27:13 (pid:2002) Preempting 88 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 08:32:10 (pid:2002) Preempting 10 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 10:17:10 (pid:2002) Preempting 38 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 10:17:36 (pid:2002) Preempting 13 jobs
due to MAX_JOBS_RUNNING change

The manual at:

http://research.cs.wisc.edu/htcondor/manual/v8.3/3_3Configuration.html#21897


says:

Changing this setting to be below the current number of jobs that are
running will cause running jobs to be aborted until the number running
is within the limit.

My problem is that we are NOT changing the value (see config file
timestamp above).  We're managing with puppet but certainly not running
puppet every 25 seconds as the last two log lines above show so it can't
even be some craziness there.

I thought I remember reading somewhere that the schedd may in fact get
more than MAX_JOBS_RUNNING jobs started because of the way it works
which is fine with me but I thought then it just didn't run any more
until the number got below.  It seems to be running more than 10k and
then proceeding to kill them.

Am I wrong?

joe
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685

Follow-Ups:
- Re: [HTCondor-users] schedd getting more than max_jobs_running running
  - From: Joe Boyd

References:
- [HTCondor-users] schedd getting more than max_jobs_running running
  - From: Joe Boyd

Prev by Date: [HTCondor-users] schedd getting more than max_jobs_running running
Next by Date: Re: [HTCondor-users] schedd getting more than max_jobs_running running
Previous by thread: [HTCondor-users] schedd getting more than max_jobs_running running
Next by thread: Re: [HTCondor-users] schedd getting more than max_jobs_running running
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] schedd getting more than max_jobs_running running