[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] schedd getting more than max_jobs_running running



Hi Joe,

Re the below, it looks very much like a bug that was fixed back in January. See
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4554

This bug fix has already appeared in the developer series with HTCondor v8.3.3 and above, and will also appear in the stable series starting with the upcoming HTCondor v8.2.8.

If you adverse to running HTCondor v8.3.x just yet, you have a few options to consider:

1. Wait until HTCondor v8.2.8 is released, which will likely be about 4 to 6 weeks (we typically hold back on stable releases to accumulate enough bug patches to make upgrading worthwhile).

OR

2. We can send you a link to download a v8.2.8 pre-release from the nightly build results.

OR

3. I think you can work around this problem immediately by placing the following config knob in your central manager
   USE_RESOURCE_REQUEST_COUNTS = False
and doing a condor_reconfig. If you do this, the bad news is negotiation cycles will take longer as you are effectively disabling protocol scalability improvements introduced in v8.1, and your negotiation cycles will run at the same (slower) speed that they did back with HTCondor v8.0.x. You will also want to remember to set this knob back to its default value of True (or remove it from your config) once you upgrade to v8.2.8 or above.

Hope the above helps,
Todd

On 3/25/2015 12:42 PM, Joe Boyd wrote:
Hi condor-users,

We have MAX_JOBS_RUNNING set to:

[root@fifebatch1 condor]# condor_config_val -v MAX_JOBS_RUNNING
MAX_JOBS_RUNNING = 10000
  # at: /etc/condor/config.d/02_gwms_schedds.config, line 9
  # raw: MAX_JOBS_RUNNING = 10000

[root@fifebatch1 condor]# ls -al
/etc/condor/config.d/02_gwms_schedds.config
-rw-r--r-- 1 root root 3528 Feb 26 14:43
/etc/condor/config.d/02_gwms_schedds.config
[root@fifebatch1 condor]# grep MAX_JOBS_RUNNING
/etc/condor/config.d/02_gwms_schedds.config
MAX_JOBS_RUNNING        = 10000

Grepping in our schedd log there are lines like:

SchedLog.20150324T161949:03/24/15 08:17:11 (pid:2002) Preempting 66 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 08:27:13 (pid:2002) Preempting 88 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 08:32:10 (pid:2002) Preempting 10 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 10:17:10 (pid:2002) Preempting 38 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 10:17:36 (pid:2002) Preempting 13 jobs
due to MAX_JOBS_RUNNING change

The manual at:

http://research.cs.wisc.edu/htcondor/manual/v8.3/3_3Configuration.html#21897


says:

Changing this setting to be below the current number of jobs that are
running will cause running jobs to be aborted until the number running
is within the limit.

My problem is that we are NOT changing the value (see config file
timestamp above).  We're managing with puppet but certainly not running
puppet every 25 seconds as the last two log lines above show so it can't
even be some craziness there.

I thought I remember reading somewhere that the schedd may in fact get
more than MAX_JOBS_RUNNING jobs started because of the way it works
which is fine with me but I thought then it just didn't run any more
until the number got below.  It seems to be running more than 10k and
then proceeding to kill them.

Am I wrong?

joe
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685