[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] schedd getting more than max_jobs_running running



Hi Todd,

Awesome thanks.  We'll plan on upgrading when 8.2.8 comes out.

We can try the setting you mention in 3 for now. Looks like or neg cycles are quick right now:

03/25/15 13:46:00 ---------- Started Negotiation Cycle ----------
03/25/15 13:46:17 ---------- Finished Negotiation Cycle ----------
03/25/15 13:46:42 ---------- Started Negotiation Cycle ----------
03/25/15 13:46:58 ---------- Finished Negotiation Cycle ----------
03/25/15 13:47:18 ---------- Started Negotiation Cycle ----------
03/25/15 13:47:35 ---------- Finished Negotiation Cycle ----------
03/25/15 13:47:55 ---------- Started Negotiation Cycle ----------
03/25/15 13:48:12 ---------- Finished Negotiation Cycle ----------

We'll see what they are with the change.

joe


On 03/25/2015 01:44 PM, Todd Tannenbaum wrote:
Hi Joe,

Re the below, it looks very much like a bug that was fixed back in
January.  See
   https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4554

This bug fix has already appeared in the developer series with HTCondor
v8.3.3 and above, and will also appear in the stable series starting
with the upcoming HTCondor v8.2.8.

If you adverse to running HTCondor v8.3.x just yet, you have a few
options to consider:

1. Wait until HTCondor v8.2.8 is released, which will likely be about 4
to 6 weeks (we typically hold back on stable releases to accumulate
enough bug patches to make upgrading worthwhile).

OR

2. We can send you a link to download a v8.2.8 pre-release from the
nightly build results.

OR

3. I think you can work around this problem immediately by placing the
following config knob in your central manager
    USE_RESOURCE_REQUEST_COUNTS = False
and doing a condor_reconfig.  If you do this, the bad news is
negotiation cycles will take longer as you are effectively disabling
protocol scalability improvements introduced in v8.1, and your
negotiation cycles will run at the same (slower) speed that they did
back with HTCondor v8.0.x.  You will also want to remember to set this
knob back to its default value of True (or remove it from your config)
once you upgrade to v8.2.8 or above.

Hope the above helps,
Todd

On 3/25/2015 12:42 PM, Joe Boyd wrote:
Hi condor-users,

We have MAX_JOBS_RUNNING set to:

[root@fifebatch1 condor]# condor_config_val -v MAX_JOBS_RUNNING
MAX_JOBS_RUNNING = 10000
  # at: /etc/condor/config.d/02_gwms_schedds.config, line 9
  # raw: MAX_JOBS_RUNNING = 10000

[root@fifebatch1 condor]# ls -al
/etc/condor/config.d/02_gwms_schedds.config
-rw-r--r-- 1 root root 3528 Feb 26 14:43
/etc/condor/config.d/02_gwms_schedds.config
[root@fifebatch1 condor]# grep MAX_JOBS_RUNNING
/etc/condor/config.d/02_gwms_schedds.config
MAX_JOBS_RUNNING        = 10000

Grepping in our schedd log there are lines like:

SchedLog.20150324T161949:03/24/15 08:17:11 (pid:2002) Preempting 66 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 08:27:13 (pid:2002) Preempting 88 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 08:32:10 (pid:2002) Preempting 10 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 10:17:10 (pid:2002) Preempting 38 jobs
due to MAX_JOBS_RUNNING change
SchedLog.20150324T161949:03/24/15 10:17:36 (pid:2002) Preempting 13 jobs
due to MAX_JOBS_RUNNING change

The manual at:

http://research.cs.wisc.edu/htcondor/manual/v8.3/3_3Configuration.html#21897



says:

Changing this setting to be below the current number of jobs that are
running will cause running jobs to be aborted until the number running
is within the limit.

My problem is that we are NOT changing the value (see config file
timestamp above).  We're managing with puppet but certainly not running
puppet every 25 seconds as the last two log lines above show so it can't
even be some craziness there.

I thought I remember reading somewhere that the schedd may in fact get
more than MAX_JOBS_RUNNING jobs started because of the way it works
which is fine with me but I thought then it just didn't run any more
until the number got below.  It seems to be running more than 10k and
then proceeding to kill them.

Am I wrong?

joe
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/