[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Execute last DAGMan job as soon as possible



It turned out that we modified the default prio factor to 10 (before the condor default switched to 1000) so I changed all users priority factor to 1000 and set the urgent group's priority to 1. It did help in shortening the process of the jobs grabbing free slots, but it still takes between 10-15 minutes to do so. Whats interesting is that after these ten minutes lots of slots are allocated to the group, so there is obviously something affected by the group priority. The might be some unintentional claim / timeout setting behind all this but I don't know what to look for.

My main gripe is that why do the jobs wait for minutes, when the jobs' machine rank is the highest in the pool, the group priority factor is the lowest, the job priority is also high, PRIORITY_HALFLIFE = 1 so the amount of resources used should not matter, and there *are* free slots that get matched to other users.

Cheers,
Szabolcs

On Wed, Nov 30, 2016 at 6:44 PM, Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> wrote:

Iâd suggest double-checking your default prio factor value â as I recall it was only raised to 1,000 in the 8.4 release, and if youâre using 8.2, it might only be 100 (if Iâm remembering correctly), and if your jobs are racking up a lot of slot time across many machines the final job may still have a higher EUP even with a priority factor of 1.

Â

Perhaps if you specify also a dummy accounting_group_user for the end jobs, theyâd wind up in a different usage basket than the rest of the DAG and wouldnât be penalized for the large usage it incurred?

Â

You can use condor_userprio to check on all this.

Â

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -Michael Peleltier.

Â

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Szabolcs HorvÃtth
Sent: Wednesday, November 30, 2016 12:18 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Execute last DAGMan job as soon as possible

Â

Hi Michael,

Thanks for the tip! I tried setting up group accounting and it solved most of my problems, although it still takes more time to start the end jobs than I'd expect.

We have a negotiation cycle every 30 seconds but it takes much longer time to match slots to these jobs (around 10-15 minutes), even though there are idle slots that get matched to other jobs.

Maybe there are some claims hanging on to these slots?

Â

Cheers,

Szabolcs

Â

On Tue, Nov 29, 2016 at 7:12 PM, Michael Pelletier <Michael.V.Pelletier@raytheon.com> wrote:

Hello,

Â

It sounds like what youâre looking for is accounting groups. Youâd set an accounting group which has a very low priority factor, i.e., âgroup_urgentâ and assign your final node to that group:

Â

GROUP_NAMES = group_urgent

GROUP_PRIO_FACTOR_group_urgent Â= 1.0

GROUP_AUTOREGROUP = True

Â

In your final DAG node which handles the post-processing, youâd set the following in the submit description for it:

Â

Accounting_group = group_urgent

Â

And then that final job would be very likely to be the first in line to get the next matching machine resource, because its effective user priority (real user priority times priority factor of 1.0) would be likely to be lower than other jobs using the default_prio_factor of 1000 (v8.4).

Â

The priority set in the submit âpriorityâ setting only applies to jobs from the same owner. Youâd use this, for example, if you had a pile of 10,000 runs waiting in the queue, but needed to get a few validation runs through before all of those 10,000 are finished.

Â

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -Michael Pelletier.

Â

Â

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Szabolcs HorvÃtth
Sent: Tuesday, November 29, 2016 5:07 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Execute last DAGMan job as soon as possible

Â

Hi,

What is the fastest way to start a job in a Condor pool where machine rank, user priority factor and job priority varies a lot?

We use DAGMan graphs where the last job depends on the execution of all previously submitted DAG jobs. This last job does some post processing on the data generated by the dag, and it can take some time, so its not something that I'd like to execute on the Scheduler machine. But it would be important to start this post-process as fast as possible, regardless of the priority of the submitting user. I tried setting high machine rank and high job priority but I still see lots of these jobs wait while other jobs get started. The best solution would be to skip matchmaking altogether and execute the job right away but I didn't find a reliable way to do that.

Cheers,

Szabolcs


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Â


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/