[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] (no subject)



Hello,

We have a scheduling policy use case at our site which it is not clear
to me how to best implement within htcondor. I wanted to ask for help
to the experts out there.

We have a ~3000 slots batch farm. The default maximum duration for the
jobs is 12h, we kill jobs exceeding that limit. Most of the jobs of
our users take way less time, typically finishing within 2 or 3 hours
maximum. However, some times there are special jobs that might last
longer than 12h. For that purpose we have currently implemented a
"long" AccountingGroup to which users can submit and that will allow
jobs to run up to 48h. We have configured a maximum of slots that can
be running in this "long" accounting group equal to 400. This is
currently a hard limit.

What we observe is that, even if they are submitted to the "long"
AccountingGroup, still 90% of the jobs complete in much less than 12h.
Users submit them to the long queue "just in case" to make sure the
few jobs in the tail which might exceed the 12h limit are not killed.

The policy that we would like to implement is one that preempts jobs
which have been running for more than 12h and which exceed in number
the max we have configured for this type (400 in our current case).

I try to write an example since I am not sure my english description
of the policy was all that clear: imagine we start with a full farm,
running 3000 jobs. There are 700 of those jobs which have been running
for >12h. I then submit my 100 jobs, and will expect then htcondor to
choose 100 out of those "long running" 700 jobs to be preempted for
letting my jobs run. Ideally, I would like to tell htcondor to start
preempting those jobs which have been running for shorter time for
instance.

which is the best way to do this in htcondor?

thanks much,
Gonzalo