[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs with disjoint requirements



Hello Don, and everyone,

One important thing to remember about HTCondor is that you no longer have "queues," in the usual Grid Engine sense. This was one of the concepts I had the most difficulty conveying to my users as we migrated from SGE to HTCondor - old habits died hard.

One of the numerous Grid Engine workarounds they'd written into their submission tools over the years was to only submit one job every few seconds, to give multiple submitters a chance to interleave their jobs in the single-file Grid Engine queue that had been set up years earlier. This usually meant dozens or hundreds of cores sat idle for hours on end, especially with short-running jobs, which when you've spent as much money as they had on the exec nodes is a pretty grim state of affairs.

Once they got the hang of the idea that job sequence and priority is calculated at every negotiation cycle for every job, and after enough runs of "condor_userprio," I was able to get them to write submissions so that they could submit ten thousand jobs in a matter of seconds, rather than five and a half hours with a "sleep 5" between each "qsub." The negotiator takes care of dividing up the resources fairly among the multiple users so that nobody has to wait for all 10,000 jobs to finish before their own jobs run, and they just don't have to worry about it anymore.

If your own jobs are the only ones contending for the resources, I think what you may be after is accounting groups, rather than a requirements _expression_.

For example, my test pool has seven slots. I create a submit description like so:

executable = /bin/sleep
arguments = 120
accounting_group_user = pelletm_batchA
queue 20
accounting_group_user = pelletm_batchB
queue 20

This creates 40 jobs, half of which were tied to the "batchA" accounting group user (Processes .0 through .19) and half of which were tied to "batchB," as .20-.39.

Look what happened:

-- Submitter: condor1 : <138.127.79.182:54201> : condor1
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  27.0   pelletm        10/26 11:24   0+00:00:02 R  0   0.0  sleep 120
  27.1   pelletm        10/26 11:24   0+00:00:02 R  0   0.0  sleep 120
  27.2   pelletm        10/26 11:24   0+00:00:02 R  0   0.0  sleep 120
  27.3   pelletm        10/26 11:24   0+00:00:00 I  0   0.0  sleep 120
...
  27.19  pelletm        10/26 11:24   0+00:00:00 I  0   0.0  sleep 120
  27.20  pelletm        10/26 11:24   0+00:00:02 R  0   0.0  sleep 120
  27.21  pelletm        10/26 11:24   0+00:00:02 R  0   0.0  sleep 120
  27.22  pelletm        10/26 11:24   0+00:00:02 R  0   0.0  sleep 120
  27.23  pelletm        10/26 11:24   0+00:00:00 I  0   0.0  sleep 120
...

The negotiator assigned six slots in the pool off the bat, and half went to batchA and half to batchB. Here's the condor_userprio a bit later, once the seventh slot was claimed:

condor1$ condor_userprio
Last Priority Update: 10/26 11:26
                     Effective   Priority   Res   Total Usage  Time Since
User Name             Priority    Factor   In Use (wghted-hrs) Last Usage
------------------- ------------ --------- ------ ------------ ----------
pelletm_batchB@doma       502.41   1000.00      3         0.10      <now>
pelletm_batchA@doma       502.89   1000.00      4         0.12      <now>
------------------- ------------ --------- ------ ------------ ----------
Number of users: 2                              7         0.22    0+23:59

The accounting_group_user specified in the submit description resulted in two separate "users" for a single submission, and the resources will be fair-share divided between them by the negotiator. We'd expect to see the assignment of the seventh slot oscillate back and forth between the two as the total usage figure reflects the use of the that odd slot.

If you want to divide up the resources unevenly, then you'd want to set up the pool's configuration with an accounting group with a different priority factor, and direct the jobs accordingly using the "accounting_group" submit value.

With respect to large numbers of queued jobs: I have one group of users which submits about a quarter to half a million short-running jobs on a fairly regular basis. I gave them their own private scheduler so that people could still run condor_q on the main scheduler* without timing out, but given that setup I don't generally find it's a bad thing to have very large numbers of jobs queued. It makes it easier for the users, as they no longer have to leave their submission running for hours and hours on end while empty slots wait for work, or write a DAG, to avoid having too many idle jobs.. The highest peak I've seen was about 800,000 jobs waiting on a Friday evening. It brightens my weekend to think about every last scrap of available CPU power fully utilized all weekend long, and the toasty warm air flowing from the back of the systems.

---
*HTCondor defaults to a scheduler on each member of the pool, but setting SCHEDD_HOST and having a single central scheduler was, in part, a concession to those same old Grid Engine habits, where there would have been panic in the hallways if the equivalent of "qstat" returned different results on different machines. That, and the scale of the pool coupled with the condor_shadow processes and the network topology meant that a beefy schedd host on the same network as the exec nodes works better for us.

 

Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681) NOTE NEW NUMBER
339.293.9149 cell
339.645.8614 fax

michael.v.pelletier@xxxxxxxxxxxx