[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] A better condor_q when 100k+ jobs are submitted... ideas? ( was Re: memory leak in condor_q? )



On 2/7/2014 1:58 AM, Pek Daniel wrote:
> Thank you Todd! The reason why we need to measure condor_q -global
> performance is because our users use it extensively nowadays (more
> precisely the counterpart of it in the current system). They use it to
> poll their jobs, grepping around, etc. Even worse, we can't/don't want
> to tie a specific user to a specific submission node (achieving more
> efficient load balancing between schedds), which means if a user wants
> to find its job, s/he has to query with -global... 

Understood.  Currently in development, as we are working to scale up to hundreds of thousands of submitted jobs, we are thinking about what information most users really want to see when they do "condor_q" alone... and often times, they are just polling to know aggregate information like how many of my jobs are running vs idle, grouping this aggregate info by task or application, and perhaps some "recent happenings" like some details about the last 10 jobs to complete or change status.  Seems pretty rare that a user with thousands of submitted jobs really wants to see the information that condor_q currently displays by default, which is info about every single queued individual job.  Any additional thoughts you or anyone on htcondor-users has in terms of what would be a better condor_q display when there are O(100k+) jobs submitted would be helpful.

In HTCondor as it exists now, using constraint options to condor_q like "-run" or "-hold" or "-constraint" can help in just selecting specific jobs, instead of dumping out everything with condor_q then using grep.  One way to query for aggregates today is "condor_status -submitter"... the aggregates provided are unfortunately very spartan (just running, idle, held), the user interface is not correct (users don't think of running condor_status to look at job info), but it does show information for all submit machines (like condor_q -global).  See the example output below - note that when provided with a user, it shows aggregate info for each submit machine where that user has submitted jobs, and then totals across all submit machines:

[tannenba@submit-1 ~]$ condor_status -submitter chinn@xxxxxxxxxxxxxxx
Name                         Machine            RunningJobs IdleJobs HeldJobs

chinn@xxxxxxxxxxxxxxx        XXXXXX-fe01.rcac.X          10       54        0
chinn@xxxxxxxxxxxxxxx        XXXXXX-fe02.rcac.X           0       11        0
                           RunningJobs           IdleJobs           HeldJobs

chinn@xxxxxxxxxxxxxxx               10                 65                  0

               Total                 0                 65                  0