[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] high priority DAGMAN processing




Hello Everyone,

Can I make one dagman more important than others including anything else in the backlog of IDLE condor_dagman jobs in a one user environment?  
 NOTE:  In operations this is a one user environment, so user priority settings are not relevant.

Our operations environment is trying to run a shared resource configuration, with essentially 3 levels of data processing priority. 
HIGH         – calibration exposures  – critical because the results effect the data coming in behind it.
MEDIUM  - standard real time data exposures
LOW         -  reprocessing — data will not be arriving constantly so as the cores are available we can reprocess in the pipeline (new/better calibration perhaps).   Not critical, but over time there could be a large number of these lesser important condor_dagman reprocessing jobs.

I do not want to set up servers for these HIGH priority jobs that will sit idle that vast majority of the time.  So each of these dataset types can be identified in HTCondor by the dagman template used to create the dagman file submitted to start the process.  We did this partially because each dagman runs the same set of jobs but at different job priority level.  The job priorities within the dagman files are working beautifully, but I have not figured out how to move an important calibration dagman past all of the idle reprocessing condor_dagman jobs, so that it becomes a running dagman that is executing jobs.

I have  tweaked the system so that I am almost fully utilizing the available cores.   So I don’t think I really need more jobs Running, unless it does not hurt to have a the schedd have a large number of jobs running? 
I see that TotalSchedulerJobsRunning is my  limiter for the number of Running condor_dagman jobs.   I am considering moving the schedd to a separate vm, and playing with setting that number larger, but with regular exposures and  reprocessing exposures, I will probably always have idle dagman jobs that I need to leap frog over. 

At this time I don’t want to terminate/evict any running jobs or dagman sets — We believe that we can meet the delivery time window requirement for the high priority jobs even if we have to wait for some of the running jobs to finish, because we have enough cores and enough short running jobs that a core should become available quickly,  but we definitely can’t wait for the 2000 idle sets in front of this one to finish.  I need these jobs to get the next available core.

How do I move the new important calibration dataset dagman to the front of the line and get it running?

                    Mary

Mary Romelfanger
Sr. Systems Software Engineer
.___.      
{o,o}      Phone 410-338-6708
/)__)     Cell      443-244-0191
-"-"-          mary@xxxxxxxxx

Space Telescope Science Institute
3700 San Martin Drive
Baltimore, MD 21218