[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] IRe: Pausing the Queue



I haven't seen anyone else say to set
MAX_JOBS_SUBMITTED = 0

That will make sure that no new condor_submits will happen
to the queue but let the existing ones drain out.  If you
also simultaneously do condor_off -peaceful on
all the condor_startd, that will make sure that no new jobs start
and the ones that are not yet executing in the queue wont' start.

As others have said, max_jobs_running = 0 will immediately
evict all jobs that are running. Fast but not a way to be popular
with your users.

Steve


On Fri, 24 Jun 2011, Lans Carstensen wrote:

If the goal is outright eviction/return-jobs-to-idle across a schedd, setting MAX_JOBS_RUNNING=0 during your maintenance window is effective but not documented in the wiki.

Usually a combination of methods are likely needed depending on your workflow. Stopping matchmaking by shutting down the negotiator may help you start to drain depending on claim reuse. Touching startd's in some manner to prevent them from picking up new jobs or evicting the jobs they have may be necessary. Telling a schedd to evict running jobs with MAX_JOBS_RUNNING=0 is kind of a last-straw in your escalation and won't get your administratively held jobs confused with your errored-held jobs.

Timothy St. Clair wrote:
$ man condor_hold -------------------------------------------
Examples
       To place on hold all jobs (of the user that issued the
condor_hold command)  that  are  not  currently running:

       % condor_hold -constraint "JobStatus!=2"

Hope this helps,
Tim

On Fri, 2011-06-24 at 11:29 -0400, McGee, Kevin D. wrote:
We have a recurring problem; our grid is so successful that there are
typically several hundred jobs in the queue 24/7 and downtime is hard
to schedule.  We are still growing and upgrading our infrastructure,
so we need system downtime on a sporadic basis to change
configurations or bring new equipment online.  Because of the way our
application is architected, these changes go beyond adding or removing
compute nodes, the changes affect every copy of the application that
is running on the grid.  Is there a way to pause the job queue without
asking users to delete their jobs so that we can allow the jobs
running to finish with no new ones starting?  This would allow us to
wait for the grid to go idle, do our work and then resume job
submission.


Thanks,

- Kevin


Kevin McGee
A2C5 Radar Modeling and Simulation Section Supervisor
A2C Missile Defense Radar Engineering Group
Air & Missile Defense Department
-----------------------------------------
Email: Kevin.McGee@xxxxxxxxxx
-----------------------------------------
The Johns Hopkins University Applied Physics Laboratory
11100 Johns Hopkins Road
Laurel MD 20723-6099
-----------------------------------------
(240) 228-0710 / Washington DC
(443) 778-0710 / Baltimore MD


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.