[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Multicore Shutdown



Hi Laurence,

I fear that this is impossible to implement precisely -- the problem is
that there is no way of condor knowing how close to completion each job
is, and therefore how much lost/gained cycles would result from evicting
the job(s) vs. waiting a bit longer. However, that said, have a look at
"ExpectedMachineGracefulDrainingBadput" and related classads, in the manual:

http://research.cs.wisc.edu/htcondor/manual/v8.4/12_Appendix_A.html#102405

As I understand it this uses the retirement-time-remaining as an
estimate of the lost throughput of killing a job, and can probably be
used to implement what you want. However, this assumes that
retirement-time is meaningful in your environment. The
"ExpectedMachineGracefulQuickBadput" ad is a count of the
already-committed cpu-seconds that would be lost if jobs were evicted
right now.

Sorry this isn't more helpful, but these classads are something worth
looking at, and I think a good-enough solution may be found by combining
them with some knowledge of the typical runtimes of your jobs.

Cheers,
Will

On 07/19/2016 03:41 PM, Laurence Field wrote:
> Hi,
> 
> We would like to optimize the shutdown of multicore startds. After the
> startds no longer accepts new jobs, there is a point where the time
> wasted by the idle slots exceeds the time lost by killing the running
> jobs. This is essentially shutdown startd if sum idle time > sum time of
> running jobs. Does anyone know how this should correctly expressed in
> the config?
> 
> Cheers,
> 
> Laurence
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/