[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Testing systems services



Hi:

Thanks for the suggestions, I have used Hawkeye/Crondor and Ganglia for a while, but I am less interested, in this case, in the status of the pool itself, but rather its ability to service particular jobs.  I've just settled, for now, on a set of scripts that schedule test versions of the types of jobs I want to ensure will always be able to run on our pool.  If they fail, I just get an email with some of the more relevant details.  

Best,
-B

On 2011-05-20, at 5:21 AM, Angel de Vicente wrote:

> Hi,
> 
> On 10/05/11 16:51, Burnett, Ben wrote:
>> Hi:
>> 
>> So I've been trying to manage a small Condor pool (~200 cores) over the last little while, and I've run into a small irritating issue, and wondered if others have experience the same thing, or if they have solutions/ideas.
>> 
>> So I have configured the pool to do various helpful things, like accept GPU jobs, provide dynamic slots on some of the more capable machines, etc.  What I have found though, is that once I've set the configuration, I rarely revisit it.  This means that if it stops working, I won't know until someone complains.  This might contribute to a decreased workload, since if no one complains, then it does not need to be fixed; however, it is more generally the case that I do get complaints, and generally they arrive in my inbox near strict deadlines (not that anyone ever leaves things to the last minute :P).
>> 
>> Does anyone have a relatively simple system to continuously test their pool's services?  Ideally, I'd like the test jobs to run with very low priority, so as not to interfere with regular workloads, but  would like them to run at least once a day (or as often as practically possible), and keep track of the results (this could just be an email, or a log file).  Then, if one job fails, I'd like to be emailed about it.
>> 
>> I can think of a few approaches myself, but I thought I'd ask if anyone has already got something similar up and running.
> 
> sorry to reply so late. Did you have a look at Hawkeye? http://www.cs.wisc.edu/condor/hawkeye/
> 
> It's been a while since I last used it, but it was very easy to use and out of the box you could check a lot of useful things like disk space, logged on users etc.
> 
> Cheers,
> Ángel de vicente
> -- 
> http://www.iac.es/galeria/angelv/
> 
> High Performance Computing Support PostDoc
> Instituto de Astrofísica de Canarias
> ---------------------------------------------------------------------------------------------
> ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de Datos, acceda a http://www.iac.es/disclaimer.php
> WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en
> 

--
Ben Burnett
Optimization Research Group
Department of Math & Computer Science
University of Lethbridge
http://optimization.cs.uleth.ca

"Everyone is entitled to their opinion; you're not entitled to your own fact."
- Michael Specter