[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] high-performance computing ?



On 7/01/2016 8:13 PM, Xavier Faure-Miller wrote:
HTCondor is advertising itself as a âHTC software. How close is it now to be also HPC? (High Performance)
I am looking to execute many jobs, some of them running is less than one minute (could be a few seconds eventually). Hence I would not want to wait more than a few seconds between the triggering time and the actual start of the execution.Â
Is HTCondor able to do that? What is the typical delay between the trigger and the actual start of the job on a distributed machine?
Hi Xavier,

I run similar jobs through my HTC cluster and I don't think that it will do what you are after as specified above. And for that matter neither will a PBS based HPC cluster either. I've also use distributed frameworks such as BOINC as well. The problem will be that no matter how efficient a submission and scheduler you have for jobs that short the management and IO is always larger than the run time.

What I have found works really well though is that all my "small" jobs are non-linear with no inter-dependence - think GA population computations or Monte Carlo population samples with tens of thousands or even millions of individuals. So what I do is wrap them up into bundles of jobs so that the computational load is much higher than the management overhead. After that HTC, PBS etc becomes very efficient. There is a tailoring process to balance the IO verses computation. And a bit more work to wrap and unwrap as well.

Also consider for my very fast jobs of sub one second the internal IO of the individual process started to out weight the computational over head, so investing in a machine with as many cores as I could afford, with SSDs or even RAM disks made more sense. Then I used a simple OpenMP or batch system locally on the machine to run until the IO bus became saturated. Finally, there is simply a system overhead required to start processes and reap zombies at the end that for really short jobs with many repetitions takes time and system resources.

Hope that helps,
-pete

Attachment: signature.asc
Description: OpenPGP digital signature