I am a PhD student at the University of Liverpool whose research focuses on implementing numerical Bayesian techniques (specifically Particle Filters and SMC Samplers) at scale on distributed compute environments. These are iterative algorithms and are used to solve Bayesian inference problems. Some background into my work so far is given in the following two paragraphs.
My supervisors and I decided that the Universities HTCondor pool would be a suitable target to implement the algorithm on. To begin with, the algorithm worked by: (1) distributing work to machines in the pool, (2) waiting for all machines to finish, (3) process the outputs. This process repeated for a set number of epochs, and as you can imagine was extremely slow.
In the past month, my supervisors and I have made a number of changes to the structure of the mentioned algorithms to make them more suitable to run on a Condor pool. My algorithm is now divided into two sections whereby I have a 'driver' program which runs on the main Condor server and a 'worker' program which runs on machines in the pool. The workers now perform as many iterations as possible in a given amount of time (ie 30 mins) before sending their work back to the driver for less frequent global synchronisation. Similarly, the driver program runs for a (much longer) given amount of time (ie 24hrs).
Since modifying the algorithm to work this way, I've noticed the following:
I should note that I witnessed this behaviour when submitting some dummy code that:
I'm not sure why I'm witnessing the behaviour described above, and wondered if anyone could provide insight into why this might be happening. Ian Smith (cc'ed) manages the Universities Condor pool and can provide some more technical background.
Any help is much appreciated.