[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Witnessing strange behaviour

Hi all,

I am a PhD student at the University of Liverpool whose research focuses on implementing numerical Bayesian techniques (specifically Particle Filters and SMC Samplers) at scale on distributed compute environments. These are iterative algorithms and are used to solve Bayesian inference problems. Some background into my work so far is given in the following two paragraphs.

My supervisors and I decided that the Universities HTCondor pool would be a suitable target to implement the algorithm on. To begin with, the algorithm worked by: (1) distributing work to machines in the pool, (2) waiting for all machines to finish, (3) process the outputs. This process repeated for a set number of epochs, and as you can imagine was extremely slow.

In the past month, my supervisors and I have made a number of changes to the structure of the mentioned algorithms to make them more suitable to run on a Condor pool. My algorithm is now divided into two sections whereby I have a 'driver' program which runs on the main Condor server and a 'worker' program which runs on machines in the pool. The workers now perform as many iterations as possible in a given amount of time (ie 30 mins) before sending their work back to the driver for less frequent global synchronisation. Similarly, the driver program runs for a (much longer) given amount of time (ie 24hrs).

Since modifying the algorithm to work this way, I've noticed the following:
  1. It can take a long time for jobs to start executing once they have been submitted (I've witnessed up to 2 hrs from seeing the "submitted" message to then seeing "executing" message)
  2. Jobs can get stuck in a loop where they repeatedly fail and migrate and eventually stop running

I should note that I witnessed this behaviour when submitting some dummy code that:
  1. Submits jobs to generate as many samples as possible in 5 minutes up to a max of 100 samples
  2. Processes the outputs from these jobs and split these into N subsets
  3. Submit jobs to workers that read in a subset, waits for 5 minutes, then returns the subset of samples

I'm not sure why I'm witnessing the behaviour described above, and wondered if anyone could provide insight into why this might be happening. Ian Smith (cc'ed) manages the Universities Condor pool and can provide some more technical background.

Any help is much appreciated.

Many thanks,