[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Distributed Java Jobs (Dedicated and preemptible)


I'm new to condor (though I've worked with PBS in LAM/MPI environments
in the past).

I am trying to add condor support to a Java-based package for
statistical inference and nonlinear optimization (via Monte Carlo
methods) that I've been working on. In my setting, I can exploit
dedicated JVMs, but can also tolerate the periodic migration or
addition or removal of JVM instances for a particular problem - as
long as there is 1 worker JVM active, useful work can be done, given
the right communication upon JVM migration/addition/removal.

For each unique job (particular inference/optimization problem), I
have a unique key (some integer, say). Then the Java/Condor
interactions I need to support are:

- (Manually) start a JVM instance, keyed to a particular job,
  running the job's manager.

- (Manually, though ideally programmatically) start helper JVM
  instances, which are able to connect - via sockets - to the
  manager JVM instance whose key matches theirs.

  Over the course of the computation, these sockets will also
  be used to let the helper JVM instances communicate (e.g.
  for swapping states in a parallelized Gibbs sampler or
  temperatures in parallel tempering).

- Receive notification when a JVM instance is being preempted,
  so the manager can extract critical data and its state can be
  serialized for later redistribution. (If critical data isn't
  extracted, removing a JVM causes all other workers to hang until it
  finishes migrating, which is no good when random machines' idle
  times are being used.)

Naively, I could probably construct my own RDBMS-based solution (every
JVM registers itself with the job key it is working on, and the
manager periodically polls the database to bring new JVM instances
into the job). Is there anything better/more idiomatically correct,
given Condor? If so, where can I learn more about this?

Thanks much,