[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor-mpi observation...



I have been using condor (6.6) to run MPI jobs for a couple of weeks now, and I've noticed that it really only functions well when I only submit one job from a particular user. The problem is that gathering machines in the pool to actually run the mpi job is done by the DedicatedScheduler user. Since DedicatedScheduler and the actual user each have their own user priorities, DedicatedScheduler can kick the real user's MPI jobs off while trying to secure machines for some other MPI job for the *same* user. This means that when I submit two jobs, and after my user priority has raised a bit, my two jobs will start competing for resources... this has resulted in a stale mate several times, with DedicatedServer hanging on to resources seemingly indefinitely, and neither job actually executing. This is a huge waste of resources.

I'm wondering if this issue has been addressed, or if it will be addressed in future versions. Using the "real" user during the allocation cycle would seem to make more sense here and perhaps partially resolve this issue. I've "solved" this by not running more than one MPI job at a time, but this is far from optimal. Are allocations handled differently in 6.7?

rok