[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor and Blast on multi-processor compute nodes




Miskell, Craig wrote:

Partial Solutions to date:
1) I initially took a hint from the Bologna Batch System short/long
running jobs config (thanks to whoever documented that) and implemented
an "ExclusiveVM vs Non-Exclusive VM" system.  Each node had an extra VM
configured which was an Exclusive VM (I also lied about the amount of
memory and number of CPUS in the config file so as to get appropriate
final numbers).  Via the START expression, when a job was marked
"exclusive" it could only run on the Exclusive VM, otherwise it could
only run on one of the non-exclusive VMs.  If an exclusive job was
running on a node, no non-exclusive jobs could start there either, and
vice-versa, if any non-exclusive jobs were running, no exclusive jobs
could start.  Blast jobs were submitted as exclusive jobs, and ran with
"-a X", thus using all available CPU.  Condor only thought one VM was in
use, but all CPU was being efficiently used.
But this breaks down in the presence of more than a few non-exclusive
jobs.  Unless all non-exclusive VMs on a node are vacated at the same
time, an exclusive job never gets a chance to run - in practice, with
lots of jobs in a queue, non-exclusive jobs on the same node almost
never finish at the same time, so the exclusive jobs are locked out, no
matter what user priority there might be (the START expression simply
never matches)

So, I pondered allowing an exclusive job to start up if there was at
least one free non-exclusive VM, and never allow non-exclusive to start
if there is an exclusive job running.  This would oversubscribe the CPU
for a while until any non-exclusive jobs finished, and when the
exclusive job finishes, non-exclusive could still potentially be allowed
back on if Rank/priority allowed.  But this seems inefficient to me
(increases wall-clock time at the least).  Am I worried over nothing?

Yes, to avoid the starvation problem you mentioned agove, you either need to temporarily oversubscribe, or preempt the non-exclusive jobs when an exclusive one starts. At least, those are the only options I can think of.

2) Second attempt was to try an implement my option b) above.  I got rid
of the whole exclusive/non-exclusive vm idea.  Each blast job runs with
-a 1, and advertises an extra attribute:
BlastDatabaseUsed
which is the name of the blast database in use by that job.  Then on the
compute node I added:
STARTD_JOB_EXPRS = BlastDatabaseUsed
STARTD_VM_EXPRS = BlastDatabaseUsed
to the local config file.  So STARTD_JOB_EXPRS pushed BlastDatabaseUsed
for the current job into the VM classad for the VM it was running on,
and STARTD_VM_EXPRS made it available to all the other VMs.  With the
Start expression:
START = $(START) && \
       (TARGET.BlastDatabaseUsed=?=UNDEFINED || \
               ((vm1_BlastDatabaseUsed=?=UNDEFINED ||
vm1_BlastDatabaseUsed=?=TARGET.BlastDatabaseUsed) && \
               (vm2_BlastDatabaseUsed=?=UNDEFINED ||
vm2_BlastDatabaseUsed=?=TARGET.BlastDatabaseUsed)))
I would have expected it to work.  But, there seems to be some sort of
delay between STARTD_JOB_EXPRS pushing into the VM classad and
STARTD_VM_EXPRS pushing that attribute into the other VMs classAds,
resulting in the start expression matching for a new database while
there was still a job running for the old one.  Nodes flip-flopped and
had two jobs running at the same time for different databases.  It
varied with each run, as would be expected from some kind of race
condition.


I can't think of any way to remove this race condition, because if the machines happen to get matched within the same negotiation cycle, the negotiator won't know about the changes to the startd ads until the next round anyway.

Another idea would be to have the job record somewhere on the machine what DB it is using and have a startd cron job that publishes this information into the machine ad as the "last DB used". You probably wouldn't want to use that as a hard requirement, as in your above example, but you could submit your jobs with a rank expression that prefers to run on a machine where the last DB used matches the DB to be used by the job. Since this "last DB used" would persist beyond the life of the jobs, it might be less vulnerable to the race condition, because it might change less often. However, the result would depend on how often jobs manage to land on a machine matching the DB they want, so it is, unfortunately, rather unpredictable.

--Dan