Partial Solutions to date:
1) I initially took a hint from the Bologna Batch System short/long
running jobs config (thanks to whoever documented that) and implemented
an "ExclusiveVM vs Non-Exclusive VM" system. Each node had an extra VM
configured which was an Exclusive VM (I also lied about the amount of
memory and number of CPUS in the config file so as to get appropriate
final numbers). Via the START expression, when a job was marked
"exclusive" it could only run on the Exclusive VM, otherwise it could
only run on one of the non-exclusive VMs. If an exclusive job was
running on a node, no non-exclusive jobs could start there either, and
vice-versa, if any non-exclusive jobs were running, no exclusive jobs
could start. Blast jobs were submitted as exclusive jobs, and ran with
"-a X", thus using all available CPU. Condor only thought one VM was in
use, but all CPU was being efficiently used.
But this breaks down in the presence of more than a few non-exclusive
jobs. Unless all non-exclusive VMs on a node are vacated at the same
time, an exclusive job never gets a chance to run - in practice, with
lots of jobs in a queue, non-exclusive jobs on the same node almost
never finish at the same time, so the exclusive jobs are locked out, no
matter what user priority there might be (the START expression simply
never matches)
So, I pondered allowing an exclusive job to start up if there was at
least one free non-exclusive VM, and never allow non-exclusive to start
if there is an exclusive job running. This would oversubscribe the CPU
for a while until any non-exclusive jobs finished, and when the
exclusive job finishes, non-exclusive could still potentially be allowed
back on if Rank/priority allowed. But this seems inefficient to me
(increases wall-clock time at the least). Am I worried over nothing?
2) Second attempt was to try an implement my option b) above. I got rid
of the whole exclusive/non-exclusive vm idea. Each blast job runs with
-a 1, and advertises an extra attribute:
BlastDatabaseUsed
which is the name of the blast database in use by that job. Then on the
compute node I added:
STARTD_JOB_EXPRS = BlastDatabaseUsed
STARTD_VM_EXPRS = BlastDatabaseUsed
to the local config file. So STARTD_JOB_EXPRS pushed BlastDatabaseUsed
for the current job into the VM classad for the VM it was running on,
and STARTD_VM_EXPRS made it available to all the other VMs. With the
Start expression:
START = $(START) && \
(TARGET.BlastDatabaseUsed=?=UNDEFINED || \
((vm1_BlastDatabaseUsed=?=UNDEFINED ||
vm1_BlastDatabaseUsed=?=TARGET.BlastDatabaseUsed) && \
(vm2_BlastDatabaseUsed=?=UNDEFINED ||
vm2_BlastDatabaseUsed=?=TARGET.BlastDatabaseUsed)))
I would have expected it to work. But, there seems to be some sort of
delay between STARTD_JOB_EXPRS pushing into the VM classad and
STARTD_VM_EXPRS pushing that attribute into the other VMs classAds,
resulting in the start expression matching for a new database while
there was still a job running for the old one. Nodes flip-flopped and
had two jobs running at the same time for different databases. It
varied with each run, as would be expected from some kind of race
condition.