[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor and Blast on multi-processor compute nodes



Hi,
	I've been beating my head against this for days now with no
noticeable progress, so I'd be most appreciative of any suggestions that
might be made.
The problem (short):
	For best efficiency, I want to be able to configure Condor so
that for a given class of jobs (Bioinformatics blast searches), either
		a)  the job can have exclusive access to a given
physical (multi-processor) node, or
		b) Multiple of these jobs can run but only if they are
searching the same database (flatfile on disk)

The problem (long)
Previous testing of blast has shown that it is most efficient if the
blast database being searched fits into RAM and can be cached by the OS
as it is read.  Single processor nodes aren't an issue - only one Condor
VM, only one job at a time, and if the Blast db fits in memory, so much
the better.

For multi-processor servers it gets tricky.  Blast has a "-a" option
which controls how many threads run.  If the condor job can have
exclusive access to a node, then running with -a X, where X is the
number of processors, is generally the best way to run, although running
X simultaneous processes with -a 1 is also ok.  So I have two ways of
looking at this problem - 
	i) somehow reserve the entire physical node (all VMs) and run
with -a X, or
	ii) run all my blasts with -a 1, but somehow ensure that a given
node is only searching one DB at a time.  It would be ok for multiple
DB's to be the searched simultaneously if their total size is < RAM, but
I believe that would increase complexity unnecessarily.

Also in the mix is that the database being searched must be retrieved
from central storage, which for a 500MB+ database can take some time.
The previously searched DB is left on disk in transient storage in case
the next job can use it, so ideally I'd like to minimize churn (nodes
switching from one DB to another).  I know RANK can take care of that to
some degree, but I'll throw this in as another bit of the puzzle.

Partial Solutions to date:
1) I initially took a hint from the Bologna Batch System short/long
running jobs config (thanks to whoever documented that) and implemented
an "ExclusiveVM vs Non-Exclusive VM" system.  Each node had an extra VM
configured which was an Exclusive VM (I also lied about the amount of
memory and number of CPUS in the config file so as to get appropriate
final numbers).  Via the START expression, when a job was marked
"exclusive" it could only run on the Exclusive VM, otherwise it could
only run on one of the non-exclusive VMs.  If an exclusive job was
running on a node, no non-exclusive jobs could start there either, and
vice-versa, if any non-exclusive jobs were running, no exclusive jobs
could start.  Blast jobs were submitted as exclusive jobs, and ran with
"-a X", thus using all available CPU.  Condor only thought one VM was in
use, but all CPU was being efficiently used.  

But this breaks down in the presence of more than a few non-exclusive
jobs.  Unless all non-exclusive VMs on a node are vacated at the same
time, an exclusive job never gets a chance to run - in practice, with
lots of jobs in a queue, non-exclusive jobs on the same node almost
never finish at the same time, so the exclusive jobs are locked out, no
matter what user priority there might be (the START expression simply
never matches)

So, I pondered allowing an exclusive job to start up if there was at
least one free non-exclusive VM, and never allow non-exclusive to start
if there is an exclusive job running.  This would oversubscribe the CPU
for a while until any non-exclusive jobs finished, and when the
exclusive job finishes, non-exclusive could still potentially be allowed
back on if Rank/priority allowed.  But this seems inefficient to me
(increases wall-clock time at the least).  Am I worried over nothing?  

2) Second attempt was to try an implement my option b) above.  I got rid
of the whole exclusive/non-exclusive vm idea.  Each blast job runs with
-a 1, and advertises an extra attribute:
BlastDatabaseUsed
which is the name of the blast database in use by that job.  Then on the
compute node I added:
STARTD_JOB_EXPRS = BlastDatabaseUsed
STARTD_VM_EXPRS = BlastDatabaseUsed
to the local config file.  So STARTD_JOB_EXPRS pushed BlastDatabaseUsed
for the current job into the VM classad for the VM it was running on,
and STARTD_VM_EXPRS made it available to all the other VMs.  With the
Start expression:
START = $(START) && \
        (TARGET.BlastDatabaseUsed=?=UNDEFINED || \
                ((vm1_BlastDatabaseUsed=?=UNDEFINED ||
vm1_BlastDatabaseUsed=?=TARGET.BlastDatabaseUsed) && \
                (vm2_BlastDatabaseUsed=?=UNDEFINED ||
vm2_BlastDatabaseUsed=?=TARGET.BlastDatabaseUsed)))
I would have expected it to work.  But, there seems to be some sort of
delay between STARTD_JOB_EXPRS pushing into the VM classad and
STARTD_VM_EXPRS pushing that attribute into the other VMs classAds,
resulting in the start expression matching for a new database while
there was still a job running for the old one.  Nodes flip-flopped and
had two jobs running at the same time for different databases.  It
varied with each run, as would be expected from some kind of race
condition.

So, can anyone suggest a way to either fix one of the above techniques,
or some other way of looking at this problem?

Thanks very much for reading this far ;-)

Craig Miskell,
Technical Support,
AgResearch Invermay
03 489-9279
"The two rules for success:  1. Don't tell people everything you know."
  - Anon.
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================