[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] DAG POST mechanism


I am in need of some improvements to the DAG "POST" process.  I need to
have some custom code that can be connected to the POST command of each
DAG node and that can update the state of the DAG.  In particular, I
need to track failure rates, and if a certain condition is met then I
need the POST process to modify the template classads that are being
used, and also the DAG classad.

My specific situation is that we have several "resource pools" which are
all visible from a single matchmaker instance.  At times certain pools
do not play nicely with the running DAGs, and jobs which are matched
there exclusively fail.  Failures happen fast, and even with a 1 minute
throttle I've put in, it is the classic "black hole" situation that a
pool filled with resources that will fail jobs quickly will consume a
disproportionate number of jobs (e.g. last night I had 80% of jobs
failing due to 2 pools out of 10), and this then hits the "retry" limit,
and my DAG exits with nodes that have never run properly.

I would like to address this by having the POST process count successes
and failures associated with each pool and if the # of failures exceeds
some threshold, that pool will be explicitly excluded from any job
matching for nodes in that DAG.  I'd like to store the counts in the DAG
classad, and I would like the POST script to know which classad just ran
that it is responding to, and also which DAG classad it is associated with.

I can have job completions at a rate of 5-10 Hz, so Condor needs to be
able to handle this level of DAG classad updates.

It feels to me like a better way to deal with this particular situation
would be to have a permanent process running that does the DAG post
processing.  It could even be multi-threaded with a worker pool to
automatically handle multiple POST scripts at once.  The DAG POST script
could then have some standard message passing interface that will
indicate a port and message format, and then DAGMAN could simply
communicate with that one existing process, rather than fork a new
process to handle each completing DAG node.

Any thoughts on this situation would be appreciated.


Ian Stokes-Rees, PhD                       W: http://abitibi.sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1.617.432.5608 x75
NEBioGrid, Harvard Medical School          C: +1.617.331.5993

fn:Ian Stokes-Rees, PhD
org:Harvard Medical School;Biological Chemistry and Molecular Pharmacology
adr:250 Longwood Ave;;SGM-105;Boston;MA;02115;USA
title:Research Associate, Sliz Lab
tel;work:+1.617.432.5608 x75