[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Gracefully stopping DAGMAN



On Mar 26, 2010, at 10:46 AM, Craig Struble wrote:

> 
> On Mar 26, 2010, at 11:36 AM, R. Kent Wenger wrote:
> 
>> On Thu, 25 Mar 2010, Robert Mortensen wrote:
>> 
>>> I have a situation where I submit a DAG where each node has a PRE and POST script, there are no parent/child relationships since each node is independent. The PRE script prepares the data for the node to use, the POST script post processes the data and marks the status of each node in a separate database. We have a script that allows our users to cancel the run (a run may have thousands of nodes and take several hours to complete). The question is, how can I stop the DAG but have the post script of each node that has started running be run?
>>> 
>>> Currently, I put a "KILL" file in the directory the dag is run from, then the PRE scripts check for this file and exit with a non-zero result. This keeps other nodes that have already run from being added into the queue. Then I condor_rm each of the idle and running nodes, this evicts them and runs their POST scripts (which is what I need). I then just wait for the DAG to finish. If there are a lot of unrun nodes, I must wait for all their PRE scripts (that do nothing) to run, which is a waste and can take a while.
>>> 
>>> Basically I need to signal dagman to stop running PRE scripts and submitting nodes, condor_rm all submitted nodes, and run any pending POST scripts. Anyway to do this?
>>> 
>>> BTW, I'm running on Windows with 7.4.1.....
>> 
>> Hmm.  I can't think of a fairly easy way to do exactly what you want to do.  If you condor_rm the DAGMan job, it will rm all of the node jobs, but it won't run any of the POST scripts.
>> 
>> I'm thinking that the real solution to this problem is to add a configuration knob to tell DAGMan exactly what you want it to do when you condor_rm it -- so you could tell it, for example, to remove jobs in the queue, but still go ahead and run the POST scripts.  How does that sound?
>> 
> 
> I like this idea. I recently developed a workflow that stages and unstages data to a web server. During the development, it would have been very handy to have a "do this on condor_rm" knob so that unstaging would occur when I stopped my DAG prematurely.


Yeah, that would work. It would need to run the POST script for any PRE script that ran successfully. I also noticed that the PRE scripts get way ahead of the node job submittal, so maybe a control for that would be good also..... This would of course need to work for subdags as well!

Let me know if something like this is added in the future! (Of course I'll keep watching for announcements as well....)

Thanks,
Bob Mortensen