[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] define rescue dag file location



Hi Kent,

Thank you for your email. The way we have setup our pipeline with condor dagman is working pretty well since last more than 2 years now. We do not use DAGMAN_ABORT_DUPLICATES. We also keep logs of each dagman launch and job launch in their own specific directories.

Thanks

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of R. Kent Wenger
Sent: Monday, April 21, 2014 10:04 AM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] define rescue dag file location

On Mon, 21 Apr 2014, Shrivastava, Savita wrote:

> We have recently upgraded condor version 7.8.9 from 8.6.5. In version
> 7.8.9 version the ârescue option for condor_dagman command does not
> exist anymore and by default the rescue dag file is written in the
> directory where the original dag file reside.  With ârescue option I
> used to give a unique name and location to rescue dag file. Is there
> still a way that I can name and store each rescue dag file uniquely
> when I launch several dag instances at the same time using same dag file.

I assume you mean you upgraded from 7.6.5. :-)  Any reason you didn't upgrade to the 8.0 series?

Wow!  If you're launching several DAG instances of the same DAG file at the same time, I'm surprised that DAGMan is allowing it, and I'm really surprised that you aren't having other problems!  Are you setting DAGMAN_ABORT_DUPLICATES to false in your configuration?  If you aren't, the duplicate DAGMans should be aborting themselves; but setting DAGMAN_ABORT_DUPLICATES is a bad idea -- the documentation should make that clear!

I'd highly recommend avoiding launching multiple instances on the same DAG file; in fact, if you're running a version earlier than 7.9.0, you should avoid having two instances of a DAG running that use the same submit files even if the actual DAG files are different.  (The real problem is that you don't want two or more DAGMans reading the same set of user log files, which will happen if they're using the same set of node job submit files.)

If you upgrade to 7.9.0 or later, you can handle this much more easily (because the DAGMan node job log file mechanism is improved).  In that case, if you copy your DAG file, or even make a symbolic link to another file name, you'll be okay, because DAGMan uses a single log file based on the DAG file name, instead of reading from the log file(s) specified in the node job submit files.

At any rate, in the 7.7 series we removed the -rescue flag to condor_dagman, because we changed to an abbreviated rescue DAG that can only be parsed in conjunction with the original DAG file.

Kent Wenger
CHTC Team


The materials in this email are private and may contain Protected Health Information. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying, distribution or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return email.