[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAGMan low submission rate



Hi,

No, what I mean is that I don't know of a way to determine the progress of a workflow, or which jobs failed at end of a workflow when using the queue command. With DAGMan, I can just kill the DAG and have the progress saved in the rescue. Similarly, the rescue is a much easier way to determine which jobs failed and need to be checked then parsing the log files or checking condor_history.

Benedikt

On 27 June 2017 at 14:49, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

in 8.6, condor_submit has support for retrying a job that failed, just like dagman does. The only thing that dagman can do there that condor_submit canât do it pre and post scripts, is that what you mean?

Â

-tj

Â

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Benedikt Riedel
Sent: Tuesday, June 27, 2017 2:42 PM


To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] DAGMan low submission rate

Â

Hi,

Â

By job monitoring, I mean which jobs in my set have failed, succeeded, etc. With 1.2M nodes, it can quickly become a pain to do all this outside of DAGMan.

Â

Benedikt

Â

On 27 June 2017 at 14:33, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

Can you clarify what you mean by âjob monitoring?â.

Â

Â

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Benedikt Riedel
Sent: Tuesday, June 27, 2017 10:31 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] DAGMan low submission rate

Â

Hi,

Â

This might be tangential. DAGMan and late materialization are not compatible at the moment. With late materialization users seemingly have to go off and write their own job monitoring code. DAGMan provides job monitoring capability for free.ÂAm I missing something? What is the best practice for monitoring large independent job sets when using late materialization?

Â

As for Henning's users issue, have you tried settingÂ

Â

DAGMAN_USER_LOG_SCAN_INTERVAL = 1

Â

This appears to increase the submission rate.

Â

Thanks,

Â

Benedikt

Â

Â

Â

On 27 June 2017 at 09:53, Greg Thain <gthain@xxxxxxxxxxx> wrote:

On 06/27/2017 01:30 AM, Henning Fehrmann wrote:

Hello,

one of our users started a DAG with 1.2M nodes which do not depend on
other nodes. It seems that in average 120 jobs are submitted per
minute. This number is strongly fluctuating.


If there are no dependencies between nodes, and you aren't using other DAGman features like pre/post scripts, would you consider upgrading to 8.7, to use the late materialization feature instead of dagman? It was designed exactly for this use case.

-greg



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



Â

--

Benedikt Riedel
Scientific Programmer
University of Chicago
Computation Institute


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



Â

--

Benedikt Riedel
Scientific Programmer
University of Chicago
Computation Institute


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Benedikt Riedel
Scientific Programmer
University of Chicago
Computation Institute