[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Application specific scheduler



So Nick, says, I want to migrate my home grown distributed environment to HTCondor. As a new user he considers 3 options. Miron says use DAGman. Miha asks why. Miron says because it manages job dependencies. Gabriel says DAGman  is the way to go, but he wonders "why, in case of failure, one has to restart the workflow rather than retry the failed jobs, "

Kent Wegner from CHTC team clarifies ans says, yes we know it is problem, gives the link and has a name for it: this is issue #2831.

Let me stop here. Nick seems an an experienced sysadmin  / engineer. But HTCondor-list  has 2,100 subscribers. How many of these subscribers know about DAGman?  Maybe they search and read why, in case of failure, they hae resubmitt all jobs from the beginning?

one has the documentation at:

http://research.cs.wisc.edu/htcondor/manual/v7.8/2_10DAGMan_Applications.html

How many of those who read those pages understand what they read enough to apply in practice? And how many  of those who know are away of issue #2831? There is no mention of it anywhere

So I estimate, to be generous, that 100 people on the list know. That mean they are 2,000 people reading here who don't understand how to use DAGman properly, directly from reading the documentation, and probably don't undestand how to benefit from this thread.

After 30 years of successfully using HTCondor - and it is a great tool for the very few who know how to use it, the goal should be not to add more features and features. 

The goal should be how to gain significantly more people users. The  trend is new to use web services APIs . REST, NEWT (the same thing), Agave API, or even making HTCondor APIs so the poor user can submit dependent jobs without even knowing there is a DAGman somewhere.

I have a lot of respect for the people who made HTCondor. This is why, now is the time to make a radical change in UX

Thank you,

Miha





--- --- --- --- --- --- --- --- --- --- --- --- ---

Miha Ahronovitz

Principal Ahrono Associates

Web: http://www.ahrono.com/

Blog: http://my-inner-voice.blogspot.com/

c: 408 422 2757

emiha.ahronovitz@xxxxxxxxxx

tw: @myinnervoice

--- --- --- --- --- --- --- --- --- --- --- --- ---




On Sat, Jun 28, 2014 at 1:12 PM, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:
On Sat, 28 Jun 2014, Gabriel Mateescu wrote:

If there is something that may need improvement in DAGMan,
it is that I do not understand why, in case of failure, one has
to restart the workflow rather than retry the failed jobs, possibly
on different execution nodes.

You're not the first person to ask for that capability:

  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2831,4
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3403,4

I don't know exactly how #2831 will happen, but hopefully in the 8.3 series...

One question for #2831 is this:  how does the user notify DAGMan that a particular failed node should be retried?  (This is assuming that the user has done some kind of manual fix to whatever caused the node to fail.  If you just want to retry nodes without any kind of manual intervention, you can just specify retries in the DAG, although getting the retry to land on a different machine that the previous try is tricky.)

Kent Wenger
CHTC Team

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/