[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] change to condor_submit - user feedback desired! (was Re: multiple condor_submit's - one cluster)




Hi David, Don, Klint, Ben, Dimitri, Carl, Brian, Lauren (hope I didn't omit anyone) -

Thanks much for the valuable feedback to date! People like you are precisely why open source works and why HTCondor will continue to improve!

Currently pondering the points folks made. In a few days I will distill this down into a proposed concrete plan of action and post back here. I think I will focus first on the details for the ability of condor_submit to scan the filesystem and do a submit for each file (option 1 in my original email), and let thoughts about condor_submit reading lines from stdin (option 2) mature a bit more... (yes I agree with Dimitri they are related, but want to start where I think the need is greatest...)

best regards,
Todd

On 2/10/2015 12:06 AM, David Champion wrote:
* On 09 Feb 2015, Lauren Michael wrote:
Hi All,

First, I strongly echo Ben's points, especially for keeping the submit file
as a record of the exact syntax for future reference by the user (to
understand what he/she did).

For example, the following (#2):
     ls data/*.csv | grep foo | condor_submit -submit_per_line input_line
employs skills and unix familiarity (grep, pipe) that most users I work
with largely do not have. To remember and use such a command, they'd end up
recording it in a document or perhaps a script. The greater the number of
the arguments in the command, the more this type of recording becomes true,
in my experience.

I agree with the spirit of this. My enthusiasm for adding this kind of
notation is actually aligned with it: to address this control issue,
it's become fairly common (I think) to write shell scripts -- or python,
etc -- that generate condor_submit files as output. That's another step
yet removed from the submit file's being a job record. That kind of
circumstantial complexity exists now, and the closer we can put it to
the submit file, the better off users are.


Stepping back, I believe there are multiple motivations emerging in this
thread, though I'll also point out that I *believe* they are all from
"advanced" users of HTCondor and unix (at least for the names I recognize
in this thread, probably excluding myself).

Here's an attempt at a summary of desired outcomes listed in this email
thread so far:
1. Provide users with an in-file alternative to $(Process) for cases when
the user has many similarly-named but non-numbered files, and lacks the
know-how/desire/time to convert such files to numbered filenames while
maintaining metadata about which file is which.
(not mentioned here yet, but I'm adding it now, as I interact with
countless non-advanced users facing this barrier and have otherwise
discussed it at length with people like Todd T, motivating a foreach-like
option.)

Another possibility occurs to me today: some kind of mapping declaration
might make it possible to translate ordered patterns to sequences of
control directives. But maybe that's just another color of light shining
on things we've already discussed (e.g. native control loops in c_s).

2. Create a simple syntax for executing #1 that doesn't require significant
unix/scripting experience.
3. If possible, allow advanced users to also intuitively use the solution
in a unix-y and/or scripting way.
4. Minimize performance/latency side effects.


Specifically commenting on syntax:
I also see David's point for not creating a universal name ("file"). Is
something *like* the following possible?:

queue foreach species in $(species).data

I'm also in favor of something like the above because I *think* "queue
foreach data/*.csv" effectively co-ops the wildcard and would keep the user
from specifying files using multiple wildcard instances (say, for
sub-directories). For example, what if I wanted to "queue foreach
*_data/*.data"?

I think this shouldn't be a problem so long as the C++ implementation
uses fnmatch().  (Sorry for the technical-speak.)  But I think there's
a good point here that limiting the subject of the for(each) to files
on the filesystem is... well, potentially limiting.  Looking to future
possibilities, I would prefer that the syntax explicitly state that it's
matching local filenames.


I am so excited that we're at the point of crowd-sourcing input for such a
feature!

+1!



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685