If I'm understanding correctly what you want to do, I think a combination of category throttles and priorities would do what you want. You could do something like this:
 Job D0 download.sub # single threaded
 Job P0 preprocess.sub # requires a lot of memory
 Job C0 calculate.sub # uses lots of cores
 Job R0 remove.sub  Â# cleans up input files
 Job S0 summarize.sub # takes a while mostly I/O bound
 VARS D0 id="<uuid0>"
 VARS P0 id="<uuid0>"
 VARS C0 id="<uuid0>"
 VARS R0 id="<uuid0>"
 VARS S0 id="<uuid0>"
 PARENT D0 CHILD P0
 PARENT P0 CHILD C0
 PARENT C0 CHILD R0 S0 # remove and summarize can run in parallel?
 MAXJOBS nfs_limit 10
 CATEGORY D0 nfs_limit
 CATEGORY P0 nfs_limit
 CATEGORY C0 nfs_limit
 CATEGORY R0 nfs_limit
 # S0 not here because it doesn't depend on downloaded files
 PRIORITY P0 10
 PRIORITY P0 100
 PRIORITY C0 1000
 PRIORITY R0 10000
 # Not sure about priority for summarize
If you do something like this, your DAG should start out by submitting 10 download jobs. When the first download job finishes, the corresponding preprocess job will be submitted before any more download jobs, because of the higher priority. Then, as you work your way along, calculate jobs will be favored over preprocess jobs, and remove jobs will be the most favored.