[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] POST script user privileges in DAG



Hi Kent,

Thanks for your quick response.

I guess one workaround could be this: instead of doing your file deleting in a POST script, add another node that does this. If the node job is submitted similarly to the process jobs, it should end up with and NFS id of nobody:nogroup, and thereby be able to delete the files. In this model, you'd have a delete node that would immediately follow each process node. If you do that, the delete node will only be run if the process node succeeds.

The submit node has user:group of ubuntu:ubuntu (login that came with the image im using) however other users could be possible in the future. This user:group does not have permission to delete the nobody:nogroup files. So if you use POST script to submit, I would be specifying executable as the condor_submit then? My main concern with this approach is that the new node job being submitted would be queued for a while before being run but I guess i could adjust the prio to be higher, that was the main reason why I wanted to run it as a POST initially.

You might find category throttles useful for your workflows:
http://research.cs.wisc.edu/htcondor/manual/v8.2/2_10DAGMan_Applications.html#SECTION003108400000000000000

I've looked into categories before but I couldn't think of a way for it to work. I might be missing something but I feel like CATEGORIES and LIMIT serve a similar function even though one is at the job level and the other is at the dag level.

Maybe the challenges I'm facing would be more clear with this example:

Say I have 100 datasets to process but my NFS cannot hold more than 10 at once before filing up. I have 4 jobs I want to run on each dataset, download (D#) -> preprocess (P#) -> calculate (C#) -> summarize (S#)
My DAG would then look something like:

Job D0 download.subÂÂ # single threaded
Job P0 preprocess.sub # requires a lot of memory
Job C0 calculate.sub # uses lots of cores
Job S0 summarize.sub # takes a while mostly I/O bound

SCRIPT POST C0 rm_download.sh "<uuid0>" $RETURN
VARS D0 id="<uuid0>"
VARS P0 id="<uuid0>"
VARS C0 id="<uuid0>"
VARS S0 id="<uuid0>"

PARENT D0 CHILD P0
PARENT P0 CHILD C0
PARENT C0 CHILD S0


and then I have this 99 more times for D0 -> D99 with different UUIDs
I have LIMITS on the download.sub to prevent overloading the download server but what happens is that after each download job is finished, it just releases its allocation on LIMIT and the next download starts and I don't see how using categories would change things. Ideally, I would want something to count towards MAX_JOBS when the download starts and not release its allocation until the POST script is run. Putting preprocess/calculate as the same category would not fix things since their resource requirements are higher so it would just be adding another barrier to being run?

The challenge I've had with this setup is that condor_submit_dag seems to submit all the download jobs (D0-D99) at once (so using so using max-idle will not work_ and then once each download finishes, it will submit the preprocess jobs (P0-P99). However, once the preprocess jobs finish, it rarely starts a calculate jobs because not enough resources are available so it just mostly goes between download and preprocess. The current workaround I'm thinking about involves splitting up my original DAG into 10 smaller DAG files.

Best,
Ying