[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] POST script user privileges in DAG



On Thu, 5 Feb 2015, Ying W wrote:

I have a NFS (network share) that I use to share files between my various
condor workers and my pipeline starts with a submission file downloading
large datasets to NFS and eventually processing and deleting this data.
Something like this:
download (1 core, some memory) -> process (lots of cores and memory) -- POST
(remove downloaded files) --> calculate metrics on output files (few cores,
little memory). I don't want to have my workers all doing the 'download' job
and none of them being open for 'processing' job and overloading the space I
have. The process jobs are shared with other DAGs that work w/smaller
datasets that are not deleted (so i dont want to hardcode a delete in
there).

Since the NFS could quickly reach capacity due to the size of the input
files, I created a POST script that will remove the input files if the
$RETURN is 0 (exit successfully). However, I suspect I am running into a
permission error since the files are not being deleted (my submit node has
access to NFS, however, the user does not have permissions to remove a file
created by nobody:nogroup).

When files are created by my condor workers VMs, the user/group is
nobody:nogroup (I am running Ubuntu 12.04) while if the POST script creates
a file the user:group is the same as the user:group that ran the
condor_submit_dag. I was wondering if it was possible to keep the user:group
the same when running the POST script and if there was any tips on debugging
PRE/POST scripts since stdout and stderr don't seem to be captured.

Ah, yes, the "no stdout/stderr from PRE/POST scripts" issue has been around quite a while:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=171,4
One workaround is for the POST script you specify in your DAG file to just be a wrapper that runs the "real" script and captures stdout and stderr.

I think the difference is this: the POST script is being directly forked from the DAGMan job on the submit machine, so it's not too surprising that user:group is the same as for the user who submits the DAG. Is your HTCondor installed as root? If so, the download and process jobs should be running as the same user, but maybe they don't have the right NFS credential to create the files as something other than nobody:nogroup?

I guess one workaround could be this: instead of doing your file deleting in a POST script, add another node that does this. If the node job is submitted similarly to the process jobs, it should end up with and NFS id of nobody:nogroup, and thereby be able to delete the files. In this model, you'd have a delete node that would immediately follow each process node. If you do that, the delete node will only be run if the process node succeeds.

As a sidenote, I am open to suggestions on better ways of creating pipelines
that involve downloading large datasets. Originally I was planning on
downloading these large datasets into scratch space on each worker VM,
however, it did not seem straightforward to force all job submission files
in DAG to run on the same worker VM so instead I planned on downloading to
NFS and use the --max-idle parameter in condor_submit_dag to limit the
number of datasets on NFS (since too many idle jobs would be queued if only
the download job ran).

Yes, the "run multiple jobs on the same machine" issue is another old one:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=572,4

You might find category throttles useful for your workflows:
http://research.cs.wisc.edu/htcondor/manual/v8.2/2_10DAGMan_Applications.html#SECTION003108400000000000000

Kent Wenger
CHTC Team