Dear list,
I have a NFS (network share) that I use to share files between my
various condor workers and my pipeline starts with a submission file
downloading large datasets to NFS and eventually processing and deleting this data. Something like this:
download (1 core, some memory) -> process (lots
of cores and memory) -- POST (remove downloaded files) --> calculate
metrics on output files (few cores, little memory). I don't want to have
my workers all doing the 'download' job and none of them being open for
'processing' job and overloading the space I have. The process jobs are shared with other DAGs that work w/smaller datasets that are not deleted (so i dont want to hardcode a delete in there).
Since
the NFS could quickly reach capacity due to the size of the input
files, I created a POST script that will remove the input files if the
$RETURN is 0 (exit successfully). However, I suspect I am running into a permission error since the files are not being deleted (my submit node has access to NFS, however, the user does not have permissions to remove a file created by nobody:nogroup).
When
files are created by my condor workers VMs, the user/group is
nobody:nogroup (I am running Ubuntu 12.04) while if the POST script creates a file the user:group is
the same as the user:group that ran the condor_submit_dag. I was
wondering if it was possible to keep the user:group the same when running the POST script and if
there was any tips on debugging PRE/POST scripts since stdout and stderr
don't seem to be captured.
As a sidenote, I am open to suggestions on better ways of creating pipelines that involve downloading large datasets. Originally I was planning on downloading these large datasets into scratch space on each worker VM, however, it did not seem straightforward to force all job submission files in DAG to run on the same worker VM so instead I planned on downloading to NFS and use the --max-idle parameter in condor_submit_dag to limit the number of datasets on NFS (since too many idle jobs would be queued if only the download job ran).