[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] File upload issue in the parallel universe



We tend to run quite a few MPI jobs under the parallel universe that reside on a single physical machine, i.e. we use SMP over many cores, and we don't have a shared file system between the submit and execute nodes. Because of the vagaries of doing this under Condor, the inevitable shell script that wraps the mpi commands tests to see if its particular process has the env var  _CONDOR_PROCNO set to zero, in which case it proceeds with the job, whereas all other nodes grabbed by Condor just wait for the job to finish. This usually appears in the wrapper script as:

if [ $_CONDOR_PROCNO -ne 0 ]
then
        wait
        exit 0
fi

The problem is that this takes place *after* all necessary files have been uploaded from the submit host, which means that all single slots grabbed for the job will have their own scratch directory and upload copies of these files. This is a complete waste of I/O, as only the slot with _CONDOR_PROCNO set to zero needs to upload the files. Is there a way to tell Condor to only upload files selectively, e.g. when _CONDOR_PROCNO == 0? I've failed to find a way by reading the manual, though I may have missed it.

Aloo

ps. I realise that I can work around this by having all necessary files on a web/ftp/whatever server, not ask Condor to upload any files, and then do a wget/curl/whatever from within the MPI wrapper once the "if" clause above has taken place. However, this feels hacky and is a functionality that I feel should reside within Condor.


View your Twitter and Flickr updates from one place - Learn more!