[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job Realtime output file




On Sat, 9 Mar 2013, Guillermo Marco Puche wrote:

I know those directives are SGE directives. From my pov is SGE handles job he must be able also to handle it's own error and output logs.

The trouble here is that SGE is being handed, for a number of hard reasons, a russian doll of scripts to execute. Your job is the smallest doll, while the -o and -e directives (and yes, you are overriding the directives set by default by 'bosco') apply to the outermost doll. It's very likely that stdout and stderr are already being diverted at inner layers. If you'd really like to see streaming stdout from your job, your best option (until we have some form of out-of-the-box Condor 'standard universe' for 'grid' or 'vanilla' universe jobs, which would indeed come in handy for many other applications) is probably to set up some form of remote I/O yourself.

If you have at least outbound network connectivity from the worker nodes to the submit node you could try using 'chirp' (a standalone incarnation of the Aitch-Tee-Condor Remote I/O protocol, which may eventually be "re-"integrated into the 'grid' universe as the remote I/O method of choice).

In its simplest form:

0) Grab and install 'cctools', and make it available on the submit
   and worker nodes.
   http://www.cse.nd.edu/~ccl/software/download.shtml
   (the site seems to be down right now)

1) Start chirp_server on the submit node (will bind on port
   9094 by default, use *no* authentication/authorisation and
   write files in the current directory).

2) Run your payload on the worker nodes with
   ./payload |tee chirp_put -t -1 -b 4096 - submit_node.domain my_job_output.$$

You should then be getting a streaming update (with 4kB buffering, which is pretty much the minimum you can get by default from fstreams) of the stdout of your job(s) as 'my_job_output.script_PID' on submit_node.domain, in the directory from which you started chirp_server.

There are countless variations of this scheme (add authentication/authorisation, send the 'chirp_put' executable along with the job if you cannot install it on the worker nodes, use a different naming scheme, run the job via 'parrot', etc.) but it should serve your basic need in any environment.

Does this still make sense ?

Francesco Prelz
INFN-MI