[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Suggestion: transfer on error
- Date: Fri, 06 Dec 2013 09:28:03 +0000
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: [HTCondor-users] Suggestion: transfer on error
A couple of suggestions, have these been raised before?
(1) I would find it really helpful if Condor could transfer
stdout/stderr files only if the job fails.
AFAICS, at the moment it collects them in local spool files, and then
either transfers them at the end (e.g. error = <FILENAME>) or while
running (error = <FILENAME>, stream_error = true)
I have a bunch of chatty jobs where I don't care about the stdout/stderr
if they are successful, but if they fail I currently see nothing more
than "job proc (X) failed with status 1." which means having to change
submission files and re-run them just to find out what went wrong. If
it's a transient failure that makes it even more difficult to trace.
So ideally I'd like to have a flag which says "only transfer
stdout/stderr if the job fails"
(2) When I submit a DAG full of jobs, in log files I cannot see any
record of which host a particular job ran on.
If I see it while it's running (condor_q -run -dag) then I can see the
host. But this is not recorded in *.dagman.out as far as I can see.
Is there any way to log this information? It would be really helpful,
for example, if a job fails when it is matched with one particular host
because an NFS mount is missing.
At the moment my best solution is to do "hostname 1>&2" at the top of
the job, and to transfer its stderr.