[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Suggestion: transfer on error



A couple of suggestions, have these been raised before?

(1) I would find it really helpful if Condor could transfer stdout/stderr files only if the job fails.

AFAICS, at the moment it collects them in local spool files, and then either transfers them at the end (e.g. error = <FILENAME>) or while running (error = <FILENAME>, stream_error = true)

I have a bunch of chatty jobs where I don't care about the stdout/stderr if they are successful, but if they fail I currently see nothing more than "job proc (X) failed with status 1." which means having to change submission files and re-run them just to find out what went wrong. If it's a transient failure that makes it even more difficult to trace.

So ideally I'd like to have a flag which says "only transfer stdout/stderr if the job fails"

(2) When I submit a DAG full of jobs, in log files I cannot see any record of which host a particular job ran on.

If I see it while it's running (condor_q -run -dag) then I can see the host. But this is not recorded in *.dagman.out as far as I can see.

Is there any way to log this information? It would be really helpful, for example, if a job fails when it is matched with one particular host because an NFS mount is missing.

At the moment my best solution is to do "hostname 1>&2" at the top of the job, and to transfer its stderr.

Thanks,

Brian.