[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Re: possible bugs on Condor 6.7.2?



Thanks to Derek and others who responded for all of the helpful
information.  I'm fairly new to Condor and this discussion has
clarified a lot.  I hope much of this ends up in the manual
as was suggested.

I still have a couple of questions.

1) Suppose I have a vanilla universe job whose partial output is
useful even if the job doesn't complete.  (For example, it might be
doing Monte Carlo integration and outputing partial results, and these
partial results could be included in a final average over all runs.)
Is there a way to ask Condor to provide me with output files even for
incomplete runs?  I know (now) that I could look in the spool
directory on the submit host, but I'd rather that all output appeared
in the submit directory on the submit host.  I guess I'd like it if
Condor could simulate (as closely as is easily possible) naively
running multiple jobs on the same host.

2) When a vanilla universe job writes to stdout, I see that I can
ask Condor to stream stdout to the submitting host.  But if the
job vacates a host and restarts elsewhere, the previous stdout
output is lost.  If there was a way to avoid this, this could be
a solution to problem 1.  For example, if Condor provided a macro
like $(Run) which counted consecutively for each run that a job
does, then I could use that when naming the file that standard output
goes to.

I believe both of the above questions make sense for the standard
universe too...

I'd still love an option which made Condor do periodic file transfers
for vanilla jobs.  (I'm thinking of: random job failure; hardware
failure; code that ignores SIGTERM and can't be changed; etc.)

Thanks again for all of the help,

Dan

Derek Wright <wright@xxxxxxxxxxx> writes:

> ok, it's definitely still working.  i think you're just all confused
> about how this feature really works.
>
> when you use "ON_EXIT_OR_EVICT", the intermediary files do *NOT* get
> transfered back into the directory where you submitted the job from.
> instead, they're stored in the per-job subdirectory of spool i
> mentioned in a previous message.  in case it's not obvious, the naming
> convention of these per-job directories is as follows:
>
> cluster<#>.proc<#>.subproc<#>
>
> so, if the job you care about is job 735.0, the directory would be:
>
> spool/cluster735.proc0.subproc0
>
> if you submit a vanilla job that writes periodic output to a file, you
> set "when_to_transfer_ouput = ON_EXIT_OR_EVICT" in your submit file,
> vacate it, and inspect the spool directory, you'll see the files are
> getting written there, no problem.
>
> furthermore, if your test job opens this file to append data (instead
> of truncating it), everything works exactly as you'd expect.  future
> runs just append more data to the file.  you don't even have to worry
> about putting the file in transfer_input_files, since condor is nice
> enough to notice that if it transfered it back for you and you're in
> ON_EXIT_OR_EVICT mode, it will transfer that automatically as input
> the next time your job runs.
>
> however, you'll only see these file back in the directory you
> submitted them from when the job *finally* exits.  just because
> they're not in your local directory doesn't mean they're not being
> transfered or that ON_EXIT_OR_EVICT doesn't work.
>
> probably the manual should be more clear to avoid this confusion.
>
> -derek