[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] file transfer problems with vanilla job

Thanks for this Derek but I still can't seem to get it to work
- perhaps I'm missing something fundamental here ???

I have a very simple .bat file which I run:

C:\signal>type loop.bat
time >> output.txt
sleep 15
goto start

and the .sub file looks like

C:\signal>type signal.sub
output = signal.out
log = signal.log
error = signal.err
transfer_input_files = output.txt
should_transfer_output_files = YES
when_to_transfer_output_files = ON_EXIT_OR_EVICT
executable = loop.bat
notification = Error

The configuration file is setup so that the job
gets a soft kill after 1 minute and hard kill after a
further 10 minutes.
Since the soft kill isn't trapped the job runs as expected for 11 minutes
before going to the idle state. The output in
c:\condor\execute\dir_???\output.txt looks fine but I can't
see this file anywhere in the spool area after it get's killed.
I just see:

C:\signal>dir c:\condor\spool
Volume in drive C is W2KS
Volume Serial Number is 373D-17DA

Directory of c:\condor\spool

02/11/2004  11:53       <DIR>          .
02/11/2004  11:53       <DIR>          ..
12/11/2004  12:48              334,623 Accountantnew.log
12/11/2004  12:48                2,682 job_queue.log
12/11/2004  12:38                   51 cluster161.ickpt.subproc0
12/11/2004  12:43              388,596 history
12/11/2004  12:38       <DIR>          cluster161.proc0.subproc0
12/11/2004  12:38       <DIR>          cluster161.proc0.subproc0.tmp

and the cluster161* directories are empty. I can't imagine any state info is in there ?

I've only tried this on a personal condor pool under Win2K -
would things be different on a "real" distributed one.
Version is 6.6.7.
Would using a .exe linked with the checkpointing library
(with an explicit call to it) work ???



--On 12 November 2004 05:04 -0600 Derek Wright <wright@xxxxxxxxxxx> wrote:

On Fri, 12 Nov 2004 10:17:06 +0000 "Dr Ian C. Smith" wrote:

> just have your job periodically checkpoint itself.

Is there any point in doing this ? If files are only staged back if the
job runs to completion then only the results need to be saved just before
completion (if there's sufficient memory).

sorry if my message wasn't clear enough. i was trying to get the point across that the files are only copied back into the directory you *submitted* from once the job runs to completion. if when_to_transfer_ouput is set to "ON_EXIT_OR_EVICT", then any intermediary files written by the job are transfered back to the submit machine, they're just stored in a temporary spool location (instead of your initial submit directory). anything in this temporary spool directory is sent back with your job the next time it starts running.

Saved state information cannot be transferred back if the jobs is

yes it can. if you use ON_EXIT_OR_EVICT, any files created by your job are transferred back (to the *spool* directory on the submit machine, not the directoroy you submitted from), even if the job is killed. the only exception is if the job is "hard-killed" (for example, condor_vacate -fast). in that case, it really is killed, nothing is transfered (for that run), and the job will restart with whatever spooled files are still sitting on the submit machine.

and is of no use once the job has run to completion.

true. jobs that do their own checkpointing might want to remove their checkpoint file after they write out their final results to their real output file(s) as the last step before they complete successfully. that way, condor won't needlessly transfer that final checkpoint file back for you. i'm honestly not sure if you'll still end up with the last spooled copy (if any) that's already sitting on the submit machine or not. if you do, there's no real additional cost for getting a copy of it, since it's already on the submit machine, and just needs to be copied out of spool and into your submit directory.

i hope this (finally) clarifies this feature.  i'll make sure all this
wisdom ends up in the manual in the near future.

Condor-users mailing list