[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] file transfer problems with vanilla job
- Date: Fri, 12 Nov 2004 12:59:43 +0000
- From: "Dr Ian C. Smith" <i.c.smith@xxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] file transfer problems with vanilla job
Thanks for this Derek but I still can't seem to get it to work
- perhaps I'm missing something fundamental here ???
I have a very simple .bat file which I run:
time >> output.txt
and the .sub file looks like
output = signal.out
log = signal.log
error = signal.err
transfer_input_files = output.txt
should_transfer_output_files = YES
when_to_transfer_output_files = ON_EXIT_OR_EVICT
executable = loop.bat
notification = Error
The configuration file is setup so that the job
gets a soft kill after 1 minute and hard kill after a
further 10 minutes.
Since the soft kill isn't trapped the job runs as expected for 11 minutes
before going to the idle state. The output in
c:\condor\execute\dir_???\output.txt looks fine but I can't
see this file anywhere in the spool area after it get's killed.
I just see:
Volume in drive C is W2KS
Volume Serial Number is 373D-17DA
Directory of c:\condor\spool
02/11/2004 11:53 <DIR> .
02/11/2004 11:53 <DIR> ..
12/11/2004 12:48 334,623 Accountantnew.log
12/11/2004 12:48 2,682 job_queue.log
12/11/2004 12:38 51 cluster161.ickpt.subproc0
12/11/2004 12:43 388,596 history
12/11/2004 12:38 <DIR> cluster161.proc0.subproc0
12/11/2004 12:38 <DIR> cluster161.proc0.subproc0.tmp
and the cluster161* directories are empty.
I can't imagine any state info is in there ?
I've only tried this on a personal condor pool under Win2K -
would things be different on a "real" distributed one.
Version is 6.6.7.
Would using a .exe linked with the checkpointing library
(with an explicit call to it) work ???
--On 12 November 2004 05:04 -0600 Derek Wright <wright@xxxxxxxxxxx> wrote:
On Fri, 12 Nov 2004 10:17:06 +0000 "Dr Ian C. Smith" wrote:
> just have your job periodically checkpoint itself.
Is there any point in doing this ? If files are only staged back if the
job runs to completion then only the results need to be saved just before
completion (if there's sufficient memory).
sorry if my message wasn't clear enough. i was trying to get the
point across that the files are only copied back into the directory
you *submitted* from once the job runs to completion. if
when_to_transfer_ouput is set to "ON_EXIT_OR_EVICT", then any
intermediary files written by the job are transfered back to the
submit machine, they're just stored in a temporary spool location
(instead of your initial submit directory). anything in this
temporary spool directory is sent back with your job the next time it
Saved state information cannot be transferred back if the jobs is
yes it can. if you use ON_EXIT_OR_EVICT, any files created by your
job are transferred back (to the *spool* directory on the submit
machine, not the directoroy you submitted from), even if the job is
killed. the only exception is if the job is "hard-killed" (for
example, condor_vacate -fast). in that case, it really is killed,
nothing is transfered (for that run), and the job will restart with
whatever spooled files are still sitting on the submit machine.
and is of no use once the job has run to completion.
true. jobs that do their own checkpointing might want to remove their
checkpoint file after they write out their final results to their real
output file(s) as the last step before they complete successfully.
that way, condor won't needlessly transfer that final checkpoint file
back for you. i'm honestly not sure if you'll still end up with the
last spooled copy (if any) that's already sitting on the submit
machine or not. if you do, there's no real additional cost for
getting a copy of it, since it's already on the submit machine, and
just needs to be copied out of spool and into your submit directory.
i hope this (finally) clarifies this feature. i'll make sure all this
wisdom ends up in the manual in the near future.
Condor-users mailing list