Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] file transfer problems with vanilla job

Date: Fri, 12 Nov 2004 12:59:43 +0000
From: "Dr Ian C. Smith" <i.c.smith@xxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] file transfer problems with vanilla job

Thanks for this Derek but I still can't seem to get it to work
- perhaps I'm missing something fundamental here ???

I have a very simple .bat file which I run:

C:\signal>type loop.bat
:start
time >> output.txt
sleep 15
goto start

and the .sub file looks like

C:\signal>type signal.sub
universe=vanilla
output = signal.out
log = signal.log
error = signal.err
transfer_input_files = output.txt
should_transfer_output_files = YES
when_to_transfer_output_files = ON_EXIT_OR_EVICT
executable = loop.bat
notification = Error
queue

The configuration file is setup so that the job
gets a soft kill after 1 minute and hard kill after a
further 10 minutes.
Since the soft kill isn't trapped the job runs as expected for 11 minutes
before going to the idle state. The output in
c:\condor\execute\dir_???\output.txt looks fine but I can't
see this file anywhere in the spool area after it get's killed.
I just see:

C:\signal>dir c:\condor\spool
Volume in drive C is W2KS
Volume Serial Number is 373D-17DA

Directory of c:\condor\spool

02/11/2004  11:53       <DIR>          .
02/11/2004  11:53       <DIR>          ..
12/11/2004  12:48              334,623 Accountantnew.log
12/11/2004  12:48                2,682 job_queue.log
12/11/2004  12:38                   51 cluster161.ickpt.subproc0
12/11/2004  12:43              388,596 history
12/11/2004  12:38       <DIR>          cluster161.proc0.subproc0
12/11/2004  12:38       <DIR>          cluster161.proc0.subproc0.tmp


and the cluster161* directories are empty.
I can't imagine any state info is in there ?

I've only tried this on a personal condor pool under Win2K -
would things be different on a "real" distributed one.
Version is 6.6.7.
Would using a .exe linked with the checkpointing library
(with an explicit call to it) work ???

cheers,

-ian.

--On 12 November 2004 05:04 -0600 Derek Wright <wright@xxxxxxxxxxx> wrote:

On Fri, 12 Nov 2004 10:17:06 +0000 "Dr Ian C. Smith" wrote:

> just have your job periodically checkpoint itself.
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Is there any point in doing this ? If files are only staged back if the
job runs to completion then only the results need to be saved just before
completion (if there's sufficient memory).


sorry if my message wasn't clear enough.  i was trying to get the
point across that the files are only copied back into the directory
you *submitted* from once the job runs to completion.  if
when_to_transfer_ouput is set to "ON_EXIT_OR_EVICT", then any
intermediary files written by the job are transfered back to the
submit machine, they're just stored in a temporary spool location
(instead of your initial submit directory).  anything in this
temporary spool directory is sent back with your job the next time it
starts running.

Saved state information cannot be transferred back if the jobs is
killed


yes it can.  if you use ON_EXIT_OR_EVICT, any files created by your
job are transferred back (to the *spool* directory on the submit
machine, not the directoroy you submitted from), even if the job is
killed.  the only exception is if the job is "hard-killed" (for
example, condor_vacate -fast).  in that case, it really is killed,
nothing is transfered (for that run), and the job will restart with
whatever spooled files are still sitting on the submit machine.

and is of no use once the job has run to completion.


true.  jobs that do their own checkpointing might want to remove their
checkpoint file after they write out their final results to their real
output file(s) as the last step before they complete successfully.
that way, condor won't needlessly transfer that final checkpoint file
back for you.  i'm honestly not sure if you'll still end up with the
last spooled copy (if any) that's already sitting on the submit
machine or not.  if you do, there's no real additional cost for
getting a copy of it, since it's already on the submit machine, and
just needs to be copied out of spool and into your submit directory.

i hope this (finally) clarifies this feature.  i'll make sure all this
wisdom ends up in the manual in the near future.

-derek
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users

Follow-Ups:
- Re: [Condor-users] file transfer problems with vanilla job
  - From: Derek Wright

References:
- Re: [Condor-users] file transfer problems with vanilla job
  - From: Derek Wright

Prev by Date: [Condor-users] condor init script and CONDOR_CONFIG
Next by Date: Re: [Condor-users] Retirement Clarification
Previous by thread: Re: [Condor-users] file transfer problems with vanilla job
Next by thread: Re: [Condor-users] file transfer problems with vanilla job
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] file transfer problems with vanilla job