[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] file transfer problems with vanilla job



I have a very very simillar problem.
I have written to the mailing list and no one seems to be aware that
condor_vacate_job does not work, and in the version we use it doesn't seem
to even exist.

I wasted time writing a fotran message polling loop to catch the WMCLOSE_32
signal that condor sends before dying if configured to. The code works, as
tested by others and myself but condor did not transfer output files.(I
don't know if it was not properly configured, but I have a feeling there are
serious issues with condor on the Windows platform).
1. For starters it is documented in the condor manual that condor on the
vanilla universe will resatart a job from the beginning if it is vacated, so
is of no point vacating the job. But a person on the mailing list (Matt I
think ) said some he knew the person who wrote that section of the code in
condor and should work. ( But as far as  I tested it doesn't), maybe a
version issue??
2. I resorted to using a timing loop in my code to kill the job, but the
loop should poll the time every 3 minutes, and I found out that in some
cases even after 2-3hours the code had not died. Which implied that the time
hadn't been polled for nearly 3hours instead of every 3 minutes. Whilst I
have been told that a multitasking OS like windows may do other tasks, and I
do accept that 3 minutes could lengthen to several minutes, I find 3 hours
unbelivable. My personal hunch, I may be wrong, is that condor simply fails
to detect if the code is running or not if it has been running for several
hours(<5hours according to my tests) and assumes it is still running.
3. Now I am planning to run jobs for a few hours at a time.

It would have been really useful if someone is kind enough to add the
command condor_transfer (or something) to transfer half finished jobs for
the vanilla universe. It is an irony since 90% of all desktops run windows
and condor would reveloutionise their usage if it had some basic
functionality like this.

Anyway you should not state which files to transfer back as any
changed/created files will be transferred back.

If you do need the source to the WMCLOSE_32 trapping system I cangive it to
you to use.

And you may also read my relevant posts on this.
http://lists.cs.wisc.edu/archive/condor-users/2004-October/msg00230.shtml

Sorry about the long letter. :-)

Hope there are many users with the same problem, so the authors of condor
may revamp the windows version(Hint, Hint ..) :-)

Thank you,
Alan




Alan Arokiam,
The Materials Modelling Group,
Materials Science and Engineering,
Department of Engineering,
The University of Liverpool,
Brownlow Hill,
Liverpool,
UK.
L69 3GH
Tel: 44-(0)151-794-4671
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of De-Wei Yin
> Sent: 10 November 2004 06:56 PM
> To: condor-users@xxxxxxxxxxx
> Subject: [Condor-users] file transfer problems with vanilla job
> 
> I plan to run some very long simulations that can go on for months.  For
> performance reasons I use the Intel Fortran Compiler and Intel Math
> Kernel Library, therefore the jobs must be submitted vanilla.  The
> executable code has its own checkpointing mechanism.  I want the
> checkpoint file and other output files transferred back to the submit
> node whenever the job is preempted or vacated from the execute node, or
> if the job is removed from the job queue.  Condor also needs to be able
> to send the checkpoint file and a log file back as input when the job
> restarts.
> 
> My problem is that the output files are not coming back when the job is
> evicted from a node (by Condor or by me using condor_vacate or
> condor_hold) or when it is removed from the queue (by me using
> condor_rm), and if I do eventually get them to come back, I'm not sure
> how to tell Condor which ones to send back to use in restarting the job.
> 
> The submit node is in the same pool as the execute nodes (same CM, no
> flocking involved), and it does not share a common FILESYSTEM_DOMAIN
> with the execute nodes.  I am using Condor 6.6.6.
> 
> The program that I run basically uses four types of files:
> 
>   1. "init" file: Contains all the data required to start the job from
>      scratch.  If the "ckpt" file is present, that file is read and the
>      job continues from the last checkpoint;  if "ckpt" does not exist,
>      then the "init" file is read and the job starts from the beginning.
>   2. "ckpt" file: Contains the minimal data set needed to continue an
>      interrupted job, and is periodically overwritten with newer data
>      sets.  The first thing that the program does is search for this
>      file (using Fortran inquire(file="ckpt",exist=ex))
>   3. "rlog" file: The running log of the job, contains some data and
>      job status information not needed to restart the job.  When the job
>      starts from scratch using an "init" file, a new "rlog" file is
>      created;  when the job restarts from a "ckpt" file, new records are
>      appended to the existing "rlog" file.
>   4. "data" file: Written periodically during the job run and numbered
>      sequentially (not appended or overwritten).  Contains data to be
>      analyzed by other postprocessing tools.
> 
> Here is what I have in the submit file, which works find if the job can
> run from start to finish without being evicted:
> 
>     Universe        = vanilla
>     Initialdir      = .
>     Executable      = ./main
>     Error           = ./condor.err
>     Log             = ./condor.log
>     should_transfer_files   = YES
>     when_to_transfer_output = ON_EXIT_OR_EVICT
>     transfer_input_files = init
>     Queue
> 
> When the jobs is evicted or removed, I expect to find the latest "ckpt"
> file (if one has already been written), an "rlog" file, and any number
> of "data" files.  Unfortunately nothing comes back and the job always
> restarts from scratch, and I cannot figure out why.
> 
> If the job is evicted, then when it restarts, it will need as input:
> the "ckpt" if it has already been created, the "init" file in case there
> is no "ckpt" file, and the "rlog" if it has been created so that new
> records can be appended to it.  How do I tell Condor that it needs to
> send back these files, especially the "ckpt" and "rlog" files, which
> might not yet exist if the job was interrupted early.  By the way, any
> numbered "data" file that does come back need not be returned to the
> execute node since they are never needed as input.
> 
> I've tried adding "transfer_output_files" to the submit script, but ran
> into three problems:
> 
>   1. The "init" file becomes corrupted when it is sent to the execute
>      node (maybe a bug?).
>   2. Condor panics if it cannot find an output file explicitly listed in
>      "transfer_output_files" (e.g., when the "ckpt" file has not yet
>      been written).
>   3. An unknown number of "data" files are created, and those not
>      explicitly listed in "transfer_output_files" are lost.
> 
> I would really appreciate any help with this.  Thanks!
> 
> Dewey
> 
> --
> Mr. De-Wei Yin, MASc, PEng
> Dept of Chemical & Biological Engineering tel: +1 608 262-3370
> University of Wisconsin-Madison           fax: +1 608 262-5434
> 1415 Engineering Drive                    dyin at cae dot wisc dot edu
> Madison WI 53706-1691 USA                 www.engr.wisc.edu/groups/mtsm/
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users