[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] file transfer problems with vanilla job



On Thu, 11 Nov 2004 00:43:40 -0000  "Alan Christy Arokiam" wrote:

> I have written to the mailing list and no one seems to be aware that
> condor_vacate_job does not work, and in the version we use it
> doesn't seem to even exist.

condor_vacate_job was added in 6.7.0.  if you're not using 6.7.x, you
won't have condor_vacate_job.  however, i suspect you're on windows,
and now that you mention it, i bet out windows installer doesn't
handle this.  condor_vacate_job is just a copy of condor_rm or
condor_hold, renamed to be "condor_vacate_job".  i'll make sure our
windows installer will get this right for future releases.  in the
mean time, you can just copy condor_rm.exe to "condor_vacate_job.exe"
and it should all work fine.

and by "does not work", what do you mean? ;) i'm sure it works to
vacate the job.  if the files aren't getting transfered back, that's
another story (if that's not working, it's really the fault of the
condor_starter and/or condor_shadow), but at least condor_vacate_job
will have done its part...

> 1. For starters it is documented in the condor manual that condor on
> the vanilla universe will resatart a job from the beginning if it is
> vacated, 

true.

> so is of no point vacating the job.

not true.  if your job can checkpoint itself, there's a point to
vacating it, even if it's a vanilla job.

now, let's look at why:

when_to_transfer_output = ON_EXIT_OR_EVICT

in your submit file may not be working as you'd expect.

when condor decides to vacate your job (either because of
condor_vacate, condor_vacate_job, or because of the startd's policy
expressions on the machine where it is running), it's going to send
your job a signal.  on unix, you can specify the signal to send via
the "kill_sig" setting in your submit file.  the default is SIGTERM.
on windows, we send a WM_CLOSE.

so, if your application doesn't catch WM_CLOSE, handle it, and do some
cleanup to write out, flush, and close the checkpoint files, those
files won't necessarily have any data, and condor won't try to
transfer them back for you.

> It would have been really useful if someone is kind enough to add
> the command condor_transfer (or something) to transfer half finished
> jobs for the vanilla universe. It is an irony since 90% of all
> desktops run windows and condor would reveloutionise their usage if
> it had some basic functionality like this.

unfortunately, it's next to impossible for us to add such a thing.
where would we transfer these files back to on the submit machine?
how would we know to send them along with your job when it restarts?
if this happened more than once, how would we know which copy to
overwrite?  who would be responsible for cleaning out old data?

we can do a decent job of answering the above questions if we transfer
the files ourselves, once the job exits.  we transfer them back to a
(job-specific) subdirectory in the spool directory.  this spool
directory for your job is where we transfer files back out again if
the job restarts.  everytime your job vacates, we transfer back to
another temporary directory, and if all the files transfered
successfully, we atomically "commit" these files to your job by moving
the temporary directory into the "real" location and removing the old
copy.  because of this semi-atomic commit, we only ever have 1 copy of
the data, so there's no big worry about garbage collection.

if we left it up to users to issue a condor_transfer_files command
whenever they felt like it, we'd have all kinds of problems with all
of the above, not to mention what should happen at the remote site to
ensure some degree of file synchronization while we're trying to copy
all these files around.  i.e. we'd probably suspend your job for the
whole duration we were transfering files, to make sure your job didn't
try to write anything while we were copying them...  

and, such a command would be useless if ON_EXIT_OR_EVICT was working
right. ;)  i'm not saying there's a bug (though we can try more
testing on windows to be sure).  but, if there is, that's what we
should spend our development time on, not adding this new command.

-derek (condor team)