[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Transfering files in a Vanilla universe on thejobbeing killed.




> you've tested that this works outside of condor yes?

Yes I did test it and it did work, confirmed it by making it to write it to
file on receiving a close signal.

> 
> condor_vacate and condor_vacte_job exist for this very reason...

At least on CondorVersion: 6.6.5 May  4 2004 it does not work and the
documentation explicitly says " A job running under the vanilla universe is
killed, and Condor restarts the job from the beginning somewhere else."

Which doesn't help as intermediate files are not transferred!


> 
> see the periodic expressions in condor_submit docs - they allow such logic
> 

Had a quick read but there is no sign of a command for the vanilla universe
that can initiate file transfer of intermediate files.

> Most vanilla jobs will be done in a way that, if you didn't respond to
> the vaction in time the ouputs are such that they are not
> guaranteeable - therefore the job, when restarted, should be restarted
> in exactly the same way as originally. This means the ouput from the
> previous job is not needed so there is no need to waste time and
> bandwidth sending it...
> 
> If the job is designed to respond to the signal then the user should
> also know how to change the transfer files setting
> 

I do agree with what you say, but good code used in long term computation
have other means of ensuring that a hard kill does not result in data
corruption, and yet remain platform independent. Using a WM_ClOSE test makes
the code windows specific etc


> as I said above - the behaviour of a restarting job is then undefined.
> not to mention that the end users may not wish to have their computer
> taking ages to come back to them fully since it's copying several megs
> of data across their network...
> 

One can make use of the Windows BITS service to transfer files without the
user even being aware of the transfer, say this is useful in offices where
the worker uses their pc for ~ 8 hours and condor cannot run jobs but use
BITS to transfer files, on 10MBs lan BITS can realistically transfer more
than 10GB of data per node on the 8hour span

> Makig that the default behaviour would break most peoples programs
> (since implementing a restarting program is considerably trickier than
> the restart from scratch approach. Not to mention that designing a job
> which can restart from a non deterministic position (what happens if
> one buffer flushed before being killed but another didn't) happily.

The standard practice to running long term jobs is to use the Grandfather,
father, son file practice. If the son file is corrupted to a kill during a
mid file write, the father file is used as the next restart point, and so
on. This method is nearly always fool proof. Also file writes done
occasionally and fast reduce the probability of a mid file write kill.

> 
> The sensible approch is to allow it to die gracefully.
> 
Yes!

> The real problem is that the process which sends the WM_CLOSE is flawed.
> 
> An alternate would be a simple, and OS independant, mechanism whereby
> the process could open up a socket on a defined port (one per VM) and
> listen for instruction to vacate... (supplying libraries for the
> common apps which do just this would not be hard and could, for
> instance in java's case, be themselves cross platform).
> 
> If you launch multiple child processes then it is up to you to have
> just one listen and inform the others.
> 
> This does open a can of security worms though...
> 
> Matt