[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] file transfer problems with vanilla job

> condor_vacate_job was added in 6.7.0.  

Thhanks for clarifying this.

> and by "does not work", what do you mean? ;)

If the command doesn't exist it would not work would it? From my earlier
post/problems I mentioned the version I use but was informed it works, so
thanks for the clarification.

> not true.  if your job can checkpoint itself, there's a point to
> vacating it, even if it's a vanilla job.
> now, let's look at why:
> when_to_transfer_output = ON_EXIT_OR_EVICT
> in your submit file may not be working as you'd expect.
> when condor decides to vacate your job (either because of
> condor_vacate, condor_vacate_job, or because of the startd's policy
> expressions on the machine where it is running), it's going to send
> your job a signal.  on unix, you can specify the signal to send via
> the "kill_sig" setting in your submit file.  the default is SIGTERM.
> on windows, we send a WM_CLOSE.

As far As I/Ian tested.

1. My application detects WM_Close 32 signals
2. It gracefully terminates itself in about a minute or so.
3. It closes all files that are open.
4. The files get written to.
5. Even though there is 10 mins to a hard kill and my app has terminated,
and all parameters set to transfer files on eviction it does not
6.My application checkpoints itself.

I amy be wrong, but If you do want I can send you a precompiled binary of my
application, so that you may test for yourself. I would be very grateful if
you could point out some fault in my code, or if there is a bug in condor
correct it.

> unfortunately, it's next to impossible for us to add such a thing.
> where would we transfer these files back to on the submit machine?

I cannot believe a person of you caliber saying next to immposible :-)

It is dead simple,
This is how I would implement it.
1. on issual of condor_transfer job1 ( a hypothetical command),
2.Condor will send all processes it started to sleep(it may wait till all
I/Os to files end.)
3. It will make a copy of the files in its tmp directory.
4. wake the processes up
5. transfer the files in the tmp directory.

Besides if you provide this function many of us writing code which runs for
long term can handle it ourselfs. I have had programs running on linux
systems non stop for over 8-10 months and had done daily backups on files,
which were written to/read from without a single corruption affecting the
usability by simple tricks, such as the grandfather, father & son files to
provide redundancy and cached timed writes. Eg cache results to RAM and
write at a paricular time (very efficient and fast). Then Transfer the files
to backup.

Without this functionality basically condor is crippled, and realistically
the people who use condor are not going to be the people who need point and
click solutions but people with some experience in long term computing.

The only thing I/other users are asking is to provide us with at least a
file transfer mechanism which we can invoke, the problem of data
corruption/integrity  is dead simple and  we can handle that ourselfs.
Besides this functionality is very simple for you to provide us as it exists
in condor.

Please Condor team hear our christmas wish :-)

Thanks for the reply


Alan Arokiam,
The Materials Modelling Group,
Materials Science and Engineering,
Department of Engineering,
The University of Liverpool,
Brownlow Hill,
L69 3GH
Tel: 44-(0)151-794-4671