[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] file transfer on Windows again



Dear All,

This kind of follows on from one of my earlier posts

http://lists.cs.wisc.edu/archive/condor-users/2004-July/msg00173.shtml

and one by Alan Arokiam (also at this instution).

http://lists.cs.wisc.edu/archive/condor-users/2004-October/msg00230.shtml

We're trying to get long running jobs working on our Windows XP based
condor pool. The pool can only be used after office hours and running jobs
will be soft killed and then eventally hard killed before the start
of the working day. Therefore the results need to be transferred back to
the submit host so that the calculations can be picked up again later.
Alan has wrote a Fortran 90 program to trap the WM_CLOSE signal sent by
Condor and exit the program gracefully. I've been experimenting with this
on a personal Condor pool with this config:

WANT_SUSPEND = FALSE
WANT_VACATE = TRUE
START = TRUE
SUSPEND = FALSE
CONTINUE		= $(UWCS_CONTINUE)
PREEMPT = ( $(ActivityTimer) > 60 )
KILL			= $(UWCS_KILL)      # 10 minutes

The job runs for 1 minute and then vacates and goes into the idle state
but no output is returned.  The output seems to be created OK in
the \condor\execute\dir_blah spool area though.

I've thrown everything I can think of into the .sub file:

should_transfer_output_files = YES
transfer_output_files = md.out
when_to_transfer_output_files = ON_EXIT_OR_EVICT

but still no output.

When I try a simple .bat file with an infinite loop (and no trap) it runs
for 11 minutes and then goes to the idle state. This suggest to me
that the signal trap programs works correctly and exits gracefully
when prompted. Is seems as if Condor cleans up the environment before the
output is transferred back or kills the daemons too soon.

Anyone had any luck with this kinda thing ??

cheers,

-ian.


----------------------------------- Dr Ian C. Smith, e-Science team, University of Liverpool Computing Services Department