[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Transfering files in a Vanilla universe on the jobbeing killed.



Dear All,

I am using a condor system running on Windows XP, vanilla universe. The condor system terminates all jobs at 8.30 am, every working day, I have to have the job terminate before then in order to transfer intermediate job states saved by my job (my job saves auto recovery information at intervals determined by me, it is independent of condor checkpoints).

I had read through the mailing list and came across this:

 

http://lists.cs.wisc.edu/archive/condor-users/2004-July/msg00173.shtml

 

So I wrote a code with a windows messaging queue to trap the WM_CLOSE Win32 message, and polled this queue at suitable intervals to set a pointer to gracefully kill my application. I tested this application and it does gracefully kill itself ( an easy way is the X on the window in Windows).

 

When I send the job to the condor queue it works fine, but at 8.30am the job gets evicted and no files are transferred, and the job does remain in the queue and is again submitted, yet no files are transferred back?

 

The submission script is:

 

universe = vanilla

Requirements = (CSD_CONDOR_POOL == "MEBC") && (OpSys == "WINNT51")

executable = hellotest.exe

output = mdi.out

errror = mdi.err

transfer_input_files = input.dat,iapn_c.dat,iapn_i.dat,iapn_m.dat,iapp_c.dat,iapp_i.dat,iapp_m.dat,rrelx.dat,rrely.dat,rrelz.dat

should_transfer_files = YES

when_to_transfer_output = ON_EXIT_OR_EVICT

log = mdi.log

notification = Error

queue

 

 

 

 

and a typical log is:

 

000 (074.000.000) 10/21 03:34:03 Job submitted from host: <xxx.xxx.xx.xxx:1027>

...

001 (074.000.000) 10/21 03:34:13 Job executing on host: <xxx.xxx.xxx.xx:1029>

...

006 (074.000.000) 10/21 03:34:21 Image size of job updated: 10476

...

006 (074.000.000) 10/21 03:54:21 Image size of job updated: 11168

...

006 (074.000.000) 10/21 04:14:21 Image size of job updated: 11176

...

004 (074.000.000) 10/21 08:31:00 Job was evicted.

            (0) Job was not checkpointed.

                        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage

                        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage

            0  -  Run Bytes Sent By Job

            1444002  -  Run Bytes Received By Job

...

001 (074.000.000) 10/21 17:30:22 Job executing on host: <xxx.xxx.xxx.xx:1029>

...

006 (074.000.000) 10/21 17:50:31 Image size of job updated: 11168

...

006 (074.000.000) 10/21 18:10:31 Image size of job updated: 11176

...

006 (074.000.000) 10/21 23:30:32 Image size of job updated: 11180

...

006 (074.000.000) 10/21 23:50:32 Image size of job updated: 11188

...

004 (074.000.000) 10/22 08:30:06 Job was evicted.

            (0) Job was not checkpointed.

                        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage

                        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage

            0  -  Run Bytes Sent By Job

            1444002  -  Run Bytes Received By Job

 

 

 

I am not the admin of the pool, so I can’t change any settings as well, also the admin is not available at the moment. Any help will be appreciated.

 

PS basically I need intermediate files from my job to be transferred everyday at 8.30am to my machine.

 

Thank you,

Alan

 

 

Alan Arokiam,

The Materials Modelling Group,

Materials Science and Engineering,

Department of Engineering,

The University of Liverpool,

Brownlow Hill,

Liverpool,

UK.

L69 3GH

Tel: 44-(0)151-794-4671