[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Transfering files in a Vanilla universe on thejobbeing killed.



Title: RE: [Condor-users] Transfering files in a Vanilla universe on the jobbeing killed.

Alan,

Another method that might help you is the use of the Bristol "enhancement" for Condor.  It's buried on the download page of all the Condor-PVM's etc.  It's reported as for version 6.4.X from what I remember, but it works great with 6.6.x and 6.7.x.  It will allow you to use a remote file share in Condor under windows.  It creates a virtual desktop so perhaps that might further help you with the WM_CLOSE bit, plus you'll be able to have your files running from a fileshare.

I just found this a week ago.  This enhancement is a godsend.  But I warn you, Windows NT work-station (4.0, 2K, XP) only allow 10 connections, you'll need your fileshare to be on a windows server.  The documentation is very straight forward.



Scott

-----Original Message-----
From: Alan Christy Arokiam [mailto:alanca@xxxxxxxxxxxxxxx]
Sent: October 25, 2004 6:28 AM
To: 'matthew hope'; 'Condor-Users Mail List'
Subject: RE: [Condor-users] Transfering files in a Vanilla universe on
the jobbeing killed.


Dear Matt,
Thanks for the reply.
> 1) Do you die in time?
> You must respond* to the WM_CLOSE within the constraints of the KILL
> variable on the client
> If KILL evaluates immediately to true them you may never exit in time...
>
I don't know! How can I find out? My job does remain in the queue; otherwise
it just gets removed if it were still running, so it may detect it.

How do I find out whether kill evaluates immediately to true (the admin is
not available at the moment). Also I understand that the system may simply
not be set to transfer files? Can I check this?

> 2) are you a console app?
> Horrific hacky way condor seems to get the WM_CLOSE to you. It
> enumerates all the windows on it's 'screen' and sends a WM_CLOSE to
> them. Therefore if you wish your app to receive this message you must
> somewhere create a form.
> Messy, unpleasant..but it works.

My app used to be a console application. Now it is a Single Document
Interface (SDI) windows FORTRAN application. Basically it has two threads,
one for the message loop and the other for the computational loop. The old
console app is called as an external subroutine into the 2nd thread and the
old app checks a global variable for the WM_CLOSE flag and kills itself. At
the moment it checks it every loop ~ 0.5s per loop to complete.


Suggestion:
Since I haven't seen a suggestion page for condor I will like to suggest the
following, for the vanilla universe at least. (if the function does exist
please do correct me!)

1. Have a command for the user to kill the job and transfer back all files,
Condor_RM is not enough. Long term running apps checkpoint themselves and
the lack of such functionality is a pain. If such a function existed it will
mean that my sending machine will using the AT command of windows send the
signal to all my jobs at 8.30am.

2. Alternatively have a run time in the submission script, and transfer back
all files on kill. Eg from submission time keep the job on the queue whether
it runs or not for say 6 hours and then (kill if needed) and transfer back
all files at the end of the 6 hours.

I do really like condor, but at the moment it seriously lacks very important
functionality in the vanilla universe (which is very common in university
machines running windows). Correct me if I am wrong but I believe there is
serious flaw in the implementation of condor in the vanilla universe in that
the default behaviour of condor should be set to transfer files back to the
user on a hard kill, regardless of whether the job is running or not as this
will reduce all cumbersome procedures users must follow. My vision of an
ideal condor will be where a user specifies binary/input files and sends the
job and at termination (user or by condor) condor simply transfers back all
files created/modified in its temporary directory. This will make condor
transparent and simple to use, no need for WM_CLOSE 32 signal trapping and
so on. Implementing this in condor should be relatively easy(?)

Thank you,
Alan



> I would very much like this behaviour to be changed (since it is
> possible from a processid to determine it's windows handle (which
> everything has even if it doesn't have a window)) and from there send
> it the message direct.
>
> This may only be a problem if you use a script to fire up your
> application - perhaps there is explicit logic in the startd to send
> the message to it's initial child process but not any descendents...
>
> It would appear few people use the windows signals (for example the
> bug in dagman's windows signalling that went un-noticed for ages) on
> vacation so the more feedback to this list of it working/not working
> the better since it would appear to need some more documentation at
> the least.
>
> Matt
>
> * this means your process exits.
>
> On Sun, 24 Oct 2004 20:26:17 +0100, Alan Christy Arokiam
> <alanca@xxxxxxxxxxxxxxx> wrote:
> >
> >
> >
> > Dear All,
> >
> > I am using a condor system running on Windows XP, vanilla universe. The
> > condor system terminates all jobs at 8.30 am, every working day, I have
> to
> > have the job terminate before then in order to transfer intermediate job
> > states saved by my job (my job saves auto recovery information at
> intervals
> > determined by me, it is independent of condor checkpoints).
> >
> > I had read through the mailing list and came across this:
> >
> >
> >
> > http://lists.cs.wisc.edu/archive/condor-users/2004-July/msg00173.shtml
> >
> >
> >
> > So I wrote a code with a windows messaging queue to trap the WM_CLOSE
> Win32
> > message, and polled this queue at suitable intervals to set a pointer to
> > gracefully kill my application. I tested this application and it does
> > gracefully kill itself ( an easy way is the X on the window in Windows).
> >
> >
> >
> > When I send the job to the condor queue it works fine, but at 8.30am the
> job
> > gets evicted and no files are transferred, and the job does remain in
> the
> > queue and is again submitted, yet no files are transferred back?
> >
> >
> >
> > The submission script is:
> >
> >
> >
> > universe = vanilla
> >
> > Requirements = (CSD_CONDOR_POOL == "MEBC") && (OpSys == "WINNT51")
> >
> > executable = hellotest.exe
> >
> > output = mdi.out
> >
> > errror = mdi.err
> >
> > transfer_input_files =
> >
> input.dat,iapn_c.dat,iapn_i.dat,iapn_m.dat,iapp_c.dat,iapp_i.dat,iapp_m.da
> t,rrelx.dat,rrely.dat,rrelz.dat
> >
> > should_transfer_files = YES
> >
> > when_to_transfer_output = ON_EXIT_OR_EVICT
> >
> > log = mdi.log
> >
> > notification = Error
> >
> > queue
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > and a typical log is:
> >
> <snip>
> >
> >
> >
> > I am not the admin of the pool, so I can't change any settings as well,
> also
> > the admin is not available at the moment. Any help will be appreciated.
> >
> >
> >
> > PS basically I need intermediate files from my job to be transferred
> > everyday at 8.30am to my machine.
> >
> >
> >
> > Thank you,
> >
> > Alan
> >
> >
> >
> >
> >
> > Alan Arokiam,
> >
> > The Materials Modelling Group,
> >
> > Materials Science and Engineering,
> >
> > Department of Engineering,
> >
> > The University of Liverpool,
> >
> > Brownlow Hill,
> >
> > Liverpool,
> >
> > UK.
> >
> > L69 3GH
> >
> > Tel: 44-(0)151-794-4671
> >
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> >
> >
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users