[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to kill job when output dir removed ?




As Ian Chesal's response indicated, the behavior in Condor 6.8 is to put jobs on hold when there are errors transferring files. This differs from earlier versions, which would keep trying to run the job over and over, often in the vain hopes that it was just a transient error. Now you can construct an automated policy about what should happen to the held jobs by configuring SYSTEM_PERIODIC_RELEASE or SYSTEM_PERIODIC_REMOVE (or the user can configure a policy with the respective job policy expressions). As Ian also pointed out, it would be nice to have more control over Condor's emailing, to avoid bombing users. I have found that, for all practical purposes, any user operating at large scale simply must set notification = never in their submit file.

I should also note that there was one case of file transfer errors not handled by 6.8's put-on-hold policy. It is failure while writing the output to the submit machine (e.g. because the disk is full). This has been fixed in Condor 6.9.1, so jobs will go on hold in this case too. It was difficult to judge whether this was a bug fix or a change in behavior (i.e. suitable for 6.8 vs. 6.9). In the end, I decided to put it into 6.9.

--Dan

Dr Ian C. Smith wrote:
Hi,

I have a recurring problem here where our users submit
files through a web interface but then indadvertently
remove the directory the condor input/output files
are sitting in without killing the job first. I've
tried all sort of safeguards to prevent this but they
still seem find a way of doing it (that's users for ya !).

Condor's "try, try and try again"  strategy means that
it keeps attempting to write the output files in the hope
that the directory might reappear and deluging my inbox
with error messages in the process.

I can understand that this may have been put in to deal with flakey
NFS filesystems (although I see that Condor tries to avoid
these like the plague now) but is there anyway of getting
condor to just give up if it can't write the output files.
If not can it be set up not to bombard me with e-mail warnings.

On a related point - if I specify that a particular output file
is to be transferred back from the execute host using

transfer_output_files =

and the file isn't there (usually because the executable
has bombed) it just seems to keep on trying in vain.
Anyway to prevent this either ?

regards,

-ian.

--------------------
Dr Ian C. Smith,
The University of Liverpool,
Computer Services Department
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR