[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Suppress Windows error dialogs popping up for crashing Condor jobs



Thanks for your input.  I still need a check like you are talking about with cpu usage and kill-time.  In particular I want to guard against programmer infinite loop bugs, or if they popped up a message box on purpose.  They know they aren't supposed to pop up an error message explicitly in a job, actually they are supposed to call our wrapped versions of all API calls, which if resulting in a GUI will go through a switch that just logs in condor mode.

My problem was an unexpected crash that made windows produce an invalid memory access error message, and for some reason the programtic method of using windows API SetErrorMode was missing this one, for windows XP.  The registry key I listed fixed that, but it is a tough option to decide to tell the customer to edit their registry, or additional dev/test/doc time to develop a configuration tool for them.  If there is a way to keep that dialog from appearing all from in the code on XP I would love to know about it.

As for implementing your idea of auto killing a long running job that was not using much CPU...do you implement this in condor with a periodic remove?  Or do you implement this in your a thread of the Condor job via it's python wrapper?


--Derrick

On Wed, May 18, 2011 at 9:30 AM, Michael O'Donnell <odonnellm@xxxxxxxx> wrote:
Derrick, I have run into similar problems and generally this is handled in
the application. One thought is to check if the developers can add a
switch that causes the program to exit with a STDOUT error code versus a
popup message. I was working on a numerical hydrologic model that was
written by someone else in Fortran and they essentially had a popop that
required the user to click ok when the program completed successfully (as
if you would not know the program completed its analysis successfully).
Anyhow, I was able to change the underlining code so popups did not occur.
I would imagine this could be done in your case.

Most of my applications that I run are wrapped inside a python script,
which allows me a better programming language then using something like
DOS batch files. VBS or something else could also be used. I had also
looked into sendkeys, but I had a difficult time getting this to work
because there was something different about the window station environment
(a popup occurs, but it does not actually exist) and although sendkeys
worked running the application locally, it would not work when executed
via condor.

A couple other ideas are to evaluate the CPU for the exe task. If it falls
below a threshold and remains there for a certain duration then kill it.
You can also set a maximum runtime for a condor job and if this is
exceeded then kill it. Although these methods work, in my opinion the best
method is to add a switch or something that allows errors messages to be
sent to STDOUT versus a popup. There may be a better way, but this is what
I did in the past.


mike





From:
Derrick Karimi <derrick.karimi@xxxxxxxxx>
To:
Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date:
05/18/2011 07:13 AM
Subject:
[Condor-users] Suppress Windows error dialogs popping up for    crashing
Condor jobs
Sent by:
condor-users-bounces@xxxxxxxxxxx



Hi,

I am working on fault tolerance on our system.  When our job's
run sometimes they crash.  I told the developers to fix the code but they
told me to rerun the job because they can't reproduce the problem...I will
work on their attitude later.

My problem was windows popping up various error reporting and crash
dialogs.  When the dialog pops up the process won't exit till the user
clicks OK, and eventually condor will restart the job.  The first process
is still holding resources and the second process keeps failing.  After
mucking with 4 different places in the registry and UI on xp, vista and 7
(as wall as every place in the UI I could control error reporting, and
disabling the error reporting service), I was still seeing popups.  I
started using the windows SetErrorMode function, which in practice only
worked for me on Windows 7 and Vista.  I was still seeing a popup
Application Error, memory could not be "read" on a simple null value
dereference

Finally I came across the article
http://support.microsoft.com/kb/128642


which tells you to set in the registry:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Windows\ErrorMode = 2

This seems to suppress the failure dialog on the XP systems.
As a Note: I am still not sure if you need to also disable the Dr. Watson
debugger...but I have done that on the way to finding this solution.

--Derrick_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



--
--Derrick