[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Unexplained status=128



Hi all,

I had a random unexplainable problem of programs failing with status 128 on some nodes for some time now on windows platform. I finally tracked it down to be our main GUI running on the node. Here is what happens, we have a main GUI application that submits jobs to Condor and reports the result back to the user. The programs executed on each node is a batch file that sets up the file shares and executes the worker non gui applications.

Everything runs fine until the node that Condor tries to run the worker programs on has the main GUI running. In that case the programs exit with status 128(DLL not found) but the problem is all the DLL are there. I have put directory listing in the batch commands and verified that the files shares are mapped and all the DLL's needed are there. I have also used "depends" to write the dependency to file that I analysed and it verifies that all the DLL's are there at run time of the programs. The only thing I could think of is that one of the DLL's could not be loaded when the program is running because of its DllMain initialization or delayed initialization. However, it is not easy to find which one since there are a lot of our own and third party DLL's. For the time being I have changed the submit configuration file to reschedule the job if status 128 happens by using "On_exit_remove" criteria.

Any body has any ideas on why this happens and how to tackle this problem? Or is there any way I could know that the GUI is running on the node? or that the user is actively logged in even though the keyboard and mouse are not moving?

Thanks
BTB

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>