[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] Unexplained status=128



You raise the question of whether or not a user is already logged into the
remote NT machine.   Could that be the cause of 128 errors?  My
understanding is that NT can only handle one user at a time.  

I have been struggling with similar 128 problems, but haven't been able to
track it down. (I am passing many  DLLs found by dumpbin and loadtest... )
The most iritating thing is that my own submit machine shows the code 128
behavior and I know it can run the jobs. My work around has been to exclude
execute nodes that show the problem. The problem with resumbitting jobs that
exit with 128 is that the same nodes keep accepting jobs and running through
them quickly because they don't actually compute. 

brent

> -----Original Message-----
> From: Belay T Beshah [mailto:belay.beshah@xxxxxxxxxxxxxxxxxxxxxxxx]
> Sent: Thursday, October 23, 2003 6:03 PM
> To: condor-users@xxxxxxxxxxx
> Subject: [condor-users] Unexplained status=128
> 
> 
> Hi all,
> 
> I had a random unexplainable problem of programs failing with 
> status 128 
> on some nodes for some time now on windows platform. I 
> finally tracked 
> it down to be our main GUI running on the node. Here is what 
> happens, we 
> have a main GUI application that submits jobs to Condor and 
> reports the 
> result back to the user. The programs executed on each node 
> is a batch 
> file that sets up the file shares and executes the worker non gui 
> applications.
> 
> Everything runs fine until the node that Condor tries to run 
> the worker 
> programs on has the main GUI running. In that case the programs exit 
> with status 128(DLL not found) but the problem is all the DLL 
> are there. 
> I have put directory listing in the batch commands and 
> verified that the 
> files shares are mapped and all the DLL's needed are there. I 
> have also 
> used  "depends" to write the dependency to file that I 
> analysed and it 
> verifies that all the DLL's are there at run time of the 
> programs. The 
> only thing I could think of is that one of the DLL's could 
> not be loaded 
> when the program is running because of its DllMain initialization or 
> delayed initialization. However, it is not easy to find which 
> one since 
> there are a lot of our own and third party DLL's. For the 
> time being I 
> have changed the submit configuration file to reschedule the job if 
> status 128 happens by using "On_exit_remove" criteria.
> 
> Any body has any ideas on why this happens and how to tackle this 
> problem? Or is there any way I could know that the GUI is 
> running on the 
> node? or that the user is actively logged in even though the keyboard 
> and mouse are not moving?
> 
> Thanks
> BTB
> 
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>
> 
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>