[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Unexplained status=128




Griffith, Brent wrote:


You raise the question of whether or not a user is already logged into the
remote NT machine. Could that be the cause of 128 errors? My
understanding is that NT can only handle one user at a time.


I think Windows 2000 and XP can run the Condor jobs even though the user is logged in. I haven't tested it with NT. I had Condor using the machines when the user is logged in. My main problem is if the user is logged in and he is using the main GUI which shares many DLL with the worker executables then I get 128 error. It took me a while to finally track that is the case since it is too random to know why this machine which run the job fine a couple of hours ago was now having error 128. May be you could check your nodes to see if the problem happens when there is a DLL your works executables share with the excitable running at that time.

I have been struggling with similar 128 problems, but haven't been able to
track it down. (I am passing many  DLLs found by dumpbin and loadtest... )
The most iritating thing is that my own submit machine shows the code 128
behavior and I know it can run the jobs.

That is exactly what is bothering me and the bad thing about it the randomness of it. It will be nice if there was a way to know if there was a user logged into the machine. Even though, it reduces the number of nodes that I have for computation, the resubmitting works since the user wouldn't even know that this happened, as far as he is considered the job is completed. That is one the good thing I like about Condor.

My work around has been to exclude
execute nodes that show the problem. The problem with resumbitting jobs that
exit with 128 is that the same nodes keep accepting jobs and running through
them quickly because they don't actually compute.


I think there is a solution for that, I have seen an option to configure the negotiator not to send the job back immediately to the same node after a failure but I couldn't remember which option it is. May be Condors will enlighten us on this.

BTB


Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>