[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] jobs don't start



On Thu, 31 Jul 2003 14:20:58 +0100, Alexander Klyubin <A.Kljubin@xxxxxxxxxxx> wrote:

Is C:\Condor\batch the same for all users of the Terminal Server machine (you submit the job from there, right)?

Actually I'm the only one, who is testing Condor right now. And yes, I submit the jobs from there.

Is it writable for the user under which Condor service runs? (you said it is...)

Yes, I'm running the service under the same account, which created this directory and the files in there.
This account also has administrative rights, which means, that I can write/delete/rename etc. every file/directorys.


Greetings,
Thomas

Regards,
Alexander Klyubin

On 07/31/2003 02:02 PM, Thomas Bauer wrote:
Thomas - the Shadow Exceptions seem to be because it's loosing contact
with the condor_starter on the other side. (The errno 10054 is a Winsock
"Connection reset by peer") - it would be helpful to see a StarterLog from
the machine 128.176.206.149.


Ok, here is a sample of the StarterLog of this machine:
-----------------------------------------------------------------
7/31 13:16:43 ******************************************************
7/31 13:16:43 ** condor_starter (CONDOR_STARTER) STARTING UP
7/31 13:16:43 ** $CondorVersion: 6.5.3 Jul 3 2003 $
7/31 13:16:43 ** $CondorPlatform: INTEL-WINNT40 $
7/31 13:16:43 ** PID = 1152
7/31 13:16:43 ******************************************************
7/31 13:16:43 Using config file: C:\Condor\condor_config
7/31 13:16:43 Using local config files: C:\Condor\condor_config.local
7/31 13:16:43 DaemonCore: Command Socket at <128.176.206.149:1361>
7/31 13:16:43 Setting resource limits not implemented!
7/31 13:16:43 Starter communicating with condor_shadow <128.176.206.141:3170>
7/31 13:16:43 Submitting machine is "WMTP01.UNI-MUENSTER.DE"
7/31 13:16:44 File transfer completed successfully.
7/31 13:16:45 Starting a VANILLA universe job with ID: 8.0
7/31 13:16:45 IWD: C:\Condor\execute\dir_1152
7/31 13:16:45 Output file: C:\Condor\execute\dir_1152\trapez.out
7/31 13:16:46 Error file: C:\Condor\execute\dir_1152\trapez.err
7/31 13:16:46 Renice expr "10" evaluated to 10
7/31 13:16:46 About to exec C:\Condor\execute\dir_1152\condor_exec.exe
7/31 13:16:46 Create_Process succeeded, pid=756
7/31 13:17:11 Job exited, pid=756, status=0
7/31 13:17:11 Failed to rename(\core,\core.8.0): errno 0 (No error)
7/31 13:17:11 ReliSock: put_file: TransmitFile() failed, errno=10054
7/31 13:17:11 ERROR "DoUpload: Failed to send file C:\Condor\execute\dir_1152\fort.19, exiting at 1371
" at line 1370 in file ..\src\condor_c++_util\file_transfer.C
7/31 13:17:11 ShutdownFast all jobs.
7/31 13:17:11 Error disabling account condor-reuse-vm1 (ACCESS DENIED)
-----------------------------------------------------------------


The same messages come over and over again

Greetings,
Thomas Bauer




If trapezregel.exe is just a normal executable, which runs perfectly on any of those machines if you install it there manually, then may be you should try explicitly telling Condor which files to transfer and when. Try adding following to the submission file:

transfer_input_files = trapezregel.exe
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT

The above assumes that you need to send in only the executable file. The executable does not need anything else to run.

Regards,
Alexander Klyubin

On 07/31/2003 12:26 PM, Thomas Bauer wrote:
> Hi Alexander,
> > my submission file is:
> --------------------------------------------------------------
> universe=vanilla
> executable=trapezregel.exe
> output=trapez.out
> error=trapez.err
> log=trapez.log
> requirements= OpSys=="WINNT50"
> queue
> ---------------------------------------------------------
> > Thomas
> > On Thu, 31 Jul 2003 11:29:48 +0100, Alexander Klyubin > <A.Kljubin@xxxxxxxxxxx> wrote:
> >> Hello Thomas,
>>
>> can you send you job submission file (the one you submit using >> condor_submit)?
>>
>>
>> Alexander Klyubin
>>
>> On 07/31/2003 11:23 AM, Thomas Bauer wrote:
>>
>>> Hi again,
>>> >>> I checked the log files and found the following entrys:
>>>
>>> ------------------------ShadowLog on the >>> Condor-Server---------



--


--------
>>>
>>> --- --------------------
>>> (...)
>>> 7/31 11:31:05 ******************************************************
>>> 7/31 11:31:05 ** condor_shadow (CONDOR_SHADOW) STARTING UP
>>> 7/31 11:31:05 ** $CondorVersion: 6.5.3 Jul 3 2003 $
>>> 7/31 11:31:05 ** $CondorPlatform: INTEL-WINNT40 $
>>> 7/31 11:31:05 ** PID = 2368
>>> 7/31 11:31:05 ******************************************************
>>> 7/31 11:31:05 Using config file: C:\Condor\condor_config
>>> 7/31 11:31:05 Using local config files: C:\Condor\condor_config.local
>>> 7/31 11:31:06 DaemonCore: Command Socket at <128.176.206.141:2317>
>>> 7/31 11:31:07 Initializing a VANILLA shadow
>>> 7/31 11:31:11 (7.0) (2368): Request to run on <128.176.206.149:1047> >>> was ACCEPTED
>>> 7/31 11:31:17 (7.0) (2368): ReliSock: put_file: TransmitFile() >>> failed, errno=10054
>>> 7/31 11:31:17 (7.0) (2368): ERROR "DoUpload: Failed to send file >>> C:\Condor\spool\cluster7.ickpt.subproc0, exiting at 1371
>>> at line 1370 in file ..\src\condor_c++_util\file_transfer.C
>>> (...)
>>> -------------------------------------------------------------------



--


---- >>>
>>>
>>> --- ------
>>>
>>> There seems to be a problem to send the files to the Client.
>>> The SchedLog says:
>>>
>>> ---------------------- SchedLog on Server >>> ---------------------



--


--------
>>>
>>> --- ----
>>> (...)
>>> 7/31 11:31:01 Started shadow for job 7.0 on "<128.176.206.149:1047>", >>> (shadow pid = 2368)
>>> 7/31 11:31:02 Sent ad to central manager for tombauer@xxxxxxxxxxx >>> muenster.de
>>> 7/31 11:31:19 DaemonCore: Command received via UDP from host >>> <128.176.206.141:2330>
>>> 7/31 11:31:19 DaemonCore: received command 60001 (DC_PROCESSEXIT), >>> calling handler (HandleProcessExitCommand())
>>> 7/31 11:31:19 ERROR: Shadow exited with job exception code!
>>> (...)
>>> -------------------------------------------------------------------



--


---- >>>
>>>
>>> --- --
>>>
>>> Greetings,
>>> Thomas Bauer
>>> PS: My setup is an Intel Windows2003 Server (Terminal Server) and 3 >>> Intel Windows2000-Clients.
>>> When I submit a job, I log on on the Terminalserver, copy my files to >>> his local harddisk and start the job from there.
>>> Maybe there is a problem with Condor and Win2003 Server?
>>>
>>>
>>>
>>>
>>>
>>> On Thu, 31 Jul 2003 09:36:36 +0100, Alexander Klyubin >>> <A.Kljubin@xxxxxxxxxxx> wrote:
>>>
>>>> I encountered similar problems when the job actually failed to start >>>> (due to a bug it contained). You may want to check the logs the job >>>> produces on the remote machine. You should also definitely check >>>> StartLog and StarterLog of the remote machine.
>>>>
>>>> Moreover, if you have a firewall between or on the submission >>>> machine or on worker machine this may also contribute to the problem.
>>>>
>>>> Regards,
>>>> Alexander Klyubin
>>>>
>>>> On 07/31/2003 09:31 AM, Thomas Bauer wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a WINNT-Server and some WINNT-Clients. When I submit a job, >>>>> the job don't start on any machine.
>>>>> "condor_q - analyze" tells me, that "1 match, but prefer another >>>>> specific job despite its worse user-prioriy"
>>>>> The log-file of this job shows entry like these:
>>>>> -----------------------------------------------------------------



--


---- >>>>>
>>>>>
>>>>> -
>>>>> 007 (023.000.000) 07/08 17:52:34 Shadow exception! Can no longer >>>>> communicate with condor_starter on execute machine
>>>>> 0 - Run Bytes Sent By Job
>>>>> 528441 - Run Bytes Received By Job
>>>>> -----------------------------------------------------------------



--


---- >>>>>
>>>>>
>>>>> -
>>>>>
>>>>> Does anybody know, why these "Shadow exceptions" occur?
>>>>>
>>>>> Thomas
>>>>
>>>>
>>>>
>>>> Condor Support Information:
>>>> http://www.cs.wisc.edu/condor/condor-support/
>>>> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
>>>> unsubscribe condor-users <your_email_address>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>> Condor Support Information:
>> http://www.cs.wisc.edu/condor/condor-support/
>> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
>> unsubscribe condor-users <your_email_address>
>>
>>
>>
> > >

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>







Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>






-- ============================== Thomas Bauer Universität Münster Institut für Festkörpertheorie Wilhelm-Klemm-Straße 10 48149 Münster Germany ============================== Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>