[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Standard Universe blues...



Hi Alain, Todd and all the Condor team,

I am enclosing below a similar error message received from the same machine, this time it has some more information about it's memory status.

-Guy


This is an automated email from the Condor system
on machine "L002W021.pubclass.ad.bgu.ac.il".  Do not reply.

"C:\Condor/bin/condor_startd.exe" on "L002W021.pubclass.ad.bgu.ac.il" died due to exception STACK_OVERFLOW.

Condor will automatically restart this process in 17 seconds.

*** Last 20 line(s) of file StartLog:
1/1 12:33:02 ** C:\Condor\bin\condor_startd.exe
1/1 12:33:02 ** $CondorVersion: 6.6.10 Jun 22 2005 $
1/1 12:33:02 ** $CondorPlatform: INTEL-WINNT50 $
1/1 12:33:02 ** PID = 904
1/1 12:33:02 ******************************************************
1/1 12:33:02 Using config file: C:\Condor\condor_config
1/1 12:33:02 Using local config files: C:\Condor/condor_config.local
1/1 12:33:02 DaemonCore: Command Socket at <132.72.69.42:4054>
1/1 12:33:02 "C:\Condor/bin/condor_starter.exe -classad" did not produce any output, ignoring
1/1 12:33:02 "C:\Condor/bin/condor_starter.pvm -classad" did not produce any output, ignoring
1/1 12:33:02 "C:\Condor/bin/condor_starter.std -classad" did not produce any output, ignoring
1/1 12:33:02 New machine resource allocated
1/1 12:33:07 no loadavg samples this minute, maybe thread died???
1/1 12:33:07 About to run initial benchmarks.
1/1 12:33:13 Completed initial benchmarks.
1/1 12:33:13 State change: IS_OWNER is false
1/1 12:33:13 Changing state: Owner -> Unclaimed
1/1 12:33:13 new Packet failed. out of memory
1/1 12:33:13 ERROR "new Packet failed. out of memory" at line 625 in file ..\src\condor_io\SafeMsg.C

1/1 12:33:13 Deleting Cronmgr
*** End of file StartLog

*** Last entry in core file core.STARTD.WIN32

============================
Exception code: C00000FD STACK_OVERFLOW
Fault address:  004503A7 01:0004F3A7 C:\Condor\bin\condor_startd.exe

Registers:
EAX:0000BB54
EBX:00000000
ECX:0011DD74
EDX:008AF718
ESI:00804638
EDI:00000000
CS:EIP:001B:004503A7
SS:ESP:0023:00120D6C  EBP:00120D80
DS:0023  ES:0023  FS:0038  GS:0000
Flags:00010202

Call stack:
Address   Frame     Logical addr  Module
004503A7  00120D80  0001:0004F3A7 C:\Condor\bin\condor_startd.exe
0041790A  00120E14  0001:0001690A C:\Condor\bin\condor_startd.exe
00408D61  00120F94  0001:00007D61 C:\Condor\bin\condor_startd.exe
00408768  001211C8  0001:00007768 C:\Condor\bin\condor_startd.exe
0043976E  001211E8  0001:0003876E C:\Condor\bin\condor_startd.exe
00436553  00121204  0001:00035553 C:\Condor\bin\condor_startd.exe
0044265C  0012FDC0  0001:0004165C C:\Condor\bin\condor_startd.exe
0044535A  0012FDF8  0001:0004435A C:\Condor\bin\condor_startd.exe
0043D556  0012FE30  0001:0003C556 C:\Condor\bin\condor_startd.exe
00444C3E  00489698  0001:00043C3E C:\Condor\bin\condor_startd.exe
*** End of file core.STARTD.WIN32



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: tel-zur@xxxxxxxxxxxx
The Official Condor Homepage is http://www.cs.wisc.edu/condor





Todd Tannenbaum wrote:

On Tue, Jan 03, 2006 at 12:59:31PM +0000, Angel de Vicente wrote:
thanks for the suggestion. I would love to know how to get the standard universe
without the shadow/IO abilities, anyone?


In the v6.7.x series, you can put
  want_remote_io=false
into your submit file.  Does this accomplish what you want?
From the condor_submit man page (again, ver 6.7.x):

want_remote_io = <True | False> This option controls how a file is opened and manipulated in a standard universe job. If this option is true, which is the default, then the condor_ shadow makes all decisions about how each and every file should be opened by the executing job. This entails a network round trip (or more) from the job to the condor_ shadow and back again for every single open() in addition to other needed information about the file. If set to false, then when the job queries the condor_ shadow for the first time about how to open a file, the condor_ shadow will inform the job to automatically perform all of its file manipulation on the local file system on the execute machine and any file remapping will be ignored. This means that there must be a shared file system (such as NFS or AFS) between the execute machine and the submit machine and that ALL paths that the job could open on the execute machine must be valid. The ability of the standard universe job to checkpoint, possibly to a checkpoi! nt server, is not affected by this attribute. However, when the job resumes it will be expecting the same file system conditions that were present when the job checkpointed.
regards,
Todd


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users