[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SCHEDD dying on multiple process submission



Ben,

There are actually two separate crash emails that I get, the first one
being MUCH more prevalent than the second. I've attached them both at
the bottom of this message.

Thanks for your help with all of this.

John
 

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ben Burnett
Sent: Wednesday, May 23, 2007 4:51 PM
To: 'Condor-Users Mail List'
Subject: Re: [Condor-users] SCHEDD dying on multiple process submission

Hi John:

Could you please send a stack trace from one of these dead processes?
You should have received some in the admin e-mail account.

-B
  

------------------------------------------------------------------------
---------------------------------

This is an automated email from the Condor system on machine
"PCG6548.Ceg.Corp.Net".  Do not reply.

"C:\condor/bin/condor_schedd.exe" on "PCG6548.Ceg.Corp.Net" died due to
exception ACCESS_VIOLATION.
Condor will automatically restart this process in 17 seconds.

*** Last 20 line(s) of file SchedLog:
5/23 20:00:25 (pid:5980) Failed to execute
C:\condor/bin/condor_shadow.pvm.exe, ignoring
5/23 20:00:25 (pid:5980) my_popen: CreateProcess failed
5/23 20:00:25 (pid:5980) Failed to execute
C:\condor/bin/condor_shadow.std.exe, ignoring
5/23 20:00:26 (pid:5980) 796.0: JobLeaseDuration remaining: 1136
5/23 20:00:26 (pid:5980) Sent ad to central manager for
e14600@xxxxxxxxxxxx
5/23 20:00:26 (pid:5980) Sent ad to 1 collectors for e14600@xxxxxxxxxxxx
5/23 20:00:26 (pid:5980) Calling Timer handler 0 (dc_touch_log_file)
5/23 20:00:26 (pid:5980) Return from Timer handler 0 (dc_touch_log_file)
5/23 20:00:26 (pid:5980) Calling Timer handler 1 (check_session_cache)
5/23 20:00:26 (pid:5980) Return from Timer handler 1
(check_session_cache)
5/23 20:00:26 (pid:5980) Calling Timer handler 2 (handle_cookie_refresh)
5/23 20:00:26 (pid:5980) Return from Timer handler 2
(handle_cookie_refresh)
5/23 20:00:26 (pid:5980) Calling Handler <<10.100.116.51:9618>>
5/23 20:00:26 (pid:5980) Return from Handler <<10.100.116.51:9618>>
5/23 20:00:26 (pid:5980) Calling Timer handler 3 (self_monitor)
5/23 20:00:26 (pid:5980) Return from Timer handler 3 (self_monitor)
5/23 20:00:26 (pid:5980) Calling Timer handler 5
(DaemonCore::SendAliveToParent)
5/23 20:00:26 (pid:5980) Return from Timer handler 5
(DaemonCore::SendAliveToParent)
5/23 20:00:26 (pid:5980) Calling Timer handler 12 (checkReconnectQueue)
5/23 20:00:26 (pid:5980) ClassAd from query has no JobId, ignoring
*** End of file SchedLog

*** Last entry in core file core.SCHEDD.WIN32

=============================
Exception code: C0000005 ACCESS_VIOLATION Fault address:  0047104F
01:0007004F C:\condor\bin\condor_schedd.exe

Registers:
EAX:0012F7F8
EBX:00000002
ECX:00000000
EDX:00583E60
ESI:00000000
EDI:0012F99C
CS:EIP:001B:0047104F
SS:ESP:0023:0012F7DC  EBP:0012F7E8
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010246

Call stack:
Address   Frame
0047104F  0012F7E8  AttrList::Lookup+C
00471647  0012F80C  AttrList::EvalFloat+25 0041E66D  0012F840
Scheduler::AddMrec+1F9
00418A27  0012F8E8  Scheduler::makeReconnectRecords+22E
004187BB  0012F9B0  Scheduler::checkReconnectQueue+303
00499ADA  0012F9EC  TimerManager::Timeout+14C
00481237  0012FE30  DaemonCore::Driver+212
00489DC6  0012FF68  dc_main+AF9
00489ED5  0012FF80  main+CE
004C7796  00000001  mainCRTStartup+C5

*** End of file core.SCHEDD.WIN32


------------------------------------------------------------------------
---------------------------------

This is an automated email from the Condor system on machine
"PCG6548.Ceg.Corp.Net".  Do not reply.

"C:\condor/bin/condor_schedd.exe" on "PCG6548.Ceg.Corp.Net" exited with
status 4.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file SchedLog:
5/23 19:59:32 (pid:2616) Shadow pid 5436 for job 797.21 exited with
status 100
5/23 19:59:32 (pid:2616) match (<10.100.116.77:4362>#1179336651#53) out
of jobs (cluster id 797); relinquishing

5/23 19:59:32 (pid:2616) Sent RELEASE_CLAIM to startd at
<10.100.116.77:4362>
5/23 19:59:32 (pid:2616) Match record (<10.100.116.77:4362>, 797, -1)
deleted
5/23 19:59:32 (pid:2616) Calling Timer handler 836
(SelfDrainingQueue::timerHandler[job_is_finished_queue])

5/23 19:59:32 (pid:2616) Return from Timer handler 836
(SelfDrainingQueue::timerHandler[job_is_finished_queue])

5/23 19:59:32 (pid:2616) DaemonCore: Command received via TCP from host
<10.100.116.77:2773>
5/23 19:59:32 (pid:2616) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)

5/23 19:59:32 (pid:2616) Calling HandleReq <vacate_service> (0)
5/23 19:59:32 (pid:2616) Got VACATE_SERVICE from <10.100.116.77:2773>
5/23 19:59:32 (pid:2616) Return from HandleReq <vacate_service>
5/23 19:59:32 (pid:2616) DaemonCore: Command received via UDP from host
<10.100.116.51:3614>
5/23 19:59:32 (pid:2616) DaemonCore: received command 60011 (DC_NOP),
calling handler (handle_nop())

5/23 19:59:32 (pid:2616) Calling HandleReq <handle_nop()> (0)
5/23 19:59:32 (pid:2616) Return from HandleReq <handle_nop()>
5/23 19:59:33 (pid:2616) scheduler universe job (789.0) pid 2704 exited
with status 4
5/23 19:59:33 (pid:2616) (789.0) Problem parsing user policy for job:
The UNKNOWN (never set) OnExitRemove expression '' evaluated to
UNDEFINED.  Putting job on hold.

5/23 19:59:33 (pid:2616) Job 789.0 put on hold: The UNKNOWN (never set)
OnExitRemove expression '' evaluated to UNDEFINED

5/23 19:59:37 (pid:2616) ERROR "Unexpected pending status for fake
message delivery.
" at line 4238 in file ..\src\condor_daemon_core.V6\daemon_core.C
*** End of file SchedLog



>>> This e-mail and any attachments are confidential, may contain legal, professional or other privileged information, and are intended solely for the addressee.  If you are not the intended recipient, do not use the information in this e-mail in any way, delete this e-mail and notify the sender. CEG-IP1