[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] shadows keep dying problem



Hi,

This is starting to frustrate me now, so I'm hoping someone else will 
be able to help.  This problem has appeared from nowhere, and in 
addition no one else using the pool appears to be suffering from it!  

The symptoms are that my jobs are submitted and then start to run.  A 
fraction of them complete OK, but the rest seem to lose contact and 
after an hour (usually), they get cleaned up and restarted, with the 
SchedLog entry at the end.

In addition, the ShadowLog is full of lines looking like:

11/27 09:34:54 (525.19) (2088): GlobalGroupMember: artsanybody

where the final 'artsandybody' is a selection of usernames from around 
the university.  We think part of the problem may be that the pool is 
quite busy, and so the UDP packets may be getting lost.  Does anyone 
have any suggestions?  We're running Condor 6.4.7 on WinXP.

Thanks in advance,
Henry 

SchedLog:
========

11/27 09:22:22 DaemonCore: Command received via TCP from host 
<137.222.189.138:1701>
11/27 09:22:22 DaemonCore: received command 1111 (QMGMT_CMD), calling 
handler (handle_q)
11/27 09:22:22 QMGR Connection closed
11/27 09:22:50 DaemonCore: Command received via TCP from host 
<137.222.189.138:1702>
11/27 09:22:50 DaemonCore: received command 1111 (QMGMT_CMD), calling 
handler (handle_q)
11/27 09:22:50 QMGR Connection closed
11/27 09:22:57 ERROR: Child pid 2724 appears hung! Killing it hard.
11/27 09:22:57 DaemonCore: Command received via UDP from host 
<137.222.189.138:1704>
11/27 09:22:57 DaemonCore: received command 60001 (DC_PROCESSEXIT), 
calling handler (HandleProcessExitCommand())
11/27 09:22:57 Shadow pid 2724 successfully killed because it was hung.
11/27 09:22:57 Shadow pid 2724 for job 525.23 exited with status 4
11/27 09:22:57 ERROR: Shadow exited with job exception code!
11/27 09:22:57 Match for cluster 525 has had 5 shadow exceptions, 
relinquishing.
11/27 09:22:57 Called send_vacate( <137.222.97.31:1037>, 443 )
11/27 09:22:57 Sent RELEASE_CLAIM to startd on <137.222.97.31:1037>
11/27 09:22:57 Match record (<137.222.97.31:1037>, 525, 23) deleted
11/27 09:22:57 Capability of deleted match: 
<137.222.97.31:1037>#2026321860
11/27 09:22:57 Entered delete_shadow_rec( 2724 )
11/27 09:22:57 Deleting shadow rec for PID 2724, job (525.23)
11/27 09:22:58 Entered check_zombie( 2724, 0x8a0ca4, st=2 )
11/27 09:22:58 Marked job 525.23 as IDLE
11/27 09:22:58 Exited check_zombie( 2724, 0x8a0ca4 )
11/27 09:22:58 Shadow does not have a match record, so did not remove 
it from the match
11/27 09:22:58 
11/27 09:22:58 ..................
11/27 09:22:58 .. Shadow Recs (10/10)
11/27 09:22:58 .. 2224, 525.25, F, <137.222.97.47:3224>, cur_hosts=1, 
status=2
11/27 09:22:58 .. 2516, 525.24, F, <137.222.97.36:1037>, cur_hosts=1, 
status=2
11/27 09:22:58 .. 1284, 525.15, F, <137.222.97.71:3835>, cur_hosts=1, 
status=2
11/27 09:22:58 .. 3656, 525.6, F, <137.222.97.84:1085>, cur_hosts=1, 
status=2
11/27 09:22:58 .. 1732, 525.11, F, <137.222.97.53:3491>, cur_hosts=1, 
status=2
11/27 09:22:58 .. 3204, 525.14, F, <137.222.97.22:1036>, cur_hosts=1, 
status=2
11/27 09:22:58 .. 3304, 527.7, F, <137.222.97.21:1032>, cur_hosts=1, 
status=2
11/27 09:22:58 .. 2480, 525.18, F, <137.222.97.91:1978>, cur_hosts=1, 
status=2
11/27 09:22:58 .. 2088, 525.19, F, <137.222.97.57:1323>, cur_hosts=1, 
status=2
11/27 09:22:58 .. 3496, 525.8, F, <137.222.97.101:4174>, cur_hosts=1, 
status=2
11/27 09:22:58 ..................

----------------------
Henry Knowles, Electrical & Electronic Engineering


Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>