[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] UDP Packet is missing frequently Collector unable to display it status.



Hi,

  Our Condor Pool have 7 physical system(Execute Nodes) and 2 Central
Managers. The CMs are in different subnet. we noticed frequently some
Execute machine status shown by Collector is not proper. 

I have copied  some lines of the Start Log and Collector Log.

The exceute node IP is 192.168.111.103.


--------------------------STATR LOG-----------------------------

10/30 11:54:55 Swap space: 1052216
10/30 11:54:55 29672988 kbytes available for "/vm/local.grid3/execute"
10/30 11:54:55 Looking up RESERVED_DISK parameter
10/30 11:54:55 Reserving 5120 kbytes for file system
10/30 11:54:55 Total execute space: 29667868
10/30 11:54:59 Trying to update collector <10.201.42.242:9618>
10/30 11:54:59 Attempting to send update via UDP to collector
scorpio.pesgrid.wipro.com <10.201.42.242:9618>
10/30 11:54:59 Trying to update collector <10.201.42.238:9618>
10/30 11:54:59 Attempting to send update via UDP to collector
grid8.pesgrid.wipro.com <10.201.42.238:9618>
10/30 11:54:59 Sent update to 2 collector(s)


In CM config file COLLECTOR_DEBUG = D_NETWORK.

For this particular machine the 
-- Fragmentation Header: last=0,seq=1,... is always missing. 
So the Machine status is not shown in condor status. when it is able to
receive -- Fragmentation Header: last=0,seq=1,.. it status is properly
shown.

-------------COLLECTOR LOG--------------------------------------

10/30 11:54:59 RECV 1000 bytes at <10.201.42.238:9618> from
<192.168.111.103:9607>
10/30 11:54:59 Fragmentation Header: last=0,seq=0,len=975,data=[25]
10/30 11:54:59 	Frag [975 bytes]
10/30 11:54:59 found timed out msg: cur=1225347899, msg=1225347868
10/30 11:54:59 Deleting timeouted message:
10/30 11:54:59 ========================
ID: 107.111.168.192, 32067, 1225168961, 53261
len:975, lastNo:0, rcved:1, lastTime:1225347868
===================
10/30 11:54:59 found timed out msg: cur=1225347899, msg=1225347872
10/30 11:54:59 Deleting timeouted message:
10/30 11:54:59 ========================
ID: 105.111.168.192, 15299, 1224835749, 23392
len:975, lastNo:0, rcved:1, lastTime:1225347872
===================
10/30 11:54:59 RECV 1000 bytes at <10.201.42.238:9618> from
<192.168.111.103:9607>
10/30 11:54:59 Fragmentation Header: last=0,seq=2,len=975,data=[25]
10/30 11:54:59 	Frag [975 bytes]
10/30 11:54:59 RECV 1000 bytes at <10.201.42.238:9618> from
<192.168.111.103:9607>
10/30 11:54:59 Fragmentation Header: last=0,seq=3,len=975,data=[25]
10/30 11:54:59 	Frag [975 bytes]
10/30 11:54:59 RECV 381 bytes at <10.201.42.238:9618> from
<192.168.111.103:9607>
10/30 11:54:59 Fragmentation Header: last=1,seq=4,len=356,data=[25]
10/30 11:54:59 	Frag [356 bytes]
.
.
.
.
10/30 11:55:06 found timed out msg: cur=1225347906, msg=1225347892
10/30 11:55:06 ========================
ID: 103.111.168.192, 9064, 1225345367, 3981
len:975, lastNo:0, rcved:1, lastTime:1225347892
===================
.
.
.
10/30 11:55:52 RECV 1000 bytes at <10.201.42.238:9618> from
<192.168.111.103:9608>
10/30 11:55:52 Fragmentation Header: last=0,seq=0,len=975,data=[25]
10/30 11:55:52  Frag [975 bytes]
10/30 11:55:55 RECV 1000 bytes at <10.201.42.238:9618> from
<192.168.111.104:9617>
10/30 11:55:55 Fragmentation Header: last=0,seq=0,len=975,data=[25]
10/30 11:55:55  Frag [975 bytes]
10/30 11:55:55 found timed out msg: cur=1225347955, msg=1225347932
10/30 11:55:55 Deleting timeouted message:
10/30 11:55:55 ========================
ID: 105.111.168.192, 15299, 1224835749, 23396
len:975, lastNo:0, rcved:1, lastTime:1225347932

===================
10/30 11:55:59 RECV 1000 bytes at <10.201.42.238:9618> from
<192.168.111.103:9608>
10/30 11:55:59 Fragmentation Header: last=0,seq=0,len=975,data=[25]
10/30 11:55:59  Frag [975 bytes]
10/30 11:55:59 found timed out msg: cur=1225347959, msg=1225347928
10/30 11:55:59 Deleting timeouted message:
10/30 11:55:59 ========================
ID: 107.111.168.192, 32067, 1225168961, 53262
len:975, lastNo:0, rcved:1, lastTime:1225347928

===================
10/30 11:55:59 RECV 1000 bytes at <10.201.42.238:9618> from
<192.168.111.103:9608>
10/30 11:55:59 Fragmentation Header: last=0,seq=1,len=975,data=[25]
10/30 11:55:59  Frag [975 bytes]
10/30 11:55:59 RECV 1000 bytes at <10.201.42.238:9618> from
<192.168.111.103:9608>
10/30 11:55:59 Fragmentation Header: last=0,seq=2,len=975,data=[25]
10/30 11:55:59  Frag [975 bytes]
10/30 11:55:59 RECV 1000 bytes at <10.201.42.238:9618> from
<192.168.111.103:9608>
10/30 11:55:59 Fragmentation Header: last=0,seq=3,len=975,data=[25]
10/30 11:55:59  Frag [975 bytes]
10/30 11:55:59 RECV 383 bytes at <10.201.42.238:9618> from
<192.168.111.103:9608>
10/30 11:55:59 Fragmentation Header: last=1,seq=4,len=358,data=[25]
10/30 11:55:59  Frag [358 bytes]
10/30 11:55:59 long msg ready: 4258 bytes
10/30 11:55:59 4 bytes read from UDP[size=4258, passed=4]
10/30 11:55:59 4 bytes read from UDP[size=4258, passed=8]
10/30 11:55:59 4 bytes read from UDP[size=4258, passed=12]
10/30 11:55:59 4 bytes read from UDP[size=4258, passed=16]
10/30 11:55:59 4 bytes read from UDP[size=4258, passed=567]
10/30 11:55:59 4 bytes read from UDP[size=4258, passed=571]
10/30 11:55:59 SafeMsg::_longMsg::getPtr:
found delim = ^@ & length = 55
10/30 11:55:59 55 bytes read from UDP[size=4258, passed=980]
10/30 11:55:59 SafeMsg::_longMsg::getPtr:
found delim = ^@ & length = 178
10/30 11:55:59 178 bytes read from UDP[size=4258, passed=2054]
10/30 11:55:59 SafeMsg::_longMsg::getPtr:
found delim = ^@ & length = 37
10/30 11:55:59 37 bytes read from UDP[size=4258, passed=2952]
10/30 11:55:59 StartdAd     : Inserting ** "< grid3.pesgrid.wipro.com ,
192.168.111.103 >"
10/30 11:55:59 4 bytes read from UDP[size=4258, passed=3060]
10/30 11:55:59 4 bytes read from UDP[size=4258, passed=3064]
10/30 11:55:59 SafeMsg::_longMsg::getPtr:
found delim = ^@ & length = 21
10/30 11:55:59 21 bytes read from UDP[size=4258, passed=3906]
10/30 11:55:59 SafeMsg::_longMsg::getPtr:
found delim = ^@ & length = 4
10/30 11:55:59 4 bytes read from UDP[size=4258, passed=4258]
10/30 11:55:59 StartdPvtAd  : Inserting ** "< grid3.pesgrid.wipro.com ,
192.168.111.103 >"
10/30 11:56:02 RECV 1000 bytes at <10.201.42.238:9618> from
<192.168.111.105:9603>
10/30 11:56:02 Fragmentation Header: last=0,seq=0,len=975,data=[25]
10/30 11:56:02  Frag [975 bytes]
10/30 11:56:02 found timed out msg: cur=1225347962, msg=1225347839
10/30 11:56:02 Deleting timeouted message:
10/30 11:56:02 ========================
ID: 103.111.168.192, 9065, 1225345367, 6645
len:3283, lastNo:4, rcved:4, lastTime:1225347839

===================


We are facing problems because after a restart the machine is not
correctly displayed in pool status. If that machine got match and if
execute some job. In job status the JOB seems to be running. In machine
status, there will be no claimed or busy machine.

This is due to different subnet or some N/W issues.

We are using condor
$CondorVersion: 7.0.3 Jun 20 2008 BuildID: 91405 $
$CondorPlatform: I386-LINUX_RHEL5 $.

by
Johnson






Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. 

www.wipro.com