[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with Condor under Windows



Hi,

thanks for the tip.
Actually, since the client machines have "vmplayer" (for other
purposes, the machines are primarily for student use), they have two
virtual interfaces (created by vmplayer) and a third one, the real
one.
I'm using NETWORK_INTERFACE = 0.0.0.0 to bind in all interfaces, but I
have also tried out "NETWORK_INTERFACE=192.x.x.x" with no effect.

BTW, I tried to switch server, using a windows XP/SP2 (instead of the
Linux one) and the problems persist (so I guess the problem comes from
the client part).
Next, I'll try a clean install on two (one server, another as the
client) clean windows machines (no anti-virus, no vmplayer, just
XP/SP2).

Patricio.


On 7/2/07, Michael McClenahan <michael.mcclenahan@xxxxxxxxxxxxxxx> wrote:
Try forcing it to bind to the correct NIC. We had issues when we had
multiple nics in a machine.

NETWORK_INTERFACE = 192.x.x.x

Might not be the problem but it's very quick to check.

Mike


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Patricio
Domingues
Sent: 30 June 2007 10:20
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Problems with Condor under Windows

Hi,

I'm managing with a small Condor pool, with a Linux's master (ubuntu
6.10, condor version 6.8.4) and several Windows machines (all XP Service
pack 2) acting as execution nodes.

All the jobs are submitted at Condor's master machine.

The problem is that the executing node are rarely able to complete
tasks, since they seem to enter an error mode with the message
"SafeMsg: sending small msg failed. errno: 0" followed by "PidWatcher
thread couldn't notify main thread (exited_pid=896)"  logged every
second (as follows):

== Execute Node  / MasterLog ==
6/30 09:52:14 SafeMsg: sending small msg failed. errno: 0 6/30 09:52:14
PidWatcher thread couldn't notify main thread (exited_pid=896) 6/30
09:52:15 SafeMsg: sending small msg failed. errno: 0 6/30 09:52:15
PidWatcher thread couldn't notify main thread (exited_pid=896)

Identical messages are also logged on the "StarterLog" and "StartLog".

And the only way to recover the executing node is to kill Condor's
processes (condor_master, condor_starter and alike). And most of the
time, the situation repeats again.

At the Condor's server, there are error reports at the "Schedlog"
== master / SchedLog ===
6/30 03:24:30 (pid:4841) Response problem from startd on
<192.168.132.1:1083> (match <192.168.132.1:1083>#1183154665#1).
6/30 03:24:30 (pid:4841) Sent RELEASE_CLAIM to startd on
<192.168.132.1:1083> 6/30 03:24:30 (pid:4841) Match record
(<192.168.132.1:1083>, 1820, 0) deleted 6/30 03:24:30 (pid:4841)
DaemonCore: Command received via UDP from host <192.168.132.1:3667> 6/30
03:24:30 (pid:4841) DaemonCore: received command 60014
(DC_INVALIDATE_KEY), calling handler (handle_invalidate_key()) 6/30
03:24:30 (pid:4841) condor_read(): recv() returned -1, errno = 104,
assuming failure reading 5 bytes from unknown source.


And at the server's "ShadowLog"

== master / ShadowLog ===
6/29 23:19:45 (1797.0) (5158): condor_read(): recv() returned -1, errno
= 104, assuming failure reading 5 bytes from unknown source.
6/29 23:19:45 (1797.0) (5158): Can no longer talk to condor_starter
<192.168.136.3:1080>
6/29 23:19:45 (1797.0) (5158): Trying to reconnect to disconnected job
6/29 23:19:45 (1797.0) (5158): LastJobLeaseRenewal: 1183155191 Fri Jun
29 23:13:11 2007
6/29 23:19:45 (1797.0) (5158): JobLeaseDuration: 1200 seconds
6/29 23:19:45 (1797.0) (5158): JobLeaseDuration remaining: 806
6/29 23:19:45 (1797.0) (5158): Attempting to locate disconnected starter
6/29 23:20:05 (1797.0) (5158): condor_read(): timeout reading 5 bytes
from <192.168.136.3:1080>.
6/29 23:20:25 (1797.0) (5158): condor_read(): timeout reading 5 bytes
from <192.168.136.3:1080>.
6/29 23:20:25 (1797.0) (5158): locateStarter(): Failed to read reply
ClassAd ==================

Each Windows client run antivirus (one of them runs AVG, others run
PANDA) and the Windows firewall. Although, with both of them disabled
the problem still exists.
The machines have also the VmWare Player installed (I mentioning it
because the VmPlayer install two network interfaces).

Condor version is 6.8.4 at the Master's and I experimented with 6.8.4,
6.8.5 and 6.9.2 (last unstable one) at the clients. All with the same
results.

Any help is more than welcome!
Thanks,
P. Domingues.
_______________________________________________
Condor-users mailing list
----
Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.
The information in this email is intended only for the named recipient. If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.
Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Gloucester Research Limited is a company registered in England and Wales with company number 04267560.