[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problems with Condor under Windows



Hi,

I'm managing with a small Condor pool, with a Linux's master (ubuntu
6.10, condor version 6.8.4) and several Windows machines (all XP
Service pack 2) acting as execution nodes.

All the jobs are submitted at Condor's master machine.

The problem is that the executing node are rarely able to complete
tasks, since they seem to enter an error mode with the message
"SafeMsg: sending small msg failed. errno: 0" followed by "PidWatcher
thread couldn't notify main thread (exited_pid=896)"  logged every
second (as follows):

== Execute Node  / MasterLog ==
6/30 09:52:14 SafeMsg: sending small msg failed. errno: 0
6/30 09:52:14 PidWatcher thread couldn't notify main thread (exited_pid=896)
6/30 09:52:15 SafeMsg: sending small msg failed. errno: 0
6/30 09:52:15 PidWatcher thread couldn't notify main thread (exited_pid=896)

Identical messages are also logged on the "StarterLog" and "StartLog".

And the only way to recover the executing node is to kill Condor's
processes (condor_master, condor_starter and alike). And most of the
time, the situation repeats again.

At the Condor's server, there are error reports at the "Schedlog"
== master / SchedLog ===
6/30 03:24:30 (pid:4841) Response problem from startd on
<192.168.132.1:1083> (match <192.168.132.1:1083>#1183154665#1).
6/30 03:24:30 (pid:4841) Sent RELEASE_CLAIM to startd on <192.168.132.1:1083>
6/30 03:24:30 (pid:4841) Match record (<192.168.132.1:1083>, 1820, 0) deleted
6/30 03:24:30 (pid:4841) DaemonCore: Command received via UDP from
host <192.168.132.1:3667>
6/30 03:24:30 (pid:4841) DaemonCore: received command 60014
(DC_INVALIDATE_KEY), calling handler (handle_invalidate_key())
6/30 03:24:30 (pid:4841) condor_read(): recv() returned -1, errno =
104, assuming failure reading 5 bytes from unknown source.


And at the server's "ShadowLog"

== master / ShadowLog ===
6/29 23:19:45 (1797.0) (5158): condor_read(): recv() returned -1,
errno = 104, assuming failure reading 5 bytes from unknown source.
6/29 23:19:45 (1797.0) (5158): Can no longer talk to condor_starter
<192.168.136.3:1080>
6/29 23:19:45 (1797.0) (5158): Trying to reconnect to disconnected job
6/29 23:19:45 (1797.0) (5158): LastJobLeaseRenewal: 1183155191 Fri Jun
29 23:13:11 2007
6/29 23:19:45 (1797.0) (5158): JobLeaseDuration: 1200 seconds
6/29 23:19:45 (1797.0) (5158): JobLeaseDuration remaining: 806
6/29 23:19:45 (1797.0) (5158): Attempting to locate disconnected starter
6/29 23:20:05 (1797.0) (5158): condor_read(): timeout reading 5 bytes
from <192.168.136.3:1080>.
6/29 23:20:25 (1797.0) (5158): condor_read(): timeout reading 5 bytes
from <192.168.136.3:1080>.
6/29 23:20:25 (1797.0) (5158): locateStarter(): Failed to read reply ClassAd
==================

Each Windows client run antivirus (one of them runs AVG, others run
PANDA) and the Windows firewall. Although, with both of them disabled
the problem still exists.
The machines have also the VmWare Player installed (I mentioning it
because the VmPlayer install two network interfaces).

Condor version is 6.8.4 at the Master's and I experimented with 6.8.4,
6.8.5 and 6.9.2 (last unstable one) at the clients. All with the same
results.

Any help is more than welcome!
Thanks,
P. Domingues.