[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Linux machines wont comunicate with windows pool



Hi,
I have a pool of 5 machines, 3 running windows 2000, and two running
suse 9.3 professional, one of which has two processors.
The condor host is one of the windows machines. The windows machines
work as expected and can submit and run jobs.
All the machines can run condor_status and get the correct list of
machines in the pool.
All the machines can run condor_q and get their local queue.
When any of the machines run condor_q - global, I get "-- Failed to
fetch ads from: <***.***.***.***:1057> :
alsp-condorslv.alspac.bris.ac.uk CEDAR:6001:Failed to connect to
<***.***.***.***:1057>" for each of the Linux machines that have jobs
in their local queues. This message is returned, even when condor_q
-global is run on the Linux machine with queued jobs.
The Linux machines can see the local queues of the windows machines
with condor_q -global.
Jobs submitted by windows machines, set to require a Linux execute
machine do not run, and running condor_q -analyze gives "3 are
rejected by your job's requirements, 3 match, match, but reject the
job for unknown reasons".
Jobs submitted by Linux machines, set to require either a windows or
Linux execute machine do not run, and running condor_q -analyze gives
"3 are rejected by your job's requirements, 3 match, match, but reject
the job for unknown reasons".
hostallow_read and hostallow_write are set to "*" on all machines.
The log levels are set to the defaults on all the machines. The
windows machines have "DaemonCore: Command received via TCP from host
<***.***.***.***:****>" from the machine its self, and from the condor
host for every command I give. The logs on the Linux machines do not
have these messages.
Looking at the negotiator log, for the Linux machines, I get "Can't
connect to <137.222.33.60:1028>:0, errno = 10060" and "getpeername
failed so connect must have failed" and "SECMAN:2003:TCP connection to
<137.222.33.60:1028> failed". A few times the condor host manages to
match a job with the Linux machines, but it fails to start with the
error "condor_read(): recv() returned -1, errno = 10054, assuming
failure. Failed to get reply from schedd".
Can anyone suggest anything that I can try to get the Linux machines
cooperating with the windows ones?

Thanks,
Matthew Cattle