[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Linux machines wont comunicate with windows pool



Good problem description.  See comments below. 

> I have a pool of 5 machines, 3 running windows 2000, and two running
> suse 9.3 professional, one of which has two processors.
> The condor host is one of the windows machines. The windows machines
> work as expected and can submit and run jobs.
> All the machines can run condor_status and get the correct list of
> machines in the pool.
> All the machines can run condor_q and get their local queue.
> When any of the machines run condor_q - global, I get "-- Failed to
> fetch ads from: <***.***.***.***:1057> :
> alsp-condorslv.alspac.bris.ac.uk CEDAR:6001:Failed to connect to
> <***.***.***.***:1057>" for each of the Linux machines that have jobs
> in their local queues. This message is returned, even when condor_q
> -global is run on the Linux machine with queued jobs.
> The Linux machines can see the local queues of the windows machines
> with condor_q -global.
> Jobs submitted by windows machines, set to require a Linux execute
> machine do not run, and running condor_q -analyze gives "3 are
> rejected by your job's requirements, 3 match, match, but reject the
> job for unknown reasons".

What does the schedd log say regarding these jobs?

> Jobs submitted by Linux machines, set to require either a windows or
> Linux execute machine do not run, and running condor_q -analyze gives
> "3 are rejected by your job's requirements, 3 match, match, but reject
> the job for unknown reasons".

Either?  (It's hard (but not impossible, I suppose) to imagine
executables that work on both Linux and Windows :-)) Anyway, what is
your requirements expression (condor_q -l | grep Requirements) and what
does the schedd log say?

> hostallow_read and hostallow_write are set to "*" on all machines.

That's good.

> The log levels are set to the defaults on all the machines.

You'll want to make them more verbose.  The defaults are rarely enough
to diagnose thorny problems.  Please see 

http://docs.optena.com/display/CONDOR/How+To+Increase+Debugging+Messages


> The
> windows machines have "DaemonCore: Command received via TCP from host
> <***.***.***.***:****>" from the machine its self, and from the condor
> host for every command I give. The logs on the Linux machines do not
> have these messages.

I suppose this makes sense; it seems that your windows machines are
receiving and processing the commands, whereas the linux machines
aren't.

> Looking at the negotiator log, for the Linux machines, I get "Can't
> connect to <137.222.33.60:1028>:0, errno = 10060" 

10060 is Windows for "Connection Timed Out"

> and "getpeername
> failed so connect must have failed" and "SECMAN:2003:TCP connection to
> <137.222.33.60:1028> failed". A few times the condor host manages to
> match a job with the Linux machines, but it fails to start with the
> error "condor_read(): recv() returned -1, errno = 10054, assuming
> failure. Failed to get reply from schedd".

Huh.  10054 is "Connection reset by peer".  I guess that still doesn't
tell us much.

> Can anyone suggest anything that I can try to get the Linux machines
> cooperating with the windows ones?

Although I don't see much evidence for it, I'm going to toss this out
anyway: is your DNS setup configured well enough so that _all_ machines
can do both regular and reverse DNS lookups?  (That is, can all machines
translate names to IP addresses and vice versa for every machine in the
pool?)  Condor is sensitive to DNS issues.

Have you made other noteworthy config file changes?

Beyond that, I'd say increase your logging level, restart the pool for
good measure (condor_restart -all) and examine the logs for more hints.

Mike Yoder
Principal Member of Technical Staff
Ask Mike: http://docs.optena.com
Direct  : +1.408.321.9000
Fax     : +1.408.321.9030
Mobile  : +1.408.497.7597
yoderm@xxxxxxxxxx

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com