[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Connection problem



Matthieu,
How are the machines configured? As I understand it you have:
Linux = Running the condor_schedd
VMWare image of Windows = Running the condor_startd for executing jobs.
Perhaps check that the central manager is configured/running for this
pool. Can you share your configuration files for the two machines? My
guess here is that the negotiator and collector aren't running or
available for some reason, which would explain why you're getting:
11/24 12:26:39 (pid:8775) DaemonCore: received command 421 (RESCHEDULE),
calling handler (reschedule_negotiator)
11/24 12:26:39 (pid:8775) attempt to connect to <xx.xx.xx.xx:9618>
failed: Invalid argument (connect errno = 22).  Will keep trying for 20
total seconds (20 to go).
The reschedule command happens after you submit to a scheduler, and
the schedd attempts to connect to the negotiator to tell it that jobs
are available to be run. Getting the configuration files would help,
plus the results of running the following command from the linux box
running the scheduler:
condor_config_val -n <Hostname of central manager> DAEMON_LIST
I'm happy to help further if you need it.

Warm regards,
Jason

--

===================================
Jason A. Stowe

Phone: 607.227.9686
jstowe@xxxxxxxxxxxxxxxxxx

Cycle Computing, LLC
http://www.cyclecomputing.com


On 11/24/06, Cargnelli, Matthieu <Matthieu.Cargnelli@xxxxxxxx> wrote:
Hi,

I'm trying to get a Windows condor master/worker working with a linux
box (just submitting jobs). The windows box is actually running inside
VMWare.

The linux box is running a schedd, and I'm able to run a condor_submit
command which puts the job in the local queue. Only the job doesn't
start. A condor_status fails to, indicating that the collector doesn't
respond. Here is a par of the Schedd Log :


11/24 12:25:01 (pid:8775) attempt to connect to <xx.xx.xx.xx:9618>
failed: Invalid argument (connect errno = 22).  Will keep trying for 20
total seconds (20 to go).

11/24 12:25:21 (pid:8775) attempt to connect to <xx.xx.xx.xx:9618>
failed: Invalid argument (connect errno = 22).
11/24 12:25:21 (pid:8775) ERROR: SECMAN:2003:TCP connection to
<xx.xx.xx.xx:9618> failed

11/24 12:25:21 (pid:8775) Failed to start non-blocking update to
<xx.xx.xx.xx:9618>.
11/24 12:26:00 (pid:8775) get_file: Zero-length file check failed!
11/24 12:26:00 (pid:8775) Failed to receive file from client in
SendSpoolFile.
11/24 12:26:39 (pid:8775) DaemonCore: Command received via UDP from host
<127.0.1.1:32788>
11/24 12:26:39 (pid:8775) DaemonCore: received command 421 (RESCHEDULE),
calling handler (reschedule_negotiator)
11/24 12:26:39 (pid:8775) attempt to connect to <xx.xx.xx.xx:9618>
failed: Invalid argument (connect errno = 22).  Will keep trying for 20
total seconds (20 to go).

The linux box can ping the windows box (with its full hostname) and the
windows firewall is opened for port 9618, I can connect through this
port via telnet.
I don't know what is wrong. Does someone have a clue on this ?

--
Matthieu Cargnelli
EADS CCR - Centre de Toulouse
Centreda 1
4, Avenue Didier Daurat
31700 BLAGNAC
Tel: (+33 5) 67.19.61.73

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR



--

===================================
Jason A. Stowe

Phone: 607.227.9686
jstowe@xxxxxxxxxxxxxxxxxx

Cycle Computing, LLC
http://www.cyclecomputing.com