[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor Version 6.9.0 X86_64 - GCB clients failtostart



Hi All,

I can't believe what I'm seeing here. My little mpi cluster on which I'm
experimenting with GCB consists out of 10 exactly identical boxes.
All equipped with onboard realtek Ethernet cards, the 10 machines boot
from network, and are all supplied with the same image. Last week I
surprised one of the 10 boxes with an Intel 10/100 card to do a little
performance benchmarking. It just so happens that this box DOES start
condor (configured to use GCB) correctly. All other boxes with the realtek
cards fail.

Can somebody please explain how the heck this is possible. I knew Realtek
was crap, but this bad! I mean condor without GCB works like a charm on
these boxes. I therefor find it hard to believe there's something
physically wrong with these boxes. Is this a Condor issue or a driver
issue, I'm lost...

In case this is a condor thingy, I attached 2 MasterLog files, one from
the machine with the Intel card, which successfully starts. And one from
the Realtek machines.

Kind Regards,

Cor

>> Cor Cornelisse <ccorneli@xxxxxxxx> wrote:
>>> 12/7 22:11:45 GCB: GCB_bind: _myIP failed
>>
>> The most likely cause is that your machine (the one with the
>> master) doesn't have any active IP addresses beyond loopback
>> (127.0.0.1).  That seems plausible on your laptop if you tried to
>> start Condor before attaching to a network.
>>
>> That doesn't explain why you would see that error message on your
>> execute nodes, which presumably are working fine.  To take a wild
>> guess, are you starting Condor in your init scripts?  If so, is
>> Condor possibly higher priority than initializing the network?
>> Having Condor start before the network is up if a recipe for
>> problems.
>>
>> If that's not the case for your execute nodes, you might want to
>> double check that you're not seeing a different error.
>>
>> --
>> Alan De Smet                              Condor Project Research
>> adesmet@xxxxxxxxxxx                 http://www.condorproject.org/
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at either
>> https://lists.cs.wisc.edu/archive/condor-users/
>> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>>
>
> Hi,
>
> I'm sure networking is up before condor. I do start the service through
> init scripts, but to test if your hypothesis is correct, I simply
> restarted the condor service, resulting in the same error. So the network
> is definitly up and running. I set the masterlog debug option to D_ALL,
> this gives a little more debug information, but still not enough for me to
> understand what's going wrong (it looks like it's trying to bind to
> 0.0.0.0 :s)
>
> Anyone?
>
> 12/8 18:29:12 (fd:3) (pid:4559) Using config source:
> /opt/condor/etc/condor_config
> 12/8 18:29:12 (fd:3) (pid:4559) Using local config sources:
> 12/8 18:29:12 (fd:3) (pid:4559)    /var/condor/condor_config.local
> 12/8 18:29:12 (fd:5) (pid:4559) Attempting to lock
> /tmp/condor-lock.portal0.998036533202143/InstanceLock.
> 12/8 18:29:12 (fd:6) (pid:4559) Obtained lock on
> /tmp/condor-lock.portal0.998036533202143/InstanceLock.
> 12/8 18:29:12 (fd:6) (pid:4559) Setting up command socket
> 12/8 18:29:12 (fd:6) (pid:4559) CONDOR_INHERIT: is NULL
> 12/8 18:29:12 (fd:7) (pid:4559) GCB: GCB_socket(fd = 6, TCP)
> 12/8 18:29:12 (fd:7) (pid:4559) PRIV_CONDOR --> PRIV_ROOT at sock.C:526
> 12/8 18:29:12 (fd:7) (pid:4559) GCB: GCB_bind(6[GCB_SOCKET], <0.0.0.0:0>)
> 12/8 18:29:12 (fd:7) (pid:4559) GCB: GCB_bind: _myIP failed
> 12/8 18:29:12 (fd:7) (pid:4559) PRIV_ROOT --> PRIV_CONDOR at sock.C:532
> 12/8 18:29:12 (fd:7) (pid:4559) bind failed errno = 0
> 12/8 18:29:12 (fd:7) (pid:4559) Failed to bind to command ReliSock
> 12/8 18:29:12 (fd:7) (pid:4559) (Make sure your IP address is correct in
> /etc/hosts.)
> 12/8 18:29:12 (fd:7) (pid:4559) ERROR "BindAnyCommandPort failed" at line
> 6808 in file daemon_core.C
>
>
> --
> A lie told often enough becomes the truth.
>
> Lenin (1870 - 1924)
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at either
> https://lists.cs.wisc.edu/archive/condor-users/
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>


-- 
A lie told often enough becomes the truth.

Lenin (1870 - 1924)

Attachment: realtek_eth_machine.log
Description: Binary data

Attachment: intel_eth_machine.log
Description: Binary data