Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all articles
Browse latest Browse all 6278

A newbie problem with infiniband.

$
0
0

Hi all =)

 

I am a bit new to the forum, but I have been reading it for quite some time and the posts are very helpful. Thanks!

 

So I decided that it is worth hopping on the infiniband wagon ( it is clear why - the speed is awesome, also the performance boost and the price has no match   ) . BUT ....

 

I have run into some problems setting up the infiniband fabric.

Some information about my setup : HP c7000 with 4 x Proliant BL685c Gen1. each with a HP 4x DDR DUAL PORT MEZZ HCA, I also have a 2 x HP 4x DDR IB Switch Module ( each with 16 downlink ports and 8 physical interfaces - CX4 connectors ) .

I am running VMware ESXi 5.1.0

 

~ # esxcli system version get

   Product: VMware ESXi

   Version: 5.1.0

   Build: Releasebuild-799733

   Update: 0

 

So far so good, I have installed the drivers needed :

 

 

* Mellanox ESXI 5.0 Driver ( esxcli software vib install -d /tmp/drivers/mlx4_en-mlnx-1.6.1.2-offline_bundle-471530.zip –-no-sig-check )

* Mellanox OFED driver ( esxcli software vib install -d  /tmp/drivers/MLNX-OFED-ESX-1.8.1.0.zip --no-sig-check )

 

 

# esxcli software vib list | grep Mellanox

net-ib-cm                      1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

net-ib-core                    1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

net-ib-ipoib                   1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

net-ib-mad                     1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

net-ib-sa                      1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

net-ib-umad                    1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

net-memtrack                   2013.0131.1850-1OEM.500.0.0.472560    Mellanox         PartnerSupported  2014-03-18

net-mlx4-core                  1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

net-mlx4-en                    1.6.1.2-1OEM.500.0.0.406165           Mellanox         VMwareCertified   2014-03-18

net-mlx4-ib                    1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

scsi-ib-srp                    1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18


 

After that I have installed the OpenSM ( esxcli software vib install -v /tmp/drivers/ib-opensm-3.3.15.x86_64.vib –-no-sig-check)

 

~ # esxcli software vib list | grep open

ib-opensm                      3.3.15                                Intel            VMwareAccepted    2014-03-18


I also configured the OpenSM per adapter with a partitions.conf file (Default=0x7fff,ipoib,mtu=5:ALL=full;), putting this file in the /scratch/opensm/adapter_1_hca/ and /scratch/opensm/adapter_2_hca/ directories

 

/vmfs/volumes/530dc445-b2c469b5-adf0-0019bb3b460e/.locker/opensm # ls -la

drwxr-xr-x    1 root     root           560 Feb 28 09:59 .

drwxr-xr-x    1 root     root           980 Feb 28 09:59 ..

drwxr-xr-x    1 root     root           420 Mar 18 12:31 0x00237dffff94d87d

drwxr-xr-x    1 root     root           420 Mar 18 12:31 0x00237dffff94d87e

 

 

/vmfs/volumes/530dc445-b2c469b5-adf0-0019bb3b460e/.locker/opensm/0x00237dffff94d87d # cat partitions.conf

Default=0x7fff,ipoib,mtu=5:ALL=full;

 

I have been following those two tutorials :

http://www.vladan.fr/homelab-storage-network-speedup/

http://www.bussink.ch/?p=1183

 

Now I can see the adapters :

 

~ # esxcli network nic list | grep Mellanox

vmnic_ib0  0000:047:00.0  ib_ipoib  Up    20000  Full    00:23:7d:94:d8:7d  1500  Mellanox Technologies MT25418 [ConnectX VPI - 10GigE / IB DDR, PCIe 2.0 2.5GT/s]

vmnic_ib1  0000:047:00.0  ib_ipoib  Up    20000  Full    00:23:7d:94:d8:7e  1500  Mellanox Technologies MT25418 [ConnectX VPI - 10GigE / IB DDR, PCIe 2.0 2.5GT/s]

 

Also when start ./ibstat I get that :

 

/opt/opensm/bin # ./ibstat

CA 'mlx4_0'

        CA type: MT25418

        Number of ports: 2

        Firmware version: 2.7.0

        Hardware version: a0

        Node GUID: 0x00237dffff94d87c

        System image GUID: 0x00237dffff94d87f

        Port 1:

                State: Active

                Physical state: LinkUp

                Rate: 20

                Base lid: 1

                LMC: 0

                SM lid: 6

                Capability mask: 0x0251086a

                Port GUID: 0x00237dffff94d87d

                Link layer: InfiniBand

        Port 2:

                State: Active

                Physical state: LinkUp

                Rate: 20

                Base lid: 5

                LMC: 0

                SM lid: 6

                Capability mask: 0x0251086a

                Port GUID: 0x00237dffff94d87e

                Link layer: InfiniBand

 

So everything seems to be working, except it is not :

When trying to ping from one host to the other i get that :

 

/opt/opensm/bin # ./ibping -S -dd

ibwarn: [15174] umad_init: umad_init

ibwarn: [15174] umad_open_port: ca (null) port 0

ibwarn: [15174] umad_get_cas_names: max 32

ibwarn: [15174] umad_get_cas_names: return 1 cas

ibwarn: [15174] resolve_ca_name: checking ca 'mlx4_0'

ibwarn: [15174] resolve_ca_port: checking ca 'mlx4_0'

ibwarn: [15174] umad_get_ca: ca_name mlx4_0

ibwarn: [15174] umad_get_ca: opened mlx4_0

ibwarn: [15174] resolve_ca_port: checking port 0

ibwarn: [15174] resolve_ca_port: checking port 1

ibwarn: [15174] resolve_ca_port: found active port 1

ibwarn: [15174] resolve_ca_name: found ca mlx4_0 with port 1 type 1

ibwarn: [15174] resolve_ca_name: found ca mlx4_0 with active port 1

ibwarn: [15174] umad_open_port: opening mlx4_0 port 1

ibwarn: [15174] dev_to_umad_id: mapped mlx4_0 1 to 0

ibwarn: [15174] umad_open_port: opened /dev/umad0 fd 3 portid 0

ibwarn: [15174] umad_register: fd 3 mgmt_class 3 mgmt_version 2 rmpp_version 1 method_mask (nil)

ibwarn: [15174] umad_register: fd 3 registered to use agent 0 qp 1

ibwarn: [15174] umad_register_oui: fd 3 mgmt_class 50 rmpp_version 0 oui 0x0145 method_mask 0xffd0cca0

ibwarn: [15174] umad_register_oui: fd 3 registered to use agent 1 qp 1 class 0x32 oui 0xffd0cc90

ibdebug: [15174] ibping_serv: starting to serve...

ibwarn: [15174] umad_recv: fd 3 umad 0x80579c0 timeout 4294967295

ibwarn: [15174] umad_recv: read returned 4294967232 > sizeof umad 64 + length 256 (Resource temporarily unavailable)

ibwarn: [15174] mad_receive_via: recv failed: Resource temporarily unavailable

ibdebug: [15174] ibping_serv: server out

 

For some reason I always get the Resource temporarily unavailable message. When I try to do a ./ibping -L the right Lid or ./ibping -G with the right Guid I always get this :

 

/opt/opensm/bin # ./ibping -G 0x001b78ffff34b9c6

ibwarn: [15237] _do_madrpc: recv failed: Resource temporarily unavailable

ibwarn: [15237] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 6)

ibwarn: [15237] ib_path_query_via: sa call path_query failed

./ibping: iberror: failed: can't resolve destination port 0x001b78ffff34b9c6

 

So I would really appreciate any help with getting one nod to ping the other.

 

I am thinking that my problem might be the HP 4x IB Switch, but it shouldnt be, because with it I could get at least a point to point connection. The switch doesnt have an onboard subnet manager, but I am using OpenSM, so that also shouldnt be the problem.

I want to use the Infiniband connection for a virtual storage between the Proliants, but first I need to verify that there is a connection. Any help would be welcome, also any suggestions

Thanks in advance.

 

Alex


Viewing all articles
Browse latest Browse all 6278

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>