Hi all =)
I am a bit new to the forum, but I have been reading it for quite some time and the posts are very helpful. Thanks!
So I decided that it is worth hopping on the infiniband wagon ( it is clear why - the speed is awesome, also the performance boost and the price has no match ) . BUT ....
I have run into some problems setting up the infiniband fabric.
Some information about my setup : HP c7000 with 4 x Proliant BL685c Gen1. each with a HP 4x DDR DUAL PORT MEZZ HCA, I also have a 2 x HP 4x DDR IB Switch Module ( each with 16 downlink ports and 8 physical interfaces - CX4 connectors ) .
I am running VMware ESXi 5.1.0
~ # esxcli system version get
Product: VMware ESXi
Version: 5.1.0
Build: Releasebuild-799733
Update: 0
So far so good, I have installed the drivers needed :
* Mellanox ESXI 5.0 Driver ( esxcli software vib install -d /tmp/drivers/mlx4_en-mlnx-1.6.1.2-offline_bundle-471530.zip –-no-sig-check )
* Mellanox OFED driver ( esxcli software vib install -d /tmp/drivers/MLNX-OFED-ESX-1.8.1.0.zip --no-sig-check )
# esxcli software vib list | grep Mellanox
net-ib-cm 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18
net-ib-core 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18
net-ib-ipoib 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18
net-ib-mad 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18
net-ib-sa 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18
net-ib-umad 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18
net-memtrack 2013.0131.1850-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18
net-mlx4-core 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18
net-mlx4-en 1.6.1.2-1OEM.500.0.0.406165 Mellanox VMwareCertified 2014-03-18
net-mlx4-ib 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18
scsi-ib-srp 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18
After that I have installed the OpenSM ( esxcli software vib install -v /tmp/drivers/ib-opensm-3.3.15.x86_64.vib –-no-sig-check)
~ # esxcli software vib list | grep open
ib-opensm 3.3.15 Intel VMwareAccepted 2014-03-18
I also configured the OpenSM per adapter with a partitions.conf file (Default=0x7fff,ipoib,mtu=5:ALL=full;), putting this file in the /scratch/opensm/adapter_1_hca/ and /scratch/opensm/adapter_2_hca/ directories
/vmfs/volumes/530dc445-b2c469b5-adf0-0019bb3b460e/.locker/opensm # ls -la
drwxr-xr-x 1 root root 560 Feb 28 09:59 .
drwxr-xr-x 1 root root 980 Feb 28 09:59 ..
drwxr-xr-x 1 root root 420 Mar 18 12:31 0x00237dffff94d87d
drwxr-xr-x 1 root root 420 Mar 18 12:31 0x00237dffff94d87e
/vmfs/volumes/530dc445-b2c469b5-adf0-0019bb3b460e/.locker/opensm/0x00237dffff94d87d # cat partitions.conf
Default=0x7fff,ipoib,mtu=5:ALL=full;
I have been following those two tutorials :
http://www.vladan.fr/homelab-storage-network-speedup/
Now I can see the adapters :
~ # esxcli network nic list | grep Mellanox
vmnic_ib0 0000:047:00.0 ib_ipoib Up 20000 Full 00:23:7d:94:d8:7d 1500 Mellanox Technologies MT25418 [ConnectX VPI - 10GigE / IB DDR, PCIe 2.0 2.5GT/s]
vmnic_ib1 0000:047:00.0 ib_ipoib Up 20000 Full 00:23:7d:94:d8:7e 1500 Mellanox Technologies MT25418 [ConnectX VPI - 10GigE / IB DDR, PCIe 2.0 2.5GT/s]
Also when start ./ibstat I get that :
/opt/opensm/bin # ./ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.7.0
Hardware version: a0
Node GUID: 0x00237dffff94d87c
System image GUID: 0x00237dffff94d87f
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 1
LMC: 0
SM lid: 6
Capability mask: 0x0251086a
Port GUID: 0x00237dffff94d87d
Link layer: InfiniBand
Port 2:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 5
LMC: 0
SM lid: 6
Capability mask: 0x0251086a
Port GUID: 0x00237dffff94d87e
Link layer: InfiniBand
So everything seems to be working, except it is not :
When trying to ping from one host to the other i get that :
/opt/opensm/bin # ./ibping -S -dd
ibwarn: [15174] umad_init: umad_init
ibwarn: [15174] umad_open_port: ca (null) port 0
ibwarn: [15174] umad_get_cas_names: max 32
ibwarn: [15174] umad_get_cas_names: return 1 cas
ibwarn: [15174] resolve_ca_name: checking ca 'mlx4_0'
ibwarn: [15174] resolve_ca_port: checking ca 'mlx4_0'
ibwarn: [15174] umad_get_ca: ca_name mlx4_0
ibwarn: [15174] umad_get_ca: opened mlx4_0
ibwarn: [15174] resolve_ca_port: checking port 0
ibwarn: [15174] resolve_ca_port: checking port 1
ibwarn: [15174] resolve_ca_port: found active port 1
ibwarn: [15174] resolve_ca_name: found ca mlx4_0 with port 1 type 1
ibwarn: [15174] resolve_ca_name: found ca mlx4_0 with active port 1
ibwarn: [15174] umad_open_port: opening mlx4_0 port 1
ibwarn: [15174] dev_to_umad_id: mapped mlx4_0 1 to 0
ibwarn: [15174] umad_open_port: opened /dev/umad0 fd 3 portid 0
ibwarn: [15174] umad_register: fd 3 mgmt_class 3 mgmt_version 2 rmpp_version 1 method_mask (nil)
ibwarn: [15174] umad_register: fd 3 registered to use agent 0 qp 1
ibwarn: [15174] umad_register_oui: fd 3 mgmt_class 50 rmpp_version 0 oui 0x0145 method_mask 0xffd0cca0
ibwarn: [15174] umad_register_oui: fd 3 registered to use agent 1 qp 1 class 0x32 oui 0xffd0cc90
ibdebug: [15174] ibping_serv: starting to serve...
ibwarn: [15174] umad_recv: fd 3 umad 0x80579c0 timeout 4294967295
ibwarn: [15174] umad_recv: read returned 4294967232 > sizeof umad 64 + length 256 (Resource temporarily unavailable)
ibwarn: [15174] mad_receive_via: recv failed: Resource temporarily unavailable
ibdebug: [15174] ibping_serv: server out
For some reason I always get the Resource temporarily unavailable message. When I try to do a ./ibping -L the right Lid or ./ibping -G with the right Guid I always get this :
/opt/opensm/bin # ./ibping -G 0x001b78ffff34b9c6
ibwarn: [15237] _do_madrpc: recv failed: Resource temporarily unavailable
ibwarn: [15237] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 6)
ibwarn: [15237] ib_path_query_via: sa call path_query failed
./ibping: iberror: failed: can't resolve destination port 0x001b78ffff34b9c6
So I would really appreciate any help with getting one nod to ping the other.
I am thinking that my problem might be the HP 4x IB Switch, but it shouldnt be, because with it I could get at least a point to point connection. The switch doesnt have an onboard subnet manager, but I am using OpenSM, so that also shouldnt be the problem.
I want to use the Infiniband connection for a virtual storage between the Proliants, but first I need to verify that there is a connection. Any help would be welcome, also any suggestions
Thanks in advance.
Alex