I ran the ib_send_lat in each host, so client and server were in the same host. The numbers all look reasonable. Now both cables and the switch for the subnet are not involved. Any suggestion to what to test to narrow down the causes of the spikes?
When you suggested a loopback test, did you mean to test the two cables? I.e. take each cable, loop-back between two ports, see if the link comes up? I do have SM running on the switch.
BTW, I have updated both MLNX_OFED and firmware for HCAs to the latest. An example from running hca_self_test.ofed
---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 1
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... MLNX_OFED_LINUX-3.3-1.0.4.0 (OFED-3.3-1.0.4): modules
Host Driver RPM Check .................. PASS
Firmware on CA #0 HCA .................. v12.16.1006
Firmware Check on CA #0 (HCA) .......... NA
REASON: NO required fw version
Host Driver Initialization ............. PASS
Number of CA Ports Active .............. 1
Port State of Port #1 on CA #0 (HCA)..... UP 4X EDR (InfiniBand)
Error Counter Check on CA #0 (HCA)...... PASS
Kernel Syslog Check .................... PASS
Node GUID on CA #0 (HCA) ............... 7c:fe:90:03:00:29:26:b6
------------------ DONE ---------------------
I repeated the tests that I did before, I still observed spikes.
Host fs10
Server:
[root@fs10 ~]# ib_send_lat -a -c UD
************************************
* Waiting for client to connect... *
************************************
Max msg size in UD is MTU 4096
Changing to this MTU
---------------------------------------------------------------------------------------
Send Latency Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : UD Using SRQ : OFF
RX depth : 1000
Mtu : 4096[B]
Link type : IB
Max inline data : 188[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x03 QPN 0x0031 PSN 0xc0fb9e
remote address: LID 0x03 QPN 0x0030 PSN 0x3bd5ea
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
2 1000 0.67 4.41 0.69
4 1000 0.67 4.77 0.69
8 1000 0.67 4.77 0.69
16 1000 0.67 4.28 0.69
32 1000 0.71 4.93 0.72
64 1000 0.71 5.22 0.72
128 1000 0.75 4.80 0.76
256 1000 1.06 4.20 1.08
512 1000 1.14 4.79 1.16
1024 1000 1.27 5.08 1.29
2048 1000 1.54 5.62 1.55
4096 1000 2.04 5.71 2.06
---------------------------------------------------------------------------------------
Client:
[root@fs10 ~]# ib_send_lat -a -c UD 192.168.12.150
Max msg size in UD is MTU 4096
Changing to this MTU
---------------------------------------------------------------------------------------
Send Latency Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : UD Using SRQ : OFF
TX depth : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 188[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x03 QPN 0x0030 PSN 0x3bd5ea
remote address: LID 0x03 QPN 0x0031 PSN 0xc0fb9e
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
2 1000 0.67 5.92 0.68
4 1000 0.67 4.93 0.69
8 1000 0.67 4.80 0.68
16 1000 0.67 4.29 0.69
32 1000 0.70 4.95 0.72
64 1000 0.70 5.21 0.72
128 1000 0.75 4.81 0.76
256 1000 1.07 4.20 1.08
512 1000 1.14 4.80 1.16
1024 1000 1.27 5.08 1.29
2048 1000 1.53 5.63 1.55
4096 1000 2.04 5.70 2.06
---------------------------------------------------------------------------------------
Host fs11
Server:
[root@fs11 ~]# ib_send_lat -a -c UD
************************************
* Waiting for client to connect... *
************************************
Max msg size in UD is MTU 4096
Changing to this MTU
---------------------------------------------------------------------------------------
Send Latency Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : UD Using SRQ : OFF
RX depth : 1000
Mtu : 4096[B]
Link type : IB
Max inline data : 188[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x02 QPN 0x002d PSN 0xf82dfe
remote address: LID 0x02 QPN 0x002c PSN 0x49619e
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
2 1000 0.68 2.60 0.69
4 1000 0.67 2.34 0.69
8 1000 0.67 2.06 0.69
16 1000 0.68 1.95 0.69
32 1000 0.71 1.83 0.72
64 1000 0.71 1.82 0.72
128 1000 0.75 1.91 0.76
256 1000 1.07 3.26 1.09
512 1000 1.14 2.50 1.15
1024 1000 1.28 2.71 1.30
2048 1000 1.54 2.83 1.56
4096 1000 2.05 2.76 2.07
---------------------------------------------------------------------------------------
Client:
[root@fs11 ~]# ib_send_lat -a -c UD 192.168.12.151
Max msg size in UD is MTU 4096
Changing to this MTU
---------------------------------------------------------------------------------------
Send Latency Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : UD Using SRQ : OFF
TX depth : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 188[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x02 QPN 0x002c PSN 0x49619e
remote address: LID 0x02 QPN 0x002d PSN 0xf82dfe
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
2 1000 0.67 5.75 0.69
4 1000 0.67 2.37 0.69
8 1000 0.67 5.52 0.69
16 1000 0.67 1.86 0.69
32 1000 0.70 2.01 0.72
64 1000 0.71 1.85 0.72
128 1000 0.75 1.90 0.76
256 1000 1.06 5.13 1.08
512 1000 1.13 2.27 1.15
1024 1000 1.28 2.74 1.30
2048 1000 1.53 2.86 1.56
4096 1000 2.04 6.10 2.07
---------------------------------------------------------------------------------------