Infiniband

From Leo's Notes
Last edited on 28 February 2023, at 17:43.

Mellanox OFED Installation

The Mellanox OFED (OpenFabrics Enterprise Distribution) is a software stack that allows for RDMA and kernel bypass applications to the adapter. Installing this package is required in order to make proper use of the hardware.

Download

You can download the Mellanox OFED from Nvidia's website: https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/.

Automatic download script

Use the following script to download the latest Mellanox OFED driver of a particular release. I use this to get the latest version of the 4.9 LTS and 5.4 releases. You'll have to change the OS version to match what you need.

#!/bin/bash

function get_version() {
        # Pass me something like '4.9'
        Version="$1"

        curl 'https://downloaders.azurewebsites.net/downloaders/mlnx_ofed_downloader/helper.php' \
          -H 'Accept: application/json, text/javascript, */*' \
          -H 'Accept-Language: en-US,en;q=0.9' \
          -H 'Connection: keep-alive' \
          -H 'Content-Type: application/x-www-form-urlencoded' \
          -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46' \
          -H 'X-Requested-With: XMLHttpRequest' \
          --data 'action=get_versions' \
          --silent \
          --compressed \

Building RPMs

The manual way

The install script requires the following packages to work:

# yum install perl-Term-ANSIColor tcsh tcl gcc-gfortran tk

To build with the --add-kernel-support option so that the packages target your kernel, the following packages will also be needed:

# yum install perl-File-Temp createrepo elfutils-libelf-devel \
  rpm-build lsof python36 python36-devel \
  kernel-devel kernel-rpm-macros \
  make gdb-headless gcc

Next, download the desired Mellanox package and extract it. You should see a mlnxofedinstall script. Invoke the script with the following arguments:

# ./mlnxofedinstall --distro rhel8.7 --add-kernel-support -k $(uname -r)

A bunch of RPMs should be generated once this is done. You should then be able to install these RPM packages on this system or copy it to other systems for installation.

After installing, run /etc/init.d/openibd restart to restart the driver. You will want to run this whenever changing any settings with the kernel module for it to take effect.

Using Docker

I have started using Docker to build the RPM files so that I can control the build environment which matches the HPC enviroment the drivers target.

Define the following environment variables

export Kernel="4.18.0-372.32.1.el8_6.x86_64"
export MLNXVersionNumber="4.9-5.1.0.0"
export MLNXVersionDistro="rhel8.6"

Create a Docker file to build the drivers in

FROM rockylinux/rockylinux:8.6

RUN set -ex; \
        dnf -y install createrepo rpm-build python36 kernel-rpm-macros lsof python36-devel gcc elfutils-libelf-devel \
        perl pciutils python2 gcc-gfortran python3-libxml2 tcsh expat glib2 tcl libstdc++ bc tk gtk2 atk cairo \
        numactl pkgconfig ethtool lsof createrepo rpm-build elfutils-libelf-devel python36 python36-devel \
        kernel-$KERNEL_VERSION kernel-modules-$KERNEL_VERSION kernel-devel-$KERNEL_VERSION; \
        ln -s /usr/bin/python3 /usr/bin/python

# the rpmbuild uses their spec file which looks for centos|rhel, but not rocky.
# We can fake this distro being CentOS by copying the CentOS release file here.
COPY os-release-centos /etc/os-release

Build the image docker build -t rcs-baseimage-mlx4 --build-arg KERNEL_VERSION=$Kernel .

Using the docker file above, I then build the driver by running the following.

docker run --rm -ti \
	-v `pwd`/artifacts:/artifacts \
	-v `pwd`/tmp:/tmp \
	rcs-baseimage-mlx4 \
	bash -c "cd /tmp; \
		cat /etc/os-release; \
		wget -O - --progress=dot:giga http://local-server/MLNX/MLNX_OFED_LINUX-$MLNXVersionNumber-$MLNXVersionDistro-x86_64.tgz | tar -xzf - ; \
		cd MLNX_OFED_LINUX-$MLNXVersionNumber-$MLNXVersionDistro-x86_64; \
		sed -i 's/centos|/rocky|centos|/g' mlnxofedinstall; \
		./mlnxofedinstall --distro $MLNXVersionDistro --add-kernel-support -k $Kernel; \
		echo logs below.; \
		cat /tmp/MLNX_OFED_LINUX-$MLNXVersionNumber-*/*_logs/*.log; \
		echo ok"

Troubleshooting

Driver not loading

When trying to start the driver with /etc/init.d/openibd start, you get a bunch of fail messages.

# /etc/init.d/openibd  start
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_IN.UTF-8)
Module mlx5_ib belong to kernel-modules which is not a part[FAILED]_OFED, skipping...
Module mlx5_core belong to kernel-core which is not a part [FAILED]OFED, skipping...
Module ib_umad belong to kernel-modules which is not a part[FAILED]_OFED, skipping...
Module ib_uverbs belong to kernel-modules which is not a pa[FAILED]NX_OFED, skipping...
Module ib_ipoib belong to kernel-modules which is not a par[FAILED]X_OFED, skipping...
Loading HCA driver and Access Layer:                       [  OK  ]
Module rdma_cm belong to kernel-modules which is not a part[FAILED]_OFED, skipping...
Module rdma_ucm belong to kernel-modules which is not a par[FAILED]X_OFED, skipping...

This happens when the distro provided OFED modules are being loaded. You're likely forgetting to install the OFED kernel modules package. To fix this, run:

# yum -y install mlnx-ofa_kernel-modules

You may then try restarting openibd again. You might also need to unload a few kernel drivers manually for this to work.

# /etc/init.d/openibd  start
Unloading rdma_cm                                         [FAILED]
rmmod: ERROR: Module rdma_cm is in use by: rpcrdma ib_srpt ib_isert
# rmmod ib_isert
# rmmod ib_srpt
# rmmod rpcrdma
# rmmod rdma_cm

# /etc/init.d/openibd  start
Unloading HCA driver:                                      [  OK  ]
Loading HCA driver and Access Layer:                       [  OK  ]

Rocky Linux 8.6 with 4.9 LTS driver build issues

Compiling the Mellanox driver for CentOS / Rocky Linux 8.6 on the 4.9 LTS driver results in an error. Someone else brought this issue up on Nvidia's forum but had no real answer. rpmbuild shows the following errors:

In file included from /tmp/MLNX_OFED_LINUX-4.9-4.1.7.0-4.18.0-372.32.1.el8_6.x86_64/mlnx_iso.36/OFED_topdir/BUILD/mlnx-ofa_kernel-4.9/obj/default/drivers/net/ethernet/mellanox/mlx4/alloc.c:36:
/tmp/MLNX_OFED_LINUX-4.9-4.1.7.0-4.18.0-372.32.1.el8_6.x86_64/mlnx_iso.36/OFED_topdir/BUILD/mlnx-ofa_kernel-4.9/obj/default/include/linux/mm.h:15:21: error: conflicting types for 'kvzalloc'
 static inline void *kvzalloc(unsigned long size,...) {
                     ^18:28, 6 November 2022 (UTC)~~

To work around this issue, I add the following patch script which I execute ~15 seconds after running the mlnxofedinstall script (after it has extracted the RPM source files to /tmp/MLNX_OFED_LINUX-$MLNXVersionNumber-$Kernel. This basically just injects the KV*ALLOC flags into the mm.h header file so that these definitions aren't redefined again.

cd /tmp/MLNX_OFED_LINUX-$MLNXVersionNumber-$Kernel

while read i ; do
        echo "PATCHING FOR $i"
        find -iname mm.h -exec sh -c "sed -i \"3i$i\" {}" \;
done <<EOF
#define HAVE_KVMALLOC_ARRAY 1
#define HAVE_KVMALLOC_NODE 1
#define HAVE_KVMALLOC 1
#define HAVE_KVZALLOC_NODE 1
#define HAVE_KVZALLOC 1
#define HAVE_KVCALLOC 1
EOF

Quick Usage

Here are some commands when working with Infiniband.

Command Description
ofed_info -s Shows the OFED version that is installed.
# ofed_info -s
MLNX_OFED_LINUX-5.1-2.3.7.1:
ibcheckwidth
ibdev2netdev Maps adapter port to network device
# ibdev2netdev -v
0000:05:00.0 mlx4_0 (MT26438 - MT1008X01087) FALCON QDR      fw 2.9.1000 port 1 (ACTIVE) ==> ib0 (Down)
0000:05:00.0 mlx4_0 (MT26438 - MT1008X01087) FALCON QDR      fw 2.9.1000 port 2 (DOWN  ) ==> enp5s0d1 (Down)
ibdiagnet Shows diagnostic information about the network. Example commands:
# ibdiagnet --pc -P all=1 -get_phy_info --extended_speed all --pm_per_lane \
     --get_cable_info --cable_info_disconnected --pm_pause_time 600 -o /tmp/ibdiagnet_ibm
# ibdiagnet -P symbol_error_counter=1
ibnetdiscover Discovers the network topology. Pipe stdout to /dev/null to look for any errors.
# ibnetdiscover > /dev/null
ibqueryerrors Looks for any errors.
# ibqueryerrors -s PortXmitWait,LinkErrorRecoveryCounter,PortRcvSwitchRelayErrors,\
LinkDownedCounter,PortXmitDiscards,VL15Dropped,PortRcvErrors,PortRcvRemotePhysicalErrors
ibstat Shows port statistics
ibswitches Shows switches on the network
ibv_devinfo Shows the Infiniband device information.

See also: https://docs.nvidia.com/networking/display/ConnectX6EN/Troubleshooting

IP over Infiniband (IPoIB)

On Linux, the driver is ib_ipoib. Like all other kernel modules, its options when loading can be set at /etc/modprobe.d/ib_ipoib.conf.

Issues with dropped packets

I've been beating my head trying to determine why an infiniband adapter on one particular server is dropping packets:

# cat /sys/class/net/ib0/statistics/{tx,rx}_dropped
4069
141287

The ring buffer size appears to be smaller than other servers. Other servers that have no issues have a buffer size of 512/512 for rx/tx respectively.

# ethtool -g ib0
Ring parameters for ib0:
Pre-set maximums:
RX:		8192
RX Mini:	0
RX Jumbo:	0
TX:		8192
Current hardware settings:
RX:		256
RX Mini:	0
RX Jumbo:	0
TX:		128

Set the buffer size one time:

# ethtool -G ib0 rx 8192
# ethtool -G ib0 tx 4096

Or, set it in the kernel modprobe.d file:

# echo "options ib_ipoib recv_queue_size=8192 send_queue_size=4096 >> /etc/modprobe.d/ib_ipoib.conf

IPoIB Bonding

At CHGI, there is a GPFS filesystem that has the storage nodes use a 4 link infiniband interface bonded as one. However, it's not working quite right because it randomly stops working.

dmesg shows this on the storage node:

ib2: ipoib_cm_handle_tx_wc: failed cm send event

Then, the bonded ib0 interface stops responding to the quorum node. It can still reach the other storage nodes and certain other hosts via the bonded link.

The fix when this happens is to toggle ib2 down and up. This then causes the bonded interface to use another link as active. Once this happens, the bonded link can talk with the quorum node again. Here, you can see me toggling every infiniband link until I hit ib2 which causes the bond to use ib3 as the active link.

[Mon Jul 27 10:25:09 2020] bonding: bond0: link status definitely down for interface ib0, disabling it         
[Mon Jul 27 10:25:11 2020] bonding: bond0: link status definitely up for interface ib0, 56000 Mbps full duplex.
[Mon Jul 27 10:25:15 2020] bonding: bond0: link status definitely down for interface ib1, disabling it         
[Mon Jul 27 10:25:17 2020] bonding: bond0: link status definitely up for interface ib1, 56000 Mbps full duplex.
[Mon Jul 27 10:25:24 2020] bonding: bond0: link status definitely down for interface ib2, disabling it         
[Mon Jul 27 10:25:24 2020] bonding: bond0: making interface ib3 the new active one.                            
[Mon Jul 27 10:25:26 2020] bonding: bond0: link status definitely up for interface ib2, 56000 Mbps full duplex.

The quorum node only has one infiniband link. When this occurred, pings from the storage node was able to reach the quorum node and the quorum node does send a reply which never reaches the storage node. Most likely the switch is getting confused and is sending the reply back on another interface. (something to test if this happens again?)

[root@essio1 ~]# dmesg -T | grep -B 3 "making interface ib"                                                     
[Sun Feb 16 15:12:57 2020] device bond0 left promiscuous mode                                                   
[Sun Feb 16 15:12:57 2020] device ib0 left promiscuous mode                                                     
[Sun Feb 16 18:02:12 2020] bonding: bond0: link status definitely down for interface ib0, disabling it          
[Sun Feb 16 18:02:12 2020] bonding: bond0: making interface ib1 the new active one.                             
--                                                                                                              
[Mon May  4 12:08:56 2020] NOHZ: local_softirq_pending 08                                                       
[Mon May  4 12:08:56 2020] bonding: bond0: link status definitely up for interface ib0, 56000 Mbps full duplex. 
[Mon May  4 12:09:08 2020] bonding: bond0: link status definitely down for interface ib1, disabling it          
[Mon May  4 12:09:08 2020] bonding: bond0: making interface ib2 the new active one.                             
--                                                                                                              
[Mon Jul 27 10:25:15 2020] bonding: bond0: link status definitely down for interface ib1, disabling it          
[Mon Jul 27 10:25:17 2020] bonding: bond0: link status definitely up for interface ib1, 56000 Mbps full duplex. 
[Mon Jul 27 10:25:24 2020] bonding: bond0: link status definitely down for interface ib2, disabling it          
[Mon Jul 27 10:25:24 2020] bonding: bond0: making interface ib3 the new active one.

The failure always occurs after a failed cm send event error, but not always. as you can see by the timestamps.

[root@essio1 ~]# dmesg -T | grep "failed cm send"                                                            
[Mon Feb 24 09:57:43 2020] ib1: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=304 vend_err 81)
[Tue Apr 14 14:05:10 2020] ib1: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=154 vend_err 81)
[Tue Apr 14 14:07:13 2020] ib1: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=262 vend_err 81)
[Tue Apr 14 14:09:15 2020] ib1: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=287 vend_err 81)
[Tue Apr 14 14:09:20 2020] ib1: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=223 vend_err 81)
[Tue Apr 14 14:09:45 2020] ib1: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=109 vend_err 81)
[Mon May  4 11:34:14 2020] ib1: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=64 vend_err 81) 
[Mon May  4 15:50:20 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=225 vend_err 81)
[Mon May  4 15:51:07 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=407 vend_err 81)
[Tue Jun 16 23:41:55 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=154 vend_err 81)
[Tue Jun 16 23:41:55 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=353 vend_err 81)
[Wed Jun 17 11:04:36 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=325 vend_err 81)
[Wed Jun 17 11:05:50 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=181 vend_err 81)
[Wed Jun 17 11:08:48 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=157 vend_err 81)
[Wed Jun 17 11:16:19 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=166 vend_err 81)
[Wed Jun 17 11:16:42 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=323 vend_err 81)
[Mon Jun 22 11:13:27 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=158 vend_err 81)
[Mon Jun 22 11:14:55 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=238 vend_err 81)
[Mon Jun 22 11:18:28 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=255 vend_err 81)
[Wed Jun 24 16:57:30 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=167 vend_err 81)
[Wed Jun 24 17:03:09 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=131 vend_err 81)
[Mon Jul  6 10:49:46 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=15 vend_err 81) 
[Mon Jul  6 11:08:34 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=72 vend_err 81) 
[Tue Jul 14 13:52:32 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=122 vend_err 81)
[Sun Jul 26 06:20:52 2020] ib2: ipoib_cm_handle_tx_wc: failed cm send event (status=12, wrid=269 vend_err 81)

See Also