mirror of
https://github.com/Motorhead1991/qemu.git
synced 2026-01-26 15:07:23 -07:00
hw/rdma: Remove deprecated pvrdma device and rdmacm-mux helper
The whole RDMA subsystem was deprecated in commit e9a54265f5
("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
released in v8.2.
Remove:
- PVRDMA device
- generated vmw_pvrdma/ directory from linux-headers
- rdmacm-mux tool from contrib/
Cc: Yuval Shaia <yuval.shaia.ml@gmail.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20240328130255.52257-2-philmd@linaro.org>
This commit is contained in:
parent
a60e53fa8f
commit
1dfd42c426
51 changed files with 5 additions and 7977 deletions
|
|
@ -365,15 +365,6 @@ recommending to switch to their stable counterparts:
|
|||
- "Zve64f" should be replaced with "zve64f"
|
||||
- "Zve64d" should be replaced with "zve64d"
|
||||
|
||||
``-device pvrdma`` and the rdma subsystem (since 8.2)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The pvrdma device and the whole rdma subsystem are in a bad shape and
|
||||
without active maintenance. The QEMU project intends to remove this
|
||||
device and subsystem from the code base in a future release without
|
||||
replacement unless somebody steps up and improves the situation.
|
||||
|
||||
|
||||
Block device options
|
||||
''''''''''''''''''''
|
||||
|
||||
|
|
|
|||
|
|
@ -925,6 +925,10 @@ contains native support for this feature and thus use of the option
|
|||
ROM approach was obsolete. The native SeaBIOS support can be activated
|
||||
by using ``-machine graphics=off``.
|
||||
|
||||
``pvrdma`` and the RDMA subsystem (removed in 9.1)
|
||||
''''''''''''''''''''''''''''''''''''''''''''''''''
|
||||
|
||||
The 'pvrdma' device and the whole RDMA subsystem have been removed.
|
||||
|
||||
Related binaries
|
||||
----------------
|
||||
|
|
|
|||
345
docs/pvrdma.txt
345
docs/pvrdma.txt
|
|
@ -1,345 +0,0 @@
|
|||
Paravirtualized RDMA Device (PVRDMA)
|
||||
====================================
|
||||
|
||||
|
||||
1. Description
|
||||
===============
|
||||
PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
|
||||
It works with its Linux Kernel driver AS IS, no need for any special guest
|
||||
modifications.
|
||||
|
||||
While it complies with the VMware device, it can also communicate with bare
|
||||
metal RDMA-enabled machines as peers.
|
||||
|
||||
It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
|
||||
|
||||
It does not require the whole guest RAM to be pinned allowing memory
|
||||
over-commit and, even if not implemented yet, migration support will be
|
||||
possible with some HW assistance.
|
||||
|
||||
A project presentation accompany this document:
|
||||
- https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
|
||||
|
||||
|
||||
|
||||
2. Setup
|
||||
========
|
||||
|
||||
|
||||
2.1 Guest setup
|
||||
===============
|
||||
Fedora 27+ kernels work out of the box, older distributions
|
||||
require updating the kernel to 4.14 to include the pvrdma driver.
|
||||
|
||||
However the libpvrdma library needed by User Level Software is still
|
||||
not available as part of the distributions, so the rdma-core library
|
||||
needs to be compiled and optionally installed.
|
||||
|
||||
Please follow the instructions at:
|
||||
https://github.com/linux-rdma/rdma-core.git
|
||||
|
||||
|
||||
2.2 Host Setup
|
||||
==============
|
||||
The pvrdma backend is an ibdevice interface that can be exposed
|
||||
either by a Soft-RoCE(rxe) device on machines with no RDMA device,
|
||||
or an HCA SRIOV function(VF/PF).
|
||||
Note that ibdevice interfaces can't be shared between pvrdma devices,
|
||||
each one requiring a separate instance (rxe or SRIOV VF).
|
||||
|
||||
|
||||
2.2.1 Soft-RoCE backend(rxe)
|
||||
===========================
|
||||
A stable version of rxe is required, Fedora 27+ or a Linux
|
||||
Kernel 4.14+ is preferred.
|
||||
|
||||
The rdma_rxe module is part of the Linux Kernel but not loaded by default.
|
||||
Install the User Level library (librxe) following the instructions from:
|
||||
https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
|
||||
|
||||
Associate an ETH interface with rxe by running:
|
||||
rxe_cfg add eth0
|
||||
An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
|
||||
|
||||
|
||||
2.2.2 RDMA device Virtual Function backend
|
||||
==========================================
|
||||
Nothing special is required, the pvrdma device can work not only with
|
||||
Ethernet Links, but also Infinibands Links.
|
||||
All is needed is an ibdevice with an active port, for Mellanox cards
|
||||
will be something like mlx5_6 which can be the backend.
|
||||
|
||||
|
||||
2.2.3 QEMU setup
|
||||
================
|
||||
Configure QEMU with --enable-rdma flag, installing
|
||||
the required RDMA libraries.
|
||||
|
||||
|
||||
|
||||
3. Usage
|
||||
========
|
||||
|
||||
|
||||
3.1 VM Memory settings
|
||||
======================
|
||||
Currently the device is working only with memory backed RAM
|
||||
and it must be mark as "shared":
|
||||
-m 1G \
|
||||
-object memory-backend-ram,id=mb1,size=1G,share \
|
||||
-numa node,memdev=mb1 \
|
||||
|
||||
|
||||
3.2 MAD Multiplexer
|
||||
===================
|
||||
MAD Multiplexer is a service that exposes MAD-like interface for VMs in
|
||||
order to overcome the limitation where only single entity can register with
|
||||
MAD layer to send and receive RDMA-CM MAD packets.
|
||||
|
||||
To build rdmacm-mux run
|
||||
# make rdmacm-mux
|
||||
|
||||
Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel
|
||||
modules aren't loaded, otherwise the rdmacm-mux service will fail to start.
|
||||
|
||||
The application accepts 3 command line arguments and exposes a UNIX socket
|
||||
to pass control and data to it.
|
||||
-d rdma-device-name Name of RDMA device to register with
|
||||
-s unix-socket-path Path to unix socket to listen (default /var/run/rdmacm-mux)
|
||||
-p rdma-device-port Port number of RDMA device to register with (default 1)
|
||||
The final UNIX socket file name is a concatenation of the 3 arguments so
|
||||
for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
|
||||
will be created.
|
||||
|
||||
pvrdma requires this service.
|
||||
|
||||
Please refer to contrib/rdmacm-mux for more details.
|
||||
|
||||
|
||||
3.3 Service exposed by libvirt daemon
|
||||
=====================================
|
||||
The control over the RDMA device's GID table is done by updating the
|
||||
device's Ethernet function addresses.
|
||||
Usually the first GID entry is determined by the MAC address, the second by
|
||||
the first IPv6 address and the third by the IPv4 address. Other entries can
|
||||
be added by adding more IP addresses. The opposite is the same, i.e.
|
||||
whenever an address is removed, the corresponding GID entry is removed.
|
||||
The process is done by the network and RDMA stacks. Whenever an address is
|
||||
added the ib_core driver is notified and calls the device driver add_gid
|
||||
function which in turn update the device.
|
||||
To support this in pvrdma device the device hooks into the create_bind and
|
||||
destroy_bind HW commands triggered by pvrdma driver in guest.
|
||||
|
||||
Whenever changed is made to the pvrdma port's GID table a special QMP
|
||||
messages is sent to be processed by libvirt to update the address of the
|
||||
backend Ethernet device.
|
||||
|
||||
pvrdma requires that libvirt service will be up.
|
||||
|
||||
|
||||
3.4 PCI devices settings
|
||||
========================
|
||||
RoCE device exposes two functions - an Ethernet and RDMA.
|
||||
To support it, pvrdma device is composed of two PCI functions, an Ethernet
|
||||
device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
|
||||
Ethernet function can be used for other Ethernet purposes such as IP.
|
||||
|
||||
|
||||
3.5 Device parameters
|
||||
=====================
|
||||
- netdev: Specifies the Ethernet device function name on the host for
|
||||
example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
|
||||
device used to create it.
|
||||
- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
|
||||
- mad-chardev: The name of the MAD multiplexer char device.
|
||||
- ibport: In case of multi-port device (such as Mellanox's HCA) this
|
||||
specify the port to use. If not set 1 will be used.
|
||||
- dev-caps-max-mr-size: The maximum size of MR.
|
||||
- dev-caps-max-qp: Maximum number of QPs.
|
||||
- dev-caps-max-cq: Maximum number of CQs.
|
||||
- dev-caps-max-mr: Maximum number of MRs.
|
||||
- dev-caps-max-pd: Maximum number of PDs.
|
||||
- dev-caps-max-ah: Maximum number of AHs.
|
||||
|
||||
Notes:
|
||||
- The first 3 parameters are mandatory settings, the rest have their
|
||||
defaults.
|
||||
- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
|
||||
limits but the final values is adjusted by the backend device limitations.
|
||||
- netdev can be extracted from ibdev's sysfs
|
||||
(/sys/class/infiniband/<ibdev>/device/net/)
|
||||
|
||||
|
||||
3.6 Example
|
||||
===========
|
||||
Define bridge device with vmxnet3 network backend:
|
||||
<interface type='bridge'>
|
||||
<mac address='56:b4:44:e9:62:dc'/>
|
||||
<source bridge='bridge1'/>
|
||||
<model type='vmxnet3'/>
|
||||
<address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
|
||||
</interface>
|
||||
|
||||
Define pvrdma device:
|
||||
<qemu:commandline>
|
||||
<qemu:arg value='-object'/>
|
||||
<qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
|
||||
<qemu:arg value='-numa'/>
|
||||
<qemu:arg value='node,memdev=mb1'/>
|
||||
<qemu:arg value='-chardev'/>
|
||||
<qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
|
||||
<qemu:arg value='-device'/>
|
||||
<qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
|
||||
</qemu:commandline>
|
||||
|
||||
|
||||
|
||||
4. Implementation details
|
||||
=========================
|
||||
|
||||
|
||||
4.1 Overview
|
||||
============
|
||||
The device acts like a proxy between the Guest Driver and the host
|
||||
ibdevice interface.
|
||||
On configuration path:
|
||||
- For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
|
||||
a resource from the backend interface, maintaining a 1-1 mapping
|
||||
between the guest and host.
|
||||
On data path:
|
||||
- Every post_send/receive received from the guest will be converted into
|
||||
a post_send/receive for the backend. The buffers data will not be touched
|
||||
or copied resulting in near bare-metal performance for large enough buffers.
|
||||
- Completions from the backend interface will result in completions for
|
||||
the pvrdma device.
|
||||
|
||||
|
||||
4.2 PCI BARs
|
||||
============
|
||||
PCI Bars:
|
||||
BAR 0 - MSI-X
|
||||
MSI-X vectors:
|
||||
(0) Command - used when execution of a command is completed.
|
||||
(1) Async - not in use.
|
||||
(2) Completion - used when a completion event is placed in
|
||||
device's CQ ring.
|
||||
BAR 1 - Registers
|
||||
--------------------------------------------------------
|
||||
| VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC |
|
||||
--------------------------------------------------------
|
||||
DSR - Address of driver/device shared memory used
|
||||
for the command channel, used for passing:
|
||||
- General info such as driver version
|
||||
- Address of 'command' and 'response'
|
||||
- Address of async ring
|
||||
- Address of device's CQ ring
|
||||
- Device capabilities
|
||||
CTL - Device control operations (activate, reset etc)
|
||||
IMG - Set interrupt mask
|
||||
REQ - Command execution register
|
||||
ERR - Operation status
|
||||
|
||||
BAR 2 - UAR
|
||||
---------------------------------------------------------
|
||||
| QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag |
|
||||
---------------------------------------------------------
|
||||
- Offset 0 used for QP operations (send and recv)
|
||||
- Offset 4 used for CQ operations (arm and poll)
|
||||
|
||||
|
||||
4.3 Major flows
|
||||
===============
|
||||
|
||||
4.3.1 Create CQ
|
||||
===============
|
||||
- Guest driver
|
||||
- Allocates pages for CQ ring
|
||||
- Creates page directory (pdir) to hold CQ ring's pages
|
||||
- Initializes CQ ring
|
||||
- Initializes 'Create CQ' command object (cqe, pdir etc)
|
||||
- Copies the command to 'command' address
|
||||
- Writes 0 into REQ register
|
||||
- Device
|
||||
- Reads the request object from the 'command' address
|
||||
- Allocates CQ object and initialize CQ ring based on pdir
|
||||
- Creates the backend CQ
|
||||
- Writes operation status to ERR register
|
||||
- Posts command-interrupt to guest
|
||||
- Guest driver
|
||||
- Reads the HW response code from ERR register
|
||||
|
||||
4.3.2 Create QP
|
||||
===============
|
||||
- Guest driver
|
||||
- Allocates pages for send and receive rings
|
||||
- Creates page directory(pdir) to hold the ring's pages
|
||||
- Initializes 'Create QP' command object (max_send_wr,
|
||||
send_cq_handle, recv_cq_handle, pdir etc)
|
||||
- Copies the object to 'command' address
|
||||
- Write 0 into REQ register
|
||||
- Device
|
||||
- Reads the request object from 'command' address
|
||||
- Allocates the QP object and initialize
|
||||
- Send and recv rings based on pdir
|
||||
- Send and recv ring state
|
||||
- Creates the backend QP
|
||||
- Writes the operation status to ERR register
|
||||
- Posts command-interrupt to guest
|
||||
- Guest driver
|
||||
- Reads the HW response code from ERR register
|
||||
|
||||
4.3.3 Post receive
|
||||
==================
|
||||
- Guest driver
|
||||
- Initializes a wqe and place it on recv ring
|
||||
- Write to qpn|qp_recv_bit (31) to QP offset in UAR
|
||||
- Device
|
||||
- Extracts qpn from UAR
|
||||
- Walks through the ring and does the following for each wqe
|
||||
- Prepares the backend CQE context to be used when
|
||||
receiving completion from backend (wr_id, op_code, emu_cq_num)
|
||||
- For each sge prepares backend sge
|
||||
- Calls backend's post_recv
|
||||
|
||||
4.3.4 Process backend events
|
||||
============================
|
||||
- Done by a dedicated thread used to process backend events;
|
||||
at initialization is attached to the device and creates
|
||||
the communication channel.
|
||||
- Thread main loop:
|
||||
- Polls for completions
|
||||
- Extracts QEMU _cq_num, wr_id and op_code from context
|
||||
- Writes CQE to CQ ring
|
||||
- Writes CQ number to device CQ
|
||||
- Sends completion-interrupt to guest
|
||||
- Deallocates context
|
||||
- Acks the event to backend
|
||||
|
||||
|
||||
|
||||
5. Limitations
|
||||
==============
|
||||
- The device obviously is limited by the Guest Linux Driver features implementation
|
||||
of the VMware device API.
|
||||
- Memory registration mechanism requires mremap for every page in the buffer in order
|
||||
to map it to a contiguous virtual address range. Since this is not the data path
|
||||
it should not matter much. If the default max mr size is increased, be aware that
|
||||
memory registration can take up to 0.5 seconds for 1GB of memory.
|
||||
- The device requires target page size to be the same as the host page size,
|
||||
otherwise it will fail to init.
|
||||
- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
|
||||
so it can't work with huge pages. The limitation will be addressed in the future,
|
||||
however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
|
||||
pages available, QEMU will use them. QEMU will fail to init if the requirements
|
||||
are not met.
|
||||
|
||||
|
||||
|
||||
6. Performance
|
||||
==============
|
||||
By design the pvrdma device exits on each post-send/receive, so for small buffers
|
||||
the performance is affected; however for medium buffers it will became close to
|
||||
bare metal and from 1MB buffers and up it reaches bare metal performance.
|
||||
(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
|
||||
|
||||
All the above assumes no memory registration is done on data path.
|
||||
|
|
@ -39,7 +39,7 @@ can be accessed by following steps.
|
|||
|
||||
.. code-block:: bash
|
||||
|
||||
./configure --disable-rdma --disable-pvrdma --prefix=/usr \
|
||||
./configure --disable-rdma --prefix=/usr \
|
||||
--target-list="loongarch64-softmmu" \
|
||||
--disable-libiscsi --disable-libnfs --disable-libpmem \
|
||||
--disable-glusterfs --enable-libusb --enable-usb-redir \
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue