qemu/include
Ankit Agrawal b64b7ed8bb qom: new object to associate device to NUMA node
NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
partitioning of the GPU device resources (including device memory) into
several (upto 8) isolated instances. Each of the partitioned memory needs
a dedicated NUMA node to operate. The partitions are not fixed and they
can be created/deleted at runtime.

Unfortunately Linux OS does not provide a means to dynamically create/destroy
NUMA nodes and such feature implementation is not expected to be trivial. The
nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
we utilize the Generic Initiator (GI) Affinity structures that allows
association between nodes and devices. Multiple GI structures per BDF is
possible, allowing creation of multiple nodes by exposing unique PXM in each
of these structures.

Implement the mechanism to build the GI affinity structures as Qemu currently
does not. Introduce a new acpi-generic-initiator object to allow host admin
link a device with an associated NUMA node. Qemu maintains this association
and use this object to build the requisite GI Affinity Structure.

When multiple NUMA nodes are associated with a device, it is required to
create those many number of acpi-generic-initiator objects, each representing
a unique device:node association.

Following is one of a decoded GI affinity structure in VM ACPI SRAT.
[0C8h 0200   1]                Subtable Type : 05 [Generic Initiator Affinity]
[0C9h 0201   1]                       Length : 20

[0CAh 0202   1]                    Reserved1 : 00
[0CBh 0203   1]           Device Handle Type : 01
[0CCh 0204   4]             Proximity Domain : 00000007
[0D0h 0208  16]                Device Handle : 00 00 20 00 00 00 00 00 00 00 00
00 00 00 00 00
[0E0h 0224   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[0E4h 0228   4]                    Reserved2 : 00000000

[0E8h 0232   1]                Subtable Type : 05 [Generic Initiator Affinity]
[0E9h 0233   1]                       Length : 20

An admin can provide a range of acpi-generic-initiator objects, each
associating a device (by providing the id through pci-dev argument)
to the desired NUMA node (using the node argument). Currently, only PCI
device is supported.

For the grace hopper system, create a range of 8 nodes and associate that
with the device using the acpi-generic-initiator object. While a configuration
of less than 8 nodes per device is allowed, such configuration will prevent
utilization of the feature to the fullest. The following sample creates 8
nodes per PCI device for a VM with 2 PCI devices and link them to the
respecitve PCI device using acpi-generic-initiator objects:

-numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
-numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
-numa node,nodeid=8 -numa node,nodeid=9 \
-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
-object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
-object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
-object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
-object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
-object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
-object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
-object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
-object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \

-numa node,nodeid=10 -numa node,nodeid=11 -numa node,nodeid=12 \
-numa node,nodeid=13 -numa node,nodeid=14 -numa node,nodeid=15 \
-numa node,nodeid=16 -numa node,nodeid=17 \
-device vfio-pci-nohotplug,host=0009:01:01.0,bus=pcie.0,addr=05.0,rombar=0,id=dev1 \
-object acpi-generic-initiator,id=gi8,pci-dev=dev1,node=10 \
-object acpi-generic-initiator,id=gi9,pci-dev=dev1,node=11 \
-object acpi-generic-initiator,id=gi10,pci-dev=dev1,node=12 \
-object acpi-generic-initiator,id=gi11,pci-dev=dev1,node=13 \
-object acpi-generic-initiator,id=gi12,pci-dev=dev1,node=14 \
-object acpi-generic-initiator,id=gi13,pci-dev=dev1,node=15 \
-object acpi-generic-initiator,id=gi14,pci-dev=dev1,node=16 \
-object acpi-generic-initiator,id=gi15,pci-dev=dev1,node=17 \

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [1]
Cc: Jonathan Cameron <qemu-devel@nongnu.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Message-Id: <20240308145525.10886-2-ankita@nvidia.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12 17:56:55 -04:00
..
authz Prefer 'on' | 'off' over 'yes' | 'no' for bool options 2021-01-29 17:07:53 +00:00
block virtio: Re-enable notifications after drain 2024-02-07 21:51:03 +01:00
chardev chardev: use bool for fe_is_open 2024-01-12 13:23:48 +00:00
crypto crypto: Modify the qcrypto_block_create to support creation flags 2024-02-09 12:50:37 +00:00
disas disas: introduce show_opcodes 2024-03-06 12:35:51 +00:00
exec target-arm queue: 2024-03-05 13:54:54 +00:00
fpu fpu: Add conversions between bfloat16 and [u]int8 2023-09-16 14:57:15 +00:00
gdbstub gdbstub: Call gdbserver_fork() both in parent and in child 2024-03-06 12:35:19 +00:00
hw qom: new object to associate device to NUMA node 2024-03-12 17:56:55 -04:00
io io: Add generic pwritev/preadv interface 2024-03-01 15:42:04 +08:00
libdecnumber Replace config-time define HOST_WORDS_BIGENDIAN 2022-04-06 10:50:37 +02:00
migration migration/qemu-file: add utility methods for working with seekable channels 2024-03-01 15:42:04 +08:00
monitor monitor: add more *_locked() functions 2023-05-25 10:18:33 +02:00
net qapi: Move @String out of common.json to discourage reuse 2024-02-12 10:04:32 +01:00
qapi qerror: QERR_DEVICE_IN_USE is no longer used, drop 2024-03-09 18:56:37 +03:00
qemu plugins: remove non per_vcpu inline operation from API 2024-03-06 12:35:46 +00:00
qom include/qom/object.h: New OBJECT_DEFINE_SIMPLE_TYPE{, _WITH_INTERFACES} macros 2024-02-27 13:01:42 +00:00
scsi hw/ufs: Support for UFS logical unit 2023-09-07 14:01:29 -04:00
semihosting * util/log: re-allow switching away from stderr log file 2023-10-09 10:11:18 -04:00
standard-headers hw/virtio: Add support for VDPA network simulation devices 2024-03-12 17:56:55 -04:00
sysemu sysemu/xen-mapcache: Check Xen availability with CONFIG_XEN_IS_POSSIBLE 2024-03-09 18:51:45 +01:00
tcg tcg: Increase width of temp_subindex 2024-02-13 07:42:45 -10:00
ui include: Clean up includes 2024-01-30 21:20:20 +03:00
user {linux,bsd}-user: Introduce get_task_state() 2024-03-06 12:35:19 +00:00
elf.h util: spelling fixes 2023-08-31 19:47:43 +02:00
glib-compat.h compiler.h: replace QEMU_NORETURN with G_NORETURN 2022-04-21 17:03:51 +04:00
qemu-io.h Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
qemu-main.h ui/cocoa: Run qemu_init in the main thread 2022-09-23 14:36:33 +02:00