qemu/include/hw
Ankit Agrawal b64b7ed8bb qom: new object to associate device to NUMA node
NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
partitioning of the GPU device resources (including device memory) into
several (upto 8) isolated instances. Each of the partitioned memory needs
a dedicated NUMA node to operate. The partitions are not fixed and they
can be created/deleted at runtime.

Unfortunately Linux OS does not provide a means to dynamically create/destroy
NUMA nodes and such feature implementation is not expected to be trivial. The
nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
we utilize the Generic Initiator (GI) Affinity structures that allows
association between nodes and devices. Multiple GI structures per BDF is
possible, allowing creation of multiple nodes by exposing unique PXM in each
of these structures.

Implement the mechanism to build the GI affinity structures as Qemu currently
does not. Introduce a new acpi-generic-initiator object to allow host admin
link a device with an associated NUMA node. Qemu maintains this association
and use this object to build the requisite GI Affinity Structure.

When multiple NUMA nodes are associated with a device, it is required to
create those many number of acpi-generic-initiator objects, each representing
a unique device:node association.

Following is one of a decoded GI affinity structure in VM ACPI SRAT.
[0C8h 0200   1]                Subtable Type : 05 [Generic Initiator Affinity]
[0C9h 0201   1]                       Length : 20

[0CAh 0202   1]                    Reserved1 : 00
[0CBh 0203   1]           Device Handle Type : 01
[0CCh 0204   4]             Proximity Domain : 00000007
[0D0h 0208  16]                Device Handle : 00 00 20 00 00 00 00 00 00 00 00
00 00 00 00 00
[0E0h 0224   4]        Flags (decoded below) : 00000001
                                     Enabled : 1
[0E4h 0228   4]                    Reserved2 : 00000000

[0E8h 0232   1]                Subtable Type : 05 [Generic Initiator Affinity]
[0E9h 0233   1]                       Length : 20

An admin can provide a range of acpi-generic-initiator objects, each
associating a device (by providing the id through pci-dev argument)
to the desired NUMA node (using the node argument). Currently, only PCI
device is supported.

For the grace hopper system, create a range of 8 nodes and associate that
with the device using the acpi-generic-initiator object. While a configuration
of less than 8 nodes per device is allowed, such configuration will prevent
utilization of the feature to the fullest. The following sample creates 8
nodes per PCI device for a VM with 2 PCI devices and link them to the
respecitve PCI device using acpi-generic-initiator objects:

-numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
-numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
-numa node,nodeid=8 -numa node,nodeid=9 \
-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
-object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
-object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
-object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
-object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
-object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
-object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
-object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
-object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \

-numa node,nodeid=10 -numa node,nodeid=11 -numa node,nodeid=12 \
-numa node,nodeid=13 -numa node,nodeid=14 -numa node,nodeid=15 \
-numa node,nodeid=16 -numa node,nodeid=17 \
-device vfio-pci-nohotplug,host=0009:01:01.0,bus=pcie.0,addr=05.0,rombar=0,id=dev1 \
-object acpi-generic-initiator,id=gi8,pci-dev=dev1,node=10 \
-object acpi-generic-initiator,id=gi9,pci-dev=dev1,node=11 \
-object acpi-generic-initiator,id=gi10,pci-dev=dev1,node=12 \
-object acpi-generic-initiator,id=gi11,pci-dev=dev1,node=13 \
-object acpi-generic-initiator,id=gi12,pci-dev=dev1,node=14 \
-object acpi-generic-initiator,id=gi13,pci-dev=dev1,node=15 \
-object acpi-generic-initiator,id=gi14,pci-dev=dev1,node=16 \
-object acpi-generic-initiator,id=gi15,pci-dev=dev1,node=17 \

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [1]
Cc: Jonathan Cameron <qemu-devel@nongnu.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Message-Id: <20240308145525.10886-2-ankita@nvidia.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12 17:56:55 -04:00
..
acpi qom: new object to associate device to NUMA node 2024-03-12 17:56:55 -04:00
adc hw/arm/npcm7xx: Declare QOM macros using OBJECT_DECLARE_SIMPLE_TYPE() 2023-01-12 17:15:09 +00:00
arm hw/arm: Connect STM32L4x5 GPIO to STM32L4x5 SoC 2024-03-07 12:19:25 +00:00
audio hw/audio/virtio-sound: return correct command response size 2024-03-12 17:56:55 -04:00
block hw/block/fdc-isa: Implement relocation and enabling/disabling for TYPE_ISA_FDC 2024-02-14 06:09:32 -05:00
char hw/sparc/grlib: split out the headers for each peripherals 2024-02-15 16:58:46 +01:00
core accel/tcg: Add tlb_fill_flags to CPUTLBEntryFull 2024-03-05 13:22:56 +00:00
cpu Use OBJECT_DECLARE_SIMPLE_TYPE when possible 2020-09-18 14:12:32 -04:00
cris hw/net/etraxfs-eth: use qemu_configure_nic_device() 2024-02-02 16:23:47 +00:00
cxl hw/pci-bridge/pxb-cxl: Drop RAS capability from host bridge. 2024-03-12 17:56:55 -04:00
display hw/arm: Introduce Raspberry PI 4 machine 2024-02-27 13:01:42 +00:00
dma hw/dma: Pass parent object to i8257_dma_init() 2024-02-15 16:58:46 +01:00
firmware Implement SMBIOS type 9 v2.6 2024-03-12 17:56:55 -04:00
fsi hw/fsi: Aspeed APB2OPB & On-chip peripheral bus 2024-02-01 08:33:18 +01:00
gpio hw/arm: Connect STM32L4x5 GPIO to STM32L4x5 SoC 2024-03-07 12:19:25 +00:00
hyperv vmbus: Print a warning when enabled without the recommended set of features 2024-03-08 14:18:56 +01:00
i2c hw/i2c: Implement Broadcom Serial Controller (BSC) 2024-03-05 13:22:55 +00:00
i386 hw/i386/pc: Inline pc_cmos_init() into pc_cmos_init_late() and remove it 2024-03-12 17:56:55 -04:00
ide ide, vl: turn -win2k-hack into a property on IDE devices 2024-02-28 00:23:39 +01:00
input hw/input/pckbd: Open-code i8042_setup_a20_line() wrapper 2024-02-22 12:47:35 +01:00
intc hw/intc/grlib_irqmp: implements multicore irq 2024-02-15 16:58:46 +01:00
ipack ipack: Rename ipack_bus_new_inplace() to ipack_bus_init() 2021-09-30 13:42:10 +01:00
ipmi Use OBJECT_DECLARE_SIMPLE_TYPE when possible 2020-09-18 14:12:32 -04:00
isa hw/isa/vt82c686: Bring back via_isa_set_irq() 2023-11-28 14:26:37 +01:00
loongarch loongarch: Change the UEFI loading mode to loongarch 2024-02-29 19:32:45 +08:00
m68k m68k: Clean up includes 2024-01-30 21:20:20 +03:00
mem include: Clean up includes 2024-01-30 21:20:20 +03:00
mips hw/mips: Inline 'bios.h' definitions 2024-01-05 16:20:15 +01:00
misc hw/arm: Connect STM32L4x5 GPIO to STM32L4x5 SoC 2024-03-07 12:19:25 +00:00
net hw/net/npcm_gmac.h: correct typos 2024-02-21 08:16:43 +03:00
nubus hw/nubus: add nubus-virtio-mmio device 2024-02-27 09:36:39 +01:00
nvram acpi: Clean up includes 2024-01-30 21:20:20 +03:00
openrisc hw/openrisc: Split re-usable boot time apis out to boot.c 2022-09-04 07:02:56 +01:00
pci pcie_sriov: Reset SR-IOV extended capability 2024-03-12 17:56:55 -04:00
pci-bridge hw/cxl: Add a switch mailbox CCI function 2023-11-07 03:39:11 -05:00
pci-host hw/ppc/ppc4xx_pci: Extract PCI host definitions to hw/pci-host/ppc4xx.h 2024-02-22 12:47:40 +01:00
ppc ppc/pnv: Implement the ChipTOD to Core transfer 2024-02-23 23:24:43 +10:00
rdma qapi: introduce x-query-rdma QMP command 2021-11-02 15:55:14 +00:00
remote include/hw/pci: Split pci_device.h off pci.h 2023-01-08 01:54:22 -05:00
riscv hw/riscv/virt.h: correct typos 2024-02-21 08:16:43 +03:00
rtc hw/rtc/sun4v-rtc: Relicense to GPLv2-or-later 2024-03-07 12:54:56 +00:00
rx hw/rx/rx62n: Only call qdev_get_gpio_in() when necessary 2024-02-15 16:58:46 +01:00
s390x s390x/pci: drive ISM reset from subsystem reset 2024-01-19 11:38:32 +01:00
scsi esp.c: keep track of the DRQ state during DMA 2024-02-13 19:37:28 +00:00
sd hw/sd: Introduce a "sd-card" SPI variant model 2023-09-01 11:40:04 +02:00
sensor hw/sensor: Add IC_DEVICE_ID to ISL voltage regulators 2022-07-14 16:24:38 +02:00
sh4 hw/intc/sh_intc: Inline and drop sh_intc_source() function 2021-10-30 18:39:37 +02:00
southbridge hw/isa/piix: Allow for optional PIT creation in PIIX3 2023-10-22 05:18:17 -04:00
sparc hw/sparc/grlib: split out the headers for each peripherals 2024-02-15 16:58:46 +01:00
ssi hw/ssi: Implement BCM2835 SPI Controller 2024-02-02 13:51:59 +00:00
timer hw/timer: Move HPET_INTCAP definition to "hpet.h" 2024-02-20 20:34:21 +03:00
tricore hw/tricore/testboard: Use qdev_new() instead of QOM basic API 2024-02-22 12:47:40 +01:00
usb include: Include headers where needed 2023-01-08 01:54:22 -05:00
vfio migration: convert to NotifierWithReturn 2024-02-28 11:31:28 +08:00
virtio hw/virtio: Add support for VDPA network simulation devices 2024-03-12 17:56:55 -04:00
watchdog hw/watchdog: Allwinner WDT emulation for system reset 2023-04-20 10:21:13 +01:00
xen hw/xen: Extract 'xen_igd.h' from 'xen_pt.h' 2024-03-09 18:51:45 +01:00
xtensa Include hw/irq.h a lot less 2019-08-16 13:31:52 +02:00
boards.h hw/core: Add machine_class_default_cpu_type() 2024-01-05 16:20:14 +01:00
clock.h include/: spelling fixes 2023-09-08 13:08:52 +03:00
elf_ops.h Revert "hw/elf_ops: Ignore loadable segments with zero size" 2024-02-09 17:52:20 +00:00
fw-path-provider.h Use DECLARE_*CHECKER* macros 2020-09-09 09:27:09 -04:00
hotplug.h pci: fix 'hotplugglable' property behavior 2023-03-07 12:38:59 -05:00
hw.h compiler.h: replace QEMU_NORETURN with G_NORETURN 2022-04-21 17:03:51 +04:00
irq.h hw/core/irq: remove unused 'qemu_irq_split' function 2022-04-21 11:37:04 +01:00
loader-fit.h nomaintainer: Fix Lesser GPL version number 2020-11-15 17:04:40 +01:00
loader.h hw/loader: Clean up global variable shadowing in rom_add_file() 2023-11-07 13:08:48 +01:00
nmi.h Use DECLARE_*CHECKER* macros 2020-09-09 09:27:09 -04:00
or-irq.h hw: Replace qemu_or_irq typedef by OrIRQState 2023-02-27 13:27:05 +00:00
pcmcia.h replace TABs with spaces 2023-03-20 12:43:50 +01:00
platform-bus.h nomaintainer: Fix Lesser GPL version number 2020-11-15 17:04:40 +01:00
ptimer.h ptimer: Rename PTIMER_POLICY_DEFAULT to PTIMER_POLICY_LEGACY 2022-05-19 16:19:03 +01:00
qdev-clock.h clock: Add ClockEvent parameter to callbacks 2021-03-08 17:20:01 +00:00
qdev-core.h oslib-posix: initialize backend memory objects in parallel 2024-02-06 08:15:22 +01:00
qdev-dma.h Supply missing header guards 2019-06-12 13:20:21 +02:00
qdev-properties-system.h qdev: Add a granule_mode property 2024-03-09 19:17:01 +01:00
qdev-properties.h qdev-properties: alias all object class properties 2023-12-21 22:49:28 +01:00
register.h hw/core/register: Add more 64-bit utilities 2021-09-01 11:59:12 +10:00
registerfields.h hw/registerfields: Add shared fields macros 2022-06-22 09:49:34 +02:00
resettable.h Use DECLARE_*CHECKER* macros 2020-09-09 09:27:09 -04:00
stream.h hw/core/stream: Rename StreamSlave as StreamSink 2020-12-10 12:15:04 -05:00
sysbus.h hw/sysbus: Remove now unused sysbus_address_space() 2024-02-26 18:40:21 +01:00
usb.h hw/usb: remove usb_bus_find 2024-02-27 09:37:21 +01:00
vmstate-if.h Use DECLARE_*CHECKER* macros 2020-09-09 09:27:09 -04:00