mirror of
https://github.com/Motorhead1991/qemu.git
synced 2025-08-05 00:33:55 -06:00
virtio,pc,pci: features, cleanups, fixes
vhost-user enabled on non-linux systems beginning of nvme sriov support bigger tx queue for vdpa virtio iommu bypass FADT flag to detect legacy keyboards Fixes, cleanups all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmImipMPHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRpD5AH/jz73VVDE3dZTtsdEH/f2tuO8uosur9fIjHJ nCMwBoosdDWmrWjrwxynmG6e+qIcOHEGdTInvS1TY2OTU+elNNTiR57pWiljXRsJ 2kNIXKp4dXaYI/bxmKUzKSoVscyWxL686ND4U8sZhuppSNrWpLmMUNgwqmYjQQLV yd2JpIKgZYnzShPnJMDtF3ItcCHetY6jeB28WAclKywIEuCTmjulYCTaH5ujroG9 rykMaQIjoe/isdmCcBx05UuMxH61vf5L8pR06N6e3GO9T2/Y/hWuteVoEJaCQvNa +zIyL2hOjGuMKr+icLo9c42s3yfwWNsRfz87wqdAY47yYSyc1wo= =3NVe -----END PGP SIGNATURE----- Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging virtio,pc,pci: features, cleanups, fixes vhost-user enabled on non-linux systems beginning of nvme sriov support bigger tx queue for vdpa virtio iommu bypass FADT flag to detect legacy keyboards Fixes, cleanups all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> # gpg: Signature made Mon 07 Mar 2022 22:43:31 GMT # gpg: using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469 # gpg: issuer "mst@redhat.com" # gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full] # gpg: aka "Michael S. Tsirkin <mst@redhat.com>" [full] # Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67 # Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469 * remotes/mst/tags/for_upstream: (47 commits) hw/acpi/microvm: turn on 8042 bit in FADT boot architecture flags if present tests/acpi: i386: update FACP table differences hw/acpi: add indication for i8042 in IA-PC boot flags of the FADT table tests/acpi: i386: allow FACP acpi table changes docs: vhost-user: add subsection for non-Linux platforms configure, meson: allow enabling vhost-user on all POSIX systems vhost: use wfd on functions setting vring call fd event_notifier: add event_notifier_get_wfd() pci: drop COMPAT_PROP_PCP for 2.0 machine types hw/smbios: Add table 4 parameter, "processor-id" x86: cleanup unused compat_apic_id_mode vhost-vsock: detach the virqueue element in case of error pc: add option to disable PS/2 mouse/keyboard acpi: pcihp: pcie: set power on cap on parent slot pci: expose TYPE_XIO3130_DOWNSTREAM name pci: show id info when pci BDF conflict hw/misc/pvpanic: Use standard headers instead headers: Add pvpanic.h pci-bridge/xio3130_downstream: Fix error handling pci-bridge/xio3130_upstream: Fix error handling ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org> # Conflicts: # docs/specs/index.rst
This commit is contained in:
commit
9f0369efb0
66 changed files with 1229 additions and 174 deletions
|
@ -324,6 +324,14 @@ machine is hardly emulated at all (e.g. neither the LCD nor the USB part had
|
|||
been implemented), so there is not much value added by this board. Use the
|
||||
``ref405ep`` machine instead.
|
||||
|
||||
``pc-i440fx-1.4`` up to ``pc-i440fx-1.7`` (since 7.0)
|
||||
'''''''''''''''''''''''''''''''''''''''''''''''''''''
|
||||
|
||||
These old machine types are quite neglected nowadays and thus might have
|
||||
various pitfalls with regards to live migration. Use a newer machine type
|
||||
instead.
|
||||
|
||||
|
||||
Backend options
|
||||
---------------
|
||||
|
||||
|
|
|
@ -38,6 +38,26 @@ conventions <backend_conventions>`.
|
|||
*Master* and *slave* can be either a client (i.e. connecting) or
|
||||
server (listening) in the socket communication.
|
||||
|
||||
Support for platforms other than Linux
|
||||
--------------------------------------
|
||||
|
||||
While vhost-user was initially developed targeting Linux, nowadays it
|
||||
is supported on any platform that provides the following features:
|
||||
|
||||
- A way for requesting shared memory represented by a file descriptor
|
||||
so it can be passed over a UNIX domain socket and then mapped by the
|
||||
other process.
|
||||
|
||||
- AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can
|
||||
exchange messages through it, including ancillary data when needed.
|
||||
|
||||
- Either eventfd or pipe/pipe2. On platforms where eventfd is not
|
||||
available, QEMU will automatically fall back to pipe2 or, as a last
|
||||
resort, pipe. Each file descriptor will be used for receiving or
|
||||
sending events by reading or writing (respectively) an 8-byte value
|
||||
to the corresponding it. The 8-value itself has no meaning and
|
||||
should not be interpreted.
|
||||
|
||||
Message Specification
|
||||
=====================
|
||||
|
||||
|
|
115
docs/pcie_sriov.txt
Normal file
115
docs/pcie_sriov.txt
Normal file
|
@ -0,0 +1,115 @@
|
|||
PCI SR/IOV EMULATION SUPPORT
|
||||
============================
|
||||
|
||||
Description
|
||||
===========
|
||||
SR/IOV (Single Root I/O Virtualization) is an optional extended capability
|
||||
of a PCI Express device. It allows a single physical function (PF) to appear as multiple
|
||||
virtual functions (VFs) for the main purpose of eliminating software
|
||||
overhead in I/O from virtual machines.
|
||||
|
||||
Qemu now implements the basic common functionality to enable an emulated device
|
||||
to support SR/IOV. Yet no fully implemented devices exists in Qemu, but a
|
||||
proof-of-concept hack of the Intel igb can be found here:
|
||||
|
||||
git://github.com/knuto/qemu.git sriov_patches_v5
|
||||
|
||||
Implementation
|
||||
==============
|
||||
Implementing emulation of an SR/IOV capable device typically consists of
|
||||
implementing support for two types of device classes; the "normal" physical device
|
||||
(PF) and the virtual device (VF). From Qemu's perspective, the VFs are just
|
||||
like other devices, except that some of their properties are derived from
|
||||
the PF.
|
||||
|
||||
A virtual function is different from a physical function in that the BAR
|
||||
space for all VFs are defined by the BAR registers in the PFs SR/IOV
|
||||
capability. All VFs have the same BARs and BAR sizes.
|
||||
|
||||
Accesses to these virtual BARs then is computed as
|
||||
|
||||
<VF BAR start> + <VF number> * <BAR sz> + <offset>
|
||||
|
||||
From our emulation perspective this means that there is a separate call for
|
||||
setting up a BAR for a VF.
|
||||
|
||||
1) To enable SR/IOV support in the PF, it must be a PCI Express device so
|
||||
you would need to add a PCI Express capability in the normal PCI
|
||||
capability list. You might also want to add an ARI (Alternative
|
||||
Routing-ID Interpretation) capability to indicate that your device
|
||||
supports functions beyond it's "own" function space (0-7),
|
||||
which is necessary to support more than 7 functions, or
|
||||
if functions extends beyond offset 7 because they are placed at an
|
||||
offset > 1 or have stride > 1.
|
||||
|
||||
...
|
||||
#include "hw/pci/pcie.h"
|
||||
#include "hw/pci/pcie_sriov.h"
|
||||
|
||||
pci_your_pf_dev_realize( ... )
|
||||
{
|
||||
...
|
||||
int ret = pcie_endpoint_cap_init(d, 0x70);
|
||||
...
|
||||
pcie_ari_init(d, 0x100, 1);
|
||||
...
|
||||
|
||||
/* Add and initialize the SR/IOV capability */
|
||||
pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
|
||||
vf_devid, initial_vfs, total_vfs,
|
||||
fun_offset, stride);
|
||||
|
||||
/* Set up individual VF BARs (parameters as for normal BARs) */
|
||||
pcie_sriov_pf_init_vf_bar( ... )
|
||||
...
|
||||
}
|
||||
|
||||
For cleanup, you simply call:
|
||||
|
||||
pcie_sriov_pf_exit(device);
|
||||
|
||||
which will delete all the virtual functions and associated resources.
|
||||
|
||||
2) Similarly in the implementation of the virtual function, you need to
|
||||
make it a PCI Express device and add a similar set of capabilities
|
||||
except for the SR/IOV capability. Then you need to set up the VF BARs as
|
||||
subregions of the PFs SR/IOV VF BARs by calling
|
||||
pcie_sriov_vf_register_bar() instead of the normal pci_register_bar() call:
|
||||
|
||||
pci_your_vf_dev_realize( ... )
|
||||
{
|
||||
...
|
||||
int ret = pcie_endpoint_cap_init(d, 0x60);
|
||||
...
|
||||
pcie_ari_init(d, 0x100, 1);
|
||||
...
|
||||
memory_region_init(mr, ... )
|
||||
pcie_sriov_vf_register_bar(d, bar_nr, mr);
|
||||
...
|
||||
}
|
||||
|
||||
Testing on Linux guest
|
||||
======================
|
||||
The easiest is if your device driver supports sysfs based SR/IOV
|
||||
enabling. Support for this was added in kernel v.3.8, so not all drivers
|
||||
support it yet.
|
||||
|
||||
To enable 4 VFs for a device at 01:00.0:
|
||||
|
||||
modprobe yourdriver
|
||||
echo 4 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
|
||||
|
||||
You should now see 4 VFs with lspci.
|
||||
To turn SR/IOV off again - the standard requires you to turn it off before you can enable
|
||||
another VF count, and the emulation enforces this:
|
||||
|
||||
echo 0 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
|
||||
|
||||
Older drivers typically provide a max_vfs module parameter
|
||||
to enable it at load time:
|
||||
|
||||
modprobe yourdriver max_vfs=4
|
||||
|
||||
To disable the VFs again then, you simply have to unload the driver:
|
||||
|
||||
rmmod yourdriver
|
200
docs/specs/acpi_erst.rst
Normal file
200
docs/specs/acpi_erst.rst
Normal file
|
@ -0,0 +1,200 @@
|
|||
ACPI ERST DEVICE
|
||||
================
|
||||
|
||||
The ACPI ERST device is utilized to support the ACPI Error Record
|
||||
Serialization Table, ERST, functionality. This feature is designed for
|
||||
storing error records in persistent storage for future reference
|
||||
and/or debugging.
|
||||
|
||||
The ACPI specification[1], in Chapter "ACPI Platform Error Interfaces
|
||||
(APEI)", and specifically subsection "Error Serialization", outlines a
|
||||
method for storing error records into persistent storage.
|
||||
|
||||
The format of error records is described in the UEFI specification[2],
|
||||
in Appendix N "Common Platform Error Record".
|
||||
|
||||
While the ACPI specification allows for an NVRAM "mode" (see
|
||||
GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES) where non-volatile RAM is
|
||||
directly exposed for direct access by the OS/guest, this device
|
||||
implements the non-NVRAM "mode". This non-NVRAM "mode" is what is
|
||||
implemented by most BIOS (since flash memory requires programming
|
||||
operations in order to update its contents). Furthermore, as of the
|
||||
time of this writing, Linux only supports the non-NVRAM "mode".
|
||||
|
||||
|
||||
Background/Motivation
|
||||
---------------------
|
||||
|
||||
Linux uses the persistent storage filesystem, pstore, to record
|
||||
information (eg. dmesg tail) upon panics and shutdowns. Pstore is
|
||||
independent of, and runs before, kdump. In certain scenarios (ie.
|
||||
hosts/guests with root filesystems on NFS/iSCSI where networking
|
||||
software and/or hardware fails, and thus kdump fails), pstore may
|
||||
contain information available for post-mortem debugging.
|
||||
|
||||
Two common storage backends for the pstore filesystem are ACPI ERST
|
||||
and UEFI. Most BIOS implement ACPI ERST. UEFI is not utilized in all
|
||||
guests. With QEMU supporting ACPI ERST, it becomes a viable pstore
|
||||
storage backend for virtual machines (as it is now for bare metal
|
||||
machines).
|
||||
|
||||
Enabling support for ACPI ERST facilitates a consistent method to
|
||||
capture kernel panic information in a wide range of guests: from
|
||||
resource-constrained microvms to very large guests, and in particular,
|
||||
in direct-boot environments (which would lack UEFI run-time services).
|
||||
|
||||
Note that Microsoft Windows also utilizes the ACPI ERST for certain
|
||||
crash information, if available[3].
|
||||
|
||||
|
||||
Configuration|Usage
|
||||
-------------------
|
||||
|
||||
To use ACPI ERST, a memory-backend-file object and acpi-erst device
|
||||
can be created, for example:
|
||||
|
||||
qemu ...
|
||||
-object memory-backend-file,id=erstnvram,mem-path=acpi-erst.backing,size=0x10000,share=on \
|
||||
-device acpi-erst,memdev=erstnvram
|
||||
|
||||
For proper operation, the ACPI ERST device needs a memory-backend-file
|
||||
object with the following parameters:
|
||||
|
||||
- id: The id of the memory-backend-file object is used to associate
|
||||
this memory with the acpi-erst device.
|
||||
- size: The size of the ACPI ERST backing storage. This parameter is
|
||||
required.
|
||||
- mem-path: The location of the ACPI ERST backing storage file. This
|
||||
parameter is also required.
|
||||
- share: The share=on parameter is required so that updates to the
|
||||
ERST backing store are written to the file.
|
||||
|
||||
and ERST device:
|
||||
|
||||
- memdev: Is the object id of the memory-backend-file.
|
||||
- record_size: Specifies the size of the records (or slots) in the
|
||||
backend storage. Must be a power of two value greater than or
|
||||
equal to 4096 (PAGE_SIZE).
|
||||
|
||||
|
||||
PCI Interface
|
||||
-------------
|
||||
|
||||
The ERST device is a PCI device with two BARs, one for accessing the
|
||||
programming registers, and the other for accessing the record exchange
|
||||
buffer.
|
||||
|
||||
BAR0 contains the programming interface consisting of ACTION and VALUE
|
||||
64-bit registers. All ERST actions/operations/side effects happen on
|
||||
the write to the ACTION, by design. Any data needed by the action must
|
||||
be placed into VALUE prior to writing ACTION. Reading the VALUE
|
||||
simply returns the register contents, which can be updated by a
|
||||
previous ACTION.
|
||||
|
||||
BAR1 contains the 8KiB record exchange buffer, which is the
|
||||
implemented maximum record size.
|
||||
|
||||
|
||||
Backend Storage Format
|
||||
----------------------
|
||||
|
||||
The backend storage is divided into fixed size "slots", 8KiB in
|
||||
length, with each slot storing a single record. Not all slots need to
|
||||
be occupied, and they need not be occupied in a contiguous fashion.
|
||||
The ability to clear/erase specific records allows for the formation
|
||||
of unoccupied slots.
|
||||
|
||||
Slot 0 contains a backend storage header that identifies the contents
|
||||
as ERST and also facilitates efficient access to the records.
|
||||
Depending upon the size of the backend storage, additional slots will
|
||||
be designated to be a part of the slot 0 header. For example, at 8KiB,
|
||||
the slot 0 header can accomodate 1021 records. Thus a storage size
|
||||
of 8MiB (8KiB * 1024) requires an additional slot for use by the
|
||||
header. In this scenario, slot 0 and slot 1 form the backend storage
|
||||
header, and records can be stored starting at slot 2.
|
||||
|
||||
Below is an example layout of the backend storage format (for storage
|
||||
size less than 8MiB). The size of the storage is a multiple of 8KiB,
|
||||
and contains N number of slots to store records. The example below
|
||||
shows two records (in CPER format) in the backend storage, while the
|
||||
remaining slots are empty/available.
|
||||
|
||||
::
|
||||
|
||||
Slot Record
|
||||
<------------------ 8KiB -------------------->
|
||||
+--------------------------------------------+
|
||||
0 | storage header |
|
||||
+--------------------------------------------+
|
||||
1 | empty/available |
|
||||
+--------------------------------------------+
|
||||
2 | CPER |
|
||||
+--------------------------------------------+
|
||||
3 | CPER |
|
||||
+--------------------------------------------+
|
||||
... | |
|
||||
+--------------------------------------------+
|
||||
N | empty/available |
|
||||
+--------------------------------------------+
|
||||
|
||||
The storage header consists of some basic information and an array
|
||||
of CPER record_id's to efficiently access records in the backend
|
||||
storage.
|
||||
|
||||
All fields in the header are stored in little endian format.
|
||||
|
||||
::
|
||||
|
||||
+--------------------------------------------+
|
||||
| magic | 0x0000
|
||||
+--------------------------------------------+
|
||||
| record_offset | record_size | 0x0008
|
||||
+--------------------------------------------+
|
||||
| record_count | reserved | version | 0x0010
|
||||
+--------------------------------------------+
|
||||
| record_id[0] | 0x0018
|
||||
+--------------------------------------------+
|
||||
| record_id[1] | 0x0020
|
||||
+--------------------------------------------+
|
||||
| record_id[...] |
|
||||
+--------------------------------------------+
|
||||
| record_id[N] | 0x1FF8
|
||||
+--------------------------------------------+
|
||||
|
||||
The 'magic' field contains the value 0x524F545354535245.
|
||||
|
||||
The 'record_size' field contains the value 0x2000, 8KiB.
|
||||
|
||||
The 'record_offset' field points to the first record_id in the array,
|
||||
0x0018.
|
||||
|
||||
The 'version' field contains 0x0100, the first version.
|
||||
|
||||
The 'record_count' field contains the number of valid records in the
|
||||
backend storage.
|
||||
|
||||
The 'record_id' array fields are the 64-bit record identifiers of the
|
||||
CPER record in the corresponding slot. Stated differently, the
|
||||
location of a CPER record_id in the record_id[] array provides the
|
||||
slot index for the corresponding record in the backend storage.
|
||||
|
||||
Note that, for example, with a backend storage less than 8MiB, slot 0
|
||||
contains the header, so the record_id[0] will never contain a valid
|
||||
CPER record_id. Instead slot 1 is the first available slot and thus
|
||||
record_id_[1] may contain a CPER.
|
||||
|
||||
A 'record_id' of all 0s or all 1s indicates an invalid record (ie. the
|
||||
slot is available).
|
||||
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
[1] "Advanced Configuration and Power Interface Specification",
|
||||
version 4.0, June 2009.
|
||||
|
||||
[2] "Unified Extensible Firmware Interface Specification",
|
||||
version 2.1, October 2008.
|
||||
|
||||
[3] "Windows Hardware Error Architecture", specfically
|
||||
"Error Record Persistence Mechanism".
|
|
@ -18,4 +18,5 @@ guest hardware that is specific to QEMU.
|
|||
acpi_mem_hotplug
|
||||
acpi_pci_hotplug
|
||||
acpi_nvdimm
|
||||
acpi_erst
|
||||
sev-guest-firmware
|
||||
|
|
|
@ -65,6 +65,7 @@ PCI devices (other than virtio):
|
|||
1b36:000f mdpy (mdev sample device), linux/samples/vfio-mdev/mdpy.c
|
||||
1b36:0010 PCIe NVMe device (-device nvme)
|
||||
1b36:0011 PCI PVPanic device (-device pvpanic-pci)
|
||||
1b36:0012 PCI ACPI ERST device (-device acpi-erst)
|
||||
|
||||
All these devices are documented in docs/specs.
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue