mirror of
https://github.com/Motorhead1991/qemu.git
synced 2025-07-27 04:13:53 -06:00
* target/i386/kvm: support for reading RAPL MSRs using a helper program
* hpet: emulation improvements -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmaelL4UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMXoQf+K77lNlHLETSgeeP3dr7yZPOmXjjN qFY/18jiyLw7MK1rZC09fF+n9SoaTH8JDKupt0z9M1R10HKHLIO04f8zDE+dOxaE Rou3yKnlTgFPGSoPPFr1n1JJfxtYlLZRoUzaAcHUaa4W7JR/OHJX90n1Rb9MXeDk jV6P0v1FWtIDdM6ERm9qBGoQdYhj6Ra2T4/NZKJFXwIhKEkxgu4yO7WXv8l0dxQz jE4fKotqAvrkYW1EsiVZm30lw/19duhvGiYeQXoYhk8KKXXjAbJMblLITSNWsCio 3l6Uud/lOxekkJDAq5nH3H9hCBm0WwvwL+0vRf3Mkr+/xRGvrhtmUdp8NQ== =00mB -----END PGP SIGNATURE----- Merge tag 'for-upstream' of https://gitlab.com/bonzini/qemu into staging * target/i386/kvm: support for reading RAPL MSRs using a helper program * hpet: emulation improvements # -----BEGIN PGP SIGNATURE----- # # iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmaelL4UHHBib256aW5p # QHJlZGhhdC5jb20ACgkQv/vSX3jHroMXoQf+K77lNlHLETSgeeP3dr7yZPOmXjjN # qFY/18jiyLw7MK1rZC09fF+n9SoaTH8JDKupt0z9M1R10HKHLIO04f8zDE+dOxaE # Rou3yKnlTgFPGSoPPFr1n1JJfxtYlLZRoUzaAcHUaa4W7JR/OHJX90n1Rb9MXeDk # jV6P0v1FWtIDdM6ERm9qBGoQdYhj6Ra2T4/NZKJFXwIhKEkxgu4yO7WXv8l0dxQz # jE4fKotqAvrkYW1EsiVZm30lw/19duhvGiYeQXoYhk8KKXXjAbJMblLITSNWsCio # 3l6Uud/lOxekkJDAq5nH3H9hCBm0WwvwL+0vRf3Mkr+/xRGvrhtmUdp8NQ== # =00mB # -----END PGP SIGNATURE----- # gpg: Signature made Tue 23 Jul 2024 03:19:58 AM AEST # gpg: using RSA key F13338574B662389866C7682BFFBD25F78C7AE83 # gpg: issuer "pbonzini@redhat.com" # gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full] # gpg: aka "Paolo Bonzini <pbonzini@redhat.com>" [full] * tag 'for-upstream' of https://gitlab.com/bonzini/qemu: hpet: avoid timer storms on periodic timers hpet: store full 64-bit target value of the counter hpet: accept 64-bit reads and writes hpet: place read-only bits directly in "new_val" hpet: remove unnecessary variable "index" hpet: ignore high bits of comparator in 32-bit mode hpet: fix and cleanup persistence of interrupt status Add support for RAPL MSRs in KVM/Qemu tools: build qemu-vmsr-helper qio: add support for SO_PEERCRED for socket channel target/i386: do not crash if microvm guest uses SGX CPUID leaves Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This commit is contained in:
commit
43f59bf765
23 changed files with 1995 additions and 185 deletions
|
@ -34,3 +34,4 @@ guest hardware that is specific to QEMU.
|
|||
virt-ctlr
|
||||
vmcoreinfo
|
||||
vmgenid
|
||||
rapl-msr
|
||||
|
|
155
docs/specs/rapl-msr.rst
Normal file
155
docs/specs/rapl-msr.rst
Normal file
|
@ -0,0 +1,155 @@
|
|||
================
|
||||
RAPL MSR support
|
||||
================
|
||||
|
||||
The RAPL interface (Running Average Power Limit) is advertising the accumulated
|
||||
energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
|
||||
|
||||
The consumption is reported via MSRs (model specific registers) like
|
||||
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
|
||||
registers that represent the accumulated energy consumption in micro Joules.
|
||||
|
||||
Thanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some
|
||||
of them can now be handled by the userspace (QEMU). It uses a mechanism called
|
||||
"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so
|
||||
that a callback is put in place. The design of this patch uses only this
|
||||
mechanism for handling the MSRs between guest/host.
|
||||
|
||||
At the moment the following MSRs are involved:
|
||||
|
||||
.. code:: C
|
||||
|
||||
#define MSR_RAPL_POWER_UNIT 0x00000606
|
||||
#define MSR_PKG_POWER_LIMIT 0x00000610
|
||||
#define MSR_PKG_ENERGY_STATUS 0x00000611
|
||||
#define MSR_PKG_POWER_INFO 0x00000614
|
||||
|
||||
The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL
|
||||
spec and specify the power limit of the package, provide range of parameter(min
|
||||
power, max power,..) and also the information of the multiplier for the energy
|
||||
counter to calculate the power. Those MSRs are populated once at the beginning
|
||||
by reading the host CPU MSRs and are given back to the guest 1:1 when
|
||||
requested.
|
||||
|
||||
The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
|
||||
energy consumed since the last time the register was cleared. If you multiply
|
||||
it with the UNIT provided above you'll get the power in micro-joules. This
|
||||
counter is always increasing and it increases more or less faster depending on
|
||||
the consumption of the package. This counter is supposed to overflow at some
|
||||
point.
|
||||
|
||||
Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e
|
||||
"rdmsr 0x611") will retrieve the same value. The value represents the energy
|
||||
for the whole package. Whatever Core reading it will get the same value and a
|
||||
core that belongs to PKG-0 will not be able to get the value of PKG-1 and
|
||||
vice-versa.
|
||||
|
||||
High level implementation
|
||||
-------------------------
|
||||
|
||||
In order to update the value of the virtual MSR, a QEMU thread is created.
|
||||
The thread is basically just an infinity loop that does:
|
||||
|
||||
1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
|
||||
Userspace and System)
|
||||
|
||||
2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
|
||||
the QEMU threads are running on.
|
||||
|
||||
3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
|
||||
will do what they have to do and so the energy counter will increase.
|
||||
|
||||
4. Repeat 2. and 3. and calculate the delta of every metrics representing the
|
||||
time spent scheduled for each QEMU thread *and* the energy spent by the
|
||||
packages during the pause.
|
||||
|
||||
5. Filter the vcpu threads and the non-vcpu threads.
|
||||
|
||||
6. Retrieve the topology of the Virtual Machine. This helps identify which
|
||||
vCPU is running on which virtual package.
|
||||
|
||||
7. The total energy spent by the non-vcpu threads is divided by the number
|
||||
of vcpu threads so that each vcpu thread will get an equal part of the
|
||||
energy spent by the QEMU workers.
|
||||
|
||||
8. Calculate the ratio of energy spent per vcpu threads.
|
||||
|
||||
9. Calculate the energy for each virtual package.
|
||||
|
||||
10. The virtual MSRs are updated for each virtual package. Each vCPU that
|
||||
belongs to the same package will return the same value when accessing the
|
||||
the MSR.
|
||||
|
||||
11. Loop back to 1.
|
||||
|
||||
Ratio calculation
|
||||
-----------------
|
||||
|
||||
In Linux, a process has an execution time associated with it. The scheduler is
|
||||
dividing the time in clock ticks. The number of clock ticks per second can be
|
||||
found by the sysconf system call. A typical value of clock ticks per second is
|
||||
100. So a core can run a process at the maximum of 100 ticks per second. If a
|
||||
package has 4 cores, 400 ticks maximum can be scheduled on all the cores
|
||||
of the package for a period of 1 second.
|
||||
|
||||
The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
|
||||
process with the [pid] as the process ID. It gives the amount of ticks the
|
||||
process has been scheduled in userspace (utime) and kernel space (stime).
|
||||
|
||||
By reading those metrics for a thread, one can calculate the ratio of time the
|
||||
package has spent executing the thread.
|
||||
|
||||
Example:
|
||||
|
||||
A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
|
||||
per second per core. If a thread was scheduled for 100 ticks between a second
|
||||
on this package, that means my thread has been scheduled for 1/4 of the whole
|
||||
package. With that, the calculation of the energy spent by the thread on this
|
||||
package during this whole second is 1/4 of the total energy spent by the
|
||||
package.
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
Currently this feature is only working on an Intel CPU that has the RAPL driver
|
||||
mounted and available in the sysfs. if not, QEMU fails at start-up.
|
||||
|
||||
This feature is activated with -accel
|
||||
kvm,rapl=true,rapl-helper-socket=/path/sock.sock
|
||||
|
||||
It is important that the socket path is the same as the one
|
||||
:program:`qemu-vmsr-helper` is listening to.
|
||||
|
||||
qemu-vmsr-helper
|
||||
----------------
|
||||
|
||||
The qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of
|
||||
making persistent reservation, qemu-vmsr-helper is here to overcome the
|
||||
CVE-2020-8694 which remove user access to the rapl msr attributes.
|
||||
|
||||
A socket communication is established between QEMU processes that has the RAPL
|
||||
MSR support activated and the qemu-vmsr-helper. A systemd service and socket
|
||||
activation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket).
|
||||
|
||||
The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The
|
||||
socket can be passed via SCM_RIGHTS by libvirt, or its permissions can be
|
||||
changed (e.g. 660 and root:kvm for a Debian system for example). Libvirt could
|
||||
also start a separate helper if needed. All in all, the policy is left to the
|
||||
user.
|
||||
|
||||
See the qemu-pr-helper documentation or manpage for further details.
|
||||
|
||||
Current Limitations
|
||||
-------------------
|
||||
|
||||
- Works only on Intel host CPUs because AMD CPUs are using different MSR
|
||||
addresses.
|
||||
|
||||
- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
|
||||
moment.
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/
|
||||
.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html
|
Loading…
Add table
Add a link
Reference in a new issue