With the default TCP stack configuration, it could be even 2 hours
before the connection times out due to the other side not being
reachable. However, in some cases, the application needs to be aware of
a connection issue much sooner.
This is the case, for example, for postcopy live migration. If there is
no traffic from the migration destination guest (server-side) to the
migration source guest (client-side), the destination keeps waiting for
pages indefinitely and does not switch to the postcopy-paused state.
This can happen, for example, if the destination QEMU instance is
started with the '-S' command line option and the machine is not started
yet, or if the machine is idle and produces no new page faults for
not-yet-migrated pages.
This patch introduces new inet socket parameters that control count,
idle period, and interval of TCP keep-alive packets before the
connection is considered broken. These parameters are available on
systems where the respective TCP socket options are defined, that
includes Linux, Windows, macOS, but not OpenBSD. Additionally, macOS
defines TCP_KEEPIDLE as TCP_KEEPALIVE instead, so the patch supplies its
own definition.
The default value for all is 0, which means the system configuration is
used.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Currently, the inet address parser cannot handle multiple options where
one is prefixed with the name of the other. For example, with the
'keep-alive-idle' option added, the current parser cannot parse
'127.0.0.1:5000,keep-alive-idle=60,keep-alive' correctly. Instead, it
fails with "error parsing 'keep-alive' flag '-idle=60,keep-alive'".
To resolve these issues, this patch rewrites the inet address parsing
using the QemuOpts parser, which the inet_parse_flag() function tries to
mimic. This new parser supports all previously supported options and on
top of that the 'numeric' flag is now also supported. The only
difference is, the new parser produces an error if an unknown option is
passed, instead of silently ignoring it.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Commit aec21d3175 (qapi: Add InetSocketAddress member keep-alive)
introduces the keep-alive flag, which enables the SO_KEEPALIVE socket
option, but only on client-side sockets. However, this option is also
useful for server-side sockets, so they can check if a client is still
reachable or drop the connection otherwise.
This patch enables the SO_KEEPALIVE socket option on passive server-side
sockets if the keep-alive flag is enabled. This socket option is then
inherited by active server-side sockets communicating with connected
clients.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
To get a listening socket, we need to first create a socket, try binding
it to a certain port, and lastly starting listening to it. Each of these
operations can fail due to various reasons, one of them being that the
requested address/port is already in use. In such case, the function
tries the same process with a new port number.
This patch refactors the port number loop, so the success path is no
longer buried inside the 'if' statements in the middle of the loop. Now,
the success path is not nested and ends at the end of the iteration
after successful socket creation, binding, and listening. In case any of
the operations fails, it either continues to the next iteration (and the
next port) or jumps out of the loop to handle the error and exits the
function.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
This is done in preparation for enabling the SO_KEEPALIVE support for
server sockets and adding settings for more TCP keep-alive socket
options.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Starting with z15 (or newer) we can execute mmio
instructions from userspace. On older platforms
where we don't have these instructions available
we can fallback to using system calls to access
the PCI mapped resources.
This patch adds helper functions for mmio reads
and writes for s390x.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Acked-by: Thomas Huth <thuth@redhat.com>
Message-id: 20250430185012.2303-2-alifm@linux.ibm.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Emscripten does not support couroutine methods currently used by QEMU but
provides a coroutine implementation called "fiber". This commit introduces a
coroutine backend using fiber. Note that fiber does not support submitting
coroutines to other threads.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/006b683fd578ed6303a2dc8679094da9a7e6dfb4.1745820062.git.ktokunaga.mail@gmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Emscripten does not support partial unmapping of mmapped memory
regions[1]. This limitation prevents correct implementation of qemu_ram_mmap
and qemu_ram_munmap, which rely on partial unmap behavior.
As a workaround, this commit excludes mmap-alloc.c from the Emscripten
build. Instead, for Emscripten build, this modifies qemu_anon_ram_alloc to
use qemu_memalign in place of qemu_ram_mmap, and disable memory backends
that rely on mmap, such as memory-backend-file and memory-backend-shm.
[1] d4a74336f2/system/lib/libc/emscripten_mmap.c (L61)
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Link: https://lore.kernel.org/r/76834f933ee4f14eeb5289d21c59d306886e58e9.1745820062.git.ktokunaga.mail@gmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Although __builtin___clear_cache is used to flush the instruction cache for
a specified memory region, this operation doesn't apply to wasm, as its
memory isn't executable. Moreover, Emscripten does not support this builtin
and fails to compile it with the following error.
> fatal error: error in backend: llvm.clear_cache is not supported on wasm
To resolve this, this commit removes the call to __builtin___clear_cache for
Emscripten build.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/2926a798fa52a3a5b11c3df4edd1643d2b7cdcb9.1745820062.git.ktokunaga.mail@gmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mechanical change using gsed, then style manually adapted
to pass checkpatch.pl script.
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20250424194905.82506-4-philmd@linaro.org>
A few functions now end with a label. The next commit will clean them
up.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20250407082643.2310002-3-armbru@redhat.com>
[Straightforward conflict with commit 988ad4cceb (hw/loongarch/virt:
Fix cpuslot::cpu set at last in virt_cpu_plug()) resolved]
Split icount stuff from system/cpu-timers.h.
There are 17 files which only require icount.h, 7 that only
require cpu-timers.h, and 7 that require both.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Convert the existing includes with
sed -i ,exec/memory.h,system/memory.h,g
Move the include within cpu-all.h into a !CONFIG_USER_ONLY block.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
On ARM hosts with CTR_EL0.DIC and CTR_EL0.IDC set, this would only cause
an ISB to be executed during cache maintenance, which could lead to QEMU
executing TBs containing garbage instructions.
This seems to be because the ISB finishes executing instructions and
flushes the pipeline, but the ISB doesn't guarantee that writes from the
executed instructions are committed. If a small enough TB is created, it's
possible that the writes setting up the TB aren't committed by the time the
TB is executed.
This function is intended to be a port of the gcc implementation
(85b46d0795/libgcc/config/aarch64/sync-cache.c (L67))
which makes the first DSB unconditional, so we can fix the synchronization
issue by doing that as well.
Cc: qemu-stable@nongnu.org
Fixes: 664a79735e ("util: Specialize flush_idcache_range for aarch64")
Signed-off-by: Joe Komlodi <komlodi@google.com>
Message-id: 20250310203622.1827940-2-komlodi@google.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
aio_dispatch_handler() adds handlers to ctx->poll_aio_handlers if
polling should be enabled. If we call adjust_polling_time() for all
polling handlers before this, new polling handlers are still left at
poll->ns = 0 and polling is only actually enabled after the next event.
Move the adjust_polling_time() call after aio_dispatch_handler().
This fixes test-nested-aio-poll, which expects that polling becomes
effective the first time around.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250311141912.135657-1-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Adaptive polling has a big problem: It doesn't consider that an event
loop can wait for many different events that may have very different
typical latencies.
For example, think of a guest that tends to send a new I/O request soon
after the previous I/O request completes, but the storage on the host is
rather slow. In this case, getting the new request from guest quickly
means that polling is enabled, but the next thing is performing the I/O
request on the backend, which is slow and disables polling again for the
next guest request. This means that in such a scenario, polling could
help for every other event, but is only ever enabled when it can't
succeed.
In order to fix this, keep a separate AioPolledEvent for each
AioHandler. We will then know that the backend file descriptor always
has a high latency and isn't worth polling for, but we also know that
the guest is always fast and we should poll for it. This solves at least
half of the problem, we can now keep polling for those cases where it
makes sense and get the improved performance from it.
Since the event loop doesn't know which event will be next, we still do
some unnecessary polling while we're waiting for the slow disk. I made
some attempts to be more clever than just randomly growing and shrinking
the polling time, and even to let callers be explicit about when they
expect a new event, but so far this hasn't resulted in improved
performance or even caused performance regressions. For now, let's just
fix the part that is easy enough to fix, we can revisit the rest later.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250307221634.71951-6-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250307221634.71951-5-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As a preparation for having multiple adaptive polling states per
AioContext, move the 'ns' field into a separate struct.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250307221634.71951-4-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
iov_from_buf(), iov_to_buf(), iov_memset(), and iov_copy() asserts
that the given offset fits in the iov while tolerating the specified
number of bytes to operate with to be greater than the size of iov.
This is inconsistent so remove the assertions.
Asserting the offset fits in the iov makes sense if it is expected that
there are other operations that process the content before the offset
and the content is processed in order. Under this expectation, the
offset should point to the end of bytes that are previously processed
and fit in the iov. However, this expectation depends on the details of
the caller, and did not hold true at least one case and required code to
check iov_size(), which is added with commit 83ddb3dbba
("hw/net/net_tx_pkt: Fix overrun in update_sctp_checksum()").
Adding such a check is inefficient and error-prone. These functions
already tolerate the specified number of bytes to operate with to be
greater than the size of iov to avoid such checks so remove the
assertions to tolerate invalid offset as well. They return the number of
bytes they operated with so their callers can still check the returned
value to ensure there are sufficient space at the given offset.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Currently we call icount_start_warp_timer() from timerlist_rearm().
This produces incorrect behaviour, because timerlist_rearm() is
called, for instance, when a timer callback modifies its timer. We
cannot decide here to warp the timer forwards to the next timer
deadline merely because all_cpu_threads_idle() is true, because the
timer callback we were called from (or some other callback later in
the list of callbacks being invoked) may be about to raise a CPU
interrupt and move a CPU from idle to ready.
The only valid place to choose to warp the timer forward is from the
main loop, when we know we have no outstanding IO or timer callbacks
that might be about to wake up a CPU.
For Arm guests, this bug was mostly latent until the refactoring
commit f6fc36deef ("target/arm/helper: Implement
CNTHCTL_EL2.CNT[VP]MASK"), which exposed it because it refactored a
timer callback so that it happened to call timer_mod() first and
raise the interrupt second, when it had previously raised the
interrupt first and called timer_mod() afterwards.
This call seems to have originally derived from the
pre-record-and-replay icount code, which (as of e.g. commit
db1a49726c in 2010) in this location did a call to
qemu_notify_event(), necessary to get the icount code in the vCPU
round-robin thread to stop and recalculate the icount deadline when a
timer was reprogrammed from the IO thread. In current QEMU,
everything is done on the vCPU thread when we are in icount mode, so
there's no need to try to notify another thread here.
I suspect that the other reason why this call was doing icount timer
warping is that it pre-dates commit efab87cf79 from 2015, which
added a call to icount_start_warp_timer() to main_loop_wait(). Once
the call in timerlist_rearm() has been removed, if the timer
callbacks don't cause any CPU to be woken up then we will end up
calling icount_start_warp_timer() from main_loop_wait() when the rr
main loop code calls rr_wait_io_event().
Remove the incorrect call from timerlist_rearm().
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2703
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20250210135804.3526943-1-peter.maydell@linaro.org
- Merge "qemu/clang-tsa.h" within "qemu/compiler.h"
- Various cleanups around accelerators initialization code
(better user/system split)
- Various trivial cleanups in accel/tcg/,
Guard few TCG calls with tcg_enabled()
- Explicit disassemble_info endianness
- Improve dual-endianness support for MicroBlaze
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmfJw08ACgkQ4+MsLN6t
wN70whAAtfcdWtqseFfb6fvDtjflgxN51Ui0iaOECXUA18USKriGy34eBcMYMiM2
+eKgU7+jI6JGE4+burcgWUsPpFFF951/A8+lyIbFgO5yToTDmC+qNe4XfmMAIyXq
uf9Obr2c0Xk9luh4odb+jPAQodw/7G1fKgcCVIJNDCl/xEcPhS9eNpTaHwcVnkWI
K6KrxWXOsqG6+evJBPWYoXtOOyt0+JcwAsJoGhprwtGm3P9+jSVXsgeGsJVyZcna
f32JtjWL754O8XeMkOn4x6rt58VrCIMKI9xT7keDyuhTCq0Zki9RO2nMU2dSw5mN
AfL9hxqUy0Nijnyslg3ugujDfTePsNyLdwwH7n0mnoD72ELi6WnhDsmOThuEB3Rd
4/kdwTJfA/rlWk/GF1tbKW7AvQZokRARtzmL3V0HmGJu57lX+2JuszEdYBkqDEP7
GH1I10B2yANUm+C9y3X8qWOU7Ws433ebJeJoZuyfnbZ9Me+UfRmql/oS+V8ata2i
fArEItpldUFrWRyYLkTbXrh2dgyV9yJTEir/lzOzeAZZzyabTbjf2z9qnh976GGO
1QnDy5QA4f54kDBUZe7JK26TZsHPch7cgqXW6f8tRlJF7A9hxGK8d2TUV/lC3/vx
LUOlWNu03PhiruYmZEcWOsY3Jt9jRCF6lIryrnaJsqnVOVmMUMM=
=3TRh
-----END PGP SIGNATURE-----
Merge tag 'accel-cpus-20250306' of https://github.com/philmd/qemu into staging
Generic CPUs / accelerators patch queue
- Merge "qemu/clang-tsa.h" within "qemu/compiler.h"
- Various cleanups around accelerators initialization code
(better user/system split)
- Various trivial cleanups in accel/tcg/,
Guard few TCG calls with tcg_enabled()
- Explicit disassemble_info endianness
- Improve dual-endianness support for MicroBlaze
# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmfJw08ACgkQ4+MsLN6t
# wN70whAAtfcdWtqseFfb6fvDtjflgxN51Ui0iaOECXUA18USKriGy34eBcMYMiM2
# +eKgU7+jI6JGE4+burcgWUsPpFFF951/A8+lyIbFgO5yToTDmC+qNe4XfmMAIyXq
# uf9Obr2c0Xk9luh4odb+jPAQodw/7G1fKgcCVIJNDCl/xEcPhS9eNpTaHwcVnkWI
# K6KrxWXOsqG6+evJBPWYoXtOOyt0+JcwAsJoGhprwtGm3P9+jSVXsgeGsJVyZcna
# f32JtjWL754O8XeMkOn4x6rt58VrCIMKI9xT7keDyuhTCq0Zki9RO2nMU2dSw5mN
# AfL9hxqUy0Nijnyslg3ugujDfTePsNyLdwwH7n0mnoD72ELi6WnhDsmOThuEB3Rd
# 4/kdwTJfA/rlWk/GF1tbKW7AvQZokRARtzmL3V0HmGJu57lX+2JuszEdYBkqDEP7
# GH1I10B2yANUm+C9y3X8qWOU7Ws433ebJeJoZuyfnbZ9Me+UfRmql/oS+V8ata2i
# fArEItpldUFrWRyYLkTbXrh2dgyV9yJTEir/lzOzeAZZzyabTbjf2z9qnh976GGO
# 1QnDy5QA4f54kDBUZe7JK26TZsHPch7cgqXW6f8tRlJF7A9hxGK8d2TUV/lC3/vx
# LUOlWNu03PhiruYmZEcWOsY3Jt9jRCF6lIryrnaJsqnVOVmMUMM=
# =3TRh
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 06 Mar 2025 23:46:23 HKT
# gpg: using RSA key FAABE75E12917221DCFD6BB2E3E32C2CDEADC0DE
# gpg: Good signature from "Philippe Mathieu-Daudé (F4BUG) <f4bug@amsat.org>" [full]
# Primary key fingerprint: FAAB E75E 1291 7221 DCFD 6BB2 E3E3 2C2C DEAD C0DE
* tag 'accel-cpus-20250306' of https://github.com/philmd/qemu: (54 commits)
include: Poison TARGET_PHYS_ADDR_SPACE_BITS definition
system: Open-code qemu_init_arch_modules() using target_name()
target/i386: Mark WHPX APIC region as little-endian
target/alpha: Do not mix exception flags and FPCR bits
target/riscv: Convert misa_mxl_max using GLib macros
target/riscv: Declare RISCVCPUClass::misa_mxl_max as RISCVMXL
target/xtensa: Finalize config in xtensa_register_core()
target/sparc: Constify SPARCCPUClass::cpu_def
target/i386: Constify X86CPUModel uses
disas: Remove target_words_bigendian() call in initialize_debug_target()
target/xtensa: Set disassemble_info::endian value in disas_set_info()
target/sh4: Set disassemble_info::endian value in disas_set_info()
target/riscv: Set disassemble_info::endian value in disas_set_info()
target/ppc: Set disassemble_info::endian value in disas_set_info()
target/mips: Set disassemble_info::endian value in disas_set_info()
target/microblaze: Set disassemble_info::endian value in disas_set_info
target/arm: Set disassemble_info::endian value in disas_set_info()
target: Set disassemble_info::endian value for big-endian targets
target: Set disassemble_info::endian value for little-endian targets
target/mips: Fix possible MSA int overflow
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We already have "qemu/compiler.h" for compiler-specific arrangements,
automatically included by "qemu/osdep.h" for each source file. No
need to explicitly include a header for a Clang particularity.
Suggested-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20250117170201.91182-1-philmd@linaro.org>
Migration code wants to manage device data sending threads in one place.
QEMU has an existing thread pool implementation, however it is limited
to queuing AIO operations only and essentially has a 1:1 mapping between
the current AioContext and the AIO ThreadPool in use.
Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
GThreadPool.
This brings a few new operations on a pool:
* thread_pool_wait() operation waits until all the submitted work requests
have finished.
* thread_pool_set_max_threads() explicitly sets the maximum thread count
in the pool.
* thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
in the pool to equal the number of still waiting in queue or unfinished work.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/b1efaebdbea7cb7068b8fb74148777012383e12b.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
These names conflict with ones used by future generic thread pool
equivalents.
Generic names should belong to the generic pool type, not specific (AIO)
type.
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/70f9e0fb4b01042258a1a57996c64d19779dc7f0.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
This function name conflicts with one used by a future generic thread pool
function and it was only used by one test anyway.
Update the trace event name in thread_pool_submit_aio() accordingly.
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/6830f07777f939edaf0a2d301c39adcaaf3817f0.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
VMApple contains an "aes" engine device that it uses to encrypt and
decrypt its nvram. It has trivial hard coded keys it uses for that
purpose.
Add device emulation for this device model.
Signed-off-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Phil Dennis-Jordan <phil@philjordan.eu>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Tested-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-ID: <20241223221645.29911-10-phil@philjordan.eu>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Features:
SR-IOV emulation for pci
virtio-mem-pci support for s390
interleave support for cxl
big endian support for vdpa svq
new QAPI events for vhost-user
Also vIOMMU reset order fixups are in.
Fixes, cleanups all over the place.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAme4b8sPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpHKcIAKPJsVqPdda2dJ7b7FdyRT0Q+uwezXqaGHd4
7Lzih1wsxYNkwIAyPtEb76/21qiS7BluqlUCfCB66R9xWjP5/KfvAFj4/r4AEduE
fxAgYzotNpv55zcRbcflMyvQ42WGiZZHC+o5Lp7vDXUP3pIyHrl0Ydh5WmcD+hwS
BjXvda58TirQpPJ7rUL+sSfLih17zQkkDcfv5/AgorDy1wK09RBKwMx/gq7wG8yJ
twy8eBY2CmfmFD7eTM+EKqBD2T0kwLEeLfS/F/tl5Fyg6lAiYgYtCbGLpAmWErsg
XZvfZmwqL7CNzWexGvPFnnLyqwC33WUP0k0kT88Y5wh3/h98blw=
=tej8
-----END PGP SIGNATURE-----
Merge tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu into staging
virtio,pc,pci: features, fixes, cleanups
Features:
SR-IOV emulation for pci
virtio-mem-pci support for s390
interleave support for cxl
big endian support for vdpa svq
new QAPI events for vhost-user
Also vIOMMU reset order fixups are in.
Fixes, cleanups all over the place.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# -----BEGIN PGP SIGNATURE-----
#
# iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAme4b8sPHG1zdEByZWRo
# YXQuY29tAAoJECgfDbjSjVRpHKcIAKPJsVqPdda2dJ7b7FdyRT0Q+uwezXqaGHd4
# 7Lzih1wsxYNkwIAyPtEb76/21qiS7BluqlUCfCB66R9xWjP5/KfvAFj4/r4AEduE
# fxAgYzotNpv55zcRbcflMyvQ42WGiZZHC+o5Lp7vDXUP3pIyHrl0Ydh5WmcD+hwS
# BjXvda58TirQpPJ7rUL+sSfLih17zQkkDcfv5/AgorDy1wK09RBKwMx/gq7wG8yJ
# twy8eBY2CmfmFD7eTM+EKqBD2T0kwLEeLfS/F/tl5Fyg6lAiYgYtCbGLpAmWErsg
# XZvfZmwqL7CNzWexGvPFnnLyqwC33WUP0k0kT88Y5wh3/h98blw=
# =tej8
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 21 Feb 2025 20:21:31 HKT
# gpg: using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469
# gpg: issuer "mst@redhat.com"
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full]
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>" [full]
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu: (41 commits)
docs/devel/reset: Document reset expectations for DMA and IOMMU
hw/vfio/common: Add a trace point in vfio_reset_handler
hw/arm/smmuv3: Move reset to exit phase
hw/i386/intel-iommu: Migrate to 3-phase reset
hw/virtio/virtio-iommu: Migrate to 3-phase reset
vhost-user-snd: correct the calculation of config_size
net: vhost-user: add QAPI events to report connection state
hw/virtio/virtio-nsm: Respond with correct length
vdpa: Fix endian bugs in shadow virtqueue
MAINTAINERS: add more files to `vhost`
cryptodev/vhost: allocate CryptoDevBackendVhost using g_mem0()
vhost-iova-tree: Update documentation
vhost-iova-tree, svq: Implement GPA->IOVA & partial IOVA->HVA trees
vhost-iova-tree: Implement an IOVA-only tree
amd_iommu: Use correct bitmask to set capability BAR
amd_iommu: Use correct DTE field for interrupt passthrough
hw/virtio: reset virtio balloon stats on machine reset
mem/cxl_type3: support 3, 6, 12 and 16 interleave ways
hw/mem/cxl_type3: Ensure errp is set on realization failure
hw/mem/cxl_type3: Fix special_ops memory leak on msix_init_exclusive_bar() failure
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Creates and supports a GPA->IOVA tree and a partial IOVA->HVA tree by
splitting up guest-backed memory maps and host-only memory maps from the
full IOVA->HVA tree. That is, any guest-backed memory maps are now
stored in the GPA->IOVA tree and host-only memory maps stay in the
IOVA->HVA tree.
Also propagates the GPAs (in_addr/out_addr) of a VirtQueueElement to
vhost_svq_translate_addr() to translate GPAs to IOVAs via the GPA->IOVA
tree (when descriptors are backed by guest memory). For descriptors
backed by host-only memory, the existing partial SVQ IOVA->HVA tree is
used.
GPAs are unique in the guest's address space, ensuring unambiguous IOVA
translations. This avoids the issue where different GPAs map to the same
HVA, causing the original HVA->IOVA translation to potentially return an
IOVA associated with the wrong intended GPA.
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20250217144936.3589907-3-jonah.palmer@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Replace g_strdup_printf("%s", value) -> g_strdup(value)
to avoid unnecessary string formatting.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Depending on the configuration of the host and VM, a passthrough
device may generate recurring DMA mapping errors at runtime. In such
cases, reporting the issue once is sufficient.
We have already the warn/error_report_once() routines taking a format
and arguments. Using the same design pattern, add a new warning
variant taking an 'Error *' parameter.
Cc: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250206131438.1505542-2-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
- add a check-rust test to docker builds
- re-factor the qtest logic to be cleaner
- fix tests to not clock_step when no timers enabled
- roll-up log prefix into qtest_send
- cleaner error reporting when qtest_clock_set fails
- revert old deadlock fix now tests are updated
- only run full set of migration tests under HW acceleration
- support late attachment to user-mode gdbstubs
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEZoWumedRZ7yvyN81+9DbCVqeKkQFAmeqBSsACgkQ+9DbCVqe
KkQS/Af+K0hpdGc1msiuMsqmuESBvhoQniYZFLN1/pwe2KpG8i/+fq2fsCuxJhJ1
2TzPH7aj54p9MGCZf2k9JLhO22XldN+oezZMc1crhoWK0AtrWhnLs58I2oEPIsUo
NmGO6Zfm98ge89o2y8GCvd0QXAtUf+jduDKnW0mfnOnw+w/mky5KzWS7/1091VGW
42LSY4KnqgdLSqLyuLBOrgADEjB1ChWS4/bSC+kEYSGrmNQB+n1KeIzzlJBGpOr0
Z9yzmhMCm7TWdkFNPmnVfYH/7ZUNcpv6PtQSpkku4f6b/gybyvJBknHpM4i+Gpb5
87wSjljrCpdNm/9KFRjiJuUWdS/jCg==
=UF0n
-----END PGP SIGNATURE-----
Merge tag 'pull-10.0-testing-and-gdstub-updates-100225-1' of https://gitlab.com/stsquad/qemu into staging
testing and gdbstub updates:
- add a check-rust test to docker builds
- re-factor the qtest logic to be cleaner
- fix tests to not clock_step when no timers enabled
- roll-up log prefix into qtest_send
- cleaner error reporting when qtest_clock_set fails
- revert old deadlock fix now tests are updated
- only run full set of migration tests under HW acceleration
- support late attachment to user-mode gdbstubs
# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCgAdFiEEZoWumedRZ7yvyN81+9DbCVqeKkQFAmeqBSsACgkQ+9DbCVqe
# KkQS/Af+K0hpdGc1msiuMsqmuESBvhoQniYZFLN1/pwe2KpG8i/+fq2fsCuxJhJ1
# 2TzPH7aj54p9MGCZf2k9JLhO22XldN+oezZMc1crhoWK0AtrWhnLs58I2oEPIsUo
# NmGO6Zfm98ge89o2y8GCvd0QXAtUf+jduDKnW0mfnOnw+w/mky5KzWS7/1091VGW
# 42LSY4KnqgdLSqLyuLBOrgADEjB1ChWS4/bSC+kEYSGrmNQB+n1KeIzzlJBGpOr0
# Z9yzmhMCm7TWdkFNPmnVfYH/7ZUNcpv6PtQSpkku4f6b/gybyvJBknHpM4i+Gpb5
# 87wSjljrCpdNm/9KFRjiJuUWdS/jCg==
# =UF0n
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 10 Feb 2025 08:54:51 EST
# gpg: using RSA key 6685AE99E75167BCAFC8DF35FBD0DB095A9E2A44
# gpg: Good signature from "Alex Bennée (Master Work Key) <alex.bennee@linaro.org>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 6685 AE99 E751 67BC AFC8 DF35 FBD0 DB09 5A9E 2A44
* tag 'pull-10.0-testing-and-gdstub-updates-100225-1' of https://gitlab.com/stsquad/qemu:
tests/tcg: Add late gdbstub attach test
docs/user: Document the %d placeholder and suspend=n QEMU_GDB features
gdbstub: Allow late attachment
osdep: Introduce qemu_kill_thread()
user: Introduce host_interrupt_signal
user: Introduce user/signal.h
gdbstub: Try unlinking the unix socket before binding
gdbstub: Allow the %d placeholder in the socket path
tests/qtest/migration: Pick smoke tests
tests/qtest/migration: Add --full option
Revert "util/timer: avoid deadlock when shutting down"
tests/qtest: tighten up the checks on clock_step
tests/qtest: rename qtest_send_prefix and roll-up into qtest_send
tests/qtest: simplify qtest_process_inbuf
tests/qtest: don't step clock at start of npcm7xx periodic IRQ test
tests/qtest: don't attempt to clock_step while waiting for virtio ISR
tests/docker: replicate the check-rust-tools-nightly CI job
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The general expectation is that header files should follow the same
file/path naming scheme as the corresponding source file. There are
various historical exceptions to this practice in QEMU, with one of
the most notable being the include/qapi/qmp/ directory. Most of the
headers there correspond to source files in qobject/.
This patch corrects most of that inconsistency by creating
include/qobject/ and moving the headers for qobject/ there.
This also fixes MAINTAINERS for include/qapi/qmp/dispatch.h:
scripts/get_maintainer.pl now reports "QAPI" instead of "No
maintainers found".
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com> #s390x
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20241118151235.2665921-2-armbru@redhat.com>
[Rebased]
Add a function for sending signals to individual threads. It does not make
sense on Windows, so do not provide an implementation, so that if someone
uses it by accident, they will get a linker error.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Warner Losh <imp@bsdimp.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Message-Id: <20250117001542.8290-6-iii@linux.ibm.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20250207153112.3939799-15-alex.bennee@linaro.org>
In case an emulated process execve()s another emulated process, bind()
will fail, because the socket already exists. So try deleting it. Use
the existing unix_listen() function which does this. Link qemu-user
with qemu-sockets.c and add the monitor_get_fd() stub.
Note that it is not possible to handle this in do_execv(): deleting
gdbserver_user_state.socket_path before safe_execve() is not correct,
because the latter may fail, and afterwards we may lose control.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20250117001542.8290-3-iii@linux.ibm.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20250207153112.3939799-12-alex.bennee@linaro.org>
This reverts commit bc02be4508.
Now we catch attempts to clock_step to the next timer when none are
enabled we can revert the previous attempt to prevent deadlock. As
long as a new target time is given we will move time forward even if
no timers will fire. This is desirable for tests which are checking
that nothing changes when things are disabled.
Previously most tests got away with it because --enable-slirp always
has a timer running while the test is active.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20250207153112.3939799-8-alex.bennee@linaro.org>
Create MAP_SHARED RAMBlocks by mmap'ing a file descriptor rather than using
MAP_ANON, so the memory can be accessed in another process by passing and
mmap'ing the fd. This will allow CPR to support memory-backend-ram and
memory-backend-shm objects, provided the user creates them with share=on.
Use memfd_create if available because it has no constraints. If not, use
POSIX shm_open. However, allocation on the opened fd may fail if the shm
mount size is too small, even if the system has free memory, so for backwards
compatibility fall back to qemu_anon_ram_alloc/MAP_ANON on failure.
For backwards compatibility on Windows, always use MAP_ANON. share=on has
no purpose there, but the syntax is accepted, and must continue to work.
Lastly, quietly fall back to MAP_ANON if the system does not support
qemu_ram_alloc_from_fd.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/1736967650-129648-5-git-send-email-steven.sistare@oracle.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Acked-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20250102181601.1421059-2-richard.henderson@linaro.org>
The next commit will remove "qemu/clang-tsa.h" of "exec/exec-all.h",
however the following files indirectly include it:
$ git grep -L qemu/clang-tsa.h $(git grep -wl TSA_NO_TSA)
block/create.c
include/block/block_int-common.h
tests/unit/test-bdrv-drain.c
tests/unit/test-block-iothread.c
util/qemu-thread-posix.c
Explicitly include it so we can process with the removal in the
next commit.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20241212185341.2857-4-philmd@linaro.org>
Headers in include/sysemu/ are not only related to system
*emulation*, they are also used by virtualization. Rename
as system/ which is clearer.
Files renamed manually then mechanical change using sed tool.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Lei Yang <leiyang@redhat.com>
Message-Id: <20241203172445.28576-1-philmd@linaro.org>
Currently -d guest_errors enables logging of different invalid actions
by the guest such as misusing hardware, accessing missing features or
invalid memory areas. The memory access logging can be quite verbose
which obscures the other messages enabled by this debug switch so
separate it by adding a new -d invalid_mem option to make it possible
to control it independently of other guest error logs.
Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-ID: <1bb0d0e91ba14aca13056df3b0a774f89cbf966c.1730549443.git.balaton@eik.bme.hu>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
The call to xgetbv() is passing the ecx value for cpuid function 0xD,
index 0. The xgetbv call thus returns false (OSXSAVE is bit 27, which is
well out of the range of CPUID[0xD,0].ECX) and eax is not modified. While
fixing it, cache the whole computation of supported XCR0 bits since it
will be used for more than just CPUID leaf 0xD.
Furthermore, unsupported subleafs of CPUID 0xD (including all those
corresponding to zero bits in host's XCR0) must be hidden; if OSXSAVE
is not set at all, the whole of CPUID leaf 0xD plus the XSAVE bit must
be hidden.
Finally, unconditionally drop XSTATE_BNDREGS_MASK and XSTATE_BNDCSR_MASK;
real hardware will only show them if the MPX bit is set in CPUID;
this is never the case for hvf_get_supported_cpuid() because QEMU's
Hypervisor.framework support does not handle the VMX fields related to
MPX (even in the unlikely possibility that the host has MPX enabled).
So hide those bits in the new cache_host_xcr0().
Cc: Phil Dennis-Jordan <lists@philjordan.eu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add support for probing RISC-V vector extension availability in
the backend. This information will be used when deciding whether
to use vector instructions in code generation.
Cache lg2(vlenb) for the backend. The storing of lg2(vlenb) means
we can convert all of the division into subtraction.
While the compiler doesn't support RISCV_HWPROBE_EXT_ZVE64X,
we use RISCV_HWPROBE_IMA_V instead. RISCV_HWPROBE_IMA_V is more
strictly constrainted than RISCV_HWPROBE_EXT_ZVE64X. At least in
current QEMU implemenation, the V vector extension depends on the
zve64d extension.
Signed-off-by: TANG Tiancheng <tangtiancheng.ttc@alibaba-inc.com>
Reviewed-by: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
Tested-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Message-ID: <20241007025700.47259-2-zhiwei_liu@linux.alibaba.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Setting the SO_REUSEADDR property on a socket allows binding to a port
number that is in the TIMED_WAIT state. This is usually done on listener
sockets, to enable a server to restart itself without having to wait for
the completion of TIMED_WAIT on the port.
It is also possible, but highly unusual, to set it on client sockets. It
is rare to explicitly bind() a client socket, since it is almost always
fine to allow the kernel to auto-bind a client socket to a random free
port. Most systems will have many 10's of 1000's of free ports that
client sockets will be bound to.
eg on Linux
$ sysctl -a | grep local_port
net.ipv4.ip_local_port_range = 32768 60999
eg on OpenBSD
$ sysctl -a | grep net.inet.ip.port
net.inet.ip.portfirst=1024
net.inet.ip.portlast=49151
net.inet.ip.porthifirst=49152
net.inet.ip.porthilast=65535
A connected socket must have a unique set of value for
(protocol, localip, localport, remoteip, remoteport)
otherwise it is liable to get EADDRINUSE.
A client connection should trivially avoid EADDRINUSE if letting the
kernel auto-assign the 'localport' value, which QEMU always does.
When QEMU sets SO_REUSEADDR on a client socket on OpenBSD, however, it
upsets this situation.
The OpenBSD kernel appears to happily pick a 'localport' that is in the
TIMED_WAIT state, even if there are many other available local ports
available for use that are not in the TIMED_WAIT state.
A test program that just loops opening client sockets will start seeing
EADDRINUSE on OpenBSD when as few as 2000 ports are in TIMED_WAIT,
despite 10's of 1000's ports still being unused. This contrasts with
Linux which appears to avoid picking local ports in TIMED_WAIT state.
This problem on OpenBSD exhibits itself periodically with the migration
test failing with a message like[1]:
qemu-system-ppc64: Failed to connect to '127.0.0.1:24109': Address already in use
While I have not been able to reproduce the OpenBSD failure in my own
testing, given the scope of what QEMU tests do, it is entirely possible
that there could be a lot of ports in TIMED_WAIT state when the
migration test runs.
Removing SO_REUSEADDR from the client sockets should not affect normal
QEMU usage, and should improve reliability on OpenBSD.
This use of SO_REUSEADDR on client sockets is highly unusual, and
appears to have been present since the very start of the QEMU socket
helpers in 2008. The orignal commit has no comment about the use of
SO_REUSEADDR on the client, so is most likely just an 16 year old
copy+paste bug.
[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-10/msg03427.htmlhttps://lists.nongnu.org/archive/html/qemu-devel/2024-02/msg01572.html
Fixes: d247d25f18
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
socket_remote_address hasn't been used since it was added in
17c55decec ("sockets: add helpers for creating SocketAddress from a socket")
inet_connect hasn't been used since 2017's
8ecc2f9eab ("sheepdog: Use SocketAddress and socket_connect()")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>