A new parameter "-a" is added to "info migrate" to dump all info, while
when not specified it only dumps the important ones. When at it, reorg
everything to make it easier to read for human.
The general rule is:
- Put important things at the top
- Reuse a single line when things are very relevant, hence reducing lines
needed to show the results
- Remove almost useless ones (e.g. "normal_bytes", while we also have
both "page size" and "normal" pages)
- Regroup things, so that related fields will show together
- etc.
Before this change, it looks like (one example of a completed case):
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
send-switchover-start: on
clear-bitmap-shift: 18
Migration status: completed
total time: 122952 ms
downtime: 76 ms
setup: 15 ms
transferred ram: 130825923 kbytes
throughput: 8717.68 mbps
remaining ram: 0 kbytes
total ram: 16777992 kbytes
duplicate: 997263 pages
normal: 32622225 pages
normal bytes: 130488900 kbytes
dirty sync count: 10
page size: 4 kbytes
multifd bytes: 117134260 kbytes
pages-per-second: 169431
postcopy request count: 5835
precopy ram: 15 kbytes
postcopy ram: 13691151 kbytes
After this change, sample output (default, no "-a" specified):
Status: postcopy-active
Time (ms): total=40504, setup=14, down=145
RAM info:
Bandwidth (mbps): 6102.65
Sizes (KB): psize=4, total=16777992,
transferred=37673019, remain=2136404,
precopy=3, multifd=26108780, postcopy=11563855
Pages: normal=9394288, zero=600672, rate_per_sec=185875
Others: dirty_syncs=3, dirty_pages_rate=278378, postcopy_req=4078
Sample output when "-a" specified:
Status: active
Time (ms): total=3040, setup=4, exp_down=300
RAM info:
Throughput (mbps): 10.51
Sizes (KB): psize=4, total=4211528,
transferred=3979, remain=4206452,
precopy=3978, multifd=0, postcopy=0
Pages: normal=992, zero=277, rate_per_sec=320
Others: dirty_syncs=1
Globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
send-switchover-start: on
clear-bitmap-shift: 18
XBZRLE: size=67108864, transferred=0, pages=0, miss=188451
miss_rate=0.00, encode_rate=0.00, overflow=0
CPU Throttle (%): 0
Dirty-limit Throttle (us): 0
Dirty-limit Ring Full (us): 0
Postcopy Blocktime (ms): 0
Postcopy vCPU Blocktime: ...
Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>
Tested-by: Mario Casquero <mcasquer@redhat.com>
[peterx: print "," too in 1st line of RAM info]
Signed-off-by: Peter Xu <peterx@redhat.com>
With commit 82137e6c8c ("migration: enforce multifd and postcopy preempt to
be set before incoming"), and if postcopy preempt / multifd is enabled, one
cannot setup any capability because these checks would always fail.
(qemu) migrate_set_capability xbzrle off
Error: Postcopy preempt must be set before incoming starts
To fix it, check existing cap and only raise an error if the specific cap
changed.
Fixes: 82137e6c8c ("migration: enforce multifd and postcopy preempt to be set before incoming")
Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
If zerocopy is enabled for multifd then QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
flag is forced into all multifd channel write calls via p->write_flags
that was setup in multifd_nocomp_send_setup().
However, device state packets aren't compatible with zerocopy - the data
buffer isn't getting kept pinned until multifd channel flush.
Make sure to mask that QIO_CHANNEL_WRITE_FLAG_ZERO_COPY flag in a multifd
send thread if the data being sent is device state.
Fixes: 0525b91a0b ("migration/multifd: Device state transfer support - send side")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/3bd5f48578e29f3a78f41b1e4fbea3d4b2d9b136.1747403393.git.maciej.szmigiero@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Enable Multifd and Postcopy migration together.
The migration_ioc_process_incoming() routine checks
magic value sent on each channel and helps to properly
setup multifd and postcopy channels.
The Precopy and Multifd threads work during the initial
guest RAM transfer. When migration moves to the Postcopy
phase, the multifd threads cease to send data on multifd
channels and Postcopy threads on the destination
request/pull data from the source side.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Link: https://lore.kernel.org/r/20250512125124.147064-3-ppandit@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
During multifd migration, zero pages are written if
they are migrated more than once.
This may result in a migration thread hang issue when
multifd and postcopy are enabled together.
When postcopy is enabled, always write zero pages as and
when they are migrated.
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250512125124.147064-2-ppandit@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
I hit following error which testing migration in pure RoCE env:
"-incoming rdma:[::]:8089: RDMA ERROR: You only have RoCE / iWARP devices in your
systems and your management software has specified '[::]', but IPv6 over RoCE /
iWARP is not supported in Linux.#012'."
In our setup, we use rdma bind on ipv6 on target host, while connect from source
with ipv4, remove the qemu_rdma_broken_ipv6_kernel, migration just work
fine.
Checking the git history, the function was added since introducing of
rdma migration, which is more than 10 years ago. linux-rdma has
improved support on RoCE/iWARP for ipv6 over past years. There are a few fixes
back in 2016 seems related to the issue, eg:
aeb76df46d11 ("IB/core: Set routable RoCE gid type for ipv4/ipv6 networks")
other fixes back in 2018, eg:
052eac6eeb56 RDMA/cma: Update RoCE multicast routines to use net namespace
8d20a1f0ecd5 RDMA/cma: Fix rdma_cm raw IB path setting for RoCE
9327c7afdce3 RDMA/cma: Provide a function to set RoCE path record L2 parameters
5c181bda77f4 RDMA/cma: Set default GID type as RoCE when resolving RoCE route
3c7f67d1880d IB/cma: Fix default RoCE type setting
be1d325a3358 IB/core: Set RoCEv2 MGID according to spec
63a5f483af0e IB/cma: Set default gid type to RoCEv2
So remove the outdated function and it's usage.
Cc: Peter Xu <peterx@redhat.com>
Cc: Li Zhijian <lizhijian@fujitsu.com>
Cc: Yu Zhang <yu.zhang@ionos.com>
Cc: Fabiano Rosas <farosas@suse.de>
Cc: qemu-devel@nongnu.org
Cc: linux-rdma@vger.kernel.org
Cc: michael@flatgalaxy.com
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Tested-by: Li zhijian <lizhijian@fujitsu.com>
Reviewed-by: Michael Galaxy <mrgalaxy@nvidia.com>
Link: https://lore.kernel.org/r/20250402051306.6509-1-jinpu.wang@ionos.com
[peterx: some cosmetic changes]
Signed-off-by: Peter Xu <peterx@redhat.com>
The preempt mode postcopy has been introduced for a while. From latency
POV, it should always win the vanilla postcopy.
However there's one thing missing when preempt mode is enabled right now,
which is the spatial locality hint when there're page requests from the
destination side.
In vanilla postcopy, as long as a page request was unqueued, it will update
the PSS of the precopy background stream, so that after a page request the
background thread will move the pages after whatever was requested. It's
pretty much a natural behavior when there's only one channel anyway, and
one scanner to send the pages.
Preempt mode didn't follow that, because preempt mode has its own channel
and its own PSS (which doesn't linearly scan the guest memory, but
dedicated to resolve page requested from destination). So the page request
process and the background migration process are completely separate.
This patch adds the hint explicitly for preempt mode. With that, whenever
the preempt mode receives a page request on the source, it will service the
remote page fault in the return path, then it'll provide a hint to the
background thread so that we'll start sending the pages right after the
requested ones in the background, assuming the follow up pages have a
higher chance to be accessed later.
NOTE: since the background migration thread and return path thread run
completely concurrently, it doesn't always mean the hint will be applied
every single time. For example, it's possible that the return path thread
receives multiple page requests in a row without the background thread
getting the chance to consume one. In such case, the preempt thread only
provide the hint if the previous hint has been consumed. After all,
there's no point queuing hints when we only have one linear scanner.
This could measureably improve the simple sequential memory access pattern
during postcopy (when preempt is on). For random accesses, I can measure a
slight increase of remote page fault latency from ~500us -> ~600us, that
could be a trade-off to have such hint mechanism, and after all that's
still greatly improved comparing to vanilla postcopy on random (~10ms).
The patch is verified by our QE team in a video streaming test case, to
reduce the pause of the video from ~1min to a few seconds when switching
over to postcopy with preempt mode.
Reported-by: Xiaohui Li <xiaohli@redhat.com>
Tested-by: Xiaohui Li <xiaohli@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250424220705.195544-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Implement save_postcopy_prepare(), preparing for the enablement
of both multifd and postcopy.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250411114534.3370816-5-ppandit@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Add a savevm handler for a module to opt-in sending extra sections right
before postcopy starts, and before VM is stopped.
RAM will start to use this new savevm handler in the next patch to do flush
and sync for multifd pages.
Note that we choose to do it before VM stopped because the current only
potential user is not sensitive to VM status, so doing it before VM is
stopped is preferred to enlarge any postcopy downtime.
It is still a bit unfortunate that we need to introduce such a new savevm
handler just for the only use case, however it's so far the cleanest.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250411114534.3370816-4-ppandit@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The various logical migration channels don't have a
standardized way of advertising themselves and their
connections may be seen out of order by the migration
destination. When a new connection arrives, the incoming
migration currently make use of heuristics to determine
which channel it belongs to.
The next few patches will need to change how the multifd
and postcopy capabilities interact and that affects the
channel discovery heuristic.
Refactor the channel discovery heuristic to make it less
opaque and simplify the subsequent patches.
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250411114534.3370816-3-ppandit@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Move MULTIFD_ macros to the header file so that
they are accessible from other source files.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250411114534.3370816-2-ppandit@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
control_save_page() is for RDMA only, unfold it to make the code more
clear.
In addition:
- Similar to other branches style in ram_save_target_page(), involve RDMA
only if the condition 'migrate_rdma()' is true.
- Further simplify the code by removing the RAM_SAVE_CONTROL_NOT_SUPP.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-6-lizhijian@fujitsu.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Since we have disabled RDMA + postcopy, it's safe to remove
the migration_in_postcopy() that follows the migrate_rdma().
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-5-lizhijian@fujitsu.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
It's believed that RDMA + postcopy-ram has been broken for a while.
Rather than spending time re-enabling it, let's simply disable it as a
trade-off.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-4-lizhijian@fujitsu.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Depending on the order of starting RDMA and setting capability,
they can be categorized into the following scenarios:
Source:
S1: [set capabilities] -> [Start RDMA outgoing]
Destination:
D1: [set capabilities] -> [Start RDMA incoming]
D2: [Start RDMA incoming] -> [set capabilities]
Previously, compatibility between RDMA and capabilities was verified only
in scenario D1, potentially causing migration failures in other situations.
For scenarios S1 and D1, we can seamlessly incorporate
migration_transport_compatible() to address compatibility between
channels and capabilities vs transport.
For scenario D2, ensure compatibility within migrate_caps_check().
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-3-lizhijian@fujitsu.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
- Replace cpu_list() definition by CPUClass::list_cpus() callback
- Remove few MO_TE definitions on Hexagon / X86 targets
- Remove target_ulong uses in ARMMMUFaultInfo and ARM CPUWatchpoint
- Remove DEVICE_HOST_ENDIAN definition
- Evaluate TARGET_BIG_ENDIAN at compile time and use target_needs_bswap() more
- Rename target_words_bigendian() as target_big_endian()
- Convert target_name() and target_cpu_type() to TargetInfo API
- Constify QOM TypeInfo class_data/interfaces fields
- Get default_cpu_type calling machine_class_default_cpu_type()
- Correct various uses of GLibCompareDataFunc prototype
- Simplify ARM/Aarch64 gdb_get_core_xml_file() handling a bit
- Move device tree files in their own pc-bios/dtb/ subdir
- Correctly check strchrnul() symbol availability on macOS SDK
- Move target-agnostic methods out of cpu-target.c and accel-target.c
- Unmap canceled USB XHCI packet
- Use deposit/extract API in designware model
- Fix MIPS16e translation
- Few missing header fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmgLqb8ACgkQ4+MsLN6t
wN6nCQ//cmv1M+NsndhO5TAK8T1eUSXKlTZh932uro6ZgxKwN4p+j1Qo7bq3O9gu
qUMHNbcfQl8sHSytiXBoxCjLMCXC3u38iyz75WGXuPay06rs4wqmahqxL4tyno3l
1RviFts9xlLn+tJqqrAR6+pRdALld0TY+yXUjXgr4aK5pIRpLz9U/sIEoh7qbA5U
x0MTaceDG3A91OYo0TgrNbcMe1b9GqQZ+a4tbaP+oE37wbiKdyQ68LjrEbV08Y1O
qrFF4oxquV31QJcUiuII1W7hC6psGrMsUA1f1qDu7QvmybAZWNZNsR9T66X9jH5J
wXMShJmmXwxugohmuPPFnDshzJy90aFL6Jy2shrfqcG2v0W66ARY1ZnbJLCcfczt
073bnE2dnOVhd/ny37RrIJNJLLmYM0yFDeKuYtNNAzpK9fpA7Q2PI8QiqNacQ3Pa
TdEYrGlMk7OeNck8xJmJMY5rATthi1D4dIBv3rjQbUolQvPJe2Y9or0R2WL1jK5v
hhr6DY01iSPES3CravmUs/aB1HRMPi/nX45OmFR6frAB7xqWMreh81heBVuoTTK8
PuXtRQgRMRKwDeTxlc6p+zba4mIEYG8rqJtPFRgViNCJ1KsgSIowup3BNU05YuFn
NoPoRayMDVMgejVgJin3Mg2DCYvt/+MBmO4IoggWlFsXj59uUgA=
=DXnZ
-----END PGP SIGNATURE-----
Merge tag 'single-binary-20250425' of https://github.com/philmd/qemu into staging
Various patches loosely related to single binary work:
- Replace cpu_list() definition by CPUClass::list_cpus() callback
- Remove few MO_TE definitions on Hexagon / X86 targets
- Remove target_ulong uses in ARMMMUFaultInfo and ARM CPUWatchpoint
- Remove DEVICE_HOST_ENDIAN definition
- Evaluate TARGET_BIG_ENDIAN at compile time and use target_needs_bswap() more
- Rename target_words_bigendian() as target_big_endian()
- Convert target_name() and target_cpu_type() to TargetInfo API
- Constify QOM TypeInfo class_data/interfaces fields
- Get default_cpu_type calling machine_class_default_cpu_type()
- Correct various uses of GLibCompareDataFunc prototype
- Simplify ARM/Aarch64 gdb_get_core_xml_file() handling a bit
- Move device tree files in their own pc-bios/dtb/ subdir
- Correctly check strchrnul() symbol availability on macOS SDK
- Move target-agnostic methods out of cpu-target.c and accel-target.c
- Unmap canceled USB XHCI packet
- Use deposit/extract API in designware model
- Fix MIPS16e translation
- Few missing header fixes
# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmgLqb8ACgkQ4+MsLN6t
# wN6nCQ//cmv1M+NsndhO5TAK8T1eUSXKlTZh932uro6ZgxKwN4p+j1Qo7bq3O9gu
# qUMHNbcfQl8sHSytiXBoxCjLMCXC3u38iyz75WGXuPay06rs4wqmahqxL4tyno3l
# 1RviFts9xlLn+tJqqrAR6+pRdALld0TY+yXUjXgr4aK5pIRpLz9U/sIEoh7qbA5U
# x0MTaceDG3A91OYo0TgrNbcMe1b9GqQZ+a4tbaP+oE37wbiKdyQ68LjrEbV08Y1O
# qrFF4oxquV31QJcUiuII1W7hC6psGrMsUA1f1qDu7QvmybAZWNZNsR9T66X9jH5J
# wXMShJmmXwxugohmuPPFnDshzJy90aFL6Jy2shrfqcG2v0W66ARY1ZnbJLCcfczt
# 073bnE2dnOVhd/ny37RrIJNJLLmYM0yFDeKuYtNNAzpK9fpA7Q2PI8QiqNacQ3Pa
# TdEYrGlMk7OeNck8xJmJMY5rATthi1D4dIBv3rjQbUolQvPJe2Y9or0R2WL1jK5v
# hhr6DY01iSPES3CravmUs/aB1HRMPi/nX45OmFR6frAB7xqWMreh81heBVuoTTK8
# PuXtRQgRMRKwDeTxlc6p+zba4mIEYG8rqJtPFRgViNCJ1KsgSIowup3BNU05YuFn
# NoPoRayMDVMgejVgJin3Mg2DCYvt/+MBmO4IoggWlFsXj59uUgA=
# =DXnZ
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 25 Apr 2025 11:26:55 EDT
# gpg: using RSA key FAABE75E12917221DCFD6BB2E3E32C2CDEADC0DE
# gpg: Good signature from "Philippe Mathieu-Daudé (F4BUG) <f4bug@amsat.org>" [full]
# Primary key fingerprint: FAAB E75E 1291 7221 DCFD 6BB2 E3E3 2C2C DEAD C0DE
* tag 'single-binary-20250425' of https://github.com/philmd/qemu: (58 commits)
qemu: Convert target_name() to TargetInfo API
accel: Move target-agnostic code from accel-target.c -> accel-common.c
accel: Make AccelCPUClass structure target-agnostic
accel: Include missing 'qemu/accel.h' header in accel-internal.h
accel: Implement accel_init_ops_interfaces() for both system/user mode
cpus: Move target-agnostic methods out of cpu-target.c
cpus: Replace CPU_RESOLVING_TYPE -> target_cpu_type()
qemu: Introduce target_cpu_type()
qapi: Rename TargetInfo structure as QemuTargetInfo
hw/microblaze: Evaluate TARGET_BIG_ENDIAN at compile time
hw/mips: Evaluate TARGET_BIG_ENDIAN at compile time
target/xtensa: Evaluate TARGET_BIG_ENDIAN at compile time
target/mips: Check CPU endianness at runtime using env_is_bigendian()
accel/kvm: Use target_needs_bswap()
linux-user/elfload: Use target_needs_bswap()
target/hexagon: Include missing 'accel/tcg/getpc.h'
accel/tcg: Correct list of included headers in tcg-stub.c
system/kvm: make functions accessible from common code
meson: Use osdep_prefix for strchrnul()
meson: Share common C source prefixes
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Mechanical change using gsed, then style manually adapted
to pass checkpatch.pl script.
Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20250424194905.82506-4-philmd@linaro.org>
The migration core subsystem makes use of the VFIO migration API to
collect statistics on the number of bytes transferred. These services
are declared in "hw/vfio/vfio-common.h" which also contains VFIO
internal declarations. Move the migration declarations into a new
header file "hw/vfio/vfio-migration.h" to reduce the exposure of VFIO
internals.
While at it, use a 'vfio_migration_' prefix for these services.
To be noted, vfio_migration_add_bytes_transferred() is a VFIO
migration internal service which we will be moved in the subsequent
patches.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: John Levon <john.levon@nutanix.com>
Reviewed-by: Avihai Horon <avihaih@nvidia.com>
Link: https://lore.kernel.org/qemu-devel/20250326075122.1299361-4-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
A few functions now end with a label. The next commit will clean them
up.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20250407082643.2310002-3-armbru@redhat.com>
[Straightforward conflict with commit 988ad4cceb (hw/loongarch/virt:
Fix cpuslot::cpu set at last in virt_cpu_plug()) resolved]
Rename to migration_legacy_page_bits, to make it clear that
we cannot change the value without causing a migration break.
Move to page-vary.h and page-vary-target.c.
Define via TARGET_PAGE_BITS if not TARGET_PAGE_BITS_VARY.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Convert the existing includes with sed.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Convert the existing includes with sed.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Convert the existing includes with
sed -i ,exec/memory.h,system/memory.h,g
Move the include within cpu-all.h into a !CONFIG_USER_ONLY block.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The SEEK_CUR case in qio_channel_block_seek was incorrectly using the
'whence' parameter instead of the 'offset' parameter when calculating the
new position.
Fixes: 65cf200a51 ("migration: introduce a QIOChannel impl for BlockDriverState VMState")
Signed-off-by: Marco Cavenati <Marco.Cavenati@eurecom.fr>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
Message-ID: <20250326162230.3323199-1-Marco.Cavenati@eurecom.fr>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Define the cpr_is_incoming helper, to be used in several cpr fix patches.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <1741380954-341079-2-git-send-email-steven.sistare@oracle.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Address an error in RDMA-based migration by ensuring RDMA is prioritized
when saving pages in `ram_save_target_page()`.
Previously, the RDMA protocol's page-saving step was placed after other
protocols due to a refactoring in commit bc38dc2f5f. This led to migration
failures characterized by unknown control messages and state loading errors
destination:
(qemu) qemu-system-x86_64: Unknown control message QEMU FILE
qemu-system-x86_64: error while loading state section id 1(ram)
qemu-system-x86_64: load of migration failed: Operation not permitted
source:
(qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1
qemu-system-x86_64: rdma migration: recv polling control error!
qemu-system-x86_64: warning: Early error. Sending error.
qemu-system-x86_64: warning: rdma migration: send polling control error
RDMA migration implemented its own protocol/method to send pages to
destination side, hand over to RDMA first to prevent pages being saved by
other protocol.
Fixes: bc38dc2f5f ("migration: refactor ram_save_target_page functions")
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-2-lizhijian@fujitsu.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Unlike cpr-reboot mode, cpr-transfer mode cannot save volatile ram blocks
in the migration stream file and recreate them later, because the physical
memory for the blocks is pinned and registered for vfio. Add a blocker
for volatile ram blocks.
Also add a blocker for RAM_GUEST_MEMFD. Preserving guest_memfd may be
sufficient for CPR, but it has not been tested yet.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <1740667681-257312-1-git-send-email-steven.sistare@oracle.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
On the incoming migration side, QEMU uses a coroutine to load all the VM
states. Inside, it may reference MigrationState on global states like
migration capabilities, parameters, error state, shared mutexes and more.
However there's nothing yet to make sure MigrationState won't get
destroyed (e.g. after migration_shutdown()). Meanwhile there's also no API
available to remove the incoming coroutine in migration_shutdown(),
avoiding it to access the freed elements.
There's a bug report showing this can happen and crash dest QEMU when
migration is cancelled on source.
When it happens, the dest main thread is trying to cleanup everything:
#0 qemu_aio_coroutine_enter
#1 aio_dispatch_handler
#2 aio_poll
#3 monitor_cleanup
#4 qemu_cleanup
#5 qemu_default_main
Then it found the migration incoming coroutine, schedule it (even after
migration_shutdown()), causing crash:
#0 __pthread_kill_implementation
#1 __pthread_kill_internal
#2 __GI_raise
#3 __GI_abort
#4 __assert_fail_base
#5 __assert_fail
#6 qemu_mutex_lock_impl
#7 qemu_lockable_mutex_lock
#8 qemu_lockable_lock
#9 qemu_lockable_auto_lock
#10 migrate_set_error
#11 process_incoming_migration_co
#12 coroutine_trampoline
To fix it, take a refcount after an incoming setup is properly done when
qmp_migrate_incoming() succeeded the 1st time. As it's during a QMP
handler which needs BQL, it means the main loop is still alive (without
going into cleanups, which also needs BQL).
Releasing the refcount now only until the incoming migration coroutine
finished or failed. Hence the refcount is valid for both (1) setup phase
of incoming ports, mostly IO watches (e.g. qio_channel_add_watch_full()),
and (2) the incoming coroutine itself (process_incoming_migration_co()).
Note that we can't unref in migration_incoming_state_destroy(), because
both qmp_xen_load_devices_state() and load_snapshot() will use it without
an incoming migration. Those hold BQL so they're not prone to this issue.
PS: I suspect nobody uses Xen's command at all, as it didn't register yank,
hence AFAIU the command should crash on master when trying to unregister
yank in migration_incoming_state_destroy().. but that's another story.
Also note that in some incoming failure cases we may not always unref the
MigrationState refcount, which is a trade-off to keep things simple. We
could make it accurate, but it can be an overkill. Some examples:
- Unlike most of the rest protocols, socket_start_incoming_migration()
may create net listener after incoming port setup sucessfully.
It means we can't unref in migration_channel_process_incoming() as a
generic path because socket protocol might keep using MigrationState.
- For either socket or file, multiple IO watches might be created, it
means logically each IO watch needs to take one refcount for
MigrationState so as to be 100% accurate on ownership of refcount taken.
In general, we at least need per-protocol handling to make it accurate,
which can be an overkill if we know incoming failed after all. Add a short
comment to explain that when taking the refcount in qmp_migrate_incoming().
Bugzilla: https://issues.redhat.com/browse/RHEL-69775
Tested-by: Yan Fu <yafu@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250220132459.512610-1-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
This SaveVMHandler helps device provide its own asynchronous transmission
of the remaining data at the end of a precopy phase via multifd channels,
in parallel with the transfer done by save_live_complete_precopy handlers.
These threads are launched only when multifd device state transfer is
supported.
Management of these threads in done in the multifd migration code,
wrapping them in the generic thread pool.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/qemu-devel/eac74a4ca7edd8968bbf72aa07b9041c76364a16.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Since device state transfer via multifd channels requires multifd
channels with packets and is currently not compatible with multifd
compression add an appropriate query function so device can learn
whether it can actually make use of it.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/1ff0d98b85f470e5a33687406e877583b8fab74e.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
The newly introduced device state buffer can be used for either storing
VFIO's read() raw data, but already also possible to store generic device
states. After noticing that device states may not easily provide a max
buffer size (also the fact that RAM MultiFDPages_t after all also want to
have flexibility on managing offset[] array), it may not be a good idea to
stick with union on MultiFDSendData.. as it won't play well with such
flexibility.
Switch MultiFDSendData to a struct.
It won't consume a lot more space in reality, after all the real buffers
were already dynamically allocated, so it's so far only about the two
structs (pages, device_state) that will be duplicated, but they're small.
With this, we can remove the pretty hard to understand alloc size logic.
Because now we can allocate offset[] together with the SendData, and
properly free it when the SendData is freed.
[MSS: Make sure to clear possible device state payload before freeing
MultiFDSendData, remove placeholders for other patches not included]
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/qemu-devel/7b02baba8e6ddb23ef7c349d312b9b631db09d7e.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
This way if there are fields there that needs explicit disposal (like, for
example, some attached buffers) they will be handled appropriately.
Add a related assert to multifd_set_payload_type() in order to make sure
that this function is only used to fill a previously empty MultiFDSendData
with some payload, not the other way around.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/6755205f2b95abbed251f87061feee1c0e410836.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
multifd_send() function is currently not thread safe, make it thread safe
by holding a lock during its execution.
This way it will be possible to safely call it concurrently from multiple
threads.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/dd0f3bcc02ca96a7d523ca58ea69e495a33b453b.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Add a basic support for receiving device state via multifd channels -
channels that are shared with RAM transfers.
Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
packet header either device state (MultiFDPacketDeviceState_t) or RAM
data (existing MultiFDPacket_t) is read.
The received device state data is provided to
qemu_loadvm_load_state_buffer() function for processing in the
device's load_state_buffer handler.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/9b86f806c134e7815ecce0eee84f0e0e34aa0146.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Read packet header first so in the future we will be able to
differentiate between a RAM multifd packet and a device state multifd
packet.
Since these two are of different size we can't read the packet body until
we know which packet type it is.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/832ad055fe447561ac1ad565d61658660cb3f63f.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Some drivers might want to make use of auxiliary helper threads during VM
state loading, for example to make sure that their blocking (sync) I/O
operations don't block the rest of the migration process.
Add a migration core managed thread pool to facilitate this use case.
The migration core will wait for these threads to finish before
(re)starting the VM at destination.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/b09fd70369b6159c75847e69f235cb908b02570c.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
All callers to migration_incoming_state_destroy() other than
postcopy_ram_listen_thread() do this call with BQL held.
Since migration_incoming_state_destroy() ultimately calls "load_cleanup"
SaveVMHandlers and it will soon call BQL-sensitive code it makes sense
to always call that function under BQL rather than to have it deal with
both cases (with BQL and without BQL).
Add the necessary bql_lock() and bql_unlock() to
postcopy_ram_listen_thread().
qemu_loadvm_state_main() in postcopy_ram_listen_thread() could call
"load_state" SaveVMHandlers that are expecting BQL to be held.
In principle, the only devices that should be arriving on migration
channel serviced by postcopy_ram_listen_thread() are those that are
postcopiable and whose load handlers are safe to be called without BQL
being held.
But nothing currently prevents the source from sending data for "unsafe"
devices which would cause trouble there.
Add a TODO comment there so it's clear that it would be good to improve
handling of such (erroneous) case in the future.
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/21bb5ca337b1d5a802e697f553f37faf296b5ff4.1741193259.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/71ca753286b87831ced4afd422e2e2bed071af25.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
This QEMU_VM_COMMAND sub-command and its switchover_start SaveVMHandler is
used to mark the switchover point in main migration stream.
It can be used to inform the destination that all pre-switchover main
migration stream data has been sent/received so it can start to process
post-switchover data that it might have received via other migration
channels like the multifd ones.
Add also the relevant MigrationState bit stream compatibility property and
its hw_compat entry.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Zhang Chen <zhangckid@gmail.com> # for the COLO part
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/311be6da85fc7e49a7598684d80aa631778dcbce.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
v2 changelog:
- Fix Mac (and possibly some other) build issues for two patches
- os: add an ability to lock memory on_fault
- memory: pass MemTxAttrs to memory_access_is_direct()
List of features:
- William's fix on ram hole punching when with file offset
- Daniil's patchset to introduce mem-lock=on-fault
- William's hugetlb hwpoison fix for size report & remap
- David's series to allow qemu debug writes to MMIOs
-----BEGIN PGP SIGNATURE-----
iIgEABYKADAWIQS5GE3CDMRX2s990ak7X8zN86vXBgUCZ6zcQBIccGV0ZXJ4QHJl
ZGhhdC5jb20ACgkQO1/MzfOr1wbL3wEAqx94NpB/tEEBj6WXE3uV9LqQ0GCTYmV+
MbM51Vep8ksA/35yFn3ltM2yoSnUf9WJW6LXEEKhQlwswI0vChQERgkE
=++O1
-----END PGP SIGNATURE-----
Merge tag 'mem-next-pull-request' of https://gitlab.com/peterx/qemu into staging
Memory pull request for 10.0
v2 changelog:
- Fix Mac (and possibly some other) build issues for two patches
- os: add an ability to lock memory on_fault
- memory: pass MemTxAttrs to memory_access_is_direct()
List of features:
- William's fix on ram hole punching when with file offset
- Daniil's patchset to introduce mem-lock=on-fault
- William's hugetlb hwpoison fix for size report & remap
- David's series to allow qemu debug writes to MMIOs
# -----BEGIN PGP SIGNATURE-----
#
# iIgEABYKADAWIQS5GE3CDMRX2s990ak7X8zN86vXBgUCZ6zcQBIccGV0ZXJ4QHJl
# ZGhhdC5jb20ACgkQO1/MzfOr1wbL3wEAqx94NpB/tEEBj6WXE3uV9LqQ0GCTYmV+
# MbM51Vep8ksA/35yFn3ltM2yoSnUf9WJW6LXEEKhQlwswI0vChQERgkE
# =++O1
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 13 Feb 2025 01:37:04 HKT
# gpg: using EDDSA key B9184DC20CC457DACF7DD1A93B5FCCCDF3ABD706
# gpg: issuer "peterx@redhat.com"
# gpg: Good signature from "Peter Xu <xzpeter@gmail.com>" [full]
# gpg: aka "Peter Xu <peterx@redhat.com>" [full]
# Primary key fingerprint: B918 4DC2 0CC4 57DA CF7D D1A9 3B5F CCCD F3AB D706
* tag 'mem-next-pull-request' of https://gitlab.com/peterx/qemu:
overcommit: introduce mem-lock=on-fault
system: introduce a new MlockState enum
system/vl: extract overcommit option parsing into a helper
os: add an ability to lock memory on_fault
system/physmem: poisoned memory discard on reboot
system/physmem: handle hugetlb correctly in qemu_ram_remap()
physmem: teach cpu_memory_rw_debug() to write to more memory regions
hmp: use cpu_get_phys_page_debug() in hmp_gva2gpa()
memory: pass MemTxAttrs to memory_access_is_direct()
physmem: disallow direct access to RAM DEVICE in address_space_write_rom()
physmem: factor out direct access check into memory_region_supports_direct_access()
physmem: factor out RAM/ROMD check in memory_access_is_direct()
physmem: factor out memory_region_is_ram_device() check in memory_access_is_direct()
system/physmem: take into account fd_offset for file fallocate
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
qmp_migrate guarantees that cpr_channel is not null for
MIG_MODE_CPR_TRANSFER when cpr_state_save is called:
qmp_migrate()
if (s->parameters.mode == MIG_MODE_CPR_TRANSFER && !cpr_channel) {
return;
}
cpr_state_save(cpr_channel)
but cpr_state_save checks for mode differently before using channel,
and Coverity cannot infer that they are equivalent in outgoing QEMU,
and warns that channel may be NULL:
cpr_state_save(channel)
MigMode mode = migrate_mode();
if (mode == MIG_MODE_CPR_TRANSFER) {
f = cpr_transfer_output(channel, errp);
To make Coverity happy, assert that channel != NULL in cpr_state_save.
Resolves: Coverity CID 1590980
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Message-ID: <1738788841-211843-1-git-send-email-steven.sistare@oracle.com>
[assert instead of using parameters.mode in cpr_state_save]
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The expected outcome from qmp_migrate_cancel() is that the source
migration goes to the terminal state
MIGRATION_STATUS_CANCELLED. Anything different from this is a bug when
cancelling.
Make sure there is never a state transition from an unspecified state
into FAILED. Code that sets FAILED, should always either make sure
that the old state is not CANCELLING or specify the old state.
Note that the destination is allowed to go into FAILED, so there's no
issue there.
(I don't think this is relevant as a backport because cancelling does
work, it just doesn't show the right state at the end)
Fixes: 3dde8fdbad ("migration: Merge precopy/postcopy on switchover start")
Fixes: d0edb8a173 ("migration: Create the postcopy preempt channel asynchronously")
Fixes: 8518278a6a ("migration: implementation of background snapshot thread")
Fixes: bf78a046b9 ("migration: refactor migrate_fd_connect failures")
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-7-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
After postcopy has started, it's not possible to recover the source
machine in case a migration error occurs because the destination has
already been changing the state of the machine. For that same reason,
it doesn't make sense to try to cancel the migration after postcopy
has started. Reject the cancel command during postcopy.
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-6-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
If the destination side fails at migration_ioc_process_incoming()
before starting the coroutine, it will report the error but QEMU will
not exit.
Set the migration state to FAILED and exit the process if
exit-on-error allows.
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2633
Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-5-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Remove all instances of _fd_ from the migration generic code. These
functions have grown over time and the _fd_ part is now just
confusing.
migration_fd_error() -> migration_error() makes it a little
vague. Since it's only used for migration_connect() failures, change
it to migration_connect_set_error().
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-4-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>