Add a savevm handler for a module to opt-in sending extra sections right
before postcopy starts, and before VM is stopped.
RAM will start to use this new savevm handler in the next patch to do flush
and sync for multifd pages.
Note that we choose to do it before VM stopped because the current only
potential user is not sensitive to VM status, so doing it before VM is
stopped is preferred to enlarge any postcopy downtime.
It is still a bit unfortunate that we need to introduce such a new savevm
handler just for the only use case, however it's so far the cleanest.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250411114534.3370816-4-ppandit@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Rename to migration_legacy_page_bits, to make it clear that
we cannot change the value without causing a migration break.
Move to page-vary.h and page-vary-target.c.
Define via TARGET_PAGE_BITS if not TARGET_PAGE_BITS_VARY.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Convert the existing includes with
sed -i ,exec/memory.h,system/memory.h,g
Move the include within cpu-all.h into a !CONFIG_USER_ONLY block.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Unlike cpr-reboot mode, cpr-transfer mode cannot save volatile ram blocks
in the migration stream file and recreate them later, because the physical
memory for the blocks is pinned and registered for vfio. Add a blocker
for volatile ram blocks.
Also add a blocker for RAM_GUEST_MEMFD. Preserving guest_memfd may be
sufficient for CPR, but it has not been tested yet.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <1740667681-257312-1-git-send-email-steven.sistare@oracle.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
This SaveVMHandler helps device provide its own asynchronous transmission
of the remaining data at the end of a precopy phase via multifd channels,
in parallel with the transfer done by save_live_complete_precopy handlers.
These threads are launched only when multifd device state transfer is
supported.
Management of these threads in done in the multifd migration code,
wrapping them in the generic thread pool.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/qemu-devel/eac74a4ca7edd8968bbf72aa07b9041c76364a16.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Some drivers might want to make use of auxiliary helper threads during VM
state loading, for example to make sure that their blocking (sync) I/O
operations don't block the rest of the migration process.
Add a migration core managed thread pool to facilitate this use case.
The migration core will wait for these threads to finish before
(re)starting the VM at destination.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/b09fd70369b6159c75847e69f235cb908b02570c.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
All callers to migration_incoming_state_destroy() other than
postcopy_ram_listen_thread() do this call with BQL held.
Since migration_incoming_state_destroy() ultimately calls "load_cleanup"
SaveVMHandlers and it will soon call BQL-sensitive code it makes sense
to always call that function under BQL rather than to have it deal with
both cases (with BQL and without BQL).
Add the necessary bql_lock() and bql_unlock() to
postcopy_ram_listen_thread().
qemu_loadvm_state_main() in postcopy_ram_listen_thread() could call
"load_state" SaveVMHandlers that are expecting BQL to be held.
In principle, the only devices that should be arriving on migration
channel serviced by postcopy_ram_listen_thread() are those that are
postcopiable and whose load handlers are safe to be called without BQL
being held.
But nothing currently prevents the source from sending data for "unsafe"
devices which would cause trouble there.
Add a TODO comment there so it's clear that it would be good to improve
handling of such (erroneous) case in the future.
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/21bb5ca337b1d5a802e697f553f37faf296b5ff4.1741193259.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/71ca753286b87831ced4afd422e2e2bed071af25.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
This QEMU_VM_COMMAND sub-command and its switchover_start SaveVMHandler is
used to mark the switchover point in main migration stream.
It can be used to inform the destination that all pre-switchover main
migration stream data has been sent/received so it can start to process
post-switchover data that it might have received via other migration
channels like the multifd ones.
Add also the relevant MigrationState bit stream compatibility property and
its hw_compat entry.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Zhang Chen <zhangckid@gmail.com> # for the COLO part
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/311be6da85fc7e49a7598684d80aa631778dcbce.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
We're currently only checking the QEMUFile error after
qemu_loadvm_state(). This was causing a TLS termination error from
multifd recv threads to be ignored.
Start checking the migration error as well to avoid missing further
errors.
Regarding compatibility concerning the TLS termination error that was
being ignored, for QEMUs <= 9.2 - if the old QEMU is being used as
migration source - the recently added migration property
multifd-tls-clean-termination needs to be set to OFF in the
*destination* machine.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Now qemu_savevm_state_complete_precopy() is never used in postcopy, clean
it up as in_postcopy==false now unconditionally.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-14-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Postcopy invokes qemu_savevm_state_complete_precopy() twice for a long
time, and that caused way too much confusions. Let's clean this up and
make postcopy easier to read.
It's actually fairly straightforward: postcopy starts with saving
non-postcopiable iterables, then later it saves again with non-iterable
only. Move these two calls out makes everything much easier to follow.
Otherwise it's very unclear what qemu_savevm_state_complete_precopy() did
in either of the calls.
No functional change intended.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-13-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Postcopy invokes qemu_savevm_state_complete_precopy() twice, that means
it'll invoke COMPLETE notify twice.. also twice the tracepoints that
marking precopy complete.
Move that notification (along with the tracepoint) out to the caller, so
that postcopy will only notify once right at the start of switchover phase
from precopy. When at it, rename it to suite the file now it locates.
For precopy, there should have no functional change except the tracepoint
has a name change.
For the other two users of qemu_savevm_state_complete_precopy(), namely:
qemu_savevm_state() and qemu_savevm_live_state(): the notifier shouldn't
matter because they're not precopy at all. Now in these two contexts (aka,
"savevm", and "colo") sometimes the precopy notifiers will still be
invoked, but that's outside the scope of this patch.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-12-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Do one shot cpu sync at qemu_savevm_state_complete_precopy_non_iterable(),
instead of coding it separately in two places.
Note that in the context of qemu_savevm_state_complete_precopy(), this
patch is also an optimization for postcopy path, in that we can avoid sync
cpu twice during switchover: before this patch, postcopy_start() invokes
twice on qemu_savevm_state_complete_precopy(), each of them will try to
sync CPU info. In reality, only one of them would be enough.
For background snapshot, there's no intended functional change.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-7-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
This parameter is only used by one caller, which is the genuine precopy
complete path (migration_completion_precopy).
The parameter was introduced in a1fbe750fd ("migration: Fix race of image
locking between src and dst") to make sure the inactivate will happen
before EOF to make sure dest will always be able to activate the disk
properly. However there's no limitation on how early we inactivate the
disk. For precopy completion path, we can always do that as long as VM is
stopped.
Move the disk inactivate there, then we can remove this inactivate_disk
parameter in the whole call stack, because all the rest users pass in false
always.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-6-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
QEMU machine has a property "suppress-vmdesc". When it is enabled, QEMU
will stop attaching JSON VM description at the end of the precopy migration
stream (postcopy is never affected because postcopy never attach that).
However even if it's suppressed by the user, the source QEMU will still
construct the JSON descriptions, which is a complete waste of CPU and
memory resources.
To avoid it, only create the JSON writer object if suppress-vmdesc is not
specified.
Luckily, vmstate_save() already supports vmdesc==NULL, so only a few spots
that are left to be prepared that vmdesc can be NULL now.
When at it, move the init / destroy of the JSON writer object to start /
end of the migration - the JSON writer object is a sub-struct of migration
state, and that looks like the only object that was dynamically allocated /
destroyed within migration process. Make it the same as the rest objects
that migration uses.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-3-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
should_send_vmdesc() has a hack inside (which was not reflected in the
function name) in that it tries to detect global postcopy state and that
will affect the value to be returned.
It's easier to keep the helper simple by only check the suppress-vmdesc
property. Then:
- On the sender side of its usage, there's already in_postcopy variable
that we can use: postcopy doesn't send vmdesc at all, so directly skip
everything for postcopy.
- On the recv side, when reaching vmdesc processing it must be precopy
code already, hence that hack check never used to work anyway.
No functional change intended, except a trivial side effect that QEMU
source will start to avoid running some JSON helper in postcopy path, but
that would only reduce the postcopy blackout window a bit, rather than any
other bad side effect.
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250114230746.3268797-2-peterx@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
This patch proposes a flag to maintain disk activation status globally. It
mostly rewrites disk activation mgmt for QEMU, including COLO and QMP
command xen_save_devices_state.
Backgrounds
===========
We have two problems on disk activations, one resolved, one not.
Problem 1: disk activation recover (for switchover interruptions)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When migration is either cancelled or failed during switchover, especially
when after the disks are inactivated, QEMU needs to remember re-activate
the disks again before vm starts.
It used to be done separately in two paths: one in qmp_migrate_cancel(),
the other one in the failure path of migration_completion().
It used to be fixed in different commits, all over the places in QEMU. So
these are the relevant changes I saw, I'm not sure if it's complete list:
- In 2016, commit fe904ea824 ("migration: regain control of images when
migration fails to complete")
- In 2017, commit 1d2acc3162 ("migration: re-active images while migration
been canceled after inactive them")
- In 2023, commit 6dab4c93ec ("migration: Attempt disk reactivation in
more failure scenarios")
Now since we have a slightly better picture maybe we can unify the
reactivation in a single path.
One side benefit of doing so is, we can move the disk operation outside QMP
command "migrate_cancel". It's possible that in the future we may want to
make "migrate_cancel" be OOB-compatible, while that requires the command
doesn't need BQL in the first place. This will already do that and make
migrate_cancel command lightweight.
Problem 2: disk invalidation on top of invalidated disks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is an unresolved bug for current QEMU. Link in "Resolves:" at the
end. It turns out besides the src switchover phase (problem 1 above), QEMU
also needs to remember block activation on destination.
Consider two continuous migration in a row, where the VM was always paused.
In that scenario, the disks are not activated even until migration
completed in the 1st round. When the 2nd round starts, if QEMU doesn't
know the status of the disks, it needs to try inactivate the disk again.
Here the issue is the block layer API bdrv_inactivate_all() will crash a
QEMU if invoked on already inactive disks for the 2nd migration. For
detail, see the bug link at the end.
Implementation
==============
This patch proposes to maintain disk activation with a global flag, so we
know:
- If we used to inactivate disks for migration, but migration got
cancelled, or failed, QEMU will know it should reactivate the disks.
- On incoming side, if the disks are never activated but then another
migration is triggered, QEMU should be able to tell that inactivate is
not needed for the 2nd migration.
We used to have disk_inactive, but it only solves the 1st issue, not the
2nd. Also, it's done in completely separate paths so it's extremely hard
to follow either how the flag changes, or the duration that the flag is
valid, and when we will reactivate the disks.
Convert the existing disk_inactive flag into that global flag (also invert
its naming), and maintain the disk activation status for the whole
lifecycle of qemu. That includes the incoming QEMU.
Put both of the error cases of source migration (failure, cancelled)
together into migration_iteration_finish(), which will be invoked for
either of the scenario. So from that part QEMU should behave the same as
before. However with such global maintenance on disk activation status, we
not only cleanup quite a few temporary paths that we try to maintain the
disk activation status (e.g. in postcopy code), meanwhile it fixes the
crash for problem 2 in one shot.
For freshly started QEMU, the flag is initialized to TRUE showing that the
QEMU owns the disks by default.
For incoming migrated QEMU, the flag will be initialized to FALSE once and
for all showing that the dest QEMU doesn't own the disks until switchover.
That is guaranteed by the "once" variable.
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2395
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-Id: <20241206230838.1111496-7-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Src QEMU sets block_inactive=true very early before the invalidation takes
place. It means if something wrong happened during setting the flag but
before reaching qemu_savevm_state_complete_precopy_non_iterable() where it
did the invalidation work, it'll make block_inactive flag inconsistent.
For example, think about when qemu_savevm_state_complete_precopy_iterable()
can fail: it will have block_inactive set to true even if all block drives
are active.
Fix that by only update the flag after the invalidation is done.
No Fixes for any commit, because it's not an issue if bdrv_activate_all()
is re-entrant upon all-active disks - false positive block_inactive can
bring nothing more than "trying to active the blocks but they're already
active". However let's still do it right to avoid the inconsistent flag
v.s. reality.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-Id: <20241206230838.1111496-6-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Postcopy never cared about late-block-active. However there's no mention
in the capability that it doesn't apply to postcopy.
Considering that we _assumed_ late activation is always good, do that too
for postcopy unconditionally, just like precopy. After this patch, we
should have unified the behavior across all.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-Id: <20241206230838.1111496-5-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Headers in include/sysemu/ are not only related to system
*emulation*, they are also used by virtualization. Rename
as system/ which is clearer.
Files renamed manually then mechanical change using sed tool.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Lei Yang <leiyang@redhat.com>
Message-Id: <20241203172445.28576-1-philmd@linaro.org>
After fixing the loadvm cleanup race the qemu_loadvm_state_cleanup()
is now being called twice in the postcopy listen thread.
Fixes: 4ce5622908 ("migration/multifd: Fix rb->receivedmap cleanup race")
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20241125191128.9120-1-farosas@suse.de
Signed-off-by: Peter Xu <peterx@redhat.com>
* Big cleanup of deprecated machines
* Power11 support for spapr
* XIVE improvements
* Goodbye to Cedric and David as ppc reviewers, thank you both o7
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEETkN92lZhb0MpsKeVZ7MCdqhiHK4FAmcoEicACgkQZ7MCdqhi
HK5M8Q//fz+ZkJndXkBjb1Oinx+q+eVtNm2JrvcWIsXyhG3K+6VxYPp69H+SRv/Z
TWuUqMQPxq8mhQvBJlDAttp/oaUEiOcCRvs/iUoBN12L4mVxXfdoT88TZ4frN3eP
8bePq+DW2N/7gpmsJm5CyEZPpcf9AjVHgLRp3KYFkOJ/14uzvuwnocU39gl+2IUh
MXHTedQgMNXaKorJXk1NVdM6NxMuVhOvwxAs6ya2gwhxyA5tteo5PiQOnDJWkejf
xg3RRsNzGYcs1Qg/3kFIf3RfEB0aYbPxROM8IfPaJWKN5KnMggj/JAkHyK1x/V3J
wml7+cB0doMt/yRiuYJhXpyrtOqpvjRWPA6RhxECWW2kwrovv8NAF8IrFnw9NvOQ
QC66ZaaFcbAcFrVT1e/iggU76d01II6m4OAgKcXw+FRHgps4VU9y83j7ApNnNUWN
IXp9hkzoHi5VwX0FrG4ELUr2iEf1HASMvM8EZ/0AxzWj5iNtQB8lFsrEdaGVXyIS
M5JaJeNjCn4koCyYaFSctH5eKtbzIwnGWnDcdTwaOuQ+9itBvY8O+HZalE6sAc5S
kLFZ7i/Ut/qxbY5pMumt8LKD4pR1SsOxFB8dJCmn/f/tvRGtIVsoY6btNe4M0+24
42MxZbWO6W379C32bwbtsPiGA+aLSgShjP4cWm9cgRjz4RJFnwg=
=vmIG
-----END PGP SIGNATURE-----
Merge tag 'pull-ppc-for-9.2-1-20241104' of https://gitlab.com/npiggin/qemu into staging
* Various bug fixes
* Big cleanup of deprecated machines
* Power11 support for spapr
* XIVE improvements
* Goodbye to Cedric and David as ppc reviewers, thank you both o7
# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCgAdFiEETkN92lZhb0MpsKeVZ7MCdqhiHK4FAmcoEicACgkQZ7MCdqhi
# HK5M8Q//fz+ZkJndXkBjb1Oinx+q+eVtNm2JrvcWIsXyhG3K+6VxYPp69H+SRv/Z
# TWuUqMQPxq8mhQvBJlDAttp/oaUEiOcCRvs/iUoBN12L4mVxXfdoT88TZ4frN3eP
# 8bePq+DW2N/7gpmsJm5CyEZPpcf9AjVHgLRp3KYFkOJ/14uzvuwnocU39gl+2IUh
# MXHTedQgMNXaKorJXk1NVdM6NxMuVhOvwxAs6ya2gwhxyA5tteo5PiQOnDJWkejf
# xg3RRsNzGYcs1Qg/3kFIf3RfEB0aYbPxROM8IfPaJWKN5KnMggj/JAkHyK1x/V3J
# wml7+cB0doMt/yRiuYJhXpyrtOqpvjRWPA6RhxECWW2kwrovv8NAF8IrFnw9NvOQ
# QC66ZaaFcbAcFrVT1e/iggU76d01II6m4OAgKcXw+FRHgps4VU9y83j7ApNnNUWN
# IXp9hkzoHi5VwX0FrG4ELUr2iEf1HASMvM8EZ/0AxzWj5iNtQB8lFsrEdaGVXyIS
# M5JaJeNjCn4koCyYaFSctH5eKtbzIwnGWnDcdTwaOuQ+9itBvY8O+HZalE6sAc5S
# kLFZ7i/Ut/qxbY5pMumt8LKD4pR1SsOxFB8dJCmn/f/tvRGtIVsoY6btNe4M0+24
# 42MxZbWO6W379C32bwbtsPiGA+aLSgShjP4cWm9cgRjz4RJFnwg=
# =vmIG
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 04 Nov 2024 00:15:35 GMT
# gpg: using RSA key 4E437DDA56616F4329B0A79567B30276A8621CAE
# gpg: Good signature from "Nicholas Piggin <npiggin@gmail.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 4E43 7DDA 5661 6F43 29B0 A795 67B3 0276 A862 1CAE
* tag 'pull-ppc-for-9.2-1-20241104' of https://gitlab.com/npiggin/qemu: (67 commits)
MAINTAINERS: Remove myself as reviewer
MAINTAINERS: Remove myself from XIVE
MAINTAINERS: Remove myself from the PowerNV machines
hw/ppc: Consolidate ppc440 initial mapping creation functions
hw/ppc: Consolidate e500 initial mapping creation functions
tests/qtest: Add XIVE tests for the powernv10 machine
pnv/xive2: TIMA CI ops using alternative offsets or byte lengths
pnv/xive2: TIMA support for 8-byte OS context push for PHYP
pnv/xive: Update PIPR when updating CPPR
pnv/xive: Add special handling for pool targets
ppc/xive2: Support "Pull Thread Context to Odd Thread Reporting Line"
ppc/xive2: Change context/ring specific functions to be generic
ppc/xive2: Support "Pull Thread Context to Register" operation
ppc/xive2: Allow 1-byte write of Target field in TIMA
ppc/xive2: Dump the VP-group and crowd tables with 'info pic'
ppc/xive2: Dump more NVP state with 'info pic'
pnv/xive2: Support for "OS LGS Push" TIMA operation
ppc/xive2: Support TIMA "Pull OS Context to Odd Thread Reporting Line"
pnv/xive2: Define OGEN field in the TIMA
pnv/xive: TIMA patch sets pre-req alignment and formatting changes
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Commit 1392617d35 intended to tag pseries-2.1 - 2.11 machines as
deprecated with reasons mentioned in its commit log.
Removing pseries-2.9 specific code with this patch for now.
While at it, also remove the pre-2.10 migration hacks which now become
obsolete.
Suggested-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Keep migration thread names together, so it's easier to see a list of all
possible migration threads.
Still two functional changes below besides the macro defintions:
- There's one dirty rate thread that we overlooked before, now we add
that too and name it as "mig/dirtyrate" following the old rules.
- The old name "mig/src/rp-thr" has "-thr" but it may not be useful if
it's a thread name anyway, while "rp" can be slightly hard to read.
Taking this chance to rename it to "mig/src/return", hopefully a better
name.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Acked-by: Hyman Huang <yong.huang@smartx.com>
Reviewed-by: Zhang Chen <chen.zhang@intel.com>
Link: https://lore.kernel.org/r/20241011153652.517440-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Fix a segmentation fault in multifd when rb->receivedmap is cleared
too early.
After commit 5ef7e26bdb ("migration/multifd: solve zero page causing
multiple page faults"), multifd started using the rb->receivedmap
bitmap, which belongs to ram.c and is initialized and *freed* from the
ram SaveVMHandlers.
Multifd threads are live until migration_incoming_state_destroy(),
which is called after qemu_loadvm_state_cleanup(), leading to a crash
when accessing rb->receivedmap.
process_incoming_migration_co() ...
qemu_loadvm_state() multifd_nocomp_recv()
qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset()
rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap)
...
migration_incoming_state_destroy()
multifd_recv_cleanup()
multifd_recv_terminate_threads(NULL)
Move the loadvm cleanup into migration_incoming_state_destroy(), after
multifd_recv_cleanup() to ensure multifd threads have already exited
when rb->receivedmap is cleared.
Adjust the postcopy listen thread comment to indicate that we still
want to skip the cpu synchronization.
CC: qemu-stable@nongnu.org
Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults")
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240917185802.15619-3-farosas@suse.de
[peterx: added comment in migration_incoming_state_destroy()]
Signed-off-by: Peter Xu <peterx@redhat.com>
There are two qemu_loadvm_state_cleanup() calls that were introduced
when qemu_loadvm_state_setup() was still called before loading the
configuration section, so there was state to be cleaned up if the
header checks failed.
However, commit 9e14b84908 ("migration/savevm: load_header before
load_setup") has moved that configuration section part to
qemu_loadvm_state_header() which now happens before
qemu_loadvm_state_setup().
Remove the cleanup calls that are now misplaced.
Note that we didn't use Fixes because it's benign to cleanup() even if
setup() is not invoked. So this patch is not needed for stable, as it
falls into cleanup category.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240917185802.15619-2-farosas@suse.de
[peterx: added last paragraph of commit message]
Signed-off-by: Peter Xu <peterx@redhat.com>
An error path missed setting *errp, which can cause a NULL deref.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20240813050638.446172-11-npiggin@gmail.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20240813202329.1237572-19-alex.bennee@linaro.org>
This patch adds a migration state on src called "postcopy-recover-setup".
The new state will describe the intermediate step starting from when the
src QEMU received a postcopy recovery request, until the migration channels
are properly established, but before the recovery process take place.
The request came from Libvirt where Libvirt currently rely on the migration
state events to detect migration state changes. That works for most of the
migration process but except postcopy recovery failures at the beginning.
Currently postcopy recovery only has two major states:
- postcopy-paused: this is the state that both sides of QEMU will be in
for a long time as long as the migration channel was interrupted.
- postcopy-recover: this is the state where both sides of QEMU handshake
with each other, preparing for a continuation of postcopy which used to
be interrupted.
The issue here is when the recovery port is invalid, the src QEMU will take
the URI/channels, noticing the ports are not valid, and it'll silently keep
in the postcopy-paused state, with no event sent to Libvirt. In this case,
the only thing Libvirt can do is to poll the migration status with a proper
interval, however that's less optimal.
Considering that this is the only case where Libvirt won't get a
notification from QEMU on such events, let's add postcopy-recover-setup
state to mimic what we have with the "setup" state of a newly initialized
migration, describing the phase of connection establishment.
With that, postcopy recovery will have two paths to go now, and either path
will guarantee an event generated. Now the events will look like this
during a recovery process on src QEMU:
- Initially when the recovery is initiated on src, QEMU will go from
"postcopy-paused" -> "postcopy-recover-setup". Old QEMUs don't have
this event.
- Depending on whether the channel re-establishment is succeeded:
- In succeeded case, src QEMU will move from "postcopy-recover-setup"
to "postcopy-recover". Old QEMUs also have this event.
- In failure case, src QEMU will move from "postcopy-recover-setup" to
"postcopy-paused" again. Old QEMUs don't have this event.
This guarantees that Libvirt will always receive a notification for
recovery process properly.
One thing to mention is, such new status is only needed on src QEMU not
both. On dest QEMU, the state machine doesn't change. Hence the events
don't change either. It's done like so because dest QEMU may not have an
explicit point of setup start. E.g., it can happen that when dest QEMUs
doesn't use migrate-recover command to use a new URI/channel, but the old
URI/channels can be reused in recovery, in which case the old ports simply
can work again after the network routes are fixed up.
Add a new helper postcopy_is_paused() detecting whether postcopy is still
paused, taking RECOVER_SETUP into account too. When using it on both
src/dst, a slight change is done altogether to always wait for the
semaphore before checking the status, because for both sides a sem_post()
will be required for a recovery.
Cc: Jiri Denemark <jdenemar@redhat.com>
Cc: Prasad Pandit <ppandit@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Buglink: https://issues.redhat.com/browse/RHEL-38485
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The postcopy thread names on dest QEMU are slightly confusing, partly I'll
need to blame myself on 36f62f11e4 ("migration: Postcopy preemption
preparation on channel creation"). E.g., "fault-fast" reads like a fast
version of "fault-default", but it's actually the fast version of
"postcopy/listen".
Taking this chance, rename all the migration threads with proper rules.
Considering we only have 15 chars usable, prefix all threads with "mig/",
meanwhile identify src/dst threads properly this time. So now most thread
names will look like "mig/DIR/xxx", where DIR will be "src"/"dst", except
the bg-snapshot thread which doesn't have a direction.
For multifd threads, making them "mig/{src|dst}/{send|recv}_%d".
We used to have "live_migration" thread for a very long time, now it's
called "mig/src/main". We may hope to have "mig/dst/main" soon but not
yet.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Functions that use an Error **errp parameter to return errors should
not also report them to the user, because reporting is the caller's
job. When the caller does, the error is reported twice. When it
doesn't (because it recovered from the error), there is no error to
report, i.e. the report is bogus.
qmp_xen_save_devices_state() and qmp_xen_load_devices_state() violate
this principle: they call qemu_save_device_state() and
qemu_loadvm_state(), which call error_report_err().
I wish I could clean this up now, but migration's error reporting is
too complicated (confused?) for me to mess with it.
Instead, I'm merely improving the error reported by
qmp_xen_load_devices_state() and qmp_xen_load_devices_state() to the
QMP core from
An IO error has occurred
to
saving Xen device state failed
and
loading Xen device state failed
respectively.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20240513141703.549874-6-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Fabiano Rosas <farosas@suse.de>
Acked-by: Peter Xu <peterx@redhat.com>
The block migration has been considered obsolete since QEMU 8.2 in
favor of the more flexible storage migration provided by the
blockdev-mirror driver. Two releases have passed so now it's time to
remove it.
Deprecation commit 66db46ca83 ("migration: Deprecate block
migration").
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Address the comment added in commit 4629ed1e98
("qerror: Finally unused, clean up"), from 2015:
/*
* These macros will go away, please don't use
* in new code, and do not add new ones!
*/
Mechanical transformation using sed, manually
removing the definition in include/qapi/qmp/qerror.h.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20240312141343.3168265-10-armbru@redhat.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
[Straightforward conflict with commit aeaafb1e59 (migration: export
migration_is_running) resolved]
This will be useful to report errors at a higher level, mostly in VFIO
today.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20240320064911.545001-9-clg@redhat.com
[peterx: drop comment for ERRP_GUARD, per Markus]
Signed-off-by: Peter Xu <peterx@redhat.com>
The purpose is to record a potential error in the migration stream if
qemu_savevm_state_setup() fails. Most of the current .save_setup()
handlers can be modified to use the Error argument instead of managing
their own and calling locally error_report().
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Harsh Prateek Bora <harshpb@linux.ibm.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Cc: John Snow <jsnow@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20240320064911.545001-8-clg@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
This prepares ground for the changes coming next which add an Error**
argument to the .save_setup() handler. Callers of qemu_savevm_state_setup()
now handle the error and fail earlier setting the migration state from
MIGRATION_STATUS_SETUP to MIGRATION_STATUS_FAILED.
In qemu_savevm_state(), move the cleanup to preserve the error
reported by .save_setup() handlers.
Since the previous behavior was to ignore errors at this step of
migration, this change should be examined closely to check that
cleanups are still correctly done.
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20240320064911.545001-7-clg@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
This will prepare ground for future changes adding an Error** argument
to qemu_savevm_state_setup().
Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20240320064911.545001-6-clg@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Before loading the guest states, ensure that the preempt channel has been
ready to use, as some of the states (e.g. via virtio_load) might trigger
page faults that will be handled through the preempt channel. So yield to
the main thread in the case that the channel create event hasn't been
dispatched.
Cc: qemu-stable <qemu-stable@nongnu.org>
Fixes: 9358982744 ("migration: Send requested page directly in rp-return thread")
Originally-by: Lei Wang <lei4.wang@intel.com>
Link: https://lore.kernel.org/all/9aa5d1be-7801-40dd-83fd-f7e041ced249@intel.com/T/
Signed-off-by: Lei Wang <lei4.wang@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240405034056.23933-1-wei.w.wang@intel.com
[peterx: add a todo section, add Fixes and copy stable for 8.0+]
Signed-off-by: Peter Xu <peterx@redhat.com>
When commit bd2270608f ("migration/ram.c: add a notifier chain for
precopy") added PRECOPY_NOTIFY_SETUP notifiers at the end of
qemu_savevm_state_setup(), it didn't take into account a possible
error in the loop calling vmstate_save() or .save_setup() handlers.
Check ret value before calling the notifiers.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20240304122844.1888308-10-clg@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Commit 90697be889 ("live migration: Serialize vmstate saving in stage
2") introduced device serialization in qemu_savevm_state_iterate(). The
rationale behind it was to first complete migration of slower changing
block devices and only then migrate the RAM, to avoid sending fast
changing RAM pages over and over.
This commit was added a long time ago, and while it was useful back
then, it is not the case anymore:
1. Block migration is deprecated, see commit 66db46ca83 ("migration:
Deprecate block migration").
2. Today there are other iterative devices besides RAM and block, such
as VFIO, which are registered for migration after RAM. With current
serialization behavior, a fast changing device can block other
devices from sending their data, which may prevent migration from
converging in some cases.
The issue described in item 2 was observed in several VFIO migration
scenarios with switchover-ack capability enabled, where some workload on
the VM prevented RAM from ever reaching a hard zero, thus blocking VFIO
initial pre-copy data from being sent. Hence, destination could not ack
switchover and migration could not converge.
Fix that by not serializing iterative devices in
qemu_savevm_state_iterate().
Note that this still doesn't fully prevent device starvation. As
correctly pointed out by Peter [1], a fast changing device might
constantly consume all allocated bandwidth and block the following
devices. However, this scenario is more likely to happen only if
max-bandwidth is low.
[1] https://lore.kernel.org/qemu-devel/Zd6iw9dBhW6wKNxx@x1n/
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240304105339.20713-2-avihaih@nvidia.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Add a new migration capability 'mapped-ram'.
The core of the feature is to ensure that RAM pages are mapped
directly to offsets in the resulting migration file instead of being
streamed at arbitrary points.
The reasons why we'd want such behavior are:
- The resulting file will have a bounded size, since pages which are
dirtied multiple times will always go to a fixed location in the
file, rather than constantly being added to a sequential
stream. This eliminates cases where a VM with, say, 1G of RAM can
result in a migration file that's 10s of GBs, provided that the
workload constantly redirties memory.
- It paves the way to implement O_DIRECT-enabled save/restore of the
migration stream as the pages are ensured to be written at aligned
offsets.
- It allows the usage of multifd so we can write RAM pages to the
migration file in parallel.
For now, enabling the capability has no effect. The next couple of
patches implement the core functionality.
Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240229153017.2221-8-farosas@suse.de
Signed-off-by: Peter Xu <peterx@redhat.com>
Now that the migration state reference counting is correct, further
wrap the bottom half dispatch process to avoid future issues.
Move BH creation and scheduling together and wrap the dispatch with an
intermediary function that will ensure we always keep the ref/unref
balanced.
Also move the responsibility of deleting the BH into the wrapper and
remove the now unnecessary pointers.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240119233922.32588-6-farosas@suse.de
Signed-off-by: Peter Xu <peterx@redhat.com>
We need to hold a reference to the current_migration object around
async calls to avoid it been freed while still in use. Even on this
load-side function, we might still use the MigrationState, e.g to
check for capabilities.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240119233922.32588-4-farosas@suse.de
Signed-off-by: Peter Xu <peterx@redhat.com>
- We lost Juan and Leo in the maintainers file
- Steven's suspend state fix
- Steven's fix for coverity on migrate_mode
- Avihai's migration cleanup series
-----BEGIN PGP SIGNATURE-----
iIgEABYKADAWIQS5GE3CDMRX2s990ak7X8zN86vXBgUCZZY0TxIccGV0ZXJ4QHJl
ZGhhdC5jb20ACgkQO1/MzfOr1wbSxgEAoM5g3wkc22lpAlRpU+hJUqT9NVOVQSK+
Fk7XJYTdSgABAKzykA6hAmU5Kj+yVI6jI874SVZbs2FWpFs4osvsKk4D
=sfuM
-----END PGP SIGNATURE-----
Merge tag 'migration-20240104-pull-request' of https://gitlab.com/peterx/qemu into staging
migration 1st pull for 9.0
- We lost Juan and Leo in the maintainers file
- Steven's suspend state fix
- Steven's fix for coverity on migrate_mode
- Avihai's migration cleanup series
# -----BEGIN PGP SIGNATURE-----
#
# iIgEABYKADAWIQS5GE3CDMRX2s990ak7X8zN86vXBgUCZZY0TxIccGV0ZXJ4QHJl
# ZGhhdC5jb20ACgkQO1/MzfOr1wbSxgEAoM5g3wkc22lpAlRpU+hJUqT9NVOVQSK+
# Fk7XJYTdSgABAKzykA6hAmU5Kj+yVI6jI874SVZbs2FWpFs4osvsKk4D
# =sfuM
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 04 Jan 2024 04:30:07 GMT
# gpg: using EDDSA key B9184DC20CC457DACF7DD1A93B5FCCCDF3ABD706
# gpg: issuer "peterx@redhat.com"
# gpg: Good signature from "Peter Xu <xzpeter@gmail.com>" [unknown]
# gpg: aka "Peter Xu <peterx@redhat.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: B918 4DC2 0CC4 57DA CF7D D1A9 3B5F CCCD F3AB D706
* tag 'migration-20240104-pull-request' of https://gitlab.com/peterx/qemu: (26 commits)
migration: fix coverity migrate_mode finding
migration/multifd: Remove unnecessary usage of local Error
migration: Remove unnecessary usage of local Error
migration: Fix migration_channel_read_peek() error path
migration/multifd: Remove error_setg() in migration_ioc_process_incoming()
migration/multifd: Fix leaking of Error in TLS error flow
migration/multifd: Simplify multifd_channel_connect() if else statement
migration/multifd: Fix error message in multifd_recv_initial_packet()
migration: Remove errp parameter in migration_fd_process_incoming()
migration: Refactor migration_incoming_setup()
migration: Remove nulling of hostname in migrate_init()
migration: Remove migrate_max_downtime() declaration
tests/qtest: postcopy migration with suspend
tests/qtest: precopy migration with suspend
tests/qtest: option to suspend during migration
tests/qtest: migration events
migration: preserve suspended for bg_migration
migration: preserve suspended for snapshot
migration: preserve suspended runstate
migration: propagate suspended runstate
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Restoring a snapshot can break a suspended guest. Snapshots suffer from
the same suspended-state issues that affect live migration, plus they must
handle an additional problematic scenario, which is that a running vm must
remain running if it loads a suspended snapshot.
To save, the existing vm_stop call now completely stops the suspended
state. Finish with vm_resume to leave the vm in the state it had prior
to the save, correctly restoring the suspended state.
To load, if the snapshot is not suspended, then vm_stop + vm_resume
correctly handles all states, and leaves the vm in the state it had prior
to the load. However, if the snapshot is suspended, restoration is
trickier. First, call vm_resume to restore the state to suspended so the
current state matches the saved state. Then, if the pre-load state is
running, call wakeup to resume running.
Prior to these changes, the vm_stop to RUN_STATE_SAVE_VM and
RUN_STATE_RESTORE_VM did not change runstate if the current state was
suspended, but now it does, so allow these transitions.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/1704312341-66640-8-git-send-email-steven.sistare@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>