qemu/migration
Peter Xu f8c543e808 migration: Allow network to fail even during recovery
Normally the postcopy recover phase should only exist for a super short
period, that's the duration when QEMU is trying to recover from an
interrupted postcopy migration, during which handshake will be carried out
for continuing the procedure with state changes from PAUSED -> RECOVER ->
POSTCOPY_ACTIVE again.

Here RECOVER phase should be super small, that happens right after the
admin specified a new but working network link for QEMU to reconnect to
dest QEMU.

However there can still be case where the channel is broken in this small
RECOVER window.

If it happens, with current code there's no way the src QEMU can got kicked
out of RECOVER stage. No way either to retry the recover in another channel
when established.

This patch allows the RECOVER phase to fail itself too - we're mostly
ready, just some small things missing, e.g. properly kick the main
migration thread out when sleeping on rp_sem when we found that we're at
RECOVER stage.  When this happens, it fails the RECOVER itself, and
rollback to PAUSED stage.  Then the user can retry another round of
recovery.

To make it even stronger, teach QMP command migrate-pause to explicitly
kick src/dst QEMU out when needed, so even if for some reason the migration
thread didn't got kicked out already by a failing rethrn-path thread, the
admin can also kick it out.

This will be an super, super corner case, but still try to cover that.

One can try to test this with two proxy channels for migration:

  (a) socat unix-listen:/tmp/src.sock,reuseaddr,fork tcp:localhost:10000
  (b) socat tcp-listen:10000,reuseaddr,fork unix:/tmp/dst.sock

So the migration channel will be:

                      (a)          (b)
  src -> /tmp/src.sock -> tcp:10000 -> /tmp/dst.sock -> dst

Then to make QEMU hang at RECOVER stage, one can do below:

  (1) stop the postcopy using QMP command postcopy-pause
  (2) kill the 2nd proxy (b)
  (3) try to recover the postcopy using /tmp/src.sock on src
  (4) src QEMU will go into RECOVER stage but won't be able to continue
      from there, because the channel is actually broken at (b)

Before this patch, step (4) will make src QEMU stuck in RECOVER stage,
without a way to kick the QEMU out or continue the postcopy again.  After
this patch, (4) will quickly fail qemu and bounce back to PAUSED stage.

Admin can also kick QEMU from (4) into PAUSED when needed using
migrate-pause when needed.

After bouncing back to PAUSED stage, one can recover again.

Reported-by: Xiaohui Li <xiaohli@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2111332
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-ID: <20231017202633.296756-3-peterx@redhat.com>
2023-11-02 11:35:03 +01:00
..
block-dirty-bitmap.c migration: hold the BQL during setup 2023-10-17 09:25:13 +02:00
block.c qemu-file: Remove _noflush from qemu_file_transferred_noflush() 2023-10-31 08:44:33 +01:00
block.h migration: disable auto-converge during bulk block migration 2017-09-27 11:27:14 +01:00
channel-block.c io: follow coroutine AioContext in qio_channel_yield() 2023-09-07 20:32:11 -05:00
channel-block.h migration: introduce a QIOChannel impl for BlockDriverState VMState 2022-06-22 19:33:43 +01:00
channel.c migration: check magic value for deciding the mapping of channels 2023-02-06 19:22:57 +01:00
channel.h migration: check magic value for deciding the mapping of channels 2023-02-06 19:22:57 +01:00
colo-failover.c migration/colo: Improve an x-colo-lost-heartbeat error message 2023-02-23 14:10:17 +01:00
colo.c qemu-file: Make qemu_fflush() return errors 2023-10-31 08:44:33 +01:00
dirtyrate.c migration/dirtyrate: use QEMU_CLOCK_HOST to report start-time 2023-10-10 08:04:12 +08:00
dirtyrate.h migration/calc-dirty-rate: millisecond-granularity period 2023-10-10 08:03:50 +08:00
exec.c *: Add missing includes of qemu/error-report.h 2023-03-22 15:06:57 +00:00
exec.h migration: Export exec.c functions in its own file 2017-06-01 18:49:22 +02:00
fd.c bulk: Remove pointless QOM casts 2023-06-05 20:48:34 +02:00
fd.h migration: Fix fd protocol for incoming defer 2019-06-05 12:43:55 +02:00
file.c migration: file URI offset 2023-10-04 13:18:08 +02:00
file.h migration: file URI 2023-10-04 13:16:58 +02:00
global_state.c migration: never fail in global_state_store() 2023-06-02 01:03:19 +02:00
meson.build migration: file URI 2023-10-04 13:16:58 +02:00
migration-hmp-cmds.c migration: mode parameter 2023-11-01 16:13:58 +01:00
migration-stats.c migration: migration_rate_limit_reset() don't need the QEMUFile 2023-10-31 08:44:33 +01:00
migration-stats.h migration: Remove transferred atomic counter 2023-10-31 08:44:33 +01:00
migration.c migration: Allow network to fail even during recovery 2023-11-02 11:35:03 +01:00
migration.h migration: Allow network to fail even during recovery 2023-11-02 11:35:03 +01:00
multifd-zlib.c migration: spelling fixes 2023-07-25 17:13:20 +03:00
multifd-zstd.c migration: spelling fixes 2023-07-25 17:13:20 +03:00
multifd.c migration: Remove transferred atomic counter 2023-10-31 08:44:33 +01:00
multifd.h multifd: Add the ramblock to MultiFDRecvParams 2023-05-10 18:48:11 +02:00
options.c migration: mode parameter 2023-11-01 16:13:58 +01:00
options.h migration: mode parameter 2023-11-01 16:13:58 +01:00
page_cache.c migration: Fix cache_init()'s "Failed to allocate" error messages 2021-02-08 11:19:51 +00:00
page_cache.h migration: Clean up signed vs. unsigned XBZRLE cache-size 2021-02-08 11:19:51 +00:00
postcopy-ram.c migration: Fix race that dest preempt thread close too early 2023-09-27 13:58:02 -04:00
postcopy-ram.h migration: Allow postcopy_ram_supported_by_host() to report err 2023-04-27 10:18:25 +02:00
qemu-file.c migration: Refactor error handling in source return path 2023-11-02 11:35:03 +01:00
qemu-file.h migration: Refactor error handling in source return path 2023-11-02 11:35:03 +01:00
ram-compress.c migration: Rename ram_compressed_pages() to compress_ram_pages() 2023-10-30 17:41:55 +01:00
ram-compress.h migration: Rename ram_compressed_pages() to compress_ram_pages() 2023-10-30 17:41:55 +01:00
ram.c migration: Allow network to fail even during recovery 2023-11-02 11:35:03 +01:00
ram.h migration: Refactor error handling in source return path 2023-11-02 11:35:03 +01:00
rdma.c qemu-file: Make qemu_fflush() return errors 2023-10-31 08:44:33 +01:00
rdma.h migration/rdma: Remove qemu_ prefix from exported functions 2023-10-17 09:25:13 +02:00
savevm.c migration: Add tracepoints for downtime checkpoints 2023-11-01 16:13:58 +01:00
savevm.h migration: Add .save_prepare() handler to struct SaveVMHandlers 2023-09-11 08:34:06 +02:00
socket.c migration: Move migrate_use_zero_copy_send() to options.c 2023-04-24 15:01:46 +02:00
socket.h migration: Postcopy preemption preparation on channel creation 2022-07-20 12:15:08 +01:00
target.c migration: Add migration prefix to functions in target.c 2023-09-11 08:34:06 +02:00
threadinfo.c migration/multifd: Protect accesses to migration_threads 2023-07-26 10:55:56 +02:00
threadinfo.h migration/multifd: Protect accesses to migration_threads 2023-07-26 10:55:56 +02:00
tls.c migration: Drop unused parameter for migration_tls_client_create() 2023-05-03 11:24:20 +02:00
tls.h migration: Drop unused parameter for migration_tls_client_create() 2023-05-03 11:24:20 +02:00
trace-events migration: Refactor error handling in source return path 2023-11-02 11:35:03 +01:00
trace.h trace: switch position of headers to what Meson requires 2020-08-21 06:18:24 -04:00
vmstate-types.c Move CPU softfloat unions to cpu-float.h 2022-04-06 14:31:43 +02:00
vmstate.c qemu-file: Remove _noflush from qemu_file_transferred_noflush() 2023-10-31 08:44:33 +01:00
xbzrle.c migration/xbzrle: Use i386 host/cpuinfo.h 2023-05-23 16:51:18 -07:00
xbzrle.h migration/xbzrle: Use i386 host/cpuinfo.h 2023-05-23 16:51:18 -07:00
yank_functions.c bulk: Remove pointless QOM casts 2023-06-05 20:48:34 +02:00
yank_functions.h migration: Move the yank unregister of channel_close out 2021-07-26 12:45:03 +01:00