qemu/migration
Peter Xu d657a14de5 migration: Fix UAF for incoming migration on MigrationState
On the incoming migration side, QEMU uses a coroutine to load all the VM
states.  Inside, it may reference MigrationState on global states like
migration capabilities, parameters, error state, shared mutexes and more.

However there's nothing yet to make sure MigrationState won't get
destroyed (e.g. after migration_shutdown()).  Meanwhile there's also no API
available to remove the incoming coroutine in migration_shutdown(),
avoiding it to access the freed elements.

There's a bug report showing this can happen and crash dest QEMU when
migration is cancelled on source.

When it happens, the dest main thread is trying to cleanup everything:

  #0  qemu_aio_coroutine_enter
  #1  aio_dispatch_handler
  #2  aio_poll
  #3  monitor_cleanup
  #4  qemu_cleanup
  #5  qemu_default_main

Then it found the migration incoming coroutine, schedule it (even after
migration_shutdown()), causing crash:

  #0  __pthread_kill_implementation
  #1  __pthread_kill_internal
  #2  __GI_raise
  #3  __GI_abort
  #4  __assert_fail_base
  #5  __assert_fail
  #6  qemu_mutex_lock_impl
  #7  qemu_lockable_mutex_lock
  #8  qemu_lockable_lock
  #9  qemu_lockable_auto_lock
  #10 migrate_set_error
  #11 process_incoming_migration_co
  #12 coroutine_trampoline

To fix it, take a refcount after an incoming setup is properly done when
qmp_migrate_incoming() succeeded the 1st time.  As it's during a QMP
handler which needs BQL, it means the main loop is still alive (without
going into cleanups, which also needs BQL).

Releasing the refcount now only until the incoming migration coroutine
finished or failed.  Hence the refcount is valid for both (1) setup phase
of incoming ports, mostly IO watches (e.g. qio_channel_add_watch_full()),
and (2) the incoming coroutine itself (process_incoming_migration_co()).

Note that we can't unref in migration_incoming_state_destroy(), because
both qmp_xen_load_devices_state() and load_snapshot() will use it without
an incoming migration.  Those hold BQL so they're not prone to this issue.

PS: I suspect nobody uses Xen's command at all, as it didn't register yank,
hence AFAIU the command should crash on master when trying to unregister
yank in migration_incoming_state_destroy()..  but that's another story.

Also note that in some incoming failure cases we may not always unref the
MigrationState refcount, which is a trade-off to keep things simple.  We
could make it accurate, but it can be an overkill.  Some examples:

  - Unlike most of the rest protocols, socket_start_incoming_migration()
    may create net listener after incoming port setup sucessfully.
    It means we can't unref in migration_channel_process_incoming() as a
    generic path because socket protocol might keep using MigrationState.

  - For either socket or file, multiple IO watches might be created, it
    means logically each IO watch needs to take one refcount for
    MigrationState so as to be 100% accurate on ownership of refcount taken.

In general, we at least need per-protocol handling to make it accurate,
which can be an overkill if we know incoming failed after all.  Add a short
comment to explain that when taking the refcount in qmp_migrate_incoming().

Bugzilla: https://issues.redhat.com/browse/RHEL-69775
Tested-by: Yan Fu <yafu@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250220132459.512610-1-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
2025-03-10 12:09:24 -03:00
..
block-active.c migration/block-active: Remove global active flag 2025-02-06 14:26:51 +01:00
block-dirty-bitmap.c include: Rename sysemu/ -> system/ 2024-12-20 17:44:56 +01:00
channel-block.c io: follow coroutine AioContext in qio_channel_yield() 2023-09-07 20:32:11 -05:00
channel-block.h migration: introduce a QIOChannel impl for BlockDriverState VMState 2022-06-22 19:33:43 +01:00
channel.c migration: Fix hang after error in destination setup phase 2025-02-14 15:19:05 -03:00
channel.h migration: check magic value for deciding the mapping of channels 2023-02-06 19:22:57 +01:00
colo-failover.c migration/colo: Improve an x-colo-lost-heartbeat error message 2023-02-23 14:10:17 +01:00
colo-stubs.c migration/colo: make colo_incoming_co() return void 2024-05-22 17:34:31 -03:00
colo.c migration: Add MIG_CMD_SWITCHOVER_START and its load handler 2025-03-06 06:47:33 +01:00
cpr-transfer.c migration: cpr-transfer save and load 2025-01-29 11:43:05 -03:00
cpr.c migration: use parameters.mode in cpr_state_save 2025-02-14 15:19:06 -03:00
cpu-throttle.c include: Rename sysemu/ -> system/ 2024-12-20 17:44:56 +01:00
dirtyrate.c qapi: Move include/qapi/qmp/ to include/qobject/ 2025-02-10 15:33:16 +01:00
dirtyrate.h include: Rename sysemu/ -> system/ 2024-12-20 17:44:56 +01:00
exec.c migration: simplify exec migration functions 2024-03-04 07:12:40 +01:00
exec.h migration: convert exec backend to accept MigrateAddress. 2023-11-02 11:35:04 +01:00
fd.c migration: Allow pipes to keep working for fd migrations 2024-11-25 16:21:55 -05:00
fd.h migration: Revert mapped-ram multifd support to fd: URI 2024-03-22 12:12:08 -04:00
file.c migration/multifd: Pass in MultiFDPages_t to file_write_ramblock_iov 2024-09-03 16:24:35 -03:00
file.h migration/multifd: Pass in MultiFDPages_t to file_write_ramblock_iov 2024-09-03 16:24:35 -03:00
global_state.c include: Rename sysemu/ -> system/ 2024-12-20 17:44:56 +01:00
meson.build migration/multifd: Device state transfer support - send side 2025-03-06 06:47:33 +01:00
migration-hmp-cmds.c migration: Add MIG_CMD_SWITCHOVER_START and its load handler 2025-03-06 06:47:33 +01:00
migration-stats.c migration: migration_rate_limit_reset() don't need the QEMUFile 2023-10-31 08:44:33 +01:00
migration-stats.h migration: Remove transferred atomic counter 2023-10-31 08:44:33 +01:00
migration.c migration: Fix UAF for incoming migration on MigrationState 2025-03-10 12:09:24 -03:00
migration.h migration: Add thread pool of optional load threads 2025-03-06 06:47:33 +01:00
multifd-device-state.c migration: Add save_live_complete_precopy_thread handler 2025-03-06 06:47:33 +01:00
multifd-nocomp.c migration/multifd: Make MultiFDSendData a struct 2025-03-06 06:47:33 +01:00
multifd-qatzip.c multifd: bugfix for incorrect migration data with qatzip compression 2025-01-09 17:40:27 -03:00
multifd-qpl.c multifd: bugfix for incorrect migration data with QPL compression 2025-01-09 17:40:21 -03:00
multifd-uadk.c migration/multifd: Fix compile error caused by page_size usage 2025-01-09 17:37:50 -03:00
multifd-zero-page.c migration/multifd: Move pages accounting into multifd_send_zero_page_detect() 2024-09-03 16:24:35 -03:00
multifd-zlib.c migration/multifd: Make MultiFDMethods const 2024-09-03 16:24:36 -03:00
multifd-zstd.c migration/multifd: Fix loop conditions in multifd_zstd_send_prepare and multifd_zstd_recv 2024-09-18 14:27:24 -04:00
multifd.c migration/multifd: Make MultiFDSendData a struct 2025-03-06 06:47:33 +01:00
multifd.h migration/multifd: Make MultiFDSendData a struct 2025-03-06 06:47:33 +01:00
options.c migration: Add MIG_CMD_SWITCHOVER_START and its load handler 2025-03-06 06:47:33 +01:00
options.h migration: Use device_class_set_props_n 2024-12-19 19:33:37 +01:00
page_cache.c migration: Fix cache_init()'s "Failed to allocate" error messages 2021-02-08 11:19:51 +00:00
page_cache.h migration: Clean up signed vs. unsigned XBZRLE cache-size 2021-02-08 11:19:51 +00:00
postcopy-ram.c overcommit: introduce mem-lock=on-fault 2025-02-12 11:36:13 -05:00
postcopy-ram.h migration/postcopy: Add postcopy-recover-setup phase 2024-06-21 09:47:59 -03:00
qemu-file.c migration: SCM_RIGHTS for QEMUFile 2025-01-29 11:43:05 -03:00
qemu-file.h migration/qemu-file: Define g_autoptr() cleanup function for QEMUFile 2025-03-06 06:47:34 +01:00
ram.c migration: Set migration error outside of migrate_cancel 2025-02-14 15:19:05 -03:00
ram.h migration/ram: Move RAM_SAVE_FLAG* into ram.h 2025-01-09 17:38:15 -03:00
rdma.c migration: Change migrate_fd_ to migration_ 2025-02-14 15:19:05 -03:00
rdma.h migration/ram: Move RAM_SAVE_FLAG* into ram.h 2025-01-09 17:38:15 -03:00
savevm.c migration: Add save_live_complete_precopy_thread handler 2025-03-06 06:47:33 +01:00
savevm.h migration: Add thread pool of optional load threads 2025-03-06 06:47:33 +01:00
socket.c migration: Remove unused socket_send_channel_create_sync 2024-10-08 15:28:55 -04:00
socket.h migration: Remove unused socket_send_channel_create_sync 2024-10-08 15:28:55 -04:00
target.c migration: Add migration prefix to functions in target.c 2023-09-11 08:34:06 +02:00
threadinfo.c migration/multifd: Protect accesses to migration_threads 2023-07-26 10:55:56 +02:00
threadinfo.h migration/multifd: Protect accesses to migration_threads 2023-07-26 10:55:56 +02:00
tls.c migration/multifd: Terminate the TLS connection 2025-02-14 15:19:04 -03:00
tls.h migration/multifd: Terminate the TLS connection 2025-02-14 15:19:04 -03:00
trace-events migration: Add MIG_CMD_SWITCHOVER_START and its load handler 2025-03-06 06:47:33 +01:00
trace.h trace: switch position of headers to what Meson requires 2020-08-21 06:18:24 -04:00
vmstate-types.c migration: cpr-transfer mode 2025-01-29 11:56:24 -03:00
vmstate.c qapi: Move include/qapi/qmp/ to include/qobject/ 2025-02-10 15:33:16 +01:00
xbzrle.c migration/xbzrle: Use i386 host/cpuinfo.h 2023-05-23 16:51:18 -07:00
xbzrle.h migration/xbzrle: Use i386 host/cpuinfo.h 2023-05-23 16:51:18 -07:00
yank_functions.c migration/yank: Use channel features 2024-01-29 11:02:12 +08:00
yank_functions.h migration: Move the yank unregister of channel_close out 2021-07-26 12:45:03 +01:00