qemu/include/block
Kevin Wolf 984a32f17e file-posix: Support FUA writes
Until now, FUA was always emulated with a separate flush after the write
for file-posix. The overhead of processing a second request can reduce
performance significantly for a guest disk that has disabled the write
cache, especially if the host disk is already write through, too, and
the flush isn't actually doing anything.

Advertise support for REQ_FUA in write requests and implement it for
Linux AIO and io_uring using the RWF_DSYNC flag for write requests. The
thread pool still performs a separate fdatasync() call. This can be
improved later by using the pwritev2() syscall if available.

As an example, this is how fio numbers can be improved in some scenarios
with this patch (all using virtio-blk with cache=directsync on an nvme
block device for the VM, fio with ioengine=libaio,direct=1,sync=1):

                              | old           | with FUA support
------------------------------+---------------+-------------------
bs=4k, iodepth=1, numjobs=1   |  45.6k iops   |  56.1k iops
bs=4k, iodepth=1, numjobs=16  | 183.3k iops   | 236.0k iops
bs=4k, iodepth=16, numjobs=1  | 258.4k iops   | 311.1k iops

However, not all scenarios are clear wins. On another slower disk I saw
little to no improvment. In fact, in two corner case scenarios, I even
observed a regression, which I however consider acceptable:

1. On slow host disks in a write through cache mode, when the guest is
   using virtio-blk in a separate iothread so that polling can be
   enabled, and each completion is quickly followed up with a new
   request (so that polling gets it), it can happen that enabling FUA
   makes things slower - the additional very fast no-op flush we used to
   have gave the adaptive polling algorithm a success so that it kept
   polling. Without it, we only have the slow write request, which
   disables polling. This is a problem in the polling algorithm that
   will be fixed later in this series.

2. With a high queue depth, it can be beneficial to have flush requests
   for another reason: The optimisation in bdrv_co_flush() that flushes
   only once per write generation acts as a synchronisation mechanism
   that lets all requests complete at the same time. This can result in
   better batching and if the disk is very fast (I only saw this with a
   null_blk backend), this can make up for the overhead of the flush and
   improve throughput. In theory, we could optionally introduce a
   similar artificial latency in the normal completion path to achieve
   the same kind of completion batching. This is not implemented in this
   series.

Compatibility is not a concern for the kernel side of io_uring, it has
supported RWF_DSYNC from the start. However, io_uring_prep_writev2() is
not available before liburing 2.2.

Linux AIO started supporting it in Linux 4.13 and libaio 0.3.111. The
kernel is not a problem for any supported build platform, so it's not
necessary to add runtime checks. However, openSUSE is still stuck with
an older libaio version that would break the build.

We must detect the presence of the writev2 functions in the user space
libraries at build time to avoid build failures.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250307221634.71951-2-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-03-13 17:44:55 +01:00
..
accounting.h block: add accounting for zone append operation 2023-05-15 08:18:10 -04:00
aio-wait.h system/cpus: rename qemu_mutex_lock_iothread() to bql_lock() 2024-01-08 10:45:43 -05:00
aio.h thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio 2025-03-06 06:47:33 +01:00
aio_task.h block: Remove unused aio_task_pool_empty 2024-09-30 10:53:18 +03:00
block-common.h block: remove unused BLOCK_OP_TYPE_DATAPLANE 2025-02-06 14:51:10 +01:00
block-copy.h copy-before-write: allow specifying minimum cluster size 2024-09-30 10:52:41 +03:00
block-global-state.h block: Add blockdev-set-active QMP command 2025-02-06 14:26:51 +01:00
block-hmp-cmds.h include/block: Untangle inclusion loops 2023-01-20 07:24:28 +01:00
block-io.h block: remove outdated AioContext locking comments 2023-12-21 22:49:27 +01:00
block.h include/block: Untangle inclusion loops 2023-01-20 07:24:28 +01:00
block_backup.h include/block: Untangle inclusion loops 2023-01-20 07:24:28 +01:00
block_int-common.h qemu/compiler: Absorb 'clang-tsa.h' 2025-03-06 14:21:25 +01:00
block_int-global-state.h qapi: blockdev-backup: add discard-source parameter 2024-05-28 15:52:15 +03:00
block_int-io.h block: Mark bdrv_cow_child() and callers GRAPH_RDLOCK 2023-11-07 19:14:19 +01:00
block_int.h include/block: Untangle inclusion loops 2023-01-20 07:24:28 +01:00
blockjob.h Rename "QEMU global mutex" to "BQL" in comments and docs 2024-01-08 10:45:43 -05:00
blockjob_int.h block: Mark block_job_add_bdrv() GRAPH_WRLOCK 2023-11-07 19:14:19 +01:00
dirty-bitmap.h block: Mark bdrv_*_dirty_bitmap() and callers GRAPH_RDLOCK 2023-02-23 19:49:32 +01:00
export.h block/export: Add option to allow export of inactive nodes 2025-02-06 14:46:40 +01:00
fuse.h fuse: Allow exporting BDSs via FUSE 2020-12-11 17:52:39 +01:00
graph-lock.h qemu/compiler: Absorb 'clang-tsa.h' 2025-03-06 14:21:25 +01:00
nbd.h nbd/server: Allow users to adjust handshake limit in QMP 2025-02-11 13:45:47 -06:00
nvme.h hw/nvme: set error status code explicitly for misc commands 2025-02-26 12:40:35 +01:00
qapi.h block: Mark bdrv_get_parent_name() and callers GRAPH_RDLOCK 2023-10-12 16:31:33 +02:00
qdict.h qapi: Move include/qapi/qmp/ to include/qobject/ 2025-02-10 15:33:16 +01:00
raw-aio.h file-posix: Support FUA writes 2025-03-13 17:44:55 +01:00
replication.h replication: move include out of root directory 2021-05-26 14:49:46 +02:00
reqlist.h block/reqlist: add reqlist_wait_all() 2022-03-07 09:33:30 +01:00
snapshot.h block: remove AioContext locking 2023-12-21 22:49:27 +01:00
thread-pool.h thread-pool: Implement generic (non-AIO) pool support 2025-03-06 06:47:33 +01:00
throttle-groups.h block/throttle-groups: Use ThrottleDirection instread of bool is_write 2023-08-29 10:49:24 +02:00
ufs.h hw/ufs: Add temperature event notification support 2025-03-05 02:13:29 +01:00
write-threshold.h block: Clean up includes 2023-02-08 07:28:05 +01:00