Commit graph

1632 commits

Author SHA1 Message Date
Eric Blake
253b43a290 mirror: Drop redundant zero_target parameter
The two callers to a mirror job (drive-mirror and blockdev-mirror) set
zero_target precisely when sync mode == FULL, with the one exception
that drive-mirror skips zeroing the target if it was newly created and
reads as zero.  But given the previous patch, that exception is
equally captured by target_is_zero.

Meanwhile, there is another slight wrinkle, fortunately caught by
iotest 185: if the caller uses "sync":"top" but the source has no
backing file, the code in blockdev.c was changing sync to be FULL, but
only after it had set zero_target=false.  In mirror.c, prior to recent
patches, this didn't matter: the only places that inspected sync were
setting is_none_mode (both TOP and FULL had set that to false), and
mirror_start() setting base = mode == MIRROR_SYNC_MODE_TOP ?
bdrv_backing_chain_next(bs) : NULL.  But now that we are passing sync
around, the slammed sync mode would result in a new pre-zeroing pass
even when the user had passed "sync":"top" in an effort to skip
pre-zeroing.  Fortunately, the assignment of base when bs has no
backing chain still works out to NULL if we don't slam things.  So
with the forced change of sync ripped out of blockdev.c, the sync mode
is passed through the full callstack unmolested, and we can now
reliably reconstruct the same settings as what used to be passed in by
zero_target=false, without the redundant parameter.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250509204341.3553601-24-eblake@redhat.com>
Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[eblake: Fix regression in iotest 185]
Signed-off-by: Eric Blake <eblake@redhat.com>
2025-05-14 20:10:12 -05:00
Eric Blake
d17a34bfb9 mirror: Allow QMP override to declare target already zero
QEMU has an optimization for a just-created drive-mirror destination
that is not possible for blockdev-mirror (which can't create the
destination) - any time we know the destination starts life as all
zeroes, we can skip a pre-zeroing pass on the destination.  Recent
patches have added an improved heuristic for detecting if a file
contains all zeroes, and we plan to use that heuristic in upcoming
patches.  But since a heuristic cannot quickly detect all scenarios,
and there may be cases where the caller is aware of information that
QEMU cannot learn quickly, it makes sense to have a way to tell QEMU
to assume facts about the destination that can make the mirror
operation faster.  Given our existing example of "qemu-img convert
--target-is-zero", it is time to expose this override in QMP for
blockdev-mirror as well.

This patch results in some slight redundancy between the older
s->zero_target (set any time mode==FULL and the destination image was
not just created - ie. clear if drive-mirror is asking to skip the
pre-zero pass) and the newly-introduced s->target_is_zero (in addition
to the QMP override, it is set when drive-mirror creates the
destination image); this will be cleaned up in the next patch.

There is also a subtlety that we must consider.  When drive-mirror is
passing target_is_zero on behalf of a just-created image, we know the
image is sparse (skipping the pre-zeroing keeps it that way), so it
doesn't matter whether the destination also has "discard":"unmap" and
"detect-zeroes":"unmap".  But now that we are letting the user set the
knob for target-is-zero, if the user passes a pre-existing file that
is fully allocated, it is fine to leave the file fully allocated under
"detect-zeroes":"on", but if the file is open with
"detect-zeroes":"unmap", we should really be trying harder to punch
holes in the destination for every region of zeroes copied from the
source.  The easiest way to do this is to still run the pre-zeroing
pass (turning the entire destination file sparse before populating
just the allocated portions of the source), even though that currently
results in double I/O to the portions of the file that are allocated.
A later patch will add further optimizations to reduce redundant
zeroing I/O during the mirror operation.

Since "target-is-zero":true is designed for optimizations, it is okay
to silently ignore the parameter rather than erroring if the user ever
sets the parameter in a scenario where the mirror job can't exploit it
(for example, when doing "sync":"top" instead of "sync":"full", we
can't pre-zero, so setting the parameter won't make a speed
difference).

Signed-off-by: Eric Blake <eblake@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20250509204341.3553601-23-eblake@redhat.com>
Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-05-14 16:55:10 -05:00
Eric Blake
5272609670 block: Add new bdrv_co_is_all_zeroes() function
There are some optimizations that require knowing if an image starts
out as reading all zeroes, such as making blockdev-mirror faster by
skipping the copying of source zeroes to the destination.  The
existing bdrv_co_is_zero_fast() is a good building block for answering
this question, but it tends to give an answer of 0 for a file we just
created via QMP 'blockdev-create' or similar (such as 'qemu-img create
-f raw').  Why?  Because file-posix.c insists on allocating a tiny
header to any file rather than leaving it 100% sparse, due to some
filesystems that are unable to answer alignment probes on a hole.  But
teaching file-posix.c to read the tiny header doesn't scale - the
problem of a small header is also visible when libvirt sets up an NBD
client to a just-created file on a migration destination host.

So, we need a wrapper function that handles a bit more complexity in a
common manner for all block devices - when the BDS is mostly a hole,
but has a small non-hole header, it is still worth the time to read
that header and check if it reads as all zeroes before giving up and
returning a pessimistic answer.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250509204341.3553601-19-eblake@redhat.com>
2025-05-14 16:08:23 -05:00
Eric Blake
c33159dec7 block: Expand block status mode from bool to flags
This patch is purely mechanical, changing bool want_zero into an
unsigned int for bitwise-or of flags.  As of this patch, all
implementations are unchanged (the old want_zero==true is now
mode==BDRV_WANT_PRECISE which is a superset of BDRV_WANT_ZERO); but
the callers in io.c that used to pass want_zero==false are now
prepared for future driver changes that can now distinguish bewteen
BDRV_WANT_ZERO vs. BDRV_WANT_ALLOCATED.  The next patch will actually
change the file-posix driver along those lines, now that we have
more-specific hints.

As for the background why this patch is useful: right now, the
file-posix driver recognizes that if allocation is being queried, the
entire image can be reported as allocated (there is no backing file to
refer to) - but this throws away information on whether the entire
image reads as zero (trivially true if lseek(SEEK_HOLE) at offset 0
returns -ENXIO, a bit more complicated to prove if the raw file was
created with 'qemu-img create' since we intentionally allocate a small
chunk of all-zero data to help with alignment probing).  Later patches
will add a generic algorithm for seeing if an entire file reads as
zeroes.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250509204341.3553601-16-eblake@redhat.com>
2025-05-14 15:33:34 -05:00
Raman Dzehtsiar
3d3911f16b blockdev-backup: Add error handling option for copy-before-write jobs
This patch extends the blockdev-backup QMP command to allow users to specify
how to behave when IO errors occur during copy-before-write operations.
Previously, the behavior was fixed and could not be controlled by the user.

The new 'on-cbw-error' option can be set to one of two values:
- 'break-guest-write': Forwards the IO error to the guest and triggers
  the on-source-error policy. This preserves snapshot integrity at the
  expense of guest IO operations.
- 'break-snapshot': Allows the guest OS to continue running normally,
  but invalidates the snapshot and aborts related jobs. This prioritizes
  guest operation over backup consistency.

This enhancement provides more flexibility for backup operations in different
environments where requirements for guest availability versus backup
consistency may vary.

The default behavior remains unchanged to maintain backward compatibility.

Signed-off-by: Raman Dzehtsiar <Raman.Dzehtsiar@gmail.com>
Message-ID: <20250414090025.828660-1-Raman.Dzehtsiar@gmail.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
[vsementsov: fix long lines]
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Tested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
2025-05-12 18:19:31 +03:00
Sunny Zhu
ed1aef1716 block: Remove unused callback function *bdrv_aio_pdiscard
The bytes type in *bdrv_aio_pdiscard should be int64_t rather than int.

There are no drivers implementing the *bdrv_aio_pdiscard() callback,
it appears to be an unused function. Therefore, we'll simply remove it
instead of fixing it.

Additionally, coroutine-based callbacks are preferred. If someone needs
to implement bdrv_aio_pdiscard, a coroutine-based version would be
straightforward to implement.

Signed-off-by: Sunny Zhu <sunnyzhyy@qq.com>
Message-ID: <tencent_7140D2E54157D98CF3D9E64B1A007A1A7906@qq.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-04-25 17:06:50 +02:00
Kevin Wolf
ee416407b3 aio-posix: Separate AioPolledEvent per AioHandler
Adaptive polling has a big problem: It doesn't consider that an event
loop can wait for many different events that may have very different
typical latencies.

For example, think of a guest that tends to send a new I/O request soon
after the previous I/O request completes, but the storage on the host is
rather slow. In this case, getting the new request from guest quickly
means that polling is enabled, but the next thing is performing the I/O
request on the backend, which is slow and disables polling again for the
next guest request. This means that in such a scenario, polling could
help for every other event, but is only ever enabled when it can't
succeed.

In order to fix this, keep a separate AioPolledEvent for each
AioHandler. We will then know that the backend file descriptor always
has a high latency and isn't worth polling for, but we also know that
the guest is always fast and we should poll for it. This solves at least
half of the problem, we can now keep polling for those cases where it
makes sense and get the improved performance from it.

Since the event loop doesn't know which event will be next, we still do
some unnecessary polling while we're waiting for the slow disk. I made
some attempts to be more clever than just randomly growing and shrinking
the polling time, and even to let callers be explicit about when they
expect a new event, but so far this hasn't resulted in improved
performance or even caused performance regressions. For now, let's just
fix the part that is easy enough to fix, we can revisit the rest later.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250307221634.71951-6-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-03-13 17:57:23 +01:00
Kevin Wolf
518db1013c aio: Create AioPolledEvent
As a preparation for having multiple adaptive polling states per
AioContext, move the 'ns' field into a separate struct.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250307221634.71951-4-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-03-13 17:57:23 +01:00
Kevin Wolf
984a32f17e file-posix: Support FUA writes
Until now, FUA was always emulated with a separate flush after the write
for file-posix. The overhead of processing a second request can reduce
performance significantly for a guest disk that has disabled the write
cache, especially if the host disk is already write through, too, and
the flush isn't actually doing anything.

Advertise support for REQ_FUA in write requests and implement it for
Linux AIO and io_uring using the RWF_DSYNC flag for write requests. The
thread pool still performs a separate fdatasync() call. This can be
improved later by using the pwritev2() syscall if available.

As an example, this is how fio numbers can be improved in some scenarios
with this patch (all using virtio-blk with cache=directsync on an nvme
block device for the VM, fio with ioengine=libaio,direct=1,sync=1):

                              | old           | with FUA support
------------------------------+---------------+-------------------
bs=4k, iodepth=1, numjobs=1   |  45.6k iops   |  56.1k iops
bs=4k, iodepth=1, numjobs=16  | 183.3k iops   | 236.0k iops
bs=4k, iodepth=16, numjobs=1  | 258.4k iops   | 311.1k iops

However, not all scenarios are clear wins. On another slower disk I saw
little to no improvment. In fact, in two corner case scenarios, I even
observed a regression, which I however consider acceptable:

1. On slow host disks in a write through cache mode, when the guest is
   using virtio-blk in a separate iothread so that polling can be
   enabled, and each completion is quickly followed up with a new
   request (so that polling gets it), it can happen that enabling FUA
   makes things slower - the additional very fast no-op flush we used to
   have gave the adaptive polling algorithm a success so that it kept
   polling. Without it, we only have the slow write request, which
   disables polling. This is a problem in the polling algorithm that
   will be fixed later in this series.

2. With a high queue depth, it can be beneficial to have flush requests
   for another reason: The optimisation in bdrv_co_flush() that flushes
   only once per write generation acts as a synchronisation mechanism
   that lets all requests complete at the same time. This can result in
   better batching and if the disk is very fast (I only saw this with a
   null_blk backend), this can make up for the overhead of the flush and
   improve throughput. In theory, we could optionally introduce a
   similar artificial latency in the normal completion path to achieve
   the same kind of completion batching. This is not implemented in this
   series.

Compatibility is not a concern for the kernel side of io_uring, it has
supported RWF_DSYNC from the start. However, io_uring_prep_writev2() is
not available before liburing 2.2.

Linux AIO started supporting it in Linux 4.13 and libaio 0.3.111. The
kernel is not a problem for any supported build platform, so it's not
necessary to add runtime checks. However, openSUSE is still stuck with
an older libaio version that would break the build.

We must detect the presence of the writev2 functions in the user space
libraries at build time to avoid build failures.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250307221634.71951-2-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-03-13 17:44:55 +01:00
Stefan Hajnoczi
98c7362b1e Generic CPUs / accelerators patch queue
- Merge "qemu/clang-tsa.h" within "qemu/compiler.h"
 - Various cleanups around accelerators initialization code
   (better user/system split)
 - Various trivial cleanups in accel/tcg/,
   Guard few TCG calls with tcg_enabled()
 - Explicit disassemble_info endianness
 - Improve dual-endianness support for MicroBlaze
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmfJw08ACgkQ4+MsLN6t
 wN70whAAtfcdWtqseFfb6fvDtjflgxN51Ui0iaOECXUA18USKriGy34eBcMYMiM2
 +eKgU7+jI6JGE4+burcgWUsPpFFF951/A8+lyIbFgO5yToTDmC+qNe4XfmMAIyXq
 uf9Obr2c0Xk9luh4odb+jPAQodw/7G1fKgcCVIJNDCl/xEcPhS9eNpTaHwcVnkWI
 K6KrxWXOsqG6+evJBPWYoXtOOyt0+JcwAsJoGhprwtGm3P9+jSVXsgeGsJVyZcna
 f32JtjWL754O8XeMkOn4x6rt58VrCIMKI9xT7keDyuhTCq0Zki9RO2nMU2dSw5mN
 AfL9hxqUy0Nijnyslg3ugujDfTePsNyLdwwH7n0mnoD72ELi6WnhDsmOThuEB3Rd
 4/kdwTJfA/rlWk/GF1tbKW7AvQZokRARtzmL3V0HmGJu57lX+2JuszEdYBkqDEP7
 GH1I10B2yANUm+C9y3X8qWOU7Ws433ebJeJoZuyfnbZ9Me+UfRmql/oS+V8ata2i
 fArEItpldUFrWRyYLkTbXrh2dgyV9yJTEir/lzOzeAZZzyabTbjf2z9qnh976GGO
 1QnDy5QA4f54kDBUZe7JK26TZsHPch7cgqXW6f8tRlJF7A9hxGK8d2TUV/lC3/vx
 LUOlWNu03PhiruYmZEcWOsY3Jt9jRCF6lIryrnaJsqnVOVmMUMM=
 =3TRh
 -----END PGP SIGNATURE-----

Merge tag 'accel-cpus-20250306' of https://github.com/philmd/qemu into staging

Generic CPUs / accelerators patch queue

- Merge "qemu/clang-tsa.h" within "qemu/compiler.h"
- Various cleanups around accelerators initialization code
  (better user/system split)
- Various trivial cleanups in accel/tcg/,
  Guard few TCG calls with tcg_enabled()
- Explicit disassemble_info endianness
- Improve dual-endianness support for MicroBlaze

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmfJw08ACgkQ4+MsLN6t
# wN70whAAtfcdWtqseFfb6fvDtjflgxN51Ui0iaOECXUA18USKriGy34eBcMYMiM2
# +eKgU7+jI6JGE4+burcgWUsPpFFF951/A8+lyIbFgO5yToTDmC+qNe4XfmMAIyXq
# uf9Obr2c0Xk9luh4odb+jPAQodw/7G1fKgcCVIJNDCl/xEcPhS9eNpTaHwcVnkWI
# K6KrxWXOsqG6+evJBPWYoXtOOyt0+JcwAsJoGhprwtGm3P9+jSVXsgeGsJVyZcna
# f32JtjWL754O8XeMkOn4x6rt58VrCIMKI9xT7keDyuhTCq0Zki9RO2nMU2dSw5mN
# AfL9hxqUy0Nijnyslg3ugujDfTePsNyLdwwH7n0mnoD72ELi6WnhDsmOThuEB3Rd
# 4/kdwTJfA/rlWk/GF1tbKW7AvQZokRARtzmL3V0HmGJu57lX+2JuszEdYBkqDEP7
# GH1I10B2yANUm+C9y3X8qWOU7Ws433ebJeJoZuyfnbZ9Me+UfRmql/oS+V8ata2i
# fArEItpldUFrWRyYLkTbXrh2dgyV9yJTEir/lzOzeAZZzyabTbjf2z9qnh976GGO
# 1QnDy5QA4f54kDBUZe7JK26TZsHPch7cgqXW6f8tRlJF7A9hxGK8d2TUV/lC3/vx
# LUOlWNu03PhiruYmZEcWOsY3Jt9jRCF6lIryrnaJsqnVOVmMUMM=
# =3TRh
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 06 Mar 2025 23:46:23 HKT
# gpg:                using RSA key FAABE75E12917221DCFD6BB2E3E32C2CDEADC0DE
# gpg: Good signature from "Philippe Mathieu-Daudé (F4BUG) <f4bug@amsat.org>" [full]
# Primary key fingerprint: FAAB E75E 1291 7221 DCFD  6BB2 E3E3 2C2C DEAD C0DE

* tag 'accel-cpus-20250306' of https://github.com/philmd/qemu: (54 commits)
  include: Poison TARGET_PHYS_ADDR_SPACE_BITS definition
  system: Open-code qemu_init_arch_modules() using target_name()
  target/i386: Mark WHPX APIC region as little-endian
  target/alpha: Do not mix exception flags and FPCR bits
  target/riscv: Convert misa_mxl_max using GLib macros
  target/riscv: Declare RISCVCPUClass::misa_mxl_max as RISCVMXL
  target/xtensa: Finalize config in xtensa_register_core()
  target/sparc: Constify SPARCCPUClass::cpu_def
  target/i386: Constify X86CPUModel uses
  disas: Remove target_words_bigendian() call in initialize_debug_target()
  target/xtensa: Set disassemble_info::endian value in disas_set_info()
  target/sh4: Set disassemble_info::endian value in disas_set_info()
  target/riscv: Set disassemble_info::endian value in disas_set_info()
  target/ppc: Set disassemble_info::endian value in disas_set_info()
  target/mips: Set disassemble_info::endian value in disas_set_info()
  target/microblaze: Set disassemble_info::endian value in disas_set_info
  target/arm: Set disassemble_info::endian value in disas_set_info()
  target: Set disassemble_info::endian value for big-endian targets
  target: Set disassemble_info::endian value for little-endian targets
  target/mips: Fix possible MSA int overflow
  ...

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-03-07 07:39:49 +08:00
Philippe Mathieu-Daudé
82c4d8a3b4 qemu/compiler: Absorb 'clang-tsa.h'
We already have "qemu/compiler.h" for compiler-specific arrangements,
automatically included by "qemu/osdep.h" for each source file. No
need to explicitly include a header for a Clang particularity.

Suggested-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20250117170201.91182-1-philmd@linaro.org>
2025-03-06 14:21:25 +01:00
Maciej S. Szmigiero
b5aa74968b thread-pool: Implement generic (non-AIO) pool support
Migration code wants to manage device data sending threads in one place.

QEMU has an existing thread pool implementation, however it is limited
to queuing AIO operations only and essentially has a 1:1 mapping between
the current AioContext and the AIO ThreadPool in use.

Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's
GThreadPool.

This brings a few new operations on a pool:
* thread_pool_wait() operation waits until all the submitted work requests
have finished.

* thread_pool_set_max_threads() explicitly sets the maximum thread count
in the pool.

* thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count
in the pool to equal the number of still waiting in queue or unfinished work.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/b1efaebdbea7cb7068b8fb74148777012383e12b.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-03-06 06:47:33 +01:00
Maciej S. Szmigiero
dc67daeed5 thread-pool: Rename AIO pool functions to *_aio() and data types to *Aio
These names conflict with ones used by future generic thread pool
equivalents.
Generic names should belong to the generic pool type, not specific (AIO)
type.

Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/70f9e0fb4b01042258a1a57996c64d19779dc7f0.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-03-06 06:47:33 +01:00
Maciej S. Szmigiero
03c6468a13 thread-pool: Remove thread_pool_submit() function
This function name conflicts with one used by a future generic thread pool
function and it was only used by one test anyway.

Update the trace event name in thread_pool_submit_aio() accordingly.

Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/6830f07777f939edaf0a2d301c39adcaaf3817f0.1741124640.git.maciej.szmigiero@oracle.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-03-06 06:47:33 +01:00
Keoseong Park
07b12aae50 hw/ufs: Add temperature event notification support
This patch introduces temperature event notification support to the UFS
emulation. It enables the emulated UFS device to generate
temperature-related events, including high and low temperature
notifications, in compliance with the UFS specification.

With this feature, UFS drivers can now handle temperature exception events
during testing and development within the emulated environment.
This enhances validation and debugging capabilities for thermal event
handling in UFS implementations.

Signed-off-by: Keoseong Park <keosung.park@samsung.com>
Reviewed-by: Jeuk Kim <jeuk20.kim@samsung.com>
Message-ID: <20250225064146epcms2p50889cb0066e2d4734f2386de325bcdf6@epcms2p5>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
2025-03-05 02:13:29 +01:00
Klaus Jensen
6fc39228ff hw/nvme: set error status code explicitly for misc commands
The nvme_aio_err() does not handle Verify, Compare, Copy and other misc
commands and defaults to setting the error status code to Internal
Device Error. For some of these commands, we know better, so set it
explicitly.

For the commands using the nvme_misc_cb() callback (Copy, Flush, ...),
if no status code has explicitly been set by the lower handlers, default
to Internal Device Error as previously.

Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2025-02-26 12:40:35 +01:00
Klaus Jensen
6ccca4b6bb hw/nvme: rework csi handling
The controller incorrectly allows a zoned namespace to be attached even
if CS.CSS is configured to only support the NVM command set for I/O
queues.

Rework handling of namespace command sets in general by attaching
supported namespaces when the controller is started instead of, like
now, statically when realized.

Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2025-02-26 12:40:32 +01:00
Klaus Jensen
d96a32de3f hw/nvme: be compliant wrt. dsm processing limits
The specification states that,

> The controller shall set all three processing limit fields (i.e., the
> DMRL, DMRSL and DMSL fields) to non-zero values or shall clear all
> three processing limit fields to 0h.

So, set the DMRL and DMSL fields in addition to DMRSL.

Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2025-02-25 12:55:21 +01:00
Klaus Jensen
b202fb549d nvme: fix iocs status code values
The status codes related to I/O Command Sets are in the wrong group.

Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2025-02-25 12:55:21 +01:00
Klaus Jensen
9cf6ec0659 hw/nvme: add knob for doorbell buffer config support
Add a 'dbcs' knob to allow Doorbell Buffer Config command to be
disabled.

Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2025-02-25 12:55:21 +01:00
Klaus Jensen
e7047adf1e hw/nvme: make oacs dynamic
Virtualization Management needs sriov-related parameters. Only report
support for the command when that conditions are true.

Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2025-02-25 12:55:21 +01:00
Stephen Bates
23a4b3ebc7 hw/nvme: Add OCP SMART / Health Information Extended Log Page
The Open Compute Project [1] includes a Datacenter NVMe
SSD Specification [2]. The most recent version of this specification
(as of November 2024) is 2.6.1. This specification layers on top of
the NVM Express specifications [3] to provide additional
functionality. A key part of of this is the 512 Byte OCP SMART / Health
Information Extended log page that is defined in Section 4.8.6 of the
specification.

We add a controller argument (ocp) that toggles on/off the SMART log
extended structure.  To accommodate different vendor specific specifications
like OCP, we add a multiplexing function (nvme_vendor_specific_log) which
will route to the different log functions based on arguments and log ids.
We only return the OCP extended SMART log when the command is 0xC0 and ocp
has been turned on in the nvme argumants.

Though we add the whole nvme SMART log extended structure, we only populate
the physical_media_units_{read,written}, log_page_version and
log_page_uuid.

This patch is based on work done by Joel but has been modified enough
that he requested a co-developed-by tag rather than a signed-off-by.

[1]: https://www.opencompute.org/
[2]: https://www.opencompute.org/documents/datacenter-nvme-ssd-specification-v2-6-1-pdf
[3]: https://nvmexpress.org/specifications/

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2025-02-25 12:27:21 +01:00
Eric Blake
ff12e6a5ff nbd/server: Allow users to adjust handshake limit in QMP
Although defaulting the handshake limit to 10 seconds was a nice QoI
change to weed out intentionally slow clients, it can interfere with
integration testing done with manual NBD_OPT commands over 'nbdsh
--opt-mode'.  Expose a QMP knob 'handshake-max-secs' to allow the user
to alter the timeout away from the default.

The parameter name here intentionally matches the spelling of the
constant added in commit fb1c2aaa98, and not the command-line spelling
added in the previous patch for qemu-nbd; that's because in QMP,
longer names serve as good self-documentation, and unlike the command
line, machines don't have problems generating longer spellings.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250203222722.650694-6-eblake@redhat.com>
[eblake: s/max-secs/max-seconds/ in QMP]
Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
2025-02-11 13:45:47 -06:00
Stefan Hajnoczi
f2ec48fefd Block layer patches
- Managing inactive nodes (enables QSD migration with shared storage)
 - Fix swapped values for BLOCK_IO_ERROR 'device' and 'qom-path'
 - vpc: Read images exported from Azure correctly
 - scripts/qemu-gdb: Support coroutine dumps in coredumps
 - Minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmek34IRHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9bDpxAAnTvwmdazAXG0g9GzqvrEB/+6rStjAsqE
 9MTWV4WxyN41d0RXxN8CYKb8CXSiTRyw6r3CSGNYEI2eShe9e934PriSkZm41HyX
 n9Yh5YxqGZqitzvPtx62Ii/1KG+PcjQbfHuK1p4+rlKa0yQ2eGlio1JIIrZrCkBZ
 ikZcQUrhIyD0XV8hTQ2+Ysa+ZN6itjnlTQIG3gS3m8f8WR7kyUXD8YFMQFJFyjVx
 NrAIpLnc/ln9+5PZR9tje8U7XEn2KCgI5pgGaQnrd0h0G1H4ig8ogzYYnKTLhjU/
 AmQpS8np8Tyg6S1UZTiekEq0VuAhThEQc5b3sGbmHWH/R2ABMStyf18oCBAkPzZ7
 s6h+3XzTKKY2Q5Q3ZG/ANkUJjTNBhdj1fcaARvbSWsqsuk5CWX/I3jzvgihFtCSs
 eGu+b/bLeW6P7hu4qPHBcgLHuB1Fc7Rd2t4BoIGM1wcO2CeC9DzUKOiIMZOEJIh0
 GGqCkEWDHgckDTakD4/vSqm0UDKt6FSlQC9ga/ILBY3IB5HpHoArY58selymy28i
 X7MgAvbjdsmNuUuXDZZOiObcFt3j8jlmwPJpPyzXPQIiPX1RXeBPRhVAEeZCKn6Z
 tfHr72SJdMeVOGXVTvOrJ2iW+4g03rPdmkDFCUhpOwo62RODq7ahvCIXsNf3nEFR
 rSB3T1M/8EM=
 =iQLP
 -----END PGP SIGNATURE-----

Merge tag 'for-upstream' of https://repo.or.cz/qemu/kevin into staging

Block layer patches

- Managing inactive nodes (enables QSD migration with shared storage)
- Fix swapped values for BLOCK_IO_ERROR 'device' and 'qom-path'
- vpc: Read images exported from Azure correctly
- scripts/qemu-gdb: Support coroutine dumps in coredumps
- Minor cleanups

# -----BEGIN PGP SIGNATURE-----
#
# iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmek34IRHGt3b2xmQHJl
# ZGhhdC5jb20ACgkQfwmycsiPL9bDpxAAnTvwmdazAXG0g9GzqvrEB/+6rStjAsqE
# 9MTWV4WxyN41d0RXxN8CYKb8CXSiTRyw6r3CSGNYEI2eShe9e934PriSkZm41HyX
# n9Yh5YxqGZqitzvPtx62Ii/1KG+PcjQbfHuK1p4+rlKa0yQ2eGlio1JIIrZrCkBZ
# ikZcQUrhIyD0XV8hTQ2+Ysa+ZN6itjnlTQIG3gS3m8f8WR7kyUXD8YFMQFJFyjVx
# NrAIpLnc/ln9+5PZR9tje8U7XEn2KCgI5pgGaQnrd0h0G1H4ig8ogzYYnKTLhjU/
# AmQpS8np8Tyg6S1UZTiekEq0VuAhThEQc5b3sGbmHWH/R2ABMStyf18oCBAkPzZ7
# s6h+3XzTKKY2Q5Q3ZG/ANkUJjTNBhdj1fcaARvbSWsqsuk5CWX/I3jzvgihFtCSs
# eGu+b/bLeW6P7hu4qPHBcgLHuB1Fc7Rd2t4BoIGM1wcO2CeC9DzUKOiIMZOEJIh0
# GGqCkEWDHgckDTakD4/vSqm0UDKt6FSlQC9ga/ILBY3IB5HpHoArY58selymy28i
# X7MgAvbjdsmNuUuXDZZOiObcFt3j8jlmwPJpPyzXPQIiPX1RXeBPRhVAEeZCKn6Z
# tfHr72SJdMeVOGXVTvOrJ2iW+4g03rPdmkDFCUhpOwo62RODq7ahvCIXsNf3nEFR
# rSB3T1M/8EM=
# =iQLP
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 06 Feb 2025 11:12:50 EST
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* tag 'for-upstream' of https://repo.or.cz/qemu/kevin: (25 commits)
  block: remove unused BLOCK_OP_TYPE_DATAPLANE
  iotests: Add (NBD-based) tests for inactive nodes
  iotests: Add qsd-migrate case
  iotests: Add filter_qtest()
  nbd/server: Support inactive nodes
  block/export: Add option to allow export of inactive nodes
  block: Drain nodes before inactivating them
  block/export: Don't ignore image activation error in blk_exp_add()
  block: Support inactive nodes in blk_insert_bs()
  block: Add blockdev-set-active QMP command
  block: Add option to create inactive nodes
  block: Fix crash on block_resize on inactive node
  block: Don't attach inactive child to active node
  migration/block-active: Remove global active flag
  block: Inactivate external snapshot overlays when necessary
  block: Allow inactivating already inactive nodes
  block: Add 'active' field to BlockDeviceInfo
  block-backend: Fix argument order when calling 'qapi_event_send_block_io_error()'
  scripts/qemu-gdb: Support coroutine dumps in coredumps
  scripts/qemu-gdb: Simplify fs_base fetching for coroutines
  ...

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-02-10 13:25:36 -05:00
Daniel P. Berrangé
407bc4bf90 qapi: Move include/qapi/qmp/ to include/qobject/
The general expectation is that header files should follow the same
file/path naming scheme as the corresponding source file. There are
various historical exceptions to this practice in QEMU, with one of
the most notable being the include/qapi/qmp/ directory. Most of the
headers there correspond to source files in qobject/.

This patch corrects most of that inconsistency by creating
include/qobject/ and moving the headers for qobject/ there.

This also fixes MAINTAINERS for include/qapi/qmp/dispatch.h:
scripts/get_maintainer.pl now reports "QAPI" instead of "No
maintainers found".

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com> #s390x
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20241118151235.2665921-2-armbru@redhat.com>
[Rebased]
2025-02-10 15:33:16 +01:00
Stefan Hajnoczi
fc4e394b28 block: remove unused BLOCK_OP_TYPE_DATAPLANE
BLOCK_OP_TYPE_DATAPLANE prevents BlockDriverState from being used by
virtio-blk/virtio-scsi with IOThread. Commit b112a65c52 ("block:
declare blockjobs and dataplane friends!") eliminated the main reason
for this blocker in 2014.

Nowadays the block layer supports I/O from multiple AioContexts, so
there is even less reason to block IOThread users. Any legitimate
reasons related to interference would probably also apply to
non-IOThread users.

The only remaining users are bdrv_op_unblock(BLOCK_OP_TYPE_DATAPLANE)
calls after bdrv_op_block_all(). If we remove BLOCK_OP_TYPE_DATAPLANE
their behavior doesn't change.

Existing bdrv_op_block_all() callers that don't explicitly unblock
BLOCK_OP_TYPE_DATAPLANE seem to do so simply because no one bothered to
rather than because it is necessary to keep BLOCK_OP_TYPE_DATAPLANE
blocked.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250203182529.269066-1-stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-02-06 14:51:10 +01:00
Kevin Wolf
1600ef01ab block/export: Add option to allow export of inactive nodes
Add an option in BlockExportOptions to allow creating an export on an
inactive node without activating the node. This mode needs to be
explicitly supported by the export type (so that it doesn't perform any
operations that are forbidden for inactive nodes), so this patch alone
doesn't allow this option to be successfully used yet.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250204211407.381505-13-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-02-06 14:46:40 +01:00
Kevin Wolf
8cd37207f8 block: Add blockdev-set-active QMP command
The system emulator tries to automatically activate and inactivate block
nodes at the right point during migration. However, there are still
cases where it's necessary that the user can do this manually.

Images are only activated on the destination VM of a migration when the
VM is actually resumed. If the VM was paused, this doesn't happen
automatically. The user may want to perform some operation on a block
device (e.g. taking a snapshot or starting a block job) without also
resuming the VM yet. This is an example where a manual command is
necessary.

Another example is VM migration when the image files are opened by an
external qemu-storage-daemon instance on each side. In this case, the
process that needs to hand over the images isn't even part of the
migration and can't know when the migration completes. Management tools
need a way to explicitly inactivate images on the source and activate
them on the destination.

This adds a new blockdev-set-active QMP command that lets the user
change the status of individual nodes (this is necessary in
qemu-storage-daemon because it could be serving multiple VMs and only
one of them migrates at a time). For convenience, operating on all
devices (like QEMU does automatically during migration) is offered as an
option, too, and can be used in the context of single VM.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250204211407.381505-9-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-02-06 14:26:51 +01:00
Kevin Wolf
faecd16fe5 block: Add option to create inactive nodes
In QEMU, nodes are automatically created inactive while expecting an
incoming migration (i.e. RUN_STATE_INMIGRATE). In qemu-storage-daemon,
the notion of runstates doesn't exist. It also wouldn't necessarily make
sense to introduce it because a single daemon can serve multiple VMs
that can be in different states.

Therefore, allow the user to explicitly open images as inactive with a
new option. The default is as before: Nodes are usually active, except
when created during RUN_STATE_INMIGRATE.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250204211407.381505-8-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-02-06 14:26:51 +01:00
Kevin Wolf
aec81049c2 block: Add 'active' field to BlockDeviceInfo
This allows querying from QMP (and also HMP) whether an image is
currently active or inactive (in the sense of BDRV_O_INACTIVE).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250204211407.381505-2-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-02-06 14:26:50 +01:00
Philippe Mathieu-Daudé
edf3bce969 include: Include missing 'qemu/clang-tsa.h' header
The next commit will remove "qemu/clang-tsa.h" of "exec/exec-all.h",
however the following files indirectly include it:

  $ git grep -L qemu/clang-tsa.h $(git grep -wl TSA_NO_TSA)
  block/create.c
  include/block/block_int-common.h
  tests/unit/test-bdrv-drain.c
  tests/unit/test-block-iothread.c
  util/qemu-thread-posix.c

Explicitly include it so we can process with the removal in the
next commit.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20241212185341.2857-4-philmd@linaro.org>
2024-12-20 17:44:57 +01:00
Ayush Mishra
dbaa2936b3 hw/nvme: add NPDAL/NPDGL
Add the NPDGL and NPDAL fields to support large alignment and
granularities.

Signed-off-by: Ayush Mishra <ayush.m55@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Link: https://lore.kernel.org/r/20241001012833.3551820-1-ayush.m55@samsung.com
[k.jensen: renamed the enum values]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2024-11-04 19:09:45 +01:00
Arun Kumar
79e490058f hw/nvme: i/o cmd set independent namespace data structure
Add support for the I/O Command Set Independent Namespace Data
Structure (CNS 8h and 1fh).

Signed-off-by: Arun Kumar <arun.kka@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Link: https://lore.kernel.org/r/20240925004407.3521406-1-arun.kka@samsung.com
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2024-11-04 19:09:45 +01:00
Peter Maydell
51483f6c84 include: Move QemuLockCnt APIs to their own header
Currently the QemuLockCnt data structure and associated functions are
in the include/qemu/thread.h header.  Move them to their own
qemu/lockcnt.h.  The main reason for doing this is that it means we
can autogenerate the documentation comments into the docs/devel
documentation.

The copyright/author in the new header is drawn from lockcnt.c,
since the header changes were added in the same commit as
lockcnt.c; since neither thread.h nor lockcnt.c state an explicit
license, the standard default of GPL-2-or-later applies.

We include the new header (and the .c file, which was accidentally
omitted previously) in the "RCU" part of MAINTAINERS, since that
is where the lockcnt.rst documentation is categorized.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20240816132212.3602106-7-peter.maydell@linaro.org
2024-10-15 15:16:17 +01:00
Peter Maydell
718780d204 nvme queue
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmb7nokACgkQTeGvMW1P
 Del+2gf/YefiiYSL540C2QeYRwMFd6xFKKWYRRJoaARyLAoqInVdLiBql527Oov8
 rgDQq+D0XXP15CNDvfAZ59a36h1bAW79QCfKEUMSbP8GPeqb5pOSRfvYJSwnG1YX
 SC70vKLOrBhzxYiQYSOhLNKdbUM00OUyf2xibu0zk84UpkXtzSR4h/byFnQIHwEV
 /uUh4+cxY6eQK1Tfk/f66FEJLuJTchFOMVswYolDMezu2vJmToWHju/kpy2ugvaC
 +WEUEti8kL66/B399u1uwAad2OejC1Jf4qMFcFQJ9Cs9RV4HTC9byolceJE+1R0V
 CZt1SxvBNdK/ihs1iTjP7fInPqdYKw==
 =16tX
 -----END PGP SIGNATURE-----

Merge tag 'pull-nvme-20241001' of https://gitlab.com/birkelund/qemu into staging

nvme queue

# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCgAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmb7nokACgkQTeGvMW1P
# Del+2gf/YefiiYSL540C2QeYRwMFd6xFKKWYRRJoaARyLAoqInVdLiBql527Oov8
# rgDQq+D0XXP15CNDvfAZ59a36h1bAW79QCfKEUMSbP8GPeqb5pOSRfvYJSwnG1YX
# SC70vKLOrBhzxYiQYSOhLNKdbUM00OUyf2xibu0zk84UpkXtzSR4h/byFnQIHwEV
# /uUh4+cxY6eQK1Tfk/f66FEJLuJTchFOMVswYolDMezu2vJmToWHju/kpy2ugvaC
# +WEUEti8kL66/B399u1uwAad2OejC1Jf4qMFcFQJ9Cs9RV4HTC9byolceJE+1R0V
# CZt1SxvBNdK/ihs1iTjP7fInPqdYKw==
# =16tX
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue 01 Oct 2024 08:02:33 BST
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [full]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [full]
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* tag 'pull-nvme-20241001' of https://gitlab.com/birkelund/qemu:
  hw/nvme: add atomic write support
  hw/nvme: add knob for CTRATT.MEM
  hw/nvme: support CTRATT.MEM
  hw/nvme: clear masked events from the aer queue
  hw/nvme: report id controller metadata sgl support

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2024-10-01 11:34:07 +01:00
Arun Kumar
a1ab67883d hw/nvme: support CTRATT.MEM
Indicate that 'MDTS and Size Limits Exclude Metadata (MEM)' in the
Controller Attributes (CTRATT) I/O Command Set Independent Identify
Controller Data Structure.

Signed-off-by: Arun Kumar <arun.kka@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
[k.jensen: updated commit message]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2024-09-30 12:45:17 +02:00
Dr. David Alan Gilbert
e84af3eb72 block: Remove unused aio_task_pool_empty
aio_task_pool_empty has been unused since it was added in
  6e9b225f73 ("block: introduce aio task pool")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <dave@treblig.org>
Message-Id: <20240917002007.330689-1-dave@treblig.org>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
2024-09-30 10:53:18 +03:00
Fiona Ebner
9484ad6c17 copy-before-write: allow specifying minimum cluster size
In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.

To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.

The type 'size' (corresponding to uint64_t in C) is used in QAPI to
rule out negative inputs and for consistency with already existing
@cluster-size parameters. Since block_copy_calculate_cluster_size()
uses int64_t for its result, a check that the input is not too large
is added in block_copy_state_new() before calling it. The calculation
in block_copy_calculate_cluster_size() is done in the target int64_t
type.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Acked-by: Markus Armbruster <armbru@redhat.com> (QAPI schema)
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-Id: <20240711120915.310243-2-f.ebner@proxmox.com>
[vsementsov: switch version to 9.2 in QAPI doc]
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
2024-09-30 10:52:41 +03:00
Yoochan Jeong
7c85332a2b hw/ufs: minor bug fixes related to ufs-test
Minor bugs and errors related to ufs-test are resolved. Some
permissions and code implementations that are not synchronized
with the ufs spec are edited.

Signed-off-by: Yoochan Jeong <yc01.jeong@samsung.com>
Reviewed-by: Jeuk Kim <jeuk20.kim@samsung.com>
Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>
2024-09-06 18:04:16 +09:00
Eric Blake
c8a76dbd90 nbd/server: CVE-2024-7409: Cap default max-connections to 100
Allowing an unlimited number of clients to any web service is a recipe
for a rudimentary denial of service attack: the client merely needs to
open lots of sockets without closing them, until qemu no longer has
any more fds available to allocate.

For qemu-nbd, we default to allowing only 1 connection unless more are
explicitly asked for (-e or --shared); this was historically picked as
a nice default (without an explicit -t, a non-persistent qemu-nbd goes
away after a client disconnects, without needing any additional
follow-up commands), and we are not going to change that interface now
(besides, someday we want to point people towards qemu-storage-daemon
instead of qemu-nbd).

But for qemu proper, and the newer qemu-storage-daemon, the QMP
nbd-server-start command has historically had a default of unlimited
number of connections, in part because unlike qemu-nbd it is
inherently persistent until nbd-server-stop.  Allowing multiple client
sockets is particularly useful for clients that can take advantage of
MULTI_CONN (creating parallel sockets to increase throughput),
although known clients that do so (such as libnbd's nbdcopy) typically
use only 8 or 16 connections (the benefits of scaling diminish once
more sockets are competing for kernel attention).  Picking a number
large enough for typical use cases, but not unlimited, makes it
slightly harder for a malicious client to perform a denial of service
merely by opening lots of connections withot progressing through the
handshake.

This change does not eliminate CVE-2024-7409 on its own, but reduces
the chance for fd exhaustion or unlimited memory usage as an attack
surface.  On the other hand, by itself, it makes it more obvious that
with a finite limit, we have the problem of an unauthenticated client
holding 100 fds opened as a way to block out a legitimate client from
being able to connect; thus, later patches will further add timeouts
to reject clients that are not making progress.

This is an INTENTIONAL change in behavior, and will break any client
of nbd-server-start that was not passing an explicit max-connections
parameter, yet expects more than 100 simultaneous connections.  We are
not aware of any such client (as stated above, most clients aware of
MULTI_CONN get by just fine on 8 or 16 connections, and probably cope
with later connections failing by relying on the earlier connections;
libvirt has not yet been passing max-connections, but generally
creates NBD servers with the intent for a single client for the sake
of live storage migration; meanwhile, the KubeSAN project anticipates
a large cluster sharing multiple clients [up to 8 per node, and up to
100 nodes in a cluster], but it currently uses qemu-nbd with an
explicit --shared=0 rather than qemu-storage-daemon with
nbd-server-start).

We considered using a deprecation period (declare that omitting
max-parameters is deprecated, and make it mandatory in 3 releases -
then we don't need to pick an arbitrary default); that has zero risk
of breaking any apps that accidentally depended on more than 100
connections, and where such breakage might not be noticed under unit
testing but only under the larger loads of production usage.  But it
does not close the denial-of-service hole until far into the future,
and requires all apps to change to add the parameter even if 100 was
good enough.  It also has a drawback that any app (like libvirt) that
is accidentally relying on an unlimited default should seriously
consider their own CVE now, at which point they are going to change to
pass explicit max-connections sooner than waiting for 3 qemu releases.
Finally, if our changed default breaks an app, that app can always
pass in an explicit max-parameters with a larger value.

It is also intentional that the HMP interface to nbd-server-start is
not changed to expose max-connections (any client needing to fine-tune
things should be using QMP).

Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20240807174943.771624-12-eblake@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
[ericb: Expand commit message to summarize Dan's argument for why we
break corner-case back-compat behavior without a deprecation period]
Signed-off-by: Eric Blake <eblake@redhat.com>
2024-08-08 16:02:23 -05:00
Eric Blake
fb1c2aaa98 nbd/server: Plumb in new args to nbd_client_add()
Upcoming patches to fix a CVE need to track an opaque pointer passed
in by the owner of a client object, as well as request for a time
limit on how fast negotiation must complete.  Prepare for that by
changing the signature of nbd_client_new() and adding an accessor to
get at the opaque pointer, although for now the two servers
(qemu-nbd.c and blockdev-nbd.c) do not change behavior even though
they pass in a new default timeout value.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20240807174943.771624-11-eblake@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
[eblake: s/LIMIT/MAX_SECS/ as suggested by Dan]
Signed-off-by: Eric Blake <eblake@redhat.com>
2024-08-08 15:05:27 -05:00
Kevin Wolf
7e17111646 block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked
Upstream clang 18 (and backports to clang 17 in Fedora and RHEL)
implemented support for __attribute__((cleanup())) in its Thread Safety
Analysis, so we can now actually have a proper implementation of
WITH_GRAPH_RDLOCK_GUARD() that understands when we acquire and when we
release the lock.

-Wthread-safety is now only enabled if the compiler is new enough to
understand this pattern. In theory, we could have used some #ifdefs to
keep the existing basic checks on old compilers, but as long as someone
runs a newer compiler (and our CI does), we will catch locking problems,
so it's probably not worth keeping multiple implementations for this.

The implementation can't use g_autoptr any more because the glib macros
define wrapper functions that don't have the right TSA attributes, so
the compiler would complain about them. Just use the cleanup attribute
directly instead.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20240627181245.281403-3-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-08-06 20:12:39 +02:00
Arun Kumar
d522aef88d hw/nvme: add cross namespace copy support
Extend copy command to copy user data across different namespaces via
support for specifying a namespace for each source range

Signed-off-by: Arun Kumar <arun.kka@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2024-07-22 14:36:15 +02:00
Minwoo Im
6471556500 hw/nvme: add Identify Endurance Group List
Commit 73064edfb8 ("hw/nvme: flexible data placement emulation")
intorudced NVMe FDP feature to nvme-subsys and nvme-ctrl with a
single endurance group #1 supported.  This means that controller should
return proper identify data to host with Identify Endurance Group List
(CNS 19h).  But, yes, only just for the endurance group #1.  This patch
allows host applications to ask for which endurance group is available
and utilize FDP through that endurance group.

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2024-07-11 17:05:37 +02:00
Paolo Bonzini
44b424dc4a block: remove separate bdrv_file_open callback
bdrv_file_open and bdrv_open are completely equivalent, they are
never checked except to see which one to invoke.  So merge them
into a single one.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-28 14:44:51 +02:00
Prasad Pandit
24687abf23 linux-aio: add IO_CMD_FDSYNC command support
Libaio defines IO_CMD_FDSYNC command to sync all outstanding
asynchronous I/O operations, by flushing out file data to the
disk storage. Enable linux-aio to submit such aio request.

When using aio=native without fdsync() support, QEMU creates
pthreads, and destroying these pthreads results in TLB flushes.
In a real-time guest environment, TLB flushes cause a latency
spike. This patch helps to avoid such spikes.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Message-ID: <20240425070412.37248-1-ppandit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-06-10 11:05:43 +02:00
Stefan Hajnoczi
e669e800fc aio: warn about iohandler_ctx special casing
The main loop has two AioContexts: qemu_aio_context and iohandler_ctx.
The main loop runs them both, but nested aio_poll() calls on
qemu_aio_context exclude iohandler_ctx.

Which one should qemu_get_current_aio_context() return when called from
the main loop? Document that it's always qemu_aio_context.

This has subtle effects on functions that use
qemu_get_current_aio_context(). For example, aio_co_reschedule_self()
does not work when moving from iohandler_ctx to qemu_aio_context because
qemu_get_current_aio_context() does not differentiate these two
AioContexts.

Document this in order to reduce the chance of future bugs.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20240506190622.56095-3-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-06-10 11:05:43 +02:00
Minwoo Im
5c079578d2 hw/ufs: Add support MCQ of UFSHCI 4.0
This patch adds support for MCQ defined in UFSHCI 4.0.  This patch
utilized the legacy I/O codes as much as possible to support MCQ.

MCQ operation & runtime register is placed at 0x1000 offset of UFSHCI
register statically with no spare space among four registers (48B):

	UfsMcqSqReg, UfsMcqSqIntReg, UfsMcqCqReg, UfsMcqCqIntReg

The maxinum number of queue is 32 as per spec, and the default
MAC(Multiple Active Commands) are 32 in the device.

Example:
	-device ufs,serial=foo,id=ufs0,mcq=true,mcq-maxq=8

Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
Reviewed-by: Jeuk Kim <jeuk20.kim@samsung.com>
Message-Id: <20240528023106.856777-3-minwoo.im@samsung.com>
Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>
2024-06-03 16:20:42 +09:00
Minwoo Im
cdba3b901a hw/ufs: Update MCQ-related fields to block/ufs.h
This patch is a prep patch for the following MCQ support patch for
hw/ufs.  This patch updated minimal mandatory fields to support MCQ
based on UFSHCI 4.0.

Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
Reviewed-by: Jeuk Kim <jeuk20.kim@samsung.com>
Message-Id: <20240528023106.856777-2-minwoo.im@samsung.com>
Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>
2024-06-03 16:20:42 +09:00
Vladimir Sementsov-Ogievskiy
0fd05c8d80 qapi: blockdev-backup: add discard-source parameter
Add a parameter that enables discard-after-copy. That is mostly useful
in "push backup with fleecing" scheme, when source is snapshot-access
format driver node, based on copy-before-write filter snapshot-access
API:

[guest]      [snapshot-access] ~~ blockdev-backup ~~> [backup target]
   |            |
   | root       | file
   v            v
[copy-before-write]
   |             |
   | file        | target
   v             v
[active disk]   [temp.img]

In this case discard-after-copy does two things:

 - discard data in temp.img to save disk space
 - avoid further copy-before-write operation in discarded area

Note that we have to declare WRITE permission on source in
copy-before-write filter, for discard to work. Still we can't take it
unconditionally, as it will break normal backup from RO source. So, we
have to add a parameter and pass it thorough bdrv_open flags.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Fiona Ebner <f.ebner@proxmox.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20240313152822.626493-5-vsementsov@yandex-team.ru>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
2024-05-28 15:52:15 +03:00