motorhead/qemu

mirror of https://github.com/Motorhead1991/qemu.git synced 2025-08-02 07:13:54 -06:00

Author	SHA1	Message	Date
Eric Blake	253b43a290	mirror: Drop redundant zero_target parameter The two callers to a mirror job (drive-mirror and blockdev-mirror) set zero_target precisely when sync mode == FULL, with the one exception that drive-mirror skips zeroing the target if it was newly created and reads as zero. But given the previous patch, that exception is equally captured by target_is_zero. Meanwhile, there is another slight wrinkle, fortunately caught by iotest 185: if the caller uses "sync":"top" but the source has no backing file, the code in blockdev.c was changing sync to be FULL, but only after it had set zero_target=false. In mirror.c, prior to recent patches, this didn't matter: the only places that inspected sync were setting is_none_mode (both TOP and FULL had set that to false), and mirror_start() setting base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL. But now that we are passing sync around, the slammed sync mode would result in a new pre-zeroing pass even when the user had passed "sync":"top" in an effort to skip pre-zeroing. Fortunately, the assignment of base when bs has no backing chain still works out to NULL if we don't slam things. So with the forced change of sync ripped out of blockdev.c, the sync mode is passed through the full callstack unmolested, and we can now reliably reconstruct the same settings as what used to be passed in by zero_target=false, without the redundant parameter. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250509204341.3553601-24-eblake@redhat.com> Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [eblake: Fix regression in iotest 185] Signed-off-by: Eric Blake <eblake@redhat.com>	2025-05-14 20:10:12 -05:00
Eric Blake	d17a34bfb9	mirror: Allow QMP override to declare target already zero QEMU has an optimization for a just-created drive-mirror destination that is not possible for blockdev-mirror (which can't create the destination) - any time we know the destination starts life as all zeroes, we can skip a pre-zeroing pass on the destination. Recent patches have added an improved heuristic for detecting if a file contains all zeroes, and we plan to use that heuristic in upcoming patches. But since a heuristic cannot quickly detect all scenarios, and there may be cases where the caller is aware of information that QEMU cannot learn quickly, it makes sense to have a way to tell QEMU to assume facts about the destination that can make the mirror operation faster. Given our existing example of "qemu-img convert --target-is-zero", it is time to expose this override in QMP for blockdev-mirror as well. This patch results in some slight redundancy between the older s->zero_target (set any time mode==FULL and the destination image was not just created - ie. clear if drive-mirror is asking to skip the pre-zero pass) and the newly-introduced s->target_is_zero (in addition to the QMP override, it is set when drive-mirror creates the destination image); this will be cleaned up in the next patch. There is also a subtlety that we must consider. When drive-mirror is passing target_is_zero on behalf of a just-created image, we know the image is sparse (skipping the pre-zeroing keeps it that way), so it doesn't matter whether the destination also has "discard":"unmap" and "detect-zeroes":"unmap". But now that we are letting the user set the knob for target-is-zero, if the user passes a pre-existing file that is fully allocated, it is fine to leave the file fully allocated under "detect-zeroes":"on", but if the file is open with "detect-zeroes":"unmap", we should really be trying harder to punch holes in the destination for every region of zeroes copied from the source. The easiest way to do this is to still run the pre-zeroing pass (turning the entire destination file sparse before populating just the allocated portions of the source), even though that currently results in double I/O to the portions of the file that are allocated. A later patch will add further optimizations to reduce redundant zeroing I/O during the mirror operation. Since "target-is-zero":true is designed for optimizations, it is okay to silently ignore the parameter rather than erroring if the user ever sets the parameter in a scenario where the mirror job can't exploit it (for example, when doing "sync":"top" instead of "sync":"full", we can't pre-zero, so setting the parameter won't make a speed difference). Signed-off-by: Eric Blake <eblake@redhat.com> Acked-by: Markus Armbruster <armbru@redhat.com> Message-ID: <20250509204341.3553601-23-eblake@redhat.com> Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-05-14 16:55:10 -05:00
Eric Blake	5272609670	block: Add new bdrv_co_is_all_zeroes() function There are some optimizations that require knowing if an image starts out as reading all zeroes, such as making blockdev-mirror faster by skipping the copying of source zeroes to the destination. The existing bdrv_co_is_zero_fast() is a good building block for answering this question, but it tends to give an answer of 0 for a file we just created via QMP 'blockdev-create' or similar (such as 'qemu-img create -f raw'). Why? Because file-posix.c insists on allocating a tiny header to any file rather than leaving it 100% sparse, due to some filesystems that are unable to answer alignment probes on a hole. But teaching file-posix.c to read the tiny header doesn't scale - the problem of a small header is also visible when libvirt sets up an NBD client to a just-created file on a migration destination host. So, we need a wrapper function that handles a bit more complexity in a common manner for all block devices - when the BDS is mostly a hole, but has a small non-hole header, it is still worth the time to read that header and check if it reads as all zeroes before giving up and returning a pessimistic answer. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-19-eblake@redhat.com>	2025-05-14 16:08:23 -05:00
Eric Blake	c33159dec7	block: Expand block status mode from bool to flags This patch is purely mechanical, changing bool want_zero into an unsigned int for bitwise-or of flags. As of this patch, all implementations are unchanged (the old want_zero==true is now mode==BDRV_WANT_PRECISE which is a superset of BDRV_WANT_ZERO); but the callers in io.c that used to pass want_zero==false are now prepared for future driver changes that can now distinguish bewteen BDRV_WANT_ZERO vs. BDRV_WANT_ALLOCATED. The next patch will actually change the file-posix driver along those lines, now that we have more-specific hints. As for the background why this patch is useful: right now, the file-posix driver recognizes that if allocation is being queried, the entire image can be reported as allocated (there is no backing file to refer to) - but this throws away information on whether the entire image reads as zero (trivially true if lseek(SEEK_HOLE) at offset 0 returns -ENXIO, a bit more complicated to prove if the raw file was created with 'qemu-img create' since we intentionally allocate a small chunk of all-zero data to help with alignment probing). Later patches will add a generic algorithm for seeing if an entire file reads as zeroes. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-16-eblake@redhat.com>	2025-05-14 15:33:34 -05:00
Raman Dzehtsiar	3d3911f16b	blockdev-backup: Add error handling option for copy-before-write jobs This patch extends the blockdev-backup QMP command to allow users to specify how to behave when IO errors occur during copy-before-write operations. Previously, the behavior was fixed and could not be controlled by the user. The new 'on-cbw-error' option can be set to one of two values: - 'break-guest-write': Forwards the IO error to the guest and triggers the on-source-error policy. This preserves snapshot integrity at the expense of guest IO operations. - 'break-snapshot': Allows the guest OS to continue running normally, but invalidates the snapshot and aborts related jobs. This prioritizes guest operation over backup consistency. This enhancement provides more flexibility for backup operations in different environments where requirements for guest availability versus backup consistency may vary. The default behavior remains unchanged to maintain backward compatibility. Signed-off-by: Raman Dzehtsiar <Raman.Dzehtsiar@gmail.com> Message-ID: <20250414090025.828660-1-Raman.Dzehtsiar@gmail.com> Acked-by: Markus Armbruster <armbru@redhat.com> [vsementsov: fix long lines] Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Tested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>	2025-05-12 18:19:31 +03:00
Sunny Zhu	ed1aef1716	block: Remove unused callback function bdrv_aio_pdiscard The bytes type in bdrv_aio_pdiscard should be int64_t rather than int. There are no drivers implementing the *bdrv_aio_pdiscard() callback, it appears to be an unused function. Therefore, we'll simply remove it instead of fixing it. Additionally, coroutine-based callbacks are preferred. If someone needs to implement bdrv_aio_pdiscard, a coroutine-based version would be straightforward to implement. Signed-off-by: Sunny Zhu <sunnyzhyy@qq.com> Message-ID: <tencent_7140D2E54157D98CF3D9E64B1A007A1A7906@qq.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-04-25 17:06:50 +02:00
Kevin Wolf	ee416407b3	aio-posix: Separate AioPolledEvent per AioHandler Adaptive polling has a big problem: It doesn't consider that an event loop can wait for many different events that may have very different typical latencies. For example, think of a guest that tends to send a new I/O request soon after the previous I/O request completes, but the storage on the host is rather slow. In this case, getting the new request from guest quickly means that polling is enabled, but the next thing is performing the I/O request on the backend, which is slow and disables polling again for the next guest request. This means that in such a scenario, polling could help for every other event, but is only ever enabled when it can't succeed. In order to fix this, keep a separate AioPolledEvent for each AioHandler. We will then know that the backend file descriptor always has a high latency and isn't worth polling for, but we also know that the guest is always fast and we should poll for it. This solves at least half of the problem, we can now keep polling for those cases where it makes sense and get the improved performance from it. Since the event loop doesn't know which event will be next, we still do some unnecessary polling while we're waiting for the slow disk. I made some attempts to be more clever than just randomly growing and shrinking the polling time, and even to let callers be explicit about when they expect a new event, but so far this hasn't resulted in improved performance or even caused performance regressions. For now, let's just fix the part that is easy enough to fix, we can revisit the rest later. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250307221634.71951-6-kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-03-13 17:57:23 +01:00
Kevin Wolf	518db1013c	aio: Create AioPolledEvent As a preparation for having multiple adaptive polling states per AioContext, move the 'ns' field into a separate struct. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250307221634.71951-4-kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-03-13 17:57:23 +01:00
Kevin Wolf	984a32f17e	file-posix: Support FUA writes Until now, FUA was always emulated with a separate flush after the write for file-posix. The overhead of processing a second request can reduce performance significantly for a guest disk that has disabled the write cache, especially if the host disk is already write through, too, and the flush isn't actually doing anything. Advertise support for REQ_FUA in write requests and implement it for Linux AIO and io_uring using the RWF_DSYNC flag for write requests. The thread pool still performs a separate fdatasync() call. This can be improved later by using the pwritev2() syscall if available. As an example, this is how fio numbers can be improved in some scenarios with this patch (all using virtio-blk with cache=directsync on an nvme block device for the VM, fio with ioengine=libaio,direct=1,sync=1): \| old \| with FUA support ------------------------------+---------------+------------------- bs=4k, iodepth=1, numjobs=1 \| 45.6k iops \| 56.1k iops bs=4k, iodepth=1, numjobs=16 \| 183.3k iops \| 236.0k iops bs=4k, iodepth=16, numjobs=1 \| 258.4k iops \| 311.1k iops However, not all scenarios are clear wins. On another slower disk I saw little to no improvment. In fact, in two corner case scenarios, I even observed a regression, which I however consider acceptable: 1. On slow host disks in a write through cache mode, when the guest is using virtio-blk in a separate iothread so that polling can be enabled, and each completion is quickly followed up with a new request (so that polling gets it), it can happen that enabling FUA makes things slower - the additional very fast no-op flush we used to have gave the adaptive polling algorithm a success so that it kept polling. Without it, we only have the slow write request, which disables polling. This is a problem in the polling algorithm that will be fixed later in this series. 2. With a high queue depth, it can be beneficial to have flush requests for another reason: The optimisation in bdrv_co_flush() that flushes only once per write generation acts as a synchronisation mechanism that lets all requests complete at the same time. This can result in better batching and if the disk is very fast (I only saw this with a null_blk backend), this can make up for the overhead of the flush and improve throughput. In theory, we could optionally introduce a similar artificial latency in the normal completion path to achieve the same kind of completion batching. This is not implemented in this series. Compatibility is not a concern for the kernel side of io_uring, it has supported RWF_DSYNC from the start. However, io_uring_prep_writev2() is not available before liburing 2.2. Linux AIO started supporting it in Linux 4.13 and libaio 0.3.111. The kernel is not a problem for any supported build platform, so it's not necessary to add runtime checks. However, openSUSE is still stuck with an older libaio version that would break the build. We must detect the presence of the writev2 functions in the user space libraries at build time to avoid build failures. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250307221634.71951-2-kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-03-13 17:44:55 +01:00
Stefan Hajnoczi	98c7362b1e	Generic CPUs / accelerators patch queue - Merge "qemu/clang-tsa.h" within "qemu/compiler.h" - Various cleanups around accelerators initialization code (better user/system split) - Various trivial cleanups in accel/tcg/, Guard few TCG calls with tcg_enabled() - Explicit disassemble_info endianness - Improve dual-endianness support for MicroBlaze -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmfJw08ACgkQ4+MsLN6t wN70whAAtfcdWtqseFfb6fvDtjflgxN51Ui0iaOECXUA18USKriGy34eBcMYMiM2 +eKgU7+jI6JGE4+burcgWUsPpFFF951/A8+lyIbFgO5yToTDmC+qNe4XfmMAIyXq uf9Obr2c0Xk9luh4odb+jPAQodw/7G1fKgcCVIJNDCl/xEcPhS9eNpTaHwcVnkWI K6KrxWXOsqG6+evJBPWYoXtOOyt0+JcwAsJoGhprwtGm3P9+jSVXsgeGsJVyZcna f32JtjWL754O8XeMkOn4x6rt58VrCIMKI9xT7keDyuhTCq0Zki9RO2nMU2dSw5mN AfL9hxqUy0Nijnyslg3ugujDfTePsNyLdwwH7n0mnoD72ELi6WnhDsmOThuEB3Rd 4/kdwTJfA/rlWk/GF1tbKW7AvQZokRARtzmL3V0HmGJu57lX+2JuszEdYBkqDEP7 GH1I10B2yANUm+C9y3X8qWOU7Ws433ebJeJoZuyfnbZ9Me+UfRmql/oS+V8ata2i fArEItpldUFrWRyYLkTbXrh2dgyV9yJTEir/lzOzeAZZzyabTbjf2z9qnh976GGO 1QnDy5QA4f54kDBUZe7JK26TZsHPch7cgqXW6f8tRlJF7A9hxGK8d2TUV/lC3/vx LUOlWNu03PhiruYmZEcWOsY3Jt9jRCF6lIryrnaJsqnVOVmMUMM= =3TRh -----END PGP SIGNATURE----- Merge tag 'accel-cpus-20250306' of https://github.com/philmd/qemu into staging Generic CPUs / accelerators patch queue - Merge "qemu/clang-tsa.h" within "qemu/compiler.h" - Various cleanups around accelerators initialization code (better user/system split) - Various trivial cleanups in accel/tcg/, Guard few TCG calls with tcg_enabled() - Explicit disassemble_info endianness - Improve dual-endianness support for MicroBlaze # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmfJw08ACgkQ4+MsLN6t # wN70whAAtfcdWtqseFfb6fvDtjflgxN51Ui0iaOECXUA18USKriGy34eBcMYMiM2 # +eKgU7+jI6JGE4+burcgWUsPpFFF951/A8+lyIbFgO5yToTDmC+qNe4XfmMAIyXq # uf9Obr2c0Xk9luh4odb+jPAQodw/7G1fKgcCVIJNDCl/xEcPhS9eNpTaHwcVnkWI # K6KrxWXOsqG6+evJBPWYoXtOOyt0+JcwAsJoGhprwtGm3P9+jSVXsgeGsJVyZcna # f32JtjWL754O8XeMkOn4x6rt58VrCIMKI9xT7keDyuhTCq0Zki9RO2nMU2dSw5mN # AfL9hxqUy0Nijnyslg3ugujDfTePsNyLdwwH7n0mnoD72ELi6WnhDsmOThuEB3Rd # 4/kdwTJfA/rlWk/GF1tbKW7AvQZokRARtzmL3V0HmGJu57lX+2JuszEdYBkqDEP7 # GH1I10B2yANUm+C9y3X8qWOU7Ws433ebJeJoZuyfnbZ9Me+UfRmql/oS+V8ata2i # fArEItpldUFrWRyYLkTbXrh2dgyV9yJTEir/lzOzeAZZzyabTbjf2z9qnh976GGO # 1QnDy5QA4f54kDBUZe7JK26TZsHPch7cgqXW6f8tRlJF7A9hxGK8d2TUV/lC3/vx # LUOlWNu03PhiruYmZEcWOsY3Jt9jRCF6lIryrnaJsqnVOVmMUMM= # =3TRh # -----END PGP SIGNATURE----- # gpg: Signature made Thu 06 Mar 2025 23:46:23 HKT # gpg: using RSA key FAABE75E12917221DCFD6BB2E3E32C2CDEADC0DE # gpg: Good signature from "Philippe Mathieu-Daudé (F4BUG) <f4bug@amsat.org>" [full] # Primary key fingerprint: FAAB E75E 1291 7221 DCFD 6BB2 E3E3 2C2C DEAD C0DE * tag 'accel-cpus-20250306' of https://github.com/philmd/qemu: (54 commits) include: Poison TARGET_PHYS_ADDR_SPACE_BITS definition system: Open-code qemu_init_arch_modules() using target_name() target/i386: Mark WHPX APIC region as little-endian target/alpha: Do not mix exception flags and FPCR bits target/riscv: Convert misa_mxl_max using GLib macros target/riscv: Declare RISCVCPUClass::misa_mxl_max as RISCVMXL target/xtensa: Finalize config in xtensa_register_core() target/sparc: Constify SPARCCPUClass::cpu_def target/i386: Constify X86CPUModel uses disas: Remove target_words_bigendian() call in initialize_debug_target() target/xtensa: Set disassemble_info::endian value in disas_set_info() target/sh4: Set disassemble_info::endian value in disas_set_info() target/riscv: Set disassemble_info::endian value in disas_set_info() target/ppc: Set disassemble_info::endian value in disas_set_info() target/mips: Set disassemble_info::endian value in disas_set_info() target/microblaze: Set disassemble_info::endian value in disas_set_info target/arm: Set disassemble_info::endian value in disas_set_info() target: Set disassemble_info::endian value for big-endian targets target: Set disassemble_info::endian value for little-endian targets target/mips: Fix possible MSA int overflow ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-03-07 07:39:49 +08:00
Philippe Mathieu-Daudé	82c4d8a3b4	qemu/compiler: Absorb 'clang-tsa.h' We already have "qemu/compiler.h" for compiler-specific arrangements, automatically included by "qemu/osdep.h" for each source file. No need to explicitly include a header for a Clang particularity. Suggested-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20250117170201.91182-1-philmd@linaro.org>	2025-03-06 14:21:25 +01:00
Maciej S. Szmigiero	b5aa74968b	thread-pool: Implement generic (non-AIO) pool support Migration code wants to manage device data sending threads in one place. QEMU has an existing thread pool implementation, however it is limited to queuing AIO operations only and essentially has a 1:1 mapping between the current AioContext and the AIO ThreadPool in use. Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's GThreadPool. This brings a few new operations on a pool: * thread_pool_wait() operation waits until all the submitted work requests have finished. * thread_pool_set_max_threads() explicitly sets the maximum thread count in the pool. * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count in the pool to equal the number of still waiting in queue or unfinished work. Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Link: https://lore.kernel.org/qemu-devel/b1efaebdbea7cb7068b8fb74148777012383e12b.1741124640.git.maciej.szmigiero@oracle.com Signed-off-by: Cédric Le Goater <clg@redhat.com>	2025-03-06 06:47:33 +01:00
Maciej S. Szmigiero	dc67daeed5	thread-pool: Rename AIO pool functions to _aio() and data types to Aio These names conflict with ones used by future generic thread pool equivalents. Generic names should belong to the generic pool type, not specific (AIO) type. Acked-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Link: https://lore.kernel.org/qemu-devel/70f9e0fb4b01042258a1a57996c64d19779dc7f0.1741124640.git.maciej.szmigiero@oracle.com Signed-off-by: Cédric Le Goater <clg@redhat.com>	2025-03-06 06:47:33 +01:00
Maciej S. Szmigiero	03c6468a13	thread-pool: Remove thread_pool_submit() function This function name conflicts with one used by a future generic thread pool function and it was only used by one test anyway. Update the trace event name in thread_pool_submit_aio() accordingly. Acked-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Link: https://lore.kernel.org/qemu-devel/6830f07777f939edaf0a2d301c39adcaaf3817f0.1741124640.git.maciej.szmigiero@oracle.com Signed-off-by: Cédric Le Goater <clg@redhat.com>	2025-03-06 06:47:33 +01:00
Keoseong Park	07b12aae50	hw/ufs: Add temperature event notification support This patch introduces temperature event notification support to the UFS emulation. It enables the emulated UFS device to generate temperature-related events, including high and low temperature notifications, in compliance with the UFS specification. With this feature, UFS drivers can now handle temperature exception events during testing and development within the emulated environment. This enhances validation and debugging capabilities for thermal event handling in UFS implementations. Signed-off-by: Keoseong Park <keosung.park@samsung.com> Reviewed-by: Jeuk Kim <jeuk20.kim@samsung.com> Message-ID: <20250225064146epcms2p50889cb0066e2d4734f2386de325bcdf6@epcms2p5> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>	2025-03-05 02:13:29 +01:00
Klaus Jensen	6fc39228ff	hw/nvme: set error status code explicitly for misc commands The nvme_aio_err() does not handle Verify, Compare, Copy and other misc commands and defaults to setting the error status code to Internal Device Error. For some of these commands, we know better, so set it explicitly. For the commands using the nvme_misc_cb() callback (Copy, Flush, ...), if no status code has explicitly been set by the lower handlers, default to Internal Device Error as previously. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2025-02-26 12:40:35 +01:00
Klaus Jensen	6ccca4b6bb	hw/nvme: rework csi handling The controller incorrectly allows a zoned namespace to be attached even if CS.CSS is configured to only support the NVM command set for I/O queues. Rework handling of namespace command sets in general by attaching supported namespaces when the controller is started instead of, like now, statically when realized. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2025-02-26 12:40:32 +01:00
Klaus Jensen	d96a32de3f	hw/nvme: be compliant wrt. dsm processing limits The specification states that, > The controller shall set all three processing limit fields (i.e., the > DMRL, DMRSL and DMSL fields) to non-zero values or shall clear all > three processing limit fields to 0h. So, set the DMRL and DMSL fields in addition to DMRSL. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2025-02-25 12:55:21 +01:00
Klaus Jensen	b202fb549d	nvme: fix iocs status code values The status codes related to I/O Command Sets are in the wrong group. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2025-02-25 12:55:21 +01:00
Klaus Jensen	9cf6ec0659	hw/nvme: add knob for doorbell buffer config support Add a 'dbcs' knob to allow Doorbell Buffer Config command to be disabled. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2025-02-25 12:55:21 +01:00
Klaus Jensen	e7047adf1e	hw/nvme: make oacs dynamic Virtualization Management needs sriov-related parameters. Only report support for the command when that conditions are true. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2025-02-25 12:55:21 +01:00
Stephen Bates	23a4b3ebc7	hw/nvme: Add OCP SMART / Health Information Extended Log Page The Open Compute Project [1] includes a Datacenter NVMe SSD Specification [2]. The most recent version of this specification (as of November 2024) is 2.6.1. This specification layers on top of the NVM Express specifications [3] to provide additional functionality. A key part of of this is the 512 Byte OCP SMART / Health Information Extended log page that is defined in Section 4.8.6 of the specification. We add a controller argument (ocp) that toggles on/off the SMART log extended structure. To accommodate different vendor specific specifications like OCP, we add a multiplexing function (nvme_vendor_specific_log) which will route to the different log functions based on arguments and log ids. We only return the OCP extended SMART log when the command is 0xC0 and ocp has been turned on in the nvme argumants. Though we add the whole nvme SMART log extended structure, we only populate the physical_media_units_{read,written}, log_page_version and log_page_uuid. This patch is based on work done by Joel but has been modified enough that he requested a co-developed-by tag rather than a signed-off-by. [1]: https://www.opencompute.org/ [2]: https://www.opencompute.org/documents/datacenter-nvme-ssd-specification-v2-6-1-pdf [3]: https://nvmexpress.org/specifications/ Signed-off-by: Stephen Bates <sbates@raithlin.com> Co-developed-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2025-02-25 12:27:21 +01:00
Eric Blake	ff12e6a5ff	nbd/server: Allow users to adjust handshake limit in QMP Although defaulting the handshake limit to 10 seconds was a nice QoI change to weed out intentionally slow clients, it can interfere with integration testing done with manual NBD_OPT commands over 'nbdsh --opt-mode'. Expose a QMP knob 'handshake-max-secs' to allow the user to alter the timeout away from the default. The parameter name here intentionally matches the spelling of the constant added in commit `fb1c2aaa98`, and not the command-line spelling added in the previous patch for qemu-nbd; that's because in QMP, longer names serve as good self-documentation, and unlike the command line, machines don't have problems generating longer spellings. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250203222722.650694-6-eblake@redhat.com> [eblake: s/max-secs/max-seconds/ in QMP] Acked-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>	2025-02-11 13:45:47 -06:00
Stefan Hajnoczi	f2ec48fefd	Block layer patches - Managing inactive nodes (enables QSD migration with shared storage) - Fix swapped values for BLOCK_IO_ERROR 'device' and 'qom-path' - vpc: Read images exported from Azure correctly - scripts/qemu-gdb: Support coroutine dumps in coredumps - Minor cleanups -----BEGIN PGP SIGNATURE----- iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmek34IRHGt3b2xmQHJl ZGhhdC5jb20ACgkQfwmycsiPL9bDpxAAnTvwmdazAXG0g9GzqvrEB/+6rStjAsqE 9MTWV4WxyN41d0RXxN8CYKb8CXSiTRyw6r3CSGNYEI2eShe9e934PriSkZm41HyX n9Yh5YxqGZqitzvPtx62Ii/1KG+PcjQbfHuK1p4+rlKa0yQ2eGlio1JIIrZrCkBZ ikZcQUrhIyD0XV8hTQ2+Ysa+ZN6itjnlTQIG3gS3m8f8WR7kyUXD8YFMQFJFyjVx NrAIpLnc/ln9+5PZR9tje8U7XEn2KCgI5pgGaQnrd0h0G1H4ig8ogzYYnKTLhjU/ AmQpS8np8Tyg6S1UZTiekEq0VuAhThEQc5b3sGbmHWH/R2ABMStyf18oCBAkPzZ7 s6h+3XzTKKY2Q5Q3ZG/ANkUJjTNBhdj1fcaARvbSWsqsuk5CWX/I3jzvgihFtCSs eGu+b/bLeW6P7hu4qPHBcgLHuB1Fc7Rd2t4BoIGM1wcO2CeC9DzUKOiIMZOEJIh0 GGqCkEWDHgckDTakD4/vSqm0UDKt6FSlQC9ga/ILBY3IB5HpHoArY58selymy28i X7MgAvbjdsmNuUuXDZZOiObcFt3j8jlmwPJpPyzXPQIiPX1RXeBPRhVAEeZCKn6Z tfHr72SJdMeVOGXVTvOrJ2iW+4g03rPdmkDFCUhpOwo62RODq7ahvCIXsNf3nEFR rSB3T1M/8EM= =iQLP -----END PGP SIGNATURE----- Merge tag 'for-upstream' of https://repo.or.cz/qemu/kevin into staging Block layer patches - Managing inactive nodes (enables QSD migration with shared storage) - Fix swapped values for BLOCK_IO_ERROR 'device' and 'qom-path' - vpc: Read images exported from Azure correctly - scripts/qemu-gdb: Support coroutine dumps in coredumps - Minor cleanups # -----BEGIN PGP SIGNATURE----- # # iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmek34IRHGt3b2xmQHJl # ZGhhdC5jb20ACgkQfwmycsiPL9bDpxAAnTvwmdazAXG0g9GzqvrEB/+6rStjAsqE # 9MTWV4WxyN41d0RXxN8CYKb8CXSiTRyw6r3CSGNYEI2eShe9e934PriSkZm41HyX # n9Yh5YxqGZqitzvPtx62Ii/1KG+PcjQbfHuK1p4+rlKa0yQ2eGlio1JIIrZrCkBZ # ikZcQUrhIyD0XV8hTQ2+Ysa+ZN6itjnlTQIG3gS3m8f8WR7kyUXD8YFMQFJFyjVx # NrAIpLnc/ln9+5PZR9tje8U7XEn2KCgI5pgGaQnrd0h0G1H4ig8ogzYYnKTLhjU/ # AmQpS8np8Tyg6S1UZTiekEq0VuAhThEQc5b3sGbmHWH/R2ABMStyf18oCBAkPzZ7 # s6h+3XzTKKY2Q5Q3ZG/ANkUJjTNBhdj1fcaARvbSWsqsuk5CWX/I3jzvgihFtCSs # eGu+b/bLeW6P7hu4qPHBcgLHuB1Fc7Rd2t4BoIGM1wcO2CeC9DzUKOiIMZOEJIh0 # GGqCkEWDHgckDTakD4/vSqm0UDKt6FSlQC9ga/ILBY3IB5HpHoArY58selymy28i # X7MgAvbjdsmNuUuXDZZOiObcFt3j8jlmwPJpPyzXPQIiPX1RXeBPRhVAEeZCKn6Z # tfHr72SJdMeVOGXVTvOrJ2iW+4g03rPdmkDFCUhpOwo62RODq7ahvCIXsNf3nEFR # rSB3T1M/8EM= # =iQLP # -----END PGP SIGNATURE----- # gpg: Signature made Thu 06 Feb 2025 11:12:50 EST # gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6 # gpg: issuer "kwolf@redhat.com" # gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full] # Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6 * tag 'for-upstream' of https://repo.or.cz/qemu/kevin: (25 commits) block: remove unused BLOCK_OP_TYPE_DATAPLANE iotests: Add (NBD-based) tests for inactive nodes iotests: Add qsd-migrate case iotests: Add filter_qtest() nbd/server: Support inactive nodes block/export: Add option to allow export of inactive nodes block: Drain nodes before inactivating them block/export: Don't ignore image activation error in blk_exp_add() block: Support inactive nodes in blk_insert_bs() block: Add blockdev-set-active QMP command block: Add option to create inactive nodes block: Fix crash on block_resize on inactive node block: Don't attach inactive child to active node migration/block-active: Remove global active flag block: Inactivate external snapshot overlays when necessary block: Allow inactivating already inactive nodes block: Add 'active' field to BlockDeviceInfo block-backend: Fix argument order when calling 'qapi_event_send_block_io_error()' scripts/qemu-gdb: Support coroutine dumps in coredumps scripts/qemu-gdb: Simplify fs_base fetching for coroutines ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-02-10 13:25:36 -05:00
Daniel P. Berrangé	407bc4bf90	qapi: Move include/qapi/qmp/ to include/qobject/ The general expectation is that header files should follow the same file/path naming scheme as the corresponding source file. There are various historical exceptions to this practice in QEMU, with one of the most notable being the include/qapi/qmp/ directory. Most of the headers there correspond to source files in qobject/. This patch corrects most of that inconsistency by creating include/qobject/ and moving the headers for qobject/ there. This also fixes MAINTAINERS for include/qapi/qmp/dispatch.h: scripts/get_maintainer.pl now reports "QAPI" instead of "No maintainers found". Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Acked-by: Halil Pasic <pasic@linux.ibm.com> #s390x Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-ID: <20241118151235.2665921-2-armbru@redhat.com> [Rebased]	2025-02-10 15:33:16 +01:00
Stefan Hajnoczi	fc4e394b28	block: remove unused BLOCK_OP_TYPE_DATAPLANE BLOCK_OP_TYPE_DATAPLANE prevents BlockDriverState from being used by virtio-blk/virtio-scsi with IOThread. Commit `b112a65c52` ("block: declare blockjobs and dataplane friends!") eliminated the main reason for this blocker in 2014. Nowadays the block layer supports I/O from multiple AioContexts, so there is even less reason to block IOThread users. Any legitimate reasons related to interference would probably also apply to non-IOThread users. The only remaining users are bdrv_op_unblock(BLOCK_OP_TYPE_DATAPLANE) calls after bdrv_op_block_all(). If we remove BLOCK_OP_TYPE_DATAPLANE their behavior doesn't change. Existing bdrv_op_block_all() callers that don't explicitly unblock BLOCK_OP_TYPE_DATAPLANE seem to do so simply because no one bothered to rather than because it is necessary to keep BLOCK_OP_TYPE_DATAPLANE blocked. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250203182529.269066-1-stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-02-06 14:51:10 +01:00
Kevin Wolf	1600ef01ab	block/export: Add option to allow export of inactive nodes Add an option in BlockExportOptions to allow creating an export on an inactive node without activating the node. This mode needs to be explicitly supported by the export type (so that it doesn't perform any operations that are forbidden for inactive nodes), so this patch alone doesn't allow this option to be successfully used yet. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250204211407.381505-13-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-02-06 14:46:40 +01:00
Kevin Wolf	8cd37207f8	block: Add blockdev-set-active QMP command The system emulator tries to automatically activate and inactivate block nodes at the right point during migration. However, there are still cases where it's necessary that the user can do this manually. Images are only activated on the destination VM of a migration when the VM is actually resumed. If the VM was paused, this doesn't happen automatically. The user may want to perform some operation on a block device (e.g. taking a snapshot or starting a block job) without also resuming the VM yet. This is an example where a manual command is necessary. Another example is VM migration when the image files are opened by an external qemu-storage-daemon instance on each side. In this case, the process that needs to hand over the images isn't even part of the migration and can't know when the migration completes. Management tools need a way to explicitly inactivate images on the source and activate them on the destination. This adds a new blockdev-set-active QMP command that lets the user change the status of individual nodes (this is necessary in qemu-storage-daemon because it could be serving multiple VMs and only one of them migrates at a time). For convenience, operating on all devices (like QEMU does automatically during migration) is offered as an option, too, and can be used in the context of single VM. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250204211407.381505-9-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-02-06 14:26:51 +01:00
Kevin Wolf	faecd16fe5	block: Add option to create inactive nodes In QEMU, nodes are automatically created inactive while expecting an incoming migration (i.e. RUN_STATE_INMIGRATE). In qemu-storage-daemon, the notion of runstates doesn't exist. It also wouldn't necessarily make sense to introduce it because a single daemon can serve multiple VMs that can be in different states. Therefore, allow the user to explicitly open images as inactive with a new option. The default is as before: Nodes are usually active, except when created during RUN_STATE_INMIGRATE. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250204211407.381505-8-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-02-06 14:26:51 +01:00
Kevin Wolf	aec81049c2	block: Add 'active' field to BlockDeviceInfo This allows querying from QMP (and also HMP) whether an image is currently active or inactive (in the sense of BDRV_O_INACTIVE). Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250204211407.381505-2-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-02-06 14:26:50 +01:00
Philippe Mathieu-Daudé	edf3bce969	include: Include missing 'qemu/clang-tsa.h' header The next commit will remove "qemu/clang-tsa.h" of "exec/exec-all.h", however the following files indirectly include it: $ git grep -L qemu/clang-tsa.h $(git grep -wl TSA_NO_TSA) block/create.c include/block/block_int-common.h tests/unit/test-bdrv-drain.c tests/unit/test-block-iothread.c util/qemu-thread-posix.c Explicitly include it so we can process with the removal in the next commit. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20241212185341.2857-4-philmd@linaro.org>	2024-12-20 17:44:57 +01:00
Ayush Mishra	dbaa2936b3	hw/nvme: add NPDAL/NPDGL Add the NPDGL and NPDAL fields to support large alignment and granularities. Signed-off-by: Ayush Mishra <ayush.m55@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Link: https://lore.kernel.org/r/20241001012833.3551820-1-ayush.m55@samsung.com [k.jensen: renamed the enum values] Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2024-11-04 19:09:45 +01:00
Arun Kumar	79e490058f	hw/nvme: i/o cmd set independent namespace data structure Add support for the I/O Command Set Independent Namespace Data Structure (CNS 8h and 1fh). Signed-off-by: Arun Kumar <arun.kka@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Link: https://lore.kernel.org/r/20240925004407.3521406-1-arun.kka@samsung.com Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2024-11-04 19:09:45 +01:00
Peter Maydell	51483f6c84	include: Move QemuLockCnt APIs to their own header Currently the QemuLockCnt data structure and associated functions are in the include/qemu/thread.h header. Move them to their own qemu/lockcnt.h. The main reason for doing this is that it means we can autogenerate the documentation comments into the docs/devel documentation. The copyright/author in the new header is drawn from lockcnt.c, since the header changes were added in the same commit as lockcnt.c; since neither thread.h nor lockcnt.c state an explicit license, the standard default of GPL-2-or-later applies. We include the new header (and the .c file, which was accidentally omitted previously) in the "RCU" part of MAINTAINERS, since that is where the lockcnt.rst documentation is categorized. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20240816132212.3602106-7-peter.maydell@linaro.org	2024-10-15 15:16:17 +01:00
Peter Maydell	718780d204	nvme queue -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmb7nokACgkQTeGvMW1P Del+2gf/YefiiYSL540C2QeYRwMFd6xFKKWYRRJoaARyLAoqInVdLiBql527Oov8 rgDQq+D0XXP15CNDvfAZ59a36h1bAW79QCfKEUMSbP8GPeqb5pOSRfvYJSwnG1YX SC70vKLOrBhzxYiQYSOhLNKdbUM00OUyf2xibu0zk84UpkXtzSR4h/byFnQIHwEV /uUh4+cxY6eQK1Tfk/f66FEJLuJTchFOMVswYolDMezu2vJmToWHju/kpy2ugvaC +WEUEti8kL66/B399u1uwAad2OejC1Jf4qMFcFQJ9Cs9RV4HTC9byolceJE+1R0V CZt1SxvBNdK/ihs1iTjP7fInPqdYKw== =16tX -----END PGP SIGNATURE----- Merge tag 'pull-nvme-20241001' of https://gitlab.com/birkelund/qemu into staging nvme queue # -----BEGIN PGP SIGNATURE----- # # iQEzBAABCgAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmb7nokACgkQTeGvMW1P # Del+2gf/YefiiYSL540C2QeYRwMFd6xFKKWYRRJoaARyLAoqInVdLiBql527Oov8 # rgDQq+D0XXP15CNDvfAZ59a36h1bAW79QCfKEUMSbP8GPeqb5pOSRfvYJSwnG1YX # SC70vKLOrBhzxYiQYSOhLNKdbUM00OUyf2xibu0zk84UpkXtzSR4h/byFnQIHwEV # /uUh4+cxY6eQK1Tfk/f66FEJLuJTchFOMVswYolDMezu2vJmToWHju/kpy2ugvaC # +WEUEti8kL66/B399u1uwAad2OejC1Jf4qMFcFQJ9Cs9RV4HTC9byolceJE+1R0V # CZt1SxvBNdK/ihs1iTjP7fInPqdYKw== # =16tX # -----END PGP SIGNATURE----- # gpg: Signature made Tue 01 Oct 2024 08:02:33 BST # gpg: using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9 # gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [full] # gpg: aka "Klaus Jensen <k.jensen@samsung.com>" [full] # Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468 4272 63D5 6FC5 E55D A838 # Subkey fingerprint: 5228 33AA 75E2 DCE6 A247 66C0 4DE1 AF31 6D4F 0DE9 * tag 'pull-nvme-20241001' of https://gitlab.com/birkelund/qemu: hw/nvme: add atomic write support hw/nvme: add knob for CTRATT.MEM hw/nvme: support CTRATT.MEM hw/nvme: clear masked events from the aer queue hw/nvme: report id controller metadata sgl support Signed-off-by: Peter Maydell <peter.maydell@linaro.org>	2024-10-01 11:34:07 +01:00
Arun Kumar	a1ab67883d	hw/nvme: support CTRATT.MEM Indicate that 'MDTS and Size Limits Exclude Metadata (MEM)' in the Controller Attributes (CTRATT) I/O Command Set Independent Identify Controller Data Structure. Signed-off-by: Arun Kumar <arun.kka@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> [k.jensen: updated commit message] Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2024-09-30 12:45:17 +02:00
Dr. David Alan Gilbert	e84af3eb72	block: Remove unused aio_task_pool_empty aio_task_pool_empty has been unused since it was added in `6e9b225f73` ("block: introduce aio task pool") Remove it. Signed-off-by: Dr. David Alan Gilbert <dave@treblig.org> Message-Id: <20240917002007.330689-1-dave@treblig.org> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>	2024-09-30 10:53:18 +03:00
Fiona Ebner	9484ad6c17	copy-before-write: allow specifying minimum cluster size In the context of backup fleecing, discarding the source will not work when the fleecing image has a larger granularity than the one used for block-copy operations (can happen if the backup target has smaller cluster size), because cbw_co_pdiscard_snapshot() will align down the discard requests and thus effectively ignore then. To make @discard-source work in such a scenario, allow specifying the minimum cluster size used for block-copy operations and thus in particular also the granularity for discard requests to the source. The type 'size' (corresponding to uint64_t in C) is used in QAPI to rule out negative inputs and for consistency with already existing @cluster-size parameters. Since block_copy_calculate_cluster_size() uses int64_t for its result, a check that the input is not too large is added in block_copy_state_new() before calling it. The calculation in block_copy_calculate_cluster_size() is done in the target int64_t type. Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Acked-by: Markus Armbruster <armbru@redhat.com> (QAPI schema) Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-Id: <20240711120915.310243-2-f.ebner@proxmox.com> [vsementsov: switch version to 9.2 in QAPI doc] Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>	2024-09-30 10:52:41 +03:00
Yoochan Jeong	7c85332a2b	hw/ufs: minor bug fixes related to ufs-test Minor bugs and errors related to ufs-test are resolved. Some permissions and code implementations that are not synchronized with the ufs spec are edited. Signed-off-by: Yoochan Jeong <yc01.jeong@samsung.com> Reviewed-by: Jeuk Kim <jeuk20.kim@samsung.com> Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>	2024-09-06 18:04:16 +09:00
Eric Blake	c8a76dbd90	nbd/server: CVE-2024-7409: Cap default max-connections to 100 Allowing an unlimited number of clients to any web service is a recipe for a rudimentary denial of service attack: the client merely needs to open lots of sockets without closing them, until qemu no longer has any more fds available to allocate. For qemu-nbd, we default to allowing only 1 connection unless more are explicitly asked for (-e or --shared); this was historically picked as a nice default (without an explicit -t, a non-persistent qemu-nbd goes away after a client disconnects, without needing any additional follow-up commands), and we are not going to change that interface now (besides, someday we want to point people towards qemu-storage-daemon instead of qemu-nbd). But for qemu proper, and the newer qemu-storage-daemon, the QMP nbd-server-start command has historically had a default of unlimited number of connections, in part because unlike qemu-nbd it is inherently persistent until nbd-server-stop. Allowing multiple client sockets is particularly useful for clients that can take advantage of MULTI_CONN (creating parallel sockets to increase throughput), although known clients that do so (such as libnbd's nbdcopy) typically use only 8 or 16 connections (the benefits of scaling diminish once more sockets are competing for kernel attention). Picking a number large enough for typical use cases, but not unlimited, makes it slightly harder for a malicious client to perform a denial of service merely by opening lots of connections withot progressing through the handshake. This change does not eliminate CVE-2024-7409 on its own, but reduces the chance for fd exhaustion or unlimited memory usage as an attack surface. On the other hand, by itself, it makes it more obvious that with a finite limit, we have the problem of an unauthenticated client holding 100 fds opened as a way to block out a legitimate client from being able to connect; thus, later patches will further add timeouts to reject clients that are not making progress. This is an INTENTIONAL change in behavior, and will break any client of nbd-server-start that was not passing an explicit max-connections parameter, yet expects more than 100 simultaneous connections. We are not aware of any such client (as stated above, most clients aware of MULTI_CONN get by just fine on 8 or 16 connections, and probably cope with later connections failing by relying on the earlier connections; libvirt has not yet been passing max-connections, but generally creates NBD servers with the intent for a single client for the sake of live storage migration; meanwhile, the KubeSAN project anticipates a large cluster sharing multiple clients [up to 8 per node, and up to 100 nodes in a cluster], but it currently uses qemu-nbd with an explicit --shared=0 rather than qemu-storage-daemon with nbd-server-start). We considered using a deprecation period (declare that omitting max-parameters is deprecated, and make it mandatory in 3 releases - then we don't need to pick an arbitrary default); that has zero risk of breaking any apps that accidentally depended on more than 100 connections, and where such breakage might not be noticed under unit testing but only under the larger loads of production usage. But it does not close the denial-of-service hole until far into the future, and requires all apps to change to add the parameter even if 100 was good enough. It also has a drawback that any app (like libvirt) that is accidentally relying on an unlimited default should seriously consider their own CVE now, at which point they are going to change to pass explicit max-connections sooner than waiting for 3 qemu releases. Finally, if our changed default breaks an app, that app can always pass in an explicit max-parameters with a larger value. It is also intentional that the HMP interface to nbd-server-start is not changed to expose max-connections (any client needing to fine-tune things should be using QMP). Suggested-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20240807174943.771624-12-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> [ericb: Expand commit message to summarize Dan's argument for why we break corner-case back-compat behavior without a deprecation period] Signed-off-by: Eric Blake <eblake@redhat.com>	2024-08-08 16:02:23 -05:00
Eric Blake	fb1c2aaa98	nbd/server: Plumb in new args to nbd_client_add() Upcoming patches to fix a CVE need to track an opaque pointer passed in by the owner of a client object, as well as request for a time limit on how fast negotiation must complete. Prepare for that by changing the signature of nbd_client_new() and adding an accessor to get at the opaque pointer, although for now the two servers (qemu-nbd.c and blockdev-nbd.c) do not change behavior even though they pass in a new default timeout value. Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20240807174943.771624-11-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> [eblake: s/LIMIT/MAX_SECS/ as suggested by Dan] Signed-off-by: Eric Blake <eblake@redhat.com>	2024-08-08 15:05:27 -05:00
Kevin Wolf	7e17111646	block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked Upstream clang 18 (and backports to clang 17 in Fedora and RHEL) implemented support for __attribute__((cleanup())) in its Thread Safety Analysis, so we can now actually have a proper implementation of WITH_GRAPH_RDLOCK_GUARD() that understands when we acquire and when we release the lock. -Wthread-safety is now only enabled if the compiler is new enough to understand this pattern. In theory, we could have used some #ifdefs to keep the existing basic checks on old compilers, but as long as someone runs a newer compiler (and our CI does), we will catch locking problems, so it's probably not worth keeping multiple implementations for this. The implementation can't use g_autoptr any more because the glib macros define wrapper functions that don't have the right TSA attributes, so the compiler would complain about them. Just use the cleanup attribute directly instead. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20240627181245.281403-3-kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2024-08-06 20:12:39 +02:00
Arun Kumar	d522aef88d	hw/nvme: add cross namespace copy support Extend copy command to copy user data across different namespaces via support for specifying a namespace for each source range Signed-off-by: Arun Kumar <arun.kka@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2024-07-22 14:36:15 +02:00
Minwoo Im	6471556500	hw/nvme: add Identify Endurance Group List Commit `73064edfb8` ("hw/nvme: flexible data placement emulation") intorudced NVMe FDP feature to nvme-subsys and nvme-ctrl with a single endurance group #1 supported. This means that controller should return proper identify data to host with Identify Endurance Group List (CNS 19h). But, yes, only just for the endurance group #1. This patch allows host applications to ask for which endurance group is available and utilize FDP through that endurance group. Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2024-07-11 17:05:37 +02:00
Paolo Bonzini	44b424dc4a	block: remove separate bdrv_file_open callback bdrv_file_open and bdrv_open are completely equivalent, they are never checked except to see which one to invoke. So merge them into a single one. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-28 14:44:51 +02:00
Prasad Pandit	24687abf23	linux-aio: add IO_CMD_FDSYNC command support Libaio defines IO_CMD_FDSYNC command to sync all outstanding asynchronous I/O operations, by flushing out file data to the disk storage. Enable linux-aio to submit such aio request. When using aio=native without fdsync() support, QEMU creates pthreads, and destroying these pthreads results in TLB flushes. In a real-time guest environment, TLB flushes cause a latency spike. This patch helps to avoid such spikes. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Prasad Pandit <pjp@fedoraproject.org> Message-ID: <20240425070412.37248-1-ppandit@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2024-06-10 11:05:43 +02:00
Stefan Hajnoczi	e669e800fc	aio: warn about iohandler_ctx special casing The main loop has two AioContexts: qemu_aio_context and iohandler_ctx. The main loop runs them both, but nested aio_poll() calls on qemu_aio_context exclude iohandler_ctx. Which one should qemu_get_current_aio_context() return when called from the main loop? Document that it's always qemu_aio_context. This has subtle effects on functions that use qemu_get_current_aio_context(). For example, aio_co_reschedule_self() does not work when moving from iohandler_ctx to qemu_aio_context because qemu_get_current_aio_context() does not differentiate these two AioContexts. Document this in order to reduce the chance of future bugs. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240506190622.56095-3-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2024-06-10 11:05:43 +02:00
Minwoo Im	5c079578d2	hw/ufs: Add support MCQ of UFSHCI 4.0 This patch adds support for MCQ defined in UFSHCI 4.0. This patch utilized the legacy I/O codes as much as possible to support MCQ. MCQ operation & runtime register is placed at 0x1000 offset of UFSHCI register statically with no spare space among four registers (48B): UfsMcqSqReg, UfsMcqSqIntReg, UfsMcqCqReg, UfsMcqCqIntReg The maxinum number of queue is 32 as per spec, and the default MAC(Multiple Active Commands) are 32 in the device. Example: -device ufs,serial=foo,id=ufs0,mcq=true,mcq-maxq=8 Signed-off-by: Minwoo Im <minwoo.im@samsung.com> Reviewed-by: Jeuk Kim <jeuk20.kim@samsung.com> Message-Id: <20240528023106.856777-3-minwoo.im@samsung.com> Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>	2024-06-03 16:20:42 +09:00
Minwoo Im	cdba3b901a	hw/ufs: Update MCQ-related fields to block/ufs.h This patch is a prep patch for the following MCQ support patch for hw/ufs. This patch updated minimal mandatory fields to support MCQ based on UFSHCI 4.0. Signed-off-by: Minwoo Im <minwoo.im@samsung.com> Reviewed-by: Jeuk Kim <jeuk20.kim@samsung.com> Message-Id: <20240528023106.856777-2-minwoo.im@samsung.com> Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>	2024-06-03 16:20:42 +09:00
Vladimir Sementsov-Ogievskiy	0fd05c8d80	qapi: blockdev-backup: add discard-source parameter Add a parameter that enables discard-after-copy. That is mostly useful in "push backup with fleecing" scheme, when source is snapshot-access format driver node, based on copy-before-write filter snapshot-access API: [guest] [snapshot-access] ~~ blockdev-backup ~~> [backup target] \| \| \| root \| file v v [copy-before-write] \| \| \| file \| target v v [active disk] [temp.img] In this case discard-after-copy does two things: - discard data in temp.img to save disk space - avoid further copy-before-write operation in discarded area Note that we have to declare WRITE permission on source in copy-before-write filter, for discard to work. Still we can't take it unconditionally, as it will break normal backup from RO source. So, we have to add a parameter and pass it thorough bdrv_open flags. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Reviewed-by: Fiona Ebner <f.ebner@proxmox.com> Tested-by: Fiona Ebner <f.ebner@proxmox.com> Acked-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20240313152822.626493-5-vsementsov@yandex-team.ru> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>	2024-05-28 15:52:15 +03:00

1 2 3 4 5 ...

1632 commits