Add an option in BlockExportOptions to allow creating an export on an
inactive node without activating the node. This mode needs to be
explicitly supported by the export type (so that it doesn't perform any
operations that are forbidden for inactive nodes), so this patch alone
doesn't allow this option to be successfully used yet.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250204211407.381505-13-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently, block exports can't handle inactive images correctly.
Incoming write requests would run into assertion failures. Make sure
that we return an error when creating an export can't activate the
image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250204211407.381505-11-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Device models have a relatively complex way to set up their block
backends, in which blk_attach_dev() sets blk->disable_perm = true.
We want to support inactive images in exports, too, so that
qemu-storage-daemon can be used with migration. Because they don't use
blk_attach_dev(), they need another way to set this flag. The most
convenient is to do this automatically when an inactive node is attached
to a BlockBackend that can be inactivated.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250204211407.381505-10-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In order for block_resize to fail gracefully on an inactive node instead
of crashing with an assertion failure in bdrv_co_write_req_prepare()
(called from bdrv_co_truncate()), we need to check for inactive nodes
also when they are attached as a root node and make sure that
BLK_PERM_RESIZE isn't among the permissions allowed for inactive nodes.
To this effect, don't enumerate the permissions that are incompatible
with inactive nodes any more, but allow only BLK_PERM_CONSISTENT_READ
for them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250204211407.381505-7-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This allows querying from QMP (and also HMP) whether an image is
currently active or inactive (in the sense of BDRV_O_INACTIVE).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250204211407.381505-2-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 7452162ade introduced 'qom-path' argument to BLOCK_IO_ERROR
event but when the event is instantiated in 'send_qmp_error_event()' the
arguments for 'device' and 'qom_path' in
qapi_event_send_block_io_error() were reversed :
Generated code for sending event:
void qapi_event_send_block_io_error(const char *qom_path,
const char *device,
const char *node_name,
IoOperationType operation,
[...]
Call inside send_qmp_error_event():
qapi_event_send_block_io_error(blk_name(blk),
blk_get_attached_dev_path(blk),
bs ? bdrv_get_node_name(bs) : NULL, optype,
[...]
This results into reporting the QOM path as the device alias and vice
versa which in turn breaks libvirt, which expects the device alias being
either a valid alias or empty (which would make libvirt do the lookup by
node-name instead).
Cc: qemu-stable@nongnu.org
Fixes: 7452162ade ("qapi: add qom-path to BLOCK_IO_ERROR event")
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Message-ID: <09728d784888b38d7a8f09ee5e9e9c542c875e1e.1737973614.git.pkrempa@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
ASAN detected a leak when running the ahci-test
/ahci/io/dma/lba28/retry:
Direct leak of 35 byte(s) in 1 object(s) allocated from:
#0 in malloc
#1 in __vasprintf_internal
#2 in vasprintf
#3 in g_vasprintf
#4 in g_strdup_vprintf
#5 in g_strdup_printf
#6 in object_get_canonical_path ../qom/object.c:2096:19
#7 in blk_get_attached_dev_id_or_path ../block/block-backend.c:1033:12
#8 in blk_get_attached_dev_path ../block/block-backend.c:1047:12
#9 in send_qmp_error_event ../block/block-backend.c:2140:36
#10 in blk_error_action ../block/block-backend.c:2172:9
#11 in ide_handle_rw_error ../hw/ide/core.c:875:5
#12 in ide_dma_cb ../hw/ide/core.c:894:13
#13 in dma_complete ../system/dma-helpers.c:107:9
#14 in dma_blk_cb ../system/dma-helpers.c:129:9
#15 in blk_aio_complete ../block/block-backend.c:1552:9
#16 in blk_aio_write_entry ../block/block-backend.c:1619:5
#17 in coroutine_trampoline ../util/coroutine-ucontext.c:175:9
Plug the leak by freeing the device path string.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20241111145214.8261-1-farosas@suse.de>
[PMD: Use g_autofree]
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20241111170333.43833-3-philmd@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Expose the method docstring in the header, and mention
returned value must be free'd by caller.
Reported-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20241111170333.43833-2-philmd@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It was found that 'qemu-nbd' is not able to work with some disk images
exported from Azure. Looking at the 512b footer (which contains VPC
metadata):
00000000 63 6f 6e 65 63 74 69 78 00 00 00 02 00 01 00 00 |conectix........|
00000010 ff ff ff ff ff ff ff ff 2e c7 9b 96 77 61 00 00 |............wa..|
00000020 00 07 00 00 57 69 32 6b 00 00 00 01 40 00 00 00 |....Wi2k....@...|
00000030 00 00 00 01 40 00 00 00 28 a2 10 3f 00 00 00 02 |....@...(..?....|
00000040 ff ff e7 47 8c 54 df 94 bd 35 71 4c 94 5f e5 44 |...G.T...5qL._.D|
00000050 44 53 92 1a 00 00 00 00 00 00 00 00 00 00 00 00 |DS..............|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
we can see that Azure uses a different 'Creator application' --
'wa\0\0' (offset 0x1c, likely reads as 'Windows Azure') and QEMU uses this
field to determine how it can get image size. Apparently, Azure uses 'new'
method, just like Hyper-V.
Overall, it seems that only VPC and old QEMUs need to be ignored as all new
creator apps seem to have reliable current_size. Invert the logic and make
'current_size' method the default to avoid adding every new creator app to
the list.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-ID: <20241212134504.1983757-3-vkuznets@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In preparation to making changes to the logic deciding whether CHS or
'current_size' need to be used in determining the image size, split off
vpc_ignore_current_size() helper.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-ID: <20241212134504.1983757-2-vkuznets@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This error was discovered by fuzzing qemu-img.
When ph.ext_off has a sufficiently large value, the operation
le64_to_cpu(ph.ext_off) << BDRV_SECTOR_BITS in
parallels_read_format_extension() can cause an overflow in int64_t.
This overflow triggers the assert(ext_off > 0)
check in block/parallels-ext.c: parallels_read_format_extension(),
leading to a crash.
This commit adds a check to prevent overflow when shifting ph.ext_off
by BDRV_SECTOR_BITS, ensuring that the value remains within a valid range.
Reported-by: Leonid Reviakin <L.reviakin@fobos-nt.ru>
Signed-off-by: Denis Rastyogin <gerben@altlinux.org>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Message-ID: <20241212104212.513947-2-gerben@altlinux.org>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
create_long_filename() intentionally uses direntry_t->name[8+3] array
as a larger array. This works, but makes static code analysis tools
unhappy. The problem here is that a directory entry holding long file
name is significantly different from regular directory entry, and the
name is split into several parts within the entry, not just in regular
8+3 name field.
Treat the entry as array of bytes instead. This fixes the OOB access
from the compiler/tools PoV, but does not change the resulting code
in any way.
Keep the existing code style.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
This reverts commit 0cb3ff7c22.
The original code was right in that long name in LFN directory
entry uses other parts of the entry for the name too, not just
the original "name" field. So it is wrong to limit the offset
to be within the name field. Some other mechanism is needed
to fix the ubsan report and whole messy usage of bytes past the
given field.
Reported-by: Volker Rümelin <vr_qemu@t-online.de>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Found with test sbsaref introduced in [1].
[1] https://patchew.org/QEMU/20241203213629.2482806-1-pierrick.bouvier@linaro.org/
../block/vvfat.c:433:24: runtime error: index 14 out of bounds for type 'uint8_t [11]'
#0 0x56151a66b93a in create_long_filename ../block/vvfat.c:433
#1 0x56151a66f3d7 in create_short_and_long_name ../block/vvfat.c:725
#2 0x56151a670403 in read_directory ../block/vvfat.c:804
#3 0x56151a674432 in init_directories ../block/vvfat.c:964
#4 0x56151a67867b in vvfat_open ../block/vvfat.c:1258
#5 0x56151a3b8e19 in bdrv_open_driver ../block.c:1660
#6 0x56151a3bb666 in bdrv_open_common ../block.c:1985
#7 0x56151a3cadb9 in bdrv_open_inherit ../block.c:4153
#8 0x56151a3c8850 in bdrv_open_child_bs ../block.c:3731
#9 0x56151a3ca832 in bdrv_open_inherit ../block.c:4098
#10 0x56151a3cbe40 in bdrv_open ../block.c:4248
#11 0x56151a46344f in blk_new_open ../block/block-backend.c:457
#12 0x56151a388bd9 in blockdev_init ../blockdev.c:612
#13 0x56151a38ab2d in drive_new ../blockdev.c:1006
#14 0x5615190fca41 in drive_init_func ../system/vl.c:649
#15 0x56151aa796dd in qemu_opts_foreach ../util/qemu-option.c:1135
#16 0x5615190fd2b6 in configure_blockdev ../system/vl.c:708
#17 0x56151910a307 in qemu_create_early_backends ../system/vl.c:2004
#18 0x561519113fcf in qemu_init ../system/vl.c:3685
#19 0x56151a7e438e in main ../system/main.c:47
#20 0x7f72d1a46249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#21 0x7f72d1a46304 in __libc_start_main_impl ../csu/libc-start.c:360
#22 0x561517e98510 in _start (/home/user/.work/qemu/build/qemu-system-aarch64+0x3b9b510)
The offset used can easily go beyond entry->name size. It's probably a
bug, but I don't have the time to dive into vfat specifics for now.
This change solves the ubsan issue, and is functionally equivalent, as
anything written past the entry->name array would not be read anyway.
Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
The next commit will remove "qemu/clang-tsa.h" of "exec/exec-all.h",
however the following files indirectly include it:
$ git grep -L qemu/clang-tsa.h $(git grep -wl TSA_NO_TSA)
block/create.c
include/block/block_int-common.h
tests/unit/test-bdrv-drain.c
tests/unit/test-block-iothread.c
util/qemu-thread-posix.c
Explicitly include it so we can process with the removal in the
next commit.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20241212185341.2857-4-philmd@linaro.org>
Headers in include/sysemu/ are not only related to system
*emulation*, they are also used by virtualization. Rename
as system/ which is clearer.
Files renamed manually then mechanical change using sed tool.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Lei Yang <leiyang@redhat.com>
Message-Id: <20241203172445.28576-1-philmd@linaro.org>
The libssh does not handle non-blocking mode in SFTP correctly. The
driver code already changes the mode to blocking for the SFTP
initialization, but for some reason changes to non-blocking mode.
This used to work accidentally until libssh in 0.11 branch merged
the patch to avoid infinite looping in case of network errors:
https://gitlab.com/libssh/libssh-mirror/-/merge_requests/498
Since then, the ssh driver in qemu fails to read files over SFTP
as the first SFTP messages exchanged after switching the session
to non-blocking mode return SSH_AGAIN, but that message is lost
int the SFTP internals and interpretted as SSH_ERROR, which is
returned to the caller:
https://gitlab.com/libssh/libssh-mirror/-/issues/280
This is indeed an issue in libssh that we should address in the
long term, but it will require more work on the internals. For
now, the SFTP is not supported in non-blocking mode.
Fixes: https://gitlab.com/libssh/libssh-mirror/-/issues/280
Signed-off-by: Jakub Jelen <jjelen@redhat.com>
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Message-ID: <20241113125526.2495731-1-rjones@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The sum "cluster_index + count" may overflow uint32_t.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Dmitry Frolov <frolov@swemel.ru>
Message-ID: <20241106080521.219255-2-frolov@swemel.ru>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
s->offset and s->size are only set at the end of the function and still
contain the old values when formatting the error message. Print the
parameters with the new values that we actually checked instead.
Fixes: 500e243420 ('raw-format: Split raw_read_options()')
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20240829185527.47152-1-kwolf@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We need something more reliable than "device" (which absent in modern
interfaces) and "node-name" (which may absent, and actually don't
specify the device, which is a source of error) to make a per-device
throttling for the event in the following commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-ID: <20241002151806.592469-2-vsementsov@yandex-team.ru>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Make the VDI SECTOR_SIZE define be a 64-bit constant; this matches
how we define BDRV_SECTOR_SIZE. The benefit is that it means that we
don't need to carefully cast to 64-bits when doing operations like
"n_sectors * SECTOR_SIZE" to avoid doing a 32x32->32 multiply, which
might overflow, and which Coverity and other static analysers tend to
warn about.
The specific potential overflow Coverity is highlighting is the one
at the end of vdi_co_pwritev() where we write out n_sectors sectors
to the block map. This is very unlikely to actually overflow, since
the block map has 4 bytes per block and the maximum number of blocks
in the image must fit into a 32-bit integer. So this commit is not
fixing a real-world bug.
An inspection of all the places currently using SECTOR_SIZE in the
file shows none which care about the change in its type, except for
one call to error_setg() which needs the format string adjusting.
Resolves: Coverity CID 1508076
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20241008164708.2966400-5-peter.maydell@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In compare_fingerprint() we effectively check whether the characters
in the fingerprint are valid hex digits twice: first we do so with
qemu_isxdigit(), but then the hex2decimal() function also has a code
path where it effectively detects an invalid digit and returns -1.
This causes Coverity to complain because it thinks that we might use
that -1 value in an expression where it would be an integer overflow.
Avoid the double-check of hex digit validity by testing the return
values from hex2decimal() rather than doing separate calls to
qemu_isxdigit().
Since this means we now use the illegal-character return value
from hex2decimal(), rewrite it from "-1" to "UINT_MAX", which
has the same effect since the return type is "unsigned" but
looks less confusing at the callsites when we detect it with
"c0 > 0xf".
Resolves: Coverity CID 1547813
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20241008164708.2966400-3-peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the loop in qemu_gluster_parse_json() we do:
char *str = NULL;
for(...) {
str = g_strdup_printf(...);
...
if (various errors) {
goto out;
}
...
g_free(str);
str = NULL;
}
return 0;
out:
various cleanups;
g_free(str);
...
return -errno;
Coverity correctly complains that the assignment "str = NULL" at the
end of the loop is unnecessary, because we will either go back to the
top of the loop and overwrite it, or else we will exit the loop and
then exit the function without ever reading str again. The assignment
is there as defensive coding to ensure that str is only non-NULL if
it's a live allocation, so this is intentional.
We can make Coverity happier and simplify the code here by using
g_autofree, since we never need 'str' outside the loop.
Resolves: Coverity CID 1527385
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20241008164708.2966400-2-peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Parameter @id is no longer used, drop. Return a bool to indicate
success / failure, as recommended by qapi/error.h.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20241010150144.986655-4-armbru@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
According to https://marc.info/?l=fedora-devel-list&m=171934833215726
the GlusterFS development effectively ended. Thus mark it as deprecated
in QEMU, so we can remove it in a future release if the project does
not gain momentum again.
Acked-by: Niels de Vos <ndevos@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20241002082033.129022-1-thuth@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
blk_by_public last use was removed in 2017 by
c61791fc23 ("block: add aio_context field in ThrottleGroupMember")
blk_activate last use was removed earlier this year by
eef0bae3a7 ("migration: Remove block migration")
blk_add_insert_bs_notifier, blk_op_block_all, blk_op_unblock_all
last uses were removed in 2016 by
ef8875b549 ("virtio-scsi: Remove op blocker for dataplane")
blk_iostatus_disable last use was removed in 2016 by
66a0fae438 ("blockjob: Don't touch BDS iostatus")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
../block/block-copy.c:591:12: error: ‘ret’ may be used uninitialized [-Werror=maybe-uninitialized]
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
../block/stream.c:193:19: error: ‘unfiltered_bs’ may be used uninitialized [-Werror=maybe-uninitialized]
../block/stream.c:176:5: error: ‘len’ may be used uninitialized [-Werror=maybe-uninitialized]
trace/trace-block.h:906:9: error: ‘ret’ may be used uninitialized [-Werror=maybe-uninitialized]
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Acked-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
../block/mirror.c:404:5: error: ‘ret’ may be used uninitialized [-Werror=maybe-uninitialized]
../block/mirror.c:895:12: error: ‘ret’ may be used uninitialized [-Werror=maybe-uninitialized]
../block/mirror.c:578:12: error: ‘ret’ may be used uninitialized [-Werror=maybe-uninitialized]
Change a variable to int, as suggested by Manos: "bdrv_co_preadv()
which is int and is passed as an int argument to mirror_read_complete()"
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
../block/mirror.c:1066:22: error: ‘iostatus’ may be used uninitialized [-Werror=maybe-uninitialized]
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
aio_task_pool_empty has been unused since it was added in
6e9b225f73 ("block: introduce aio task pool")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <dave@treblig.org>
Message-Id: <20240917002007.330689-1-dave@treblig.org>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Allow overlapping request by removing the assert that made it
impossible. There are only two callers:
1. block_copy_task_create()
It already asserts the very same condition before calling
reqlist_init_req().
2. cbw_snapshot_read_lock()
There is no need to have read requests be non-overlapping in
copy-before-write when used for snapshot-access. In fact, there was no
protection against two callers of cbw_snapshot_read_lock() calling
reqlist_init_req() with overlapping ranges and this could lead to an
assertion failure [1].
In particular, with the reproducer script below [0], two
cbw_co_snapshot_block_status() callers could race, with the second
calling reqlist_init_req() before the first one finishes and removes
its conflicting request.
[0]:
> #!/bin/bash -e
> dd if=/dev/urandom of=/tmp/disk.raw bs=1M count=1024
> ./qemu-img create /tmp/fleecing.raw -f raw 1G
> (
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev raw,node-name=node0,file.driver=file,file.filename=/tmp/disk.raw \
> --blockdev raw,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.raw \
> <<EOF
> {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", "file": "node0", "target": "node1", "node-name": "node3" } }
> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", "file": "node3", "node-name": "snap0" } }
> {"execute": "nbd-server-start", "arguments": {"addr": { "type": "unix", "data": { "path": "/tmp/nbd.socket" } } } }
> {"execute": "block-export-add", "arguments": {"id": "exp0", "node-name": "snap0", "type": "nbd", "name": "exp0"}}
> EOF
> ) &
> sleep 5
> while true; do
> ./qemu-nbd -d /dev/nbd0
> ./qemu-nbd -c /dev/nbd0 nbd:unix:/tmp/nbd.socket:exportname=exp0 -f raw -r
> nbdinfo --map 'nbd+unix:///exp0?socket=/tmp/nbd.socket'
> done
[1]:
> #5 0x000071e5f0088eb2 in __GI___assert_fail (...) at ./assert/assert.c:101
> #6 0x0000615285438017 in reqlist_init_req (...) at ../block/reqlist.c:23
> #7 0x00006152853e2d98 in cbw_snapshot_read_lock (...) at ../block/copy-before-write.c:237
> #8 0x00006152853e3068 in cbw_co_snapshot_block_status (...) at ../block/copy-before-write.c:304
> #9 0x00006152853f4d22 in bdrv_co_snapshot_block_status (...) at ../block/io.c:3726
> #10 0x000061528543a63e in snapshot_access_co_block_status (...) at ../block/snapshot-access.c:48
> #11 0x00006152853f1a0a in bdrv_co_do_block_status (...) at ../block/io.c:2474
> #12 0x00006152853f2016 in bdrv_co_common_block_status_above (...) at ../block/io.c:2652
> #13 0x00006152853f22cf in bdrv_co_block_status_above (...) at ../block/io.c:2732
> #14 0x00006152853d9a86 in blk_co_block_status_above (...) at ../block/block-backend.c:1473
> #15 0x000061528538da6c in blockstatus_to_extents (...) at ../nbd/server.c:2374
> #16 0x000061528538deb1 in nbd_co_send_block_status (...) at ../nbd/server.c:2481
> #17 0x000061528538f424 in nbd_handle_request (...) at ../nbd/server.c:2978
> #18 0x000061528538f906 in nbd_trip (...) at ../nbd/server.c:3121
> #19 0x00006152855a7caf in coroutine_trampoline (...) at ../util/coroutine-ucontext.c:175
Cc: qemu-stable@nongnu.org
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-Id: <20240712140716.517911-1-f.ebner@proxmox.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.
To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Acked-by: Markus Armbruster <armbru@redhat.com> (QAPI schema)
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-Id: <20240711120915.310243-3-f.ebner@proxmox.com>
[vsementsov: switch version to 9.2 in QAPI doc]
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.
To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.
The type 'size' (corresponding to uint64_t in C) is used in QAPI to
rule out negative inputs and for consistency with already existing
@cluster-size parameters. Since block_copy_calculate_cluster_size()
uses int64_t for its result, a check that the input is not too large
is added in block_copy_state_new() before calling it. The calculation
in block_copy_calculate_cluster_size() is done in the target int64_t
type.
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Acked-by: Markus Armbruster <armbru@redhat.com> (QAPI schema)
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-Id: <20240711120915.310243-2-f.ebner@proxmox.com>
[vsementsov: switch version to 9.2 in QAPI doc]
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
This patch is part of a series that moves towards a consistent use of
g_assert_not_reached() rather than an ad hoc mix of different
assertion mechanisms.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-ID: <20240919044641.386068-17-pierrick.bouvier@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
This patch is part of a series that moves towards a consistent use of
g_assert_not_reached() rather than an ad hoc mix of different
assertion mechanisms.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-ID: <20240919044641.386068-8-pierrick.bouvier@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Since the "2 | 3+" expression can be simplified as "2+",
it is pointless to mention the GPLv3 license.
Add the corresponding SPDX identifier to remove all doubt.
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
QAPI's 'prefix' feature can make the connection between enumeration
type and its constants less than obvious. It's best used with
restraint.
QCryptoCipherAlgorithm has a 'prefix' that overrides the generated
enumeration constants' prefix to QCRYPTO_CIPHER_ALG.
We could simply drop 'prefix', but then the prefix becomes
QCRYPTO_CIPHER_ALGORITHM, which is rather long.
We could additionally rename the type to QCryptoCipherAlg, but I think
the abbreviation "alg" is less than clear.
Rename the type to QCryptoCipherAlgo instead. The prefix becomes
QCRYPTO_CIPHER_ALGO.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20240904111836.3273842-13-armbru@redhat.com>
QAPI's 'prefix' feature can make the connection between enumeration
type and its constants less than obvious. It's best used with
restraint.
QCryptoHashAlgorithm has a 'prefix' that overrides the generated
enumeration constants' prefix to QCRYPTO_HASH_ALG.
We could simply drop 'prefix', but then the prefix becomes
QCRYPTO_HASH_ALGORITHM, which is rather long.
We could additionally rename the type to QCryptoHashAlg, but I think
the abbreviation "alg" is less than clear.
Rename the type to QCryptoHashAlgo instead. The prefix becomes to
QCRYPTO_HASH_ALGO.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20240904111836.3273842-12-armbru@redhat.com>
[Conflicts with merge commit 7bbadc60b5 resolved]
Recent commit "qapi: Smarter camel_to_upper() to reduce need for
'prefix'" added two temporary 'prefix' to delay changing the generated
code.
Revert them. This improves QCryptoBlockFormat's generated enumeration
constant prefix from Q_CRYPTO_BLOCK_FORMAT to QCRYPTO_BLOCK_FORMAT,
and QCryptoBlockLUKSKeyslotState's from
Q_CRYPTO_BLOCKLUKS_KEYSLOT_STATE to QCRYPTO_BLOCK_LUKS_KEYSLOT_STATE.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20240904111836.3273842-6-armbru@redhat.com>
libblkio supports BLKIO_REQ_FUA with write zeros requests only since
version 1.4.0, so let's inform the block layer that the blkio driver
supports it only in this case. Otherwise we can have runtime errors
as reported in https://issues.redhat.com/browse/RHEL-32878
Fixes: fd66dbd424 ("blkio: add libblkio block driver")
Cc: qemu-stable@nongnu.org
Buglink: https://issues.redhat.com/browse/RHEL-32878
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-id: 20240808080545.40744-1-sgarzare@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Allowing an unlimited number of clients to any web service is a recipe
for a rudimentary denial of service attack: the client merely needs to
open lots of sockets without closing them, until qemu no longer has
any more fds available to allocate.
For qemu-nbd, we default to allowing only 1 connection unless more are
explicitly asked for (-e or --shared); this was historically picked as
a nice default (without an explicit -t, a non-persistent qemu-nbd goes
away after a client disconnects, without needing any additional
follow-up commands), and we are not going to change that interface now
(besides, someday we want to point people towards qemu-storage-daemon
instead of qemu-nbd).
But for qemu proper, and the newer qemu-storage-daemon, the QMP
nbd-server-start command has historically had a default of unlimited
number of connections, in part because unlike qemu-nbd it is
inherently persistent until nbd-server-stop. Allowing multiple client
sockets is particularly useful for clients that can take advantage of
MULTI_CONN (creating parallel sockets to increase throughput),
although known clients that do so (such as libnbd's nbdcopy) typically
use only 8 or 16 connections (the benefits of scaling diminish once
more sockets are competing for kernel attention). Picking a number
large enough for typical use cases, but not unlimited, makes it
slightly harder for a malicious client to perform a denial of service
merely by opening lots of connections withot progressing through the
handshake.
This change does not eliminate CVE-2024-7409 on its own, but reduces
the chance for fd exhaustion or unlimited memory usage as an attack
surface. On the other hand, by itself, it makes it more obvious that
with a finite limit, we have the problem of an unauthenticated client
holding 100 fds opened as a way to block out a legitimate client from
being able to connect; thus, later patches will further add timeouts
to reject clients that are not making progress.
This is an INTENTIONAL change in behavior, and will break any client
of nbd-server-start that was not passing an explicit max-connections
parameter, yet expects more than 100 simultaneous connections. We are
not aware of any such client (as stated above, most clients aware of
MULTI_CONN get by just fine on 8 or 16 connections, and probably cope
with later connections failing by relying on the earlier connections;
libvirt has not yet been passing max-connections, but generally
creates NBD servers with the intent for a single client for the sake
of live storage migration; meanwhile, the KubeSAN project anticipates
a large cluster sharing multiple clients [up to 8 per node, and up to
100 nodes in a cluster], but it currently uses qemu-nbd with an
explicit --shared=0 rather than qemu-storage-daemon with
nbd-server-start).
We considered using a deprecation period (declare that omitting
max-parameters is deprecated, and make it mandatory in 3 releases -
then we don't need to pick an arbitrary default); that has zero risk
of breaking any apps that accidentally depended on more than 100
connections, and where such breakage might not be noticed under unit
testing but only under the larger loads of production usage. But it
does not close the denial-of-service hole until far into the future,
and requires all apps to change to add the parameter even if 100 was
good enough. It also has a drawback that any app (like libvirt) that
is accidentally relying on an unlimited default should seriously
consider their own CVE now, at which point they are going to change to
pass explicit max-connections sooner than waiting for 3 qemu releases.
Finally, if our changed default breaks an app, that app can always
pass in an explicit max-parameters with a larger value.
It is also intentional that the HMP interface to nbd-server-start is
not changed to expose max-connections (any client needing to fine-tune
things should be using QMP).
Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20240807174943.771624-12-eblake@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
[ericb: Expand commit message to summarize Dan's argument for why we
break corner-case back-compat behavior without a deprecation period]
Signed-off-by: Eric Blake <eblake@redhat.com>
When reading with `read_cluster` we get the `mapping` with
`find_mapping_for_cluster` and then we call `open_file` for this
mapping.
The issue appear when its the same file, but a second cluster that is
not immediately after it, imagine clusters `500 -> 503`, this will give
us 2 mappings one has the range `500..501` and another `503..504`, both
point to the same file, but different offsets.
When we don't open the file since the path is the same, we won't assign
`s->current_mapping` and thus accessing way out of bound of the file.
From our example above, after `open_file` (that didn't open anything) we
will get the offset into the file with
`s->cluster_size*(cluster_num-s->current_mapping->begin)`, which will
give us `0x2000 * (504-500)`, which is out of bound for this mapping and
will produce some issues.
Signed-off-by: Amjad Alsharafi <amjadsharafi10@gmail.com>
Message-ID: <1f3ea115779abab62ba32c788073cdc99f9ad5dd.1721470238.git.amjadsharafi10@gmail.com>
[kwolf: Simplified the patch based on Amjad's analysis and input]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
How this `abort` was intended to check for was:
- if the `mapping->first_mapping_index` is not the same as
`first_mapping_index`, which **should** happen only in one case,
when we are handling the first mapping, in that case
`mapping->first_mapping_index == -1`, in all other cases, the other
mappings after the first should have the condition `true`.
- From above, we know that this is the first mapping, so if the offset
is not `0`, then abort, since this is an invalid state.
The issue was that `first_mapping_index` is not set if we are
checking from the middle, the variable `first_mapping_index` is
only set if we passed through the check `cluster_was_modified` with the
first mapping, and in the same function call we checked the other
mappings.
One approach is to go into the loop even if `cluster_was_modified`
is not true so that we will be able to set `first_mapping_index` for the
first mapping, but since `first_mapping_index` is only used here,
another approach is to just check manually for the
`mapping->first_mapping_index != -1` since we know that this is the
value for the only entry where `offset == 0` (i.e. first mapping).
Signed-off-by: Amjad Alsharafi <amjadsharafi10@gmail.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <b0fbca3ee208c565885838f6a7deeaeb23f4f9c2.1721470238.git.amjadsharafi10@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The field is marked as "the offset in the file (in clusters)", but it
was being used like this
`cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.
Signed-off-by: Amjad Alsharafi <amjadsharafi10@gmail.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <72f19a7903886dda1aa78bcae0e17702ee939262.1721470238.git.amjadsharafi10@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Before this commit, the behavior when calling `commit_one_file` for
example with `offset=0x2000` (second cluster), what will happen is that
we won't fetch the next cluster from the fat, and instead use the first
cluster for the read operation.
This is due to off-by-one error here, where `i=0x2000 !< offset=0x2000`,
thus not fetching the next cluster.
Signed-off-by: Amjad Alsharafi <amjadsharafi10@gmail.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Tested-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <b97c1e1f1bc2f776061ae914f95d799d124fcd73.1721470238.git.amjadsharafi10@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The graph lock needs to be held when calling bdrv_co_pdiscard(). Fix
block_copy_task_entry() to take it for the call.
WITH_GRAPH_RDLOCK_GUARD() was implemented in a weak way because of
limitations in clang's Thread Safety Analysis at the time, so that it
only asserts that the lock is held (which allows calling functions that
require the lock), but we never deal with the unlocking (so even after
the scope of the guard, the compiler assumes that the lock is still
held). This is why the compiler didn't catch this locking error.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20240627181245.281403-2-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>