docs: create interop/ subdirectory

This is for the future interoperability & management guide.  It includes
the QAPI docs, including the automatically generated ones, other socket
protocols (vhost-user, VNC), and the qcow2 file format.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit is contained in:
Paolo Bonzini 2017-06-06 16:55:19 +02:00
parent 067b913619
commit d59157ea05
13 changed files with 50 additions and 39 deletions

View file

@ -1,228 +0,0 @@
= License =
Copyright (c) 2015 Denis Lunev
Copyright (c) 2015 Vladimir Sementsov-Ogievskiy
This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.
= Parallels Expandable Image File Format =
A Parallels expandable image file consists of three consecutive parts:
* header
* BAT
* data area
All numbers in a Parallels expandable image are stored in little-endian byte
order.
== Definitions ==
Sector A 512-byte data chunk.
Cluster A data chunk of the size specified in the image header.
Currently, the default size is 1MiB (2048 sectors). In previous
versions, cluster sizes of 63 sectors, 256 and 252 kilobytes were
used.
BAT Block Allocation Table, an entity that contains information for
guest-to-host I/O data address translation.
== Header ==
The header is placed at the start of an image and contains the following
fields:
Bytes:
0 - 15: magic
Must contain "WithoutFreeSpace" or "WithouFreSpacExt".
16 - 19: version
Must be 2.
20 - 23: heads
Disk geometry parameter for guest.
24 - 27: cylinders
Disk geometry parameter for guest.
28 - 31: tracks
Cluster size, in sectors.
32 - 35: nb_bat_entries
Disk size, in clusters (BAT size).
36 - 43: nb_sectors
Disk size, in sectors.
For "WithoutFreeSpace" images:
Only the lowest 4 bytes are used. The highest 4 bytes must be
cleared in this case.
For "WithouFreSpacExt" images, there are no such
restrictions.
44 - 47: in_use
Set to 0x746F6E59 when the image is opened by software in R/W
mode; set to 0x312e3276 when the image is closed.
A zero in this field means that the image was opened by an old
version of the software that doesn't support Format Extension
(see below).
Other values are not allowed.
48 - 51: data_off
An offset, in sectors, from the start of the file to the start of
the data area.
For "WithoutFreeSpace" images:
- If data_off is zero, the offset is calculated as the end of BAT
table plus some padding to ensure sector size alignment.
- If data_off is non-zero, the offset should be aligned to sector
size. However it is recommended to align it to cluster size for
newly created images.
For "WithouFreSpacExt" images:
data_off must be non-zero and aligned to cluster size.
52 - 55: flags
Miscellaneous flags.
Bit 0: Empty Image bit. If set, the image should be
considered clear.
Bits 1-31: Unused.
56 - 63: ext_off
Format Extension offset, an offset, in sectors, from the start of
the file to the start of the Format Extension Cluster.
ext_off must meet the same requirements as cluster offsets
defined by BAT entries (see below).
== BAT ==
BAT is placed immediately after the image header. In the file, BAT is a
contiguous array of 32-bit unsigned little-endian integers with
(bat_entries * 4) bytes size.
Each BAT entry contains an offset from the start of the file to the
corresponding cluster. The offset set in clusters for "WithouFreSpacExt" images
and in sectors for "WithoutFreeSpace" images.
If a BAT entry is zero, the corresponding cluster is not allocated and should
be considered as filled with zeroes.
Cluster offsets specified by BAT entries must meet the following requirements:
- the value must not be lower than data offset (provided by header.data_off
or calculated as specified above),
- the value must be lower than the desired file size,
- the value must be unique among all BAT entries,
- the result of (cluster offset - data offset) must be aligned to cluster
size.
== Data Area ==
The data area is an area from the data offset (provided by header.data_off or
calculated as specified above) to the end of the file. It represents a
contiguous array of clusters. Most of them are allocated by the BAT, some may
be allocated by the ext_off field in the header while other may be allocated by
extensions. All clusters allocated by ext_off and extensions should meet the
same requirements as clusters specified by BAT entries.
== Format Extension ==
The Format Extension is an area 1 cluster in size that provides additional
format features. This cluster is addressed by the ext_off field in the header.
The format of the Format Extension area is the following:
0 - 7: magic
Must be 0xAB234CEF23DCEA87
8 - 23: m_CheckSum
The MD5 checksum of the entire Header Extension cluster except
the first 24 bytes.
The above are followed by feature sections or "extensions". The last
extension must be "End of features" (see below).
Each feature section has the following format:
0 - 7: magic
The identifier of the feature:
0x0000000000000000 - End of features
0x20385FAE252CB34A - Dirty bitmap
8 - 15: flags
External flags for extension:
Bit 0: NECESSARY
If the software cannot load the extension (due to an
unknown magic number or error), the file should not be
changed. If this flag is unset and there is an error on
loading the extension, said extension should be dropped.
Bit 1: TRANSIT
If there is an unknown extension with this flag set,
said extension should be left as is.
If neither NECESSARY nor TRANSIT are set, the extension should be
dropped.
16 - 19: data_size
The size of the following feature data, in bytes.
20 - 23: unused32
Align header to 8 bytes boundary.
variable: data (data_size bytes)
The above is followed by padding to the next 8 bytes boundary, then the
next extension starts.
The last extension must be "End of features" with all the fields set to 0.
=== Dirty bitmaps feature ===
This feature provides a way of storing dirty bitmaps in the image. The fields
of its data area are:
0 - 7: size
The bitmap size, should be equal to disk size in sectors.
8 - 23: id
An identifier for backup consistency checking.
24 - 27: granularity
Bitmap granularity, in sectors. I.e., the number of sectors
corresponding to one bit of the bitmap. Granularity must be
a power of 2.
28 - 31: l1_size
The number of entries in the L1 table of the bitmap.
variable: l1 (64 * l1_size bytes)
L1 offset table (in bytes)
A dirty bitmap is stored using a one-level structure for the mapping to host
clusters - an L1 table.
Given an offset in bytes into the bitmap data, the offset in bytes into the
image file can be obtained as follows:
offset = l1_table[offset / cluster_size] + (offset % cluster_size)
If an L1 table entry is 0, the corresponding cluster of the bitmap is assumed
to be zero.
If an L1 table entry is 1, the corresponding cluster of the bitmap is assumed
to have all bits set.
If an L1 table entry is not 0 or 1, it allocates a cluster from the data area.

View file

@ -1,581 +0,0 @@
== General ==
A qcow2 image file is organized in units of constant size, which are called
(host) clusters. A cluster is the unit in which all allocations are done,
both for actual guest data and for image metadata.
Likewise, the virtual disk as seen by the guest is divided into (guest)
clusters of the same size.
All numbers in qcow2 are stored in Big Endian byte order.
== Header ==
The first cluster of a qcow2 image contains the file header:
Byte 0 - 3: magic
QCOW magic string ("QFI\xfb")
4 - 7: version
Version number (valid values are 2 and 3)
8 - 15: backing_file_offset
Offset into the image file at which the backing file name
is stored (NB: The string is not null terminated). 0 if the
image doesn't have a backing file.
16 - 19: backing_file_size
Length of the backing file name in bytes. Must not be
longer than 1023 bytes. Undefined if the image doesn't have
a backing file.
20 - 23: cluster_bits
Number of bits that are used for addressing an offset
within a cluster (1 << cluster_bits is the cluster size).
Must not be less than 9 (i.e. 512 byte clusters).
Note: qemu as of today has an implementation limit of 2 MB
as the maximum cluster size and won't be able to open images
with larger cluster sizes.
24 - 31: size
Virtual disk size in bytes
32 - 35: crypt_method
0 for no encryption
1 for AES encryption
36 - 39: l1_size
Number of entries in the active L1 table
40 - 47: l1_table_offset
Offset into the image file at which the active L1 table
starts. Must be aligned to a cluster boundary.
48 - 55: refcount_table_offset
Offset into the image file at which the refcount table
starts. Must be aligned to a cluster boundary.
56 - 59: refcount_table_clusters
Number of clusters that the refcount table occupies
60 - 63: nb_snapshots
Number of snapshots contained in the image
64 - 71: snapshots_offset
Offset into the image file at which the snapshot table
starts. Must be aligned to a cluster boundary.
If the version is 3 or higher, the header has the following additional fields.
For version 2, the values are assumed to be zero, unless specified otherwise
in the description of a field.
72 - 79: incompatible_features
Bitmask of incompatible features. An implementation must
fail to open an image if an unknown bit is set.
Bit 0: Dirty bit. If this bit is set then refcounts
may be inconsistent, make sure to scan L1/L2
tables to repair refcounts before accessing the
image.
Bit 1: Corrupt bit. If this bit is set then any data
structure may be corrupt and the image must not
be written to (unless for regaining
consistency).
Bits 2-63: Reserved (set to 0)
80 - 87: compatible_features
Bitmask of compatible features. An implementation can
safely ignore any unknown bits that are set.
Bit 0: Lazy refcounts bit. If this bit is set then
lazy refcount updates can be used. This means
marking the image file dirty and postponing
refcount metadata updates.
Bits 1-63: Reserved (set to 0)
88 - 95: autoclear_features
Bitmask of auto-clear features. An implementation may only
write to an image with unknown auto-clear features if it
clears the respective bits from this field first.
Bit 0: Bitmaps extension bit
This bit indicates consistency for the bitmaps
extension data.
It is an error if this bit is set without the
bitmaps extension present.
If the bitmaps extension is present but this
bit is unset, the bitmaps extension data must be
considered inconsistent.
Bits 1-63: Reserved (set to 0)
96 - 99: refcount_order
Describes the width of a reference count block entry (width
in bits: refcount_bits = 1 << refcount_order). For version 2
images, the order is always assumed to be 4
(i.e. refcount_bits = 16).
This value may not exceed 6 (i.e. refcount_bits = 64).
100 - 103: header_length
Length of the header structure in bytes. For version 2
images, the length is always assumed to be 72 bytes.
Directly after the image header, optional sections called header extensions can
be stored. Each extension has a structure like the following:
Byte 0 - 3: Header extension type:
0x00000000 - End of the header extension area
0xE2792ACA - Backing file format name
0x6803f857 - Feature name table
0x23852875 - Bitmaps extension
other - Unknown header extension, can be safely
ignored
4 - 7: Length of the header extension data
8 - n: Header extension data
n - m: Padding to round up the header extension size to the next
multiple of 8.
Unless stated otherwise, each header extension type shall appear at most once
in the same image.
If the image has a backing file then the backing file name should be stored in
the remaining space between the end of the header extension area and the end of
the first cluster. It is not allowed to store other data here, so that an
implementation can safely modify the header and add extensions without harming
data of compatible features that it doesn't support. Compatible features that
need space for additional data can use a header extension.
== Feature name table ==
The feature name table is an optional header extension that contains the name
for features used by the image. It can be used by applications that don't know
the respective feature (e.g. because the feature was introduced only later) to
display a useful error message.
The number of entries in the feature name table is determined by the length of
the header extension data. Each entry look like this:
Byte 0: Type of feature (select feature bitmap)
0: Incompatible feature
1: Compatible feature
2: Autoclear feature
1: Bit number within the selected feature bitmap (valid
values: 0-63)
2 - 47: Feature name (padded with zeros, but not necessarily null
terminated if it has full length)
== Bitmaps extension ==
The bitmaps extension is an optional header extension. It provides the ability
to store bitmaps related to a virtual disk. For now, there is only one bitmap
type: the dirty tracking bitmap, which tracks virtual disk changes from some
point in time.
The data of the extension should be considered consistent only if the
corresponding auto-clear feature bit is set, see autoclear_features above.
The fields of the bitmaps extension are:
Byte 0 - 3: nb_bitmaps
The number of bitmaps contained in the image. Must be
greater than or equal to 1.
Note: Qemu currently only supports up to 65535 bitmaps per
image.
4 - 7: Reserved, must be zero.
8 - 15: bitmap_directory_size
Size of the bitmap directory in bytes. It is the cumulative
size of all (nb_bitmaps) bitmap headers.
16 - 23: bitmap_directory_offset
Offset into the image file at which the bitmap directory
starts. Must be aligned to a cluster boundary.
== Host cluster management ==
qcow2 manages the allocation of host clusters by maintaining a reference count
for each host cluster. A refcount of 0 means that the cluster is free, 1 means
that it is used, and >= 2 means that it is used and any write access must
perform a COW (copy on write) operation.
The refcounts are managed in a two-level table. The first level is called
refcount table and has a variable size (which is stored in the header). The
refcount table can cover multiple clusters, however it needs to be contiguous
in the image file.
It contains pointers to the second level structures which are called refcount
blocks and are exactly one cluster in size.
Given a offset into the image file, the refcount of its cluster can be obtained
as follows:
refcount_block_entries = (cluster_size * 8 / refcount_bits)
refcount_block_index = (offset / cluster_size) % refcount_block_entries
refcount_table_index = (offset / cluster_size) / refcount_block_entries
refcount_block = load_cluster(refcount_table[refcount_table_index]);
return refcount_block[refcount_block_index];
Refcount table entry:
Bit 0 - 8: Reserved (set to 0)
9 - 63: Bits 9-63 of the offset into the image file at which the
refcount block starts. Must be aligned to a cluster
boundary.
If this is 0, the corresponding refcount block has not yet
been allocated. All refcounts managed by this refcount block
are 0.
Refcount block entry (x = refcount_bits - 1):
Bit 0 - x: Reference count of the cluster. If refcount_bits implies a
sub-byte width, note that bit 0 means the least significant
bit in this context.
== Cluster mapping ==
Just as for refcounts, qcow2 uses a two-level structure for the mapping of
guest clusters to host clusters. They are called L1 and L2 table.
The L1 table has a variable size (stored in the header) and may use multiple
clusters, however it must be contiguous in the image file. L2 tables are
exactly one cluster in size.
Given a offset into the virtual disk, the offset into the image file can be
obtained as follows:
l2_entries = (cluster_size / sizeof(uint64_t))
l2_index = (offset / cluster_size) % l2_entries
l1_index = (offset / cluster_size) / l2_entries
l2_table = load_cluster(l1_table[l1_index]);
cluster_offset = l2_table[l2_index];
return cluster_offset + (offset % cluster_size)
L1 table entry:
Bit 0 - 8: Reserved (set to 0)
9 - 55: Bits 9-55 of the offset into the image file at which the L2
table starts. Must be aligned to a cluster boundary. If the
offset is 0, the L2 table and all clusters described by this
L2 table are unallocated.
56 - 62: Reserved (set to 0)
63: 0 for an L2 table that is unused or requires COW, 1 if its
refcount is exactly one. This information is only accurate
in the active L1 table.
L2 table entry:
Bit 0 - 61: Cluster descriptor
62: 0 for standard clusters
1 for compressed clusters
63: 0 for a cluster that is unused or requires COW, 1 if its
refcount is exactly one. This information is only accurate
in L2 tables that are reachable from the active L1
table.
Standard Cluster Descriptor:
Bit 0: If set to 1, the cluster reads as all zeros. The host
cluster offset can be used to describe a preallocation,
but it won't be used for reading data from this cluster,
nor is data read from the backing file if the cluster is
unallocated.
With version 2, this is always 0.
1 - 8: Reserved (set to 0)
9 - 55: Bits 9-55 of host cluster offset. Must be aligned to a
cluster boundary. If the offset is 0, the cluster is
unallocated.
56 - 61: Reserved (set to 0)
Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
Bit 0 - x: Host cluster offset. This is usually _not_ aligned to a
cluster boundary!
x+1 - 61: Compressed size of the images in sectors of 512 bytes
If a cluster is unallocated, read requests shall read the data from the backing
file (except if bit 0 in the Standard Cluster Descriptor is set). If there is
no backing file or the backing file is smaller than the image, they shall read
zeros for all parts that are not covered by the backing file.
== Snapshots ==
qcow2 supports internal snapshots. Their basic principle of operation is to
switch the active L1 table, so that a different set of host clusters are
exposed to the guest.
When creating a snapshot, the L1 table should be copied and the refcount of all
L2 tables and clusters reachable from this L1 table must be increased, so that
a write causes a COW and isn't visible in other snapshots.
When loading a snapshot, bit 63 of all entries in the new active L1 table and
all L2 tables referenced by it must be reconstructed from the refcount table
as it doesn't need to be accurate in inactive L1 tables.
A directory of all snapshots is stored in the snapshot table, a contiguous area
in the image file, whose starting offset and length are given by the header
fields snapshots_offset and nb_snapshots. The entries of the snapshot table
have variable length, depending on the length of ID, name and extra data.
Snapshot table entry:
Byte 0 - 7: Offset into the image file at which the L1 table for the
snapshot starts. Must be aligned to a cluster boundary.
8 - 11: Number of entries in the L1 table of the snapshots
12 - 13: Length of the unique ID string describing the snapshot
14 - 15: Length of the name of the snapshot
16 - 19: Time at which the snapshot was taken in seconds since the
Epoch
20 - 23: Subsecond part of the time at which the snapshot was taken
in nanoseconds
24 - 31: Time that the guest was running until the snapshot was
taken in nanoseconds
32 - 35: Size of the VM state in bytes. 0 if no VM state is saved.
If there is VM state, it starts at the first cluster
described by first L1 table entry that doesn't describe a
regular guest cluster (i.e. VM state is stored like guest
disk content, except that it is stored at offsets that are
larger than the virtual disk presented to the guest)
36 - 39: Size of extra data in the table entry (used for future
extensions of the format)
variable: Extra data for future extensions. Unknown fields must be
ignored. Currently defined are (offset relative to snapshot
table entry):
Byte 40 - 47: Size of the VM state in bytes. 0 if no VM
state is saved. If this field is present,
the 32-bit value in bytes 32-35 is ignored.
Byte 48 - 55: Virtual disk size of the snapshot in bytes
Version 3 images must include extra data at least up to
byte 55.
variable: Unique ID string for the snapshot (not null terminated)
variable: Name of the snapshot (not null terminated)
variable: Padding to round up the snapshot table entry size to the
next multiple of 8.
== Bitmaps ==
As mentioned above, the bitmaps extension provides the ability to store bitmaps
related to a virtual disk. This section describes how these bitmaps are stored.
All stored bitmaps are related to the virtual disk stored in the same image, so
each bitmap size is equal to the virtual disk size.
Each bit of the bitmap is responsible for strictly defined range of the virtual
disk. For bit number bit_nr the corresponding range (in bytes) will be:
[bit_nr * bitmap_granularity .. (bit_nr + 1) * bitmap_granularity - 1]
Granularity is a property of the concrete bitmap, see below.
=== Bitmap directory ===
Each bitmap saved in the image is described in a bitmap directory entry. The
bitmap directory is a contiguous area in the image file, whose starting offset
and length are given by the header extension fields bitmap_directory_offset and
bitmap_directory_size. The entries of the bitmap directory have variable
length, depending on the lengths of the bitmap name and extra data. These
entries are also called bitmap headers.
Structure of a bitmap directory entry:
Byte 0 - 7: bitmap_table_offset
Offset into the image file at which the bitmap table
(described below) for the bitmap starts. Must be aligned to
a cluster boundary.
8 - 11: bitmap_table_size
Number of entries in the bitmap table of the bitmap.
12 - 15: flags
Bit
0: in_use
The bitmap was not saved correctly and may be
inconsistent.
1: auto
The bitmap must reflect all changes of the virtual
disk by any application that would write to this qcow2
file (including writes, snapshot switching, etc.). The
type of this bitmap must be 'dirty tracking bitmap'.
2: extra_data_compatible
This flags is meaningful when the extra data is
unknown to the software (currently any extra data is
unknown to Qemu).
If it is set, the bitmap may be used as expected, extra
data must be left as is.
If it is not set, the bitmap must not be used, but
both it and its extra data be left as is.
Bits 3 - 31 are reserved and must be 0.
16: type
This field describes the sort of the bitmap.
Values:
1: Dirty tracking bitmap
Values 0, 2 - 255 are reserved.
17: granularity_bits
Granularity bits. Valid values: 0 - 63.
Note: Qemu currently doesn't support granularity_bits
greater than 31.
Granularity is calculated as
granularity = 1 << granularity_bits
A bitmap's granularity is how many bytes of the image
accounts for one bit of the bitmap.
18 - 19: name_size
Size of the bitmap name. Must be non-zero.
Note: Qemu currently doesn't support values greater than
1023.
20 - 23: extra_data_size
Size of type-specific extra data.
For now, as no extra data is defined, extra_data_size is
reserved and should be zero. If it is non-zero the
behavior is defined by extra_data_compatible flag.
variable: extra_data
Extra data for the bitmap, occupying extra_data_size bytes.
Extra data must never contain references to clusters or in
some other way allocate additional clusters.
variable: name
The name of the bitmap (not null terminated), occupying
name_size bytes. Must be unique among all bitmap names
within the bitmaps extension.
variable: Padding to round up the bitmap directory entry size to the
next multiple of 8. All bytes of the padding must be zero.
=== Bitmap table ===
Each bitmap is stored using a one-level structure (as opposed to two-level
structures like for refcounts and guest clusters mapping) for the mapping of
bitmap data to host clusters. This structure is called the bitmap table.
Each bitmap table has a variable size (stored in the bitmap directory entry)
and may use multiple clusters, however, it must be contiguous in the image
file.
Structure of a bitmap table entry:
Bit 0: Reserved and must be zero if bits 9 - 55 are non-zero.
If bits 9 - 55 are zero:
0: Cluster should be read as all zeros.
1: Cluster should be read as all ones.
1 - 8: Reserved and must be zero.
9 - 55: Bits 9 - 55 of the host cluster offset. Must be aligned to
a cluster boundary. If the offset is 0, the cluster is
unallocated; in that case, bit 0 determines how this
cluster should be treated during reads.
56 - 63: Reserved and must be zero.
=== Bitmap data ===
As noted above, bitmap data is stored in separate clusters, described by the
bitmap table. Given an offset (in bytes) into the bitmap data, the offset into
the image file can be obtained as follows:
image_offset(bitmap_data_offset) =
bitmap_table[bitmap_data_offset / cluster_size] +
(bitmap_data_offset % cluster_size)
This offset is not defined if bits 9 - 55 of bitmap table entry are zero (see
above).
Given an offset byte_nr into the virtual disk and the bitmap's granularity, the
bit offset into the image file to the corresponding bit of the bitmap can be
calculated like this:
bit_offset(byte_nr) =
image_offset(byte_nr / granularity / 8) * 8 +
(byte_nr / granularity) % 8
If the size of the bitmap data is not a multiple of the cluster size then the
last cluster of the bitmap data contains some unused tail bits. These bits must
be zero.
=== Dirty tracking bitmaps ===
Bitmaps with 'type' field equal to one are dirty tracking bitmaps.
When the virtual disk is in use dirty tracking bitmap may be 'enabled' or
'disabled'. While the bitmap is 'enabled', all writes to the virtual disk
should be reflected in the bitmap. A set bit in the bitmap means that the
corresponding range of the virtual disk (see above) was written to while the
bitmap was 'enabled'. An unset bit means that this range was not written to.
The software doesn't have to sync the bitmap in the image file with its
representation in RAM after each write. Flag 'in_use' should be set while the
bitmap is not synced.
In the image file the 'enabled' state is reflected by the 'auto' flag. If this
flag is set, the software must consider the bitmap as 'enabled' and start
tracking virtual disk changes to this bitmap from the first write to the
virtual disk. If this flag is not set then the bitmap is disabled.

View file

@ -1,138 +0,0 @@
=Specification=
The file format looks like this:
+----------+----------+----------+-----+
| cluster0 | cluster1 | cluster2 | ... |
+----------+----------+----------+-----+
The first cluster begins with the '''header'''. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 and L2 tables are composed of one or more contiguous clusters.
Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster.
All fields are little-endian.
==Header==
Header {
uint32_t magic; /* QED\0 */
uint32_t cluster_size; /* in bytes */
uint32_t table_size; /* for L1 and L2 tables, in clusters */
uint32_t header_size; /* in clusters */
uint64_t features; /* format feature bits */
uint64_t compat_features; /* compat feature bits */
uint64_t autoclear_features; /* self-resetting feature bits */
uint64_t l1_table_offset; /* in bytes */
uint64_t image_size; /* total logical image size, in bytes */
/* if (features & QED_F_BACKING_FILE) */
uint32_t backing_filename_offset; /* in bytes from start of header */
uint32_t backing_filename_size; /* in bytes */
}
Field descriptions:
* ''cluster_size'' must be a power of 2 in range [2^12, 2^26].
* ''table_size'' must be a power of 2 in range [1, 16].
* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters.
* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps. They work as follows:
** An image with unknown ''features'' bits enabled must not be opened. File format changes that are not backwards-compatible must use ''features'' bits.
** An image with unknown ''compat_features'' bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format.
** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later.
* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''.
* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes.
* ''backing_filename_offset'' and ''backing_filename_size'' describe a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints. The string must be stored within the first ''header_size'' clusters. The backing filename may be an absolute path or relative to the image file.
Feature bits:
* QED_F_BACKING_FILE = 0x01. The image uses a backing file.
* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use.
* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing files are never detected as an image format if they happen to contain magic constants.
There are currently no defined ''compat_features'' or ''autoclear_features'' bits.
Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set.
==Tables==
Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
Table {
uint64_t offsets[TABLE_NOFFSETS];
}
The tables are organized as follows:
+----------+
| L1 table |
+----------+
,------' | '------.
+----------+ | +----------+
| L2 table | ... | L2 table |
+----------+ +----------+
,------' | '------.
+----------+ | +----------+
| Data | ... | Data |
+----------+ +----------+
A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
L1, L2, and data cluster offsets must be aligned to header.cluster_size. The following offsets have special meanings:
===L2 table offsets===
* 0 - unallocated. The L2 table is not yet allocated.
===Data cluster offsets===
* 0 - unallocated. The data cluster is not yet allocated.
* 1 - zero. The data cluster contents are all zeroes and no cluster is allocated.
Future format extensions may wish to store per-offset information. The least significant 12 bits of an offset are reserved for this purpose and must be set to zero. Image files with cluster_size > 2^12 will have more unused bits which should also be zeroed.
===Unallocated L2 tables and data clusters===
Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes.
Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing file (or zeroes if no backing file) and the data being written.
===Zero data clusters===
Zero data clusters are a space-efficient way of storing zeroed regions of the image.
Reads to a zero data cluster produce zeroes. Note that the difference between an unallocated and a zero data cluster is that zero data clusters stop the reading of contents from the backing file.
Writes to a zero data cluster cause a new data cluster to be allocated. The new data cluster is populated with zeroes and the data being written.
===Logical offset translation===
Logical offsets are translated into cluster offsets as follows:
table_bits table_bits cluster_bits
<--------> <--------> <--------------->
+----------+----------+-----------------+
| L1 index | L2 index | byte offset |
+----------+----------+-----------------+
Structure of a logical offset
offset_mask = ~(cluster_size - 1) # mask for the image file byte offset
def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
l2_offset = l1_table[l1_index]
l2_table = load_table(l2_offset)
cluster_offset = l2_table[l2_index] & offset_mask
return cluster_offset + byte_offset
==Consistency checking==
This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit.
The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent.
Consistency check includes the following invariants:
# Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references.
# Offsets must be within the image file size and must be ''cluster_size'' aligned.
# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table.
The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.

View file

@ -1,620 +0,0 @@
Vhost-user Protocol
===================
Copyright (c) 2014 Virtual Open Systems Sarl.
This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.
===================
This protocol is aiming to complement the ioctl interface used to control the
vhost implementation in the Linux kernel. It implements the control plane needed
to establish virtqueue sharing with a user space process on the same host. It
uses communication over a Unix domain socket to share file descriptors in the
ancillary data of the message.
The protocol defines 2 sides of the communication, master and slave. Master is
the application that shares its virtqueues, in our case QEMU. Slave is the
consumer of the virtqueues.
In the current implementation QEMU is the Master, and the Slave is intended to
be a software Ethernet switch running in user space, such as Snabbswitch.
Master and slave can be either a client (i.e. connecting) or server (listening)
in the socket communication.
Message Specification
---------------------
Note that all numbers are in the machine native byte order. A vhost-user message
consists of 3 header fields and a payload:
------------------------------------
| request | flags | size | payload |
------------------------------------
* Request: 32-bit type of the request
* Flags: 32-bit bit field:
- Lower 2 bits are the version (currently 0x01)
- Bit 2 is the reply flag - needs to be sent on each reply from the slave
- Bit 3 is the need_reply flag - see VHOST_USER_PROTOCOL_F_REPLY_ACK for
details.
* Size - 32-bit size of the payload
Depending on the request type, payload can be:
* A single 64-bit integer
-------
| u64 |
-------
u64: a 64-bit unsigned integer
* A vring state description
---------------
| index | num |
---------------
Index: a 32-bit index
Num: a 32-bit number
* A vring address description
--------------------------------------------------------------
| index | flags | size | descriptor | used | available | log |
--------------------------------------------------------------
Index: a 32-bit vring index
Flags: a 32-bit vring flags
Descriptor: a 64-bit user address of the vring descriptor table
Used: a 64-bit user address of the vring used ring
Available: a 64-bit user address of the vring available ring
Log: a 64-bit guest address for logging
* Memory regions description
---------------------------------------------------
| num regions | padding | region0 | ... | region7 |
---------------------------------------------------
Num regions: a 32-bit number of regions
Padding: 32-bit
A region is:
-----------------------------------------------------
| guest address | size | user address | mmap offset |
-----------------------------------------------------
Guest address: a 64-bit guest address of the region
Size: a 64-bit size
User address: a 64-bit user address
mmap offset: 64-bit offset where region starts in the mapped memory
* Log description
---------------------------
| log size | log offset |
---------------------------
log size: size of area used for logging
log offset: offset from start of supplied file descriptor
where logging starts (i.e. where guest address 0 would be logged)
* An IOTLB message
---------------------------------------------------------
| iova | size | user address | permissions flags | type |
---------------------------------------------------------
IOVA: a 64-bit I/O virtual address programmed by the guest
Size: a 64-bit size
User address: a 64-bit user address
Permissions: a 8-bit value:
- 0: No access
- 1: Read access
- 2: Write access
- 3: Read/Write access
Type: a 8-bit IOTLB message type:
- 1: IOTLB miss
- 2: IOTLB update
- 3: IOTLB invalidate
- 4: IOTLB access fail
In QEMU the vhost-user message is implemented with the following struct:
typedef struct VhostUserMsg {
VhostUserRequest request;
uint32_t flags;
uint32_t size;
union {
uint64_t u64;
struct vhost_vring_state state;
struct vhost_vring_addr addr;
VhostUserMemory memory;
VhostUserLog log;
struct vhost_iotlb_msg iotlb;
};
} QEMU_PACKED VhostUserMsg;
Communication
-------------
The protocol for vhost-user is based on the existing implementation of vhost
for the Linux Kernel. Most messages that can be sent via the Unix domain socket
implementing vhost-user have an equivalent ioctl to the kernel implementation.
The communication consists of master sending message requests and slave sending
message replies. Most of the requests don't require replies. Here is a list of
the ones that do:
* VHOST_USER_GET_FEATURES
* VHOST_USER_GET_PROTOCOL_FEATURES
* VHOST_USER_GET_VRING_BASE
* VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
[ Also see the section on REPLY_ACK protocol extension. ]
There are several messages that the master sends with file descriptors passed
in the ancillary data:
* VHOST_USER_SET_MEM_TABLE
* VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
* VHOST_USER_SET_LOG_FD
* VHOST_USER_SET_VRING_KICK
* VHOST_USER_SET_VRING_CALL
* VHOST_USER_SET_VRING_ERR
* VHOST_USER_SET_SLAVE_REQ_FD
If Master is unable to send the full message or receives a wrong reply it will
close the connection. An optional reconnection mechanism can be implemented.
Any protocol extensions are gated by protocol feature bits,
which allows full backwards compatibility on both master
and slave.
As older slaves don't support negotiating protocol features,
a feature bit was dedicated for this purpose:
#define VHOST_USER_F_PROTOCOL_FEATURES 30
Starting and stopping rings
----------------------
Client must only process each ring when it is started.
Client must only pass data between the ring and the
backend, when the ring is enabled.
If ring is started but disabled, client must process the
ring without talking to the backend.
For example, for a networking device, in the disabled state
client must not supply any new RX packets, but must process
and discard any TX packets.
If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized
in an enabled state.
If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized
in a disabled state. Client must not pass data to/from the backend until ring is enabled by
VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by
VHOST_USER_SET_VRING_ENABLE with parameter 0.
Each ring is initialized in a stopped state, client must not process it until
ring is started, or after it has been stopped.
Client must start ring upon receiving a kick (that is, detecting that file
descriptor is readable) on the descriptor specified by
VHOST_USER_SET_VRING_KICK, and stop ring upon receiving
VHOST_USER_GET_VRING_BASE.
While processing the rings (whether they are enabled or not), client must
support changing some configuration aspects on the fly.
Multiple queue support
----------------------
Multiple queue is treated as a protocol extension, hence the slave has to
implement protocol features first. The multiple queues feature is supported
only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set.
The max number of queues the slave supports can be queried with message
VHOST_USER_GET_PROTOCOL_FEATURES. Master should stop when the number of
requested queues is bigger than that.
As all queues share one connection, the master uses a unique index for each
queue in the sent message to identify a specified queue. One queue pair
is enabled initially. More queues are enabled dynamically, by sending
message VHOST_USER_SET_VRING_ENABLE.
Migration
---------
During live migration, the master may need to track the modifications
the slave makes to the memory mapped regions. The client should mark
the dirty pages in a log. Once it complies to this logging, it may
declare the VHOST_F_LOG_ALL vhost feature.
To start/stop logging of data/used ring writes, server may send messages
VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with
VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively.
All the modifications to memory pointed by vring "descriptor" should
be marked. Modifications to "used" vring should be marked if
VHOST_VRING_F_LOG is part of ring's flags.
Dirty pages are of size:
#define VHOST_LOG_PAGE 0x1000
The log memory fd is provided in the ancillary data of
VHOST_USER_SET_LOG_BASE message when the slave has
VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature.
The size of the log is supplied as part of VhostUserMsg
which should be large enough to cover all known guest
addresses. Log starts at the supplied offset in the
supplied file descriptor.
The log covers from address 0 to the maximum of guest
regions. In pseudo-code, to mark page at "addr" as dirty:
page = addr / VHOST_LOG_PAGE
log[page / 8] |= 1 << page % 8
Where addr is the guest physical address.
Use atomic operations, as the log may be concurrently manipulated.
Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG
is set for this ring), log_guest_addr should be used to calculate the log
offset: the write to first byte of the used ring is logged at this offset from
log start. Also note that this value might be outside the legal guest physical
address range (i.e. does not have to be covered by the VhostUserMemory table),
but the bit offset of the last byte of the ring must fall within
the size supplied by VhostUserLog.
VHOST_USER_SET_LOG_FD is an optional message with an eventfd in
ancillary data, it may be used to inform the master that the log has
been modified.
Once the source has finished migration, rings will be stopped by
the source. No further update must be done before rings are
restarted.
IOMMU support
-------------
When the VIRTIO_F_IOMMU_PLATFORM feature has been negotiated, the master
sends IOTLB entries update & invalidation by sending VHOST_USER_IOTLB_MSG
requests to the slave with a struct vhost_iotlb_msg as payload. For update
events, the iotlb payload has to be filled with the update message type (2),
the I/O virtual address, the size, the user virtual address, and the
permissions flags. Addresses and size must be within vhost memory regions set
via the VHOST_USER_SET_MEM_TABLE request. For invalidation events, the iotlb
payload has to be filled with the invalidation message type (3), the I/O virtual
address and the size. On success, the slave is expected to reply with a zero
payload, non-zero otherwise.
The slave relies on the slave communcation channel (see "Slave communication"
section below) to send IOTLB miss and access failure events, by sending
VHOST_USER_SLAVE_IOTLB_MSG requests to the master with a struct vhost_iotlb_msg
as payload. For miss events, the iotlb payload has to be filled with the miss
message type (1), the I/O virtual address and the permissions flags. For access
failure event, the iotlb payload has to be filled with the access failure
message type (4), the I/O virtual address and the permissions flags.
For synchronization purpose, the slave may rely on the reply-ack feature,
so the master may send a reply when operation is completed if the reply-ack
feature is negotiated and slaves requests a reply. For miss events, completed
operation means either master sent an update message containing the IOTLB entry
containing requested address and permission, or master sent nothing if the IOTLB
miss message is invalid (invalid IOVA or permission).
The master isn't expected to take the initiative to send IOTLB update messages,
as the slave sends IOTLB miss messages for the guest virtual memory areas it
needs to access.
Slave communication
-------------------
An optional communication channel is provided if the slave declares
VHOST_USER_PROTOCOL_F_SLAVE_REQ protocol feature, to allow the slave to make
requests to the master.
The fd is provided via VHOST_USER_SET_SLAVE_REQ_FD ancillary data.
A slave may then send VHOST_USER_SLAVE_* messages to the master
using this fd communication channel.
Protocol features
-----------------
#define VHOST_USER_PROTOCOL_F_MQ 0
#define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1
#define VHOST_USER_PROTOCOL_F_RARP 2
#define VHOST_USER_PROTOCOL_F_REPLY_ACK 3
#define VHOST_USER_PROTOCOL_F_MTU 4
#define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5
Master message types
--------------------
* VHOST_USER_GET_FEATURES
Id: 1
Equivalent ioctl: VHOST_GET_FEATURES
Master payload: N/A
Slave payload: u64
Get from the underlying vhost implementation the features bitmask.
Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
* VHOST_USER_SET_FEATURES
Id: 2
Ioctl: VHOST_SET_FEATURES
Master payload: u64
Enable features in the underlying vhost implementation using a bitmask.
Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
* VHOST_USER_GET_PROTOCOL_FEATURES
Id: 15
Equivalent ioctl: VHOST_GET_FEATURES
Master payload: N/A
Slave payload: u64
Get the protocol feature bitmask from the underlying vhost implementation.
Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
VHOST_USER_GET_FEATURES.
Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
this message even before VHOST_USER_SET_FEATURES was called.
* VHOST_USER_SET_PROTOCOL_FEATURES
Id: 16
Ioctl: VHOST_SET_FEATURES
Master payload: u64
Enable protocol features in the underlying vhost implementation.
Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
VHOST_USER_GET_FEATURES.
Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
this message even before VHOST_USER_SET_FEATURES was called.
* VHOST_USER_SET_OWNER
Id: 3
Equivalent ioctl: VHOST_SET_OWNER
Master payload: N/A
Issued when a new connection is established. It sets the current Master
as an owner of the session. This can be used on the Slave as a
"session start" flag.
* VHOST_USER_RESET_OWNER
Id: 4
Master payload: N/A
This is no longer used. Used to be sent to request disabling
all rings, but some clients interpreted it to also discard
connection state (this interpretation would lead to bugs).
It is recommended that clients either ignore this message,
or use it to disable all rings.
* VHOST_USER_SET_MEM_TABLE
Id: 5
Equivalent ioctl: VHOST_SET_MEM_TABLE
Master payload: memory regions description
Sets the memory map regions on the slave so it can translate the vring
addresses. In the ancillary data there is an array of file descriptors
for each memory mapped region. The size and ordering of the fds matches
the number and ordering of memory regions.
* VHOST_USER_SET_LOG_BASE
Id: 6
Equivalent ioctl: VHOST_SET_LOG_BASE
Master payload: u64
Slave payload: N/A
Sets logging shared memory space.
When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol
feature, the log memory fd is provided in the ancillary data of
VHOST_USER_SET_LOG_BASE message, the size and offset of shared
memory area provided in the message.
* VHOST_USER_SET_LOG_FD
Id: 7
Equivalent ioctl: VHOST_SET_LOG_FD
Master payload: N/A
Sets the logging file descriptor, which is passed as ancillary data.
* VHOST_USER_SET_VRING_NUM
Id: 8
Equivalent ioctl: VHOST_SET_VRING_NUM
Master payload: vring state description
Set the size of the queue.
* VHOST_USER_SET_VRING_ADDR
Id: 9
Equivalent ioctl: VHOST_SET_VRING_ADDR
Master payload: vring address description
Slave payload: N/A
Sets the addresses of the different aspects of the vring.
* VHOST_USER_SET_VRING_BASE
Id: 10
Equivalent ioctl: VHOST_SET_VRING_BASE
Master payload: vring state description
Sets the base offset in the available vring.
* VHOST_USER_GET_VRING_BASE
Id: 11
Equivalent ioctl: VHOST_USER_GET_VRING_BASE
Master payload: vring state description
Slave payload: vring state description
Get the available vring base offset.
* VHOST_USER_SET_VRING_KICK
Id: 12
Equivalent ioctl: VHOST_SET_VRING_KICK
Master payload: u64
Set the event file descriptor for adding buffers to the vring. It
is passed in the ancillary data.
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data. This signals that polling should be used
instead of waiting for a kick.
* VHOST_USER_SET_VRING_CALL
Id: 13
Equivalent ioctl: VHOST_SET_VRING_CALL
Master payload: u64
Set the event file descriptor to signal when buffers are used. It
is passed in the ancillary data.
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data. This signals that polling will be used
instead of waiting for the call.
* VHOST_USER_SET_VRING_ERR
Id: 14
Equivalent ioctl: VHOST_SET_VRING_ERR
Master payload: u64
Set the event file descriptor to signal when error occurs. It
is passed in the ancillary data.
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data.
* VHOST_USER_GET_QUEUE_NUM
Id: 17
Equivalent ioctl: N/A
Master payload: N/A
Slave payload: u64
Query how many queues the backend supports. This request should be
sent only when VHOST_USER_PROTOCOL_F_MQ is set in queried protocol
features by VHOST_USER_GET_PROTOCOL_FEATURES.
* VHOST_USER_SET_VRING_ENABLE
Id: 18
Equivalent ioctl: N/A
Master payload: vring state description
Signal slave to enable or disable corresponding vring.
This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES
has been negotiated.
* VHOST_USER_SEND_RARP
Id: 19
Equivalent ioctl: N/A
Master payload: u64
Ask vhost user backend to broadcast a fake RARP to notify the migration
is terminated for guest that does not support GUEST_ANNOUNCE.
Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP
is present in VHOST_USER_GET_PROTOCOL_FEATURES.
The first 6 bytes of the payload contain the mac address of the guest to
allow the vhost user backend to construct and broadcast the fake RARP.
* VHOST_USER_NET_SET_MTU
Id: 20
Equivalent ioctl: N/A
Master payload: u64
Set host MTU value exposed to the guest.
This request should be sent only when VIRTIO_NET_F_MTU feature has been
successfully negotiated, VHOST_USER_F_PROTOCOL_FEATURES is present in
VHOST_USER_GET_FEATURES and protocol feature bit
VHOST_USER_PROTOCOL_F_NET_MTU is present in
VHOST_USER_GET_PROTOCOL_FEATURES.
If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, slave must respond
with zero in case the specified MTU is valid, or non-zero otherwise.
* VHOST_USER_SET_SLAVE_REQ_FD
Id: 21
Equivalent ioctl: N/A
Master payload: N/A
Set the socket file descriptor for slave initiated requests. It is passed
in the ancillary data.
This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES
has been negotiated, and protocol feature bit VHOST_USER_PROTOCOL_F_SLAVE_REQ
bit is present in VHOST_USER_GET_PROTOCOL_FEATURES.
If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, slave must respond
with zero for success, non-zero otherwise.
* VHOST_USER_IOTLB_MSG
Id: 22
Equivalent ioctl: N/A (equivalent to VHOST_IOTLB_MSG message type)
Master payload: struct vhost_iotlb_msg
Slave payload: u64
Send IOTLB messages with struct vhost_iotlb_msg as payload.
Master sends such requests to update and invalidate entries in the device
IOTLB. The slave has to acknowledge the request with sending zero as u64
payload for success, non-zero otherwise.
This request should be send only when VIRTIO_F_IOMMU_PLATFORM feature
has been successfully negotiated.
Slave message types
-------------------
* VHOST_USER_SLAVE_IOTLB_MSG
Id: 1
Equivalent ioctl: N/A (equivalent to VHOST_IOTLB_MSG message type)
Slave payload: struct vhost_iotlb_msg
Master payload: N/A
Send IOTLB messages with struct vhost_iotlb_msg as payload.
Slave sends such requests to notify of an IOTLB miss, or an IOTLB
access failure. If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated,
and slave set the VHOST_USER_NEED_REPLY flag, master must respond with
zero when operation is successfully completed, or non-zero otherwise.
This request should be send only when VIRTIO_F_IOMMU_PLATFORM feature
has been successfully negotiated.
VHOST_USER_PROTOCOL_F_REPLY_ACK:
-------------------------------
The original vhost-user specification only demands replies for certain
commands. This differs from the vhost protocol implementation where commands
are sent over an ioctl() call and block until the client has completed.
With this protocol extension negotiated, the sender (QEMU) can set the
"need_reply" [Bit 3] flag to any command. This indicates that
the client MUST respond with a Payload VhostUserMsg indicating success or
failure. The payload should be set to zero on success or non-zero on failure,
unless the message already has an explicit reply body.
The response payload gives QEMU a deterministic indication of the result
of the command. Today, QEMU is expected to terminate the main vhost-user
loop upon receiving such errors. In future, qemu could be taught to be more
resilient for selective requests.
For the message types that already solicit a reply from the client, the
presence of VHOST_USER_PROTOCOL_F_REPLY_ACK or need_reply bit being set brings
no behavioural change. (See the 'Communication' section for details.)