When processing a backlog scan for group interrupts, also take
into account crowd interrupts.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The blk/index in some paths may refer to an NVP or an NVGC. When it
is not known ahead of time, use the nvx_ prefix to prevent confusion.
[npiggin: split out of larger fix patch and reworded]
Signed-off-by: Glenn Miles <milesg@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
XIVE crowd sizes are encoded into a 2-bit field as follows:
0: 0b00
2: 0b01
4: 0b10
16: 0b11
A crowd size of 8 is not supported.
If an END is defined with the 'crowd' bit set, then a target can be
running on different blocks. It means that some bits from the block
VP are masked when looking for a match. It is similar to groups, but
on the block instead of the VP index.
Most of the changes are due to passing the extra argument 'crowd' all
the way to the function checking for matches.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Glenn Miles <milesg@linux.vnet.ibm.com>
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Add support for the NVPG and NVC BARs. Access to the BAR pages will
cause backlog counter operations to either increment or decriment
the counter.
Also added qtests for the same.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Add XIVE2 tests for group interrupts and group interrupts that have
been backlogged.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
When the hypervisor or OS pushes a new value to the CPPR, if the LSMFB
value is lower than the new CPPR value, there could be a pending group
interrupt in the backlog, so it needs to be scanned.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
When pushing an OS context, we were already checking if there was a
pending interrupt in the IPB and sending a notification if needed. We
also need to check if there is a pending group interrupt stored in the
NVG table. To avoid useless backlog scans, we only scan if the NVP
belongs to a group.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
When a group interrupt cannot be delivered, we need to:
- increment the backlog counter for the group in the NVG table
(if the END is configured to keep a backlog).
- start a broadcast operation to set the LSMFB field on matching CPUs
which can't take the interrupt now because they're running at too
high a priority.
[npiggin: squash in fixes from milesg]
[milesg: only load the NVP if the END is !ignore]
[milesg: always broadcast backlog, not only when there are precluded VPs]
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
If an END has the 'i' bit set (ignore), then it targets a group of
VPs. The size of the group depends on the VP index of the target
(first 0 found when looking at the least significant bits of the
index) so a mask is applied on the VP index of a running thread to
know if we have a match.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The NSR has a (so far unused) grouping level field. When a interrupt
is presented, that field tells the hypervisor or OS if the interrupt
is for an individual VP or for a VP-group/crowd. This patch reworks
the presentation API to allow to set/unset the level when
raising/accepting an interrupt.
It also renames xive_tctx_ipb_update() to xive_tctx_pipr_update() as
the IPB is only used for VP-specific target, whereas the PIPR always
needs to be updated.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Rename to follow the convention of the other function names.
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
If the 'H' attribute is set on the NVP structure, the hardware
automatically saves and restores some attributes from the TIMA in the
NVP structure.
The group-specific attributes LSMFB, LGS and T have an extra flag to
individually control what is saved/restored.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Kowal <kowal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The default PNOR image is erased and not recognised by skiboot, so NVRAM
gets disabled. This change adds a tiny pnor file that is a proper FFS
image with a formatted NVRAM partition. This is recognised by skiboot and
will persist across machine reboots.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The BMC HIOMAP PNOR access protocol has certain limits on PNOR addresses
and sizes. Add some sanity checks for these so we don't get strange
behaviour.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
skiboot has a bug that does not handle ISA FW access correctly for IDSEL
devices > 0, and the current PNOR default address and size puts 64MB in
device 0 and 64MB in device 1, which causes skiboot to hit this bug and
breaks PNOR accesses.
Move the PNOR address down to 0 for now, so a 256MB PNOR can be accessed
via device 0.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
LPC FW address space is a 256MB (28-bit) region to one of 16-devices
that are selected with the IDSEL register. Implement this by making
the ISA FW address space 4GB, and move the 256MB OPB alias within
that space according to IDSEL.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
If nothing responds to an LPC access, the LPC host controller should
set an IRQSTAT error. Model this behaviour.
skiboot uses this error to "probe" LPC accesses, among other things to
determine if a SuperIO chip is present. After this change it recognizes
there is no SuperIO present and does not keep trying to access it.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The LPC model has only supported serirqs (ISA device IRQs), however
there are internal sources that can raise other interrupts. Update the
device to handle these interrupt sources.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Linux power management code accesses these registers for pstate
management. Wire up a very simple implementation.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
After OCC fixes in QEMU pnv model and skiboot (since they have suffered
some bitrot), Linux will start performing PM SPR accesses. This is a
very simple implementation that makes it a bit happier.
Thanks,
Nick
The OCC is an On Chip Controller that handles various thermal and power
management. It is a PPC405 microcontroller that runs its own firmware
which is out of scope of the powernv machine model. Some dynamic
behaviour and interfaces that are important for host CPU testing can be
implemented with a much simpler state machine.
This change adds a 100ms timer that ticks through a simple state machine
that looks for "OCC command requests" coming from host firmware, and
responds to them.
For now the powercap command is implemented because that is used by
OPAL and exported to Linux and is easy to test.
$ F=/sys/firmware/opal/powercap/system-powercap/powercap-current
$ cat $F
100
$ echo 50 | sudo tee $F
50
$ cat $F
50
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
OCC pstate frequencies are in kHz, so the OCC data was 3-4MHz. Upgrade
to GHz. Make each pstate have a different frequency.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The HOMER is a region of memory used by host and firmware and
microconrollers. It has very little logic by itself, just some BAR
registers. Users of this memory should operate on it rather than
have HOMER implement them with MMIO registers, which is not the
right model.
This change switches the implementation of HOMER from MMIO to RAM,
and moves the OCC register implementation to in-memory structure
accesses performed by the OCC model.
This has the downside that access to unimplemented regions of HOMER
are no longer flagged. Perhaps that could be done by adding a memory
region for HOMER, and ram subregions under that for each implemented
part. But for now this takes the simpler approach.
Note: This brings some data structure definitions from skiboot, which
does not match QEMU coding style but is not changed to make comparisons
and updates simpler.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Use defines for the OCCMISC register bits, and add a comment about the
IRQ request bit, which QEMU may not model quite correctly.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Put HOMER memory region base and size into the class, to allow more
code-reuse between different machines in later changes.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The commit to fix the OCC common area sensor mappings didn't update the
register offsets to match.
Before this change, skiboot reports:
[ 0.347100086,3] OCC: Chip 0 sensor data invalid
Afterward, there is no error and the sensor_groups directory appears
under /sys/firmware/opal/.
The SLW_IMAGE_BASE address looks like a workaround to intercept firmware
memory accesses, but that does not seem to be required now (and would
have been broken by the OCC common area region mapping change anyway).
So it can be removed.
Fixes: 3a1b70b66b ("ppc/pnv: Fix OCC common area region mapping")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
HOMER memory implements some dummy registers that return a nonsense
value to satisfy skiboot accesses caused by "SLW" init and register
save/restore programming that has never worked under QEMU:
[ 0.265000943,3] SLW: Failed to set HRMOR for CPU 0,RC=0x1
[ 0.265356988,3] Disabling deep stop states
To simplify a later change to implement HOMER as a RAM area, make
these return zero, which has the same result.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The HOMER OCC registers seem to have bitrotted and fail for various
reasons on powernv8, 9, and 10.
The major problems are that POWER8 has the wrong version value and its
pstate ordering is incorrect. POWER9/10 have not set the OCC state to
active. Non-zero chips are also set to OCC slaves for POWER9/10.
Unfortunately skiboot has also bitrotted and requires fixes that are
not yet in the bios files to run. With a patched skiboot, before this
change, powernv9/10 report:
[ 0.262050394,3] OCC: Chip: 0: OCC not active
[ 0.262128603,3] OCC: Initialization on all chips did not complete(timed out)
powernv8 reports:
[ 0.173572100,3] OCC: Unknown OCC-OPAL interface version.
[ 0.173812059,3] OCC: Initialization on all chips did not complete(timed out)
After this patch, all report:
[ 0.176815668,5] OCC: All Chip Rdy after 0 ms
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Each non-core chiplet on a chip has a "pervasive chiplet" unit and its
xscom register set. This adds support for PHB4/5.
skiboot reads the CPLT_CONF1 register in __phb4/5_get_max_link_width(),
which shows up as unimplemented xscom reads. Set a value in PCI CONF1
register's link-width field to demonstrate skiboot doing something
interesting with it.
In the bigger picture, it might be better to model the pervasive
chiplet type as parent that each non-core chiplet model derives from.
For now this is enough to get the PHB registers implemented and working
for skiboot, and provides a second example (after the N1 chiplet) that
will help if the design is reworked as such.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
This adds TPM pass through API.
Also, moves SLOF from github to gitlab.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
This skiboot firmware importantly contains updates for HOMER/OCC bugs.
These subsystems have bitrotted in QEMU and skiboot and this update
allows new QEMU models to be exercised.
Power11 support is also added. This model is not yet merged in QEMU,
but firmware support will make development and testing simpler.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The ref405ep machine is scheduled for removal in QEMU 10.0. Keep the
405 CPU implementation for a while because it is theoretically
possible to model the power management (OCC) co-processor found on the
IBM POWER [8-11] processors.
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Message-ID: <20250204080649.836155-4-clg@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
The ref405ep machine is the only PPC 405 machine. Drop all support by
removing the SoC and associated devices as-well as the machine.
Link: https://lore.kernel.org/qemu-devel/20250110141800.1587589-3-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Message-ID: <20250204080649.836155-3-clg@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Since we are about to remove all support for PPC 405, start by
removing the tests referring to the ref405ep machine.
Link: https://lore.kernel.org/qemu-devel/20250110141800.1587589-2-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Message-ID: <20250204080649.836155-2-clg@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
iQEzBAABCAAdFiEEIV1G9IJGaJ7HfzVi7wSWWzmNYhEFAmfO1zkACgkQ7wSWWzmN
YhET+wf+PkaGeFTNUrOtWpl35fSMKlmOVbb1fkPfuhVBmeY2Vh1EIN3OjqnzdV0F
wxpuk+wwmFiuV1n6RNuMHQ0nz1mhgsSlZh93N5rArC/PUr3iViaT0cb82RjwxhaI
RODBhhy7V9WxEhT9hR8sCP2ky2mrKgcYbjiIEw+IvFZOVQa58rMr2h/cbAb/iH4l
7T9Wba03JBqOS6qgzSFZOMxvqnYdVjhqXN8M6W9ngRJOjPEAkTB6Evwep6anRjcM
mCUOgkf2sgQwKve8pYAeTMkzXFctvTc/qCU4ZbN8XcoKVVxe2jllGQqdOpMskPEf
slOuINeW5M0K5gyjsb/huqcOTfDI2A==
=/Y0+
-----END PGP SIGNATURE-----
Merge tag 'net-pull-request' of https://github.com/jasowang/qemu into staging
# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCAAdFiEEIV1G9IJGaJ7HfzVi7wSWWzmNYhEFAmfO1zkACgkQ7wSWWzmN
# YhET+wf+PkaGeFTNUrOtWpl35fSMKlmOVbb1fkPfuhVBmeY2Vh1EIN3OjqnzdV0F
# wxpuk+wwmFiuV1n6RNuMHQ0nz1mhgsSlZh93N5rArC/PUr3iViaT0cb82RjwxhaI
# RODBhhy7V9WxEhT9hR8sCP2ky2mrKgcYbjiIEw+IvFZOVQa58rMr2h/cbAb/iH4l
# 7T9Wba03JBqOS6qgzSFZOMxvqnYdVjhqXN8M6W9ngRJOjPEAkTB6Evwep6anRjcM
# mCUOgkf2sgQwKve8pYAeTMkzXFctvTc/qCU4ZbN8XcoKVVxe2jllGQqdOpMskPEf
# slOuINeW5M0K5gyjsb/huqcOTfDI2A==
# =/Y0+
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 10 Mar 2025 20:12:41 HKT
# gpg: using RSA key 215D46F48246689EC77F3562EF04965B398D6211
# gpg: Good signature from "Jason Wang (Jason Wang on RedHat) <jasowang@redhat.com>" [full]
# Primary key fingerprint: 215D 46F4 8246 689E C77F 3562 EF04 965B 398D 6211
* tag 'net-pull-request' of https://github.com/jasowang/qemu:
tap-linux: Open ipvtap and macvtap
Revert "hw/net/net_tx_pkt: Fix overrun in update_sctp_checksum()"
util/iov: Do not assert offset is in iov
net: move backend cleanup to NIC cleanup
net: parameterize the removing client from nc list
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
- update and expand aarch64 GPU tests
- fix build dependence for plugins
- update libvirt-ci to vulkan-tools
- allow plugin tests to run on non-POSIX systems
- tweak test/vm times
- mark test-vma as linux only
- various compiler fixes for tcg tests
- add gitlab build unit tracker
- error out early on stalled RME tests
- compile core plugin code once
- update MAINTAINERS
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEZoWumedRZ7yvyN81+9DbCVqeKkQFAmfOwjcACgkQ+9DbCVqe
KkRkcwgAlRTflCqYlZxdlo4CiOSXaHAFr8yfwWq138LJQRQH530JZnz1lZtxTbEM
pXT7ixnuJQDMybCQJmvUlK5UTUkZhGS3VuAR1VeM2J8/3VXYzf5sFjZ7yko9mA8S
2FX8vdfbko8/J31+lKccA0tpbHyi2AbMR+mO8xj6KZQoePwmHoRmhgH7p7LE35YO
8ytaOjMwTKF5fReVK+tlcrVJHFMdGsGNwtsnB2FhhVjI56fQqyM5hGXfOING2Fx3
iZH3rjzfDB4SWbBqaRsMgH9RXjuB9Eo4v0Qs5ve5SjDyzRJk+/CbbBJ4oRt9hurJ
bA+CYZuNLGBf8Z/mUeYMavA7rxT5rw==
=qobU
-----END PGP SIGNATURE-----
Merge tag 'pull-10.0-for-softfreeze-100325-3' of https://gitlab.com/stsquad/qemu into staging
functional and tcg tests, plugins and MAINTAINERS
- update and expand aarch64 GPU tests
- fix build dependence for plugins
- update libvirt-ci to vulkan-tools
- allow plugin tests to run on non-POSIX systems
- tweak test/vm times
- mark test-vma as linux only
- various compiler fixes for tcg tests
- add gitlab build unit tracker
- error out early on stalled RME tests
- compile core plugin code once
- update MAINTAINERS
# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCgAdFiEEZoWumedRZ7yvyN81+9DbCVqeKkQFAmfOwjcACgkQ+9DbCVqe
# KkRkcwgAlRTflCqYlZxdlo4CiOSXaHAFr8yfwWq138LJQRQH530JZnz1lZtxTbEM
# pXT7ixnuJQDMybCQJmvUlK5UTUkZhGS3VuAR1VeM2J8/3VXYzf5sFjZ7yko9mA8S
# 2FX8vdfbko8/J31+lKccA0tpbHyi2AbMR+mO8xj6KZQoePwmHoRmhgH7p7LE35YO
# 8ytaOjMwTKF5fReVK+tlcrVJHFMdGsGNwtsnB2FhhVjI56fQqyM5hGXfOING2Fx3
# iZH3rjzfDB4SWbBqaRsMgH9RXjuB9Eo4v0Qs5ve5SjDyzRJk+/CbbBJ4oRt9hurJ
# bA+CYZuNLGBf8Z/mUeYMavA7rxT5rw==
# =qobU
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 10 Mar 2025 18:43:03 HKT
# gpg: using RSA key 6685AE99E75167BCAFC8DF35FBD0DB095A9E2A44
# gpg: Good signature from "Alex Bennée (Master Work Key) <alex.bennee@linaro.org>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 6685 AE99 E751 67BC AFC8 DF35 FBD0 DB09 5A9E 2A44
* tag 'pull-10.0-for-softfreeze-100325-3' of https://gitlab.com/stsquad/qemu: (31 commits)
MAINTAINERS: remove widely sanctioned entities
plugins/core: make a single build unit
plugins/api: build only once
plugins/api: split out time control helpers
plugins/api: split out the vaddr/hwaddr helpers
plugins/api: split out binary path/start/end/entry code
plugins/loader: compile loader only once
plugins/plugin.h: include queue.h
plugins/api: clean-up the includes
include/qemu: plugin-memory.h doesn't need cpu-defs.h
plugins/loader: populate target_name with target_name()
plugins/api: use qemu_target_page_mask() to get value
tests/functional: add boot error detection for RME tests
gitlab: add a new build_unit job to track build size
tests/tcg: Suppress compiler false-positive warning on sha1.c
tests/tcg: enable -fwrapv for test-i386-bmi
tests/tcg: fix constraints in test-i386-adcox
tests/tcg: add message to _Static_assert in test-avx
tests/tcg: mark test-vma as a linux-only test
tests/vm: bump timeout for shutdown
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Address an error in RDMA-based migration by ensuring RDMA is prioritized
when saving pages in `ram_save_target_page()`.
Previously, the RDMA protocol's page-saving step was placed after other
protocols due to a refactoring in commit bc38dc2f5f. This led to migration
failures characterized by unknown control messages and state loading errors
destination:
(qemu) qemu-system-x86_64: Unknown control message QEMU FILE
qemu-system-x86_64: error while loading state section id 1(ram)
qemu-system-x86_64: load of migration failed: Operation not permitted
source:
(qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1
qemu-system-x86_64: rdma migration: recv polling control error!
qemu-system-x86_64: warning: Early error. Sending error.
qemu-system-x86_64: warning: rdma migration: send polling control error
RDMA migration implemented its own protocol/method to send pages to
destination side, hand over to RDMA first to prevent pages being saved by
other protocol.
Fixes: bc38dc2f5f ("migration: refactor ram_save_target_page functions")
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-2-lizhijian@fujitsu.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Unlike cpr-reboot mode, cpr-transfer mode cannot save volatile ram blocks
in the migration stream file and recreate them later, because the physical
memory for the blocks is pinned and registered for vfio. Add a blocker
for volatile ram blocks.
Also add a blocker for RAM_GUEST_MEMFD. Preserving guest_memfd may be
sufficient for CPR, but it has not been tested yet.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <1740667681-257312-1-git-send-email-steven.sistare@oracle.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
On the incoming migration side, QEMU uses a coroutine to load all the VM
states. Inside, it may reference MigrationState on global states like
migration capabilities, parameters, error state, shared mutexes and more.
However there's nothing yet to make sure MigrationState won't get
destroyed (e.g. after migration_shutdown()). Meanwhile there's also no API
available to remove the incoming coroutine in migration_shutdown(),
avoiding it to access the freed elements.
There's a bug report showing this can happen and crash dest QEMU when
migration is cancelled on source.
When it happens, the dest main thread is trying to cleanup everything:
#0 qemu_aio_coroutine_enter
#1 aio_dispatch_handler
#2 aio_poll
#3 monitor_cleanup
#4 qemu_cleanup
#5 qemu_default_main
Then it found the migration incoming coroutine, schedule it (even after
migration_shutdown()), causing crash:
#0 __pthread_kill_implementation
#1 __pthread_kill_internal
#2 __GI_raise
#3 __GI_abort
#4 __assert_fail_base
#5 __assert_fail
#6 qemu_mutex_lock_impl
#7 qemu_lockable_mutex_lock
#8 qemu_lockable_lock
#9 qemu_lockable_auto_lock
#10 migrate_set_error
#11 process_incoming_migration_co
#12 coroutine_trampoline
To fix it, take a refcount after an incoming setup is properly done when
qmp_migrate_incoming() succeeded the 1st time. As it's during a QMP
handler which needs BQL, it means the main loop is still alive (without
going into cleanups, which also needs BQL).
Releasing the refcount now only until the incoming migration coroutine
finished or failed. Hence the refcount is valid for both (1) setup phase
of incoming ports, mostly IO watches (e.g. qio_channel_add_watch_full()),
and (2) the incoming coroutine itself (process_incoming_migration_co()).
Note that we can't unref in migration_incoming_state_destroy(), because
both qmp_xen_load_devices_state() and load_snapshot() will use it without
an incoming migration. Those hold BQL so they're not prone to this issue.
PS: I suspect nobody uses Xen's command at all, as it didn't register yank,
hence AFAIU the command should crash on master when trying to unregister
yank in migration_incoming_state_destroy().. but that's another story.
Also note that in some incoming failure cases we may not always unref the
MigrationState refcount, which is a trade-off to keep things simple. We
could make it accurate, but it can be an overkill. Some examples:
- Unlike most of the rest protocols, socket_start_incoming_migration()
may create net listener after incoming port setup sucessfully.
It means we can't unref in migration_channel_process_incoming() as a
generic path because socket protocol might keep using MigrationState.
- For either socket or file, multiple IO watches might be created, it
means logically each IO watch needs to take one refcount for
MigrationState so as to be 100% accurate on ownership of refcount taken.
In general, we at least need per-protocol handling to make it accurate,
which can be an overkill if we know incoming failed after all. Add a short
comment to explain that when taking the refcount in qmp_migrate_incoming().
Bugzilla: https://issues.redhat.com/browse/RHEL-69775
Tested-by: Yan Fu <yafu@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250220132459.512610-1-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
On IOREQ_TYPE_INVALIDATE we need to invalidate the mapcache for regular
mappings. Since recently we started reusing the mapcache also to keep
track of grants mappings. However, there is no need to remove grant
mappings on IOREQ_TYPE_INVALIDATE requests, we shouldn't do that. So
remove the function call.
Fixes: 9ecdd4bf08 (xen: mapcache: Add support for grant mappings)
Cc: qemu-stable@nongnu.org
Reported-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@amd.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@amd.com>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Message-Id: <20250206194915.3357743-2-edgar.iglesias@gmail.com>
Signed-off-by: Anthony PERARD <anthony.perard@vates.tech>
Block devices don't work in PV Grub (0.9x) if there is no mode specified. It
complains: "Error ENOENT when reading the mode"
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20250207143724.30792-2-dwmw2@infradead.org>
Signed-off-by: Anthony PERARD <anthony.perard@vates.tech>
In PVH dom0, when passthrough a device to domU, QEMU code
xen_pt_realize->xc_physdev_map_pirq wants to use gsi, but in current codes
the gsi number is got from file /sys/bus/pci/devices/<sbdf>/irq, that is
wrong, because irq is not equal with gsi, they are in different spaces, so
pirq mapping fails.
To solve above problem, use new interface of Xen, xc_pcidev_get_gsi to get
gsi and use xc_physdev_map_pirq_gsi to map pirq when dom0 is PVH.
Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
Acked-by: Anthony PERARD <anthony@xenproject.org>
Reviewed-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Message-Id: <20241106061418.3655304-1-Jiqian.Chen@amd.com>
Signed-off-by: Anthony PERARD <anthony.perard@vates.tech>
The following organisations appear on the US sanctions list:
Yadro: https://sanctionssearch.ofac.treas.gov/Details.aspx?id=41125
ISPRAS: https://sanctionssearch.ofac.treas.gov/Details.aspx?id=50890
As a result maintainers interacting with such entities would face
legal risk in a number of jurisdictions. To reduce the risk of
inadvertent non-compliance remove entries from these organisations
from the MAINTAINERS file.
Mark the pcf8574 system as orphaned until someone volunteers to step
up as a maintainer. Add myself as a second reviewer to record/replay
so I can help with what odd fixes I can.
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20250304222439.2035603-32-alex.bennee@linaro.org>
Trim through the includes and remove everything not needed for the
core. Only include tcg-op-common.h to remove the need to
TARGET_LONG_BITS and move the build unit into the common set.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20250304222439.2035603-31-alex.bennee@linaro.org>
Now all the softmmu/user-mode stuff has been split out we can build
this compilation unit only once.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20250304222439.2035603-30-alex.bennee@linaro.org>
These are only usable in system mode where we control the timer. For
user-mode make them NOPs.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20250304222439.2035603-29-alex.bennee@linaro.org>
These only work for system-mode and are NOPs for user-mode.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20250304222439.2035603-28-alex.bennee@linaro.org>
To move the main api.c to a single build compilation object we need to
start splitting out user and system specific code. As we need to grob
around host headers we move these particular helpers into the *-user
mode directories.
The binary/start/end/entry helpers are all NOPs for system mode.
While using the plugin-api.c.inc trick means we build for both
linux-user and bsd-user the BSD user-mode command line is still
missing -plugin. This can be enabled once we have reliable check-tcg
tests working for the BSDs.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Warner Losh <imp@bsdimp.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20250304222439.2035603-27-alex.bennee@linaro.org>