Skip to content

vsock: Enable live migrations (snapshot-restore)#936

Open
jemoreira wants to merge 2 commits intorust-vmm:mainfrom
jemoreira:suspend_resume
Open

vsock: Enable live migrations (snapshot-restore)#936
jemoreira wants to merge 2 commits intorust-vmm:mainfrom
jemoreira:suspend_resume

Conversation

@jemoreira
Copy link
Copy Markdown
Contributor

@jemoreira jemoreira commented Feb 11, 2026

Summary of the PR

The vsock device only accesses virtqueues when they are ready.

The most common case where this matters is after receiving a VHOST_USER_GET_VRING_BASE message from the client (because of a suspend operation, for example), if one of the host side applications or sister VMs sends us a message it must be kept queued until the virtqueues are set to ready again with a kick.

Enable live migrations (snapshot-restore) of the vsock device.

The virtio spec mandates that when restoring from a snapshot (such as when the VM was migrated to a different host) the device must send a TRANSPORT RESET event via the event virtqueue to the device. The device in turn, upon receiving the event, must drop all existing connections while keeping all listeners and read the CID again from the device configuration space and update the listeners with the (potentially different) CID.

On the device side care must be taken to ensure no new connections are established until that transport reset event is handled otherwise the guest will silently drop them and the peer may not know about it. Given that the driver must read the CID from the configuration space during normal boot and after doing a reset that's the best signal for the device to "activate" and start processing packets from the host and sibling VMs. Feature negotiation also happens during both normal boot and restore, but in the case of restore it happens before the transport reset and therefore can't be used as reliable signal to "activate" the device.

The device doesn't need to save/load any state for this. When loading a snapshot the device simply notes that it must send a TRANSPORT RESET event to the driver as soon as possible, which it then does when it receives a kick on the event vring. Independently of whether the device is started to restore a previous VM state or brand new (the device actually doesn't know until it set_device_state_fd is called), the device always starts in "inactive" state, meaning it will drop any packets coming from any source. As mentioned before the device "activates" once the driver has read the configuration space.

Unlike other VMMs, QEMU takes ownership of the event vring and handles sending of the TRANSPORT RESET event itself. This change attempts to handle both approaches by sending the TRANSPORT RESET event if the vring is kicked, but doesn't treat it as precondition to activate the device, instead choosing to activate unconditionally once the driver reads the configuration space.

There is a race in this implementation, that occurs when the snapshot was taken before the guest driver read the configuration and then it reads immediately upon restore, before the transport reset event was handled. While this could be mitigated by waiting for the reset event to be sent AND the config to be read AFTER the event was sent, QEMU's decision to take over the event queue makes that approach unviable. Hopefully, it will be very unlikely a snapshot will be taken before the driver is fully initialized as that would be of very little value in practice.

Requirements

Before submitting your PR, please make sure you addressed the following
requirements:

  • All commits in this PR have Signed-Off-By trailers (with
    git commit -s), and the commit message has max 60 characters for the
    summary and max 75 characters for each description line.
  • All added/changed functionality has a corresponding unit/integration
    test.
  • All added/changed public-facing functionality has entries in the "Upcoming
    Release" section of CHANGELOG.md (if no such section exists, please create one).
  • Any newly added unsafe code is properly documented.

@jemoreira jemoreira changed the title Suspend resume vsock: Enable live migrations (snapshot-restore) Feb 11, 2026
@jemoreira jemoreira marked this pull request as draft February 12, 2026 00:07
@jemoreira
Copy link
Copy Markdown
Contributor Author

Converted to draft to address the test failures

@jemoreira jemoreira force-pushed the suspend_resume branch 2 times, most recently from d8fce24 to 2bcfd05 Compare February 13, 2026 00:17
@jemoreira jemoreira marked this pull request as ready for review February 13, 2026 00:17
Copy link
Copy Markdown
Collaborator

@dorindabassey dorindabassey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, just left a few comments.

@jemoreira
Copy link
Copy Markdown
Contributor Author

Hello, just a friendly reminder that this PR is ready for review.


// Queue mask to select vrings.
const QUEUE_MASK: u64 = 0b11;
const QUEUE_MASK: u64 = 0b111;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should be controlled by a param in the CLI. IIRC in QEMU the event queue is handled by the vmm.

Did you test this with QEMU?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test with QEMU, only with CrosVm. I was very surprised to learn that QEMU takes ownership of the third queue. Luckily no cmdline parameter is needed, instead the device always activates after the config is read. This unfortunately deprives us of the ability to avoid a particular race condition (see the details in the commit message), but it should be very unlikely to trigger anyways.

I have now tested that the device works well with QEMU, however when I tried taking a snapshot and restoring it failed with an error saying that the device doesn't support it, even though it advertises the DEVICE_STATE protocol feature. I suspect the issue is in the qemu frontend, not in the backend.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test with QEMU, only with CrosVm. I was very surprised to learn that QEMU takes ownership of the third queue.

IIRC this was already in that way for vhost-vsock, so when we introduced vhost-user-vsock we shared the same code. The VMM knows perfectly when a snapshot/migration is starting, so why it's strage that QEMU does it?

Luckily no cmdline parameter is needed, instead the device always activates after the config is read. This unfortunately deprives us of the ability to avoid a particular race condition (see the details in the commit message), but it should be very unlikely to trigger anyways.

mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?

I have now tested that the device works well with QEMU, however when I tried taking a snapshot and restoring it failed with an error saying that the device doesn't support it, even though it advertises the DEVICE_STATE protocol feature. I suspect the issue is in the qemu frontend, not in the backend.

yeah, the fronted needs to enable that in some way. Annoying...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's strage that QEMU does it?

Because the queue is called "event", not "transport_reset". If other events are added to the virtio-vsock spec then having qemu handling that queue will be a problem. It also seems to me that it doesn't match the intent of vhost-user, where reading and writing the queues is the responsibility of the backend, not the frontend; but that's just my (probably very uninformed) opinion.

mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?

Because the probability of hitting that race is extremely low, possibly 0. I don't know for sure that it's valid to attempt to take a snapshot of an uninitialized driver.

On the other hand, having the extra flag that changes the behavior based on what the VMM does is a significant burden for the users. So far what QEMU and CrosVm do, but what about other VMMs? I doubt this behavior is document anywhere in an accessible manner, so it's possible that the only way to know for sure is to look at the VMM's source code. The user could also just try it with and without the flag and see what happens, but for VMMs that share the event queue with the backend both combinations will work most of the time.

So I between a very unlikely race and some usability issues I chose the former, but I'm fine either way. Just let me know your preference.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's strage that QEMU does it?

Because the queue is called "event", not "transport_reset". If other events are added to the virtio-vsock spec then having qemu handling that queue will be a problem. It also seems to me that it doesn't match the intent of vhost-user, where reading and writing the queues is the responsibility of the backend, not the frontend; but that's just my (probably very uninformed) opinion.

As I mentioned, this comes from vhost (in-kernel) device implementation.

mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?

Because the probability of hitting that race is extremely low, possibly 0. I don't know for sure that it's valid to attempt to take a snapshot of an uninitialized driver.

On the other hand, having the extra flag that changes the behavior based on what the VMM does is a significant burden for the users. So far what QEMU and CrosVm do, but what about other VMMs? I doubt this behavior is document anywhere in an accessible manner, so it's possible that the only way to know for sure is to look at the VMM's source code. The user could also just try it with and without the flag and see what happens, but for VMMs that share the event queue with the backend both combinations will work most of the time.

So I between a very unlikely race and some usability issues I chose the former, but I'm fine either way. Just let me know your preference.

I think we need to find another way, for example, can we discover if the event queue is offloaded to the device from the frontend or not (e.g. check if the device called set_vring_* etc.) and do this only if it's set up?

Comment on lines +434 to +436
// The last byte of the config is read when the driver is initializing or after it has
// processed a transport reset event. Either way, no transport reset will be pending
// after this.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, so if a driver decides to not read the configuration, the device will not be activated? Is this conform to the spec?

Also, why we need to wait this, and not for example the feature negotiation?

Copy link
Copy Markdown
Contributor Author

@jemoreira jemoreira Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says the driver needs to read the CID from the config space as part of device initialization process. When restoring from a snapshot, the spec says the driver MUST read the config space again because its CID may have changed and drop any existing connections, but keep any listeners, now associated to the new CID. The device informs the driver that a restore/migration occurred by sending the transport reset event.

Because the driver may eventually silently drop any existing connections the device should ensure no new connections are established until after the drop happens or it knows for sure it won't happen. The only common action that happens during a regular boot and a restore is the driver reading the config. The features are also read, but in the case of the restore this is done from the frontend side, not the driver, and happens before the transport reset, so it can't be used as the signal to activate.

I've updated the commit message with this information too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info, this helps the review a lot.

let cid_map = self.cid_map.read().unwrap();
if cid_map.contains_key(&dst_cid) {
let (sibling_raw_pkts_queue, sibling_groups_set, sibling_event_fd) =
let (sibling_raw_pkts_queue_opt, sibling_groups_set, sibling_event_fd) =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this change? Can be split in another commit?

I hope these things are mentioned in the commit description.

Copy link
Copy Markdown
Contributor Author

@jemoreira jemoreira Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent packets from sibling VMs to be delivered to the guest driver before a potential transport reset event is processed (just like it's done for packages from the host). If any of these packets are allowed through a connection could be established that the transport reset would destroy later, but the sibling VM won't know about this. I don't know if just dropping the packet is the best course of action, but that's what the code already does for packets addressed to unknown CIDs and I just wanted to treat this case as if the CID is not reachable yet.

In fact, my first option was to delay adding the whole tuple to cid_map until activation, but the presence of the CID in the map is also used to check something in or near the main function, so I opted for just this instead.

I can split it into its own commit, but it doesn't make a lot of sense on its own. Let me know if you still prefer it separately.


/// Sends a TRANSPORT_RESET event to the guest driver. Returns true if it was able to send it,
/// false if there were no buffers available in the vring.
pub fn reset_transport(&mut self, vring: &VringRwLock, event_idx: bool) -> Result<bool> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, after a reset what will be the state of this device, can it be re-used or need to be stopped ?

I'm asking because I don't see any reset of connection etc.

So, yes please describe this a bit better in the commit description and please add a section in the documentation to explain how this live migration is supposed to work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a "device reset", the spec calls it a "transport reset", but effectively it just tells the driver that all connections it new about are gone. Those connections only existed in the driver side as the device has just been started from scratch when this event is sent.

I've added more details to the commit message. By documentation, do you mean the README.md file? I'm not sure what to add there beyond the fact that the device supports VM migrations as this is usually handled by the VMM itself (not the user directly) and the processes/protocols are described in the virtio and vhost-user specs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a "device reset", the spec calls it a "transport reset", but effectively it just tells the driver that all connections it new about are gone. Those connections only existed in the driver side as the device has just been started from scratch when this event is sent.

I'm a bit lost here, especially where this is called.

So we have:

            EVT_QUEUE_EVENT => {
                let reset_pending = &mut *self.transport_reset_pending.lock().unwrap();
                if *reset_pending {
                    thread.reset_transport(vring_evt, evt_idx)?;
                    *reset_pending = false;
                }
            }

IIUC this means that it's called when the driver fills the event queue. Why this?
I mean, why the guest will fill up that queue again after a snapshot?

What happen if the guest fills the event queue before we set self.transport_reset_pending?

In QEMU we use it in a different way: when the migration is completed (post-load), we use one of the buffer that the driver already queued in the event_queue.

Here for me is not clear at all, why after a snapshot/migration a driver needs to fill the event queue again with free descriptors.

I've added more details to the commit message. By documentation, do you mean the README.md file? I'm not sure what to add there beyond the fact that the device supports VM migrations as this is usually handled by the VMM itself (not the user directly) and the processes/protocols are described in the virtio and vhost-user specs.

Yep a new section in the README.md will be nice especially to explain how it supposed to interact with the driver, because I'm still a lot confused about it, so I'm pretty sure in the future we forgot about this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the section in the README file.

The most common case where this matters is after receiving a
VHOST_USER_GET_VRING_BASE message from the client, if one of the host
side applications or sister VMs sends us a message it must be kept
queued until the virtqueues are set to ready again with a kick.

Signed-off-by: Jorge E. Moreira <jemoreira@google.com>
Copy link
Copy Markdown
Member

@stefano-garzarella stefano-garzarella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still really confused, especially on why we do this on EVT_QUEUE_EVENT. So a section in the readme that explain better all the snapshot/migration support will be nice to have.

In addition, I think a parameter to enable this only with some VMMs would be better IMO because I still think what QEMU is doing makes more sense.

Note: I'll be off and with limited internet access from March 7 to March 29, if others want to merge, I'm not against, but I'd like to have a section in the readme with a clear design on how this is going to work.


// Queue mask to select vrings.
const QUEUE_MASK: u64 = 0b11;
const QUEUE_MASK: u64 = 0b111;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test with QEMU, only with CrosVm. I was very surprised to learn that QEMU takes ownership of the third queue.

IIRC this was already in that way for vhost-vsock, so when we introduced vhost-user-vsock we shared the same code. The VMM knows perfectly when a snapshot/migration is starting, so why it's strage that QEMU does it?

Luckily no cmdline parameter is needed, instead the device always activates after the config is read. This unfortunately deprives us of the ability to avoid a particular race condition (see the details in the commit message), but it should be very unlikely to trigger anyways.

mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?

I have now tested that the device works well with QEMU, however when I tried taking a snapshot and restoring it failed with an error saying that the device doesn't support it, even though it advertises the DEVICE_STATE protocol feature. I suspect the issue is in the qemu frontend, not in the backend.

yeah, the fronted needs to enable that in some way. Annoying...

Comment on lines +434 to +436
// The last byte of the config is read when the driver is initializing or after it has
// processed a transport reset event. Either way, no transport reset will be pending
// after this.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info, this helps the review a lot.


/// Sends a TRANSPORT_RESET event to the guest driver. Returns true if it was able to send it,
/// false if there were no buffers available in the vring.
pub fn reset_transport(&mut self, vring: &VringRwLock, event_idx: bool) -> Result<bool> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a "device reset", the spec calls it a "transport reset", but effectively it just tells the driver that all connections it new about are gone. Those connections only existed in the driver side as the device has just been started from scratch when this event is sent.

I'm a bit lost here, especially where this is called.

So we have:

            EVT_QUEUE_EVENT => {
                let reset_pending = &mut *self.transport_reset_pending.lock().unwrap();
                if *reset_pending {
                    thread.reset_transport(vring_evt, evt_idx)?;
                    *reset_pending = false;
                }
            }

IIUC this means that it's called when the driver fills the event queue. Why this?
I mean, why the guest will fill up that queue again after a snapshot?

What happen if the guest fills the event queue before we set self.transport_reset_pending?

In QEMU we use it in a different way: when the migration is completed (post-load), we use one of the buffer that the driver already queued in the event_queue.

Here for me is not clear at all, why after a snapshot/migration a driver needs to fill the event queue again with free descriptors.

I've added more details to the commit message. By documentation, do you mean the README.md file? I'm not sure what to add there beyond the fact that the device supports VM migrations as this is usually handled by the VMM itself (not the user directly) and the processes/protocols are described in the virtio and vhost-user specs.

Yep a new section in the README.md will be nice especially to explain how it supposed to interact with the driver, because I'm still a lot confused about it, so I'm pretty sure in the future we forgot about this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit there some something to be fixed:
the device must send a TRANSPORT RESET event via the event virtqueue to the device. -> to the driver
The device in turn, upon receiving the event -> The driver in turn

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device doesn't need to save/load any state for this. When loading a
snapshot the device simply notes that it must send a TRANSPORT RESET
event to the driver as soon as possible, which it then does when it
receives a kick on the event vring.

So to recap, this is unclear to me, why the driver will send a kick to the event vring ?
Is that in the spec or in the implementation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit there some something to be fixed

Oops! too many d-words :). Fixed it.

Why the driver will send a kick to the event vring?

I can't reply to the other comment on this same topic for some reason, so I'll reply here too:

I mean, why the guest will fill up that queue again after a snapshot?

What happen if the guest fills the event queue before we set self.transport_reset_pending?

The driver probably doesn't (necessarily) push more buffers or send a kick after a restore, the fronted does (or should, at least CrosVm does).

I'll admit I haven't looked at QEMU's source in depth, but I'd image that if the rx queue has buffers in it a kick is sent to the device after restore to activate that queue, otherwise the device won't be able to send data to the driver until the driver pushes more buffers, which it has no reason to do since the ones it already pushed are just sitting there.

The backend could avoid waiting for the kick if it stored the queue state in its saved state, but there is still the question of when to send the buffer: Immediately after loading its state and replying to the "check" call? Wouldn't that be too soon, for example if it doesn't yet have call fd?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went looking at the QEMU source code after all, or to be precise I asked Gemini to look for me. It seems QEMU does not in fact send that kick like crosvm does and instead depends on the driver to kick the queues when responding to the transport event OR on the device implementation to store the state of the queue. The virtio spec doesn't mention that the driver should kick the queues on transport reset though. In any case this approach seems to work well with the vhost backend, but the vhost-user backend is marked by qemu as unmigratable.

Other vhost-user devices, like vhost-user-blk appear to have a "kick right away" logic similar to what crosvm does for vhost-user-vsock. So, maybe it would be a good idea to add it in vhost-user-vsock too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit there some something to be fixed

Oops! too many d-words :). Fixed it.

Why the driver will send a kick to the event vring?

I can't reply to the other comment on this same topic for some reason, so I'll reply here too:

I mean, why the guest will fill up that queue again after a snapshot?

What happen if the guest fills the event queue before we set self.transport_reset_pending?

The driver probably doesn't (necessarily) push more buffers or send a kick after a restore, the fronted does (or should, at least CrosVm does).

Sorry, why a frontend should inject a kick? This seems more an hack to me?

A kick should means: "there is something new in the avail ring, please process it". What new stuff are in the vring after that kick?

I'll admit I haven't looked at QEMU's source in depth, but I'd image that if the rx queue has buffers in it a kick is sent to the device after restore to activate that queue, otherwise the device won't be able to send data to the driver until the driver pushes more buffers, which it has no reason to do since the ones it already pushed are just sitting there.

The device IMO should know that is starting with a driver already initialized and should not wait for a kick, but start to process the queue just after starting.

That said, for a TX queue may have sense, but we are not using at all in that sense in this case, since no new buffers will be there. Also we are not processing anything from the guest, so I'm still really confused why a kick is needed if the device doesn't need to process anything from the guest, but this seems more a frontend -> backend notification. Again, this seems purely an hack, and should not be enabled by default IMHO.

The backend could avoid waiting for the kick if it stored the queue state in its saved state, but there is still the question of when to send the buffer: Immediately after loading its state and replying to the "check" call? Wouldn't that be too soon, for example if it doesn't yet have call fd?

What about VHOST_USER_SET_VRING_ENABLE event? (maybe we need to extend the

The [virtio
spec](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4950001)
mandates that when restoring from a snapshot (such as when the VM was
migrated to a different host) the device must send a TRANSPORT RESET
event via the event virtqueue to the driver. The driver in turn, upon
receiving the event, must drop all existing connections while keeping
all listeners and read the CID again from the device configuration
space and update the listeners with the (potentially different) CID.

On the device side care must be taken to ensure no new connections are
established until that transport reset event is handled otherwise the
guest will silently drop them and the peer may not know about it. Given
that the driver must read the CID from the configuration space during
normal boot and after doing a reset that's the best signal for the
device to "activate" and start processing packets from the host and
sibling VMs. Feature negotiation also happens during both normal boot
and restore, but in the case of restore it happens before the transport
reset and therefore can't be used as reliable signal to "activate" the
device.

The device doesn't need to save/load any state for this. When loading a
snapshot the device simply notes that it must send a TRANSPORT RESET
event to the driver as soon as possible, which it then does when it
receives a kick on the event vring. Independently of whether the device
is started to restore a previous VM state or brand new (the device
actually doesn't know until it `set_device_state_fd` is called), the
device always starts in "inactive" state, meaning it will drop any
packets coming from any source. As mentioned before the device
"activates" once the driver has read the configuration space.

Unlike other VMMs, QEMU takes ownership of the event vring and handles
sending of the TRANSPORT RESET event itself. This change attempts to handle
both approaches by sending the TRANSPORT RESET event if the vring is
kicked, but doesn't treat it as precondition to activate the device,
instead choosing to activate unconditionally once the driver reads the
configuration space.

There is a race in this implementation, that occurs when the snapshot
was taken before the guest driver read the configuration and then it
reads immediately upon restore, before the transport reset event was
handled. While this could be mitigated by waiting for the reset event to
be sent AND the config to be read AFTER the event was sent, QEMU's
decision to take over the event queue makes that approach unviable.
Hopefully, it will be very unlikely a snapshot will be taken before the
driver is fully initialized as that would be of very little value in
practice.

Signed-off-by: Jorge E. Moreira <jemoreira@google.com>
@jemoreira
Copy link
Copy Markdown
Contributor Author

Note: I'll be off and with limited internet access from March 7 to March 29, if others want to merge, I'm not against, but I'd like to have a section in the readme with a clear design on how this is going to work.

I am also going to be on vacation during that same period, we can resume after the 29th if you're still not convinced about this. I tried to address every comment you left with the latest push, except the command line flag because I think everyone is just going to run it in "qemu mode", see that it works with other VMMs too and never use it.


## Live migration

This device implementation advertises support for live migrations by offering the VHOST_USER_PROTOCOL_F_DEVICE_STATE protocol feature, however this doesn't work with Qemu yet as it marks its vsock frontend as "unmigratable". This feature does work with CrosVm and potentially other virtual machine managers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, for example, if this feature is not negotiated, we can skip all of this, no?


The state saving flow is trivial as the device doesn't save any state as mentioned.

The state loading flow is a bit more complicated because the virtio-vsock spec mandates that the device must send a VIRTIO_VSOCK_EVENT_TRANSPORT_RESET event to the driver. During a restore the backend is started no differently than during a regular boot. When the frontend sends the VHOST_USER_SET_DEVICE_STATE_FD command with LOAD direction the backend doesn't load anything, but it takes note that a transport reset event needs to be sent to the driver via the event vring when possible. In order to make sure this event is sent when the queue is ready, the backend waits for the event queue to be kicked before sending the event. While these kicks usually come from the driver, this particular one is actually sent by the vhost-user frontend. This implementation depends on the frontend to kick all queues with pending buffers after a restore because the driver is unlikely to do so as it probably did it before the snapshot was taken.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make sure this event is sent when the queue is ready, the backend waits for the event queue to be kicked before sending the event. While these kicks usually come from the driver, this particular one is actually sent by the vhost-user frontend.

This is the hack IMO. The kick means: "hey, I'm the driver and there is a new available buffer for you in the virtqueue", while here we are using for something completely different, and even not in the spec. If this is what we want from the frontend (and I don't understand why we want this), we should put into the spec and not guess here that we will receive that event.

return Vec::new();
}

if offset + size == buf.len() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this, can we add a set_enabled callback to the VhostUserBackend trait and do this in that callback?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would that set_enabled callback be called? This has to happen after the config is read because that's the only signal (according to the spec) the driver gives the device that the transport reset happened.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also another question, why a driver should read the config space after a live migration?

For the driver should be transparent, no?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, here we are acking that the driver reset it. I need to reread it again, all of those assumptions are not clear to me and also not in the VIRTIO spec at all.

Should we extend it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver reading the config after a transport reset is in the spec. It's the second to last sentence of the last paragraph of the Device Events section. It's very easy to miss, but it is there.

Comment on lines +372 to +376
let reset_pending = &mut *self.transport_reset_pending.lock().unwrap();
if *reset_pending {
thread.reset_transport(vring_evt, evt_idx)?;
*reset_pending = false;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this should not be here IMO, but in some other place, because here we should handle events coming from the driver (or backends like unix socket, etc.).
This should be done as response of some vhost_user message, like the set_enabled, etc. not to a random kick injected by the frontend.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place where the backend implementation is given access to the vrings.

The spec clearly says that "the back-end must start a ring upon receiving a kick", so without this kick the vring will not be ready and any attempt to write to it will simply fail with the NoReady error.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but maybe we need to find something else. Or put this behavior into the spec because we are really implementing something custom for crosvm. (I'll try to find something better when I'm back)

Also, why not doing this checks in any case? I mean for every event?

What is confusing me is that we are using a kick to do something else. If we need a notification mechanism between frontend and backend, we should add a new message, but we should not reuse the kick which should be only used by the driver to notify the device.

@stefano-garzarella
Copy link
Copy Markdown
Member

Note: I'll be off and with limited internet access from March 7 to March 29, if others want to merge, I'm not against, but I'd like to have a section in the readme with a clear design on how this is going to work.

I am also going to be on vacation during that same period, we can resume after the 29th if you're still not convinced about this. I tried to address every comment you left with the latest push, except the command line flag because I think everyone is just going to run it in "qemu mode", see that it works with other VMMs too and never use it.

Agree on that, I pointed out what is unclear to me. I think we can find a solution, but I don't like the kick injection TBH and I think we should avoid that. Let's continue when we are back ;-)

Enjoy your time off!

@jemoreira
Copy link
Copy Markdown
Contributor Author

Hi, hope you had a good time off. Since it's been a while I'll try to summarize and list your concerns, please let me know if I miss anything:

Why not check that a reset is pending when handling every event?

Good question.

The device won't be "active", while a reset is pending, so we won't receive any backend or sibling VM events.

As for the RX and TX queues, we may receive those events, but it's not guaranteed that we will and we can't depend on them to send the reset event any more that we can depend on receiving one for the event queue. If we do receive those events while the reset is pending, they will be handled correctly because the device is "inactive" and therefore there is nothing to send to the driver (other than the reset) and any messages received will simply be dropped.

What about reacting to the VHOST_USER_SET_VRING_ENABLE message instead of a kick.

This could work, but the vhost-user spec mandates that the queue state be in "stopped" state until a kick is received, so I'm not comfortable accessing the vring after only that event.

One option would be for the backend process to save the vring states and just know after a restore that there are buffers available in the vring. The problem is that just having available buffers doesn't guarantee that the backend can write to them. (Unless I missed something) the vhost-user spec doesn't mandate the order in which the frontend needs to send messages to the backend, so the load state message may arrive before the file descriptor to notify the driver that buffers were used or before the guest memory has been shared with the backend. We could wait for all those messages to arrive before calling the calback on VhostUserBackend, but given that many other backends already work well without storing the vring state and instead just depend on the fronted to do it for them, I'd prefer not to add this major change for all of them just because of vsock if it can be avoided.

Another option would be to add a new message to the vhost-user protocol to start the vring, but that seems overkill to me. It would be much simpler in my opinion to interpret kicks as "there are buffers available in this vring" as opposed to "the driver just added new buffers to the vring" (more on this below).

Don't send the transport reset event to the driver from handle_event(EVENT_QUEUE_EVT, ...). Do it from another (to be implemented) callback, like set_enabled.

I don't have a strong opinion about this. I am interpreting the SOME_QUEUE_EVT events in the handle_event callback as "buffers are available in this vring" as opposed to "the driver says they just added new buffers to the vring". It's just semantics, but with my interpretation it makes sense to send the event from that location as it opens the possibility to have that callback called with that event in scenarios other than receiving an event from the driver. By "other scenarios" I don't just mean receiving a kick originated in the frontend, but any other approach we decide to implement.

The backend process shouldn't depend on receiving a kick from the frontend after a restore (the driver won't necessarily send one and shouldn't be modified to do so).
If we need a notification mechanism between frontend and backend, we should add a new message, but we should not reuse the kick which should be only used by the driver to notify the device.

As with the callback, I interpreted the kick event as "buffers are available in the vring". The backend doesn't need to know who sent this event, it only cares about what it can do after receiving it, so I don't see why the driver would be the only one that can write to the kick file descriptor.

That being said, I don't really care if that restriction is in place or not, I just need a way for the backend process to know that it can write to buffers in the event queue. Without the backend process storing and loading vring states (plus checking many other conditions) the only way available today in the vhost-user spec is that kick file descriptor.

I don't oppose extending the vhost-user spec with a new message for this notification to originate from the frontend, I just think it's unnecessary as the kick already exists and using for this purpose doesn't cause any issues.

Alternatively, put this behavior into the spec because we are really implementing something custom for crosvm.

This is obviously my preferred approach.

This behavior is not exclusive to Crosvm though, QEMU already writes to the kick fd on restore for other devices (e.g vhost-user-blk). I don't think this behavior could be avoided (with the current vhost-user spec) in most devices that support asynchronously sending messages from device to driver as the backend needs to know that buffers are available in the vring, but the driver won't notify it again after the restore. The vsock device having a queue exclusively (for now) for the restore feature that QEMU can take over without the backend being involved is an exception.

This message ended up being longer than I expected, I hope I exposed my position clearly. Looking forward to your reply.

@stefano-garzarella
Copy link
Copy Markdown
Member

Hi, hope you had a good time off. Since it's been a while I'll try to summarize and list your concerns, please let me know if I miss anything:

Hi, yeah it was, I hope the same for you and sorry for the delay, but my inbox is in a bad shape :-(
BTW thanks for restarting the conversation!

Why not check that a reset is pending when handling every event?

Good question.

The device won't be "active", while a reset is pending, so we won't receive any backend or sibling VM events.

As for the RX and TX queues, we may receive those events, but it's not guaranteed that we will and we can't depend on them to send the reset event any more that we can depend on receiving one for the event queue. If we do receive those events while the reset is pending, they will be handled correctly because the device is "inactive" and therefore there is nothing to send to the driver (other than the reset) and any messages received will simply be dropped.

Another good point of not doing what I asked is that in QEMU case, where QEMU is handling the event queue, that path (event on the event_queue) is never taken.

What about reacting to the VHOST_USER_SET_VRING_ENABLE message instead of a kick.

This could work, but the vhost-user spec mandates that the queue state be in "stopped" state until a kick is received, so I'm not comfortable accessing the vring after only that event.

One option would be for the backend process to save the vring states and just know after a restore that there are buffers available in the vring. The problem is that just having available buffers doesn't guarantee that the backend can write to them. (Unless I missed something) the vhost-user spec doesn't mandate the order in which the frontend needs to send messages to the backend, so the load state message may arrive before the file descriptor to notify the driver that buffers were used or before the guest memory has been shared with the backend. We could wait for all those messages to arrive before calling the calback on VhostUserBackend, but given that many other backends already work well without storing the vring state and instead just depend on the fronted to do it for them, I'd prefer not to add this major change for all of them just because of vsock if it can be avoided.

Another option would be to add a new message to the vhost-user protocol to start the vring, but that seems overkill to me. It would be much simpler in my opinion to interpret kicks as "there are buffers available in this vring" as opposed to "the driver just added new buffers to the vring" (more on this below).

Ack, I'm on the same page now.

Don't send the transport reset event to the driver from handle_event(EVENT_QUEUE_EVT, ...). Do it from another (to be implemented) callback, like set_enabled.

I don't have a strong opinion about this. I am interpreting the SOME_QUEUE_EVT events in the handle_event callback as "buffers are available in this vring" as opposed to "the driver says they just added new buffers to the vring". It's just semantics, but with my interpretation it makes sense to send the event from that location as it opens the possibility to have that callback called with that event in scenarios other than receiving an event from the driver. By "other scenarios" I don't just mean receiving a kick originated in the frontend, but any other approach we decide to implement.

The backend process shouldn't depend on receiving a kick from the frontend after a restore (the driver won't necessarily send one and shouldn't be modified to do so).
If we need a notification mechanism between frontend and backend, we should add a new message, but we should not reuse the kick which should be only used by the driver to notify the device.

As with the callback, I interpreted the kick event as "buffers are available in the vring". The backend doesn't need to know who sent this event, it only cares about what it can do after receiving it, so I don't see why the driver would be the only one that can write to the kick file descriptor.

That being said, I don't really care if that restriction is in place or not, I just need a way for the backend process to know that it can write to buffers in the event queue. Without the backend process storing and loading vring states (plus checking many other conditions) the only way available today in the vhost-user spec is that kick file descriptor.

I don't oppose extending the vhost-user spec with a new message for this notification to originate from the frontend, I just think it's unnecessary as the kick already exists and using for this purpose doesn't cause any issues.

Yeah, but the kick_fd IMO should only be used for guest notifications. For example, we have VHOST_USER_VRING_KICK for frontend -> backend out-of-bound notifications. (I don't mean to use it since it requires a features to be used, etc.)

Alternatively, put this behavior into the spec because we are really implementing something custom for crosvm.

This is obviously my preferred approach.

Okay, let's go in this direction. Let's clarify that in the spec. That way, we can also get the other vhost-user specs maintainers' input on this point.
Honestly, I don't like using kick_fd for this and would prefer a new message, but I understand your point. If this is the common way for this, let's document that in the spec and implement this also for vhost-user-vsock.

This behavior is not exclusive to Crosvm though, QEMU already writes to the kick fd on restore for other devices (e.g vhost-user-blk). I don't think this behavior could be avoided (with the current vhost-user spec) in most devices that support asynchronously sending messages from device to driver as the backend needs to know that buffers are available in the vring, but the driver won't notify it again after the restore. The vsock device having a queue exclusively (for now) for the restore feature that QEMU can take over without the backend being involved is an exception.

I see, but again it looks strange to me that the fronted is injecting that info overwriting a fd that should be used only by the hypervisor to forward guest events. I really don't understand why the frontend can't use a normal message on the control path to inform the backend that the migration/snapshot is completed and the device can be restarted.

For example, we are already using VHOST_USER_SET_DEVICE_STATE_FD. Can we use the fd passed with that message to understand when to restart the device. In the spec we have:

This FD must behave as follows:
   For the writing end, it must allow writing the whole back-end state sequentially. Closing the file descriptor signals the end of transfer.
   For the reading end, it must allow reading the whole back-end state sequentially. The end of file signals the end of the transfer.

@XanClic and @GermanG worked on live-migration support for virtiofsd, so maybe they can help us here.

This message ended up being longer than I expected, I hope I exposed my position clearly. Looking forward to your reply.

Don't worry at all, thanks for the great discussion so far!

@jemoreira
Copy link
Copy Markdown
Contributor Author

I see, but again it looks strange to me that the fronted is injecting that info overwriting a fd that should be used only by the hypervisor to forward guest events. I really don't understand why the frontend can't use a normal message on the control path to inform the backend that the migration/snapshot is completed and the device can be restarted.

For example, we are already using VHOST_USER_SET_DEVICE_STATE_FD. Can we use the fd passed with that message to understand when to restart the device?

Maybe in theory, but in practice this is not possible because before the snapshot is taken the frontend (typically|always?) sends the VHOST_USER_GET_VRING_BASE message for every vring. So those vrings are in stopped state after the device state is restored and need to be started again. The vring only transitions to "started" when the kick fd becomes readable or the VHOST_USER_VRING_KICK message is received.

I don't know if it's possible to take a snapshot of the backend without stopping the vrings first, but it seems unlikely to me. If this is not possible, then the kick is unavoidable (both VHOST_USER_VRING_KICK and any new message would require negotiation and therefore be optional and not required to support state migration) at least for vrings that still have unprocessed buffers (empty vrings will get kicked by the driver when it adds new buffers). This kick must come from the frontend as the driver (likely) already kicked those queues and doesn't know about the migration.

I had not taken the GET_VRING_BASE message into account when I suggested in previous messages the backend could save the state to avoid the need for the kick, now I know that idea won't work either.

PS: I've been using this as the authoritative source for the vhost-user spec, let me know if that's not the correct place.

@jemoreira
Copy link
Copy Markdown
Contributor Author

Also, where can I upload the updates to the spec for review?

And do you want to wait for the spec to be updated or would it be ok to merge this change now?

@stefano-garzarella
Copy link
Copy Markdown
Member

This kick must come from the frontend as the driver (likely) already kicked those queues and doesn't know about the migration.

If this is really what we expect from the frontend, we should document that.
For me this is still confusing, because in theory the frontend can even forget about the kick_fd after setting it (e.g. wiring it between KVM and vhost-user backend), but I'm not against that.

PS: I've been using this as the authoritative source for the vhost-user spec, let me know if that's not the correct place.

yep, this is the rendered one, the source document lives in the QEMU source code: https://gitlab.com/qemu-project/qemu in docs/interop/vhost-user.rst

Also, where can I upload the updates to the spec for review?

Since it is in the QEMU source code, we need to follow https://qemu-project.gitlab.io/qemu/devel/submitting-a-patch.html

Here some examples: https://lore.kernel.org/qemu-devel/?q=vhost-user.rst

Feel free to ask any kind of help on that.

And do you want to wait for the spec to be updated or would it be ok to merge this change now?

I'm not strict on that, but would be nice to first reach an agreement (that for example the frontend is required to write the kick_fd after the migration to re-start the queues) before merging this. When we are all aligned, we can merge this even if the spec changes are in flight.

BTW thank you very much for the work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants