vsock: Enable live migrations (snapshot-restore)#936
vsock: Enable live migrations (snapshot-restore)#936jemoreira wants to merge 2 commits intorust-vmm:mainfrom
Conversation
ef051b4 to
ba8a093
Compare
|
Converted to draft to address the test failures |
d8fce24 to
2bcfd05
Compare
2bcfd05 to
04ec2db
Compare
dorindabassey
left a comment
There was a problem hiding this comment.
Thank you for this PR, just left a few comments.
04ec2db to
f2d4a4a
Compare
|
Hello, just a friendly reminder that this PR is ready for review. |
|
|
||
| // Queue mask to select vrings. | ||
| const QUEUE_MASK: u64 = 0b11; | ||
| const QUEUE_MASK: u64 = 0b111; |
There was a problem hiding this comment.
I'm wondering if this should be controlled by a param in the CLI. IIRC in QEMU the event queue is handled by the vmm.
Did you test this with QEMU?
There was a problem hiding this comment.
I didn't test with QEMU, only with CrosVm. I was very surprised to learn that QEMU takes ownership of the third queue. Luckily no cmdline parameter is needed, instead the device always activates after the config is read. This unfortunately deprives us of the ability to avoid a particular race condition (see the details in the commit message), but it should be very unlikely to trigger anyways.
I have now tested that the device works well with QEMU, however when I tried taking a snapshot and restoring it failed with an error saying that the device doesn't support it, even though it advertises the DEVICE_STATE protocol feature. I suspect the issue is in the qemu frontend, not in the backend.
There was a problem hiding this comment.
I didn't test with QEMU, only with CrosVm. I was very surprised to learn that QEMU takes ownership of the third queue.
IIRC this was already in that way for vhost-vsock, so when we introduced vhost-user-vsock we shared the same code. The VMM knows perfectly when a snapshot/migration is starting, so why it's strage that QEMU does it?
Luckily no cmdline parameter is needed, instead the device always activates after the config is read. This unfortunately deprives us of the ability to avoid a particular race condition (see the details in the commit message), but it should be very unlikely to trigger anyways.
mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?
I have now tested that the device works well with QEMU, however when I tried taking a snapshot and restoring it failed with an error saying that the device doesn't support it, even though it advertises the DEVICE_STATE protocol feature. I suspect the issue is in the qemu frontend, not in the backend.
yeah, the fronted needs to enable that in some way. Annoying...
There was a problem hiding this comment.
why it's strage that QEMU does it?
Because the queue is called "event", not "transport_reset". If other events are added to the virtio-vsock spec then having qemu handling that queue will be a problem. It also seems to me that it doesn't match the intent of vhost-user, where reading and writing the queues is the responsibility of the backend, not the frontend; but that's just my (probably very uninformed) opinion.
mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?
Because the probability of hitting that race is extremely low, possibly 0. I don't know for sure that it's valid to attempt to take a snapshot of an uninitialized driver.
On the other hand, having the extra flag that changes the behavior based on what the VMM does is a significant burden for the users. So far what QEMU and CrosVm do, but what about other VMMs? I doubt this behavior is document anywhere in an accessible manner, so it's possible that the only way to know for sure is to look at the VMM's source code. The user could also just try it with and without the flag and see what happens, but for VMMs that share the event queue with the backend both combinations will work most of the time.
So I between a very unlikely race and some usability issues I chose the former, but I'm fine either way. Just let me know your preference.
There was a problem hiding this comment.
why it's strage that QEMU does it?
Because the queue is called "event", not "transport_reset". If other events are added to the virtio-vsock spec then having qemu handling that queue will be a problem. It also seems to me that it doesn't match the intent of vhost-user, where reading and writing the queues is the responsibility of the backend, not the frontend; but that's just my (probably very uninformed) opinion.
As I mentioned, this comes from vhost (in-kernel) device implementation.
mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?
Because the probability of hitting that race is extremely low, possibly 0. I don't know for sure that it's valid to attempt to take a snapshot of an uninitialized driver.
On the other hand, having the extra flag that changes the behavior based on what the VMM does is a significant burden for the users. So far what QEMU and CrosVm do, but what about other VMMs? I doubt this behavior is document anywhere in an accessible manner, so it's possible that the only way to know for sure is to look at the VMM's source code. The user could also just try it with and without the flag and see what happens, but for VMMs that share the event queue with the backend both combinations will work most of the time.
So I between a very unlikely race and some usability issues I chose the former, but I'm fine either way. Just let me know your preference.
I think we need to find another way, for example, can we discover if the event queue is offloaded to the device from the frontend or not (e.g. check if the device called set_vring_* etc.) and do this only if it's set up?
vhost-device-vsock/src/vhu_vsock.rs
Outdated
| // The last byte of the config is read when the driver is initializing or after it has | ||
| // processed a transport reset event. Either way, no transport reset will be pending | ||
| // after this. |
There was a problem hiding this comment.
mmm, so if a driver decides to not read the configuration, the device will not be activated? Is this conform to the spec?
Also, why we need to wait this, and not for example the feature negotiation?
There was a problem hiding this comment.
The spec says the driver needs to read the CID from the config space as part of device initialization process. When restoring from a snapshot, the spec says the driver MUST read the config space again because its CID may have changed and drop any existing connections, but keep any listeners, now associated to the new CID. The device informs the driver that a restore/migration occurred by sending the transport reset event.
Because the driver may eventually silently drop any existing connections the device should ensure no new connections are established until after the drop happens or it knows for sure it won't happen. The only common action that happens during a regular boot and a restore is the driver reading the config. The features are also read, but in the case of the restore this is done from the frontend side, not the driver, and happens before the transport reset, so it can't be used as the signal to activate.
I've updated the commit message with this information too.
There was a problem hiding this comment.
Thanks for the info, this helps the review a lot.
| let cid_map = self.cid_map.read().unwrap(); | ||
| if cid_map.contains_key(&dst_cid) { | ||
| let (sibling_raw_pkts_queue, sibling_groups_set, sibling_event_fd) = | ||
| let (sibling_raw_pkts_queue_opt, sibling_groups_set, sibling_event_fd) = |
There was a problem hiding this comment.
Why we need this change? Can be split in another commit?
I hope these things are mentioned in the commit description.
There was a problem hiding this comment.
This is to prevent packets from sibling VMs to be delivered to the guest driver before a potential transport reset event is processed (just like it's done for packages from the host). If any of these packets are allowed through a connection could be established that the transport reset would destroy later, but the sibling VM won't know about this. I don't know if just dropping the packet is the best course of action, but that's what the code already does for packets addressed to unknown CIDs and I just wanted to treat this case as if the CID is not reachable yet.
In fact, my first option was to delay adding the whole tuple to cid_map until activation, but the presence of the CID in the map is also used to check something in or near the main function, so I opted for just this instead.
I can split it into its own commit, but it doesn't make a lot of sense on its own. Let me know if you still prefer it separately.
|
|
||
| /// Sends a TRANSPORT_RESET event to the guest driver. Returns true if it was able to send it, | ||
| /// false if there were no buffers available in the vring. | ||
| pub fn reset_transport(&mut self, vring: &VringRwLock, event_idx: bool) -> Result<bool> { |
There was a problem hiding this comment.
So, after a reset what will be the state of this device, can it be re-used or need to be stopped ?
I'm asking because I don't see any reset of connection etc.
So, yes please describe this a bit better in the commit description and please add a section in the documentation to explain how this live migration is supposed to work.
There was a problem hiding this comment.
This isn't a "device reset", the spec calls it a "transport reset", but effectively it just tells the driver that all connections it new about are gone. Those connections only existed in the driver side as the device has just been started from scratch when this event is sent.
I've added more details to the commit message. By documentation, do you mean the README.md file? I'm not sure what to add there beyond the fact that the device supports VM migrations as this is usually handled by the VMM itself (not the user directly) and the processes/protocols are described in the virtio and vhost-user specs.
There was a problem hiding this comment.
This isn't a "device reset", the spec calls it a "transport reset", but effectively it just tells the driver that all connections it new about are gone. Those connections only existed in the driver side as the device has just been started from scratch when this event is sent.
I'm a bit lost here, especially where this is called.
So we have:
EVT_QUEUE_EVENT => {
let reset_pending = &mut *self.transport_reset_pending.lock().unwrap();
if *reset_pending {
thread.reset_transport(vring_evt, evt_idx)?;
*reset_pending = false;
}
}
IIUC this means that it's called when the driver fills the event queue. Why this?
I mean, why the guest will fill up that queue again after a snapshot?
What happen if the guest fills the event queue before we set self.transport_reset_pending?
In QEMU we use it in a different way: when the migration is completed (post-load), we use one of the buffer that the driver already queued in the event_queue.
Here for me is not clear at all, why after a snapshot/migration a driver needs to fill the event queue again with free descriptors.
I've added more details to the commit message. By documentation, do you mean the README.md file? I'm not sure what to add there beyond the fact that the device supports VM migrations as this is usually handled by the VMM itself (not the user directly) and the processes/protocols are described in the virtio and vhost-user specs.
Yep a new section in the README.md will be nice especially to explain how it supposed to interact with the driver, because I'm still a lot confused about it, so I'm pretty sure in the future we forgot about this.
There was a problem hiding this comment.
Added the section in the README file.
The most common case where this matters is after receiving a VHOST_USER_GET_VRING_BASE message from the client, if one of the host side applications or sister VMs sends us a message it must be kept queued until the virtqueues are set to ready again with a kick. Signed-off-by: Jorge E. Moreira <jemoreira@google.com>
f2d4a4a to
ed40b69
Compare
stefano-garzarella
left a comment
There was a problem hiding this comment.
I'm still really confused, especially on why we do this on EVT_QUEUE_EVENT. So a section in the readme that explain better all the snapshot/migration support will be nice to have.
In addition, I think a parameter to enable this only with some VMMs would be better IMO because I still think what QEMU is doing makes more sense.
Note: I'll be off and with limited internet access from March 7 to March 29, if others want to merge, I'm not against, but I'd like to have a section in the readme with a clear design on how this is going to work.
|
|
||
| // Queue mask to select vrings. | ||
| const QUEUE_MASK: u64 = 0b11; | ||
| const QUEUE_MASK: u64 = 0b111; |
There was a problem hiding this comment.
I didn't test with QEMU, only with CrosVm. I was very surprised to learn that QEMU takes ownership of the third queue.
IIRC this was already in that way for vhost-vsock, so when we introduced vhost-user-vsock we shared the same code. The VMM knows perfectly when a snapshot/migration is starting, so why it's strage that QEMU does it?
Luckily no cmdline parameter is needed, instead the device always activates after the config is read. This unfortunately deprives us of the ability to avoid a particular race condition (see the details in the commit message), but it should be very unlikely to trigger anyways.
mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?
I have now tested that the device works well with QEMU, however when I tried taking a snapshot and restoring it failed with an error saying that the device doesn't support it, even though it advertises the DEVICE_STATE protocol feature. I suspect the issue is in the qemu frontend, not in the backend.
yeah, the fronted needs to enable that in some way. Annoying...
vhost-device-vsock/src/vhu_vsock.rs
Outdated
| // The last byte of the config is read when the driver is initializing or after it has | ||
| // processed a transport reset event. Either way, no transport reset will be pending | ||
| // after this. |
There was a problem hiding this comment.
Thanks for the info, this helps the review a lot.
|
|
||
| /// Sends a TRANSPORT_RESET event to the guest driver. Returns true if it was able to send it, | ||
| /// false if there were no buffers available in the vring. | ||
| pub fn reset_transport(&mut self, vring: &VringRwLock, event_idx: bool) -> Result<bool> { |
There was a problem hiding this comment.
This isn't a "device reset", the spec calls it a "transport reset", but effectively it just tells the driver that all connections it new about are gone. Those connections only existed in the driver side as the device has just been started from scratch when this event is sent.
I'm a bit lost here, especially where this is called.
So we have:
EVT_QUEUE_EVENT => {
let reset_pending = &mut *self.transport_reset_pending.lock().unwrap();
if *reset_pending {
thread.reset_transport(vring_evt, evt_idx)?;
*reset_pending = false;
}
}
IIUC this means that it's called when the driver fills the event queue. Why this?
I mean, why the guest will fill up that queue again after a snapshot?
What happen if the guest fills the event queue before we set self.transport_reset_pending?
In QEMU we use it in a different way: when the migration is completed (post-load), we use one of the buffer that the driver already queued in the event_queue.
Here for me is not clear at all, why after a snapshot/migration a driver needs to fill the event queue again with free descriptors.
I've added more details to the commit message. By documentation, do you mean the README.md file? I'm not sure what to add there beyond the fact that the device supports VM migrations as this is usually handled by the VMM itself (not the user directly) and the processes/protocols are described in the virtio and vhost-user specs.
Yep a new section in the README.md will be nice especially to explain how it supposed to interact with the driver, because I'm still a lot confused about it, so I'm pretty sure in the future we forgot about this.
There was a problem hiding this comment.
In the commit there some something to be fixed:
the device must send a TRANSPORT RESET event via the event virtqueue to the device. -> to the driver
The device in turn, upon receiving the event -> The driver in turn
There was a problem hiding this comment.
The device doesn't need to save/load any state for this. When loading a
snapshot the device simply notes that it must send a TRANSPORT RESET
event to the driver as soon as possible, which it then does when it
receives a kick on the event vring.
So to recap, this is unclear to me, why the driver will send a kick to the event vring ?
Is that in the spec or in the implementation?
There was a problem hiding this comment.
In the commit there some something to be fixed
Oops! too many d-words :). Fixed it.
Why the driver will send a kick to the event vring?
I can't reply to the other comment on this same topic for some reason, so I'll reply here too:
I mean, why the guest will fill up that queue again after a snapshot?
What happen if the guest fills the event queue before we set self.transport_reset_pending?
The driver probably doesn't (necessarily) push more buffers or send a kick after a restore, the fronted does (or should, at least CrosVm does).
I'll admit I haven't looked at QEMU's source in depth, but I'd image that if the rx queue has buffers in it a kick is sent to the device after restore to activate that queue, otherwise the device won't be able to send data to the driver until the driver pushes more buffers, which it has no reason to do since the ones it already pushed are just sitting there.
The backend could avoid waiting for the kick if it stored the queue state in its saved state, but there is still the question of when to send the buffer: Immediately after loading its state and replying to the "check" call? Wouldn't that be too soon, for example if it doesn't yet have call fd?
There was a problem hiding this comment.
I went looking at the QEMU source code after all, or to be precise I asked Gemini to look for me. It seems QEMU does not in fact send that kick like crosvm does and instead depends on the driver to kick the queues when responding to the transport event OR on the device implementation to store the state of the queue. The virtio spec doesn't mention that the driver should kick the queues on transport reset though. In any case this approach seems to work well with the vhost backend, but the vhost-user backend is marked by qemu as unmigratable.
Other vhost-user devices, like vhost-user-blk appear to have a "kick right away" logic similar to what crosvm does for vhost-user-vsock. So, maybe it would be a good idea to add it in vhost-user-vsock too.
There was a problem hiding this comment.
In the commit there some something to be fixed
Oops! too many d-words :). Fixed it.
Why the driver will send a kick to the event vring?
I can't reply to the other comment on this same topic for some reason, so I'll reply here too:
I mean, why the guest will fill up that queue again after a snapshot?
What happen if the guest fills the event queue before we set self.transport_reset_pending?
The driver probably doesn't (necessarily) push more buffers or send a kick after a restore, the fronted does (or should, at least CrosVm does).
Sorry, why a frontend should inject a kick? This seems more an hack to me?
A kick should means: "there is something new in the avail ring, please process it". What new stuff are in the vring after that kick?
I'll admit I haven't looked at QEMU's source in depth, but I'd image that if the rx queue has buffers in it a kick is sent to the device after restore to activate that queue, otherwise the device won't be able to send data to the driver until the driver pushes more buffers, which it has no reason to do since the ones it already pushed are just sitting there.
The device IMO should know that is starting with a driver already initialized and should not wait for a kick, but start to process the queue just after starting.
That said, for a TX queue may have sense, but we are not using at all in that sense in this case, since no new buffers will be there. Also we are not processing anything from the guest, so I'm still really confused why a kick is needed if the device doesn't need to process anything from the guest, but this seems more a frontend -> backend notification. Again, this seems purely an hack, and should not be enabled by default IMHO.
The backend could avoid waiting for the kick if it stored the queue state in its saved state, but there is still the question of when to send the buffer: Immediately after loading its state and replying to the "check" call? Wouldn't that be too soon, for example if it doesn't yet have call fd?
What about VHOST_USER_SET_VRING_ENABLE event? (maybe we need to extend the
The [virtio spec](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4950001) mandates that when restoring from a snapshot (such as when the VM was migrated to a different host) the device must send a TRANSPORT RESET event via the event virtqueue to the driver. The driver in turn, upon receiving the event, must drop all existing connections while keeping all listeners and read the CID again from the device configuration space and update the listeners with the (potentially different) CID. On the device side care must be taken to ensure no new connections are established until that transport reset event is handled otherwise the guest will silently drop them and the peer may not know about it. Given that the driver must read the CID from the configuration space during normal boot and after doing a reset that's the best signal for the device to "activate" and start processing packets from the host and sibling VMs. Feature negotiation also happens during both normal boot and restore, but in the case of restore it happens before the transport reset and therefore can't be used as reliable signal to "activate" the device. The device doesn't need to save/load any state for this. When loading a snapshot the device simply notes that it must send a TRANSPORT RESET event to the driver as soon as possible, which it then does when it receives a kick on the event vring. Independently of whether the device is started to restore a previous VM state or brand new (the device actually doesn't know until it `set_device_state_fd` is called), the device always starts in "inactive" state, meaning it will drop any packets coming from any source. As mentioned before the device "activates" once the driver has read the configuration space. Unlike other VMMs, QEMU takes ownership of the event vring and handles sending of the TRANSPORT RESET event itself. This change attempts to handle both approaches by sending the TRANSPORT RESET event if the vring is kicked, but doesn't treat it as precondition to activate the device, instead choosing to activate unconditionally once the driver reads the configuration space. There is a race in this implementation, that occurs when the snapshot was taken before the guest driver read the configuration and then it reads immediately upon restore, before the transport reset event was handled. While this could be mitigated by waiting for the reset event to be sent AND the config to be read AFTER the event was sent, QEMU's decision to take over the event queue makes that approach unviable. Hopefully, it will be very unlikely a snapshot will be taken before the driver is fully initialized as that would be of very little value in practice. Signed-off-by: Jorge E. Moreira <jemoreira@google.com>
ed40b69 to
c6725ee
Compare
I am also going to be on vacation during that same period, we can resume after the 29th if you're still not convinced about this. I tried to address every comment you left with the latest push, except the command line flag because I think everyone is just going to run it in "qemu mode", see that it works with other VMMs too and never use it. |
|
|
||
| ## Live migration | ||
|
|
||
| This device implementation advertises support for live migrations by offering the VHOST_USER_PROTOCOL_F_DEVICE_STATE protocol feature, however this doesn't work with Qemu yet as it marks its vsock frontend as "unmigratable". This feature does work with CrosVm and potentially other virtual machine managers. |
There was a problem hiding this comment.
Okay, for example, if this feature is not negotiated, we can skip all of this, no?
|
|
||
| The state saving flow is trivial as the device doesn't save any state as mentioned. | ||
|
|
||
| The state loading flow is a bit more complicated because the virtio-vsock spec mandates that the device must send a VIRTIO_VSOCK_EVENT_TRANSPORT_RESET event to the driver. During a restore the backend is started no differently than during a regular boot. When the frontend sends the VHOST_USER_SET_DEVICE_STATE_FD command with LOAD direction the backend doesn't load anything, but it takes note that a transport reset event needs to be sent to the driver via the event vring when possible. In order to make sure this event is sent when the queue is ready, the backend waits for the event queue to be kicked before sending the event. While these kicks usually come from the driver, this particular one is actually sent by the vhost-user frontend. This implementation depends on the frontend to kick all queues with pending buffers after a restore because the driver is unlikely to do so as it probably did it before the snapshot was taken. |
There was a problem hiding this comment.
In order to make sure this event is sent when the queue is ready, the backend waits for the event queue to be kicked before sending the event. While these kicks usually come from the driver, this particular one is actually sent by the vhost-user frontend.
This is the hack IMO. The kick means: "hey, I'm the driver and there is a new available buffer for you in the virtqueue", while here we are using for something completely different, and even not in the spec. If this is what we want from the frontend (and I don't understand why we want this), we should put into the spec and not guess here that we will receive that event.
| return Vec::new(); | ||
| } | ||
|
|
||
| if offset + size == buf.len() { |
There was a problem hiding this comment.
Instead of doing this, can we add a set_enabled callback to the VhostUserBackend trait and do this in that callback?
There was a problem hiding this comment.
When would that set_enabled callback be called? This has to happen after the config is read because that's the only signal (according to the spec) the driver gives the device that the transport reset happened.
There was a problem hiding this comment.
Also another question, why a driver should read the config space after a live migration?
For the driver should be transparent, no?
There was a problem hiding this comment.
Ah, okay, here we are acking that the driver reset it. I need to reread it again, all of those assumptions are not clear to me and also not in the VIRTIO spec at all.
Should we extend it?
There was a problem hiding this comment.
The driver reading the config after a transport reset is in the spec. It's the second to last sentence of the last paragraph of the Device Events section. It's very easy to miss, but it is there.
| let reset_pending = &mut *self.transport_reset_pending.lock().unwrap(); | ||
| if *reset_pending { | ||
| thread.reset_transport(vring_evt, evt_idx)?; | ||
| *reset_pending = false; | ||
| } |
There was a problem hiding this comment.
Just to be clear, this should not be here IMO, but in some other place, because here we should handle events coming from the driver (or backends like unix socket, etc.).
This should be done as response of some vhost_user message, like the set_enabled, etc. not to a random kick injected by the frontend.
There was a problem hiding this comment.
This is the only place where the backend implementation is given access to the vrings.
The spec clearly says that "the back-end must start a ring upon receiving a kick", so without this kick the vring will not be ready and any attempt to write to it will simply fail with the NoReady error.
There was a problem hiding this comment.
I see, but maybe we need to find something else. Or put this behavior into the spec because we are really implementing something custom for crosvm. (I'll try to find something better when I'm back)
Also, why not doing this checks in any case? I mean for every event?
What is confusing me is that we are using a kick to do something else. If we need a notification mechanism between frontend and backend, we should add a new message, but we should not reuse the kick which should be only used by the driver to notify the device.
Agree on that, I pointed out what is unclear to me. I think we can find a solution, but I don't like the kick injection TBH and I think we should avoid that. Let's continue when we are back ;-) Enjoy your time off! |
|
Hi, hope you had a good time off. Since it's been a while I'll try to summarize and list your concerns, please let me know if I miss anything:
Good question. The device won't be "active", while a reset is pending, so we won't receive any backend or sibling VM events. As for the RX and TX queues, we may receive those events, but it's not guaranteed that we will and we can't depend on them to send the reset event any more that we can depend on receiving one for the event queue. If we do receive those events while the reset is pending, they will be handled correctly because the device is "inactive" and therefore there is nothing to send to the driver (other than the reset) and any messages received will simply be dropped.
This could work, but the vhost-user spec mandates that the queue state be in "stopped" state until a kick is received, so I'm not comfortable accessing the vring after only that event. One option would be for the backend process to save the vring states and just know after a restore that there are buffers available in the vring. The problem is that just having available buffers doesn't guarantee that the backend can write to them. (Unless I missed something) the vhost-user spec doesn't mandate the order in which the frontend needs to send messages to the backend, so the load state message may arrive before the file descriptor to notify the driver that buffers were used or before the guest memory has been shared with the backend. We could wait for all those messages to arrive before calling the calback on VhostUserBackend, but given that many other backends already work well without storing the vring state and instead just depend on the fronted to do it for them, I'd prefer not to add this major change for all of them just because of vsock if it can be avoided. Another option would be to add a new message to the vhost-user protocol to start the vring, but that seems overkill to me. It would be much simpler in my opinion to interpret kicks as "there are buffers available in this vring" as opposed to "the driver just added new buffers to the vring" (more on this below).
I don't have a strong opinion about this. I am interpreting the
As with the callback, I interpreted the kick event as "buffers are available in the vring". The backend doesn't need to know who sent this event, it only cares about what it can do after receiving it, so I don't see why the driver would be the only one that can write to the kick file descriptor. That being said, I don't really care if that restriction is in place or not, I just need a way for the backend process to know that it can write to buffers in the event queue. Without the backend process storing and loading vring states (plus checking many other conditions) the only way available today in the vhost-user spec is that kick file descriptor. I don't oppose extending the vhost-user spec with a new message for this notification to originate from the frontend, I just think it's unnecessary as the kick already exists and using for this purpose doesn't cause any issues.
This is obviously my preferred approach. This behavior is not exclusive to Crosvm though, QEMU already writes to the kick fd on restore for other devices (e.g vhost-user-blk). I don't think this behavior could be avoided (with the current vhost-user spec) in most devices that support asynchronously sending messages from device to driver as the backend needs to know that buffers are available in the vring, but the driver won't notify it again after the restore. The vsock device having a queue exclusively (for now) for the restore feature that QEMU can take over without the backend being involved is an exception. This message ended up being longer than I expected, I hope I exposed my position clearly. Looking forward to your reply. |
Hi, yeah it was, I hope the same for you and sorry for the delay, but my inbox is in a bad shape :-(
Another good point of not doing what I asked is that in QEMU case, where QEMU is handling the event queue, that path (event on the event_queue) is never taken.
Ack, I'm on the same page now.
Yeah, but the kick_fd IMO should only be used for guest notifications. For example, we have VHOST_USER_VRING_KICK for frontend -> backend out-of-bound notifications. (I don't mean to use it since it requires a features to be used, etc.)
Okay, let's go in this direction. Let's clarify that in the spec. That way, we can also get the other vhost-user specs maintainers' input on this point.
I see, but again it looks strange to me that the fronted is injecting that info overwriting a fd that should be used only by the hypervisor to forward guest events. I really don't understand why the frontend can't use a normal message on the control path to inform the backend that the migration/snapshot is completed and the device can be restarted. For example, we are already using @XanClic and @GermanG worked on live-migration support for virtiofsd, so maybe they can help us here.
Don't worry at all, thanks for the great discussion so far! |
Maybe in theory, but in practice this is not possible because before the snapshot is taken the frontend (typically|always?) sends the I don't know if it's possible to take a snapshot of the backend without stopping the vrings first, but it seems unlikely to me. If this is not possible, then the kick is unavoidable (both I had not taken the PS: I've been using this as the authoritative source for the vhost-user spec, let me know if that's not the correct place. |
|
Also, where can I upload the updates to the spec for review? And do you want to wait for the spec to be updated or would it be ok to merge this change now? |
If this is really what we expect from the frontend, we should document that.
yep, this is the rendered one, the source document lives in the QEMU source code: https://gitlab.com/qemu-project/qemu in
Since it is in the QEMU source code, we need to follow https://qemu-project.gitlab.io/qemu/devel/submitting-a-patch.html Here some examples: https://lore.kernel.org/qemu-devel/?q=vhost-user.rst Feel free to ask any kind of help on that.
I'm not strict on that, but would be nice to first reach an agreement (that for example the frontend is required to write the kick_fd after the migration to re-start the queues) before merging this. When we are all aligned, we can merge this even if the spec changes are in flight. BTW thank you very much for the work! |
Summary of the PR
The vsock device only accesses virtqueues when they are ready.
The most common case where this matters is after receiving a VHOST_USER_GET_VRING_BASE message from the client (because of a suspend operation, for example), if one of the host side applications or sister VMs sends us a message it must be kept queued until the virtqueues are set to ready again with a kick.
Enable live migrations (snapshot-restore) of the vsock device.
The virtio spec mandates that when restoring from a snapshot (such as when the VM was migrated to a different host) the device must send a TRANSPORT RESET event via the event virtqueue to the device. The device in turn, upon receiving the event, must drop all existing connections while keeping all listeners and read the CID again from the device configuration space and update the listeners with the (potentially different) CID.
On the device side care must be taken to ensure no new connections are established until that transport reset event is handled otherwise the guest will silently drop them and the peer may not know about it. Given that the driver must read the CID from the configuration space during normal boot and after doing a reset that's the best signal for the device to "activate" and start processing packets from the host and sibling VMs. Feature negotiation also happens during both normal boot and restore, but in the case of restore it happens before the transport reset and therefore can't be used as reliable signal to "activate" the device.
The device doesn't need to save/load any state for this. When loading a snapshot the device simply notes that it must send a TRANSPORT RESET event to the driver as soon as possible, which it then does when it receives a kick on the event vring. Independently of whether the device is started to restore a previous VM state or brand new (the device actually doesn't know until it
set_device_state_fdis called), the device always starts in "inactive" state, meaning it will drop any packets coming from any source. As mentioned before the device "activates" once the driver has read the configuration space.Unlike other VMMs, QEMU takes ownership of the event vring and handles sending of the TRANSPORT RESET event itself. This change attempts to handle both approaches by sending the TRANSPORT RESET event if the vring is kicked, but doesn't treat it as precondition to activate the device, instead choosing to activate unconditionally once the driver reads the configuration space.
There is a race in this implementation, that occurs when the snapshot was taken before the guest driver read the configuration and then it reads immediately upon restore, before the transport reset event was handled. While this could be mitigated by waiting for the reset event to be sent AND the config to be read AFTER the event was sent, QEMU's decision to take over the event queue makes that approach unviable. Hopefully, it will be very unlikely a snapshot will be taken before the driver is fully initialized as that would be of very little value in practice.
Requirements
Before submitting your PR, please make sure you addressed the following
requirements:
git commit -s), and the commit message has max 60 characters for thesummary and max 75 characters for each description line.
test.
Release" section of CHANGELOG.md (if no such section exists, please create one).
unsafecode is properly documented.