Skip to content

Commit c6725ee

Browse files
committed
vsock: Enable live migrations (snapshot-restore)
The [virtio spec](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4950001) mandates that when restoring from a snapshot (such as when the VM was migrated to a different host) the device must send a TRANSPORT RESET event via the event virtqueue to the driver. The driver in turn, upon receiving the event, must drop all existing connections while keeping all listeners and read the CID again from the device configuration space and update the listeners with the (potentially different) CID. On the device side care must be taken to ensure no new connections are established until that transport reset event is handled otherwise the guest will silently drop them and the peer may not know about it. Given that the driver must read the CID from the configuration space during normal boot and after doing a reset that's the best signal for the device to "activate" and start processing packets from the host and sibling VMs. Feature negotiation also happens during both normal boot and restore, but in the case of restore it happens before the transport reset and therefore can't be used as reliable signal to "activate" the device. The device doesn't need to save/load any state for this. When loading a snapshot the device simply notes that it must send a TRANSPORT RESET event to the driver as soon as possible, which it then does when it receives a kick on the event vring. Independently of whether the device is started to restore a previous VM state or brand new (the device actually doesn't know until it `set_device_state_fd` is called), the device always starts in "inactive" state, meaning it will drop any packets coming from any source. As mentioned before the device "activates" once the driver has read the configuration space. Unlike other VMMs, QEMU takes ownership of the event vring and handles sending of the TRANSPORT RESET event itself. This change attempts to handle both approaches by sending the TRANSPORT RESET event if the vring is kicked, but doesn't treat it as precondition to activate the device, instead choosing to activate unconditionally once the driver reads the configuration space. There is a race in this implementation, that occurs when the snapshot was taken before the guest driver read the configuration and then it reads immediately upon restore, before the transport reset event was handled. While this could be mitigated by waiting for the reset event to be sent AND the config to be read AFTER the event was sent, QEMU's decision to take over the event queue makes that approach unviable. Hopefully, it will be very unlikely a snapshot will be taken before the driver is fully initialized as that would be of very little value in practice. Signed-off-by: Jorge E. Moreira <jemoreira@google.com>
1 parent 62a6cf7 commit c6725ee

File tree

5 files changed

+189
-42
lines changed

5 files changed

+189
-42
lines changed

vhost-device-vsock/CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33

44
### Added
55

6+
- [#936](https://github.com/rust-vmm/vhost-device/pull/936) vsock: Enable live migrations (snapshot-restore)
7+
68
### Changed
79

810
### Fixed

vhost-device-vsock/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,18 @@ In this configuration:
129129
- The host must have vsock support (e.g., `vsock_loopback` kernel module loaded)
130130
- For testing, you can load the module with: `modprobe vsock_loopback`
131131

132+
## Live migration
133+
134+
This device implementation advertises support for live migrations by offering the VHOST_USER_PROTOCOL_F_DEVICE_STATE protocol feature, however this doesn't work with Qemu yet as it marks its vsock frontend as "unmigratable". This feature does work with CrosVm and potentially other virtual machine managers.
135+
136+
The device itself doesn't save or restore any state during a live migration. It relies instead on the frontend to save the vring's states and negotiated features. It also expects the the frontend to "kick" the queues that have pending buffers in it since the driver probably kicked those queues before the migration and won't do it again.
137+
138+
The state saving flow is trivial as the device doesn't save any state as mentioned.
139+
140+
The state loading flow is a bit more complicated because the virtio-vsock spec mandates that the device must send a VIRTIO_VSOCK_EVENT_TRANSPORT_RESET event to the driver. During a restore the backend is started no differently than during a regular boot. When the frontend sends the VHOST_USER_SET_DEVICE_STATE_FD command with LOAD direction the backend doesn't load anything, but it takes note that a transport reset event needs to be sent to the driver via the event vring when possible. In order to make sure this event is sent when the queue is ready, the backend waits for the event queue to be kicked before sending the event. While these kicks usually come from the driver, this particular one is actually sent by the vhost-user frontend. This implementation depends on the frontend to kick all queues with pending buffers after a restore because the driver is unlikely to do so as it probably did it before the snapshot was taken.
141+
142+
In response to the transport reset event the driver drops any existing connections and reads the configuration space again. To prevent the driver from dropping any new connections established after the restore the backend doesn't forward any packets from outside the VM to the driver until it has read the configuration space. In fact, because the backend doesn't know at start time whether this is a restore or a clean boot, it always waits until after the driver has read the configuration space to start forwarding packets between the outside world and the driver.
143+
132144
## Usage
133145

134146
Run the vhost-device-vsock device with unix domain socket backend:

vhost-device-vsock/src/thread_backend.rs

Lines changed: 31 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -332,7 +332,7 @@ impl VsockThreadBackend {
332332
if dst_cid != VSOCK_HOST_CID {
333333
let cid_map = self.cid_map.read().unwrap();
334334
if cid_map.contains_key(&dst_cid) {
335-
let (sibling_raw_pkts_queue, sibling_groups_set, sibling_event_fd) =
335+
let (sibling_raw_pkts_queue_opt, sibling_groups_set, sibling_event_fd) =
336336
cid_map.get(&dst_cid).unwrap();
337337

338338
if self
@@ -345,11 +345,18 @@ impl VsockThreadBackend {
345345
return Ok(());
346346
}
347347

348-
sibling_raw_pkts_queue
349-
.write()
350-
.unwrap()
351-
.push_back(RawVsockPacket::from_vsock_packet(pkt)?);
352-
let _ = sibling_event_fd.write(1);
348+
match sibling_raw_pkts_queue_opt {
349+
Some(queue) => {
350+
queue
351+
.write()
352+
.unwrap()
353+
.push_back(RawVsockPacket::from_vsock_packet(pkt)?);
354+
let _ = sibling_event_fd.write(1);
355+
}
356+
None => {
357+
info!("vsock: dropping packet for cid: {dst_cid:?} due to inactive device");
358+
}
359+
}
353360
} else {
354361
warn!("vsock: dropping packet for unknown cid: {dst_cid:?}");
355362
}
@@ -525,6 +532,7 @@ mod tests {
525532
#[cfg(feature = "backend_vsock")]
526533
use crate::vhu_vsock::VsockProxyInfo;
527534
use crate::vhu_vsock::{BackendType, VhostUserVsockBackend, VsockConfig, VSOCK_OP_RW};
535+
use vhost_user_backend::VhostUserBackend;
528536

529537
const DATA_LEN: usize = 16;
530538
const CONN_TX_BUF_SIZE: u32 = 64 * 1024;
@@ -698,11 +706,28 @@ mod tests {
698706
// SAFETY: Safe as hdr_raw and data_raw are guaranteed to be valid.
699707
let mut packet = unsafe { VsockPacket::new(hdr_raw, Some(data_raw)).unwrap() };
700708

709+
packet.set_type(VSOCK_TYPE_STREAM);
710+
packet.set_src_cid(CID);
711+
packet.set_dst_cid(SIBLING_CID);
712+
packet.set_dst_port(SIBLING_LISTENING_PORT);
713+
packet.set_op(VSOCK_OP_RW);
714+
packet.set_len(DATA_LEN as u32);
715+
packet
716+
.data_slice()
717+
.unwrap()
718+
.copy_from(&[0x01u8, 0x12u8, 0x23u8, 0x34u8]);
719+
720+
// The packet will be dropped silently because the thread won't activate until the config
721+
// is read.
722+
vtp.send_pkt(&packet).unwrap();
701723
assert_eq!(
702724
vtp.recv_raw_pkt(&mut packet).unwrap_err().to_string(),
703725
Error::EmptyRawPktsQueue.to_string()
704726
);
705727

728+
sibling_backend.get_config(0, 8);
729+
sibling2_backend.get_config(0, 8);
730+
706731
packet.set_type(VSOCK_TYPE_STREAM);
707732
packet.set_src_cid(CID);
708733
packet.set_dst_cid(SIBLING_CID);

vhost-device-vsock/src/vhu_vsock.rs

Lines changed: 61 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,18 @@
22

33
use std::{
44
collections::{HashMap, HashSet},
5+
fs::File,
56
io::Result as IoResult,
67
path::PathBuf,
78
sync::{Arc, Mutex, RwLock},
89
};
910

1011
use log::warn;
1112
use thiserror::Error as ThisError;
12-
use vhost::vhost_user::message::{VhostUserProtocolFeatures, VhostUserVirtioFeatures};
13+
use vhost::vhost_user::message::{
14+
VhostTransferStateDirection, VhostTransferStatePhase, VhostUserProtocolFeatures,
15+
VhostUserVirtioFeatures,
16+
};
1317
use vhost_user_backend::{VhostUserBackend, VringRwLock};
1418
use virtio_bindings::bindings::{
1519
virtio_config::{VIRTIO_F_NOTIFY_ON_EMPTY, VIRTIO_F_VERSION_1},
@@ -24,8 +28,14 @@ use vmm_sys_util::{
2428

2529
use crate::{thread_backend::RawPktsQ, vhu_vsock_thread::*};
2630

27-
pub(crate) type CidMap =
28-
HashMap<u64, (Arc<RwLock<RawPktsQ>>, Arc<RwLock<HashSet<String>>>, EventFd)>;
31+
pub(crate) type CidMap = HashMap<
32+
u64,
33+
(
34+
Option<Arc<RwLock<RawPktsQ>>>,
35+
Arc<RwLock<HashSet<String>>>,
36+
EventFd,
37+
),
38+
>;
2939

3040
const NUM_QUEUES: usize = 3;
3141

@@ -72,8 +82,11 @@ pub(crate) const VSOCK_FLAGS_SHUTDOWN_RCV: u32 = 1;
7282
/// data
7383
pub(crate) const VSOCK_FLAGS_SHUTDOWN_SEND: u32 = 2;
7484

85+
/// Vsock events - `VSOCK_EVENT_TRANSPORT_RESET`: Communication has been interrupted
86+
pub(crate) const VSOCK_EVENT_TRANSPORT_RESET: u32 = 0;
87+
7588
// Queue mask to select vrings.
76-
const QUEUE_MASK: u64 = 0b11;
89+
const QUEUE_MASK: u64 = 0b111;
7790

7891
pub(crate) type Result<T> = std::result::Result<T, Error>;
7992

@@ -141,6 +154,8 @@ pub(crate) enum Error {
141154
EmptyRawPktsQueue,
142155
#[error("CID already in use by another vsock device")]
143156
CidAlreadyInUse,
157+
#[error("Failed to write to event virtqueue")]
158+
EventQueueWrite,
144159
}
145160

146161
impl std::convert::From<Error> for std::io::Error {
@@ -261,6 +276,7 @@ pub(crate) struct VhostUserVsockBackend {
261276
queues_per_thread: Vec<u64>,
262277
exit_consumer: EventConsumer,
263278
exit_notifier: EventNotifier,
279+
transport_reset_pending: Arc<Mutex<bool>>,
264280
}
265281

266282
impl VhostUserVsockBackend {
@@ -286,6 +302,7 @@ impl VhostUserVsockBackend {
286302
queues_per_thread,
287303
exit_consumer,
288304
exit_notifier,
305+
transport_reset_pending: Arc::new(Mutex::new(false)),
289306
})
290307
}
291308
}
@@ -310,7 +327,9 @@ impl VhostUserBackend for VhostUserVsockBackend {
310327
}
311328

312329
fn protocol_features(&self) -> VhostUserProtocolFeatures {
313-
VhostUserProtocolFeatures::MQ | VhostUserProtocolFeatures::CONFIG
330+
VhostUserProtocolFeatures::MQ
331+
| VhostUserProtocolFeatures::CONFIG
332+
| VhostUserProtocolFeatures::DEVICE_STATE
314333
}
315334

316335
fn set_event_idx(&self, enabled: bool) {
@@ -335,6 +354,7 @@ impl VhostUserBackend for VhostUserVsockBackend {
335354
) -> IoResult<()> {
336355
let vring_rx = &vrings[0];
337356
let vring_tx = &vrings[1];
357+
let vring_evt = &vrings[2];
338358

339359
if evset != EventSet::IN {
340360
return Err(Error::HandleEventNotEpollIn.into());
@@ -349,7 +369,11 @@ impl VhostUserBackend for VhostUserVsockBackend {
349369
thread.process_tx(vring_tx, evt_idx)?;
350370
}
351371
EVT_QUEUE_EVENT => {
352-
warn!("Received an unexpected EVT_QUEUE_EVENT");
372+
let reset_pending = &mut *self.transport_reset_pending.lock().unwrap();
373+
if *reset_pending {
374+
thread.reset_transport(vring_evt, evt_idx)?;
375+
*reset_pending = false;
376+
}
353377
}
354378
BACKEND_EVENT => {
355379
thread.process_backend_evt(evset);
@@ -389,6 +413,15 @@ impl VhostUserBackend for VhostUserVsockBackend {
389413
return Vec::new();
390414
}
391415

416+
if offset + size == buf.len() {
417+
// The last byte of the config is read when the driver is initializing or after it has
418+
// processed a transport reset event. Either way, no transport reset will be pending
419+
// after this. Activate all threads once it's known a reset event is not pending.
420+
for thread in self.threads.iter() {
421+
thread.lock().unwrap().activate();
422+
}
423+
}
424+
392425
buf[offset..offset + size].to_vec()
393426
}
394427

@@ -401,6 +434,23 @@ impl VhostUserBackend for VhostUserVsockBackend {
401434
let notifier = self.exit_notifier.try_clone().ok()?;
402435
Some((consumer, notifier))
403436
}
437+
438+
fn set_device_state_fd(
439+
&self,
440+
direction: VhostTransferStateDirection,
441+
_phase: VhostTransferStatePhase,
442+
_file: File,
443+
) -> std::result::Result<Option<File>, std::io::Error> {
444+
if let VhostTransferStateDirection::LOAD = direction {
445+
*self.transport_reset_pending.lock().unwrap() = true;
446+
}
447+
Ok(None)
448+
}
449+
450+
fn check_device_state(&self) -> std::result::Result<(), std::io::Error> {
451+
// We had nothing to read/write to the fd, so always return Ok.
452+
Ok(())
453+
}
404454
}
405455

406456
#[cfg(test)]
@@ -436,17 +486,20 @@ mod tests {
436486
let vrings = [
437487
VringRwLock::new(mem.clone(), 0x1000).unwrap(),
438488
VringRwLock::new(mem.clone(), 0x2000).unwrap(),
489+
VringRwLock::new(mem.clone(), 0x1000).unwrap(),
439490
];
440491
vrings[0].set_queue_info(0x100, 0x200, 0x300).unwrap();
441492
vrings[0].set_queue_ready(true);
442493
vrings[1].set_queue_info(0x1100, 0x1200, 0x1300).unwrap();
443494
vrings[1].set_queue_ready(true);
495+
vrings[2].set_queue_info(0x2100, 0x2200, 0x2300).unwrap();
496+
vrings[2].set_queue_ready(true);
444497

445498
backend.update_memory(mem).unwrap();
446499

447500
let queues_per_thread = backend.queues_per_thread();
448501
assert_eq!(queues_per_thread.len(), 1);
449-
assert_eq!(queues_per_thread[0], 0b11);
502+
assert_eq!(queues_per_thread[0], 0b111);
450503

451504
let config = backend.get_config(0, 8);
452505
assert_eq!(config.len(), 8);
@@ -569,6 +622,7 @@ mod tests {
569622
let vrings = [
570623
VringRwLock::new(mem.clone(), 0x1000).unwrap(),
571624
VringRwLock::new(mem.clone(), 0x2000).unwrap(),
625+
VringRwLock::new(mem.clone(), 0x1000).unwrap(),
572626
];
573627

574628
backend.update_memory(mem).unwrap();

0 commit comments

Comments
 (0)