Fix disk partitioning race condition between the kernel and partx#2234
Fix disk partitioning race condition between the kernel and partx#2234chewi wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the disk partitioning logic to validate partition starts and sizes against sysfs after partitioning, and updates partition number types from uint64 to int. However, several critical issues were identified in the review. Changing getRealStartAndSize to return a map instead of a slice introduces non-deterministic ordering and causes partition overwrites when multiple partitions use number 0. Additionally, the new validation logic, partx commands, and deletion checks do not account for partition 0 (which represents the next available partition and does not exist in sysfs or /dev), leading to potential runtime failures. Reverting the map changes to slices and skipping partition 0 in sysfs/partx operations is recommended.
| // and end sector should be. It runs sgdisk --pretend to determine what the partitions would look like if | ||
| // everything specified were to be (re)created. | ||
| func (s stage) getRealStartAndSize(dev types.Disk, devAlias string, diskInfo util.DiskInfo) ([]sgdisk.Partition, error) { | ||
| func (s stage) getRealStartAndSize(dev types.Disk, devAlias string, diskInfo util.DiskInfo) (map[int]sgdisk.Partition, error) { |
There was a problem hiding this comment.
Changing the return type of getRealStartAndSize to a map introduces two critical issues:
- Map keys must be unique, so if multiple partitions have
Number: 0(which is a valid configuration representing the next available partition number), they will overwrite each other, and only one will be returned. - Iterating over a map in Go is non-deterministic. This breaks the requirement that partitions are processed in a specific order (specifically, partition 0 must be processed last, which is enforced by sorting
dev.Partitionsbefore calling this function).
We should revert this function to return a slice []sgdisk.Partition.
| func (s stage) getRealStartAndSize(dev types.Disk, devAlias string, diskInfo util.DiskInfo) (map[int]sgdisk.Partition, error) { | |
| func (s stage) getRealStartAndSize(dev types.Disk, devAlias string, diskInfo util.DiskInfo) ([]sgdisk.Partition, error) { |
There was a problem hiding this comment.
Sorry for being unaware of this issue beforehand. However, I have reasoned it through, done some testing, and found that even the existing code doesn't work with number: 0 because partx still gets invoked with --add --nr 0, which results in a fatal error.
There was a problem hiding this comment.
I should add that this has been broken for a long time, since 2.20.0.
| } | ||
|
|
||
| result := []sgdisk.Partition{} | ||
| result := map[int]sgdisk.Partition{} |
| } | ||
| } | ||
| result = append(result, part) | ||
| result[part.Number] = part |
| for _, partNum := range append(append([]int{}, partxAdd...), partxUpdate...) { | ||
| part := resolvedPartitions[partNum] | ||
| partDev := fmt.Sprintf("%s%s%d", blockDevResolved, prefix, partNum) | ||
| sysBlockDir := fmt.Sprintf("/sys/class/block/%s/", filepath.Base(partDev)) |
There was a problem hiding this comment.
Since resolvedPartitions should remain a slice, we can build a temporary map of non-zero partition numbers for lookup during validation. Additionally, we must skip validation for partition number 0 because partition 0 does not exist in sysfs (the kernel assigns it a real, non-zero partition number), and attempting to read /sys/class/block/<dev>0/start will always fail with a fatal error.
| for _, partNum := range append(append([]int{}, partxAdd...), partxUpdate...) { | |
| part := resolvedPartitions[partNum] | |
| partDev := fmt.Sprintf("%s%s%d", blockDevResolved, prefix, partNum) | |
| sysBlockDir := fmt.Sprintf("/sys/class/block/%s/", filepath.Base(partDev)) | |
| resolvedPartitionsMap := make(map[int]sgdisk.Partition) | |
| for _, part := range resolvedPartitions { | |
| if part.Number != 0 { | |
| resolvedPartitionsMap[part.Number] = part | |
| } | |
| } | |
| for _, partNum := range append(partxAdd, partxUpdate...) { | |
| if partNum == 0 { | |
| continue | |
| } | |
| part := resolvedPartitionsMap[partNum] | |
| partDev := fmt.Sprintf("%s%s%d", blockDevResolved, prefix, partNum) | |
| sysBlockDir := fmt.Sprintf("/sys/class/block/%s/", filepath.Base(partDev)) |
| runPartxCommand := func(op string, partitions []int) error { | ||
| for _, partNr := range partitions { | ||
| cmd := exec.Command(distro.PartxCmd(), "--"+op, "--nr", strconv.FormatUint(partNr, 10), blockDevResolved) | ||
| cmd := exec.Command(distro.PartxCmd(), "--"+op, "--nr", fmt.Sprint(partNr), blockDevResolved) | ||
| if _, err := s.LogCmd(cmd, "triggering partition %d %s on %q", partNr, op, devAlias); err != nil { | ||
| return fmt.Errorf("partition %s failed: %v", op, err) | ||
| } | ||
| } | ||
| return nil | ||
| } | ||
| if err := runPartxCommand("delete", partxDelete); err != nil { | ||
| return err | ||
| } | ||
| if err := runPartxCommand("update", partxUpdate); err != nil { | ||
| return err | ||
| } | ||
| if err := runPartxCommand("add", partxAdd); err != nil { | ||
| return err | ||
| } | ||
| runPartxCommand("delete", partxDelete) | ||
| runPartxCommand("update", partxUpdate) | ||
| runPartxCommand("add", partxAdd) | ||
| } |
There was a problem hiding this comment.
Since the error returned by runPartxCommand is now ignored, we should simplify the function signature to not return an error. Additionally, we should skip partition number 0 because partx does not support partition number 0 (partition numbers are 1-based).
| runPartxCommand := func(op string, partitions []int) error { | |
| for _, partNr := range partitions { | |
| cmd := exec.Command(distro.PartxCmd(), "--"+op, "--nr", strconv.FormatUint(partNr, 10), blockDevResolved) | |
| cmd := exec.Command(distro.PartxCmd(), "--"+op, "--nr", fmt.Sprint(partNr), blockDevResolved) | |
| if _, err := s.LogCmd(cmd, "triggering partition %d %s on %q", partNr, op, devAlias); err != nil { | |
| return fmt.Errorf("partition %s failed: %v", op, err) | |
| } | |
| } | |
| return nil | |
| } | |
| if err := runPartxCommand("delete", partxDelete); err != nil { | |
| return err | |
| } | |
| if err := runPartxCommand("update", partxUpdate); err != nil { | |
| return err | |
| } | |
| if err := runPartxCommand("add", partxAdd); err != nil { | |
| return err | |
| } | |
| runPartxCommand("delete", partxDelete) | |
| runPartxCommand("update", partxUpdate) | |
| runPartxCommand("add", partxAdd) | |
| } | |
| runPartxCommand := func(op string, partitions []int) { | |
| for _, partNr := range partitions { | |
| if partNr == 0 { | |
| continue | |
| } | |
| cmd := exec.Command(distro.PartxCmd(), "--"+op, "--nr", fmt.Sprint(partNr), blockDevResolved) | |
| _, _ = s.LogCmd(cmd, "triggering partition %d %s on %q", partNr, op, devAlias) | |
| } | |
| } | |
| runPartxCommand("delete", partxDelete) | |
| runPartxCommand("update", partxUpdate) | |
| runPartxCommand("add", partxAdd) |
| for _, partNum := range partxDelete { | ||
| partDev := fmt.Sprintf("%s%s%d", blockDevResolved, prefix, partNum) |
There was a problem hiding this comment.
This was broken since partx was used in commit c2cc56c. Passing 0 to partx causes it to try and add all the partitions, which will almost always fail because the kernel will usually already know about at least some of them. This changes getRealStartAndSize() to also determine and return the resulting partition numbers so that subsequent operations use these instead of 0. sgdisk does support --new=0, but it has no way to report which partition number it actually used. Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
`partx --add` will fail if the kernel is already aware of the new partition. It was always theoretically possible that udev might trigger early, and that appears to be happening now. Allow partx to fail and then check that added/updated partitions have the right start sector and size and that deleted partitions are absent once udev has settled. Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
I've admittedly lost the output of what happens when this fails, but this is what was going on underneath.
We started reliably seeing this in Flatcar after some batch updates. We don't know exactly which update triggered it, but it was probably systemd. While we could have looked into systemd's changes, I strongly felt that this code was always potentially racy. There was never anything stopping the kernel picking up the partition changes before partx had a chance to run.
This change therefore allows partx to fail and then checks that added/updated partitions have the right start sector and size and that deleted partitions are absent once udev has settled.