Skip to content

Fix disk partitioning race condition between the kernel and partx#2234

Draft
chewi wants to merge 2 commits into
coreos:mainfrom
chewi:partx-race
Draft

Fix disk partitioning race condition between the kernel and partx#2234
chewi wants to merge 2 commits into
coreos:mainfrom
chewi:partx-race

Conversation

@chewi
Copy link
Copy Markdown
Contributor

@chewi chewi commented Jun 1, 2026

I've admittedly lost the output of what happens when this fails, but this is what was going on underneath.

$ partx --add --nr 1 --verbose /dev/nvme0n1
partition: none, disk: /dev/nvme0n1, lower: 1, upper: 1
/dev/nvme0n1: partition table type 'gpt' detected
range recount: max partno=9, lower=1, upper=1
partx: /dev/nvme0n1: adding partition #1 failed: Device or resource busy
partx: /dev/nvme0n1: error adding partition 1

We started reliably seeing this in Flatcar after some batch updates. We don't know exactly which update triggered it, but it was probably systemd. While we could have looked into systemd's changes, I strongly felt that this code was always potentially racy. There was never anything stopping the kernel picking up the partition changes before partx had a chance to run.

This change therefore allows partx to fail and then checks that added/updated partitions have the right start sector and size and that deleted partitions are absent once udev has settled.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the disk partitioning logic to validate partition starts and sizes against sysfs after partitioning, and updates partition number types from uint64 to int. However, several critical issues were identified in the review. Changing getRealStartAndSize to return a map instead of a slice introduces non-deterministic ordering and causes partition overwrites when multiple partitions use number 0. Additionally, the new validation logic, partx commands, and deletion checks do not account for partition 0 (which represents the next available partition and does not exist in sysfs or /dev), leading to potential runtime failures. Reverting the map changes to slices and skipping partition 0 in sysfs/partx operations is recommended.

// and end sector should be. It runs sgdisk --pretend to determine what the partitions would look like if
// everything specified were to be (re)created.
func (s stage) getRealStartAndSize(dev types.Disk, devAlias string, diskInfo util.DiskInfo) ([]sgdisk.Partition, error) {
func (s stage) getRealStartAndSize(dev types.Disk, devAlias string, diskInfo util.DiskInfo) (map[int]sgdisk.Partition, error) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Changing the return type of getRealStartAndSize to a map introduces two critical issues:

  1. Map keys must be unique, so if multiple partitions have Number: 0 (which is a valid configuration representing the next available partition number), they will overwrite each other, and only one will be returned.
  2. Iterating over a map in Go is non-deterministic. This breaks the requirement that partitions are processed in a specific order (specifically, partition 0 must be processed last, which is enforced by sorting dev.Partitions before calling this function).

We should revert this function to return a slice []sgdisk.Partition.

Suggested change
func (s stage) getRealStartAndSize(dev types.Disk, devAlias string, diskInfo util.DiskInfo) (map[int]sgdisk.Partition, error) {
func (s stage) getRealStartAndSize(dev types.Disk, devAlias string, diskInfo util.DiskInfo) ([]sgdisk.Partition, error) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being unaware of this issue beforehand. However, I have reasoned it through, done some testing, and found that even the existing code doesn't work with number: 0 because partx still gets invoked with --add --nr 0, which results in a fatal error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add that this has been broken for a long time, since 2.20.0.

}

result := []sgdisk.Partition{}
result := map[int]sgdisk.Partition{}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Revert the result initialization to a slice to preserve duplicate partition numbers (like 0) and maintain deterministic ordering.

Suggested change
result := map[int]sgdisk.Partition{}
result := []sgdisk.Partition{}

}
}
result = append(result, part)
result[part.Number] = part
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Revert the map assignment to a slice append.

Suggested change
result[part.Number] = part
result = append(result, part)

Comment on lines +583 to +586
for _, partNum := range append(append([]int{}, partxAdd...), partxUpdate...) {
part := resolvedPartitions[partNum]
partDev := fmt.Sprintf("%s%s%d", blockDevResolved, prefix, partNum)
sysBlockDir := fmt.Sprintf("/sys/class/block/%s/", filepath.Base(partDev))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Since resolvedPartitions should remain a slice, we can build a temporary map of non-zero partition numbers for lookup during validation. Additionally, we must skip validation for partition number 0 because partition 0 does not exist in sysfs (the kernel assigns it a real, non-zero partition number), and attempting to read /sys/class/block/<dev>0/start will always fail with a fatal error.

Suggested change
for _, partNum := range append(append([]int{}, partxAdd...), partxUpdate...) {
part := resolvedPartitions[partNum]
partDev := fmt.Sprintf("%s%s%d", blockDevResolved, prefix, partNum)
sysBlockDir := fmt.Sprintf("/sys/class/block/%s/", filepath.Base(partDev))
resolvedPartitionsMap := make(map[int]sgdisk.Partition)
for _, part := range resolvedPartitions {
if part.Number != 0 {
resolvedPartitionsMap[part.Number] = part
}
}
for _, partNum := range append(partxAdd, partxUpdate...) {
if partNum == 0 {
continue
}
part := resolvedPartitionsMap[partNum]
partDev := fmt.Sprintf("%s%s%d", blockDevResolved, prefix, partNum)
sysBlockDir := fmt.Sprintf("/sys/class/block/%s/", filepath.Base(partDev))

Comment on lines 561 to 573
runPartxCommand := func(op string, partitions []int) error {
for _, partNr := range partitions {
cmd := exec.Command(distro.PartxCmd(), "--"+op, "--nr", strconv.FormatUint(partNr, 10), blockDevResolved)
cmd := exec.Command(distro.PartxCmd(), "--"+op, "--nr", fmt.Sprint(partNr), blockDevResolved)
if _, err := s.LogCmd(cmd, "triggering partition %d %s on %q", partNr, op, devAlias); err != nil {
return fmt.Errorf("partition %s failed: %v", op, err)
}
}
return nil
}
if err := runPartxCommand("delete", partxDelete); err != nil {
return err
}
if err := runPartxCommand("update", partxUpdate); err != nil {
return err
}
if err := runPartxCommand("add", partxAdd); err != nil {
return err
}
runPartxCommand("delete", partxDelete)
runPartxCommand("update", partxUpdate)
runPartxCommand("add", partxAdd)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the error returned by runPartxCommand is now ignored, we should simplify the function signature to not return an error. Additionally, we should skip partition number 0 because partx does not support partition number 0 (partition numbers are 1-based).

Suggested change
runPartxCommand := func(op string, partitions []int) error {
for _, partNr := range partitions {
cmd := exec.Command(distro.PartxCmd(), "--"+op, "--nr", strconv.FormatUint(partNr, 10), blockDevResolved)
cmd := exec.Command(distro.PartxCmd(), "--"+op, "--nr", fmt.Sprint(partNr), blockDevResolved)
if _, err := s.LogCmd(cmd, "triggering partition %d %s on %q", partNr, op, devAlias); err != nil {
return fmt.Errorf("partition %s failed: %v", op, err)
}
}
return nil
}
if err := runPartxCommand("delete", partxDelete); err != nil {
return err
}
if err := runPartxCommand("update", partxUpdate); err != nil {
return err
}
if err := runPartxCommand("add", partxAdd); err != nil {
return err
}
runPartxCommand("delete", partxDelete)
runPartxCommand("update", partxUpdate)
runPartxCommand("add", partxAdd)
}
runPartxCommand := func(op string, partitions []int) {
for _, partNr := range partitions {
if partNr == 0 {
continue
}
cmd := exec.Command(distro.PartxCmd(), "--"+op, "--nr", fmt.Sprint(partNr), blockDevResolved)
_, _ = s.LogCmd(cmd, "triggering partition %d %s on %q", partNr, op, devAlias)
}
}
runPartxCommand("delete", partxDelete)
runPartxCommand("update", partxUpdate)
runPartxCommand("add", partxAdd)

Comment on lines +623 to +624
for _, partNum := range partxDelete {
partDev := fmt.Sprintf("%s%s%d", blockDevResolved, prefix, partNum)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Skip partition number 0 when checking for deleted partitions, as /dev/sda0 (or similar) is not a valid device node and does not exist.

	for _, partNum := range partxDelete {
		if partNum == 0 {
			continue
		}
		partDev := fmt.Sprintf("%s%s%d", blockDevResolved, prefix, partNum)

@chewi chewi marked this pull request as draft June 1, 2026 18:51
chewi added 2 commits June 2, 2026 16:04
This was broken since partx was used in commit c2cc56c. Passing 0 to
partx causes it to try and add all the partitions, which will almost
always fail because the kernel will usually already know about at least
some of them.

This changes getRealStartAndSize() to also determine and return the
resulting partition numbers so that subsequent operations use these
instead of 0.

sgdisk does support --new=0, but it has no way to report which partition
number it actually used.

Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
`partx --add` will fail if the kernel is already aware of the new
partition. It was always theoretically possible that udev might trigger
early, and that appears to be happening now.

Allow partx to fail and then check that added/updated partitions have
the right start sector and size and that deleted partitions are absent
once udev has settled.

Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant