diff --git a/kernel/build/patches/0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch b/kernel/build/patches/0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch index 70c428268..5b395776b 100644 --- a/kernel/build/patches/0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch +++ b/kernel/build/patches/0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch @@ -1,81 +1,91 @@ -From 3106d546d494f2f52ec832e7f7d04f534286e254 Mon Sep 17 00:00:00 2001 -Message-ID: <3106d546d494f2f52ec832e7f7d04f534286e254.1777064117.git.lukasz@raczylo.com> -In-Reply-To: -References: +From 0ee595ef700d4f8dee3efe3b992f31ad8ee9e7af Mon Sep 17 00:00:00 2001 From: Lukasz Raczylo -Date: Fri, 24 Apr 2026 21:50:55 +0100 -Subject: [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after - TSTART doorbell +Date: Fri, 15 May 2026 13:46:37 +0100 +Subject: [PATCH 1/3] net: macb: flush PCIe posted write after TSTART doorbell + (PCIe-only) macb_start_xmit() and macb_tx_restart() kick transmission by -OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances -(BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we -have in front of us), writes to NCR are posted PCIe writes: they -are not guaranteed to reach the device before the issuing CPU -returns. If the TSTART doorbell does not reach the MAC, no TX -begins, no TCOMP completion arrives, and the ring remains -quiescent without any kernel-visible indication. +OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb +instances (BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is +the case I have in front of me), writes to NCR are posted PCIe +writes: they are not guaranteed to reach the device before the +issuing CPU returns. If the TSTART doorbell does not reach the +MAC, no TX begins, no TCOMP completion arrives, and the ring +remains quiescent without any kernel-visible indication. -Note that the raspberrypi/linux vendor fork carries a local patch -around the TSTART site (a queue->tx_pending breadcrumb that is -promoted to queue->txubr_pending by the next TCOMP interrupt, -triggering macb_tx_restart()). That workaround makes the loss -recoverable under traffic, but it cannot help if TCOMP itself is -not raised because no TX started -- which is exactly the case we -are targeting here. The handshake is not present in mainline. +Add a read-back of NCR after each TSTART write. The read is an +architected PCIe read barrier for earlier posted writes on the +same path; it ensures the doorbell has reached the MAC before +the function returns. -Add a read-back of NCR after each TSTART write in macb_start_xmit() -and macb_tx_restart(). The read is an architected PCIe read -barrier for earlier posted writes on the same path; it ensures the -doorbell has reached the MAC before the functions return. - -We do not yet have direct hardware evidence that TSTART is being -lost on the RP1 path (that would require a PCIe protocol analyser, -or at minimum a before/after counter on queue->tx_stall_last_tail -with and without this patch applied in isolation). This patch is -one of a three-patch series ("candidate fixes for silent TX stall -on BCM2712/RP1"); see the cover letter for context. We have -verified the series compiles and applies cleanly against mainline -HEAD and against raspberrypi/linux rpi-6.18.y @ f2f68e79f16f; -runtime verification is pending. +The cost is one non-posted PCIe read per TSTART. To avoid +imposing this on SoC-integrated macb variants (Atmel, Microchip, +SiFive, Xilinx), where NCR is on-chip MMIO and no fabric +posted-write concern exists, gate the readback behind a new +MACB_CAPS_PCIE_POSTED_WRITES capability set only on +raspberrypi_rp1_config. Link: https://github.com/cilium/cilium/issues/43198 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 Signed-off-by: Lukasz Raczylo --- - drivers/net/ethernet/cadence/macb_main.c | 12 ++++++++++++ - 1 file changed, 12 insertions(+) + drivers/net/ethernet/cadence/macb.h | 4 ++++ + drivers/net/ethernet/cadence/macb_main.c | 13 ++++++++++++- + 2 files changed, 16 insertions(+), 1 deletion(-) +diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h +index 0830c4897..bc2225956 100644 +--- a/drivers/net/ethernet/cadence/macb.h ++++ b/drivers/net/ethernet/cadence/macb.h +@@ -769,6 +769,10 @@ + #define MACB_CAPS_NEED_TSUCLK 0x00000400 + #define MACB_CAPS_QUEUE_DISABLE 0x00000800 + #define MACB_CAPS_QBV 0x00001000 ++/* Register writes are posted on the parent fabric and need a non-posted ++ * read-back to guarantee delivery. Currently set only on RP1. ++ */ ++#define MACB_CAPS_PCIE_POSTED_WRITES 0x00002000 + #define MACB_CAPS_PCS 0x01000000 + #define MACB_CAPS_HIGH_SPEED 0x02000000 + #define MACB_CAPS_CLK_HW_CHG 0x04000000 diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c -index a12aa2124..b6cca55ad 100644 +index 17d4a3e03..fa80236dd 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c -@@ -1922,6 +1922,13 @@ static void macb_tx_restart(struct macb_queue *queue) +@@ -1807,6 +1807,13 @@ static void macb_tx_restart(struct macb_queue *queue) spin_lock(&bp->lock); macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART)); -+ /* -+ * Flush the PCIe posted-write queue so the TSTART doorbell -+ * reliably reaches the MAC. Without this, the write can sit -+ * in the fabric and the MAC never advances, causing a silent -+ * TX stall. ++ /* On PCIe-attached parts, flush the posted-write queue so the ++ * TSTART doorbell reliably reaches the MAC. Without this the ++ * write can sit in the fabric and the MAC never advances, ++ * causing a silent TX stall. + */ -+ (void)macb_readl(bp, NCR); ++ if (bp->caps & MACB_CAPS_PCIE_POSTED_WRITES) ++ (void)macb_readl(bp, NCR); spin_unlock(&bp->lock); out_tx_ptr_unlock: -@@ -2560,6 +2567,11 @@ static netdev_tx_t macb_start_xmit(struct sk_buff *skb, struct net_device *dev) +@@ -2481,6 +2488,9 @@ static netdev_tx_t macb_start_xmit(struct sk_buff *skb, struct net_device *dev) + spin_lock(&bp->lock); - macb_tx_lpi_wake(bp); macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART)); -+ /* -+ * Flush the PCIe posted-write queue; see the comment in -+ * macb_tx_restart() for the reasoning. -+ */ -+ (void)macb_readl(bp, NCR); ++ /* Flush PCIe posted-write queue; see comment in macb_tx_restart(). */ ++ if (bp->caps & MACB_CAPS_PCIE_POSTED_WRITES) ++ (void)macb_readl(bp, NCR); spin_unlock(&bp->lock); if (CIRC_SPACE(queue->tx_head, queue->tx_tail, bp->tx_ring_size) < 1) +@@ -5474,7 +5484,8 @@ static const struct macb_config versal_config = { + static const struct macb_config raspberrypi_rp1_config = { + .caps = MACB_CAPS_GIGABIT_MODE_AVAILABLE | MACB_CAPS_CLK_HW_CHG | + MACB_CAPS_JUMBO | +- MACB_CAPS_GEM_HAS_PTP, ++ MACB_CAPS_GEM_HAS_PTP | ++ MACB_CAPS_PCIE_POSTED_WRITES, + .dma_burst_length = 16, + .clk_init = macb_clk_init, + .init = macb_init, -- -2.53.0 +2.54.0 diff --git a/kernel/build/patches/0002-net-macb-insert-PCIe-read-barrier-before-TX-completi.patch b/kernel/build/patches/0002-net-macb-insert-PCIe-read-barrier-before-TX-completi.patch new file mode 100644 index 000000000..fe018bbe3 --- /dev/null +++ b/kernel/build/patches/0002-net-macb-insert-PCIe-read-barrier-before-TX-completi.patch @@ -0,0 +1,60 @@ +From a27adeab1b08fac9ff3978d745caa536a458430b Mon Sep 17 00:00:00 2001 +From: Lukasz Raczylo +Date: Fri, 15 May 2026 13:47:20 +0100 +Subject: [PATCH 2/3] net: macb: insert PCIe read barrier before TX completion + descriptor check + +macb_tx_poll() runs with TCOMP masked, drains the TX ring, then +calls napi_complete_done() and re-enables TCOMP via IER. An +existing comment in the function notes that completions raised +while TCOMP is masked do not re-fire on IER re-enable, and +mitigates this by calling macb_tx_complete_pending(), which +inspects driver-visible ring state (descriptor->ctrl, after +rmb()) and reschedules NAPI if a completion is observable in +memory. + +On PCIe-attached parts (BCM2712 + RP1 PCIe south bridge on +Raspberry Pi 5 is the case I have in front of me), the +descriptor DMA write that sets TX_USED may not have retired to +system memory at the point macb_tx_complete_pending() runs. The +rmb() synchronises the CPU view of earlier CPU writes; it is +not sufficient to retire an in-flight peripheral DMA write. + +Add a side-effect-free MMIO read between the IER write and the +macb_tx_complete_pending() check. The read functions as an +architected PCIe read barrier for earlier peripheral-originated +DMA writes on the same path, so any in-flight TX_USED update +retires to system memory before the descriptor read. + +The register chosen is IMR (the read-only interrupt mask +mirror); reading it has no side effects on either read-clear or +W1C ISR silicon (it is not the ISR). + +Link: https://github.com/cilium/cilium/issues/43198 +Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 +Signed-off-by: Lukasz Raczylo +--- + drivers/net/ethernet/cadence/macb_main.c | 7 +++++++ + 1 file changed, 7 insertions(+) + +diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c +index fa80236dd..23120fc7c 100644 +--- a/drivers/net/ethernet/cadence/macb_main.c ++++ b/drivers/net/ethernet/cadence/macb_main.c +@@ -1868,6 +1868,13 @@ static int macb_tx_poll(struct napi_struct *napi, int budget) + * actions if an interrupt is raised just after enabling them, + * but this should be harmless. + */ ++ /* PCIe read barrier: flush any in-flight peripheral DMA ++ * writes (descriptor TX_USED updates) so the subsequent ++ * macb_tx_complete_pending() check observes them. IMR is ++ * the read-only interrupt mask mirror; the read has no ++ * side effects on either read-clear or W1C ISR silicon. ++ */ ++ (void)queue_readl(queue, IMR); + if (macb_tx_complete_pending(queue)) { + queue_writel(queue, IDR, MACB_BIT(TCOMP)); + if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE) +-- +2.54.0 + diff --git a/kernel/build/patches/0002-net-macb-re-check-ISR-after-IER-re-enable-in-macb_tx.patch b/kernel/build/patches/0002-net-macb-re-check-ISR-after-IER-re-enable-in-macb_tx.patch deleted file mode 100644 index 91948ea60..000000000 --- a/kernel/build/patches/0002-net-macb-re-check-ISR-after-IER-re-enable-in-macb_tx.patch +++ /dev/null @@ -1,106 +0,0 @@ -From 18740f04225a1778e6938ab5ecfe82087d66eb27 Mon Sep 17 00:00:00 2001 -Message-ID: <18740f04225a1778e6938ab5ecfe82087d66eb27.1777064117.git.lukasz@raczylo.com> -In-Reply-To: -References: -From: Lukasz Raczylo -Date: Fri, 24 Apr 2026 21:52:05 +0100 -Subject: [RFC PATCH net-next 2/3] net: macb: re-check ISR after IER re-enable - in macb_tx_poll - -macb_tx_poll() runs with TCOMP masked, drains the TX ring, then -calls napi_complete_done() and re-enables TCOMP via IER. An -existing comment in the function notes: - - /* Packet completions only seem to propagate to raise - * interrupts when interrupts are enabled at the time, so if - * packets were sent while interrupts were disabled, - * they will not cause another interrupt to be generated when - * interrupts are re-enabled. - */ - -and mitigates this by calling macb_tx_complete_pending(), which -inspects driver-visible ring state (descriptor->ctrl, after rmb()) -and reschedules NAPI if a completion is observable in memory. - -On PCIe-attached parts (BCM2712 + RP1 on Raspberry Pi 5 is the -setup we have in front of us), the descriptor DMA write that sets -TX_USED may not have retired to system memory at the point -macb_tx_complete_pending() runs. The rmb() synchronises the CPU -view of earlier CPU writes; it is not sufficient to retire an -in-flight peripheral DMA write. Under that ordering the in-memory -descriptor can still read TX_USED=0 when the hardware has in fact -completed the frame; the check returns false; NAPI exits; the -quirk above prevents the re-enabled IER from re-firing; the ring -goes quiescent. - -Add an explicit ISR read after the IER write. The MMIO read -serves two independent purposes: - - (1) It is an architected PCIe read barrier for earlier - peripheral-originated DMA writes on the same path, so a - subsequent macb_tx_complete_pending() observes any TX_USED - write that was in flight at the time of the barrier. - - (2) It samples the hardware ISR directly, so a TCOMP bit that - the hardware set while TCOMP was masked is visible here, - independently of whether the descriptor DMA has retired. - -If either signal indicates pending work, reschedule NAPI via the -same path as the existing check. - -This patch addresses one of three candidate races for the silent -TX stall described in the cover letter. Whether it is sufficient -by itself, or whether it requires the PCIe posted-write flush in -patch 1/3 to cover the observed behaviour, we have not yet -verified at runtime. - -Link: https://github.com/cilium/cilium/issues/43198 -Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 -Signed-off-by: Lukasz Raczylo ---- - drivers/net/ethernet/cadence/macb_main.c | 28 +++++++++++++++--------- - 1 file changed, 18 insertions(+), 10 deletions(-) - -diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c -index b6cca55ad..ea231b1c5 100644 ---- a/drivers/net/ethernet/cadence/macb_main.c -+++ b/drivers/net/ethernet/cadence/macb_main.c -@@ -1973,17 +1973,25 @@ static int macb_tx_poll(struct napi_struct *napi, int budget) - if (work_done < budget && napi_complete_done(napi, work_done)) { - queue_writel(queue, IER, MACB_BIT(TCOMP)); - -- /* Packet completions only seem to propagate to raise -- * interrupts when interrupts are enabled at the time, so if -- * packets were sent while interrupts were disabled, -- * they will not cause another interrupt to be generated when -- * interrupts are re-enabled. -- * Check for this case here to avoid losing a wakeup. This can -- * potentially race with the interrupt handler doing the same -- * actions if an interrupt is raised just after enabling them, -- * but this should be harmless. -+ /* -+ * TCOMP events that fire while the interrupt is masked do -+ * not re-fire when IER is re-enabled. Catch this two ways -+ * to avoid losing a wakeup: -+ * -+ * (1) Read ISR -- catches completions the hardware flagged -+ * but that we did not see as an interrupt. The MMIO -+ * read doubles as a PCIe read barrier, flushing any -+ * in-flight descriptor TX_USED DMA writes into memory. -+ * (2) macb_tx_complete_pending() inspects the ring after -+ * that flush, catching a descriptor whose TX_USED is -+ * now visible as a result of the barrier. -+ * -+ * This can race with the interrupt handler taking the same -+ * path if an interrupt fires just after the IER write; -+ * rescheduling NAPI in that case is harmless. - */ -- if (macb_tx_complete_pending(queue)) { -+ if ((queue_readl(queue, ISR) & MACB_BIT(TCOMP)) || -+ macb_tx_complete_pending(queue)) { - queue_writel(queue, IDR, MACB_BIT(TCOMP)); - macb_queue_isr_clear(bp, queue, MACB_BIT(TCOMP)); - netdev_vdbg(bp->dev, "TX poll: packets pending, reschedule\n"); --- -2.53.0 - diff --git a/kernel/build/patches/0003-net-macb-add-TX-stall-watchdog-as-defence-in-depth-s.patch b/kernel/build/patches/0003-net-macb-add-TX-stall-watchdog-as-defence-in-depth-s.patch deleted file mode 100644 index 8679eb4e3..000000000 --- a/kernel/build/patches/0003-net-macb-add-TX-stall-watchdog-as-defence-in-depth-s.patch +++ /dev/null @@ -1,158 +0,0 @@ -From c0469642f42ada85d91a8a685eb7c0e04cb99131 Mon Sep 17 00:00:00 2001 -Message-ID: -In-Reply-To: -References: -From: Lukasz Raczylo -Date: Fri, 24 Apr 2026 21:52:06 +0100 -Subject: [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as - defence-in-depth safety net - -Patches 1/3 and 2/3 address two candidate races that could lead -to a TCOMP completion being missed on PCIe-attached macb -instances. This patch adds a defence-in-depth safety net, in -case a further race remains that we have not identified. - -The watchdog is a per-queue delayed_work that runs once per -second. It snapshots queue->tx_tail; if the ring is non-empty -(queue->tx_head != queue->tx_tail) and tx_tail has not advanced -since the previous tick, it calls macb_tx_restart(). - -No new recovery logic is introduced. macb_tx_restart() already -exists in this file, is correctly locked (tx_ptr_lock, bp->lock), -and verifies that the hardware's TBQP is behind the driver's -head index before re-asserting TSTART. On a healthy ring it is -a no-op at the hardware level; the watchdog only supplies the -missing trigger. - -On a healthy queue the per-tick cost is one spin_lock_irqsave() -/ spin_unlock_irqrestore() and one branch. The delayed_work is -only scheduled between macb_open() and macb_close(), and is -cancelled synchronously on close. - -Context for submission: on our 24-node Raspberry Pi 5 fleet, -before this series, an out-of-band user-space watchdog -(monitoring tx_packets from /sys/class/net/.../statistics and -toggling the link down/up when it froze) was required to keep -nodes usable. We include this kernel-side watchdog as a cleaner -in-kernel equivalent for any residual stall that patches 1 and -2 do not cover. We are willing to drop this patch if the view -is that 1 and 2 should stand alone. - -Link: https://github.com/cilium/cilium/issues/43198 -Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 -Signed-off-by: Lukasz Raczylo ---- - drivers/net/ethernet/cadence/macb.h | 5 ++ - drivers/net/ethernet/cadence/macb_main.c | 59 ++++++++++++++++++++++++ - 2 files changed, 64 insertions(+) - -diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h -index 2de56017e..9115f2b47 100644 ---- a/drivers/net/ethernet/cadence/macb.h -+++ b/drivers/net/ethernet/cadence/macb.h -@@ -1278,6 +1278,11 @@ struct macb_queue { - dma_addr_t tx_ring_dma; - struct work_struct tx_error_task; - bool txubr_pending; -+ -+ /* TX stall watchdog -- see macb_tx_stall_watchdog() in macb_main.c */ -+ struct delayed_work tx_stall_watchdog_work; -+ unsigned int tx_stall_last_tail; -+ - struct napi_struct napi_tx; - - dma_addr_t rx_ring_dma; -diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c -index ea231b1c5..ea2306ef7 100644 ---- a/drivers/net/ethernet/cadence/macb_main.c -+++ b/drivers/net/ethernet/cadence/macb_main.c -@@ -2002,6 +2002,59 @@ static int macb_tx_poll(struct napi_struct *napi, int budget) - return work_done; - } - -+#define MACB_TX_STALL_INTERVAL_MS 1000 -+ -+/* -+ * TX stall watchdog. -+ * -+ * Defence-in-depth against lost TCOMP interrupts. macb already has a -+ * recovery chain (tx_pending -> txubr_pending -> macb_tx_restart()) -+ * that fires on TCOMP; if TCOMP itself is lost the TX ring stalls -+ * silently until something else kicks TSTART. This watchdog runs -+ * once per second per queue, snapshots tx_tail, and calls -+ * macb_tx_restart() if the ring is non-empty and tx_tail has not -+ * advanced since the previous tick. -+ * -+ * macb_tx_restart() already checks the hardware's TBQP against the -+ * driver's head index before re-asserting TSTART, so on a healthy -+ * ring this is a no-op at the hardware level. The watchdog only -+ * adds the missing trigger. -+ */ -+static void macb_tx_stall_watchdog(struct work_struct *work) -+{ -+ struct macb_queue *queue = container_of(to_delayed_work(work), -+ struct macb_queue, -+ tx_stall_watchdog_work); -+ struct macb *bp = queue->bp; -+ unsigned int cur_tail, cur_head; -+ bool stalled = false; -+ unsigned long flags; -+ -+ if (!netif_running(bp->dev)) -+ return; -+ -+ spin_lock_irqsave(&queue->tx_ptr_lock, flags); -+ cur_tail = queue->tx_tail; -+ cur_head = queue->tx_head; -+ if (cur_head != cur_tail && -+ cur_tail == queue->tx_stall_last_tail) -+ stalled = true; -+ else -+ queue->tx_stall_last_tail = cur_tail; -+ spin_unlock_irqrestore(&queue->tx_ptr_lock, flags); -+ -+ if (stalled) { -+ netdev_warn_once(bp->dev, -+ "TX stall detected on queue %u (tail=%u head=%u); re-kicking TSTART\n", -+ (unsigned int)(queue - bp->queues), -+ cur_tail, cur_head); -+ macb_tx_restart(queue); -+ } -+ -+ schedule_delayed_work(&queue->tx_stall_watchdog_work, -+ msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS)); -+} -+ - static void macb_hresp_error_task(struct work_struct *work) - { - struct macb *bp = from_work(bp, work, hresp_err_bh_work); -@@ -3190,6 +3243,9 @@ static int macb_open(struct net_device *dev) - for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) { - napi_enable(&queue->napi_rx); - napi_enable(&queue->napi_tx); -+ queue->tx_stall_last_tail = queue->tx_tail; -+ schedule_delayed_work(&queue->tx_stall_watchdog_work, -+ msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS)); - } - - macb_init_hw(bp); -@@ -3240,6 +3296,7 @@ static int macb_close(struct net_device *dev) - for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) { - napi_disable(&queue->napi_rx); - napi_disable(&queue->napi_tx); -+ cancel_delayed_work_sync(&queue->tx_stall_watchdog_work); - netdev_tx_reset_queue(netdev_get_tx_queue(dev, q)); - } - -@@ -4802,6 +4859,8 @@ static int macb_init_dflt(struct platform_device *pdev) - } - - INIT_WORK(&queue->tx_error_task, macb_tx_error_task); -+ INIT_DELAYED_WORK(&queue->tx_stall_watchdog_work, -+ macb_tx_stall_watchdog); - q++; - } - --- -2.53.0 - diff --git a/kernel/build/patches/0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch b/kernel/build/patches/0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch new file mode 100644 index 000000000..10e6ace07 --- /dev/null +++ b/kernel/build/patches/0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch @@ -0,0 +1,181 @@ +From 3666d28a1a4ad8d21dd7ad2d4d654dc4fd719ec7 Mon Sep 17 00:00:00 2001 +From: Lukasz Raczylo +Date: Fri, 15 May 2026 13:49:03 +0100 +Subject: [PATCH 3/3] net: macb: add TX stall watchdog to recover from lost + TCOMP interrupts + +On PCIe-attached macb instances (BCM2712 + RP1 PCIe south bridge +on Raspberry Pi 5 is the case I have in front of me), a TCOMP +interrupt can be lost: the TSTART doorbell can be lost in the +posted-write fabric (addressed by an earlier patch), or the +descriptor TX_USED DMA write can be observed late by the driver +(also addressed earlier). When that happens the TX ring stalls +silently until something else kicks TSTART. + +Add a per-queue delayed_work that runs once per second. It +detects forward progress on the TX completion path via a per-queue +bool tx_stall_tail_moved that macb_tx_complete() sets when tx_tail +advances and the watchdog clears on each tick. If the ring is +non-empty and the flag is unset when the tick runs, the watchdog +calls the existing macb_tx_restart() to re-assert TSTART. + +The bool form sidesteps any concern about ring-index aliasing +between ticks and is the form suggested by Phil Elwell when +reviewing the same series anchored against raspberrypi/linux +rpi-6.18.y (PR #7340, merged 2026-05-08). + +A netif_carrier_ok() gate skips the check when there is no +carrier (no completion is possible without a link), eliminating +a boot-time false positive where queue->tx_head can advance from +kernel-queued packets between macb_open() and link autoneg +completion. + +The stall warn is wrapped in printk_ratelimit() so operators can +count occurrences across the lifetime of the netdev while bounding +log noise. + +Link: https://github.com/cilium/cilium/issues/43198 +Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 +Link: https://github.com/raspberrypi/linux/pull/7340 +Signed-off-by: Lukasz Raczylo +--- + drivers/net/ethernet/cadence/macb.h | 10 ++++ + drivers/net/ethernet/cadence/macb_main.c | 73 ++++++++++++++++++++++++ + 2 files changed, 83 insertions(+) + +diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h +index bc2225956..a24bf5ba2 100644 +--- a/drivers/net/ethernet/cadence/macb.h ++++ b/drivers/net/ethernet/cadence/macb.h +@@ -1264,6 +1264,16 @@ struct macb_queue { + dma_addr_t tx_ring_dma; + struct work_struct tx_error_task; + bool txubr_pending; ++ ++ /* TX stall watchdog -- see macb_tx_stall_watchdog() in macb_main.c. ++ * tx_stall_tail_moved is set by macb_tx_complete() when tx_tail ++ * advances and cleared by the watchdog tick on each pass (both ++ * under tx_ptr_lock). Using a bool sidesteps any ring-index ++ * aliasing concern between ticks. ++ */ ++ struct delayed_work tx_stall_watchdog_work; ++ bool tx_stall_tail_moved; ++ + struct napi_struct napi_tx; + + dma_addr_t rx_ring_dma; +diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c +index 23120fc7c..4feb475f7 100644 +--- a/drivers/net/ethernet/cadence/macb_main.c ++++ b/drivers/net/ethernet/cadence/macb_main.c +@@ -1371,6 +1371,8 @@ static int macb_tx_complete(struct macb_queue *queue, int budget) + packets, bytes); + + queue->tx_tail = tail; ++ if (packets) ++ queue->tx_stall_tail_moved = true; + if (__netif_subqueue_stopped(bp->dev, queue_index) && + CIRC_CNT(queue->tx_head, queue->tx_tail, + bp->tx_ring_size) <= MACB_TX_WAKEUP_THRESH(bp)) +@@ -1887,6 +1889,71 @@ static int macb_tx_poll(struct napi_struct *napi, int budget) + return work_done; + } + ++#define MACB_TX_STALL_INTERVAL_MS 1000 ++ ++/* ++ * TX stall watchdog. ++ * ++ * Recovers from lost TCOMP interrupts on PCIe-attached macb ++ * instances. macb already has a recovery chain ++ * (txubr_pending -> macb_tx_restart()) that fires on TCOMP; if ++ * TCOMP itself is lost the TX ring stalls silently until something ++ * else kicks TSTART. This watchdog runs once per second per queue ++ * and calls macb_tx_restart() if the ring is non-empty and ++ * tx_tail has not advanced since the previous tick. ++ * ++ * Movement is tracked via the tx_stall_tail_moved boolean rather ++ * than a tx_tail snapshot, sidestepping any ring-index aliasing ++ * concern. The bool is set by macb_tx_complete() when tx_tail ++ * advances and cleared here on each tick; both writes are under ++ * tx_ptr_lock so no atomic is required. ++ * ++ * macb_tx_restart() already checks the hardware's TBQP against ++ * the driver's head index before re-asserting TSTART, so on a ++ * healthy ring this is a no-op at the hardware level. The ++ * watchdog only supplies the missing trigger. ++ */ ++static void macb_tx_stall_watchdog(struct work_struct *work) ++{ ++ struct macb_queue *queue = container_of(to_delayed_work(work), ++ struct macb_queue, ++ tx_stall_watchdog_work); ++ struct macb *bp = queue->bp; ++ unsigned int cur_tail, cur_head; ++ bool stalled = false; ++ unsigned long flags; ++ ++ if (!netif_running(bp->dev)) ++ return; ++ ++ /* No carrier => no completion is possible. Skip the check ++ * but keep the watchdog ticking for when carrier comes up. ++ */ ++ if (!netif_carrier_ok(bp->dev)) ++ goto reschedule; ++ ++ spin_lock_irqsave(&queue->tx_ptr_lock, flags); ++ cur_tail = queue->tx_tail; ++ cur_head = queue->tx_head; ++ if (cur_head != cur_tail && !queue->tx_stall_tail_moved) ++ stalled = true; ++ queue->tx_stall_tail_moved = false; ++ spin_unlock_irqrestore(&queue->tx_ptr_lock, flags); ++ ++ if (stalled) { ++ if (printk_ratelimit()) ++ netdev_warn(bp->dev, ++ "TX stall detected on queue %u (tail=%u head=%u); re-kicking TSTART\n", ++ (unsigned int)(queue - bp->queues), ++ cur_tail, cur_head); ++ macb_tx_restart(queue); ++ } ++ ++reschedule: ++ schedule_delayed_work(&queue->tx_stall_watchdog_work, ++ msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS)); ++} ++ + static void macb_hresp_error_task(struct work_struct *work) + { + struct macb *bp = from_work(bp, work, hresp_err_bh_work); +@@ -3109,6 +3176,9 @@ static int macb_open(struct net_device *dev) + for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) { + napi_enable(&queue->napi_rx); + napi_enable(&queue->napi_tx); ++ queue->tx_stall_tail_moved = true; ++ schedule_delayed_work(&queue->tx_stall_watchdog_work, ++ msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS)); + } + + macb_init_hw(bp); +@@ -3155,6 +3225,7 @@ static int macb_close(struct net_device *dev) + for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) { + napi_disable(&queue->napi_rx); + napi_disable(&queue->napi_tx); ++ cancel_delayed_work_sync(&queue->tx_stall_watchdog_work); + netdev_tx_reset_queue(netdev_get_tx_queue(dev, q)); + } + +@@ -4685,6 +4756,8 @@ static int macb_init(struct platform_device *pdev) + } + + INIT_WORK(&queue->tx_error_task, macb_tx_error_task); ++ INIT_DELAYED_WORK(&queue->tx_stall_watchdog_work, ++ macb_tx_stall_watchdog); + q++; + } + +-- +2.54.0 + diff --git a/kernel/build/patches/README.md b/kernel/build/patches/README.md index 73441e2a9..2ee36c7dc 100644 --- a/kernel/build/patches/README.md +++ b/kernel/build/patches/README.md @@ -1,7 +1,7 @@ | Patch file | Description | Upstream status | Link | |----------------------------------------------------------------------------|--------------------------------------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch` | macb: flush PCIe posted write after TSTART doorbell (silent TX stall on BCM2712/RP1) | RFC submitted to netdev | [lore thread](https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/) | -| `0002-net-macb-re-check-ISR-after-IER-re-enable-in-macb_tx.patch` | macb: re-check ISR after IER re-enable in `macb_tx_poll()` to catch TCOMP raised inside the IDR/IER mask window | RFC submitted to netdev | [lore thread](https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/) | -| `0003-net-macb-add-TX-stall-watchdog-as-defence-in-depth-s.patch` | macb: per-queue `delayed_work` watchdog that calls `macb_tx_restart()` if `tx_tail` hasn't advanced for ≥ 1 s | RFC submitted to netdev | [lore thread](https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/) | +| `0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch` | macb: flush PCIe posted write after TSTART doorbell — gated behind new `MACB_CAPS_PCIE_POSTED_WRITES` cap (RP1-only) so SoC-integrated variants don't pay the readback cost | v2 submitted to netdev | [v2 thread](https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/) | +| `0002-net-macb-insert-PCIe-read-barrier-before-TX-completi.patch` | macb: insert non-destructive PCIe read barrier (`queue_readl(queue, IMR)`) before `macb_tx_complete_pending()` in `macb_tx_poll()`. Replaces the v1 ISR-read form which was destructive on read-clear silicon (RP1) — that read silently consumed RCOMP / ROVR / TXUBR bits, causing silent RX-completion loss at moderate-to-heavy load | v2 submitted to netdev | [v2 thread](https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/) | +| `0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch` | macb: per-queue `delayed_work` watchdog that calls `macb_tx_restart()` if tx_tail hasn't advanced. v2 uses a `bool tx_stall_tail_moved` flag (pelwell-suggested form) instead of a tx_tail snapshot, gates the check on `netif_carrier_ok()` to eliminate a boot-time false positive, and wraps the stall-warn in `if (printk_ratelimit()) netdev_warn(...)` so events stay observable while bounded | v2 submitted to netdev | [v2 thread](https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/) · [v2 patch 3 build-fix](https://lore.kernel.org/netdev/20260515095336.92237-1-lukasz@raczylo.com/T/) | | `0004-PCI-prevent-shrink-bridge-window.patch` | PCI: prevent `adjust_bridge_window()` from shrinking a bridge window below the size required by `pbus_size_mem()` — fixes large-BAR / eGPU resource starvation | Merged to mainline v6.19, candidate for 6.18.y stable backport | [lore patch](https://patch.msgid.link/20260219153951.68869-1-ilpo.jarvinen@linux.intel.com) | | `0005-PCI-fix-premature-removal-realloc-head.patch` | PCI: reorder checks in `reassign_resources_sorted()` so entries aren't dropped from `realloc_head` before being processed — preserves `add_size` for deeper PCI hierarchy levels | Merged to mainline v6.19, candidate for 6.18.y stable backport | [lore patch](https://patch.msgid.link/20260313084551.1934-1-ilpo.jarvinen@linux.intel.com) |