feat: add transparent-tunnel CNI mode (Linux) by alam-tahmid · Pull Request #4319 · Azure/azure-container-networking

alam-tahmid · 2026-04-07T16:52:27Z

Reason for Change:
Add transparent-tunnel CNI mode that forces same-node pod-to-pod traffic through the host's
physical interface (and therefore through VFP) so Azure NSG rules are enforced on intra-node
communication. This implements GPS (GlobalPodSecurity) for Linux using iptables fwmark-based
policy routing, including:

A fix for a conntrack tuple collision bug that caused ~50% DNS packet loss on same-node
ClusterIP UDP traffic (service CIDR RETURN rules inserted before MARK rule)
Race-safe shared ip rule management (tolerates "File exists" to avoid TOCTOU on concurrent
pod creates; ref-counted cleanup using real iptables -S output patterns on delete)
Nil gateway early-exit to prevent partial setup that would black-hole all marked traffic

Issue Fixed:

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
relevant PR labels added

Notes:

This is PR 1 of 2 for the GPS feature:

This PR — transparent-tunnel mode + Linux iptables/ip-rule implementation
PR 2 — Windows /32 host route implementation (separate PR)

The mode is opt-in via conflist: "mode": "transparent-tunnel". No behavioral change to existing
transparent or other modes.

Replaces the previous GlobalPodSecurity: true boolean flag approach with a dedicated CNI mode
that uses Go struct embedding (zero code copy from TransparentEndpointClient).

Copilot

Pull request overview

This PR introduces GlobalPodSecurity configuration plumbing to carry a new boolean knob from CNI network config into endpoint metadata, and updates default CNI conflists to surface the option (defaulting to false).

Changes:

Added globalPodSecurity to CNI NetworkConfig (JSON) and plumbed it into network.EndpointInfo.
Extended network endpoint-related structs to carry GlobalPodSecurity.
Added unit tests for config unmarshalling and createEpInfo propagation, and updated default Linux/Windows conflists to include the flag (set to false).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`network/endpoint.go`	Adds `GlobalPodSecurity` fields to `endpoint` and `EndpointInfo` structs.
`cni/network/network.go`	Wires `NetworkConfig.GlobalPodSecurity` into generated `EndpointInfo`.
`cni/network/network_test.go`	Adds coverage to ensure `createEpInfo` propagates the flag into `EndpointInfo`.
`cni/netconfig.go`	Adds `GlobalPodSecurity` to CNI JSON config (`globalPodSecurity`).
`cni/netconfig_test.go`	Adds JSON unmarshal tests for `globalPodSecurity` defaulting/values.
`cni/azure-windows.conflist`	Adds `"globalPodSecurity": false` to default Windows conflist.
`cni/azure-linux.conflist`	Adds `"globalPodSecurity": false` to default Linux conflist.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

QxBytes

Assuming your iptables changes will be going in the existing transparent client endpoint code, (unless it's a new scenario)? Also assuming this change affects nodesubnet (no azure cns present)?

QxBytes

Just a note that for transparent-vlan I needed to run something for rp_filter. If you already tested and it works then should be fine but just bringing it to your attention if something pops up in the future.

Also for the ExecuteRawCommand it should be fine since you control the command input but just for future reference.

tamilmani1989 · 2026-04-22T19:56:26Z

+	// 2. Fwmark MARK rule — append so it comes after RETURN rules.
+	markMatch := "-i " + hostVeth
+	markTarget := "MARK --set-mark " + markStr
+	if err := client.iptablesClient.AppendIptableRule(
+		iptables.V4, iptables.Mangle, iptables.Prerouting, markMatch, markTarget,
+	); err != nil {
+		return errors.Wrap(err, "failed to append GPS fwmark MARK rule")
+	}
+	logger.Info("GPS: added fwmark MARK rule",
+		zap.String("veth", hostVeth), zap.String("mark", markStr))


why do we need iptable rule to mark packet? can we not add ip rule to lookup custome routing table based on pod cidr?

something like this:

ip rule add from 10.9.255.0/24 table 100 ip route add default via 10.9.255.1 dev eth1 table 100

I actually got this idea about policy routing initially from you and went through a POC to validate it.

In NodeSubnet, pods and the node share the same VNet subnet (e.g., 10.224.0.0/16) — there's no distinct pod CIDR that I know of. There's no distinct pod CIDR — a pod might be 10.224.0.5 and the node is 10.224.0.4, all in the same range. So ip rule add from table 100 would also match node traffic (kubelet, API server health checks, etc.), which we don't want routed through VFP.

The fwmark approach lets us use interface-based matching — iptables matches on the pod's veth interface (-i ) to identify only pod-originated traffic, then stamps it with a mark. The ip rule then routes only marked packets to the custom table. This is the only reliable way to distinguish "packet came from a pod" vs "packet came from the node" when they share the same subnet.

The service CIDR RETURN rules (comment #2) are also tied to this — they prevent service-bound pod traffic from getting marked, so it still goes through kube-proxy DNAT first.

An ip rule from approach would work in overlay mode (where pods have a separate CIDR like 10.244.0.0/24), but not in NodeSubnet which is our target here.

can you update this in coimment on why we prefer iptable mark over ip rule..

so what happens after service translation, does it get hit by the mark?

can you update this in coimment on why we prefer iptable mark over ip rule..

so what happens after service translation, does it get hit by the mark?

Added an in-code comment explaining the fwmark vs ip-rule-from rationale (NodeSubnet has no distinct pod CIDR, so from would match node traffic too).

post-DNAT — no, it does not get re-hit by the MARK rule. Here's the flow:

Pod sends packet to ClusterIP (e.g. 10.0.0.10:53 → CoreDNS)

Packet enters host via pod's veth → hits mangle PREROUTING → the RETURN rule matches -d serviceCIDR → exits chain unmarked

Then hits nat PREROUTING → kube-proxy DNAT rewrites destination to real pod IP (e.g. 10.224.0.8)

Kernel makes routing decision — destination pod is on same node, so packet goes through the bridge to the destination pod's veth

PREROUTING only fires once per packet — when it first arrives on an interface. After DNAT, the packet is routed locally through the bridge and goes through the FORWARD chain, which has no MARK rules. So the post-DNAT packet is never seen by our mangle PREROUTING rules.

Without the RETURN rules (which I hit during POC), the service-bound packet would get marked, policy-routed out eth0 to gateway, VFP sends it back, re-enters the pod's veth creating a second conntrack entry whose reply tuple collides with the original DNAT entry — result was ~50% UDP packet loss (DNS
failures).

here the packet doesn't leave the VM, then how nsg will be enforced.?

tamilmani1989 · 2026-04-22T23:36:02Z

+	// 2. Fwmark MARK rule — append so it comes after RETURN rules.
+	markMatch := "-i " + hostVeth
+	markTarget := "MARK --set-mark " + markStr
+	if err := client.iptablesClient.AppendIptableRule(
+		iptables.V4, iptables.Mangle, iptables.Prerouting, markMatch, markTarget,
+	); err != nil {
+		return errors.Wrap(err, "failed to append GPS fwmark MARK rule")
+	}
+	logger.Info("GPS: added fwmark MARK rule",
+		zap.String("veth", hostVeth), zap.String("mark", markStr))


can you update this in coimment on why we prefer iptable mark over ip rule..

so what happens after service translation, does it get hit by the mark?

tamilmani1989 · 2026-04-22T23:41:16Z

+		match := "-i " + hostVeth + " -d " + cidr
+		if err := client.iptablesClient.DeleteIptableRule(
+			iptables.V4, iptables.Mangle, iptables.Prerouting, match, "RETURN",
+		); err != nil {
+			logger.Error("transparent-tunnel: failed to delete service CIDR RETURN rule",
+				zap.String("cidr", cidr), zap.Error(err))
+		}
+	}


these calls should be idempotent.. if iptable rule already deleted, it should not throw error and throw error for other errors..

Fixed. The iptables DeleteIptableRule runs -D which returns error if rule is already gone — we log it but don't propagate (void function), which is safe.

but what if iptable cmd returns error for other cases? function should return error for other errors except for already deleted case

tamilmani1989 · 2026-04-22T23:41:40Z

+	out, _ := client.plClient.ExecuteCommand(context.TODO(), "iptables", "-t", "mangle", "-S", "PREROUTING")
+	markCount := 0
+	for _, line := range strings.Split(out, "\n") {
+		if strings.Contains(line, "--set-xmark "+hexMark+"/") {
+			markCount++
+		}
+	}
+
+	if markCount == 0 {
+		rule := vishnetlink.NewRule()
+		rule.Mark = transparentTunnelFwmark
+		rule.Table = transparentTunnelRouteTable
+		rule.Family = unix.AF_INET
+		_ = client.nlPolicyRoute.RuleDel(rule)
+
+		_, defaultDst, _ := net.ParseCIDR("0.0.0.0/0")
+		route := &vishnetlink.Route{
+			Dst:   defaultDst,
+			Table: transparentTunnelRouteTable,
+		}
+		_ = client.nlPolicyRoute.RouteDel(route)
+	}


should throw error except for already deleted cases

Fixed. Updated netlink RuleDel/RouteDel to check for syscall.ENOENT and syscall.ESRCH (the "not found" errors netlink returns) and suppress only those. Any other real error (permission denied, etc.) is now logged.

not just log, it should return error and should be returned to containerd. There would be impact if these rules/routes are not removed right?

Copilot AI review requested due to automatic review settings April 7, 2026 16:52

alam-tahmid requested review from a team as code owners April 7, 2026 16:52

alam-tahmid requested a review from jpayne3506 April 7, 2026 16:52

Copilot started reviewing on behalf of alam-tahmid April 7, 2026 16:53 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread network/endpoint.go

alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch from b7af504 to 0d1ce9f Compare April 7, 2026 17:13

QxBytes reviewed Apr 8, 2026

View reviewed changes

Comment thread cni/network/network.go Outdated

Comment thread cni/azure-linux.conflist Outdated

Comment thread network/endpoint.go Outdated

Comment thread cni/netconfig_test.go Outdated

alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch from 0d1ce9f to a951bf0 Compare April 16, 2026 17:58

alam-tahmid changed the title ~~feat: add GlobalPodSecurity config plumbing~~ feat: add transparent-tunnel CNI mode for GPS VFP enforcement (Linux) Apr 16, 2026

alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch 2 times, most recently from e051e3d to ecbdd88 Compare April 16, 2026 21:57

QxBytes previously approved these changes Apr 17, 2026

View reviewed changes

Comment thread network/transparent_tunnel_endpointclient_linux.go Outdated

alam-tahmid mentioned this pull request Apr 17, 2026

feat: add transparent-tunnel CNI mode for GPS VFP enforcement (Windows) #4362

Draft

4 tasks

alam-tahmid dismissed QxBytes’s stale review via 6323fa7 April 21, 2026 18:54

alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch from ecbdd88 to 6323fa7 Compare April 21, 2026 18:54

QxBytes previously approved these changes Apr 22, 2026

View reviewed changes

tamilmani1989 requested changes Apr 22, 2026

View reviewed changes

Comment thread network/transparent_tunnel_endpointclient_linux.go Outdated

alam-tahmid dismissed QxBytes’s stale review via 06e13e1 April 22, 2026 21:16

alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch 3 times, most recently from 4a137f5 to 7af0b43 Compare April 22, 2026 21:50

tamilmani1989 reviewed Apr 22, 2026

View reviewed changes

feat: add transparent-tunnel CNI mode for GPS VFP enforcement (Linux)

8aa723b

alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch from 7af0b43 to 8aa723b Compare April 23, 2026 20:26

tamilmani1989 changed the title ~~feat: add transparent-tunnel CNI mode for GPS VFP enforcement (Linux)~~ feat: add transparent-tunnel CNI mode (Linux) Apr 27, 2026

Conversation

alam-tahmid commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

QxBytes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

QxBytes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alam-tahmid commented Apr 7, 2026 •

edited

Loading