Skip to content

feat: add transparent-tunnel CNI mode (Linux)#4319

Open
alam-tahmid wants to merge 1 commit intoAzure:masterfrom
alam-tahmid:tahmidalam/gps-config-plumbing
Open

feat: add transparent-tunnel CNI mode (Linux)#4319
alam-tahmid wants to merge 1 commit intoAzure:masterfrom
alam-tahmid:tahmidalam/gps-config-plumbing

Conversation

@alam-tahmid
Copy link
Copy Markdown
Contributor

@alam-tahmid alam-tahmid commented Apr 7, 2026

Reason for Change:
 Add transparent-tunnel CNI mode that forces same-node pod-to-pod traffic through the host's
physical interface (and therefore through VFP) so Azure NSG rules are enforced on intra-node
communication. This implements GPS (GlobalPodSecurity) for Linux using iptables fwmark-based
policy routing, including:

  • A fix for a conntrack tuple collision bug that caused ~50% DNS packet loss on same-node
    ClusterIP UDP traffic (service CIDR RETURN rules inserted before MARK rule)
  • Race-safe shared ip rule management (tolerates "File exists" to avoid TOCTOU on concurrent
    pod creates; ref-counted cleanup using real iptables -S output patterns on delete)
  • Nil gateway early-exit to prevent partial setup that would black-hole all marked traffic

Issue Fixed:

Requirements:

Notes:

This is PR 1 of 2 for the GPS feature:

  1. This PRtransparent-tunnel mode + Linux iptables/ip-rule implementation
  2. PR 2 — Windows /32 host route implementation (separate PR)

The mode is opt-in via conflist: "mode": "transparent-tunnel". No behavioral change to existing
transparent or other modes.

Replaces the previous GlobalPodSecurity: true boolean flag approach with a dedicated CNI mode
that uses Go struct embedding (zero code copy from TransparentEndpointClient).

Copilot AI review requested due to automatic review settings April 7, 2026 16:52
@alam-tahmid alam-tahmid requested review from a team as code owners April 7, 2026 16:52
@alam-tahmid alam-tahmid requested a review from jpayne3506 April 7, 2026 16:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces GlobalPodSecurity configuration plumbing to carry a new boolean knob from CNI network config into endpoint metadata, and updates default CNI conflists to surface the option (defaulting to false).

Changes:

  • Added globalPodSecurity to CNI NetworkConfig (JSON) and plumbed it into network.EndpointInfo.
  • Extended network endpoint-related structs to carry GlobalPodSecurity.
  • Added unit tests for config unmarshalling and createEpInfo propagation, and updated default Linux/Windows conflists to include the flag (set to false).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
network/endpoint.go Adds GlobalPodSecurity fields to endpoint and EndpointInfo structs.
cni/network/network.go Wires NetworkConfig.GlobalPodSecurity into generated EndpointInfo.
cni/network/network_test.go Adds coverage to ensure createEpInfo propagates the flag into EndpointInfo.
cni/netconfig.go Adds GlobalPodSecurity to CNI JSON config (globalPodSecurity).
cni/netconfig_test.go Adds JSON unmarshal tests for globalPodSecurity defaulting/values.
cni/azure-windows.conflist Adds "globalPodSecurity": false to default Windows conflist.
cni/azure-linux.conflist Adds "globalPodSecurity": false to default Linux conflist.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread network/endpoint.go
@alam-tahmid alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch from b7af504 to 0d1ce9f Compare April 7, 2026 17:13
Copy link
Copy Markdown
Contributor

@QxBytes QxBytes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming your iptables changes will be going in the existing transparent client endpoint code, (unless it's a new scenario)? Also assuming this change affects nodesubnet (no azure cns present)?

Comment thread cni/network/network.go Outdated
Comment thread cni/azure-linux.conflist Outdated
Comment thread network/endpoint.go Outdated
Comment thread cni/netconfig_test.go Outdated
@alam-tahmid alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch from 0d1ce9f to a951bf0 Compare April 16, 2026 17:58
@alam-tahmid alam-tahmid changed the title feat: add GlobalPodSecurity config plumbing feat: add transparent-tunnel CNI mode for GPS VFP enforcement (Linux) Apr 16, 2026
@alam-tahmid alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch 2 times, most recently from e051e3d to ecbdd88 Compare April 16, 2026 21:57
QxBytes
QxBytes previously approved these changes Apr 17, 2026
Copy link
Copy Markdown
Contributor

@QxBytes QxBytes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that for transparent-vlan I needed to run something for rp_filter. If you already tested and it works then should be fine but just bringing it to your attention if something pops up in the future.

Also for the ExecuteRawCommand it should be fine since you control the command input but just for future reference.

Comment thread network/transparent_tunnel_endpointclient_linux.go Outdated
QxBytes
QxBytes previously approved these changes Apr 22, 2026
Comment thread network/transparent_tunnel_endpointclient_linux.go Outdated
Comment thread network/transparent_tunnel_endpointclient_linux.go
Comment on lines +132 to +141
// 2. Fwmark MARK rule — append so it comes after RETURN rules.
markMatch := "-i " + hostVeth
markTarget := "MARK --set-mark " + markStr
if err := client.iptablesClient.AppendIptableRule(
iptables.V4, iptables.Mangle, iptables.Prerouting, markMatch, markTarget,
); err != nil {
return errors.Wrap(err, "failed to append GPS fwmark MARK rule")
}
logger.Info("GPS: added fwmark MARK rule",
zap.String("veth", hostVeth), zap.String("mark", markStr))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need iptable rule to mark packet? can we not add ip rule to lookup custome routing table based on pod cidr?

something like this:


ip rule add from 10.9.255.0/24 table 100
ip route add default via 10.9.255.1 dev eth1 table 100

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually got this idea about policy routing initially from you and went through a POC to validate it.

In NodeSubnet, pods and the node share the same VNet subnet (e.g., 10.224.0.0/16) — there's no distinct pod CIDR that I know of. There's no distinct pod CIDR — a pod might be 10.224.0.5 and the node is 10.224.0.4, all in the same range. So ip rule add from table 100 would also match node traffic (kubelet, API server health checks, etc.), which we don't want routed through VFP.

The fwmark approach lets us use interface-based matching — iptables matches on the pod's veth interface (-i ) to identify only pod-originated traffic, then stamps it with a mark. The ip rule then routes only marked packets to the custom table. This is the only reliable way to distinguish "packet came from a pod" vs "packet came from the node" when they share the same subnet.

The service CIDR RETURN rules (comment #2) are also tied to this — they prevent service-bound pod traffic from getting marked, so it still goes through kube-proxy DNAT first.

An ip rule from approach would work in overlay mode (where pods have a separate CIDR like 10.244.0.0/24), but not in NodeSubnet which is our target here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update this in coimment on why we prefer iptable mark over ip rule..

so what happens after service translation, does it get hit by the mark?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update this in coimment on why we prefer iptable mark over ip rule..

so what happens after service translation, does it get hit by the mark?

Added an in-code comment explaining the fwmark vs ip-rule-from rationale (NodeSubnet has no distinct pod CIDR, so from would match node traffic too).

post-DNAT — no, it does not get re-hit by the MARK rule. Here's the flow:

  1. Pod sends packet to ClusterIP (e.g. 10.0.0.10:53 → CoreDNS)
  2. Packet enters host via pod's veth → hits mangle PREROUTING → the RETURN rule matches -d serviceCIDR → exits chain unmarked
  3. Then hits nat PREROUTING → kube-proxy DNAT rewrites destination to real pod IP (e.g. 10.224.0.8)
  4. Kernel makes routing decision — destination pod is on same node, so packet goes through the bridge to the destination pod's veth

PREROUTING only fires once per packet — when it first arrives on an interface. After DNAT, the packet is routed locally through the bridge and goes through the FORWARD chain, which has no MARK rules. So the post-DNAT packet is never seen by our mangle PREROUTING rules.

Without the RETURN rules (which I hit during POC), the service-bound packet would get marked, policy-routed out eth0 to gateway, VFP sends it back, re-enters the pod's veth creating a second conntrack entry whose reply tuple collides with the original DNAT entry — result was ~50% UDP packet loss (DNS
failures).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the packet doesn't leave the VM, then how nsg will be enforced.?

Comment thread network/transparent_tunnel_endpointclient_linux.go Outdated
@alam-tahmid alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch 3 times, most recently from 4a137f5 to 7af0b43 Compare April 22, 2026 21:50
Comment thread network/transparent_tunnel_endpointclient_linux.go
Comment on lines +132 to +141
// 2. Fwmark MARK rule — append so it comes after RETURN rules.
markMatch := "-i " + hostVeth
markTarget := "MARK --set-mark " + markStr
if err := client.iptablesClient.AppendIptableRule(
iptables.V4, iptables.Mangle, iptables.Prerouting, markMatch, markTarget,
); err != nil {
return errors.Wrap(err, "failed to append GPS fwmark MARK rule")
}
logger.Info("GPS: added fwmark MARK rule",
zap.String("veth", hostVeth), zap.String("mark", markStr))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update this in coimment on why we prefer iptable mark over ip rule..

so what happens after service translation, does it get hit by the mark?

Comment thread network/transparent_tunnel_endpointclient_linux.go
Comment on lines +244 to +251
match := "-i " + hostVeth + " -d " + cidr
if err := client.iptablesClient.DeleteIptableRule(
iptables.V4, iptables.Mangle, iptables.Prerouting, match, "RETURN",
); err != nil {
logger.Error("transparent-tunnel: failed to delete service CIDR RETURN rule",
zap.String("cidr", cidr), zap.Error(err))
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these calls should be idempotent.. if iptable rule already deleted, it should not throw error and throw error for other errors..

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The iptables DeleteIptableRule runs -D which returns error if rule is already gone — we log it but don't propagate (void function), which is safe.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what if iptable cmd returns error for other cases? function should return error for other errors except for already deleted case

Comment on lines +273 to +294
out, _ := client.plClient.ExecuteCommand(context.TODO(), "iptables", "-t", "mangle", "-S", "PREROUTING")
markCount := 0
for _, line := range strings.Split(out, "\n") {
if strings.Contains(line, "--set-xmark "+hexMark+"/") {
markCount++
}
}

if markCount == 0 {
rule := vishnetlink.NewRule()
rule.Mark = transparentTunnelFwmark
rule.Table = transparentTunnelRouteTable
rule.Family = unix.AF_INET
_ = client.nlPolicyRoute.RuleDel(rule)

_, defaultDst, _ := net.ParseCIDR("0.0.0.0/0")
route := &vishnetlink.Route{
Dst: defaultDst,
Table: transparentTunnelRouteTable,
}
_ = client.nlPolicyRoute.RouteDel(route)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should throw error except for already deleted cases

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Updated netlink RuleDel/RouteDel to check for syscall.ENOENT and syscall.ESRCH (the "not found" errors netlink returns) and suppress only those. Any other real error (permission denied, etc.) is now logged.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not just log, it should return error and should be returned to containerd. There would be impact if these rules/routes are not removed right?

@alam-tahmid alam-tahmid force-pushed the tahmidalam/gps-config-plumbing branch from 7af0b43 to 8aa723b Compare April 23, 2026 20:26
@tamilmani1989 tamilmani1989 changed the title feat: add transparent-tunnel CNI mode for GPS VFP enforcement (Linux) feat: add transparent-tunnel CNI mode (Linux) Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants