feat: add transparent-tunnel CNI mode (Linux)#4319
feat: add transparent-tunnel CNI mode (Linux)#4319alam-tahmid wants to merge 1 commit intoAzure:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces GlobalPodSecurity configuration plumbing to carry a new boolean knob from CNI network config into endpoint metadata, and updates default CNI conflists to surface the option (defaulting to false).
Changes:
- Added
globalPodSecurityto CNINetworkConfig(JSON) and plumbed it intonetwork.EndpointInfo. - Extended
networkendpoint-related structs to carryGlobalPodSecurity. - Added unit tests for config unmarshalling and
createEpInfopropagation, and updated default Linux/Windows conflists to include the flag (set tofalse).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
network/endpoint.go |
Adds GlobalPodSecurity fields to endpoint and EndpointInfo structs. |
cni/network/network.go |
Wires NetworkConfig.GlobalPodSecurity into generated EndpointInfo. |
cni/network/network_test.go |
Adds coverage to ensure createEpInfo propagates the flag into EndpointInfo. |
cni/netconfig.go |
Adds GlobalPodSecurity to CNI JSON config (globalPodSecurity). |
cni/netconfig_test.go |
Adds JSON unmarshal tests for globalPodSecurity defaulting/values. |
cni/azure-windows.conflist |
Adds "globalPodSecurity": false to default Windows conflist. |
cni/azure-linux.conflist |
Adds "globalPodSecurity": false to default Linux conflist. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b7af504 to
0d1ce9f
Compare
QxBytes
left a comment
There was a problem hiding this comment.
Assuming your iptables changes will be going in the existing transparent client endpoint code, (unless it's a new scenario)? Also assuming this change affects nodesubnet (no azure cns present)?
0d1ce9f to
a951bf0
Compare
e051e3d to
ecbdd88
Compare
QxBytes
left a comment
There was a problem hiding this comment.
Just a note that for transparent-vlan I needed to run something for rp_filter. If you already tested and it works then should be fine but just bringing it to your attention if something pops up in the future.
Also for the ExecuteRawCommand it should be fine since you control the command input but just for future reference.
ecbdd88 to
6323fa7
Compare
| // 2. Fwmark MARK rule — append so it comes after RETURN rules. | ||
| markMatch := "-i " + hostVeth | ||
| markTarget := "MARK --set-mark " + markStr | ||
| if err := client.iptablesClient.AppendIptableRule( | ||
| iptables.V4, iptables.Mangle, iptables.Prerouting, markMatch, markTarget, | ||
| ); err != nil { | ||
| return errors.Wrap(err, "failed to append GPS fwmark MARK rule") | ||
| } | ||
| logger.Info("GPS: added fwmark MARK rule", | ||
| zap.String("veth", hostVeth), zap.String("mark", markStr)) |
There was a problem hiding this comment.
why do we need iptable rule to mark packet? can we not add ip rule to lookup custome routing table based on pod cidr?
something like this:
ip rule add from 10.9.255.0/24 table 100
ip route add default via 10.9.255.1 dev eth1 table 100
There was a problem hiding this comment.
I actually got this idea about policy routing initially from you and went through a POC to validate it.
In NodeSubnet, pods and the node share the same VNet subnet (e.g., 10.224.0.0/16) — there's no distinct pod CIDR that I know of. There's no distinct pod CIDR — a pod might be 10.224.0.5 and the node is 10.224.0.4, all in the same range. So ip rule add from table 100 would also match node traffic (kubelet, API server health checks, etc.), which we don't want routed through VFP.
The fwmark approach lets us use interface-based matching — iptables matches on the pod's veth interface (-i ) to identify only pod-originated traffic, then stamps it with a mark. The ip rule then routes only marked packets to the custom table. This is the only reliable way to distinguish "packet came from a pod" vs "packet came from the node" when they share the same subnet.
The service CIDR RETURN rules (comment #2) are also tied to this — they prevent service-bound pod traffic from getting marked, so it still goes through kube-proxy DNAT first.
An ip rule from approach would work in overlay mode (where pods have a separate CIDR like 10.244.0.0/24), but not in NodeSubnet which is our target here.
There was a problem hiding this comment.
can you update this in coimment on why we prefer iptable mark over ip rule..
so what happens after service translation, does it get hit by the mark?
There was a problem hiding this comment.
can you update this in coimment on why we prefer iptable mark over ip rule..
so what happens after service translation, does it get hit by the mark?
Added an in-code comment explaining the fwmark vs ip-rule-from rationale (NodeSubnet has no distinct pod CIDR, so from would match node traffic too).
post-DNAT — no, it does not get re-hit by the MARK rule. Here's the flow:
- Pod sends packet to ClusterIP (e.g. 10.0.0.10:53 → CoreDNS)
- Packet enters host via pod's veth → hits mangle PREROUTING → the RETURN rule matches -d serviceCIDR → exits chain unmarked
- Then hits nat PREROUTING → kube-proxy DNAT rewrites destination to real pod IP (e.g. 10.224.0.8)
- Kernel makes routing decision — destination pod is on same node, so packet goes through the bridge to the destination pod's veth
PREROUTING only fires once per packet — when it first arrives on an interface. After DNAT, the packet is routed locally through the bridge and goes through the FORWARD chain, which has no MARK rules. So the post-DNAT packet is never seen by our mangle PREROUTING rules.
Without the RETURN rules (which I hit during POC), the service-bound packet would get marked, policy-routed out eth0 to gateway, VFP sends it back, re-enters the pod's veth creating a second conntrack entry whose reply tuple collides with the original DNAT entry — result was ~50% UDP packet loss (DNS
failures).
There was a problem hiding this comment.
here the packet doesn't leave the VM, then how nsg will be enforced.?
4a137f5 to
7af0b43
Compare
| // 2. Fwmark MARK rule — append so it comes after RETURN rules. | ||
| markMatch := "-i " + hostVeth | ||
| markTarget := "MARK --set-mark " + markStr | ||
| if err := client.iptablesClient.AppendIptableRule( | ||
| iptables.V4, iptables.Mangle, iptables.Prerouting, markMatch, markTarget, | ||
| ); err != nil { | ||
| return errors.Wrap(err, "failed to append GPS fwmark MARK rule") | ||
| } | ||
| logger.Info("GPS: added fwmark MARK rule", | ||
| zap.String("veth", hostVeth), zap.String("mark", markStr)) |
There was a problem hiding this comment.
can you update this in coimment on why we prefer iptable mark over ip rule..
so what happens after service translation, does it get hit by the mark?
| match := "-i " + hostVeth + " -d " + cidr | ||
| if err := client.iptablesClient.DeleteIptableRule( | ||
| iptables.V4, iptables.Mangle, iptables.Prerouting, match, "RETURN", | ||
| ); err != nil { | ||
| logger.Error("transparent-tunnel: failed to delete service CIDR RETURN rule", | ||
| zap.String("cidr", cidr), zap.Error(err)) | ||
| } | ||
| } |
There was a problem hiding this comment.
these calls should be idempotent.. if iptable rule already deleted, it should not throw error and throw error for other errors..
There was a problem hiding this comment.
Fixed. The iptables DeleteIptableRule runs -D which returns error if rule is already gone — we log it but don't propagate (void function), which is safe.
There was a problem hiding this comment.
but what if iptable cmd returns error for other cases? function should return error for other errors except for already deleted case
| out, _ := client.plClient.ExecuteCommand(context.TODO(), "iptables", "-t", "mangle", "-S", "PREROUTING") | ||
| markCount := 0 | ||
| for _, line := range strings.Split(out, "\n") { | ||
| if strings.Contains(line, "--set-xmark "+hexMark+"/") { | ||
| markCount++ | ||
| } | ||
| } | ||
|
|
||
| if markCount == 0 { | ||
| rule := vishnetlink.NewRule() | ||
| rule.Mark = transparentTunnelFwmark | ||
| rule.Table = transparentTunnelRouteTable | ||
| rule.Family = unix.AF_INET | ||
| _ = client.nlPolicyRoute.RuleDel(rule) | ||
|
|
||
| _, defaultDst, _ := net.ParseCIDR("0.0.0.0/0") | ||
| route := &vishnetlink.Route{ | ||
| Dst: defaultDst, | ||
| Table: transparentTunnelRouteTable, | ||
| } | ||
| _ = client.nlPolicyRoute.RouteDel(route) | ||
| } |
There was a problem hiding this comment.
should throw error except for already deleted cases
There was a problem hiding this comment.
Fixed. Updated netlink RuleDel/RouteDel to check for syscall.ENOENT and syscall.ESRCH (the "not found" errors netlink returns) and suppress only those. Any other real error (permission denied, etc.) is now logged.
There was a problem hiding this comment.
not just log, it should return error and should be returned to containerd. There would be impact if these rules/routes are not removed right?
7af0b43 to
8aa723b
Compare
Reason for Change:
Add
transparent-tunnelCNI mode that forces same-node pod-to-pod traffic through the host'sphysical interface (and therefore through VFP) so Azure NSG rules are enforced on intra-node
communication. This implements GPS (GlobalPodSecurity) for Linux using iptables fwmark-based
policy routing, including:
ClusterIP UDP traffic (service CIDR RETURN rules inserted before MARK rule)
pod creates; ref-counted cleanup using real iptables -S output patterns on delete)
Issue Fixed:
Requirements:
Notes:
This is PR 1 of 2 for the GPS feature:
transparent-tunnelmode + Linux iptables/ip-rule implementationThe mode is opt-in via conflist:
"mode": "transparent-tunnel". No behavioral change to existingtransparent or other modes.
Replaces the previous
GlobalPodSecurity: trueboolean flag approach with a dedicated CNI modethat uses Go struct embedding (zero code copy from TransparentEndpointClient).