[Socket Handler]: Adds HTTP/2 PING keep-alive to detect broken connections in pool.#5788
Conversation
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
| keepAlivePingDelayInfo?.SetValue(socketHttpHandler, TimeSpan.FromSeconds(30)); | ||
|
|
||
| PropertyInfo keepAlivePingTimeoutInfo = socketHandlerType.GetProperty("KeepAlivePingTimeout"); | ||
| keepAlivePingTimeoutInfo?.SetValue(socketHttpHandler, TimeSpan.FromSeconds(20)); |
There was a problem hiding this comment.
How are the values determined?
There was a problem hiding this comment.
HOw about customization through ENV variables?
There was a problem hiding this comment.
Yeah - definitely do not hard-code - and to provide value they will likely have to be more aggressive
There was a problem hiding this comment.
Cross-SDK data to support this thread:
The Go SDK (which motivated this PR "observed across multiple Go SDK customers") already has this fix since October 2023 (PR #21468) with significantly more aggressive values:
| Setting | .NET (this PR) | Go SDK (production since 2023) |
|---|---|---|
| Ping delay | 30s | 10s |
| Ping timeout | 20s | 5s |
| Total detection time | ~50s | ~15s |
That's 3.3 slower broken-connection detection in .NET. During that 50s window, every request routed to the broken connection fails.
The Go values were chosen after real customer impact analysis (golang/go#59690) and have been stable in production for 2.5 years. If the goal is to match the customer-facing behavior that resolved the Go SDK issue, aligning closer to those values (or making them ENV-configurable as @kirankumarkolli suggested) would be prudent.
AI-generated review may be incorrect. Agree? resolve the conversation. Disagree? reply with your reasoning.
There was a problem hiding this comment.
Added environment variables for setting this value and referred the rust sdk config for default values
https://github.com/Azure/azure-sdk-for-rust/blob/955b4c4803f6a96f082f3ecd394fd4ae58bc8cd6/sdk/cosmos/azure_data_cosmos_driver/src/options/connection_pool.rs#L254-L255
|
@sdkReviewAgent |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
| if (keepAlivePingPolicyInfo != null) | ||
| { | ||
| Type pingPolicyType = keepAlivePingPolicyInfo.PropertyType; | ||
| object alwaysValue = Enum.ToObject(pingPolicyType, 1); |
There was a problem hiding this comment.
🟢 Suggestion — Resilience: Prefer Enum.Parse by name over Enum.ToObject by ordinal
object alwaysValue = Enum.ToObject(pingPolicyType, 1);The hardcoded 1 is correct today (HttpKeepAlivePingPolicy.Always = 1), but it's an ordinal assumption with no compile-time validation. If Microsoft ever inserted a new enum member before Always, this would silently apply the wrong policy.
Consider using name-based lookup instead:
object alwaysValue = Enum.Parse(pingPolicyType, "Always");This is self-documenting, resilient to ordinal changes, and equally correct. The performance difference is irrelevant — this runs once at client startup. Enum.Parse will throw ArgumentException if the name changes (surfacing the problem immediately), whereas Enum.ToObject would silently apply the wrong policy.
|
✅ Review complete (23:26) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
| { | ||
| int pingDelayInSeconds = ConfigurationManager.GetEnvironmentVariable<int>( | ||
| ConfigurationManager.Http2KeepAlivePingDelayInSeconds, | ||
| defaultValue: 1); |
There was a problem hiding this comment.
Defaults seems very aggressive.
Lets align with eventual RUST expected values.
There was a problem hiding this comment.
Seems like there are the RUST defaults.
There was a problem hiding this comment.
The set defaults are what have been set in Rust as well https://github.com/Azure/azure-sdk-for-rust/blob/955b4c4803f6a96f082f3ecd394fd4ae58bc8cd6/sdk/cosmos/azure_data_cosmos_driver/src/options/connection_pool.rs#L254-L255
| Assert.IsFalse(clientHandler.ServerCertificateCustomValidationCallback.Invoke(new HttpRequestMessage(), x509Certificate2, x509Chain, sslPolicyErrors)); | ||
| } | ||
|
|
||
| [TestMethod] |
There was a problem hiding this comment.
Possible concurrency issue with other tests?
| // KeepAlivePingDelay/Timeout/Policy are available on SocketsHttpHandler in .NET 5.0+. | ||
| try | ||
| { | ||
| int pingDelayInSeconds = ConfigurationManager.GetEnvironmentVariable<int>( |
There was a problem hiding this comment.
🟡 Recommendation — Resilience: Add Math.Max clamping on env var overrides to prevent silent feature disablement
int pingDelayInSeconds = ConfigurationManager.GetEnvironmentVariable<int>(
ConfigurationManager.Http2KeepAlivePingDelayInSeconds,
defaultValue: 1);
int pingTimeoutInSeconds = ConfigurationManager.GetEnvironmentVariable<int>(
ConfigurationManager.Http2KeepAlivePingTimeoutInSeconds,
defaultValue: 2);Both KeepAlivePingDelay and KeepAlivePingTimeout on SocketsHttpHandler require values strictly greater than TimeSpan.Zero (the setter throws ArgumentOutOfRangeException for <= TimeSpan.Zero). If a user sets AZURE_COSMOS_HTTP2_KEEPALIVE_PING_DELAY_IN_SECONDS=0 or a negative value:
GetEnvironmentVariable<int>parses successfully (valid int)TimeSpan.FromSeconds(0)→TimeSpan.ZeroSetValuethrowsArgumentOutOfRangeException- Caught by
catch (Exception ex)→ all three settings skipped (delay, timeout, AND policy) - Handler has .NET defaults:
KeepAlivePingDelay = Timeout.InfiniteTimeSpan→ pings disabled entirely
The feature this PR adds is silently defeated with only a trace warning. This is especially subtle because the delay failure also prevents the KeepAlivePingPolicy = Always from being set — so even if only the timeout env var is invalid, the policy reverts to WithActiveRequests (which doesn't ping idle connections, defeating the purpose).
The codebase already has an established pattern for this in ConfigurationManager.cs:159-166:
// Existing pattern
return Math.Max(
ConfigurationManager.GetEnvironmentVariable(..., defaultValue: ...),
MinMaxRetriesInLocalRegionWhenRemoteRegionPreferred);Suggested fix:
int pingDelayInSeconds = Math.Max(1, ConfigurationManager.GetEnvironmentVariable<int>(
ConfigurationManager.Http2KeepAlivePingDelayInSeconds,
defaultValue: 1));
int pingTimeoutInSeconds = Math.Max(1, ConfigurationManager.GetEnvironmentVariable<int>(
ConfigurationManager.Http2KeepAlivePingTimeoutInSeconds,
defaultValue: 2));This guarantees values are always ≥ 1 second, which is the minimum SocketsHttpHandler accepts, and follows the existing clamping convention.
|
✅ Review complete (27:01) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
| // reset) can remain in the pool indefinitely, causing persistent request failures | ||
| // that only resolve after application restart. | ||
| // KeepAlivePingDelay/Timeout/Policy are available on SocketsHttpHandler in .NET 5.0+. | ||
| try |
There was a problem hiding this comment.
🟡 Recommendation — Resilience: Separate try-catch blocks for independent property sets
The single try/catch block (lines 191–220) groups env var parsing, KeepAlivePingDelay, KeepAlivePingTimeout, and KeepAlivePingPolicy together. If any one operation fails — including a FormatException from a non-numeric env var string, or a TargetInvocationException from the delay setter — the entire block is skipped, including the critical KeepAlivePingPolicy = Always.
The sibling EnableMultipleHttp2Connections just above (lines 176–184) uses its own isolated try-catch, which is the established pattern in this method. Each logical property set is independent and should fail independently.
Concrete failure scenario: A user sets AZURE_COSMOS_HTTP2_KEEPALIVE_PING_DELAY_IN_SECONDS=fast (non-numeric). Convert.ChangeType throws before any property is set → the catch fires → delay, timeout, AND policy are all skipped → the handler has .NET defaults (KeepAlivePingDelay = Timeout.InfiniteTimeSpan, KeepAlivePingPolicy = WithActiveRequests) → PING keep-alive is completely disabled. A bad delay value shouldn't prevent the timeout and policy from being configured with their defaults.
Suggested structure:
// Block 1: Parse env vars (with defaults)
int pingDelayInSeconds = /* ... defaultValue: 1 */;
int pingTimeoutInSeconds = /* ... defaultValue: 2 */;
// Block 2: Set delay (independent)
try { keepAlivePingDelayInfo?.SetValue(...); }
catch (Exception ex) { DefaultTrace.TraceWarning(...); }
// Block 3: Set timeout (independent)
try { keepAlivePingTimeoutInfo?.SetValue(...); }
catch (Exception ex) { DefaultTrace.TraceWarning(...); }
// Block 4: Set policy (independent)
try { /* Enum.ToObject + SetValue */ }
catch (Exception ex) { DefaultTrace.TraceWarning(...); }This way, a failure in one property still allows the others to be applied — all partial states are safe since the .NET defaults are reasonable fallbacks.
|
✅ Review complete (30:35) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Pull Request Template
Description
Configures HTTP/2 PING keep-alive on the SDK-managed SocketsHttpHandler to detect and evict broken connections from the pool.
Without this, a broken HTTP/2 connection can linger in the pool indefinitely, causing persistent 500 errors that only resolve after application restart.
This was observed across multiple Go SDK customers using HTTP/2 and applies to the .NET SDK in thin client mode which explicitly uses HTTP/2.
Settings added:
KeepAlivePingDelay │ 30s │ Send PING after 30s of inactivity
KeepAlivePingTimeout │ 20s │ Mark connection dead if no PONG within 20s
KeepAlivePingPolicy │ Always │ Ping idle connections too — critical for detecting broken pooled connections
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #IssueNumber