Skip to content

[Socket Handler]: Adds HTTP/2 PING keep-alive to detect broken connections in pool.#5788

Merged
kirankumarkolli merged 4 commits intomasterfrom
users/aavasthy/http2-keepalive-ping
Apr 23, 2026
Merged

[Socket Handler]: Adds HTTP/2 PING keep-alive to detect broken connections in pool.#5788
kirankumarkolli merged 4 commits intomasterfrom
users/aavasthy/http2-keepalive-ping

Conversation

@aavasthy
Copy link
Copy Markdown
Contributor

Pull Request Template

Description

Configures HTTP/2 PING keep-alive on the SDK-managed SocketsHttpHandler to detect and evict broken connections from the pool.

Without this, a broken HTTP/2 connection can linger in the pool indefinitely, causing persistent 500 errors that only resolve after application restart.
This was observed across multiple Go SDK customers using HTTP/2 and applies to the .NET SDK in thin client mode which explicitly uses HTTP/2.

Settings added:
KeepAlivePingDelay │ 30s │ Send PING after 30s of inactivity
KeepAlivePingTimeout │ 20s │ Mark connection dead if no PONG within 20s
KeepAlivePingPolicy │ Always │ Ping idle connections too — critical for detecting broken pooled connections

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Closing issues

To automatically close an issue: closes #IssueNumber

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@aavasthy aavasthy changed the title Add keepalive ping mechanism for http2 [Socket Handler]: Adds HTTP/2 PING keep-alive to detect broken connections in pool. Apr 15, 2026
@aavasthy aavasthy self-assigned this Apr 15, 2026
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

keepAlivePingDelayInfo?.SetValue(socketHttpHandler, TimeSpan.FromSeconds(30));

PropertyInfo keepAlivePingTimeoutInfo = socketHandlerType.GetProperty("KeepAlivePingTimeout");
keepAlivePingTimeoutInfo?.SetValue(socketHttpHandler, TimeSpan.FromSeconds(20));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the values determined?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HOw about customization through ENV variables?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - definitely do not hard-code - and to provide value they will likely have to be more aggressive

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-SDK data to support this thread:

The Go SDK (which motivated this PR "observed across multiple Go SDK customers") already has this fix since October 2023 (PR #21468) with significantly more aggressive values:

Setting .NET (this PR) Go SDK (production since 2023)
Ping delay 30s 10s
Ping timeout 20s 5s
Total detection time ~50s ~15s

That's 3.3 slower broken-connection detection in .NET. During that 50s window, every request routed to the broken connection fails.

The Go values were chosen after real customer impact analysis (golang/go#59690) and have been stable in production for 2.5 years. If the goal is to match the customer-facing behavior that resolved the Go SDK issue, aligning closer to those values (or making them ENV-configurable as @kirankumarkolli suggested) would be prudent.


AI-generated review may be incorrect. Agree? resolve the conversation. Disagree? reply with your reasoning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinlian12
Copy link
Copy Markdown
Member

@sdkReviewAgent

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

if (keepAlivePingPolicyInfo != null)
{
Type pingPolicyType = keepAlivePingPolicyInfo.PropertyType;
object alwaysValue = Enum.ToObject(pingPolicyType, 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Resilience: Prefer Enum.Parse by name over Enum.ToObject by ordinal

object alwaysValue = Enum.ToObject(pingPolicyType, 1);

The hardcoded 1 is correct today (HttpKeepAlivePingPolicy.Always = 1), but it's an ordinal assumption with no compile-time validation. If Microsoft ever inserted a new enum member before Always, this would silently apply the wrong policy.

Consider using name-based lookup instead:

object alwaysValue = Enum.Parse(pingPolicyType, "Always");

This is self-documenting, resilient to ordinal changes, and equally correct. The performance difference is irrelevant — this runs once at client startup. Enum.Parse will throw ArgumentException if the name changes (surfacing the problem immediately), whereas Enum.ToObject would silently apply the wrong policy.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12
Copy link
Copy Markdown
Member

Review complete (23:26)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

{
int pingDelayInSeconds = ConfigurationManager.GetEnvironmentVariable<int>(
ConfigurationManager.Http2KeepAlivePingDelayInSeconds,
defaultValue: 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults seems very aggressive.
Lets align with eventual RUST expected values.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there are the RUST defaults.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.IsFalse(clientHandler.ServerCertificateCustomValidationCallback.Invoke(new HttpRequestMessage(), x509Certificate2, x509Chain, sslPolicyErrors));
}

[TestMethod]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible concurrency issue with other tests?

// KeepAlivePingDelay/Timeout/Policy are available on SocketsHttpHandler in .NET 5.0+.
try
{
int pingDelayInSeconds = ConfigurationManager.GetEnvironmentVariable<int>(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Resilience: Add Math.Max clamping on env var overrides to prevent silent feature disablement

int pingDelayInSeconds = ConfigurationManager.GetEnvironmentVariable<int>(
    ConfigurationManager.Http2KeepAlivePingDelayInSeconds,
    defaultValue: 1);

int pingTimeoutInSeconds = ConfigurationManager.GetEnvironmentVariable<int>(
    ConfigurationManager.Http2KeepAlivePingTimeoutInSeconds,
    defaultValue: 2);

Both KeepAlivePingDelay and KeepAlivePingTimeout on SocketsHttpHandler require values strictly greater than TimeSpan.Zero (the setter throws ArgumentOutOfRangeException for <= TimeSpan.Zero). If a user sets AZURE_COSMOS_HTTP2_KEEPALIVE_PING_DELAY_IN_SECONDS=0 or a negative value:

  1. GetEnvironmentVariable<int> parses successfully (valid int)
  2. TimeSpan.FromSeconds(0)TimeSpan.Zero
  3. SetValue throws ArgumentOutOfRangeException
  4. Caught by catch (Exception ex)all three settings skipped (delay, timeout, AND policy)
  5. Handler has .NET defaults: KeepAlivePingDelay = Timeout.InfiniteTimeSpanpings disabled entirely

The feature this PR adds is silently defeated with only a trace warning. This is especially subtle because the delay failure also prevents the KeepAlivePingPolicy = Always from being set — so even if only the timeout env var is invalid, the policy reverts to WithActiveRequests (which doesn't ping idle connections, defeating the purpose).

The codebase already has an established pattern for this in ConfigurationManager.cs:159-166:

// Existing pattern
return Math.Max(
    ConfigurationManager.GetEnvironmentVariable(..., defaultValue: ...),
    MinMaxRetriesInLocalRegionWhenRemoteRegionPreferred);

Suggested fix:

int pingDelayInSeconds = Math.Max(1, ConfigurationManager.GetEnvironmentVariable<int>(
    ConfigurationManager.Http2KeepAlivePingDelayInSeconds,
    defaultValue: 1));

int pingTimeoutInSeconds = Math.Max(1, ConfigurationManager.GetEnvironmentVariable<int>(
    ConfigurationManager.Http2KeepAlivePingTimeoutInSeconds,
    defaultValue: 2));

This guarantees values are always ≥ 1 second, which is the minimum SocketsHttpHandler accepts, and follows the existing clamping convention.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12
Copy link
Copy Markdown
Member

Review complete (27:01)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

// reset) can remain in the pool indefinitely, causing persistent request failures
// that only resolve after application restart.
// KeepAlivePingDelay/Timeout/Policy are available on SocketsHttpHandler in .NET 5.0+.
try
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Resilience: Separate try-catch blocks for independent property sets

The single try/catch block (lines 191–220) groups env var parsing, KeepAlivePingDelay, KeepAlivePingTimeout, and KeepAlivePingPolicy together. If any one operation fails — including a FormatException from a non-numeric env var string, or a TargetInvocationException from the delay setter — the entire block is skipped, including the critical KeepAlivePingPolicy = Always.

The sibling EnableMultipleHttp2Connections just above (lines 176–184) uses its own isolated try-catch, which is the established pattern in this method. Each logical property set is independent and should fail independently.

Concrete failure scenario: A user sets AZURE_COSMOS_HTTP2_KEEPALIVE_PING_DELAY_IN_SECONDS=fast (non-numeric). Convert.ChangeType throws before any property is set → the catch fires → delay, timeout, AND policy are all skipped → the handler has .NET defaults (KeepAlivePingDelay = Timeout.InfiniteTimeSpan, KeepAlivePingPolicy = WithActiveRequests) → PING keep-alive is completely disabled. A bad delay value shouldn't prevent the timeout and policy from being configured with their defaults.

Suggested structure:

// Block 1: Parse env vars (with defaults)
int pingDelayInSeconds = /* ... defaultValue: 1 */;
int pingTimeoutInSeconds = /* ... defaultValue: 2 */;

// Block 2: Set delay (independent)
try { keepAlivePingDelayInfo?.SetValue(...); }
catch (Exception ex) { DefaultTrace.TraceWarning(...); }

// Block 3: Set timeout (independent)
try { keepAlivePingTimeoutInfo?.SetValue(...); }
catch (Exception ex) { DefaultTrace.TraceWarning(...); }

// Block 4: Set policy (independent)
try { /* Enum.ToObject + SetValue */ }
catch (Exception ex) { DefaultTrace.TraceWarning(...); }

This way, a failure in one property still allows the others to be applied — all partial states are safe since the .NET defaults are reasonable fallbacks.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12
Copy link
Copy Markdown
Member

Review complete (30:35)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@aavasthy aavasthy added the auto-merge Enables automation to merge PRs label Apr 19, 2026
@microsoft-github-policy-service microsoft-github-policy-service Bot enabled auto-merge (squash) April 19, 2026 06:19
This was referenced Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge Enables automation to merge PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants