Skip to content

feat(revenue-analytics): Resolve posthog_person_distinct_id from child Stripe objects#53595

Merged
arthurdedeus merged 8 commits intomasterfrom
feat/coalesce-distinct-id-from-child-objects
Apr 8, 2026
Merged

feat(revenue-analytics): Resolve posthog_person_distinct_id from child Stripe objects#53595
arthurdedeus merged 8 commits intomasterfrom
feat/coalesce-distinct-id-from-child-objects

Conversation

@arthurdedeus
Copy link
Copy Markdown
Contributor

@arthurdedeus arthurdedeus commented Apr 7, 2026

Problem

The initial solution to ask our users to update their Stripe customers at every charge/subscription didn't sound great. Let's simplify this for them by coalescing the metadata field from customers and their child objects (subscriptions, charges and invoices) to build the metadata column in customer_revenue_view.

This way the set up will be to simply pass in metadata at Stripe customer, subscription/charge/invoice creation, without the need to update the customer (see the docs PR if this is still unclear).

Part of #52270

PS: I'm not super concerned about the performance impact this can have when managed-viewsets flag is off (i.e. we do all computation at query time), as the goal is to move everyone to the new architecture soon. If we start seeing complaints, we can start migrating people over.

Changes

  • Enrich the customer revenue view's metadata field with posthog_person_distinct_id resolved from child Stripe objects (subscription, charge, invoice) when the customer doesn't have it directly
  • Pick the freshest child by created timestamp; customer's own value always takes priority
  • Inject posthog_person_distinct_id_source (e.g. subscription::sub_123) for debuggability

How did you test this code?

New snapshot tests for all child schema combinations, integration test with Nullable columns via RevenueAnalyticsTopCustomersQueryRunner, and full revenue analytics + persons join suite passes.

Publish to changelog?

No

Docs update

Already written here

LLM context

Co-authored with Claude Code (Opus 4.6).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

🎭 Playwright report · View test results →

⚠️ 1 flaky test:

  • Materialize view pane (chromium)

These issues are not necessarily caused by your changes.
Annoyed by this comment? Help fix flakies and failures and it'll disappear!

@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog bot commented Apr 7, 2026

Query snapshots: Backend query snapshots updated

Changes: 1 snapshots (1 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

@arthurdedeus arthurdedeus self-assigned this Apr 7, 2026
@arthurdedeus arthurdedeus force-pushed the fix/revenue-analytics-person-join branch from fc3a1c3 to ee07485 Compare April 7, 2026 17:58
@arthurdedeus arthurdedeus force-pushed the feat/coalesce-distinct-id-from-child-objects branch from fbf0fad to a71e13f Compare April 7, 2026 18:01
@arthurdedeus arthurdedeus force-pushed the fix/revenue-analytics-person-join branch from ee07485 to c457be8 Compare April 7, 2026 18:07
@arthurdedeus arthurdedeus force-pushed the feat/coalesce-distinct-id-from-child-objects branch from a71e13f to 7b2b74d Compare April 7, 2026 18:08
@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog bot commented Apr 7, 2026

Query snapshots: Backend query snapshots updated

Changes: 1 snapshots (1 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

@arthurdedeus arthurdedeus marked this pull request as ready for review April 8, 2026 12:15
@arthurdedeus arthurdedeus requested a review from a team as a code owner April 8, 2026 12:15
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Vulnerabilities

No security concerns identified. The f-string SQL interpolation in _build_child_union_leg uses only the internal Stripe table name obtained from the system schema (not user input), and a nosemgrep annotation is correctly in place.

Comments Outside Diff (1)

  1. products/revenue_analytics/backend/views/sources/test/stripe/test_stripe_customer_metadata_resolution.py, line 220-224 (link)

    P2 Integration test assertions are too weak to validate the feature

    The test sets up three distinct customer scenarios (no metadata, direct metadata, no children) and several child objects specifically to verify metadata resolution. However, the only assertion is assertIsNotNone(response), which confirms the query runs without error but says nothing about whether the resolution logic is correct.

    Given the complexity of the new resolution logic, this test should verify actual outcomes. For example:

    def test_top_customers_query_works_with_nullable_metadata_columns(self):
        response = self._run_top_customers_query()
        self.assertIsNotNone(response)
        results = {row[0]: row for row in response.results}  # keyed by customer id
    
        # cus_no_meta should get alice_distinct resolved from the subscription
        alice_row = results["cus_no_meta"]
        self.assertIn("alice_distinct", str(alice_row))
    
        # cus_with_meta should keep bob_distinct (customer-level wins)
        bob_row = results["cus_with_meta"]
        self.assertIn("bob_distinct", str(bob_row))

    Without asserting the actual resolved values, a regression in the _ENRICHED_METADATA_EXPR logic (e.g., the if branches being swapped) would go undetected.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: products/revenue_analytics/backend/views/sources/test/stripe/test_stripe_customer_metadata_resolution.py
    Line: 220-224
    
    Comment:
    **Integration test assertions are too weak to validate the feature**
    
    The test sets up three distinct customer scenarios (no metadata, direct metadata, no children) and several child objects specifically to verify metadata resolution. However, the only assertion is `assertIsNotNone(response)`, which confirms the query runs without error but says nothing about whether the resolution logic is correct.
    
    Given the complexity of the new resolution logic, this test should verify actual outcomes. For example:
    
    ```python
    def test_top_customers_query_works_with_nullable_metadata_columns(self):
        response = self._run_top_customers_query()
        self.assertIsNotNone(response)
        results = {row[0]: row for row in response.results}  # keyed by customer id
    
        # cus_no_meta should get alice_distinct resolved from the subscription
        alice_row = results["cus_no_meta"]
        self.assertIn("alice_distinct", str(alice_row))
    
        # cus_with_meta should keep bob_distinct (customer-level wins)
        bob_row = results["cus_with_meta"]
        self.assertIn("bob_distinct", str(bob_row))
    ```
    
    Without asserting the actual resolved values, a regression in the `_ENRICHED_METADATA_EXPR` logic (e.g., the `if` branches being swapped) would go undetected.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: products/revenue_analytics/backend/views/sources/test/stripe/test_stripe_customer_metadata_resolution.py
Line: 220-224

Comment:
**Integration test assertions are too weak to validate the feature**

The test sets up three distinct customer scenarios (no metadata, direct metadata, no children) and several child objects specifically to verify metadata resolution. However, the only assertion is `assertIsNotNone(response)`, which confirms the query runs without error but says nothing about whether the resolution logic is correct.

Given the complexity of the new resolution logic, this test should verify actual outcomes. For example:

```python
def test_top_customers_query_works_with_nullable_metadata_columns(self):
    response = self._run_top_customers_query()
    self.assertIsNotNone(response)
    results = {row[0]: row for row in response.results}  # keyed by customer id

    # cus_no_meta should get alice_distinct resolved from the subscription
    alice_row = results["cus_no_meta"]
    self.assertIn("alice_distinct", str(alice_row))

    # cus_with_meta should keep bob_distinct (customer-level wins)
    bob_row = results["cus_with_meta"]
    self.assertIn("bob_distinct", str(bob_row))
```

Without asserting the actual resolved values, a regression in the `_ENRICHED_METADATA_EXPR` logic (e.g., the `if` branches being swapped) would go undetected.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: products/revenue_analytics/backend/views/sources/stripe/customer.py
Line: 280-291

Comment:
**Two independent `argMax` calls may return values from different rows on `created_at` ties**

`argMax(distinct_id, created_at)` and `argMax(source_ref, created_at)` are evaluated independently. ClickHouse makes no guarantee that, when two rows share the same maximum `created_at`, both calls will pick values from the *same* row. In that (admittedly rare) edge case, `resolved_distinct_id` could come from a subscription and `resolved_source` might point to a charge — giving a misleading debug annotation.

A simple way to make them consistent is to use a single `argMax` on a tuple:

```python
ast.Alias(
    alias="resolved_pair",
    expr=ast.Call(
        name="argMax",
        args=[
            ast.Tuple(exprs=[ast.Field(chain=["distinct_id"]), ast.Field(chain=["source_ref"])]),
            ast.Field(chain=["created_at"]),
        ],
    ),
),
```

then unpack `resolved_pair.1` / `resolved_pair.2` in the outer select. Alternatively, since `source_ref` is already derived from the same row as `distinct_id` (same `id`), a composite string like `concat(distinct_id, '|', source_ref)` and a single `argMax` would also work.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(revenue-analytics): Fix semgrep and ..." | Re-trigger Greptile

Comment on lines +280 to +291
ast.Field(chain=["customer_id"]),
ast.Alias(
alias="resolved_distinct_id",
expr=ast.Call(name="argMax", args=[ast.Field(chain=["distinct_id"]), ast.Field(chain=["created_at"])]),
),
ast.Alias(
alias="resolved_source",
expr=ast.Call(name="argMax", args=[ast.Field(chain=["source_ref"]), ast.Field(chain=["created_at"])]),
),
],
select_from=ast.JoinExpr(table=union_query, alias="child_meta"),
group_by=[ast.Field(chain=["customer_id"])],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Two independent argMax calls may return values from different rows on created_at ties

argMax(distinct_id, created_at) and argMax(source_ref, created_at) are evaluated independently. ClickHouse makes no guarantee that, when two rows share the same maximum created_at, both calls will pick values from the same row. In that (admittedly rare) edge case, resolved_distinct_id could come from a subscription and resolved_source might point to a charge — giving a misleading debug annotation.

A simple way to make them consistent is to use a single argMax on a tuple:

ast.Alias(
    alias="resolved_pair",
    expr=ast.Call(
        name="argMax",
        args=[
            ast.Tuple(exprs=[ast.Field(chain=["distinct_id"]), ast.Field(chain=["source_ref"])]),
            ast.Field(chain=["created_at"]),
        ],
    ),
),

then unpack resolved_pair.1 / resolved_pair.2 in the outer select. Alternatively, since source_ref is already derived from the same row as distinct_id (same id), a composite string like concat(distinct_id, '|', source_ref) and a single argMax would also work.

Prompt To Fix With AI
This is a comment left during a code review.
Path: products/revenue_analytics/backend/views/sources/stripe/customer.py
Line: 280-291

Comment:
**Two independent `argMax` calls may return values from different rows on `created_at` ties**

`argMax(distinct_id, created_at)` and `argMax(source_ref, created_at)` are evaluated independently. ClickHouse makes no guarantee that, when two rows share the same maximum `created_at`, both calls will pick values from the *same* row. In that (admittedly rare) edge case, `resolved_distinct_id` could come from a subscription and `resolved_source` might point to a charge — giving a misleading debug annotation.

A simple way to make them consistent is to use a single `argMax` on a tuple:

```python
ast.Alias(
    alias="resolved_pair",
    expr=ast.Call(
        name="argMax",
        args=[
            ast.Tuple(exprs=[ast.Field(chain=["distinct_id"]), ast.Field(chain=["source_ref"])]),
            ast.Field(chain=["created_at"]),
        ],
    ),
),
```

then unpack `resolved_pair.1` / `resolved_pair.2` in the outer select. Alternatively, since `source_ref` is already derived from the same row as `distinct_id` (same `id`), a composite string like `concat(distinct_id, '|', source_ref)` and a single `argMax` would also work.

How can I resolve this? If you propose a fix, please make it concise.

Base automatically changed from fix/revenue-analytics-person-join to master April 8, 2026 20:31
@arthurdedeus arthurdedeus force-pushed the feat/coalesce-distinct-id-from-child-objects branch from d147dda to 16f548f Compare April 8, 2026 21:20
@arthurdedeus arthurdedeus force-pushed the feat/coalesce-distinct-id-from-child-objects branch 2 times, most recently from 23ef8aa to e74ed68 Compare April 8, 2026 21:30
@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog bot commented Apr 8, 2026

⏭️ Skipped snapshot commit because branch advanced to e74ed68 while workflow was testing 16f548f.

The new commit will trigger its own snapshot update workflow.

If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:

  • Merge master into your branch, or
  • Push an empty commit: git commit --allow-empty -m 'trigger CI' && git push

@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog bot commented Apr 8, 2026

Query snapshots: Backend query snapshots updated

Changes: 2 snapshots (2 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

@arthurdedeus arthurdedeus merged commit 3630c3a into master Apr 8, 2026
198 checks passed
@arthurdedeus arthurdedeus deleted the feat/coalesce-distinct-id-from-child-objects branch April 8, 2026 22:29
rnegron pushed a commit that referenced this pull request Apr 9, 2026
…d Stripe objects (#53595)

Co-authored-by: tests-posthog[bot] <250237707+tests-posthog[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants