Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
252ddbf
feat(dbt): add valid_email adapter dispatch macro
pmbrull May 9, 2026
5408c35
fix(dbt): make trino valid_email regex case-insensitive via (?i) flag
pmbrull May 9, 2026
435f60d
feat(dbt): add days_between adapter dispatch macro
pmbrull May 9, 2026
111aeef
refactor(dbt): use valid_email macro in customers staging
pmbrull May 9, 2026
2f8b6da
refactor(dbt): replace :: casts with CAST in staging models
pmbrull May 9, 2026
8f20f2b
refactor(dbt): replace :: casts with CAST in intermediate models
pmbrull May 9, 2026
1211987
refactor(dbt): use days_between macro and CAST in dim_customers
pmbrull May 9, 2026
7f98ddf
refactor(dbt): replace :: casts with CAST in marts models
pmbrull May 9, 2026
c6d7f8b
feat(dbt): add starburst profile output (env-driven)
pmbrull May 9, 2026
bc24a87
feat(dbt): make sources target-aware (postgres vs trino catalog)
pmbrull May 9, 2026
e398820
feat(demo): add export-raw-to-seeds.sh helper
pmbrull May 9, 2026
9096c16
fix(demo): use docker exec in export script (no host psql required)
pmbrull May 9, 2026
8adec33
feat(demo): add raw Jaffle Shop seed CSVs for Starburst target
pmbrull May 9, 2026
d6177e6
feat(dbt): declare seed column types for Iceberg loading
pmbrull May 9, 2026
504c857
feat(dbt): configure seeds (gated to trino target only)
pmbrull May 9, 2026
b3d66f5
feat(make): add demo-dbt-starburst targets
pmbrull May 9, 2026
832a207
docs(demo): document Starburst dbt usage and env vars
pmbrull May 9, 2026
5492b70
fix(dbt): macroize Postgres-specific extract() calls for Trino portab…
pmbrull May 9, 2026
f62cc5b
starburst dbt example
pmbrull May 10, 2026
c9cad4c
data
pmbrull May 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,8 @@ release: ## Create a GitHub Release (usage: make release [B=branch])
.PHONY: build-all test-all test-integration install-cli \
lint lint-python lint-rust lint-typescript lint-java lint-n8n \
format format-python format-rust format-typescript format-java format-n8n \
install-hooks install-local install-dbt demo-database demo-database-stop demo-dbt demo-gdpr demo-n8n
install-hooks install-local install-dbt install-dbt-starburst demo-database demo-database-stop demo-dbt \
demo-export-seeds demo-dbt-starburst-seed demo-dbt-starburst demo-gdpr demo-n8n

install-local: ## Install Python SDK locally in editable mode (for development)
@echo "Installing Python SDK (editable, all extras)..."
Expand All @@ -190,6 +191,11 @@ install-dbt: ## Install dbt-postgres for the demo database
pip install dbt-postgres
@echo "dbt-postgres installed"

install-dbt-starburst: ## Install dbt-trino for the Starburst target
@echo "Installing dbt-trino..."
pip install 'dbt-trino>=1.7,<2.0'
@echo "dbt-trino installed"

demo-database: ## Start the demo Jaffle Shop database (PostgreSQL + Metabase)
@echo "Starting demo database..."
cd cookbook/resources/demo-database/docker && docker-compose up -d
Expand All @@ -212,6 +218,22 @@ demo-dbt: ## Run dbt models against the demo database
@echo ""
@echo "dbt models and tests completed"

demo-export-seeds: ## Export PG raw_* tables to dbt seed CSVs (one-time setup for Starburst)
@echo "Exporting raw tables from PG to dbt seeds..."
bash cookbook/resources/demo-database/scripts/export-raw-to-seeds.sh

demo-dbt-starburst-seed: ## Load seed CSVs into the Starburst Iceberg catalog
@echo "Seeding Iceberg with raw Jaffle Shop data..."
@echo "Required env: STARBURST_HOST, STARBURST_USER, STARBURST_PASSWORD"
cd cookbook/resources/demo-database/dbt && DBT_PROFILES_DIR=$$(pwd) dbt seed --target starburst

demo-dbt-starburst: ## Run + test dbt models against Starburst
@echo "Running dbt against Starburst..."
@echo "Required env: STARBURST_HOST, STARBURST_USER, STARBURST_PASSWORD"
cd cookbook/resources/demo-database/dbt && DBT_PROFILES_DIR=$$(pwd) dbt run --target starburst
@echo ""
cd cookbook/resources/demo-database/dbt && DBT_PROFILES_DIR=$$(pwd) dbt test --target starburst

demo-gdpr: ## Start the GDPR DSAR compliance demo
@echo "Installing dependencies..."
@cd cookbook/gdpr-dsar-compliance && npm install --silent
Expand Down
132 changes: 132 additions & 0 deletions cookbook/resources/demo-database/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,138 @@ After ingestion, you should see:
| Lineage Analysis | Trace lineage from `raw_stripe.payments` to `fct_monthly_revenue` |
| MCP Integration | Query metadata via Claude/LLM for impact analysis |

## Running against Starburst (Iceberg)

The same dbt project can target a remote Starburst instance with an Iceberg catalog.
The raw Jaffle Shop data ships as dbt seeds (CSVs in `dbt/seeds/`) and is loaded
on demand.

### Prerequisites

- A Starburst Galaxy account (or self-hosted Starburst Enterprise) with an Iceberg catalog configured
- An object-storage bucket for the Iceberg data (S3 / GCS / ADLS) — Galaxy is compute-only, you bring storage
- `make install-dbt-starburst` (installs `dbt-trino>=1.7,<2.0`)

### Galaxy quick start

#### 1. Object storage (S3 example)

Galaxy supports a fixed list of AWS regions — `eu-west-3` (Paris) is **not** one of them. Pick a Galaxy-supported region close to you (`eu-west-1` Ireland is closest to Paris):

```bash
BUCKET=<globally-unique-name>
REGION=eu-west-1

aws s3api create-bucket --bucket "$BUCKET" --region "$REGION" \
--create-bucket-configuration LocationConstraint="$REGION"
```

#### 2. IAM user with long-lived keys

Galaxy needs **permanent access keys**, not STS assume-role / SSO credentials. Create a dedicated IAM user scoped to the bucket:

```bash
cat > /tmp/galaxy-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{"Effect": "Allow", "Action": ["s3:ListBucket","s3:GetBucketLocation"], "Resource": "arn:aws:s3:::${BUCKET}"},
{"Effect": "Allow", "Action": ["s3:GetObject","s3:PutObject","s3:DeleteObject"], "Resource": "arn:aws:s3:::${BUCKET}/*"}
]
}
EOF

aws iam create-user --user-name galaxy-${BUCKET}
aws iam put-user-policy --user-name galaxy-${BUCKET} \
--policy-name ${BUCKET}-bucket --policy-document file:///tmp/galaxy-policy.json
aws iam create-access-key --user-name galaxy-${BUCKET}
# capture AccessKeyId + SecretAccessKey — the secret is shown only once
```

#### 3. Galaxy catalog + cluster

In Galaxy UI:

- **Catalogs → Create catalog → Amazon S3 / Iceberg.** Set the catalog name to `iceberg`, bucket to your bucket, region to your bucket's region, and paste the IAM access key/secret. Use the Galaxy-managed metastore (default).
- **Clusters → Create cluster.** Pick the **Free** size, the same region as the bucket, auto-suspend at 1 minute, and attach the `iceberg` catalog.

#### 4. Find the connection details

In Galaxy → **Clusters → \<your-cluster\> → Connect with → Trino CLI**. Note two values exactly as shown:

- **Host** — Galaxy's pattern is `<account>-<cluster>.trino.galaxy.starburst.io`. Example: `collatetest-free-cluster-ireland.trino.galaxy.starburst.io`. **Not** `<cluster>.<account>.galaxy.starburst.io`, and **not** the Galaxy console URL `<account>.galaxy.starburst.io` — that one returns HTTP 405 to Trino traffic.
- **`--user` value** — Galaxy puts the role into the username, e.g. `pmbrull@getcollate.io/accountadmin`. Copy it verbatim.

#### 5. Personal Access Token (the password)

Galaxy's "API tokens" come in two flavors. They are **not JWTs** — they're opaque secrets used as the password over HTTP Basic auth (LDAP method in dbt-trino).

- **Personal user PAT:** Galaxy → click your avatar → *Personal access tokens* → Create. Username will be `<email>/<role>`. This is the simplest path for development.
- **Service-account token:** Galaxy → Admin → Service accounts → Create. Username will be the random ID Galaxy assigns, e.g. `i94xdO6k7bNPWloY`. Better for CI / production but the SA must be granted the role and cluster privileges separately.

The username and password must belong to the **same identity**. A personal-user PAT will not authenticate as a service account, and vice versa — you'll see HTTP 401 (`access_denied`) or 404 (`User not found`).

#### 6. Set env vars and run

```bash
export STARBURST_METHOD=ldap # default; explicit for clarity
export STARBURST_HOST=collatetest-free-cluster-ireland.trino.galaxy.starburst.io
export STARBURST_USER='pmbrull@getcollate.io/accountadmin' # quote because of the /
export STARBURST_CATALOG=iceberg

# Keep the PAT out of shell history:
read -rs STARBURST_PASSWORD && export STARBURST_PASSWORD

cd cookbook/resources/demo-database/dbt
DBT_PROFILES_DIR=$(pwd) dbt debug --target starburst
# expect: All checks passed!
```

Then load and build:

```bash
make demo-dbt-starburst-seed # load CSVs into iceberg.raw_* schemas
make demo-dbt-starburst # build staging -> intermediate -> marts
```

### Generating the seed CSVs (one-time per data version)

The CSVs are generated from the local PostgreSQL demo (so the same row data
loaded by `init.sql` is what lands in Starburst):

```bash
make demo-database # start PG
make demo-export-seeds # PG raw_* tables -> dbt/seeds/raw_*/*.csv
git add cookbook/resources/demo-database/dbt/seeds && git commit
```

CSVs are versioned in git — re-running `demo-export-seeds` is only needed when
`init.sql` changes.

### Authenticating with a real JWT (alternative)

If your Starburst is wired to an external IdP (Okta, Auth0, Azure AD) that issues real JWTs (token starts with `eyJ...` and has three dot-separated segments), use JWT auth instead:

```bash
export STARBURST_METHOD=jwt
export STARBURST_USER=your.user@org # cosmetic — identity comes from the token's `sub`
export STARBURST_JWT_TOKEN=eyJhbGciOi...
unset STARBURST_PASSWORD
```

Galaxy's built-in API tokens are **not** JWTs — don't use this path with them.

### Troubleshooting

| Symptom | Cause |
|---|---|
| `error 405` from `/v1/statement` | `STARBURST_HOST` points at the Galaxy console (`<account>.galaxy.starburst.io`) instead of the cluster (`<account>-<cluster>.trino.galaxy.starburst.io`). |
| TLS `SSLV3_ALERT_HANDSHAKE_FAILURE` | Same — wrong host. The wildcard cert doesn't cover the made-up name. macOS LibreSSL can also produce this for unrelated reasons; if `dbt debug` succeeds, ignore curl-side failures. |
| `error 401: access_denied` | Username and password belong to different identities (e.g., personal email + SA token), or the role lacks `Use cluster` / `Use catalog` privileges. |
| `error 404: User not found` | Username format wrong. Copy the exact `--user` from Galaxy's *Connect with* tab. |
| `Incorrect S3 access credentials` (Galaxy catalog wizard) | IAM keys with whitespace from the paste, region mismatch between the catalog form and the actual bucket location, or keys still propagating (~30–60s). |
| `Unsupported aws regions: [eu-west-3]` | Galaxy doesn't run in that region. Recreate the bucket in a supported region (`eu-west-1` is the closest to Paris). |

## Cleanup

```bash
Expand Down
15 changes: 15 additions & 0 deletions cookbook/resources/demo-database/dbt/dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,18 @@ models:
+schema: marts_finance
marketing:
+schema: marts_marketing

seeds:
jaffle_shop:
+enabled: "{{ target.type == 'trino' }}"
+quote_columns: false
raw_jaffle_shop:
+schema: raw_jaffle_shop
raw_stripe:
+schema: raw_stripe
raw_inventory:
+schema: raw_inventory
raw_marketing:
+schema: raw_marketing
raw_support:
+schema: raw_support
12 changes: 12 additions & 0 deletions cookbook/resources/demo-database/dbt/macros/day_of_week_iso.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{# Returns the ISO day-of-week as integer: 1 = Monday ... 7 = Sunday. #}
{% macro day_of_week_iso(date_col) %}
{{ return(adapter.dispatch('day_of_week_iso')(date_col)) }}
{% endmacro %}

{% macro default__day_of_week_iso(date_col) %}
extract(isodow from {{ date_col }})
{% endmacro %}

{% macro trino__day_of_week_iso(date_col) %}
day_of_week({{ date_col }})
{% endmacro %}
12 changes: 12 additions & 0 deletions cookbook/resources/demo-database/dbt/macros/days_between.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{# Returns the number of whole days from start_date to end_date as integer. #}
{% macro days_between(start_date, end_date) %}
{{ return(adapter.dispatch('days_between')(start_date, end_date)) }}
{% endmacro %}

{% macro default__days_between(start_date, end_date) %}
({{ end_date }} - {{ start_date }})
{% endmacro %}

{% macro trino__days_between(start_date, end_date) %}
date_diff('day', {{ start_date }}, {{ end_date }})
{% endmacro %}
12 changes: 12 additions & 0 deletions cookbook/resources/demo-database/dbt/macros/hours_between.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{# Returns the number of fractional hours from start_ts to end_ts as a numeric. #}
{% macro hours_between(start_ts, end_ts) %}
{{ return(adapter.dispatch('hours_between')(start_ts, end_ts)) }}
{% endmacro %}

{% macro default__hours_between(start_ts, end_ts) %}
extract(epoch from ({{ end_ts }} - {{ start_ts }})) / 3600
{% endmacro %}

{% macro trino__hours_between(start_ts, end_ts) %}
date_diff('second', {{ start_ts }}, {{ end_ts }}) / 3600.0
{% endmacro %}
12 changes: 12 additions & 0 deletions cookbook/resources/demo-database/dbt/macros/valid_email.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{# Returns a boolean expression: is the given column a syntactically valid email? #}
{% macro valid_email(col) %}
{{ return(adapter.dispatch('valid_email')(col)) }}
{% endmacro %}

{% macro default__valid_email(col) %}
{{ col }} ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'
{% endmacro %}

{% macro trino__valid_email(col) %}
regexp_like({{ col }}, '(?i)^[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}$')
{% endmacro %}
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,12 @@ select
c.budget - coalesce(a.total_spend, 0) as remaining_budget,
case
when coalesce(a.total_impressions, 0) > 0
then round(a.total_clicks::decimal / a.total_impressions * 100, 2)
then round(CAST(a.total_clicks AS decimal) / a.total_impressions * 100, 2)
else 0
end as overall_ctr,
case
when coalesce(a.total_clicks, 0) > 0
then round(a.total_conversions::decimal / a.total_clicks * 100, 2)
then round(CAST(a.total_conversions AS decimal) / a.total_clicks * 100, 2)
else 0
end as overall_conversion_rate,
case
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,10 @@ select
-- Derived metrics
case
when co.first_order_date is not null
then co.last_order_date - co.first_order_date
then {{ days_between('co.first_order_date', 'co.last_order_date') }}
else null
end as days_as_customer,
current_date - coalesce(co.last_order_date, c.created_at::date) as days_since_last_order,
{{ days_between("coalesce(co.last_order_date, CAST(c.created_at AS date))", "current_date") }} as days_since_last_order,

-- Customer segments
case
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ select

-- Review metrics
review_count,
round(avg_rating::decimal, 2) as avg_rating,
round(CAST(avg_rating AS decimal), 2) as avg_rating,
positive_reviews,
negative_reviews,
total_helpful_votes,
Expand All @@ -29,7 +29,7 @@ select
-- Derived metrics
case
when review_count > 0
then round(positive_reviews::decimal / review_count * 100, 2)
then round(CAST(positive_reviews AS decimal) / review_count * 100, 2)
else null
end as positive_review_pct,

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,10 @@ select
last_payment_at,

-- Date dimensions for easy filtering
date_trunc('week', order_date)::date as order_week,
date_trunc('month', order_date)::date as order_month,
date_trunc('quarter', order_date)::date as order_quarter,
extract(dow from order_date) as day_of_week,
CAST(date_trunc('week', order_date) AS date) as order_week,
CAST(date_trunc('month', order_date) AS date) as order_month,
CAST(date_trunc('quarter', order_date) AS date) as order_quarter,
{{ day_of_week_iso('order_date') }} as day_of_week,
extract(hour from created_at) as order_hour

from {{ ref('int_orders__enriched') }}
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ select
-- Coupon usage
sum(case when used_coupon then 1 else 0 end) as orders_with_coupon,
round(
sum(case when used_coupon then 1 else 0 end)::decimal / count(*) * 100,
CAST(sum(case when used_coupon then 1 else 0 end) AS decimal) / count(*) * 100,
2
) as coupon_usage_rate,

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ select
case
when lag(total_orders) over (order by order_month) > 0
then round(
(total_orders - lag(total_orders) over (order by order_month))::decimal
CAST(total_orders - lag(total_orders) over (order by order_month) AS decimal)
/ lag(total_orders) over (order by order_month) * 100,
2
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ select
-- Calculated metrics
case
when sum(total_impressions) > 0
then round(sum(total_clicks)::decimal / sum(total_impressions) * 100, 2)
then round(CAST(sum(total_clicks) AS decimal) / sum(total_impressions) * 100, 2)
else 0
end as overall_ctr,

case
when sum(total_clicks) > 0
then round(sum(total_conversions)::decimal / sum(total_clicks) * 100, 2)
then round(CAST(sum(total_conversions) AS decimal) / sum(total_clicks) * 100, 2)
else 0
end as overall_conversion_rate,

Expand Down
10 changes: 5 additions & 5 deletions cookbook/resources/demo-database/dbt/models/staging/_sources.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ version: 2

sources:
- name: jaffle_shop
database: jaffle_shop
database: "{{ 'jaffle_shop' if target.type == 'postgres' else target.database }}"
schema: raw_jaffle_shop
description: Core transactional data from the Jaffle Shop application
tables:
Expand Down Expand Up @@ -31,7 +31,7 @@ sources:
description: Line items for each order

- name: stripe
database: jaffle_shop
database: "{{ 'jaffle_shop' if target.type == 'postgres' else target.database }}"
schema: raw_stripe
description: Payment data from Stripe
tables:
Expand All @@ -47,7 +47,7 @@ sources:
description: Refund transactions

- name: inventory
database: jaffle_shop
database: "{{ 'jaffle_shop' if target.type == 'postgres' else target.database }}"
schema: raw_inventory
description: Product and inventory management data
tables:
Expand All @@ -59,7 +59,7 @@ sources:
description: Current inventory levels

- name: marketing
database: jaffle_shop
database: "{{ 'jaffle_shop' if target.type == 'postgres' else target.database }}"
schema: raw_marketing
description: Marketing and campaign data
tables:
Expand All @@ -73,7 +73,7 @@ sources:
description: User event tracking

- name: support
database: jaffle_shop
database: "{{ 'jaffle_shop' if target.type == 'postgres' else target.database }}"
schema: raw_support
description: Customer support data
tables:
Expand Down
Loading
Loading