Maternity Services Statistics to CML Schema Conversion

Overview

This pipeline converts the Maternity Services Monthly Statistics into the format required by the NHS England Central Metrics Library (CML).

The source data is the Official Statistics about NHS-funded maternity services in England, drawn from the Maternity Services Data Set (MSDS). It covers activity at the booking appointment, during pregnancy, during and after birth, and information on pregnancy outcomes.

The pipeline takes the MSDS data in a tidy (long) format, applies a series of configurable transformations, and produces two output tables matching the CML schema: a metric table and a dimensions table.

Prerequisites

Python >= 3.10
Java 8 or 11 (required by PySpark — ensure JAVA_HOME is set)
Poetry for dependency management

Getting Started

Installation

If you load in Codespaces, the relevant packages should install automatically - it may take a few minutes!

poetry install

Running the pipeline

Create the virtual environment and run the script:

eval $(poetry env activate)
python create_cml_tables.py

Running the tests

pytest

Configuration

All pipeline parameters are defined in config.yaml. You should update this file for each run.

Key fields:

Field	Description
`publication_date`	Publication date for the output (used as a literal column value)
`last_ingest_timestamp`	Timestamp of the last data ingest
`path_to_source_data`	Path to the input CSV file
`output_dir`	Directory where output CSVs are written
`log_dir`	Directory where log files are written
`dimensions`	List of all dimension names present in the source data
`dimension_creation_exclusions`	Dimensions to exclude from the output dimensions table (e.g. `mbrrace_grouping`, which is handled as a separate column)
`processing_funcs`	Ordered list of transformation functions to apply, each with a `name` and `params`

YAML anchors

config.yaml uses YAML anchors (&) and aliases (*) to avoid repeating values:

publication_date: &publication_date "01/12/2026"
last_ingest_timestamp: &last_ingest_timestamp "15/12/2026"

These can then be referenced elsewhere in the file with *publication_date and *last_ingest_timestamp.

Input Format

The pipeline expects a tidy (long) format CSV, where each row represents a single metric value for a given dimension/attribute combination:

Org_Code	Org_Level	Dimension	Attribute	Final_value	ReportingPeriodStartDate	ReportingPeriodEndDate
RXX	Trust	EthnicCategoryMotherGroup	EthnicWhite	82	01/04/2026	30/06/2026
RXX	Trust	AgeAtBookingMotherGroup	Age25to29	54	01/04/2026	30/06/2026
ALL	England	EthnicCategoryMotherGroup	EthnicWhite	79	01/04/2026	30/06/2026

The Dimension column identifies which dimension the row belongs to, and the attribute value (e.g. EthnicWhite) sits in the Attribute column.

Output Format

Two CSVs are written to data_out/:

Metric table (`data_out/metric/metric.csv`)

One row per data point, containing the numeric value and metadata:

datapoint_id	metric_id	metric_dimension_id	location_id	location_type	metric_value	reporting_period_start_datetime	last_record_timestamp	publication_date	last_ingest_timestamp	additional_metric_values

Dimensions table (`data_out/dimensions/dimensions.csv`)

One row per data point, one column per dimension. Each dimension column defaults to all_<dimension> unless the data point belongs to that dimension:

datapoint_id	metric_dimension_id	dimension_cohort_id	EthnicCategoryMotherGroup	AgeAtBookingMotherGroup	...

The dimension_cohort_id is a |-separated concatenation of all dimension column values and links the metric and dimensions tables together.

Pipeline Steps

The transformation logic is defined as an ordered sequence in config.yaml under processing_funcs. Each entry maps to a registered function in the cml_conversion_helpers library via PROCESSING_FUNC_REGISTRY.

The steps applied in this pipeline are:

move_attributes_to_new_dimension — moves MBRRACE grouping values (e.g. "Group 1. Level 3 NICU & NS") out of Org_Code into a new mbrrace_grouping column
replace_col_values — replaces "ALL" in Org_Code with "england"
rename_cols — renames source columns to CML schema names (Org_Code → location_id, etc.)
cast_date_col_to_timestamp — casts date string columns to timestamps
create_uuid_col — generates a unique datapoint_id per row
concat_cols — builds metric_id by concatenating Dimension and Count_Of
add_lit_col + cast_date_col_to_timestamp — adds publication_date and last_ingest_timestamp as typed columns
add_lit_col — adds additional_metric_values as a null column

You don't have to use this config-driven approach if you don't want to. You can simply add your PySpark code into the create_cml_tables.py file.

After these steps, create_dimension_table builds the per-dimension columns and dimension_cohort_id, and a final concat_cols call builds metric_dimension_id.

See the cml_conversion_helpers API reference below for full details on each function.

Project Structure

├── create_cml_tables.py          <- Entry point — runs the full pipeline
├── config.yaml                   <- Pipeline parameters and processing steps
│
├── src/
│   └── msds_monthly_to_cml/
│       ├── data_ingestion/
│       │   ├── get_data.py       <- Utilities for fetching source data
│       │   └── reading_data.py   <- Loads CSV into a Spark DataFrame
│       ├── data_exports/
│       │   └── write_csv.py      <- Saves Spark DataFrames as named CSVs
│       └── utils/
│           ├── file_paths.py     <- Loads config.yaml
│           ├── logging_config.py <- Configures file and console logging
│           └── spark.py          <- Creates and configures a SparkSession
│
├── tests/
│   ├── conftest.py               <- Shared pytest fixtures (SparkSession)
│   └── unittests/
│       └── test_spark.py
│
├── data_in/                      <- Place source CSV here (not committed)
├── data_out/                     <- Output CSVs written here (not committed)
└── logs/                         <- Log files written here (not committed)

API Reference

The transformation functions used in this pipeline are provided by the cml_conversion_helpers package. The key functions are documented below.

Processing functions (`cml_conversion_helpers.processing.processing`)

All functions are available via PROCESSING_FUNC_REGISTRY for config-driven use.

`move_attributes_to_new_dimension`

Moves specified values from one column into a new dimension column. Rows whose source_col_name value is in attributes_to_move have that value placed into new_col_name, and source_col_name is replaced with source_col_fill_value. All other rows get new_col_fill_value in new_col_name.

df = processing.move_attributes_to_new_dimension(
    df,
    source_col_name="Org_Code",
    source_col_fill_value="england",
    new_col_name="mbrrace_grouping",
    new_col_fill_value="no_mbrrace_grouping_filter",
    attributes_to_move=["Group 1. Level 3 NICU & NS", "Group 2. Level 3 NICU"]
)

`rename_cols`

Renames columns according to a mapping. Unmapped columns are left unchanged.

df = processing.rename_cols(df, {"Org_Code": "location_id", "Final_value": "metric_value"})

`replace_col_values`

Replaces values in a column using a mapping dictionary.

df = processing.replace_col_values(df, {"ALL": "england"}, "Org_Code")

`concat_cols`

Concatenates multiple columns into a new column.

df = processing.concat_cols(df, "metric_id", ["Dimension", "Count_Of"], sep="_")

`create_uuid_col`

Adds a column containing a truncated UUID string (hyphens removed).

df = processing.create_uuid_col(df, "datapoint_id", length=32)

`cast_date_col_to_timestamp`

Casts a string date column to a timestamp (default format: dd/MM/yyyy).

df = processing.cast_date_col_to_timestamp(df, "reporting_period_start_datetime")

`add_lit_col`

Adds a new column populated with a constant value. Use None (Python) or null (YAML) for null.

df = processing.add_lit_col(df, "publication_date", "01/12/2026")
df = processing.add_lit_col(df, "additional_metric_values", None)

`drop_cols`

Drops specified columns from a DataFrame.

df = processing.drop_cols(df, ["unwanted_col_a", "unwanted_col_b"])

Dimension functions (`cml_conversion_helpers.processing.dimension_cohorts`)

`create_dimension_table`

Main entry point for building the dimensions table. Creates one column per dimension (populated with the attribute value for matching rows, all_<dimension> otherwise) and a dimension_cohort_id column.

df = dimension_cohorts.create_dimension_table(
    df,
    dimension_cols=config["dimensions"],
    dimensions_to_exclude=config["dimension_creation_exclusions"]
)

`get_dimension_list_from_col`

Extracts the list of distinct values from a dimension column — useful when you want to derive the dimension list from the data rather than hard-coding it in config.

dimensions = dimension_cohorts.get_dimension_list_from_col(df, "Dimension")

Extending the registry

You can add your own functions to PROCESSING_FUNC_REGISTRY using the @register decorator:

from cml_conversion_helpers.processing.processing import register

@register
def my_custom_transform(df, some_param):
    # your logic here
    return df

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.devcontainer		.devcontainer
data_in		data_in
data_out		data_out
logs		logs
src/msds_monthly_to_cml		src/msds_monthly_to_cml
tests		tests
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
config.yaml		config.yaml
create_cml_tables.py		create_cml_tables.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Maternity Services Statistics to CML Schema Conversion

Overview

Prerequisites

Getting Started

Installation

Running the pipeline

Running the tests

Configuration

YAML anchors

Input Format

Output Format

Metric table (data_out/metric/metric.csv)

Dimensions table (data_out/dimensions/dimensions.csv)

Pipeline Steps

Project Structure

API Reference

Processing functions (cml_conversion_helpers.processing.processing)

move_attributes_to_new_dimension

rename_cols

replace_col_values

concat_cols

create_uuid_col

cast_date_col_to_timestamp

add_lit_col

drop_cols

Dimension functions (cml_conversion_helpers.processing.dimension_cohorts)

create_dimension_table

get_dimension_list_from_col

Extending the registry

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Metric table (`data_out/metric/metric.csv`)

Dimensions table (`data_out/dimensions/dimensions.csv`)

Processing functions (`cml_conversion_helpers.processing.processing`)

`move_attributes_to_new_dimension`

`rename_cols`

`replace_col_values`

`concat_cols`

`create_uuid_col`

`cast_date_col_to_timestamp`

`add_lit_col`

`drop_cols`

Dimension functions (`cml_conversion_helpers.processing.dimension_cohorts`)

`create_dimension_table`

`get_dimension_list_from_col`

Packages