Connections Guide

← Back to Main README

Connections define integrations with AWS services and data sources in SageMaker Unified Studio projects. This guide covers how to create, configure, and use connections in your data applications.

Overview

Connections enable your workflows to interact with:

Data Storage - S3, Redshift, RDS
Compute Engines - Spark (Glue/EMR), Athena
ML Services - SageMaker, MLflow
Orchestration - MWAA, Amazon MWAA Serverless

Two ways to create connections:

Manifest-based (recommended) - Define in manifest.yaml for automated creation
Console-based - Create manually in SMUS portal

Default Connections

Every SMUS project includes these connections automatically:

Connection Name	Type	Purpose
`default.s3_shared`	S3	Project S3 bucket
`project.workflow_mwaa`	MWAA	MWAA environment (if enabled)
`project.athena`	Athena	Athena workgroup
`project.default_lakehouse`	Lakehouse	Lakehouse connection

Creating Connections via Manifest

Define connections in your bundle manifest under bootstrap.connections:

stages:
  test:
    domain:
      name: my-domain
      region: us-east-1
    project:
      name: test-project
    bootstrap:
      environments:
        - EnvironmentConfigurationName: 'OnDemand Workflows'
      
      connections:
        - name: s3-raw-data
          type: S3
          properties:
            s3Uri: "s3://raw-data-bucket/incoming/"
        
        - name: spark-etl
          type: SPARK_GLUE
          properties:
            glueVersion: "4.0"
            workerType: "G.2X"
            numberOfWorkers: 10

Benefits:

Automated creation during deployment
Version controlled in Git
Consistent across environments
Repeatable infrastructure

Connection Types

S3 - Object Storage

Access S3 buckets for data storage and retrieval.

- name: s3-data-lake
  type: S3
  properties:
    s3Uri: "s3://my-data-bucket/data/"

Properties:

s3Uri (required): S3 bucket URI with optional prefix

Use cases:

Raw data ingestion
Processed data storage
Model artifacts
Backup and archival

Example workflow usage:

tasks:
  upload_data:
    operator: "airflow.providers.amazon.aws.transfers.local_to_s3.LocalFilesystemToS3Operator"
    filename: "/tmp/data.csv"
    dest_bucket: "${proj.connection.s3_data_lake.bucket}"
    dest_key: "raw/data.csv"

IAM - Identity and Access Management

Configure IAM settings for Glue lineage synchronization.

- name: iam-lineage-sync
  type: IAM
  properties:
    glueLineageSyncEnabled: true

Properties:

glueLineageSyncEnabled (required): Enable Glue lineage sync (true/false)

Use cases:

Data lineage tracking
Governance and compliance
Impact analysis

SPARK_GLUE - Spark on AWS Glue

Run Spark jobs on AWS Glue serverless compute.

- name: spark-processing
  type: SPARK_GLUE
  properties:
    glueVersion: "4.0"
    workerType: "G.1X"
    numberOfWorkers: 5
    maxRetries: 1

Properties:

glueVersion (required): Glue version ("4.0", "5.0")
workerType (required): Worker type
- "G.1X" - 4 vCPU, 16 GB memory
- "G.2X" - 8 vCPU, 32 GB memory
- "G.4X" - 16 vCPU, 64 GB memory
- "G.8X" - 32 vCPU, 128 GB memory
numberOfWorkers (required): Number of workers (2-100)
maxRetries (optional): Maximum retries (default: 0)

Use cases:

ETL transformations
Data quality checks
Feature engineering
Large-scale data processing

Example workflow usage:

tasks:
  transform_data:
    operator: "airflow.providers.amazon.aws.operators.glue.GlueJobOperator"
    job_name: "customer-data-transform"
    script_location: "s3://${proj.s3.root}/scripts/transform.py"
    glue_version: "${proj.connection.spark_processing.glueVersion}"
    num_of_dpus: "${proj.connection.spark_processing.numberOfWorkers}"

ATHENA - SQL Query Engine

Execute SQL queries on data lakes using Amazon Athena.

- name: athena-analytics
  type: ATHENA
  properties:
    workgroupName: "primary"

Properties:

workgroupName (required): Athena workgroup name

Use cases:

Ad-hoc SQL queries
Data validation
Reporting and analytics
Data exploration

Example workflow usage:

tasks:
  validate_data:
    operator: "airflow.providers.amazon.aws.operators.athena.AthenaOperator"
    query: |
      SELECT COUNT(*) as record_count 
      FROM ${proj.connection.athena.database}.processed_data
      WHERE date = '{{ ds }}'
    output_location: "s3://${proj.s3.root}/query-results/"
    workgroup: "${proj.connection.athena_analytics.workgroupName}"

REDSHIFT - Data Warehouse

Connect to Amazon Redshift for data warehousing operations.

- name: redshift-warehouse
  type: REDSHIFT
  properties:
    storage:
      clusterName: "analytics-cluster"
      # OR for serverless:
      # workgroupName: "analytics-workgroup"
    databaseName: "analytics"
    host: "analytics-cluster.abc123.us-east-1.redshift.amazonaws.com"
    port: 5439

Properties:

storage.clusterName (required for provisioned): Redshift cluster name
storage.workgroupName (required for serverless): Redshift serverless workgroup
databaseName (required): Database name
host (required): Redshift endpoint hostname
port (required): Port number (typically 5439)

Use cases:

Data warehousing
Complex analytics
Business intelligence
Historical data analysis

Example workflow usage:

tasks:
  load_to_redshift:
    operator: "airflow.providers.amazon.aws.transfers.s3_to_redshift.S3ToRedshiftOperator"
    redshift_conn_id: "${proj.connection.redshift_warehouse.id}"
    table: "customer_data"
    s3_bucket: "${proj.s3.root}"
    s3_key: "processed/customers.csv"
    copy_options: ["CSV", "IGNOREHEADER 1"]

SPARK_EMR - Spark on EMR

Run Spark jobs on Amazon EMR clusters or EMR Serverless.

- name: spark-emr-processing
  type: SPARK_EMR
  properties:
    computeArn: "arn:aws:emr-serverless:us-east-1:123456789012:/applications/00abc123def456"
    runtimeRole: "arn:aws:iam::123456789012:role/EMRServerlessExecutionRole"

Properties:

computeArn (required): EMR compute ARN
- EMR Serverless: arn:aws:emr-serverless:REGION:ACCOUNT:/applications/APP_ID
- EMR Cluster: arn:aws:elasticmapreduce:REGION:ACCOUNT:cluster/CLUSTER_ID
runtimeRole (required): IAM role ARN for execution

Use cases:

Large-scale Spark processing
Machine learning with Spark MLlib
Graph processing
Stream processing

Example workflow usage:

tasks:
  run_spark_job:
    operator: "airflow.providers.amazon.aws.operators.emr.EmrServerlessStartJobOperator"
    application_id: "${proj.connection.spark_emr_processing.applicationId}"
    execution_role_arn: "${proj.connection.spark_emr_processing.runtimeRole}"
    job_driver:
      sparkSubmit:
        entryPoint: "s3://${proj.s3.root}/scripts/process.py"

MLFLOW - ML Experiment Tracking

Track machine learning experiments using MLflow.

- name: mlflow-experiments
  type: MLFLOW
  properties:
    trackingServerName: "ml-tracking-server"
    trackingServerArn: "arn:aws:sagemaker:us-east-1:123456789012:mlflow-tracking-server/ml-tracking-server"

Properties:

trackingServerName (required): MLflow tracking server name
trackingServerArn (required): MLflow tracking server ARN

Use cases:

Experiment tracking
Model versioning
Hyperparameter tuning
Model comparison

Example workflow usage:

tasks:
  train_model:
    operator: "airflow.operators.python.PythonOperator"
    python_callable: "train_model"
    op_kwargs:
      mlflow_tracking_uri: "${proj.connection.mlflow_experiments.trackingUri}"
      experiment_name: "customer-churn"

WORKFLOWS_MWAA - Apache Airflow

Connect to Amazon Managed Workflows for Apache Airflow (MWAA).

- name: mwaa-workflows
  type: WORKFLOWS_MWAA
  properties:
    mwaaEnvironmentName: "production-airflow-env"

Properties:

mwaaEnvironmentName (required): MWAA environment name

Use cases:

Complex workflow orchestration
Multi-step data pipelines
Scheduled batch processing
Cross-service coordination

Example workflow usage:

content:
  workflows:
    - workflowName: data_pipeline
      connectionName: mwaa-workflows
      engine: MWAA
    triggerPostDeployment: true

WORKFLOWS_SERVERLESS - Amazon MWAA Serverless

Use serverless Airflow workflows (no MWAA environment required).

- name: serverless-workflows
  type: WORKFLOWS_SERVERLESS
  properties: {}

Properties:

No properties required (empty structure)

Use cases:

Simple workflows
Cost-optimized orchestration
Quick prototyping
Lightweight pipelines

Example workflow usage:

content:
  workflows:
    - workflowName: simple_etl
      engine: airflow-serverless
      triggerPostDeployment: true

Complete Example

Full manifest with multiple connection types:

applicationName: CustomerSegmentationModel

stages:
  test:
    domain:
      name: my-domain
      region: us-east-1
    project:
      name: test-data-platform
    
    bootstrap:
      environments:
        - EnvironmentConfigurationName: 'OnDemand Workflows'
      
      connections:
        # Storage
        - name: s3-raw-data
          type: S3
          properties:
            s3Uri: "s3://raw-data-bucket/incoming/"
        
        # Compute
        - name: spark-etl
          type: SPARK_GLUE
          properties:
            glueVersion: "4.0"
            workerType: "G.2X"
            numberOfWorkers: 10
        
        - name: spark-emr
          type: SPARK_EMR
          properties:
            computeArn: "arn:aws:emr-serverless:us-east-1:123456789012:/applications/00abc123def456"
            runtimeRole: "arn:aws:iam::123456789012:role/EMRServerlessExecutionRole"
        
        # Analytics
        - name: athena-queries
          type: ATHENA
          properties:
            workgroupName: "analytics-workgroup"
        
        - name: redshift-dw
          type: REDSHIFT
          properties:
            storage:
              clusterName: "analytics-cluster"
            databaseName: "analytics"
            host: "analytics-cluster.abc123.us-east-1.redshift.amazonaws.com"
            port: 5439
        
        # ML
        - name: ml-tracking
          type: MLFLOW
          properties:
            trackingServerName: "ml-experiments-server"
            trackingServerArn: "arn:aws:sagemaker:us-east-1:123456789012:mlflow-tracking-server/ml-experiments-server"
        
        # Orchestration
        - name: airflow-orchestration
          type: WORKFLOWS_MWAA
          properties:
            mwaaEnvironmentName: "production-airflow-env"
        
        - name: serverless-workflows
          type: WORKFLOWS_SERVERLESS
          properties: {}
        
        # Governance
        - name: iam-lineage
          type: IAM
          properties:
            glueLineageSyncEnabled: true

content:
  storage:
    - name: workflows
      connectionName: default.s3_shared
      include: ['workflows/']

Using Connections in Workflows

Reference connections in your workflow files using variable substitution:

Basic Connection Reference

tasks:
  my_task:
    operator: "airflow.providers.amazon.aws.operators.s3.S3ListOperator"
    bucket: "${proj.connection.s3_raw_data.bucket}"
    prefix: "data/"

Connection Properties

Access specific connection properties:

# S3 connection
bucket: "${proj.connection.s3_data_lake.bucket}"
prefix: "${proj.connection.s3_data_lake.prefix}"

# Spark Glue connection
glue_version: "${proj.connection.spark_etl.glueVersion}"
worker_type: "${proj.connection.spark_etl.workerType}"
num_workers: "${proj.connection.spark_etl.numberOfWorkers}"

# Athena connection
workgroup: "${proj.connection.athena_analytics.workgroupName}"

# Redshift connection
host: "${proj.connection.redshift_dw.host}"
port: "${proj.connection.redshift_dw.port}"
database: "${proj.connection.redshift_dw.databaseName}"

Connection ID

Use connection ID for Airflow operators:

tasks:
  transfer_data:
    operator: "airflow.providers.amazon.aws.transfers.s3_to_redshift.S3ToRedshiftOperator"
    redshift_conn_id: "${proj.connection.redshift_dw.id}"
    s3_bucket: "${proj.s3.root}"
    s3_key: "data/customers.csv"
    table: "customers"

See more: Substitutions and Variables Guide

Creating Connections via Console

For existing projects or manual setup:

Navigate to SageMaker Unified Studio console
Select your domain and project
Go to Connections tab
Click Create connection
Select connection type
Configure properties
Click Create

When to use console:

Existing projects without manifest
One-off connections
Testing connection configurations
Quick prototyping

When to use manifest:

New project setup
Automated deployments
CI/CD pipelines
Consistent environments

Best Practices

Connection Naming

Use descriptive, consistent names:

✅ s3-raw-data, spark-etl, athena-analytics
❌ conn1, my-connection, test

Connection Organization

Group connections by purpose:

connections:
  # Storage
  - name: s3-raw-data
  - name: s3-processed-data
  
  # Compute
  - name: spark-etl
  - name: spark-ml
  
  # Analytics
  - name: athena-queries
  - name: redshift-warehouse

Environment-Specific Connections

Use different connections per environment:

stages:
  test:
    bootstrap:
      connections:
        - name: redshift-warehouse
          properties:
            storage:
              clusterName: "test-cluster"
  
  prod:
    bootstrap:
      connections:
        - name: redshift-warehouse
          properties:
            storage:
              clusterName: "prod-cluster"

Connection Documentation

Document connection requirements in your README:

## Required Connections

- `s3-raw-data` - S3 bucket for raw data ingestion
- `spark-etl` - Glue Spark for ETL processing
- `athena-analytics` - Athena for SQL queries

Testing Connections

Validate connections after creation:

# Describe project to see connections
aws-smus-cicd-cli describe --manifest manifest.yaml --targets test --connect

# Test workflow that uses connections
aws-smus-cicd-cli run --targets test --workflow test_connections

Next Steps

Manifest Reference - Complete connection schema
Substitutions and Variables - Variable usage guide
Admin Quick Start - Infrastructure setup
CLI Commands - Connection management commands

Questions? See Manifest Reference - Connections for complete schema details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connections Guide

Overview

Default Connections

Creating Connections via Manifest

Connection Types

S3 - Object Storage

IAM - Identity and Access Management

SPARK_GLUE - Spark on AWS Glue

ATHENA - SQL Query Engine

REDSHIFT - Data Warehouse

SPARK_EMR - Spark on EMR

MLFLOW - ML Experiment Tracking

WORKFLOWS_MWAA - Apache Airflow

WORKFLOWS_SERVERLESS - Amazon MWAA Serverless

Complete Example

Using Connections in Workflows

Basic Connection Reference

Connection Properties

Connection ID

Creating Connections via Console

Best Practices

Connection Naming

Connection Organization

Environment-Specific Connections

Connection Documentation

Testing Connections

Next Steps

FilesExpand file tree

connections.md

Latest commit

History

connections.md

File metadata and controls

Connections Guide

Overview

Default Connections

Creating Connections via Manifest

Connection Types

S3 - Object Storage

IAM - Identity and Access Management

SPARK_GLUE - Spark on AWS Glue

ATHENA - SQL Query Engine

REDSHIFT - Data Warehouse

SPARK_EMR - Spark on EMR

MLFLOW - ML Experiment Tracking

WORKFLOWS_MWAA - Apache Airflow

WORKFLOWS_SERVERLESS - Amazon MWAA Serverless

Complete Example

Using Connections in Workflows

Basic Connection Reference

Connection Properties

Connection ID

Creating Connections via Console

Best Practices

Connection Naming

Connection Organization

Environment-Specific Connections

Connection Documentation

Testing Connections

Next Steps