Skip to content

BerkeleyLibrary/mokelumne

Repository files navigation

mokelumne

UC Berkeley Library's Airflow installation, libraries, and Dags.

Consult Running Airflow in Docker for more information. The compose file in the initial commit reflects the composed file linked from those docs.

Dependencies

Dependencies are declared in pyproject.toml and pinned in requirements.txt with hashes for supply-chain security. The Dockerfile installs from requirements.txt using plain pip — no additional tooling is needed at build time.

When adding or changing dependencies, regenerate the pins with uv:

uv pip compile pyproject.toml --extra test -c constraints.txt \
  --no-emit-package python-tind-client --generate-hashes \
  -o requirements.txt

constraints.txt contains upper bounds derived from the base Airflow Docker image to prevent version conflicts with pre-installed packages. Regenerate it when bumping AIRFLOW_VERSION:

docker run --rm --entrypoint python apache/airflow:<version> -m pip freeze

python-tind-client is excluded from requirements.txt because pip cannot verify hashes on git-sourced packages. It is installed separately in the Dockerfile.

Development

Spin up the application using Docker Compose. There are a number of dependencies (Postgres, Keycloak, and Redis) as well as Airflow components (api/web, processor, scheduler, triggerer), so it's hardly lightweight. For now you'll need to sequence startup so that core services are setup before the ones that depend on them:

# Mint a short term API key for AWS Bedrock. You'll add it to
# AWS_BEARER_TOKEN_BEDROCK in the .env file created in the next step.
# (This will probably change in the future.)

# Generate `.env` with unique secrets specific to your local machine
docker compose run \
  --entrypoint /opt/airflow/scripts/setup_dev.py \
  --no-deps \
  --rm \
  airflow-init

# Start the stack
docker compose up --detach

# Open airflow in the browser (see below for test credentials)
open http://localhost:8080/

Airflow keeps ephemeral information in Redis but persists virtually all state (audit logs, user records, etc.) in Postgres. Thus if you want to start from a clean slate, you'll need to cleanup volumes when downing your stack:

docker compose down -v --remove-orphans

Configuration

Airflow's configuration is propagated by environment variables defined upstream by Airflow and that are defined and handled in the base image. You can use a .env file or pass them.

Important environment variables for our build/environment:

Variable Purpose Usage
AIRFLOW_UID Set the uid the container runs as. AIRFLOW_UID="40093"
AIRFLOW_VERSION Sets the Airflow release version. Used to identify the base Airflow image and to define it as a Python constraint AIRFLOW_VERSION="3.1.7"
AIRFLOW_IMAGE_NAME Sets an alternate base image for Airflow, e.g. for slim images AIRFLOW_IMAGE_NAME="apache/airflow:slim-latest"
AIRFLOW__CORE__FERNET_KEY Fernet encryption key used to encrypt Airflow secrets AIRFLOW__CORE__FERNET_KEY="somebase64value="
AIRFLOW__API_AUTH__JWT_SECRET Secret key used to sign JWT tokens for Airflow's API authentication. The default value used in development and testing should be replaced in production. AIRFLOW__API_AUTH__JWT_SECRET="some32bytesecret"
OIDC_CLIENT_SECRET Client secret for OIDC authentication. Used by the Airflow webserver to authenticate OIDC token requests. In development, also used by keycloak-config-cli to configure the client secret. This should match Keycloak configuration in development and testing, and CalNet in production. OIDC_CLIENT_SECRET="some32charactersecret"
OIDC_NAME Name appended to the OIDC login button OIDC_NAME="keycloak"
OIDC_CLIENT_ID Client ID specified in the OIDC provider. OIDC_CLIENT_ID="mokelumne"
OIDC_WELL_KNOWN URL for the OIDC provider's well-known configuration. Used by the Airflow webserver to fetch the OIDC provider's public key for validating OIDC tokens in development and testing. Dev should be configured to point at keycloak's well known and prod points to CAS OIDC well known OIDC_WELL_KNOWN="http://keycloak:8180/realms/berkeley-local/.well-known/openid-configuration"
OIDC_ADMIN_GROUP Name of the OIDC group whose members should be mapped to the "Admin" role in Airflow. Used by keycloak-config-cli to configure group membership for the 'testadmin' user and by the Airflow webserver to map OIDC groups to Airflow roles in development and testing. For simplicity this should match the what we use for prod OIDC_ADMIN_GROUP="cn=edu:berkeley:org:libr:mokelumne:admins,ou=campus groups,dc=berkeley,dc=edu"
OIDC_USER_GROUP Similar to admin group. This group is for users in both admin and user roles. OIDC_USER_GROUP="cn=edu:berkeley:org:libr:mokelumne:users,ou=campus groups,dc=berkeley,dc=edu"
TIND_API_KEY API key for TIND access TIND_API_KEY="..."
TIND_API_URL URL for TIND access TIND_API_URL="https://digicoll.lib.berkeley.edu/api/v1"
MOKELUMNE_TIND_DOWNLOAD_DIR Path for downloaded image cache MOKELUMNE_TIND_DOWNLOAD_DIR="/some/path/to/download/to"
LANGFUSE_HOST Host for Langfuse LANGFUSE_HOST="https://us.cloud.langfuse.com"
LANGFUSE_SECRET_KEY sets langfuse secret key LANGFUSE_SECRET_KEY="sk-lf-blah-blah-blah"
LANGFUSE_PUBLIC_KEY sets langfuse public key LANGFUSE_PUBLIC_KEY="pk-lf-blah-blah-blah"
AWS_ENDPOINT_URL AWS endpoint (don't forget the https://!) AWS_ENDPOINT_URL="https://bedrock-runtime.us-west-1.amazonaws.com"
AWS_DEFAULT_REGION The AWS region to use; you probably want us-west-1. AWS_DEFAULT_REGION=us-west-1
AWS_BEARER_TOKEN_BEDROCK The IAM credential to use to access AWS. Use a short-term API key.
The key will expire after AWS console logout or 12 hours (whichever comes first).
Make sure that your region for the key matches the region above.
AWS_BEARER_TOKEN_BEDROCK="bedrock-api-key-blah-blah-blah"
AWS_MODEL_ID The model to use. Make sure it's supported on the ARN. AWS_MODEL_ID="us.anthropic.claude-haiku-4-5-20251001-v1:0"
AWS_MODEL_LABEL A human friendly label for the model. Will eventually be displayed in the Tind record. AWS_MODEL_LABEL="Claude Haiku 4.5"
AWS_MODEL_PROVIDER The provider for the model. AWS_MODEL_PROVIDER=anthropic
MOKELUMNE_PUBLIC_STORAGE Path for public assets MOKELUMNE_PUBLIC_STORAGE=/opt/airflow/public
MOKELUMNE_PUBLIC_URL URL to access public assets - must end in / MOKELUMNE_PUBLIC_URL=https://mokelumne-assets.ucblib.org/

Note: The AIRFLOW_UID example in example.env maps to the reserved uid for the airflow user in lap/workflow.

Dev Keycloak credentials

The keycloak-config-cli container will create a user/pass of admin/admin for the master realm. It will also create the following users with roles mapped from calnet groups in the berkeley-local realm:

Username Password Role(s)
testadmin testadmin Admin/User
testuser testuser User
testpublic testpublic Public

About

UC Berkeley Library's Airflow installation, libraries, and Dags

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors