Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,14 +66,13 @@ Important environment variables for our build/environment:
| `AIRFLOW_IMAGE_NAME` | Sets an alternate base image for Airflow, e.g. for `slim` images | `AIRFLOW_IMAGE_NAME="apache/airflow:slim-latest"` |
| `AIRFLOW__CORE__FERNET_KEY` | [Fernet](https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html) encryption key used to encrypt Airflow secrets | `AIRFLOW__CORE__FERNET_KEY="somebase64value="` |
| `AIRFLOW__API_AUTH__JWT_SECRET` | Secret key used to sign JWT tokens for Airflow's API authentication. The default value used in development and testing should be replaced in production. | `AIRFLOW__API_AUTH__JWT_SECRET="some32bytesecret"` |
| `AIRFLOW_CONN_TIND_DEFAULT` | Airflow connection json string for TIND access.<br>Note: the Connection params listed in the example are all needed! | `AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'` |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the result of the above, ensure this is kept up to date with what we end up with.

Copy link
Copy Markdown
Member

@anarchivist anarchivist Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we want this in here at all and rather have a section of the README on connections.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We should also probably separate out Variables, if/when we get around to that refactor.

| `OIDC_CLIENT_SECRET` | Client secret for OIDC authentication. Used by the Airflow webserver to authenticate OIDC token requests. In development, also used by `keycloak-config-cli` to configure the client secret. This should match Keycloak configuration in development and testing, and CalNet in production. | `OIDC_CLIENT_SECRET="some32charactersecret"` |
| `OIDC_NAME` | Name appended to the OIDC login button | `OIDC_NAME="keycloak"` |
| `OIDC_CLIENT_ID` | Client ID specified in the OIDC provider. | `OIDC_CLIENT_ID="mokelumne"` |
| `OIDC_WELL_KNOWN` | URL for the OIDC provider's well-known configuration. Used by the Airflow webserver to fetch the OIDC provider's public key for validating OIDC tokens in development and testing. Dev should be configured to point at keycloak's well known and prod points to CAS OIDC well known | `OIDC_WELL_KNOWN="http://keycloak:8180/realms/berkeley-local/.well-known/openid-configuration"` |
| `OIDC_ADMIN_GROUP` | Name of the OIDC group whose members should be mapped to the "Admin" role in Airflow. Used by keycloak-config-cli to configure group membership for the 'testadmin' user and by the Airflow webserver to map OIDC groups to Airflow roles in development and testing. For simplicity this should match the what we use for prod | `OIDC_ADMIN_GROUP="cn=edu:berkeley:org:libr:mokelumne:admins,ou=campus groups,dc=berkeley,dc=edu"` |
| `OIDC_USER_GROUP` | Similar to admin group. This group is for users in both admin and user roles.| `OIDC_USER_GROUP="cn=edu:berkeley:org:libr:mokelumne:users,ou=campus groups,dc=berkeley,dc=edu"` |
| `TIND_API_KEY` | API key for TIND access | `TIND_API_KEY="..."` |
| `TIND_API_URL` | URL for TIND access | `TIND_API_URL="https://digicoll.lib.berkeley.edu/api/v1"` |
| `MOKELUMNE_TIND_DOWNLOAD_DIR` | Path for downloaded image cache | `MOKELUMNE_TIND_DOWNLOAD_DIR="/some/path/to/download/to"` |
|`LANGFUSE_HOST`|Host for Langfuse|`LANGFUSE_HOST="https://us.cloud.langfuse.com"`|
|`LANGFUSE_SECRET_KEY`|sets langfuse secret key|`LANGFUSE_SECRET_KEY="sk-lf-blah-blah-blah"`|
Expand Down
5 changes: 1 addition & 4 deletions example.env
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
AIRFLOW__API_AUTH__JWT_SECRET=
# @see https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html#generating-fernet-key
AIRFLOW__CORE__FERNET_KEY=
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Airflow HTTP connections shouldn't have path or protocol information in host.

Suggested change
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "digicoll.lib.berkeley.edu","schema": "https"}'

I wonder if we should include a base_url value somewhere in extra?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually

Host (optional)
Specify the entire url or the base of the url for the service.

…that is, per the docs, exactly how it should be specified.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Reading from the same docs:

Note that all components of the URI should be URL-encoded.

... which means we actually want something like this?

Suggested change
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https%3A%2F%2Fdigicoll.lib.berkeley.edu%2Fapi%2Fv1","schema": "https"}'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that looks more correct. Sorry for missing that bit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're extracting api_url from conn.host, and that includes the https:// prefix, why include the schema at all? (Note that ALL fields are optional, anyway. We have a lot of flexibility in how we define this.)

Copy link
Copy Markdown
Member

@danschmidt5189 danschmidt5189 Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside — Do we want to support using this with the HttpOperator? In that case we'd need to rethink this a bit, as the current method doesn't pass the API token properly (it needs to be set in an Authorization: Token {token} header).

If I'm understanding the docs/code right this is where hooks would be useful. There are a few moving parts, so bear with me for a second.

HttpHook has an auth_type property which can be used to augment requests with authorization info before they're sent off. Ignore the Any typehint — it's meant to be a subclass of requests.auth.AuthBase (see New Forms of Authentication). The HttpHook source shows that when login is set, auth_type is instantiated with login and password as arguments, so a custom TindTokenAuth(requests.auth.AuthBase) we allow us to tunnel the login from the Connection into the request's Auth header.

Now, obviously you can't embed the auth_type in the Connection itself, but you can pass it to an HttpOperator (along with a bunch of other useful stuff like a paginator). Here's how that might look:

from mokelumne.tind import TindTokenAuth

AIRFLOW_CONN_TIND_DEFAULT="""{
    "conn_type": "http",
    "login": "{api_key}",
    "host": "digicoll.lib.berkeley.edu%2Fapi%2Fv1",
    "schema": "https"
}"""

tind_get_record = HttpOperator(
    conn_id="tind_default",
    auth_type=TindTokenAuth,
    method="GET",
    endpoint="record/{record_id}",
    data={"of": "xm"},
)

This has pros and cons:

  • Pro: Simply by storing the API token in login, we make it possible to reuse our connection with the built-in HttpOperator.
  • Mixed: It's slightly odd to store the token in login (though this does make for a cleaner URL).
  • Mixed: We still have to jump through hoops to specify https:// and the API's base path.
  • Con: We'd have to pass auth_type=TindTokenAuth everywhere.

A custom TindHook could address those issues:

  • We get full control over where the token is passed in the connection string.
  • We also get full control over how https:// and the API base path are presented (e.g. the Connection.host could be just the TIND host name).
  • It binds the tind:// Connection type to TindTokenAuth, so you don't have to pass that to operators.

It also adds a few features unique to hooks:

  • We can customize how this Connection type is presented in the UI.
  • The Hook class can provide a factory method for producing TINDClients.

Really more food for thought than anything else. Just wanted to put into detail / words what I mentioned on the call about trade-offs, but didn't know enough at the time to fully comment on.

AIRFLOW_UID=49003

# Set KeyCloak's logging level. "DEBUG" can be useful when
Expand All @@ -19,10 +20,6 @@ OIDC_NAME="keycloak"
OIDC_USER_GROUP="cn=edu:berkeley:org:libr:mokelumne:users,ou=campus groups,dc=berkeley,dc=edu"
OIDC_WELL_KNOWN="http://keycloak:8180/realms/berkeley-local/.well-known/openid-configuration"

# TBD
TIND_API_KEY=
TIND_API_URL=

LANGFUSE_HOST=https://us.cloud.langfuse.com
LANGFUSE_SECRET_KEY=sk-lf-blah-blah-blah
LANGFUSE_PUBLIC_KEY=pk-lf-blah-blah-blah
Expand Down
16 changes: 13 additions & 3 deletions mokelumne/util/fetch_tind.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,24 @@

import csv

from airflow.sdk import Connection

from tind_client import TINDClient

from mokelumne.util.storage import run_dir, record_dir


class FetchTind:
"""Helper methods for fetching items from TIND using TINDClient."""
def __init__(self, _run_id: str):

def __init__(self, _run_id: str, conn_id: str = "tind_default"):
self.run_id = _run_id
self.client = TINDClient(default_storage_dir=str(run_dir(_run_id)))
conn = Connection.get(conn_id)
Comment on lines +15 to +17
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to just have conn_id instead be a conn argument that expects an airflow.sdk.Connection?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem to be how most Airflow operators are written.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we writing/using operators here? As far as I can tell we're just implementing the connection and not, say, a TindOperator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose. I could see it either way. I lean slightly towards conn_id just because it feels more familiar having just dealt with the email stuff. Not going to be upset if we go with conn instead.

self.client = TINDClient(
api_url=conn.host or "",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this specific suggestion depends on how we want to handle the API base URL.

Suggested change
api_url=conn.host or "",
api_url=f"{conn.schema}://{conn.host}/api/v1",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this better, and I do believe it should fail if the host isn't specified.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised suggestion:

Suggested change
api_url=conn.host or "",
api_url=conn.host,

api_key=conn.password or "",
default_storage_dir=str(run_dir(_run_id)),
)

def get_ids(self, tind_query: str) -> list[str]:
"""Return the TIND IDs that match a given query."""
Expand All @@ -36,5 +44,7 @@ def save_tind_ids_file(self, ids: list[str]) -> None:

def write_query_results_to_xml(self, tind_query: str, file_name: str = "") -> int:
"""Download the XML results of a search query from TIND."""
records_written = self.client.write_search_results_to_file(tind_query, file_name)
records_written = self.client.write_search_results_to_file(
tind_query, file_name
)
return int(records_written)
Loading