Skip to content

AP-659 Migrate TIND connection details to an Airflow Connection#37

Open
jason-raitz wants to merge 1 commit intomainfrom
AP-659_migrate_tind_to_airflow_connection
Open

AP-659 Migrate TIND connection details to an Airflow Connection#37
jason-raitz wants to merge 1 commit intomainfrom
AP-659_migrate_tind_to_airflow_connection

Conversation

@jason-raitz
Copy link
Copy Markdown
Contributor

This refactors the fetch_tind util to use an airflow connection to instantiate the tind client. For now that connection is a json string in the env, but that can easily be moved to a secret.

Bonus for this intermediate step is that by setting the env key to AIRFLOW_CONN_TIND_DEFAULT, airflow will automatically load the json string there if you ask for a Connection.get('tind_default')!

refactors fetchtind util to use airflow connection to instantiate
the tind client. For now that connection is a json string in the env,
but that can easily be moved to a secret.
Copy link
Copy Markdown
Member

@anarchivist anarchivist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+wc. generally looks good; these are mostly nitpicky things about how Airflow connections behave. we should decide on a pattern for handling base URL info for APIs.

Comment on lines +15 to +17
def __init__(self, _run_id: str, conn_id: str = "tind_default"):
self.run_id = _run_id
self.client = TINDClient(default_storage_dir=str(run_dir(_run_id)))
conn = Connection.get(conn_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to just have conn_id instead be a conn argument that expects an airflow.sdk.Connection?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem to be how most Airflow operators are written.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we writing/using operators here? As far as I can tell we're just implementing the connection and not, say, a TindOperator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose. I could see it either way. I lean slightly towards conn_id just because it feels more familiar having just dealt with the email stuff. Not going to be upset if we go with conn instead.

Comment thread example.env
AIRFLOW__API_AUTH__JWT_SECRET=
# @see https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html#generating-fernet-key
AIRFLOW__CORE__FERNET_KEY=
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Airflow HTTP connections shouldn't have path or protocol information in host.

Suggested change
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "digicoll.lib.berkeley.edu","schema": "https"}'

I wonder if we should include a base_url value somewhere in extra?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually

Host (optional)
Specify the entire url or the base of the url for the service.

…that is, per the docs, exactly how it should be specified.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Reading from the same docs:

Note that all components of the URI should be URL-encoded.

... which means we actually want something like this?

Suggested change
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https%3A%2F%2Fdigicoll.lib.berkeley.edu%2Fapi%2Fv1","schema": "https"}'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that looks more correct. Sorry for missing that bit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're extracting api_url from conn.host, and that includes the https:// prefix, why include the schema at all? (Note that ALL fields are optional, anyway. We have a lot of flexibility in how we define this.)

Copy link
Copy Markdown
Member

@danschmidt5189 danschmidt5189 Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside — Do we want to support using this with the HttpOperator? In that case we'd need to rethink this a bit, as the current method doesn't pass the API token properly (it needs to be set in an Authorization: Token {token} header).

If I'm understanding the docs/code right this is where hooks would be useful. There are a few moving parts, so bear with me for a second.

HttpHook has an auth_type property which can be used to augment requests with authorization info before they're sent off. Ignore the Any typehint — it's meant to be a subclass of requests.auth.AuthBase (see New Forms of Authentication). The HttpHook source shows that when login is set, auth_type is instantiated with login and password as arguments, so a custom TindTokenAuth(requests.auth.AuthBase) we allow us to tunnel the login from the Connection into the request's Auth header.

Now, obviously you can't embed the auth_type in the Connection itself, but you can pass it to an HttpOperator (along with a bunch of other useful stuff like a paginator). Here's how that might look:

from mokelumne.tind import TindTokenAuth

AIRFLOW_CONN_TIND_DEFAULT="""{
    "conn_type": "http",
    "login": "{api_key}",
    "host": "digicoll.lib.berkeley.edu%2Fapi%2Fv1",
    "schema": "https"
}"""

tind_get_record = HttpOperator(
    conn_id="tind_default",
    auth_type=TindTokenAuth,
    method="GET",
    endpoint="record/{record_id}",
    data={"of": "xm"},
)

This has pros and cons:

  • Pro: Simply by storing the API token in login, we make it possible to reuse our connection with the built-in HttpOperator.
  • Mixed: It's slightly odd to store the token in login (though this does make for a cleaner URL).
  • Mixed: We still have to jump through hoops to specify https:// and the API's base path.
  • Con: We'd have to pass auth_type=TindTokenAuth everywhere.

A custom TindHook could address those issues:

  • We get full control over where the token is passed in the connection string.
  • We also get full control over how https:// and the API base path are presented (e.g. the Connection.host could be just the TIND host name).
  • It binds the tind:// Connection type to TindTokenAuth, so you don't have to pass that to operators.

It also adds a few features unique to hooks:

  • We can customize how this Connection type is presented in the UI.
  • The Hook class can provide a factory method for producing TINDClients.

Really more food for thought than anything else. Just wanted to put into detail / words what I mentioned on the call about trade-offs, but didn't know enough at the time to fully comment on.

self.client = TINDClient(default_storage_dir=str(run_dir(_run_id)))
conn = Connection.get(conn_id)
self.client = TINDClient(
api_url=conn.host or "",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this specific suggestion depends on how we want to handle the API base URL.

Suggested change
api_url=conn.host or "",
api_url=f"{conn.schema}://{conn.host}/api/v1",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this better, and I do believe it should fail if the host isn't specified.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised suggestion:

Suggested change
api_url=conn.host or "",
api_url=conn.host,

Copy link
Copy Markdown
Member

@awilfox awilfox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear and straightforward. Seems good to me, but I'll wait for the other comments from maría to be resolved before approving.

Comment thread example.env
AIRFLOW__API_AUTH__JWT_SECRET=
# @see https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html#generating-fernet-key
AIRFLOW__CORE__FERNET_KEY=
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually

Host (optional)
Specify the entire url or the base of the url for the service.

…that is, per the docs, exactly how it should be specified.

Comment thread README.md
| `AIRFLOW_IMAGE_NAME` | Sets an alternate base image for Airflow, e.g. for `slim` images | `AIRFLOW_IMAGE_NAME="apache/airflow:slim-latest"` |
| `AIRFLOW__CORE__FERNET_KEY` | [Fernet](https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html) encryption key used to encrypt Airflow secrets | `AIRFLOW__CORE__FERNET_KEY="somebase64value="` |
| `AIRFLOW__API_AUTH__JWT_SECRET` | Secret key used to sign JWT tokens for Airflow's API authentication. The default value used in development and testing should be replaced in production. | `AIRFLOW__API_AUTH__JWT_SECRET="some32bytesecret"` |
| `AIRFLOW_CONN_TIND_DEFAULT` | Airflow connection json string for TIND access.<br>Note: the Connection params listed in the example are all needed! | `AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'` |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the result of the above, ensure this is kept up to date with what we end up with.

Copy link
Copy Markdown
Member

@anarchivist anarchivist Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we want this in here at all and rather have a section of the README on connections.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We should also probably separate out Variables, if/when we get around to that refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants