AP-659 Migrate TIND connection details to an Airflow Connection#37
AP-659 Migrate TIND connection details to an Airflow Connection#37jason-raitz wants to merge 1 commit intomainfrom
Conversation
anarchivist
left a comment
There was a problem hiding this comment.
r+wc. generally looks good; these are mostly nitpicky things about how Airflow connections behave. we should decide on a pattern for handling base URL info for APIs.
| def __init__(self, _run_id: str, conn_id: str = "tind_default"): | ||
| self.run_id = _run_id | ||
| self.client = TINDClient(default_storage_dir=str(run_dir(_run_id))) | ||
| conn = Connection.get(conn_id) |
There was a problem hiding this comment.
Would it make more sense to just have conn_id instead be a conn argument that expects an airflow.sdk.Connection?
There was a problem hiding this comment.
That doesn't seem to be how most Airflow operators are written.
There was a problem hiding this comment.
Are we writing/using operators here? As far as I can tell we're just implementing the connection and not, say, a TindOperator.
There was a problem hiding this comment.
I suppose. I could see it either way. I lean slightly towards conn_id just because it feels more familiar having just dealt with the email stuff. Not going to be upset if we go with conn instead.
| AIRFLOW__API_AUTH__JWT_SECRET= | ||
| # @see https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html#generating-fernet-key | ||
| AIRFLOW__CORE__FERNET_KEY= | ||
| AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}' |
There was a problem hiding this comment.
Airflow HTTP connections shouldn't have path or protocol information in host.
| AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}' | |
| AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "digicoll.lib.berkeley.edu","schema": "https"}' |
I wonder if we should include a base_url value somewhere in extra?
There was a problem hiding this comment.
Actually…
Host(optional)
Specify the entire url or the base of the url for the service.
…that is, per the docs, exactly how it should be specified.
There was a problem hiding this comment.
Okay. Reading from the same docs:
Note that all components of the URI should be URL-encoded.
... which means we actually want something like this?
| AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}' | |
| AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https%3A%2F%2Fdigicoll.lib.berkeley.edu%2Fapi%2Fv1","schema": "https"}' |
There was a problem hiding this comment.
You're right, that looks more correct. Sorry for missing that bit.
There was a problem hiding this comment.
Given that we're extracting api_url from conn.host, and that includes the https:// prefix, why include the schema at all? (Note that ALL fields are optional, anyway. We have a lot of flexibility in how we define this.)
There was a problem hiding this comment.
Aside — Do we want to support using this with the HttpOperator? In that case we'd need to rethink this a bit, as the current method doesn't pass the API token properly (it needs to be set in an Authorization: Token {token} header).
If I'm understanding the docs/code right this is where hooks would be useful. There are a few moving parts, so bear with me for a second.
HttpHook has an auth_type property which can be used to augment requests with authorization info before they're sent off. Ignore the Any typehint — it's meant to be a subclass of requests.auth.AuthBase (see New Forms of Authentication). The HttpHook source shows that when login is set, auth_type is instantiated with login and password as arguments, so a custom TindTokenAuth(requests.auth.AuthBase) we allow us to tunnel the login from the Connection into the request's Auth header.
Now, obviously you can't embed the auth_type in the Connection itself, but you can pass it to an HttpOperator (along with a bunch of other useful stuff like a paginator). Here's how that might look:
from mokelumne.tind import TindTokenAuth
AIRFLOW_CONN_TIND_DEFAULT="""{
"conn_type": "http",
"login": "{api_key}",
"host": "digicoll.lib.berkeley.edu%2Fapi%2Fv1",
"schema": "https"
}"""
tind_get_record = HttpOperator(
conn_id="tind_default",
auth_type=TindTokenAuth,
method="GET",
endpoint="record/{record_id}",
data={"of": "xm"},
)This has pros and cons:
- Pro: Simply by storing the API token in
login, we make it possible to reuse our connection with the built-in HttpOperator. - Mixed: It's slightly odd to store the token in
login(though this does make for a cleaner URL). - Mixed: We still have to jump through hoops to specify
https://and the API's base path. - Con: We'd have to pass
auth_type=TindTokenAutheverywhere.
A custom TindHook could address those issues:
- We get full control over where the token is passed in the connection string.
- We also get full control over how
https://and the API base path are presented (e.g. the Connection.host could be just the TIND host name). - It binds the
tind://Connection type to TindTokenAuth, so you don't have to pass that to operators.
It also adds a few features unique to hooks:
- We can customize how this Connection type is presented in the UI.
- The Hook class can provide a factory method for producing TINDClients.
Really more food for thought than anything else. Just wanted to put into detail / words what I mentioned on the call about trade-offs, but didn't know enough at the time to fully comment on.
| self.client = TINDClient(default_storage_dir=str(run_dir(_run_id))) | ||
| conn = Connection.get(conn_id) | ||
| self.client = TINDClient( | ||
| api_url=conn.host or "", |
There was a problem hiding this comment.
this specific suggestion depends on how we want to handle the API base URL.
| api_url=conn.host or "", | |
| api_url=f"{conn.schema}://{conn.host}/api/v1", |
There was a problem hiding this comment.
I like this better, and I do believe it should fail if the host isn't specified.
There was a problem hiding this comment.
Revised suggestion:
| api_url=conn.host or "", | |
| api_url=conn.host, |
awilfox
left a comment
There was a problem hiding this comment.
Clear and straightforward. Seems good to me, but I'll wait for the other comments from maría to be resolved before approving.
| AIRFLOW__API_AUTH__JWT_SECRET= | ||
| # @see https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html#generating-fernet-key | ||
| AIRFLOW__CORE__FERNET_KEY= | ||
| AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}' |
There was a problem hiding this comment.
Actually…
Host(optional)
Specify the entire url or the base of the url for the service.
…that is, per the docs, exactly how it should be specified.
| | `AIRFLOW_IMAGE_NAME` | Sets an alternate base image for Airflow, e.g. for `slim` images | `AIRFLOW_IMAGE_NAME="apache/airflow:slim-latest"` | | ||
| | `AIRFLOW__CORE__FERNET_KEY` | [Fernet](https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html) encryption key used to encrypt Airflow secrets | `AIRFLOW__CORE__FERNET_KEY="somebase64value="` | | ||
| | `AIRFLOW__API_AUTH__JWT_SECRET` | Secret key used to sign JWT tokens for Airflow's API authentication. The default value used in development and testing should be replaced in production. | `AIRFLOW__API_AUTH__JWT_SECRET="some32bytesecret"` | | ||
| | `AIRFLOW_CONN_TIND_DEFAULT` | Airflow connection json string for TIND access.<br>Note: the Connection params listed in the example are all needed! | `AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'` | |
There was a problem hiding this comment.
Depending on the result of the above, ensure this is kept up to date with what we end up with.
There was a problem hiding this comment.
I'm wondering if we want this in here at all and rather have a section of the README on connections.
There was a problem hiding this comment.
Good point. We should also probably separate out Variables, if/when we get around to that refactor.
This refactors the fetch_tind util to use an airflow connection to instantiate the tind client. For now that connection is a json string in the env, but that can easily be moved to a secret.
Bonus for this intermediate step is that by setting the env key to
AIRFLOW_CONN_TIND_DEFAULT, airflow will automatically load the json string there if you ask for aConnection.get('tind_default')!