-
Notifications
You must be signed in to change notification settings - Fork 0
AP-659 Migrate TIND connection details to an Airflow Connection #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -4,6 +4,7 @@ | |||||||||
| AIRFLOW__API_AUTH__JWT_SECRET= | ||||||||||
| # @see https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html#generating-fernet-key | ||||||||||
| AIRFLOW__CORE__FERNET_KEY= | ||||||||||
| AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}' | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Airflow HTTP connections shouldn't have path or protocol information in
Suggested change
I wonder if we should include a
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually…
…that is, per the docs, exactly how it should be specified.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay. Reading from the same docs:
... which means we actually want something like this?
Suggested change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right, that looks more correct. Sorry for missing that bit.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that we're extracting
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aside — Do we want to support using this with the HttpOperator? In that case we'd need to rethink this a bit, as the current method doesn't pass the API token properly (it needs to be set in an If I'm understanding the docs/code right this is where hooks would be useful. There are a few moving parts, so bear with me for a second. HttpHook has an auth_type property which can be used to augment requests with authorization info before they're sent off. Ignore the Now, obviously you can't embed the auth_type in the Connection itself, but you can pass it to an HttpOperator (along with a bunch of other useful stuff like a paginator). Here's how that might look: from mokelumne.tind import TindTokenAuth
AIRFLOW_CONN_TIND_DEFAULT="""{
"conn_type": "http",
"login": "{api_key}",
"host": "digicoll.lib.berkeley.edu%2Fapi%2Fv1",
"schema": "https"
}"""
tind_get_record = HttpOperator(
conn_id="tind_default",
auth_type=TindTokenAuth,
method="GET",
endpoint="record/{record_id}",
data={"of": "xm"},
)This has pros and cons:
A custom TindHook could address those issues:
It also adds a few features unique to hooks:
Really more food for thought than anything else. Just wanted to put into detail / words what I mentioned on the call about trade-offs, but didn't know enough at the time to fully comment on. |
||||||||||
| AIRFLOW_UID=49003 | ||||||||||
|
|
||||||||||
| # Set KeyCloak's logging level. "DEBUG" can be useful when | ||||||||||
|
|
@@ -19,10 +20,6 @@ OIDC_NAME="keycloak" | |||||||||
| OIDC_USER_GROUP="cn=edu:berkeley:org:libr:mokelumne:users,ou=campus groups,dc=berkeley,dc=edu" | ||||||||||
| OIDC_WELL_KNOWN="http://keycloak:8180/realms/berkeley-local/.well-known/openid-configuration" | ||||||||||
|
|
||||||||||
| # TBD | ||||||||||
| TIND_API_KEY= | ||||||||||
| TIND_API_URL= | ||||||||||
|
|
||||||||||
| LANGFUSE_HOST=https://us.cloud.langfuse.com | ||||||||||
| LANGFUSE_SECRET_KEY=sk-lf-blah-blah-blah | ||||||||||
| LANGFUSE_PUBLIC_KEY=pk-lf-blah-blah-blah | ||||||||||
|
|
||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -2,16 +2,24 @@ | |||||||||
|
|
||||||||||
| import csv | ||||||||||
|
|
||||||||||
| from airflow.sdk import Connection | ||||||||||
|
|
||||||||||
| from tind_client import TINDClient | ||||||||||
|
|
||||||||||
| from mokelumne.util.storage import run_dir, record_dir | ||||||||||
|
|
||||||||||
|
|
||||||||||
| class FetchTind: | ||||||||||
| """Helper methods for fetching items from TIND using TINDClient.""" | ||||||||||
| def __init__(self, _run_id: str): | ||||||||||
|
|
||||||||||
| def __init__(self, _run_id: str, conn_id: str = "tind_default"): | ||||||||||
| self.run_id = _run_id | ||||||||||
| self.client = TINDClient(default_storage_dir=str(run_dir(_run_id))) | ||||||||||
| conn = Connection.get(conn_id) | ||||||||||
|
Comment on lines
+15
to
+17
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it make more sense to just have
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That doesn't seem to be how most Airflow operators are written.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we writing/using operators here? As far as I can tell we're just implementing the connection and not, say, a
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose. I could see it either way. I lean slightly towards |
||||||||||
| self.client = TINDClient( | ||||||||||
| api_url=conn.host or "", | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this specific suggestion depends on how we want to handle the API base URL.
Suggested change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like this better, and I do believe it should fail if the host isn't specified.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revised suggestion:
Suggested change
|
||||||||||
| api_key=conn.password or "", | ||||||||||
| default_storage_dir=str(run_dir(_run_id)), | ||||||||||
| ) | ||||||||||
|
|
||||||||||
| def get_ids(self, tind_query: str) -> list[str]: | ||||||||||
| """Return the TIND IDs that match a given query.""" | ||||||||||
|
|
@@ -36,5 +44,7 @@ def save_tind_ids_file(self, ids: list[str]) -> None: | |||||||||
|
|
||||||||||
| def write_query_results_to_xml(self, tind_query: str, file_name: str = "") -> int: | ||||||||||
| """Download the XML results of a search query from TIND.""" | ||||||||||
| records_written = self.client.write_search_results_to_file(tind_query, file_name) | ||||||||||
| records_written = self.client.write_search_results_to_file( | ||||||||||
| tind_query, file_name | ||||||||||
| ) | ||||||||||
| return int(records_written) | ||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the result of the above, ensure this is kept up to date with what we end up with.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we want this in here at all and rather have a section of the README on connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We should also probably separate out
Variables, if/when we get around to that refactor.