Encryption for REST catalog#13225
Conversation
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
(Bump to remove staleness) |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
(Bump to remove staleness) |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
(Bump to remove staleness) |
d007aa2 to
889cbbf
Compare
|
Given #7770 is merged, curious for thoughts on this PR. |
|
Could you elaborate on the api additions? I think it would help to have some more description on the general direction of this or |
|
@smaheshwar-pltr Could you please resolve the conflicts? |
|
@huaxingao @smaheshwar-pltr Our team has a person who works on encryption with the REST catalog. If @smaheshwar-pltr does not object, we can follow up on this patch. |
| return encryptionManager; | ||
| } | ||
|
|
||
| private void encryptionPropsFromMetadata(TableMetadata metadata) { |
There was a problem hiding this comment.
Is this method applied on the TableMetadata that is fetched directly from the REST catalog, and not from the metadata.json file? Both are possible, but the former must override (and check) the latter, to protect against the key removal and other attacks.
There was a problem hiding this comment.
Yes this method will always be applied on a metadata field of a LoadTableResponse object received directly from the REST catalog (its only usage within this class is as such, and you can check the constructor usages within RESTSessionCatalog to confirm that the metadata coming in from there is as such too.
There was a problem hiding this comment.
Question (not sure if there have been discussions here or if your team have thoughts): we want the key ID to come from the REST catalog service directly for security reasons.
It's typical for REST catalogs to provide metadata that corresponds to the metadata file in storage and not modify it apart from that. Given this, would it be preferable to have this field returned within the LoadTableResponse itself, to encourage catalogs to track it explicitly?
The concrete proposal here might be: ENCRYPTION_TABLE_KEY and ENCRYPTION_DEK_LENGTH become properties on the LoadTableResponse's config (mentioned in the REST spec here).
There was a problem hiding this comment.
Well I see two scenarios when thinking about this:
- metadata.json is something that both the server and the clients can read (although clients wouldn't need to, given they get the metadata with the
LoadTableResponse) - metadata.json can only be accessed on the server side and clients are not given FS credentials (either vended or not) to reach it
For case (1) I totally agree, we can't rely on just metadata.json to store these encryption properties, and the catalog should store it separately too, and eventually populating (i.e. doing the override logic referred by @ggershinsky) the properties in the LoadTableResponse to be created.
For case (2) I'm not 100% sure, but still leaning toward the catalog taking on this responsibility.
Either way, for the client side there's not much we can do other than recommending clients to consider the metadata from LoadTableResponse only. The rest (no pun intended) is on the server side to be decided and will be implementation-specific. For this code snippet above, irrelevant IMHO.
Let me know your thoughts.
There was a problem hiding this comment.
Afaik, HMS is not optimized for JSON storage. But maybe someone in the community will take on storing the full metadata object there, to improve table security. Having only the table properties is barely sufficient. I think we should recommend REST catalogs for encrypted tables.
It's not. But would storing a hash suffice? If so we could generate the hash of the whole JSON content and store it via an additional (Hive) table property. Then during table loading we can verify that the TableMetadata we just read in from a potentially untrusted storage (and yet metadata.json is not encrypted) is original or has been tampered with.
There was a problem hiding this comment.
One aspect of having an encrypted metadata.json is when the table schema is also considered a sensitive piece of information. I haven't found this in the discussions but do you know if this has ever been considered @ggershinsky ?
There was a problem hiding this comment.
would storing a hash suffice? If so we could generate the hash of the whole JSON content and store it via an additional (Hive) table property. Then during table loading we can verify that the TableMetadata we just read in from a potentially untrusted storage (and yet metadata.json is not encrypted) is original or has been tampered with.
I think it's a good idea
There was a problem hiding this comment.
One aspect of having an encrypted metadata.json is when the table schema is also considered a sensitive piece of information. I haven't found this in the discussions but do you know if this has ever been considered
Not sure. Though, it should be possible to have a REST implementation that hides the metadata.json file from the storage.
There was a problem hiding this comment.
I think it's a good idea
Sounds good, I can take this on and will produce a PR shortly.
Not sure. Though, it should be possible to have a REST implementation that hides the metadata.json file from the storage.
Yes, with REST that's true, I just meant it in a general sense, e.g. it's not currently possible to hide the schema of an encrypted table with Hive catalog. It may just be one more thing to note/document as a limitation of encryption wrt. Hive catalog - just merely wanted to highlight this though.
|
Also, it would be good to refactor (if possible) a code common to this PR and to #13066 , so that other catalogs will be able to re-use it. |
Great catch @ggershinsky, addressed in 3c055ce. |
|
@huaxingao @singhpk234 @ggershinsky, my apologies for the delay here, I was caught up in other work. I believe all comments are now be addressed or responded to, PTAL! 🙏 |
|
Thanks for the update on this PR. I reviewed the recent discussion and this looks close. I have one suggestion that might help with merge confidence: Could we make the security/behavior contract explicit in RESTOperationsBuilder Javadocs (or nearby API docs), especially for:
If possible, adding one focused negative test (or equivalent assertion) for custom/unsafe table-ops behavior would also make this easier to reason about for downstream users. Overall, appreciate the work here. This is an important capability for REST catalog users. |
|
@singhpk234 Do you have more comments? Are you OK with deferring the remote scan planning test to a follow-up? #13225 (comment) |
|
Friendly ping @singhpk234 🙏 |
|
Friendly ping, @singhpk234! |
|
Apologies I lost track of this PR :( Also one thing which i am unclear of is how is the rest-catalog giving creds for encryption ? is it implictly assumed storage config will have creds for encryption too like KMS ? what happens in case such as vault ? how does the rest server give those creds back ? |
| } | ||
|
|
||
| @Override | ||
| public EncryptionManager encryption() { |
There was a problem hiding this comment.
can this be called in multi-threaded fashion in that case we might create more than one instance ?
There was a problem hiding this comment.
Thanks for flagging - I'm mirroring here what we did for Hive
cc @ggershinsky. And I think this is fine as I believe that TableOperations are inherently not thread-safe:
E.g. BaseMetastoreTableOperations has unsynchronized mutable fields e.g. currentMetadata
and this very RESTTableOperations class too similarly:
I don't think concurrent use of a single table operations class is supported. LMKWYT!
Thanks, @singhpk234 - 2092454. |
Thanks for flagging this @singhpk234, it's a great point. This PR integrates with the existing I think that per-table KMS credential vending is valuable - I can imagine REST servers wanting to follow the principle of least privilege and subscope KMS credentials where that's possible. I do think there's a discussion to be had - the AWS KMS client doesn't pick up the On this PR: I'd prefer to split PRs and discussions (we've already discussed a lot here) - in my opinion, it makes sense to ship catalog-property KMS configuration first because I think this should be supported anyway (in the same way that you can initialise a (+ cc @ggershinsky as I know you folks have been thinking about REST integration too, would love to hear your thoughts on this) Edit: I've put up #16194 |
…cherry-pick # Conflicts: # spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestTableEncryption.java # spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestTableEncryption.java
|
Basically, the lifecycle of a KMS client object is driven by these events:
Also, it's possible to run a thread in the KMS client that would asynchronously handle the credentials. So the mechanism seems to be quite flexible. But if it doesn't cover important usecases, we can consider changes. |
|
Thanks for the response @smaheshwar-pltr @ggershinsky !
Will that being said i don't wanna block the progress in this pr, appreciate all the work, but i would really like to conclude on this before we mark REST catalog ready for supporting encryption. |
|
started a dev thread to discuss this : https://lists.apache.org/thread/z48t5wgx778j17pzto9kqxwysw4ysxxo Also please free to move forward with this PR ! For now lets not conclude for saying REST Catalog is ready to support encryption |
|
Thanks, I've responded to the mailing list thread here: https://lists.apache.org/thread/t9sj6nlxgxyl9k5cbmf70gnmrhgxz1xg. I think there was some confusion as this PR does not implement REST KMS credential vending which is a larger issue that definitely needs more discussion if we want to support it. |
|
@huaxingao, all comments here have been now addressed, might we be able to get this in for the upcoming release? |
|
+1 to including this PR in 1.11. The patch does not handle credentials (delegating this to KMS clients) - instead, it provides scaffolding for table encryption in the REST catalog client, similar to the HMS catalog client (and other catalogs in the future). |
|
I agree with Gidon's comment above. Friendly ping @huaxingao (sorry to pester) in case you have thoughts! |
|
Agreed with @ggershinsky. Let's have this in and we can address the other questions later as they don't seem to be that tightly connected, this PR has been dragging for quite long now. |
This PR implements client-side support for REST catalog encryption. With it, clients interacting with a REST catalog can read and write encrypted data.
It is similar to #13066, that integrates encryption with the Hive catalog.
cc @rdblue @RussellSpitzer @ggershinsky