-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29544: Fix Vectorized Parquet reading Struct columns with all fields null #6408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
31 changes: 31 additions & 0 deletions
31
ql/src/test/queries/clientpositive/parquet_struct_with_null_vectorization.q
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| -- SORT_QUERY_RESULTS | ||
| SET hive.vectorized.execution.enabled=true; | ||
| set hive.vectorized.execution.reduce.enabled=true; | ||
| SET hive.fetch.task.conversion=none; | ||
|
|
||
| CREATE TABLE test_parquet_struct_nulls ( | ||
| id INT, | ||
| st_prim STRUCT<x:INT, y:INT> | ||
| ) STORED AS PARQUET; | ||
|
|
||
| INSERT INTO test_parquet_struct_nulls VALUES | ||
| (1, named_struct('x', CAST(NULL AS INT), 'y', CAST(NULL AS INT))), | ||
| (2, if(1=0, named_struct('x', 1, 'y', 1), null)), | ||
| (3, named_struct('x', 3, 'y', CAST(NULL AS INT))), | ||
| (4, named_struct('x', 4, 'y', 4)); | ||
|
|
||
| -- Test A: Full table scan to check JSON representation | ||
| SELECT * FROM test_parquet_struct_nulls; | ||
|
|
||
| -- Test B: Verify IS NULL evaluates correctly | ||
| SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL; | ||
|
|
||
| -- Test C: Verify IS NOT NULL evaluates correctly | ||
| SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL; | ||
|
|
||
| -- Test D: Verify field-level null evaluation inside a valid struct | ||
| SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL AND st_prim.x IS NULL; | ||
|
|
||
| -- Validate without vectorization | ||
| SET hive.vectorized.execution.enabled=true; | ||
|
||
| SELECT * FROM test_parquet_struct_nulls; | ||
85 changes: 85 additions & 0 deletions
85
ql/src/test/results/clientpositive/llap/parquet_struct_with_null_vectorization.q.out
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| PREHOOK: query: CREATE TABLE test_parquet_struct_nulls ( | ||
| id INT, | ||
| st_prim STRUCT<x:INT, y:INT> | ||
| ) STORED AS PARQUET | ||
| PREHOOK: type: CREATETABLE | ||
| PREHOOK: Output: database:default | ||
| PREHOOK: Output: default@test_parquet_struct_nulls | ||
| POSTHOOK: query: CREATE TABLE test_parquet_struct_nulls ( | ||
| id INT, | ||
| st_prim STRUCT<x:INT, y:INT> | ||
| ) STORED AS PARQUET | ||
| POSTHOOK: type: CREATETABLE | ||
| POSTHOOK: Output: database:default | ||
| POSTHOOK: Output: default@test_parquet_struct_nulls | ||
| PREHOOK: query: INSERT INTO test_parquet_struct_nulls VALUES | ||
| (1, named_struct('x', CAST(NULL AS INT), 'y', CAST(NULL AS INT))), | ||
| (2, if(1=0, named_struct('x', 1, 'y', 1), null)), | ||
| (3, named_struct('x', 3, 'y', CAST(NULL AS INT))), | ||
| (4, named_struct('x', 4, 'y', 4)) | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: _dummy_database@_dummy_table | ||
| PREHOOK: Output: default@test_parquet_struct_nulls | ||
| POSTHOOK: query: INSERT INTO test_parquet_struct_nulls VALUES | ||
| (1, named_struct('x', CAST(NULL AS INT), 'y', CAST(NULL AS INT))), | ||
| (2, if(1=0, named_struct('x', 1, 'y', 1), null)), | ||
| (3, named_struct('x', 3, 'y', CAST(NULL AS INT))), | ||
| (4, named_struct('x', 4, 'y', 4)) | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: _dummy_database@_dummy_table | ||
| POSTHOOK: Output: default@test_parquet_struct_nulls | ||
| POSTHOOK: Lineage: test_parquet_struct_nulls.id SCRIPT [] | ||
| POSTHOOK: Lineage: test_parquet_struct_nulls.st_prim SCRIPT [] | ||
| PREHOOK: query: SELECT * FROM test_parquet_struct_nulls | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| POSTHOOK: query: SELECT * FROM test_parquet_struct_nulls | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| 1 {"x":null,"y":null} | ||
| 2 NULL | ||
| 3 {"x":3,"y":null} | ||
| 4 {"x":4,"y":4} | ||
| PREHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| POSTHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| 2 | ||
| PREHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| POSTHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| 1 | ||
| 3 | ||
| 4 | ||
| PREHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL AND st_prim.x IS NULL | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| POSTHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL AND st_prim.x IS NULL | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| 1 | ||
| PREHOOK: query: SELECT * FROM test_parquet_struct_nulls | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| POSTHOOK: query: SELECT * FROM test_parquet_struct_nulls | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| 1 {"x":null,"y":null} | ||
| 2 NULL | ||
| 3 {"x":3,"y":null} | ||
| 4 {"x":4,"y":4} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the same problem applies to another complex types? better to have more test cases and maybe even rename the qfile to
parquet_complex_..., maybeparquet_complex_types_null_vectorization.qto clearly distinguish from the already existingparquet_complex_types_vectorization.qe.g.:
LIST:
or MAP ("same" as struct but not fixed schema)
or even nested struct to validate the logic on deeper recursive callpaths:
and nested data can contain NULL on different levels where you patch is actually hit I assume, e.g.:
I would appreciate a lot if this patch could validate all of these
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Struct value turning to
Null, if sub fields areNULLwas due to specific code inVectorizedStructReaderhive/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructColumnReader.java
Lines 50 to 55 in 13f3208
It used to check, if all fields have
null, it used to setnull, that I fixed.I checked the LIST & MAP. They have another bug :-) In LIST & MAP,
If you insert
NULLin them it defaults to0eg.
It outputs
Disabling vectorization gives correct
This seems some different bug, like every
NULLis treated as 0.Will it be ok, if we chase this in a different ticket. I believe it is some where it is returning default value of int instead
NULL, some check is wrong which I have to debug.Regarding nested, vectorization is disabled so, that doesn't kick in:
https://issues.apache.org/jira/browse/HIVE-19016
For map we already have a test: https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_map_null_vectorization.q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Map there is test but the output is wrong, it is testing for
nullbut the output is having 0,https://github.com/apache/hive/blame/13f3208c01fec2d20108302efc3fd033d1d76a19/ql/src/test/results/clientpositive/llap/parquet_map_null_vectorization.q.out#L153-L158
It should have been
The inserts were
Let me know if you are ok chasing this separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for clarifying the different cases
the issue with MAPs is indeed another one, it's fine to be handled separately