Skip to content

HIVE-29544: Fix Vectorized Parquet reading Struct columns with all fields null#6408

Merged
ayushtkn merged 4 commits intoapache:masterfrom
ayushtkn:parquet_struct
Apr 10, 2026
Merged

HIVE-29544: Fix Vectorized Parquet reading Struct columns with all fields null#6408
ayushtkn merged 4 commits intoapache:masterfrom
ayushtkn:parquet_struct

Conversation

@ayushtkn
Copy link
Copy Markdown
Member

@ayushtkn ayushtkn commented Apr 4, 2026

Fix Vectorized Parquet reading Struct columns with all fields null

What changes were proposed in this pull request?

During Vectorized Parquet reads, the struct column shouldn't be read as null in case the fields are null, only if the struct field itself is null, the column should be read as null

Why are the changes needed?

To sync behaviour b/w vectorized & non vectorized reads

Does this PR introduce any user-facing change?

Yes, Vectorized reads with Struct columns with all values null doesn't read as null column

How was this patch tested?

UT

@ayushtkn ayushtkn changed the title WIP HIVE-29544: Fix Vectorized Parquet reading Struct columns with all fields null Apr 4, 2026
@ayushtkn ayushtkn requested a review from abstractdog April 4, 2026 15:38
Copy link
Copy Markdown
Contributor

@abstractdog abstractdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayushtkn: this is a very valuable and important correctness patch so far, left some comments

I also ran the qtest without the patch, and I can confirm the patch worked

I tried to pick up the definition level concept (again), and more or less, I understand this patch
I'm inclined to accept it if you cover the rest of the use-cases I mentioned

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the same problem applies to another complex types? better to have more test cases and maybe even rename the qfile to parquet_complex_..., maybe parquet_complex_types_null_vectorization.q to clearly distinguish from the already existing parquet_complex_types_vectorization.q

e.g.:
LIST:

NULL vs. [null, null] or [1, null]

or MAP ("same" as struct but not fixed schema)

NULL vs. { "a": 1, "b": null } and  { "a": 1, "b": null, "c": null }

or even nested struct to validate the logic on deeper recursive callpaths:

CREATE TABLE test_parquet_nested_struct_nulls (
    id INT,
    st_prim STRUCT<x:INT, y:INT>,
    st_nested STRUCT<x:INT, y:STRUCT<v:INT, w:INT>>
) STORED AS PARQUET;

and nested data can contain NULL on different levels where you patch is actually hit I assume, e.g.:

NULL
{x: 1, y: NULL}
{x: 1, y: {v: 2, w: NULL}
{x: 1, y: {v: 2, w: 3}

I would appreciate a lot if this patch could validate all of these

Copy link
Copy Markdown
Member Author

@ayushtkn ayushtkn Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Struct value turning to Null, if sub fields are NULL was due to specific code in VectorizedStructReader

for (int j = 0; j < vectors[i].isNull.length; j++) {
structColumnVector.isNull[j] =
(i == 0) ? vectors[i].isNull[j] : structColumnVector.isNull[j] && vectors[i].isNull[j];
}
structColumnVector.noNulls =
(i == 0) ? vectors[i].noNulls : structColumnVector.noNulls && vectors[i].noNulls;

It used to check, if all fields have null, it used to set null, that I fixed.

I checked the LIST & MAP. They have another bug :-) In LIST & MAP,

If you insert NULL in them it defaults to 0

eg.

CREATE TABLE test_parquet_array_nulls (
>     id INT,
>     arr_prim ARRAY<INT>
> ) STORED AS PARQUET;
> 
> INSERT INTO test_parquet_array_nulls VALUES
>     -- 1: Array exists, but all elements inside are NULL
>     (1, array(CAST(NULL AS INT), CAST(NULL AS INT))),
>     
>     -- 2: The Array itself is strictly NULL
>     (2, if(1=0, array(1, 2), null)),
>     
>     -- 3: Array exists, containing a mix of valid and NULL elements
>     (3, array(3, CAST(NULL AS INT))),
>     
>     -- 4: Array exists, all elements are valid
>     (4, array(4, 5));
> 
> SELECT * FROM test_parquet_array_nulls ORDER BY id;

It outputs

+------------------------------+------------------------------------+
| test_parquet_array_nulls.id  | test_parquet_array_nulls.arr_prim  |
+------------------------------+------------------------------------+
| 1                            | [0,0]                              |
| 2                            | NULL                               |
| 3                            | [3,0]                              |
| 4                            | [4,5]                              |
+------------------------------+------------------------------------+

Disabling vectorization gives correct

+------------------------------+------------------------------------+
| test_parquet_array_nulls.id  | test_parquet_array_nulls.arr_prim  |
+------------------------------+------------------------------------+
| 1                            | [null,null]                        |
| 2                            | NULL                               |
| 3                            | [3,null]                           |
| 4                            | [4,5]                              |
+------------------------------+------------------------------------+

This seems some different bug, like every NULL is treated as 0.

Will it be ok, if we chase this in a different ticket. I believe it is some where it is returning default value of int instead NULL, some check is wrong which I have to debug.

Regarding nested, vectorization is disabled so, that doesn't kick in:
https://issues.apache.org/jira/browse/HIVE-19016

For map we already have a test: https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_map_null_vectorization.q

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Map there is test but the output is wrong, it is testing for null but the output is having 0,
https://github.com/apache/hive/blame/13f3208c01fec2d20108302efc3fd033d1d76a19/ql/src/test/results/clientpositive/llap/parquet_map_null_vectorization.q.out#L153-L158

It should have been

+-----+---------------+
| id  |    intmap     |
+-----+---------------+
| 1   | {1:null,2:3}  |
| 2   | NULL          |
+-----+---------------+

The inserts were

insert into parquet_map_type_int SELECT 1, MAP(1, null, 2, 3)
insert into parquet_map_type_int (id) VALUES (2)

Let me know if you are ok chasing this separately

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clarifying the different cases
the issue with MAPs is indeed another one, it's fine to be handled separately

Copy link
Copy Markdown
Contributor

@abstractdog abstractdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more thing left in the qfile

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clarifying the different cases
the issue with MAPs is indeed another one, it's fine to be handled separately


-- Validate withou vectorization
-- Validate without vectorization
SET hive.vectorized.execution.enabled=true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SET hive.vectorized.execution.enabled=false;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@sonarqubecloud
Copy link
Copy Markdown

@ayushtkn ayushtkn merged commit 3dec709 into apache:master Apr 10, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants