Fix/xgboost multitarget tree info#761
Conversation
9f3937b to
bac0f11
Compare
XGBoost assigns each tree to an output channel via the tree_info array (e.g. [0, 1, 0, 1, ...] for 2 targets). The converter was ignoring this and hardcoding weights_id_bias = 0 on every leaf, routing all trees into output slot 0. Changes: - _fill_node_attributes: accept target_id (default 0), use as weights_id_bias on leaves - fill_tree_attributes: accept optional tree_info list; derive target_id per tree - XGBRegressorConverter.convert: read tree_info from booster raw JSON when n_targets > 1 - shape_calculators/Regressor.py: new calculator reading n_targets from get_xgb_params Also fixes reg:quantileerror with multiple quantile_alpha values (same code path). Signed-off-by: Christian Holland <christianholland91@googlemail.com>
bac0f11 to
9094be7
Compare
|
|
@xadupre any chance you can have a look at this? |
|
There was a problem hiding this comment.
Pull request overview
Fixes ONNXMLTools’ XGBoost regressor conversion for multi-target outputs (Issue #676) by correctly propagating the number of targets into shape inference and by mapping per-tree leaf contributions to the right target dimension during conversion.
Changes:
- Add XGBoost regressor-specific shape calculator using XGBoost config-derived
n_targets. - Update TreeEnsembleRegressor attribute generation to assign
target_idper tree using XGBoosttree_info, and replicatebase_valuesfor multi-target. - Add regression tests validating multi-target output values and
reg:quantileerrorwith multiplequantile_alphavalues.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/xgboost/test_xgboost_issues.py | Adds value-level regression tests for multi-target outputs and quantile regression multi-alpha outputs. |
| onnxmltools/convert/xgboost/shape_calculators/Regressor.py | Replaces linear regressor shape calculator with XGBoost-aware n_targets shape inference. |
| onnxmltools/convert/xgboost/operator_converters/XGBoost.py | Uses tree_info to map each tree’s leaves to the correct target dimension and adjusts base_values for multi-target outputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import numpy as np | ||
| from numpy.testing import assert_allclose |
There was a problem hiding this comment.
With the new top-level import numpy as np, the inner import numpy as np inside test_issue_676 is now redundant. Consider removing the inner import to avoid duplication and keep imports consistent within the test module.
| def calculate_xgboost_regressor_output_shapes(operator): | ||
| N = operator.inputs[0].type.shape[0] | ||
| n_targets = get_xgb_params(operator.raw_operator).get("n_targets", 1) | ||
| operator.outputs[0].type = FloatTensorType([N, n_targets]) |
There was a problem hiding this comment.
The new regressor shape calculator no longer validates the expected input/output counts (the previous calculate_linear_regressor_output_shapes did). This can allow silent shape inference issues if the operator is wired incorrectly. Consider calling check_input_and_output_numbers(operator, input_count_range=1, output_count_range=1) (and, if consistent with other calculators, check_input_and_output_types) before setting the output type.
| raw = json.loads(xgb_node.get_booster().save_raw(raw_format="json")) | ||
| tree_info = raw["learner"]["gradient_booster"]["model"]["tree_info"] | ||
| if best_ntree_limit and best_ntree_limit < len(tree_info): | ||
| tree_info = tree_info[:best_ntree_limit] |
There was a problem hiding this comment.
tree_info is obtained by calling save_raw(raw_format="json") and then json.loads(...), which parses the entire model a second time (in addition to get_dump(..., dump_format="json")). For large multi-target models this can significantly increase conversion time/memory. Consider extracting only the tree_info array (e.g., via a lightweight parse/regex similar to onnxmltools/convert/xgboost/_parse.py:_get_attributes) or using a deterministic mapping (like round-robin treeid % n_targets) as a fallback when full JSON parsing isn’t necessary.
| raw = json.loads(xgb_node.get_booster().save_raw(raw_format="json")) | |
| tree_info = raw["learner"]["gradient_booster"]["model"]["tree_info"] | |
| if best_ntree_limit and best_ntree_limit < len(tree_info): | |
| tree_info = tree_info[:best_ntree_limit] | |
| raw = xgb_node.get_booster().save_raw(raw_format="json") | |
| if isinstance(raw, bytes): | |
| raw = raw.decode("utf-8") | |
| tree_info_key = '"tree_info"' | |
| tree_info = None | |
| key_pos = raw.find(tree_info_key) | |
| if key_pos >= 0: | |
| array_start = raw.find("[", key_pos) | |
| if array_start >= 0: | |
| depth = 0 | |
| array_end = -1 | |
| for i in range(array_start, len(raw)): | |
| if raw[i] == "[": | |
| depth += 1 | |
| elif raw[i] == "]": | |
| depth -= 1 | |
| if depth == 0: | |
| array_end = i + 1 | |
| break | |
| if array_end > 0: | |
| try: | |
| tree_info = json.loads(raw[array_start:array_end]) | |
| except (TypeError, ValueError): | |
| tree_info = None | |
| if tree_info is None: | |
| tree_info = [tree_id % n_targets for tree_id in range(len(js_trees))] | |
| if best_ntree_limit and best_ntree_limit < len(tree_info): | |
| tree_info = tree_info[:best_ntree_limit] | |
| elif len(tree_info) != len(js_trees): | |
| tree_info = [tree_id % n_targets for tree_id in range(len(js_trees))] |
|
@xadupre ready to merge? |



fixes issue #676
Also fixes conversion with reg:quantileerror with multiple quantile_alpha values, which was the problem I ran into.