SynapseML version
1.1.0
System information
- Language version (scala 2.12):
- Spark Version (3.5.6):
- Spark Platform Synapse, 3.5.6-scala2.12-java17-python3-ubuntu):
Describe the problem
Hi team. My LightGBM classifier training job keeps failing due to an unrecognized issue. After checking the full log of yarn application, I tried to with different setting of executionMode, matrixType, useSingleDatasetMode,maxBin and a lot of others. I tried to event use only one categorical feature. But result always same when executers=1 -everythning work, when >1 trainging fails
`26/01/16 20:12:17 INFO StreamingPartitionTask: Calling NetworkInit on local port 12401 with value 10.244.9.118:12403,10.244.12.46:12405,10.244.2.35:12401,10.244.5.206:12404,10.244.19.244:12402
[LightGBM] [Info] Trying to bind port 12401...
[LightGBM] [Info] Binding port 12401 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 260 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 338 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 439 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 570 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 741 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 963 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 1251 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 1626 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 2113 milliseconds
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 3
[LightGBM] [Info] Connected to rank 4
[LightGBM] [Info] Local rank: 2, total number of machines: 5
26/01/16 20:12:26 INFO StreamingPartitionTask: NetworkInit succeeded. LightGBM task listening on: 12401
26/01/16 20:12:26 INFO StreamingPartitionTask: Waiting for all data prep to be done, task 8351, partition 2
26/01/16 20:12:26 INFO StreamingPartitionTask: Getting final training Dataset for partition 2.
26/01/16 20:12:26 INFO StreamingPartitionTask: Getting final validation Dataset for partition 2.
26/01/16 20:12:26 INFO StreamingPartitionTask: Creating LightGBM Booster for partition 2, task 8351
[LightGBM] [Info] Number of positive: 239805, number of negative: 655618
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.263171
[LightGBM] [Info] Total Bins 19583
[LightGBM] [Info] Number of data points in the train set: 179083, number of used features: 53
26/01/16 20:12:27 INFO StreamingPartitionTask: Beginning training on LightGBM Booster for task 8351, partition 2
26/01/16 20:12:27 INFO StreamingPartitionTask: LightGBM task starting iteration 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.267812 -> initscore=-1.005752
[LightGBM] [Info] Start training from score -1.005752
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007629255834db, pid=6, tid=42
JRE version: OpenJDK Runtime Environment Temurin-17.0.15+6 (17.0.15+6) (build 17.0.15+6)
Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.15+6 (17.0.15+6, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
Problematic frame:
C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b
Core dump will be written. Default location: /opt/spark/work-dir/core.6
An error report file with more information is saved as:
/crashlogs/hs_err_pid6.log
If you would like to submit a bug report, please visit:
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.
root@k8s0:/home/user# client_loop: send disconnect: Connection reset by peer`
logs of also attached
hs_err_pid6.log
Code to reproduce issue
ef _apply_lgbm_params(lgbm: LightGBMClassifier, params: Dict[str, object]) -> LightGBMClassifier:
setters = {
"learning_rate": "setLearningRate",
"num_leaves": "setNumLeaves",
"n_estimators": "setNumIterations",
"min_data_in_leaf": "setMinDataInLeaf",
"feature_fraction": "setFeatureFraction",
"feature_fraction_bynode": "setFeatureFractionByNode",
"bagging_fraction": "setBaggingFraction",
"bagging_freq": "setBaggingFreq",
"lambda_l1": "setLambdaL1",
"lambda_l2": "setLambdaL2",
"max_cat_threshold": "setMaxCatThreshold",
"cat_l2": "setCatL2",
"cat_smooth": "setCatSmooth",
"min_data_per_group": "setMinDataPerGroup",
"max_depth": "setMaxDepth",
"objective": "setObjective",
"random_state": "setSeed",
}
for key, setter in setters.items():
if key not in params:
continue
if hasattr(lgbm, setter):
getattr(lgbm, setter)(params[key])
return lgbm
def build_pipeline(
numeric_cols: List[str],
categorical_cols: List[str],
num_tasks: int,
) -> Pipeline:
indexers = [
StringIndexer(
inputCol=col,
outputCol=f"{col}_idx",
handleInvalid="keep",
stringOrderType="alphabetAsc"
)
for col in categorical_cols
]
indexed_categorical_cols = [f"{c}_idx" for c in categorical_cols]
feature_cols = numeric_cols + indexed_categorical_cols
categorical_slot_indexes = [feature_cols.index(c_name) for c_name in indexed_categorical_cols]
assembler = VectorAssembler(
inputCols=feature_cols,
outputCol="features",
handleInvalid="keep"
)
lgbm = (
LightGBMClassifier(passThroughArgs="force_col_wise=true")
.setLabelCol(LABEL_COL)
.setFeaturesCol("features")
.setPredictionCol("prediction")
.setProbabilityCol("probability")
.setRawPredictionCol("rawPrediction")
.setFeaturesShapCol("shap_values")
.setObjective("binary")
.setLearningRate(0.05)
.setNumLeaves(31)
.setNumIterations(200)
.setFeatureFraction(1.0)
.setBaggingFraction(1.0)
.setBaggingFreq(0)
.setMinDataInLeaf(100)
.setCategoricalSlotIndexes(categorical_slot_indexes)
.setValidationIndicatorCol("is_validation")
.setEarlyStoppingRound(100)
.setMetric("binary_logloss")
.setIsProvideTrainingMetric(True)
.setNumThreads(1)
.setNumTasks(num_tasks) #number of executers
.setMaxBin(15)
.setBinSampleCount(50000)
.setMatrixType("sparse")
.setVerbosity(2)
.setExecutionMode("bulk")
.setUseSingleDatasetMode(True)
.setTimeout(1200.0)
)
lgbm_params = _shrink_lgbm_params(_load_shaving_lgbm_params())
lgbm = _apply_lgbm_params(lgbm, lgbm_params)
return Pipeline(stages=indexers + [assembler, lgbm])
Other info / logs
No response
What component(s) does this bug affect?
What language(s) does this bug affect?
What integration(s) does this bug affect?
SynapseML version
1.1.0
System information
Describe the problem
Hi team. My LightGBM classifier training job keeps failing due to an unrecognized issue. After checking the full log of yarn application, I tried to with different setting of executionMode, matrixType, useSingleDatasetMode,maxBin and a lot of others. I tried to event use only one categorical feature. But result always same when executers=1 -everythning work, when >1 trainging fails
`26/01/16 20:12:17 INFO StreamingPartitionTask: Calling NetworkInit on local port 12401 with value 10.244.9.118:12403,10.244.12.46:12405,10.244.2.35:12401,10.244.5.206:12404,10.244.19.244:12402
[LightGBM] [Info] Trying to bind port 12401...
[LightGBM] [Info] Binding port 12401 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 260 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 338 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 439 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 570 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 741 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 963 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 1251 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 1626 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 2113 milliseconds
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 3
[LightGBM] [Info] Connected to rank 4
[LightGBM] [Info] Local rank: 2, total number of machines: 5
26/01/16 20:12:26 INFO StreamingPartitionTask: NetworkInit succeeded. LightGBM task listening on: 12401
26/01/16 20:12:26 INFO StreamingPartitionTask: Waiting for all data prep to be done, task 8351, partition 2
26/01/16 20:12:26 INFO StreamingPartitionTask: Getting final training Dataset for partition 2.
26/01/16 20:12:26 INFO StreamingPartitionTask: Getting final validation Dataset for partition 2.
26/01/16 20:12:26 INFO StreamingPartitionTask: Creating LightGBM Booster for partition 2, task 8351
[LightGBM] [Info] Number of positive: 239805, number of negative: 655618
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.263171
[LightGBM] [Info] Total Bins 19583
[LightGBM] [Info] Number of data points in the train set: 179083, number of used features: 53
26/01/16 20:12:27 INFO StreamingPartitionTask: Beginning training on LightGBM Booster for task 8351, partition 2
26/01/16 20:12:27 INFO StreamingPartitionTask: LightGBM task starting iteration 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.267812 -> initscore=-1.005752
[LightGBM] [Info] Start training from score -1.005752
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007629255834db, pid=6, tid=42
JRE version: OpenJDK Runtime Environment Temurin-17.0.15+6 (17.0.15+6) (build 17.0.15+6)
Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.15+6 (17.0.15+6, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
Problematic frame:
C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b
Core dump will be written. Default location: /opt/spark/work-dir/core.6
An error report file with more information is saved as:
/crashlogs/hs_err_pid6.log
If you would like to submit a bug report, please visit:
https://github.com/adoptium/adoptium-support/issues
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.
root@k8s0:/home/user# client_loop: send disconnect: Connection reset by peer`
logs of also attached
hs_err_pid6.log
Code to reproduce issue
ef _apply_lgbm_params(lgbm: LightGBMClassifier, params: Dict[str, object]) -> LightGBMClassifier:
setters = {
"learning_rate": "setLearningRate",
"num_leaves": "setNumLeaves",
"n_estimators": "setNumIterations",
"min_data_in_leaf": "setMinDataInLeaf",
"feature_fraction": "setFeatureFraction",
"feature_fraction_bynode": "setFeatureFractionByNode",
"bagging_fraction": "setBaggingFraction",
"bagging_freq": "setBaggingFreq",
"lambda_l1": "setLambdaL1",
"lambda_l2": "setLambdaL2",
"max_cat_threshold": "setMaxCatThreshold",
"cat_l2": "setCatL2",
"cat_smooth": "setCatSmooth",
"min_data_per_group": "setMinDataPerGroup",
"max_depth": "setMaxDepth",
"objective": "setObjective",
"random_state": "setSeed",
}
for key, setter in setters.items():
if key not in params:
continue
if hasattr(lgbm, setter):
getattr(lgbm, setter)(params[key])
return lgbm
def build_pipeline(
numeric_cols: List[str],
categorical_cols: List[str],
num_tasks: int,
) -> Pipeline:
indexers = [
StringIndexer(
inputCol=col,
outputCol=f"{col}_idx",
handleInvalid="keep",
stringOrderType="alphabetAsc"
)
for col in categorical_cols
]
Other info / logs
No response
What component(s) does this bug affect?
area/cognitive: Cognitive projectarea/core: Core projectarea/deep-learning: DeepLearning projectarea/lightgbm: Lightgbm projectarea/opencv: Opencv projectarea/vw: VW projectarea/website: Websitearea/build: Project build systemarea/notebooks: Samples under notebooks folderarea/docker: Docker usagearea/models: models related issueWhat language(s) does this bug affect?
language/scala: Scala source codelanguage/python: Pyspark APIslanguage/r: R APIslanguage/csharp: .NET APIslanguage/new: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse: Azure Synapse integrationsintegrations/azureml: Azure ML integrationsintegrations/databricks: Databricks integrations