[BUG] Distributed mode of LightGBMClassifier not working at all  when spark.executor.instances > 1

### SynapseML version

1.1.0

### System information

- **Language version** (scala 2.12):
- **Spark Version** (3.5.6):
- **Spark Platform** Synapse, 3.5.6-scala2.12-java17-python3-ubuntu):


### Describe the problem

Hi team. My LightGBM classifier training job keeps failing due to an unrecognized issue. After checking the full log of yarn application, I tried to with different setting of executionMode, matrixType, useSingleDatasetMode,maxBin and a lot of others. I tried to event use only one categorical feature. But result always same when executers=1 -everythning work, when >1 trainging fails

`26/01/16 20:12:17 INFO StreamingPartitionTask: Calling NetworkInit on local port 12401 with value 10.244.9.118:12403,10.244.12.46:12405,10.244.2.35:12401,10.244.5.206:12404,10.244.19.244:12402
[LightGBM] [Info] Trying to bind port 12401...
[LightGBM] [Info] Binding port 12401 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 260 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 338 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 439 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 570 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 741 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 963 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 1251 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 1626 milliseconds
[LightGBM] [Warning] Connecting to rank 3 failed, waiting for 2113 milliseconds
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 3
[LightGBM] [Info] Connected to rank 4
[LightGBM] [Info] Local rank: 2, total number of machines: 5
26/01/16 20:12:26 INFO StreamingPartitionTask: NetworkInit succeeded. LightGBM task listening on: 12401
26/01/16 20:12:26 INFO StreamingPartitionTask: Waiting for all data prep to be done, task 8351, partition 2
26/01/16 20:12:26 INFO StreamingPartitionTask: Getting final training Dataset for partition 2.
26/01/16 20:12:26 INFO StreamingPartitionTask: Getting final validation Dataset for partition 2.
26/01/16 20:12:26 INFO StreamingPartitionTask: Creating LightGBM Booster for partition 2, task 8351
[LightGBM] [Info] Number of positive: 239805, number of negative: 655618
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.263171
[LightGBM] [Info] Total Bins 19583
[LightGBM] [Info] Number of data points in the train set: 179083, number of used features: 53
26/01/16 20:12:27 INFO StreamingPartitionTask: Beginning training on LightGBM Booster for task 8351, partition 2
26/01/16 20:12:27 INFO StreamingPartitionTask: LightGBM task starting iteration 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.267812 -> initscore=-1.005752
[LightGBM] [Info] Start training from score -1.005752
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007629255834db, pid=6, tid=42
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.15+6 (17.0.15+6) (build 17.0.15+6)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.15+6 (17.0.15+6, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C  [lib_lightgbm.so+0x3a54db]  LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b
#
# Core dump will be written. Default location: /opt/spark/work-dir/core.6
#
# An error report file with more information is saved as:
# /crashlogs/hs_err_pid6.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
root@k8s0:/home/user# client_loop: send disconnect: Connection reset by peer`

logs of also attached

[hs_err_pid6.log](https://github.com/user-attachments/files/24691228/hs_err_pid6.log)

### Code to reproduce issue

ef _apply_lgbm_params(lgbm: LightGBMClassifier, params: Dict[str, object]) -> LightGBMClassifier:
    setters = {
        "learning_rate": "setLearningRate",
        "num_leaves": "setNumLeaves",
        "n_estimators": "setNumIterations",
        "min_data_in_leaf": "setMinDataInLeaf",
        "feature_fraction": "setFeatureFraction",
        "feature_fraction_bynode": "setFeatureFractionByNode",
        "bagging_fraction": "setBaggingFraction",
        "bagging_freq": "setBaggingFreq",
        "lambda_l1": "setLambdaL1",
        "lambda_l2": "setLambdaL2",
        "max_cat_threshold": "setMaxCatThreshold",
        "cat_l2": "setCatL2",
        "cat_smooth": "setCatSmooth",
        "min_data_per_group": "setMinDataPerGroup",
        "max_depth": "setMaxDepth",
        "objective": "setObjective",
        "random_state": "setSeed",
    }
    for key, setter in setters.items():
        if key not in params:
            continue
        if hasattr(lgbm, setter):
            getattr(lgbm, setter)(params[key])
    return lgbm


def build_pipeline(
    numeric_cols: List[str],
    categorical_cols: List[str],
    num_tasks: int,
) -> Pipeline:
    indexers = [
        StringIndexer(
            inputCol=col,
            outputCol=f"{col}_idx",
            handleInvalid="keep",
            stringOrderType="alphabetAsc"
        )
        for col in categorical_cols
    ]

    indexed_categorical_cols = [f"{c}_idx" for c in categorical_cols]
    feature_cols = numeric_cols + indexed_categorical_cols
    categorical_slot_indexes = [feature_cols.index(c_name) for c_name in indexed_categorical_cols]

    assembler = VectorAssembler(
        inputCols=feature_cols,
        outputCol="features",
        handleInvalid="keep"
    )

    lgbm = (
        LightGBMClassifier(passThroughArgs="force_col_wise=true")
        .setLabelCol(LABEL_COL)
        .setFeaturesCol("features")
        .setPredictionCol("prediction")
        .setProbabilityCol("probability")
        .setRawPredictionCol("rawPrediction")
        .setFeaturesShapCol("shap_values")
        .setObjective("binary")
        .setLearningRate(0.05)
        .setNumLeaves(31)
        .setNumIterations(200)
        .setFeatureFraction(1.0)
        .setBaggingFraction(1.0)
        .setBaggingFreq(0)
        .setMinDataInLeaf(100)
        .setCategoricalSlotIndexes(categorical_slot_indexes)
        .setValidationIndicatorCol("is_validation")
        .setEarlyStoppingRound(100)
        .setMetric("binary_logloss")
        .setIsProvideTrainingMetric(True)
        .setNumThreads(1)
        .setNumTasks(num_tasks) #number of executers
        .setMaxBin(15)            
        .setBinSampleCount(50000)  
        .setMatrixType("sparse")
        .setVerbosity(2)         
        .setExecutionMode("bulk")
        .setUseSingleDatasetMode(True)
        .setTimeout(1200.0)
    )

    lgbm_params = _shrink_lgbm_params(_load_shaving_lgbm_params())
    lgbm = _apply_lgbm_params(lgbm, lgbm_params)

    return Pipeline(stages=indexers + [assembler, lgbm])

### Other info / logs

_No response_

### What component(s) does this bug affect?

- [ ] `area/cognitive`: Cognitive project
- [ ] `area/core`: Core project
- [ ] `area/deep-learning`: DeepLearning project
- [x] `area/lightgbm`: Lightgbm project
- [ ] `area/opencv`: Opencv project
- [ ] `area/vw`: VW project
- [ ] `area/website`: Website
- [ ] `area/build`: Project build system
- [ ] `area/notebooks`: Samples under notebooks folder
- [ ] `area/docker`: Docker usage
- [ ] `area/models`: models related issue

### What language(s) does this bug affect?

- [ ] `language/scala`: Scala source code
- [x] `language/python`: Pyspark APIs
- [ ] `language/r`: R APIs
- [ ] `language/csharp`: .NET APIs
- [ ] `language/new`: Proposals for new client languages

### What integration(s) does this bug affect?

- [ ] `integrations/synapse`: Azure Synapse integrations
- [ ] `integrations/azureml`: Azure ML integrations
- [ ] `integrations/databricks`: Databricks integrations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Distributed mode of LightGBMClassifier not working at all when spark.executor.instances > 1 #2467

SynapseML version

System information

Describe the problem

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007629255834db, pid=6, tid=42

JRE version: OpenJDK Runtime Environment Temurin-17.0.15+6 (17.0.15+6) (build 17.0.15+6)

Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.15+6 (17.0.15+6, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)

Problematic frame:

C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree, int, int, int*, bool)+0xf7b

Core dump will be written. Default location: /opt/spark/work-dir/core.6

An error report file with more information is saved as:

/crashlogs/hs_err_pid6.log

If you would like to submit a bug report, please visit:

https://github.com/adoptium/adoptium-support/issues

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

Code to reproduce issue

Other info / logs

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Distributed mode of LightGBMClassifier not working at all when spark.executor.instances > 1 #2467

Description

SynapseML version

System information

Describe the problem

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007629255834db, pid=6, tid=42

JRE version: OpenJDK Runtime Environment Temurin-17.0.15+6 (17.0.15+6) (build 17.0.15+6)

Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.15+6 (17.0.15+6, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)

Problematic frame:

C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b

Core dump will be written. Default location: /opt/spark/work-dir/core.6

An error report file with more information is saved as:

/crashlogs/hs_err_pid6.log

If you would like to submit a bug report, please visit:

https://github.com/adoptium/adoptium-support/issues

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

Code to reproduce issue

Other info / logs

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree, int, int, int*, bool)+0xf7b