Can't run Llama 8B with higher sequence length and batch size for Multi GPU finetuning on Gaudi 3

### System Info

```shell
Optimum habana version: v1.15.0
Synapse AI version: 1.19.0-2427ed8
Gaudi pytorch container version: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561
```


### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. Run the Gaudi pytorch container: `docker run -it --runtime habana -e HABANA_VISIBLE_DEVICES=all     -e OMPI_MCA_btl_vader_single_copy_mechanism=none --entrypoint /bin/bash vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561`
2. Install the required dependencies for the question answering examples directory (in https://github.com/huggingface/optimum-habana/tree/main/examples/question-answering).
3. Run the example, but change the model to Llama 3.1 8B, increase the batch size and max sequence length:

```
python ../gaudi_spawn.py \
  --world_size 8 --use_deepspeed run_qa.py \
  --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
  --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 8 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 512 \
  --doc_stride 128 \
  --output_dir /tmp/squad_output/ \
  --use_habana \
  --use_lazy_mode \
  --use_hpu_graphs_for_inference \
  --throughput_warmup_steps 3 \
  --max_train_samples 45080 \
  --deepspeed ../../tests/configs/deepspeed_zero_2.json \
  --sdp_on_bf16
```

This throws an out of memory error on Gaudi 3 with 8 GPUs.

### Expected behavior

This should run to completion, as a 8B model with 512 sequence length and a batch size of >= 64 can run on other GPUs with similar per GPU memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't run Llama 8B with higher sequence length and batch size for Multi GPU finetuning on Gaudi 3 #1687

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Can't run Llama 8B with higher sequence length and batch size for Multi GPU finetuning on Gaudi 3 #1687

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions