Skip to content

Can't run Llama 8B with higher sequence length and batch size for Multi GPU finetuning on Gaudi 3 #1687

@ajscalers

Description

@ajscalers

System Info

Optimum habana version: v1.15.0
Synapse AI version: 1.19.0-2427ed8
Gaudi pytorch container version: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Run the Gaudi pytorch container: docker run -it --runtime habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --entrypoint /bin/bash vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561
  2. Install the required dependencies for the question answering examples directory (in https://github.com/huggingface/optimum-habana/tree/main/examples/question-answering).
  3. Run the example, but change the model to Llama 3.1 8B, increase the batch size and max sequence length:
python ../gaudi_spawn.py \
  --world_size 8 --use_deepspeed run_qa.py \
  --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
  --gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 8 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 512 \
  --doc_stride 128 \
  --output_dir /tmp/squad_output/ \
  --use_habana \
  --use_lazy_mode \
  --use_hpu_graphs_for_inference \
  --throughput_warmup_steps 3 \
  --max_train_samples 45080 \
  --deepspeed ../../tests/configs/deepspeed_zero_2.json \
  --sdp_on_bf16

This throws an out of memory error on Gaudi 3 with 8 GPUs.

Expected behavior

This should run to completion, as a 8B model with 512 sequence length and a batch size of >= 64 can run on other GPUs with similar per GPU memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions