System Info
Optimum habana version: v1.15.0
Synapse AI version: 1.19.0-2427ed8
Gaudi pytorch container version: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561
Information
Tasks
Reproduction
- Run the Gaudi pytorch container:
docker run -it --runtime habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --entrypoint /bin/bash vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561
- Install the required dependencies for the question answering examples directory (in https://github.com/huggingface/optimum-habana/tree/main/examples/question-answering).
- Run the example, but change the model to Llama 3.1 8B, increase the batch size and max sequence length:
python ../gaudi_spawn.py \
--world_size 8 --use_deepspeed run_qa.py \
--model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 512 \
--doc_stride 128 \
--output_dir /tmp/squad_output/ \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3 \
--max_train_samples 45080 \
--deepspeed ../../tests/configs/deepspeed_zero_2.json \
--sdp_on_bf16
This throws an out of memory error on Gaudi 3 with 8 GPUs.
Expected behavior
This should run to completion, as a 8B model with 512 sequence length and a batch size of >= 64 can run on other GPUs with similar per GPU memory.
System Info
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
docker run -it --runtime habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --entrypoint /bin/bash vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561This throws an out of memory error on Gaudi 3 with 8 GPUs.
Expected behavior
This should run to completion, as a 8B model with 512 sequence length and a batch size of >= 64 can run on other GPUs with similar per GPU memory.