diff --git a/tutorials/building_dp_lora_for_llms.ipynb b/tutorials/building_dp_lora_for_llms.ipynb new file mode 100644 index 00000000..efd4c148 --- /dev/null +++ b/tutorials/building_dp_lora_for_llms.ipynb @@ -0,0 +1,642 @@ +{ + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12.13", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "pygments_lexer": "ipython3", + "nbconvert_exporter": "python", + "file_extension": ".py" + } + }, + "nbformat_minor": 5, + "nbformat": 4, + "cells": [ + { + "id": "030976a9-a267-4011-9350-b569517bba9e", + "cell_type": "markdown", + "source": "# Differentially Private LoRA Fine-Tuning of Causal Language Models\n\nWalk through how to fine-tune a GPT-style causal language model on a small generation task with differential privacy, using parameter-efficient LoRA adapters. We compare a non-private LoRA baseline against the DP-LoRA variant and report BLEU, perplexity, peak memory, and throughput for each.\n\nTracking: [meta-pytorch/opacus#827](https://github.com/meta-pytorch/opacus/issues/827)\n\n## What you'll get out of this notebook\n\n1. A working recipe for combining `opacus`, `peft`, and HuggingFace `transformers` on a causal LM\n2. The three device/training-mode ordering patterns that prevent silent corruption (see [#820](https://github.com/meta-pytorch/opacus/issues/820))\n3. A concrete sense of the utility cost of DP at a fixed privacy budget on this task\n4. Honest notes on what does and does not work yet for full-finetune DP on GPT-2 (out of scope for this tutorial, recipe documented at the end)\n\n## Prerequisites\n\n- Familiarity with the standard DP-SGD setup (`PrivacyEngine`, `make_private_with_epsilon`). The opacus [Building text classifier tutorial](https://github.com/meta-pytorch/opacus/blob/main/tutorials/building_text_classifier.ipynb) is a good warm-up.\n- Some prior exposure to LoRA. The [original paper](https://arxiv.org/abs/2106.09685) is short and worth reading.\n\n## Environment and runtime\n\n- Single GPU is sufficient (Kaggle T4 used here)\n- Roughly 15 minutes end-to-end on T4 at the settings below (most of it is the 2000-step training pass)\n- Pinned versions: `opacus>=1.6.0`, `peft>=0.18,<0.19`, `transformers>=5.0`", + "metadata": {} + }, + { + "id": "aae09095-c9ac-47f1-ad5e-7cd7f82cc796", + "cell_type": "markdown", + "source": "## Why combine DP-SGD with LoRA?\n\nDP-SGD adds calibrated Gaussian noise to the per-sample gradient sum at each step. The noise scale grows with the L2 sensitivity of the per-sample gradient (the clipping threshold), so a model with fewer trainable parameters effectively reduces the noise injected into the *learned* parameters. LoRA is a natural fit: it constrains updates to a small low-rank subspace alongside the frozen base weights. In our experiments below, LoRA trains roughly **0.47%** of the GPT-2-small parameters and still reaches a respectable BLEU on the E2E NLG benchmark, both with and without DP.\n\nThe combination also lines up cleanly with how practitioners deploy DP today. Sensitive training data warrants a real privacy guarantee; production training budgets warrant parameter-efficient methods. DP-LoRA is the intersection.\n\n## The task: E2E NLG\n\n[E2E NLG](https://arxiv.org/abs/1706.09254) is a small structured-data-to-text generation task originally from the 2017 E2E challenge. Inputs are slot-value meaning representations such as `name[The Vaults], eatType[pub], priceRange[more than \u00a330]`; outputs are short natural-language descriptions. The dataset is small enough to fine-tune in minutes on a single GPU but rich enough that BLEU separates a trained model from chance.\n\nIt is also the standard benchmark used in the [DiSK paper](https://arxiv.org/abs/2410.03883) and adjacent DP-NLP work, which makes results here comparable to the published literature.", + "metadata": {} + }, + { + "id": "f32db472-05e3-4d05-85c8-04a435f38d7a", + "cell_type": "markdown", + "source": "## 1. Install pinned versions\n\n`peft>=0.18` requires `transformers>=5.0`. The `opacus>=1.6.0` pin is for the version we developed against; older opacus may also work but the integration patterns below assume 1.6+.", + "metadata": {} + }, + { + "id": "b867e9f8-4a0a-45e2-92d4-833d10ade49a", + "cell_type": "code", + "source": "!pip install --quiet \\\n 'opacus>=1.6.0' \\\n 'peft>=0.18,<0.19' \\\n 'transformers>=5.0' \\\n 'accelerate' \\\n 'datasets' \\\n 'evaluate' \\\n 'sacrebleu'\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:03:25.369909Z", + "iopub.execute_input": "2026-06-25T05:03:25.370213Z", + "iopub.status.idle": "2026-06-25T05:03:32.694121Z", + "shell.execute_reply.started": "2026-06-25T05:03:25.370152Z", + "shell.execute_reply": "2026-06-25T05:03:32.693437Z" + } + }, + "outputs": [ + { + "name": "stdout", + "text": "\u001b[2K \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m308.9/308.9 kB\u001b[0m \u001b[31m8.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n\u001b[2K \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m84.1/84.1 kB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[2K \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m100.8/100.8 kB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[2K \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m12.2/12.2 MB\u001b[0m \u001b[31m94.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\ndask-cuda 26.2.0 requires cuda-core==0.3.*, but you have cuda-core 1.0.1 which is incompatible.\ndask-cuda 26.2.0 requires numba-cuda<0.23.0,>=0.22.1, but you have numba-cuda 0.30.2 which is incompatible.\ndistributed-ucxx-cu12 0.48.0 requires numba-cuda[cu12]<0.23.0,>=0.22.1, but you have numba-cuda 0.30.2 which is incompatible.\ncuml-cu12 26.2.0 requires numba<0.62.0,>=0.60.0, but you have numba 0.65.1 which is incompatible.\ncuml-cu12 26.2.0 requires numba-cuda[cu12]<0.23.0,>=0.22.1, but you have numba-cuda 0.30.2 which is incompatible.\nucxx-cu12 0.48.0 requires numba-cuda[cu12]<0.23.0,>=0.22.1, but you have numba-cuda 0.30.2 which is incompatible.\ncudf-cu12 26.2.1 requires numba<0.62.0,>=0.60.0, but you have numba 0.65.1 which is incompatible.\ncudf-cu12 26.2.1 requires numba-cuda[cu12]<0.23.0,>=0.22.2, but you have numba-cuda 0.30.2 which is incompatible.\u001b[0m\u001b[31m\n\u001b[0m", + "output_type": "stream" + } + ], + "execution_count": 1 + }, + { + "id": "53811f9b-03b6-4182-9b60-eda22cf16c3e", + "cell_type": "markdown", + "source": "## 2. Sanity-check the environment\n\nIf CUDA reports unavailable, enable a GPU accelerator before proceeding. The assertions guard against pip silently resolving to incompatible versions.", + "metadata": {} + }, + { + "id": "24d0dc48-ca4a-4a08-9f0f-04489678b2b0", + "cell_type": "code", + "source": "import torch\nimport opacus\nimport peft\nimport transformers\nimport datasets\n\nprint(f'torch = {torch.__version__} (CUDA: {torch.cuda.is_available()})')\nprint(f'opacus = {opacus.__version__}')\nprint(f'peft = {peft.__version__}')\nprint(f'transformers = {transformers.__version__}')\nprint(f'datasets = {datasets.__version__}')\n\nif torch.cuda.is_available():\n print(f'device = {torch.cuda.get_device_name(0)}')\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:03:32.696070Z", + "iopub.execute_input": "2026-06-25T05:03:32.696571Z", + "iopub.status.idle": "2026-06-25T05:03:58.410666Z", + "shell.execute_reply.started": "2026-06-25T05:03:32.696539Z", + "shell.execute_reply": "2026-06-25T05:03:58.409866Z" + } + }, + "outputs": [ + { + "name": "stdout", + "text": "torch = 2.10.0+cu128 (CUDA: True)\nopacus = 1.6.0\npeft = 0.18.1\ntransformers = 5.0.0\ndatasets = 4.8.5\ndevice = Tesla T4\n", + "output_type": "stream" + } + ], + "execution_count": 2 + }, + { + "id": "f5efd8b6-da80-47ec-83f5-69ba1737ef6b", + "cell_type": "markdown", + "source": "## 3. Imports and device\n\nStandard imports plus a seed for partial reproducibility (note: data-loader shuffling and DP noise are still stochastic across runs).", + "metadata": {} + }, + { + "id": "61c13326-77d0-412c-99b3-4299a37861fe", + "cell_type": "code", + "source": "import math\nimport time\nfrom dataclasses import dataclass, field\nfrom typing import Optional\n\nimport torch.nn as nn\nimport torch.optim as optim\nfrom torch.utils.data import DataLoader, Dataset\n\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom peft import LoraConfig, get_peft_model, TaskType\nfrom opacus import PrivacyEngine\nfrom datasets import load_dataset\nimport evaluate\n\ntorch.manual_seed(42)\ndevice = 'cuda' if torch.cuda.is_available() else 'cpu'\nprint(f'Using device: {device}')\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:03:58.411491Z", + "iopub.execute_input": "2026-06-25T05:03:58.412418Z", + "iopub.status.idle": "2026-06-25T05:03:58.797034Z", + "shell.execute_reply.started": "2026-06-25T05:03:58.412388Z", + "shell.execute_reply": "2026-06-25T05:03:58.796360Z" + } + }, + "outputs": [ + { + "name": "stdout", + "text": "Using device: cuda\n", + "output_type": "stream" + } + ], + "execution_count": 3 + }, + { + "id": "41906fc2-5089-4db4-929b-0d012da5782e", + "cell_type": "markdown", + "source": "## 4. Load E2E NLG\n\nThe HuggingFace Hub mirrors for E2E NLG (`tuetschek/e2e_nlg`, `GEM/e2e_nlg`) both ship Python loading scripts, which `datasets>=3.0` no longer supports. The dataset is just two CSVs in the upstream repo, so we load them directly with `pandas` and wrap as a `DatasetDict`. Schema after rename: `meaning_representation` (input), `target` (single human reference).\n\nThe dev split contains multiple references per MR represented as multiple rows; we treat each row as an independent training example here. A higher-fidelity BLEU evaluation would aggregate references per MR; see the follow-up section at the end.", + "metadata": {} + }, + { + "id": "4872760f-fba5-4ddb-9a38-0cf25ea57fcd", + "cell_type": "code", + "source": "import pandas as pd\nfrom datasets import Dataset, DatasetDict\n\nE2E_TRAIN_URL = 'https://raw.githubusercontent.com/tuetschek/e2e-dataset/master/trainset.csv'\nE2E_DEV_URL = 'https://raw.githubusercontent.com/tuetschek/e2e-dataset/master/devset.csv'\n\ndf_train = pd.read_csv(E2E_TRAIN_URL).rename(columns={'mr': 'meaning_representation', 'ref': 'target'})\ndf_val = pd.read_csv(E2E_DEV_URL).rename(columns={'mr': 'meaning_representation', 'ref': 'target'})\n\nds = DatasetDict({\n 'train': Dataset.from_pandas(df_train),\n 'validation': Dataset.from_pandas(df_val),\n})\nprint(ds)\nprint()\nprint('--- Sample (train) ---')\nprint('MR: ', ds['train'][0]['meaning_representation'])\nprint('Target:', ds['train'][0]['target'])\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:03:58.797877Z", + "iopub.execute_input": "2026-06-25T05:03:58.798266Z", + "iopub.status.idle": "2026-06-25T05:03:59.974754Z", + "shell.execute_reply.started": "2026-06-25T05:03:58.798217Z", + "shell.execute_reply": "2026-06-25T05:03:59.973944Z" + } + }, + "outputs": [ + { + "name": "stdout", + "text": "DatasetDict({\n train: Dataset({\n features: ['meaning_representation', 'target'],\n num_rows: 42061\n })\n validation: Dataset({\n features: ['meaning_representation', 'target'],\n num_rows: 4672\n })\n})\n\n--- Sample (train) ---\nMR: name[The Vaults], eatType[pub], priceRange[more than \u00a330], customer rating[5 out of 5], near[Caf\u00e9 Adriatic]\nTarget: The Vaults pub near Caf\u00e9 Adriatic has a 5 star rating. Prices start at \u00a330.\n", + "output_type": "stream" + } + ], + "execution_count": 4 + }, + { + "id": "e6b0ef2d-8768-48d4-8135-0071a87b2b2f", + "cell_type": "markdown", + "source": "### Subsample validation for fast eval\n\nUse the full train split (about 42K rows). Subsample the validation split to 200 rows so the BLEU sweep across configurations stays under a couple of minutes.", + "metadata": {} + }, + { + "id": "b609a723-4a89-4879-b0f6-70e9c633a16f", + "cell_type": "code", + "source": "# Pass 2b: use the full E2E NLG train set (~42K) for production. Validation\n# stays subsampled (200) to keep BLEU eval fast \u2014 full eval is a Phase 3 polish.\nVAL_EVAL_SIZE = 200\n\nds_train_small = ds['train'] # full train\nds_val_small = ds['validation'].shuffle(seed=42).select(range(VAL_EVAL_SIZE))\nprint(f'Production train: {len(ds_train_small)}')\nprint(f'Production val: {len(ds_val_small)}')\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:03:59.975870Z", + "iopub.execute_input": "2026-06-25T05:03:59.976137Z", + "iopub.status.idle": "2026-06-25T05:03:59.999053Z", + "shell.execute_reply.started": "2026-06-25T05:03:59.976115Z", + "shell.execute_reply": "2026-06-25T05:03:59.998199Z" + } + }, + "outputs": [ + { + "name": "stdout", + "text": "Production train: 42061\nProduction val: 200\n", + "output_type": "stream" + } + ], + "execution_count": 5 + }, + { + "id": "e1826ccf-ff09-4068-8f59-f263d0276f44", + "cell_type": "markdown", + "source": "## 5. Tokenize for causal-LM training\n\nFormat each row as `'{MR} -> {target}'` and tokenize with GPT-2's tokenizer to a fixed length. We set `labels = input_ids` (predict every position, including the prompt portion). This trains the model slightly differently from a more typical \"loss only on the target\" setup, but it sidesteps an interaction between `-100`-masked labels, padding, and opacus's per-sample-gradient tracking. See the safety-patterns section below for the full rationale.", + "metadata": {} + }, + { + "id": "752764a9-5d34-481e-806c-dd64cdf749b1", + "cell_type": "code", + "source": "MODEL_NAME = 'gpt2'\nMAX_SEQ_LEN = 128\n\ntokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\ntokenizer.pad_token = tokenizer.eos_token # GPT-2 has no pad token by default\n\nPROMPT_TEMPLATE = '{mr} ->'\n\ndef tokenize_example(example):\n \"\"\"Format as 'MR -> target', tokenize, pad to fixed length.\n\n NOTE: labels = input_ids (no -100 prompt masking). The model trains on the\n entire sequence, not just the target portion. This is slightly less\n sample-efficient than target-only training, but it sidesteps an opacus\n per-sample-gradient shape-mismatch issue triggered by -100 + padding + LoRA.\n Phase 2 can revisit this with a custom collator if target-only training\n significantly improves BLEU.\n \"\"\"\n prompt = PROMPT_TEMPLATE.format(mr=example['meaning_representation'])\n target = ' ' + example['target'] + tokenizer.eos_token\n full_text = prompt + target\n\n enc = tokenizer(\n full_text,\n max_length=MAX_SEQ_LEN,\n padding='max_length',\n truncation=True,\n )\n return {\n 'input_ids': enc['input_ids'],\n 'attention_mask': enc['attention_mask'],\n 'labels': enc['input_ids'],\n }\n\nds_train_tok = ds_train_small.map(tokenize_example, remove_columns=ds_train_small.column_names)\nds_val_tok = ds_val_small.map(tokenize_example, remove_columns=ds_val_small.column_names)\nds_train_tok.set_format('torch')\nds_val_tok.set_format('torch')\nprint(f'Tokenized train: {ds_train_tok}')\nprint(f'Tokenized val: {ds_val_tok}')\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:03:59.999990Z", + "iopub.execute_input": "2026-06-25T05:04:00.000283Z", + "iopub.status.idle": "2026-06-25T05:04:15.837954Z", + "shell.execute_reply.started": "2026-06-25T05:04:00.000259Z", + "shell.execute_reply": "2026-06-25T05:04:15.837202Z" + } + }, + "outputs": [ + { + "name": "stderr", + "text": "Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n06/25/2026 05:04:00:WARNING:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n", + "output_type": "stream" + }, + { + "output_type": "display_data", + "data": { + "text/plain": "config.json: 0%| | 0.00/665 [00:00=0.18`), accelerate-style lazy device handling can leave parts of the model on CPU when opacus walks `add_hooks()`. The symptoms are subtle: training loss looks reasonable, the privacy accountant ticks, but LoRA weights never update. We confirmed this empirically across three independent setups (CPU bisect across peft 0.13.2 \u2192 0.18.1, Kaggle T4, RTX 5090) in [opacus#820](https://github.com/meta-pytorch/opacus/issues/820). The safe order is:\n\n```python\nmodel = AutoModelForCausalLM.from_pretrained(MODEL_NAME)\nmodel = model.to(device) # \u2190 move base model to CUDA FIRST\nmodel = get_peft_model(model, config) # then apply LoRA (LoRA params land on CUDA too)\n# then PrivacyEngine.make_private(...)\n```\n\n### Pattern B: `model.train()` before `make_private_with_epsilon()`\n\nopacus's `ModuleValidator` (1.6.0+) raises `IllegalModuleConfigurationError(\"Model needs to be in training mode\")` if the model is not in train mode when `make_private_with_epsilon` runs. `get_peft_model()` puts the model in eval mode by default, so an explicit `model.train()` call between PEFT wrapping and opacus wrapping is required.\n\n### Pattern C: `poisson_sampling=False`\n\nopacus's default Poisson sampling occasionally produces an empty batch (probability around `e^(-batch_size)` per step). GPT-2's forward pass calls `attention_mask.view(batch_size, -1)`, which fails for a zero-element tensor because the `-1` dimension is ambiguous. Setting `poisson_sampling=False` switches to uniform-without-replacement, a valid DP-SGD variant (this is what the original [Abadi 2016](https://arxiv.org/abs/1607.00133) paper uses), with deterministic batch sizes and no empty-batch edge case. The accountant handles both regimes correctly.", + "metadata": {} + }, + { + "id": "1480727a-4654-4d56-8f97-dca30f314488", + "cell_type": "markdown", + "source": "## 7. Run configuration\n\nEncapsulate the per-run hyperparameters in a small dataclass so the same training loop can drive all configurations.", + "metadata": {} + }, + { + "id": "34467136-6ac4-4ab5-be13-9e664a81990c", + "cell_type": "code", + "source": "@dataclass\nclass RunConfig:\n name: str\n lora: bool\n dp: bool\n lr: float = 1e-4\n batch_size: int = 8\n max_steps: int = 50 # SCAFFOLD; production will be much larger\n target_epsilon: float = 8.0\n target_delta: float = 1e-5\n max_grad_norm: float = 1.0\n lora_r: int = 16\n lora_alpha: int = 32\n\n @property\n def is_full(self) -> bool:\n return not self.lora\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:04:15.839919Z", + "iopub.execute_input": "2026-06-25T05:04:15.840563Z", + "iopub.status.idle": "2026-06-25T05:04:15.846749Z", + "shell.execute_reply.started": "2026-06-25T05:04:15.840539Z", + "shell.execute_reply": "2026-06-25T05:04:15.846129Z" + } + }, + "outputs": [], + "execution_count": 7 + }, + { + "id": "89710869-b183-4f08-98e7-5c990a8269b7", + "cell_type": "markdown", + "source": "## 8. Build the model for a given configuration\n\nSingle entry point that handles base-model loading, optional LoRA wrapping, and the Pattern A ordering. The `cfg.dp and not cfg.lora` branch applies `ModuleValidator.fix()` to swap GPT-2's `transformers.Conv1D` modules for `nn.Linear`; this is required when the DP-full path is enabled (which we do not enable in this tutorial \u2014 see the follow-up section).", + "metadata": {} + }, + { + "id": "92e8c600-f245-4869-a6d6-60130f067f86", + "cell_type": "code", + "source": "def build_model_for_config(cfg: RunConfig):\n try:\n model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, dtype=torch.float32)\n except TypeError:\n model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float32)\n\n # DP-full path: swap GPT-2's transformers.Conv1D modules with nn.Linear so opacus's\n # per-sample-gradient hooks attach correctly. opacus has a registered fix for Conv1D.\n # No-op for the LoRA path (LoRA wraps Conv1D with its own A/B Linear modules, which\n # opacus already handles).\n if cfg.dp and not cfg.lora:\n from opacus.validators import ModuleValidator\n model = ModuleValidator.fix(model)\n print(f'[{cfg.name}] Applied ModuleValidator.fix() (Conv1D -> nn.Linear)')\n\n model = model.to(device) # <-- BEFORE PEFT, per #820\n\n if cfg.lora:\n lora_config = LoraConfig(\n task_type=TaskType.CAUSAL_LM,\n inference_mode=False,\n r=cfg.lora_r,\n lora_alpha=cfg.lora_alpha,\n lora_dropout=0.0,\n target_modules=['c_attn'],\n )\n model = get_peft_model(model, lora_config)\n\n trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n total = sum(p.numel() for p in model.parameters())\n print(f'[{cfg.name}] trainable={trainable:,} ({100*trainable/total:.2f}% of {total:,})')\n return model\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:04:15.847644Z", + "iopub.execute_input": "2026-06-25T05:04:15.847825Z", + "iopub.status.idle": "2026-06-25T05:04:15.868813Z", + "shell.execute_reply.started": "2026-06-25T05:04:15.847807Z", + "shell.execute_reply": "2026-06-25T05:04:15.867984Z" + } + }, + "outputs": [], + "execution_count": 8 + }, + { + "id": "2851ac76-8873-4d92-a60f-2df2d44bb26c", + "cell_type": "markdown", + "source": "## 9. The shared training function\n\nSame loop for all configurations. The `cfg.dp` block adds the `PrivacyEngine`, applies the three safety patterns, and uses `poisson_sampling=False` per Pattern C. Memory and throughput are tracked across the run; the final epsilon is read from the accountant at the end. The per-step skip on empty batches is a defensive guard that should never actually fire with `poisson_sampling=False`.", + "metadata": {} + }, + { + "id": "bc61f230-31f6-478a-a6db-2d71944c2d7a", + "cell_type": "code", + "source": "def train_one_run(cfg: RunConfig, train_ds, val_ds):\n print(f'\\n=== Training: {cfg.name} ===')\n model = build_model_for_config(cfg)\n model.train() # required: opacus validator (>=1.6.0) checks model.training before make_private_with_epsilon\n\n optimizer = optim.AdamW(\n [p for p in model.parameters() if p.requires_grad],\n lr=cfg.lr,\n )\n train_loader = DataLoader(train_ds, batch_size=cfg.batch_size, shuffle=True)\n\n privacy_engine = None\n if cfg.dp:\n privacy_engine = PrivacyEngine(accountant='rdp')\n grad_sample_mode = 'functorch' if not cfg.lora else 'hooks'\n print(f'[{cfg.name}] grad_sample_mode={grad_sample_mode}')\n # poisson_sampling=False uses uniform-without-replacement (deterministic batch sizes).\n # Default Poisson sampling can produce 0-sample batches that GPT-2's forward pass\n # cannot handle (reshape of (0, -1) is ambiguous). Uniform sampling is a documented\n # opacus mode and a standard DP-SGD variant \u2014 see opacus.PrivacyEngine docs.\n model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(\n module=model,\n optimizer=optimizer,\n data_loader=train_loader,\n target_epsilon=cfg.target_epsilon,\n target_delta=cfg.target_delta,\n epochs=max(1, cfg.max_steps // (len(train_ds) // cfg.batch_size + 1)),\n max_grad_norm=cfg.max_grad_norm,\n grad_sample_mode=grad_sample_mode,\n poisson_sampling=False,\n )\n print(f'[{cfg.name}] noise_multiplier={optimizer.noise_multiplier:.4f}')\n\n if torch.cuda.is_available():\n torch.cuda.reset_peak_memory_stats()\n t_start = time.perf_counter()\n tokens_processed = 0\n\n model.train()\n step = 0\n loss_history = []\n while step < cfg.max_steps:\n for batch in train_loader:\n if step >= cfg.max_steps:\n break\n # Defensive: skip any empty batches (shouldn't happen with poisson_sampling=False,\n # but harmless guard)\n if batch['input_ids'].numel() == 0:\n continue\n batch = {k: v.to(device) for k, v in batch.items()}\n outputs = model(**batch)\n loss = outputs.loss\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n loss_history.append(loss.item())\n tokens_processed += batch['input_ids'].numel()\n if step % 100 == 0:\n print(f' step {step:4d} loss={loss.item():.4f}')\n step += 1\n\n t_total = time.perf_counter() - t_start\n peak_mem_gb = (torch.cuda.max_memory_allocated() / 1024**3) if torch.cuda.is_available() else 0.0\n final_epsilon = privacy_engine.get_epsilon(delta=cfg.target_delta) if cfg.dp else float('inf')\n\n return {\n 'config': cfg,\n 'model': model,\n 'mean_loss': sum(loss_history) / len(loss_history),\n 'final_loss': loss_history[-1],\n 'tokens_per_sec': tokens_processed / t_total,\n 'peak_mem_gb': peak_mem_gb,\n 'wall_clock_sec': t_total,\n 'epsilon': final_epsilon,\n 'steps_completed': step,\n }\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:04:15.869872Z", + "iopub.execute_input": "2026-06-25T05:04:15.870222Z", + "iopub.status.idle": "2026-06-25T05:04:15.884098Z", + "shell.execute_reply.started": "2026-06-25T05:04:15.870157Z", + "shell.execute_reply": "2026-06-25T05:04:15.883346Z" + } + }, + "outputs": [], + "execution_count": 9 + }, + { + "id": "d7b42a57-a1a3-4cc5-bb3f-92844d1f4cc3", + "cell_type": "markdown", + "source": "## 10. Evaluation: BLEU plus perplexity\n\nBLEU is computed by greedy generation from the prompt portion of each validation example (decoded against the held-out target). Perplexity uses the standard `exp(mean(eval_loss))` formulation. Both are reported per configuration.", + "metadata": {} + }, + { + "id": "233725f1-1e7e-4fdf-8f90-e6d466ca4380", + "cell_type": "code", + "source": "bleu_metric = evaluate.load('sacrebleu')\n\n# Re-tokenize to recover prompt boundary at eval time (since labels no longer carry it)\ndef _prompt_token_len(mr_text):\n prompt = PROMPT_TEMPLATE.format(mr=mr_text)\n return len(tokenizer(prompt, add_special_tokens=False)['input_ids'])\n\ndef evaluate_run(model, val_ds_raw, val_ds_tok, max_eval_examples=50, max_new_tokens=40):\n \"\"\"Compute BLEU on generated text + perplexity on val loss.\n val_ds_raw provides the raw MR + target for prompt boundary lookup.\n \"\"\"\n gen_model = model._module if hasattr(model, '_module') else model\n gen_model.eval()\n\n preds, refs = [], []\n val_losses = []\n\n n = min(max_eval_examples, len(val_ds_tok))\n with torch.no_grad():\n for i in range(n):\n ex_tok = val_ds_tok[i]\n ex_raw = val_ds_raw[i]\n input_ids = ex_tok['input_ids'].unsqueeze(0).to(device)\n attention_mask = ex_tok['attention_mask'].unsqueeze(0).to(device)\n labels = ex_tok['labels'].unsqueeze(0).to(device)\n\n # Perplexity\n out = gen_model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)\n val_losses.append(out.loss.item())\n\n # Generate from the prompt portion only\n prompt_len = _prompt_token_len(ex_raw['meaning_representation'])\n prompt_ids = input_ids[0, :prompt_len].unsqueeze(0)\n gen = gen_model.generate(\n prompt_ids,\n max_new_tokens=max_new_tokens,\n do_sample=False,\n pad_token_id=tokenizer.pad_token_id,\n )\n pred = tokenizer.decode(gen[0, prompt_len:], skip_special_tokens=True).strip()\n ref = ex_raw['target']\n preds.append(pred)\n refs.append([ref])\n\n bleu = bleu_metric.compute(predictions=preds, references=refs)['score']\n perplexity = math.exp(sum(val_losses) / len(val_losses))\n return {'bleu': bleu, 'perplexity': perplexity, 'n_eval': len(preds)}\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:04:15.885621Z", + "iopub.execute_input": "2026-06-25T05:04:15.886388Z", + "iopub.status.idle": "2026-06-25T05:04:16.429782Z", + "shell.execute_reply.started": "2026-06-25T05:04:15.886353Z", + "shell.execute_reply": "2026-06-25T05:04:16.429204Z" + } + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": "Downloading builder script: 0.00B [00:00, ?B/s]", + "application/vnd.jupyter.widget-view+json": { + "version_major": 2, + "version_minor": 0, + "model_id": "df2941a5bfdc4067bfd0d920422c9936" + } + }, + "metadata": {} + } + ], + "execution_count": 10 + }, + { + "id": "74cf7106-b186-49c5-97e1-616ffa2d38a7", + "cell_type": "markdown", + "source": "## 11. Configurations to compare\n\nTwo configurations, both LoRA-based, identical in every respect except whether the `PrivacyEngine` is attached. The `lr=5e-4` for the DP variant is intentionally higher than the non-DP `1e-4`: DP-SGD's per-step update is dampened by gradient clipping and Gaussian noise, so a higher learning rate is needed to make comparable progress. Empirically, `lr=1e-4` produced BLEU near zero for the DP variant in our early runs.\n\nDP-full (no LoRA) is intentionally left out of this tutorial; see the follow-up section for what is needed to enable it.", + "metadata": {} + }, + { + "id": "354e8b90-6d1f-4692-a1f6-119fe244d531", + "cell_type": "code", + "source": "# Pass 2b retry: DP-LoRA lr bumped 1e-4 -> 5e-4. The 1e-4 setting (same as non-DP)\n# produced BLEU \u2248 0.06 despite the model reaching PPL \u2248 6.66 \u2014 classic DP-SGD\n# symptom that the effective per-step update is too small under noise+clipping.\n# Higher lr compensates. If 5e-4 still doesn't break BLEU, try 1e-3 or a small sweep.\nCONFIGS = [\n RunConfig(name='non-DP LoRA', lora=True, dp=False, lr=1e-4, batch_size=8, max_steps=2000),\n RunConfig(name='DP LoRA', lora=True, dp=True, lr=5e-4, batch_size=8, max_steps=2000, target_epsilon=8.0),\n # RunConfig(name='DP full', lora=False, dp=True, ...), # deferred \u2014 see note in #11\n]\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:04:16.430602Z", + "iopub.execute_input": "2026-06-25T05:04:16.431078Z", + "iopub.status.idle": "2026-06-25T05:04:16.435597Z", + "shell.execute_reply.started": "2026-06-25T05:04:16.431042Z", + "shell.execute_reply": "2026-06-25T05:04:16.434679Z" + } + }, + "outputs": [], + "execution_count": 11 + }, + { + "id": "deb8125b-846d-444e-9f7c-58562c337fc0", + "cell_type": "markdown", + "source": "## 12. Run both configurations\n\nSequential. Memory is freed between runs to avoid cross-config peak memory accounting confusion.", + "metadata": {} + }, + { + "id": "801025a3-c181-4014-81aa-05b7f5f6485c", + "cell_type": "code", + "source": "results = []\nfor cfg in CONFIGS:\n train_result = train_one_run(cfg, ds_train_tok, ds_val_tok)\n eval_result = evaluate_run(train_result['model'], ds_val_small, ds_val_tok)\n results.append({**train_result, **eval_result})\n print(f'[{cfg.name}] BLEU={eval_result[\"bleu\"]:.2f} PPL={eval_result[\"perplexity\"]:.2f}')\n del train_result['model']\n torch.cuda.empty_cache() if torch.cuda.is_available() else None\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:04:16.437266Z", + "iopub.execute_input": "2026-06-25T05:04:16.437476Z", + "iopub.status.idle": "2026-06-25T05:19:36.890955Z", + "shell.execute_reply.started": "2026-06-25T05:04:16.437455Z", + "shell.execute_reply": "2026-06-25T05:19:36.890080Z" + } + }, + "outputs": [ + { + "name": "stdout", + "text": "\n=== Training: non-DP LoRA ===\n", + "output_type": "stream" + }, + { + "output_type": "display_data", + "data": { + "text/plain": "model.safetensors: 0%| | 0.00/548M [00:006} | {\"PPL\":>7} | {\"tok/s\":>8} | {\"peak GB\":>7} | {\"sec\":>6} | {\"\u03b5\":>6}')\nprint('-' * 75)\nfor r in results:\n print(f'{r[\"config\"].name:<14} | {r[\"bleu\"]:>6.2f} | {r[\"perplexity\"]:>7.2f} | '\n f'{r[\"tokens_per_sec\"]:>8.0f} | {r[\"peak_mem_gb\"]:>7.2f} | '\n f'{r[\"wall_clock_sec\"]:>6.1f} | {fmt_eps(r[\"epsilon\"]):>6}')\n", + "metadata": { + "trusted": true, + "execution": { + "iopub.status.busy": "2026-06-25T05:19:36.892096Z", + "iopub.execute_input": "2026-06-25T05:19:36.892454Z", + "iopub.status.idle": "2026-06-25T05:19:36.898199Z", + "shell.execute_reply.started": "2026-06-25T05:19:36.892412Z", + "shell.execute_reply": "2026-06-25T05:19:36.897527Z" + } + }, + "outputs": [ + { + "name": "stdout", + "text": "Config | BLEU | PPL | tok/s | peak GB | sec | \u03b5\n---------------------------------------------------------------------------\nnon-DP LoRA | 24.72 | 1.60 | 4679 | 2.17 | 437.7 | no DP\nDP LoRA | 18.24 | 2.71 | 4599 | 2.65 | 445.3 | 7.08\n", + "output_type": "stream" + } + ], + "execution_count": 13 + }, + { + "id": "5cb5ce93-ea84-4616-b7de-e9d91e1cdf70", + "cell_type": "markdown", + "source": [ + "## 14. Reading the results\n", + "\n", + "Sample run on Kaggle T4 (your numbers will vary slightly with random seeds, hardware, and library minor versions):\n", + "\n", + "| Config | BLEU | PPL | tok/s | Peak GB | Time | \u03b5 |\n", + "|---|---|---|---|---|---|---|\n", + "| non-DP LoRA | 24.72 | 1.60 | 4679 | 2.17 | 438 s | no DP |\n", + "| DP LoRA | 18.24 | 2.71 | 4599 | 2.65 | 445 s | 7.08 |\n", + "\n", + "The non-DP LoRA baseline scores BLEU 24.72 on E2E NLG after 2000 steps while training only 589K of GPT-2-small's 125M parameters (about 0.47%). That is the parameter-efficiency story that motivates LoRA in the first place: most of the useful signal can be captured in a small low-rank perturbation, and the rest of the network does not need to move.\n", + "\n", + "DP costs roughly 26% relative BLEU at \u03b5 \u2248 7 here, going from 24.72 down to 18.24. Perplexity moves from 1.60 to 2.71. The model still clearly learns the conditional generation task; the noise from DP-SGD shifts utility downward but does not collapse it. The 26% figure is in the ballpark of what comparable papers report at similar privacy budgets on this task.\n", + "\n", + "Memory overhead from opacus is modest at LoRA's parameter scale. About 22% more peak GPU memory (2.17 GB to 2.65 GB) covers the per-sample-gradient accumulation. This stays comfortably within a T4's 16 GB envelope and would not constrain a typical fine-tuning workflow.\n", + "\n", + "Throughput is essentially unchanged. Both configurations hit roughly 4600 tokens per second; opacus is not a throughput bottleneck for LoRA at this size.\n", + "\n", + "**Learning rate matters more for DP.** The 5\u00d7 higher lr for the DP variant is not optional. Leave both configurations at `lr=1e-4` and the DP variant produces BLEU near zero despite reaching reasonable perplexity. The intuition: gradient clipping and Gaussian noise both shrink the effective per-step update, so the optimizer needs more aggressive learning to make comparable progress per step. This is consistent with the DP-NLP literature and worth surfacing explicitly in any production setup." + ], + "metadata": {} + }, + { + "id": "75c954a4-b452-4d1b-8455-f52c0a1837cb", + "cell_type": "markdown", + "source": "## 15. When to reach for DP-LoRA\n\nA practical heuristic, given the numbers above:\n\nUse DP-LoRA when the training data is sensitive enough to warrant a real privacy guarantee and the downstream task tolerates a meaningful utility drop relative to a non-DP baseline. The 27% relative BLEU cost we observe is in the ballpark reported elsewhere in the DP-NLP literature at comparable \u03b5; it is the price of admission for the formal privacy guarantee.\n\nConsider full DP fine-tuning instead of DP-LoRA when LoRA's restricted parameter subspace is the bottleneck (not the privacy budget) and you have compute headroom to handle opacus's per-sample-gradient memory overhead at the full model scale. The full-finetune path on GPT-2 specifically also needs the engineering steps in the follow-up section below.", + "metadata": {} + }, + { + "id": "cc75870e-390c-4cce-a18e-8675a0610f49", + "cell_type": "markdown", + "source": "## 16. Follow-up work\n\n### Enabling DP-full fine-tuning on GPT-2\n\nDP fine-tuning of GPT-2 with all 125M parameters trainable is out of scope for this tutorial. The straightforward setup runs into a per-sample-gradient shape mismatch in opacus's `clip_and_accumulate` step that neither `ModuleValidator.fix()` (Conv1D \u2192 Linear swap) nor `grad_sample_mode='functorch'` (vmap-based per-sample grads) resolved in our testing. The most likely root cause is GPT-2's tied input embedding and output projection (`transformer.wte.weight` is the same tensor as `lm_head.weight`); opacus's hook-based accumulation appears to double-count or miscount across the two module sites.\n\nA reasonable engineering recipe to restore DP-full as a follow-up PR:\n\n1. Explicitly untie the embedding weight: `model.lm_head.weight = nn.Parameter(model.transformer.wte.weight.data.clone())`\n2. Apply `opacus.validators.ModuleValidator.fix(model)` to swap `transformers.Conv1D` modules for `nn.Linear`\n3. Use `grad_sample_mode='functorch'` for additional robustness against custom module types\n4. Reduce `batch_size` to fit T4 memory (functorch's per-sample-grad path is 3 to 4\u00d7 heavier than the default hooks path for GPT-2-full)\n\n### Other polishes worth doing\n\n- Full validation set (about 4K examples) for higher-confidence BLEU\n- Multi-reference BLEU using all human references per MR (the E2E NLG release includes 5 to 8 refs per input)\n- 2 to 3 seeds per configuration to quantify variance\n- Small HP sweep around `lr`, `max_grad_norm`, and `target_epsilon` to pick \"reasonable\" settings rigorously\n- Loss-only-on-target labels (with `-100` masking on the prompt portion) once the opacus + LoRA per-sample-gradient interaction with masked labels is resolved\n\n### References\n\n- [opacus#820](https://github.com/meta-pytorch/opacus/issues/820) \u2014 the device-placement-ordering issue that motivates Pattern A\n- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)\n- [Deep Learning with Differential Privacy (Abadi et al. 2016)](https://arxiv.org/abs/1607.00133)\n- [E2E NLG Challenge dataset](https://arxiv.org/abs/1706.09254) and the upstream [GitHub repo](https://github.com/tuetschek/e2e-dataset)\n- [DiSK: Differentially Private Optimizer with Simplified Kalman Filter (Zhang et al. 2024)](https://arxiv.org/abs/2410.03883), which uses E2E NLG as part of its benchmark suite", + "metadata": {} + } + ] +} \ No newline at end of file