Reproducing and extending the Self-Instruct framework by fine-tuning Qwen/Qwen2.5-0.5B-Instruct using QLoRA (4-bit) on an 82K instruction dataset, evaluated against OpenAI Davinci on 252 expert-written tasks.
- Overview
- How Self-Instruct Works
- Project Pipeline
- Setup & Installation
- Repository Structure
- Training Configuration
- Evaluation Results
- Key Observations
- Citation
Self-Instruct is a framework for improving a language model's instruction-following ability using its own generated data — without extensive human annotation. Starting from a small seed of 175 human-written tasks, it bootstraps a large instruction dataset using an LLM.
This project:
- Bootstraps 500 new machine-generated instructions (25 batches × 20 instructions) using
llama-v3p1-8b-instructvia the Fireworks API - Fine-tunes
Qwen/Qwen2.5-0.5B-Instructon the 82K Self-Instruct dataset using QLoRA (4-bit NF4 quantization) - Evaluates the fine-tuned model on the 252 expert-written Human Eval benchmark from the original Self-Instruct paper
- Compares model outputs against OpenAI Davinci predictions using BLEU, ROUGE-L, and BERTScore-F1
The Self-Instruct process is an iterative bootstrapping algorithm. Starting from a seed set of manually-written tasks, it prompts an LLM to generate new instructions and corresponding input-output instances. Low-quality or duplicate generations are filtered out, and the resulting data is merged back into the pool. This cycle repeats, growing the instruction dataset with each pass.
The pipeline for generating instruction data from a language model itself.
Seed Tasks (175)
│
▼
[Step 1] Instruction Bootstrapping
- 25 batches × 20 instructions/batch
- Model: llama-v3p1-8b-instruct (Fireworks API)
- Iterative: each batch merges with previous seeds
- Output: 500 new instructions → merged with 82K dataset
│
▼
[Step 2] Instance Generation
- generate_instances.py on non-classification tasks
- 388 generation tasks + 1890 classification tasks
- Total final dataset: 2,275 valid examples
│
▼
[Step 3] QLoRA Fine-Tuning
- Base: Qwen/Qwen2.5-0.5B-Instruct
- 4-bit NF4 quantization (BitsAndBytes)
- Training subset: 1,000 randomly sampled examples
- Platform: Google Colab (T4 GPU)
│
▼
[Step 4] Evaluation on Human Eval (252 tasks)
- Metrics: BLEU, ROUGE-L, BERTScore-F1
- Comparison: QLoRA predictions vs. Davinci predictions
- Python 3.10+
- Google Colab (recommended) or a GPU with ≥ 8GB VRAM
- A Fireworks AI API key (for instruction generation)
pip install transformers peft trl bitsandbytes datasets accelerate
pip install evaluate rouge_score bert_score matplotlibimport os
os.environ["FIREWORKS_API_KEY"] = "your_fireworks_api_key_here"All steps are contained in Runner.ipynb. Execute cells in order:
- Cell 1 — Set Fireworks API key
- Cell 2 — Bootstrap instructions (25 batches via Fireworks LLM)
- Cell 3 — Filter classification tasks
- Cell 4 — Generate instances for generation tasks
- Cell 5 — Merge and clean final dataset
- Cell 6 — QLoRA fine-tuning on Qwen2.5-0.5B-Instruct
- Cell 7 — Evaluate on 252 Human Eval tasks + visualize
Self-Instruct/
├── Runner.ipynb # Main notebook: full pipeline end-to-end
├── README.md
│
├── data/
│ ├── seed_tasks.jsonl # 175 human-written seed instructions
│ ├── gpt3-generations/
│ │ └── batch_221203/
│ │ └── all_instances_82K.jsonl # Full 82K Self-Instruct dataset
│ └── finetuning/
│ └── self_instruct_221203/ # GPT3 fine-tuning format (prompt + completion)
│
├── self_instruct/
│ ├── bootstrap_instructions.py # Instruction generation script
│ └── generate_instances.py # Instance input/output generation
│
├── human_eval/
│ ├── user_oriented_instructions.jsonl # 252 expert-written evaluation tasks
│ ├── README.md
│ └── predictions/
│ ├── davinci-self-instruct_predictions.jsonl
│ └── qwen05b-qlora-ft_eval_predictions.jsonl
│
├── qwen05b-qlora-ft/
│ └── adapter/ # Saved LoRA adapter weights
│
└── docs/
└── pipeline.JPG # Self-Instruct pipeline diagram
| Property | Value |
|---|---|
| Source dataset | Self-Instruct 82K (all_instances_82K.jsonl) |
| Training subset | 1,000 randomly sampled examples (seed=42) |
| Train / Val split | 950 / 50 (95% / 5%) |
| Classification tasks | 1,890 examples |
| Generation tasks | 388 examples |
| Final cleaned dataset | 2,275 valid examples |
| Avg. input length | 23.2 words |
| Avg. output length | 10.4 words |
| Property | Value |
|---|---|
| Base model | Qwen/Qwen2.5-0.5B-Instruct |
| Quantization | 4-bit NF4 (BitsAndBytes) |
| Compute dtype | float16 |
| Framework | HuggingFace Transformers + PEFT + TRL |
| Hyperparameter | Value |
|---|---|
Rank (r) |
8 |
| LoRA Alpha | 16 |
| LoRA Dropout | 0.1 |
| Bias | none |
| Target modules | q_proj, k_proj, v_proj, o_proj |
| Task type | CAUSAL_LM |
| Hyperparameter | Value |
|---|---|
| Epochs | 1 |
| Learning rate | 2e-4 |
| Per-device batch size | 1 |
| Gradient accumulation steps | 4 (effective batch size = 4) |
| Optimizer | paged_adamw_8bit |
| Mixed precision | fp16 |
| Max sequence length | 1024 (SFTTrainer default) |
The model was fine-tuned using the Qwen ChatML format:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{instruction}
Input:
{input}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>
The fine-tuned Qwen2.5-0.5B-QLoRA model was evaluated on all 252 expert-written, user-oriented tasks from the original Self-Instruct Human Eval benchmark. Model predictions were compared against OpenAI Davinci's outputs using three automatic metrics.
| Metric | Score |
|---|---|
| BLEU | 0.0696 |
| ROUGE-L | 0.1961 |
| BERTScore-F1 | 0.8358 |
Note on interpretation: These metrics treat Davinci's outputs as the reference. Low BLEU/ROUGE-L scores reflect lexical divergence — the fine-tuned model generates semantically valid but differently-worded responses. The high BERTScore-F1 of 0.8358 indicates strong semantic alignment, meaning the model captures the correct meaning even when the surface wording differs.
Average BLEU, ROUGE-L, and BERTScore-F1 across 252 evaluation examples.
Per-example scores sorted by BERTScore-F1 (descending). BLEU and ROUGE-L are sparse and skewed — consistent with open-ended generation tasks where exact n-gram overlap is rare. BERTScore remains consistently high across examples.
BLEU
ROUGE-L
BERTScore-F1
BERTScore is the most meaningful metric here. BLEU and ROUGE-L measure n-gram overlap, which heavily penalizes paraphrasing. Since the model is a 0.5B parameter model fine-tuned on a small 1,000-sample subset for a single epoch, lexical diversity from Davinci is expected. The BERTScore of 0.836 shows the model is semantically on-track.
Efficient fine-tuning works at scale. QLoRA with rank-8 LoRA applied to all attention projections (q, k, v, o) achieves meaningful instruction-following improvement in a single epoch, training on a T4 GPU in Google Colab — no expensive hardware required.
Bootstrapping amplifies seed diversity. The iterative generation pipeline grew from 175 seed tasks to 500+ machine-generated instructions, with each batch feeding the next as new seeds. This progressively diversifies the task pool without manual curation.
The BLEU/ROUGE gap highlights a known limitation of reference-based evaluation for open-ended instruction following. Future work could use LLM-as-judge evaluation (e.g., GPT-4 win-rate) for more meaningful quality assessment.
If you use this work or the original Self-Instruct framework, please cite:
@misc{selfinstruct,
title={Self-Instruct: Aligning Language Model with Self Generated Instructions},
author={Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh},
journal={arXiv preprint arXiv:2212.10560},
year={2022}
}This project extends the Self-Instruct framework using Qwen/Qwen2.5-0.5B-Instruct as the base model with QLoRA for efficient 4-bit fine-tuning. The instruction-generation pipeline was reproduced using the Fireworks AI API (llama-v3p1-8b-instruct) and evaluated on the 252-task Human Eval benchmark.




