A 110K Parameter Autoregressive Character-Level Language Model built from absolute scratch.
Welcome to the Nano-Llama Engine. This repository is not a wrapper around an existing model. It is a complete, deep-dive architectural recreation of a Generative Pre-trained Transformer (GPT), built manually from the ground up to understand the underlying calculus and matrix mathematics of Large Language Models.
This model implements the core mechanics of modern LLMs (like Llama 3 and GPT-4) at a microscopic scale:
- Rotary Positional Embeddings (RoPE)
- SwiGLU Activation Functions
- Multi-Head Causal Self-Attention
- RMSNorm (Root Mean Square Normalization)
- KV-Caching & Autoregressive Inference
This repository is structured educationally into 6 distinct volumes, showing the evolution from raw math to a productionized API.
The fundamental linear algebra and multivariate calculus. We build the Transformer block (Self-Attention, SwiGLU, RMSNorm) using pure NumPy. The primary goal was to manually derive the backpropagation and gradient flow for complex mechanisms without relying on automated differentiation.
Scaling the architecture with GPU acceleration.
We take the mathematical intuition proven in Volume 1 and translate the exact same architecture into PyTorch's nn.Module. Here we introduce the Adam optimizer, batched data processing, and GPU tensors.
Giving the Automaton a voice. We write the generation loop. We transition the model from "training mode" into a fully autonomous text generator, handling token decoding and context window sliding so the model can generate text autoregressively.
Training the brain. We build a dynamic character-level vocabulary and a persistent training loop. The model is trained on a 1MB dataset of William Shakespeare's works, successfully learning to spell English words and understand basic grammar entirely from scratch.
Visualizing the neural network. A custom Flask API and glassmorphic Web UI. Instead of just printing text to a terminal, this interface dynamically graphs the Softmax probabilities of the neural network's thought process in real-time as it generates text.
Containerizing the engine. We wrap the neural network in a production-ready FastAPI server and containerize it using Docker. This demonstrates the ability to transition raw research math into a scalable, deployable cloud architecture.
If you want to see the Neural Network calculate probabilities live in your browser:
- Clone this repository.
- Install the requirements:
pip install torch flask
- Run the Flask Server:
python volume_5_showcase/app.py
- Open your web browser and navigate to
http://127.0.0.1:5000.
(Note: The trained shakespeare_gpt.pth weights are not included in this repository due to GitHub file size limits. To use the UI, you must first run python volume_4_shakespeare_scale/training_loop.py to train your own local weights!)