This mermaid diagram shows the overall architecture and flow of the DiffSynth-Engine, which is a high-performance inference engine for diffusion models.
graph TB
%% Input Layer
A[User Input: Prompt, Image, Parameters] --> B[Configuration]
B --> B1[Pipeline Config<br/>FluxPipelineConfig/SDXLPipelineConfig/etc.]
%% Model Fetching & Loading
B1 --> C[Model Fetching System]
C --> C1[fetch_model<br/>HuggingFace/CivitAI/ModelScope]
C1 --> C2[State Dict Loading<br/>SafeTensors/GGUF]
C2 --> C3[Model Conversion<br/>Diffusers → DiffSynth Format]
%% Pipeline Factory
C3 --> D{Pipeline Type}
D -->|Text-to-Image| E[FluxImagePipeline]
D -->|SDXL| F[SDXLImagePipeline]
D -->|SD| G[SDImagePipeline]
D -->|Video| H[WanVideoPipeline]
D -->|Qwen Image| I[QwenImagePipeline]
%% Main Pipeline Flow (using Flux as example)
E --> J[Model Initialization]
%% Text Processing
J --> K[Text Processing]
K --> K1[CLIPTokenizer + T5TokenizerFast]
K1 --> K2[FluxTextEncoder1 + FluxTextEncoder2]
K2 --> K3[Text Embeddings]
%% Image Processing (if img2img)
J --> L[Image Processing]
L --> L1[Image Preprocessing]
L1 --> L2[FluxVAEEncoder]
L2 --> L3[Latent Space]
%% Noise & Sampling Setup
J --> M[Noise & Sampling Setup]
M --> M1[Noise Generation + Dynamic Shifting]
M --> M2[RecifitedFlowScheduler → Timesteps]
M --> M3[FlowMatchEulerSampler → Strategy]
%% Core Denoising Loop
K3 --> N[Core Denoising Loop]
L3 --> N
M1 --> N
M2 --> N
M3 --> N
N --> N1[FluxDiT Transformer<br/>+ ControlNet/IP-Adapter]
N1 --> N2[Noise Prediction]
N2 --> N3[Sampler Step]
N3 --> N4{More Steps?}
N4 -->|Yes| N1
N4 -->|No| O[Final Latents]
%% Image Decoding
O --> P[FluxVAEDecoder]
P --> Q[Generated Image]
%% Performance Optimizations
R[Performance Features]
R --> R1[Memory Management<br/>CPU/GPU Offloading<br/>Sequential Offloading]
R --> R2[Parallel Processing<br/>Tensor/Sequence Parallel<br/>CFG Parallel]
R --> R3[Quantization<br/>FP8/GGUF Support<br/>Model Compilation]
R1 --> J
R2 --> J
R3 --> J
%% Model Customization
S[Model Customization]
S --> S1[LoRA Support<br/>Fused/Unfused Loading]
S --> S2[Conditioning<br/>IP-Adapter/Redux]
S --> S3[Control<br/>ControlNet/Inpainting]
S1 --> N1
S2 --> N1
S3 --> N1
%% Tools & Extensions
T[Tools & Extensions]
T --> T1[FluxInpaintingTool]
T --> T2[FluxOutpaintingTool]
T --> T3[FluxReferenceTools]
T --> T4[FluxReplaceTool]
T --> E
%% Algorithm Foundation
U[Algorithm Foundation]
U --> U1[Noise Schedulers<br/>Beta/DDIM/Exponential/Karras]
U --> U2[Samplers<br/>Euler/DPM++/DDPM/FlowMatch]
U1 --> M
U2 --> M
style A fill:#e1f5fe
style Q fill:#c8e6c9
style N1 fill:#fff3e0
style E fill:#f3e5f5
style C fill:#fce4ec
style R fill:#e8f5e8
style S fill:#fff8e1
The DiffSynth-Engine follows a modular architecture with these key components:
- FluxImagePipeline: Primary image generation pipeline using Flux models
- SDXLImagePipeline: Stable Diffusion XL pipeline
- SDImagePipeline: Standard Stable Diffusion pipeline
- WanVideoPipeline: Video generation pipeline
- QwenImagePipeline: Qwen image generation pipeline
- Tokenizers: CLIPTokenizer and T5TokenizerFast for text preprocessing
- Text Encoders: CLIP and T5 models for text embedding generation
- Prompt Encoding: Converts text prompts to numerical embeddings
- VAE Encoder: Encodes images to latent space representation
- VAE Decoder: Decodes latents back to pixel space
- Preprocessing: Image normalization and format conversion
- Schedulers: Define noise schedules (Beta, DDIM, Exponential, etc.)
- Samplers: Implement sampling strategies (Euler, DPM++, DDPM, etc.)
- Timestep Management: Controls the denoising process progression
- DiT (Diffusion Transformer): Main neural network for noise prediction
- Attention Mechanisms: Self-attention and cross-attention layers
- ControlNet Integration: Optional conditioning for guided generation
- LoRA Support: Low-rank adaptation for model customization
- IP-Adapter & Redux: Image-based conditioning
- Parallel Processing: Multi-GPU and distributed inference
- Memory Management: CPU/GPU offloading and optimization
- Quantization: FP8 and other precision optimizations
- State Dict Handling: Loading and converting model weights
- Device Management: GPU/CPU memory allocation
- Model Lifecycle: Loading, offloading, and cleanup
The engine supports multiple diffusion model formats (Flux, SD, SDXL, Wan, Qwen) while providing a unified interface and extensive optimization features for high-performance inference.