llama-agent

llama-agent builds on llama.cpp's inference engine and adds an agentic tool-use loop on top.

Single binary, zero dependencies: no Python, no Node.js, just download and run
Single process: inference and agent loop in one process, no IPC overhead
Same model cache: uses your existing llama.cpp models, no separate download or setup
Light harness: one simple loop with a handful of built-in tools, optimized for small local models
100% local: offline, no API costs, your code stays on your machine
No hidden telemetry: zero tracking, zero phone-home, no usage events, no error reports sent anywhere
API server: llama-agent-server exposes the agent via HTTP API with SSE streaming

Note

New: Gemma 4 Vision

Gemma 4 is Google's latest open model family (Apache 2.0), built for agentic use with native tool calling and multimodal input. The E4B variant (4.5B effective params, ~5 GB quantized) runs comfortably on an 8 GB laptop and brings full vision capabilities to llama-agent. The model can read and analyze images, screenshots, diagrams, and documents.

# make sure llama-agent is installed
brew install gary149/llama-agent/llama-agent

# launch with gemma 4 vision (~5 GB, runs on 8 GB machines)
llama-agent -hf unsloth/gemma-4-E4B-it-GGUF:UD-Q4_K_XL

# if you have 16 GB+ RAM, use the bigger MoE variant instead
llama-agent -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL

With vision enabled, the agent can process hundreds of images in a single session, classify animals by family, read text from screenshots, and analyze UI layouts. All locally, all with a 4B model.

Area.mp4

Variant	Effective Params	GGUF Size	Vision	Best for
E4B	4.5B	~5 GB	Yes	Laptops, on-device
26B-A4B	3.8B active (MoE)	~16 GB	Yes	16 GB+ machines
31B	30.7B	~20 GB	Yes	32 GB+ machines

Quick Start

# Install (macOS / Linux)
brew install gary149/llama-agent/llama-agent

# Run (downloads model automatically)
llama-agent -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL

Or download pre-built binaries from GitHub Releases.

Build from source

# Build CLI agent
cmake -B build
cmake --build build --target llama-agent

# Run
./build/bin/llama-agent -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL

# Or with a local model
./build/bin/llama-agent -m model.gguf

Add to PATH for global access:

# For zsh:
echo "export PATH=\"\$PATH:$(pwd)/build/bin\"" >> ~/.zshrc
# For bash:
echo "export PATH=\"\$PATH:$(pwd)/build/bin\"" >> ~/.bashrc

Build the HTTP API server:

cmake -B build -DLLAMA_HTTPLIB=ON
cmake --build build --target llama-agent-server
./build/bin/llama-agent-server -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL --port 8081

Recommended Model

Model	Command
GLM-4.7-Flash	`-hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL`

Optimized settings for GLM-4.7-Flash

Use these parameters (recommended by Unsloth):

llama-agent -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
  --jinja --ctx-size 16384 --flash-attn on --fit on \
  --temp 0.7 --top-p 1.0 --min-p 0.01 --repeat-penalty 1.0

Flag	Purpose
`--flash-attn on`	Up to 1.48x speedup at batch size 1 (PR #19092)
`--fit on`	Auto-optimizes GPU/CPU memory allocation
`--repeat-penalty 1.0`	Prevents output degradation (Unsloth recommendation)

Note: Flash attention has a known issue with KV quantization on very long prompts (~79k+ tokens). On Pascal GPUs (GTX 10xx), flash attention may reduce performance.

Available Tools

The agent can use these tools to interact with your codebase and system.

Tool	Description
`bash`	Execute shell commands (output keeps the tail — errors at the end are preserved)
`read`	Read file contents with line numbers (supports images with vision models)
`write`	Create or overwrite files
`edit`	Search and replace in files
`glob`	Find files matching a pattern
`update_plan`	Track and display task progress for multi-step operations

Commands

Interactive commands available during a session. Type these directly in the chat.

Command	Description
`/exit`	Exit the agent
`/clear`	Clear conversation history
`/tools`	List available tools
`/skills`	List available skills
`/agents`	List discovered AGENTS.md files
`/stats`	Show token usage and timing statistics
`/compact`	Manually compact conversation context
`!<cmd>`	Run a shell command and share the output with the LLM
`!!<cmd>`	Run a shell command without sharing output with the LLM

Usage Examples

> Find all TODO comments in src/

[Tool: bash] grep -r "TODO" src/
Found 5 TODO comments...

> Read the main.cpp file

[Tool: read] main.cpp
   1| #include <iostream>
   2| int main() {
   ...

> Fix the bug on line 42

[Tool: edit] main.cpp
Replaced "old code" with "fixed code"

Skills

Skills are reusable prompt modules that extend the agent's capabilities. They follow the agentskills.io specification.

Flag	Description
`--no-skills`	Disable skill discovery
`--skills-path PATH`	Add custom skills directory

Skills are discovered from:

./.llama-agent/skills/ - Project-local skills
./.agents/skills/ - Project-local skills (alternative path)
~/.llama-agent/skills/ - User-global skills
~/.agents/skills/ - User-global skills (alternative path)
Custom paths via --skills-path

Creating a skill

Skills are directories containing a SKILL.md file with YAML frontmatter:

mkdir -p ~/.llama-agent/skills/code-review
cat > ~/.llama-agent/skills/code-review/SKILL.md << 'EOF'
---
name: code-review
description: Review code for bugs, security issues, and improvements. Use when asked to review code or a PR.
---

# Code Review Instructions

When reviewing code:
1. Run `git diff` to see changes
2. Read modified files for context
3. Check for bugs, security issues, style problems
4. Provide specific feedback with file:line references
EOF

Skill Structure

skill-name/
├── SKILL.md          # Required - YAML frontmatter + instructions
├── scripts/          # Optional - executable scripts
├── references/       # Optional - additional documentation
└── assets/           # Optional - templates, data files

SKILL.md Format

---
name: skill-name          # Required: 1-64 chars, lowercase+numbers+hyphens
description: What and when # Required: 1-1024 chars, triggers activation
license: MIT              # Optional
compatibility: python3    # Optional: environment requirements
metadata:                 # Optional: custom key-value pairs
  author: someone
---

Markdown instructions for the agent...

How Skills Work

Discovery: At startup, the agent scans skill directories and loads metadata (name/description)
Activation: When your request matches a skill's description, the agent reads the full SKILL.md
Execution: The agent follows the skill's instructions, optionally running scripts from scripts/

This "progressive disclosure" keeps context lean: only activated skills consume tokens.

AGENTS.md Support

The agent automatically discovers and loads AGENTS.md files for project-specific guidance.

Flag	Description
`--no-agents-md`	Disable AGENTS.md discovery

Files are discovered from the working directory up to the git root, plus a global ~/.llama-agent/AGENTS.md.

Creating an AGENTS.md file

Create an AGENTS.md file in your repository root:

# Project Guidelines

## Build & Test
- Build: `cmake -B build && cmake --build build`
- Test: `ctest --test-dir build`

## Code Style
- Use 4-space indentation
- Follow Google C++ style guide

## PR Guidelines
- Include tests for new features
- Update documentation

Search Locations (in precedence order)

./AGENTS.md - Current working directory (highest precedence)
../AGENTS.md, ../../AGENTS.md, ... - Parent directories up to git root
~/.llama-agent/AGENTS.md - Global user preferences (lowest precedence)

Monorepo Support

In monorepos, you can have nested AGENTS.md files:

repo/
├── AGENTS.md           # General project guidance
├── packages/
│   ├── frontend/
│   │   └── AGENTS.md   # Frontend-specific guidance (takes precedence)
│   └── backend/
│       └── AGENTS.md   # Backend-specific guidance

When working in packages/frontend/, both files are loaded with the frontend one taking precedence.

MCP Server Support

The agent supports Model Context Protocol (MCP) servers, allowing you to extend its capabilities with external tools.

Note: MCP servers using HTTPS (like HuggingFace) require SSL support. If you see 'https' scheme is not supported, rebuild with:
cmake -B build -DLLAMA_BUILD_LIBRESSL=ON
cmake --build build -t llama-agent -j

Create an mcp.json file in your working directory or at ~/.llama-agent/mcp.json:

{
  "servers": {
    "gradio": {
      "command": "npx",
      "args": ["mcp-remote", "https://example.hf.space/gradio_api/mcp/", "--transport", "streamable-http"],
      "timeout": 120000
    }
  }
}

Use /tools to see all available tools including MCP tools. Use --no-mcp to skip MCP server loading entirely.

MCP configuration details

Config Options

Field	Description	Default
`command`	Executable to run (required)	-
`args`	Command line arguments	`[]`
`env`	Environment variables	`{}`
`timeout`	Tool call timeout in ms	`60000`
`enabled`	Enable/disable the server	`true`

Config values support environment variable substitution using ${VAR_NAME} syntax.

Transport

Only stdio transport is supported natively. The agent spawns the server process and communicates via stdin/stdout using JSON-RPC 2.0.

For HTTP-based MCP servers (like Gradio endpoints), use a bridge such as mcp-remote.

Tool Naming

MCP tools are registered with qualified names: mcp__<server>__<tool>. For example, a read_file tool from a server named filesystem becomes mcp__filesystem__read_file.

Permission System

The agent asks for confirmation before:

Running shell commands
Writing or editing files
Accessing files outside the working directory

When prompted: y (yes), n (no), a (always allow), d (deny always)

Flag	Description
`--yolo`	Skip all permission prompts (dangerous!)
`--max-iterations N`	Max agent iterations (default: unlimited)

Safety Features

Sensitive file blocking: Automatically blocks access to .env, *.key, *.pem, credentials files
External directory warnings: Prompts before accessing files outside the project
Dangerous command detection: Warns for rm -rf, sudo, curl|bash, etc.
Doom-loop detection: Detects and blocks repeated identical tool calls

Caution

YOLO mode is extremely dangerous. The agent will execute any command without confirmation, including destructive operations like rm -rf. This is especially risky with smaller models that have weaker instruction-following and may hallucinate unsafe commands. Only use this flag if you fully trust the model and understand the risks.

Session Persistence

Conversations are automatically saved to disk as append-only JSONL files, so you can resume where you left off.

Sessions are stored at ~/.llama-agent/sessions/ organized by working directory. Each run creates a new session file.

Flag	Description
`--resume`	Resume the most recent session for the current directory
`--session <path>`	Use a specific session file (creates or resumes)
`--no-session`	Disable session persistence

# Start a session (auto-saved)
llama-agent -hf model

# Resume where you left off
llama-agent -hf model --resume

# Works with piped input too
echo "hello" | llama-agent -hf model
echo "what did I say?" | llama-agent -hf model --resume

# Explicit session file
llama-agent -hf model --session ~/my-session.jsonl

The /clear command resets both the conversation and the session file.

Context Compaction

Long conversations automatically trigger context compaction to stay within the model's context window. When the prompt approaches the context limit, the agent summarizes older messages using the model itself and replaces them with a structured summary. This allows arbitrarily long sessions without losing important context.

How it works:

After each completion, the agent checks whether prompt tokens exceed ~75% of the context window
If so, it finds a safe cut point at a turn boundary (never splitting a tool call from its result)
Older messages are serialized and sent to the model with a summarization prompt
The summary replaces the old messages, preserving goals, progress, key decisions, and next steps
If the context overflows entirely, the agent compacts and retries automatically

Compaction is enabled by default. The summary is iteratively updated on subsequent compactions, so context accumulates rather than being lost.

HTTP API Server

llama-agent-server exposes the agent via HTTP API with Server-Sent Events (SSE) streaming.

# Build & run
cmake -B build -DLLAMA_HTTPLIB=ON
cmake --build build --target llama-agent-server
./build/bin/llama-agent-server -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL --port 8081

Basic Usage

# Create a session
curl -X POST http://localhost:8081/v1/agent/session \
  -H "Content-Type: application/json" \
  -d '{"yolo": true}'
# Returns: {"session_id": "sess_00000001"}

# Send a message (streaming response)
curl -N http://localhost:8081/v1/agent/session/sess_00000001/chat \
  -H "Content-Type: application/json" \
  -d '{"content": "List files in the current directory"}'

API endpoints reference

Endpoint	Method	Description
`/health`	GET	Health check
`/v1/agent/session`	POST	Create a new session
`/v1/agent/session/:id`	GET	Get session info
`/v1/agent/session/:id/chat`	POST	Send message (SSE streaming)
`/v1/agent/session/:id/messages`	GET	Get conversation history
`/v1/agent/session/:id/permissions`	GET	Get pending permission requests
`/v1/agent/permission/:id`	POST	Respond to permission request
`/v1/agent/sessions`	GET	List all sessions
`/v1/agent/tools`	GET	List available tools
`/v1/agent/session/:id/stats`	GET	Get session token stats

Session Options

yolo (boolean): Skip permission prompts
max_iterations (int): Max agent iterations (default: 0 = unlimited)
working_dir (string): Working directory for tools

SSE event types

Event	Description
`iteration_start`	New agent iteration starting
`reasoning_delta`	Streaming model reasoning/thinking
`text_delta`	Streaming response text
`tool_start`	Tool execution beginning
`tool_result`	Tool execution completed
`permission_required`	Permission needed (non-yolo mode)
`permission_resolved`	Permission granted/denied
`compaction_completed`	Context compaction finished
`completed`	Agent finished with stats
`error`	Error occurred

Example SSE Stream

event: iteration_start
data: {"iteration":1,"max_iterations":0}

event: reasoning_delta
data: {"content":"Let me list the files..."}

event: tool_start
data: {"name":"bash","args":"{\"command\":\"ls\"}"}

event: tool_result
data: {"name":"bash","success":true,"output":"file1.txt\nfile2.cpp","duration_ms":45}

event: text_delta
data: {"content":"Here are the files:"}

event: completed
data: {"reason":"completed","stats":{"input_tokens":1500,"output_tokens":200}}

Permission flow & session management

Permission Flow

When yolo: false, dangerous operations require permission:

event: permission_required
data: {"request_id":"perm_abc123","tool":"bash","details":"rm -rf temp/","dangerous":true}

Respond via API:

curl -X POST http://localhost:8081/v1/agent/permission/perm_abc123 \
  -H "Content-Type: application/json" \
  -d '{"allow": true, "scope": "session"}'

Scopes: once, session, always

Concurrent Sessions

The server supports multiple concurrent sessions, each with its own conversation history and permission state.

# List all sessions
curl http://localhost:8081/v1/agent/sessions

# Delete a session
curl -X POST http://localhost:8081/v1/agent/session/sess_00000001/delete

Acknowledgments

Light harness inspired by Pi by Mario Zechner.

License

MIT - see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 8,926 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches		benches
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
AUTHORS		AUTHORS
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
ty.toml		ty.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-agent

New: Gemma 4 Vision

Table of Contents

Quick Start

Recommended Model

Available Tools

Commands

Usage Examples

Skills

AGENTS.md Support

MCP Server Support

Permission System

Safety Features

Session Persistence

Context Compaction

HTTP API Server

Basic Usage

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama-agent

New: Gemma 4 Vision

Table of Contents

Quick Start

Recommended Model

Available Tools

Commands

Usage Examples

Skills

AGENTS.md Support

MCP Server Support

Permission System

Safety Features

Session Persistence

Context Compaction

HTTP API Server

Basic Usage

Acknowledgments

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages