Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ CLAUDE_USER_SETTINGS.md
.DS_Store

docs/mkdocs/site/
docs/mkdocs/docs/notebooks/.ipynb_checkpoints/
.ipynb_checkpoints/

# Ignore automatically generated stub files (*.pyi)
**/*.pyi
Expand Down
139 changes: 113 additions & 26 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,31 +6,44 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

ArcticDB is a high-performance, serverless DataFrame database for the Python Data Science ecosystem. It provides a Python API backed by a C++ data-processing and compression engine, supporting S3, LMDB, Azure Blob Storage, and MongoDB backends.

## Claude-Maintained Documentation
## Documentation

Technical documentation in `docs/claude/` is **owned and maintained by Claude**. Consult these documents when working on related areas.
### User-Facing Documentation (`docs/mkdocs/docs/`)

**New features must include documentation:**

- **Tutorials** (`tutorials/`): Step-by-step guides for features (e.g., `sql_queries.md`)
- **API Reference** (`api/`): Auto-generated from docstrings via mkdocstrings
- **Technical docs** (`technical/`): Architecture and implementation details

When adding a new feature:

1. **Add/update docstrings** in the Python code (NumPy format)
2. **Create a tutorial** if the feature has multiple use cases or nuances
3. **Update `mkdocs.yml`** nav section to include new pages
4. **Build docs locally** to verify: `cd docs/mkdocs && mkdocs serve`

Documentation checklist:
- [ ] Public API has complete docstrings (Parameters, Returns, Raises, Examples)
- [ ] Complex features have a tutorial with code examples
- [ ] Edge cases and limitations are documented
- [ ] When to use feature A vs feature B is explained (if applicable)

### When to Read/Update Documentation
### Claude-Maintained Technical Docs (`docs/claude/`)

Technical documentation in `docs/claude/` is **owned and maintained by Claude**. Consult these documents when working on related areas.

- **Read** the relevant doc when starting work in an area (e.g., read `CACHING.md` before modifying version map cache)
- **Update** the doc only when making changes to that area
- Do NOT proactively read or update docs for unrelated areas

### Documentation Style

Keep documentation **high-level and terse**:
- Reference `file_path:ClassName:method_name` instead of copying code
- Use tables and bullet points over code blocks
- Keep conceptual diagrams; remove implementation details
- Avoid duplicating what's already in source code

### Documentation Index
Keep documentation **high-level and terse**: reference `file_path:ClassName:method_name` instead of copying code; use tables and bullet points over code blocks; avoid duplicating what's already in source code.

| Area | Document |
|------|----------|
| Architecture | [docs/claude/ARCHITECTURE.md](docs/claude/ARCHITECTURE.md) |
| C++ modules | [docs/claude/cpp/](docs/claude/cpp/) (CACHING, VERSIONING, STORAGE_BACKENDS, ENTITY, CODEC, COLUMN_STORE, PIPELINE, PROCESSING, STREAM, ASYNC, PYTHON_BINDINGS) |
| Python modules | [docs/claude/python/](docs/claude/python/) (ARCTIC_CLASS, LIBRARY_API, NATIVE_VERSION_STORE, QUERY_PROCESSING, NORMALIZATION, ADAPTERS, TOOLBOX) |
| C++ modules | [docs/claude/cpp/](docs/claude/cpp/) (CACHING, VERSIONING, STORAGE_BACKENDS, ENTITY, CODEC, COLUMN_STORE, PIPELINE, PROCESSING, STREAM, ASYNC, PYTHON_BINDINGS, C_BINDINGS, ARROW) |
| Python modules | [docs/claude/python/](docs/claude/python/) (ARCTIC_CLASS, LIBRARY_API, NATIVE_VERSION_STORE, QUERY_PROCESSING, NORMALIZATION, ADAPTERS, TOOLBOX, DUCKDB) |

## User-Specific Settings

Expand Down Expand Up @@ -72,6 +85,11 @@ git submodule update --init --recursive
ARCTICDB_PROTOC_VERS=4 CMAKE_BUILD_PARALLEL_LEVEL=16 ARCTIC_CMAKE_PRESET=linux-debug pip install -ve .
```

To install packages which aren't available internally, use the following custom index:
```bash
pip install -i https://repo.prod.m/artifactory/api/pypi/external-pypi/simple/ hypothesis==6.72.4
```

### Building a Wheel

```bash
Expand Down Expand Up @@ -146,26 +164,31 @@ cpp/out/<preset>-build/arcticdb/test_unit_arcticdb --gtest_filter="TestSuite.Tes
## Running Python Tests

```bash
# Run all tests
python -m pytest python/tests
# Run all tests (use -n for parallel execution via pytest-xdist)
python -m pytest -n 8 python/tests

# Run a single test file
python -m pytest python/tests/unit/arcticdb/test_arctic.py

# Run a specific test
python -m pytest python/tests/unit/arcticdb/test_arctic.py::test_function_name

# Run tests in a subdirectory in parallel
python -m pytest -n 8 python/tests/unit/arcticdb/version_store/duckdb/
```

## Benchmarking

**IMPORTANT: Always use a release build for benchmarking.** Debug builds have 10-30x overhead from disabled optimizations, assertions, and unoptimized template instantiation (e.g. sparrow/Arrow type system). Use `ARCTIC_CMAKE_PRESET=linux-release` for both C++ and Python benchmarks.

### C++ Benchmarks (Google Benchmark)

```bash
cmake -DTEST=ON --preset <preset> cpp
cmake --build cpp/out/<preset>-build --target benchmarks
cmake -DTEST=ON --preset linux-release cpp
cmake --build cpp/out/linux-release-build --target benchmarks

# Run specific benchmarks
cpp/out/<preset>-build/arcticdb/benchmarks --benchmark_filter=<regex>
cpp/out/linux-release-build/arcticdb/benchmarks --benchmark_filter=<regex> --benchmark_time_unit=ms
```

Benchmark sources are in `cpp/arcticdb/*/test/benchmark_*.cpp`.
Expand All @@ -174,31 +197,95 @@ Benchmark sources are in `cpp/arcticdb/*/test/benchmark_*.cpp`.

ASV benchmarks live in `python/benchmarks/`. Requires `asv` and `virtualenv` installed.

**Ensure the active virtualenv has a release build installed** before running ASV benchmarks:
```bash
cd python
python -m asv run -v --show-stderr HEAD^! # Benchmark current commit
python -m asv run -v --show-stderr --bench <regex> # Run subset matching regex
python -m asv run --python=$(which python) -v # Use current env (faster)
ARCTICDB_PROTOC_VERS=4 CMAKE_BUILD_PARALLEL_LEVEL=16 ARCTIC_CMAKE_PRESET=linux-release pip install -ve .
```

**First-time setup** — register the machine (one-off):
```bash
asv machine --yes
```

**Run from the repo root** (not `python/`):
```bash
# Run a specific benchmark suite against the current environment (fastest — no rebuild)
asv run --python=$(which python) -v --show-stderr --bench BasicFunctions

# Run all benchmarks
asv run --python=$(which python) -v --show-stderr

# Run benchmarks matching a regex
asv run --python=$(which python) -v --show-stderr --bench "QueryBuilder|Resample"
```

Note: `--python=$(which python)` uses the active virtualenv directly, avoiding a full wheel build. Do **not** combine this with a commit range (`HEAD^!`) — they are mutually exclusive.

**Available benchmark suites**: `BasicFunctions`, `Arrow`, `QueryBuilder`, `Resample`, `ModificationFunctions`, `ListSymbols`, `ListVersions`, `ListSnapshots`, `VersionChain`, `RecursiveNormalizer`, `FinalizeStagedData`, `SQLQueries`, `SQLStreamingMemory`, `SQLLargeGroupBy`, `SQLFilteringMemory`, `SQLWideTableDateRange`, `LazyReadThroughput`, `LazyReadWithOptions`, `LazyReadWithClauses`, `ChunkedOutputDownstream`.

By default only LMDB storage is tested. Set `ARCTICDB_STORAGE_AWS_S3=1` with appropriate credentials to include S3. Set `ARCTICDB_SLOW_TESTS=1` for additional slow benchmarks.

See: [ASV Benchmarks Wiki](https://github.com/man-group/ArcticDB/wiki/Dev:-ASV-Benchmarks)

## Key Development Guidelines

### Test-Driven Development

**Every code change must be accompanied by a failing test that the change fixes.** This ensures:
- The bug or missing feature is properly understood before fixing
- The fix actually addresses the issue
- Regressions are caught if the code is modified later

When fixing a bug or adding a feature:
1. Write a test that demonstrates the bug or missing functionality
2. Verify the test fails
3. Implement the fix
4. Verify the test passes

### Git Workflow

**Always confirm with the developer before committing and pushing changes upstream.** Do not assume that passing tests means the changes are ready for review. The developer may want to:
- Review the implementation approach
- Make additional changes or refinements
- Squash or reorganize commits
- Add to the commit message or PR description

Wait for explicit confirmation like "commit and push" or "looks good, push it" before pushing to remote.

### Branch Work Logs

When working on a feature branch, maintain a work log in `docs/claude/plans/<branch-name>/branch-work-log.md`. Update it at the end of each task with a few bullet points summarizing what was done. This provides continuity across sessions and helps with PR descriptions.

### Backwards Compatibility

- Data written by newer clients should be readable by older clients - document breaking changes clearly
- API changes affecting V1 or V2 public APIs must be highlighted in PR descriptions

### Code Style

Code style is enforced by `./build_tooling/format.py`. **Always run the formatter after making code changes:**
Code style is enforced by `./build_tooling/format.py`. **Always run the formatter after making code changes, but only on files changed on the branch:**

```bash
# Format all code
python ./build_tooling/format.py --in-place --type all
# Format only files changed on the branch
git diff --name-only origin/master..HEAD -- '*.py' | xargs -r -n1 python ./build_tooling/format.py --in-place --type python --file
git diff --name-only origin/master..HEAD -- '*.cpp' '*.hpp' | xargs -r -n1 python ./build_tooling/format.py --in-place --type cpp --file
```


## Code Review

When reviewing changes on a branch before submitting upstream, see **[docs/claude/skills/code-review.md](docs/claude/skills/code-review.md)** for detailed instructions covering:

- C++ memory safety (Rule of Five, Arrow C Data Interface, RAII)
- Python code quality (exception handling, duplicate code, state management)
- Test coverage analysis (happy path, error handling, edge cases, parameter coverage)
- Error handling review (fail fast, helpful messages, exception types)
- Type handling (numeric, temporal, string, complex types)
- Documentation and performance considerations

Use sub-agents to review in parallel. Write findings to `docs/claude/plans/` for tracking.


### Git Commits

- Do not add "Generated with AI" or "Co-Authored-By" lines to commit messages
72 changes: 72 additions & 0 deletions cpp/arcticdb/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -416,6 +416,7 @@ set(arcticdb_srcs
util/type_traits.hpp
util/variant.hpp
version/de_dup_map.hpp
version/lazy_read_helpers.hpp
version/op_log.hpp
version/schema_checks.hpp
version/snapshot.hpp
Expand Down Expand Up @@ -571,6 +572,7 @@ set(arcticdb_srcs
util/format_date.cpp
version/key_block.hpp
version/key_block.cpp
version/lazy_read_helpers.cpp
version/local_versioned_engine.cpp
version/schema_checks.cpp
version/op_log.cpp
Expand Down Expand Up @@ -881,6 +883,42 @@ target_compile_definitions(arcticdb_core PUBLIC PCRE2_CODE_UNIT_WIDTH=0 ENTT_ID_

GENERATE_EXPORT_HEADER(arcticdb_core)

## C API shared library (language bindings) ##
# arcticdb_core_static includes pybind11 code that references Python symbols.
# Link against libpython to resolve them (they are never called through the C API path,
# but static constructors in the core library reference them during dlopen).
find_package(Python3 COMPONENTS Development QUIET)

add_library(arcticdb_c SHARED bindings/arcticdb_c.cpp)

target_link_libraries(arcticdb_c
PRIVATE
arcticdb_core_static
${arcticdb_core_libraries}
${AWSSDK_LINK_LIBRARIES}
arcticdb_core_static
${AWSSDK_LINK_LIBRARIES}
)

if(Python3_FOUND)
target_link_libraries(arcticdb_c PRIVATE Python3::Python)
endif()

target_include_directories(arcticdb_c PRIVATE
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}>
$<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}>
$<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}/../>
$<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}/../proto/arcticc/pb2/proto/>
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/..>
${arcticdb_core_includes}
)

if(NOT ${ARCTICDB_USING_CONDA})
target_include_directories(arcticdb_c PRIVATE ${THIRD_PARTY_INCLUDE_DIRS})
endif()

target_compile_definitions(arcticdb_c PRIVATE PCRE2_CODE_UNIT_WIDTH=0 ENTT_ID_TYPE=std::uint64_t ARCTICDB_C_BUILDING)

## Core python bindings, private only ##
set(arcticdb_python_srcs
async/python_bindings.cpp
Expand Down Expand Up @@ -1006,6 +1044,7 @@ if(${TEST})
arrow/test/arrow_test_utils.cpp
arrow/test/test_arrow_read.cpp
arrow/test/test_arrow_write.cpp
arrow/test/test_lazy_record_batch_iterator.cpp
async/test/test_async.cpp
codec/test/test_codec.cpp
codec/test/test_encode_field_collection.cpp
Expand Down Expand Up @@ -1091,6 +1130,7 @@ if(${TEST})
util/test/input_frame_utils.hpp
util/test/segment_generation_utils.hpp
util/test/segment_generation_utils.cpp
version/test/test_lazy_read_helpers.cpp
version/test/test_append.cpp
version/test/test_key_block.cpp
version/test/test_sort_index.cpp
Expand Down Expand Up @@ -1197,6 +1237,7 @@ if(${TEST})
arrow/test/arrow_test_utils.cpp
arrow/test/benchmark_arrow_reads.cpp
arrow/test/benchmark_arrow_writes.cpp
arrow/test/benchmark_lazy_iterator.cpp
column_store/test/benchmark_chunked_buffer.cpp
column_store/test/benchmark_column.cpp
column_store/test/benchmark_memory_segment.cpp
Expand Down Expand Up @@ -1322,4 +1363,35 @@ if(${TEST})
${BASE_PCH}
)
endif()

## C API smoke tests ##
# Tests link against arcticdb_c (the shared library under test) plus sparrow
# (for ArrowArray/ArrowSchema type definitions). The executable linker requires
# all transitive dependencies to be resolvable, hence Python and AWS.
set(C_API_TEST_LIBS
arcticdb_c
sparrow::sparrow
Python::Python
${AWSSDK_LINK_LIBRARIES}
)

add_executable(test_c_api_smoke bindings/test_c_api_smoke.cpp)
target_link_libraries(test_c_api_smoke PRIVATE ${C_API_TEST_LIBS})
target_include_directories(test_c_api_smoke PRIVATE
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}>
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/..>
)

add_executable(test_c_api_stream_smoke bindings/test_c_api_stream_smoke.cpp)
target_link_libraries(test_c_api_stream_smoke
PRIVATE
${C_API_TEST_LIBS}
GTest::gtest
GTest::gtest_main
)
target_include_directories(test_c_api_stream_smoke PRIVATE
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}>
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/..>
)
gtest_discover_tests(test_c_api_stream_smoke PROPERTIES DISCOVERY_TIMEOUT 60)
endif()
Loading
Loading