A leakage-aware diagnostic study of whether LOB reconstruction quality transfers to downstream mid-price trend prediction under chronological evaluation.
This repository is an independent diagnostic PoW built around LOBench-style processed A-share data and uses LOBench as an upstream reference; it is not a fork, full reproduction, or official extension of LOBench.
Headline result: naive window-level random splitting produces materially higher macro-F1 under full near-neighbor exposure, but the tuned raw-window logistic gain almost disappears under blocked random splitting with embargo.
The project emphasizes evaluation validity, reproducible experiment design, and conservative interpretation rather than model scale or trading profitability.
Financial LOB windows are highly overlapping. If train/test splits are randomized at the window level, downstream prediction metrics may be inflated by near-neighbor exposure. This project uses LOBench-style processed A-share data to test reconstruction-prediction alignment under stricter split protocols.
-
Split protocol matters. Naive random window splitting creates full train/test near-neighbor exposure and materially higher macro-F1, while blocked random with embargo stays close to chronological performance.
-
Reconstruction quality is not a standalone downstream proxy. The best full-window reconstruction variant is
pca@128, but the best validation-selected frozen-head predictor islast_snapshot_repeat@40. -
The claim is intentionally narrow. The evidence is limited to
sz000001,trend5, and one stride-4 subset. This is a diagnostic PoW, not a full LOBench reproduction, SOTA claim, trading PnL study, or general market prediction claim.
| Field | Setting |
|---|---|
| Dataset source | External LOBench A-share processed data |
| Symbol | sz000001 |
| Label | trend5 |
| Window | 100 |
| Feature dimension | 40 |
| Sample stride | 4 |
| Conservative baseline split | Boundary-purged chronological 70/15/15 |
| Split diagnostics | random_window_naive, random_block_purged, chronological_no_purge |
| Samples | 7952 |
| Train / val / test | 5600 / 1200 / 1152 |
| Data policy | External data, generated tensors, checkpoints, and latent arrays are not committed |
This split-protocol diagnostic asks whether random-split gains come from legitimate temporal distribution mixing or from overlapping-window near-neighbor exposure.
Caption: Naive random window-level splitting has full train/test overlap and k5 near-neighbor exposure in this subset, while blocked random with embargo removes that exposure. The performance panel shows that naive random also raises test macro-F1.
Caption: Most of the naive-random macro-F1 gain disappears when switching from naive window-level randomization to blocked random with embargo, especially for the tuned raw-window logistic control.
Interpretation:
- Naive random split is an optimistic diagnostic protocol, not a recommended evaluation protocol.
random_window_naiveimproves tuned raw-window logistic test macro-F1 by0.0583over chronological.random_block_purgedimproves tuned raw-window logistic test macro-F1 by only0.0004over chronological.- The naive-vs-blocked gap is therefore the main evidence that the naive-random gain is mostly near-neighbor exposure in this subset.
Note: this is a lightweight protocol rerun. Its absolute metrics should be interpreted through within-step contrasts rather than as replacements for the transfer-audit headline metrics. chronological_no_purge is a no-extra-purge diagnostic on the existing kept sample universe; it does not restore boundary-dropped samples.
| Evidence | Result | Caveat |
|---|---|---|
| Prediction baseline | Step 5 logistic regression, test macro-F1 0.3972 |
Fixed-C raw-window baseline |
| Tuned raw-window logistic control | test macro-F1 0.3904, selected C=0.1 |
Selected by validation macro-F1 |
| Validation-selected frozen latent head | last_snapshot_repeat@40, test macro-F1 0.4355 |
Selected by validation macro-F1, not test |
| Paired bootstrap, validation-selected latent vs tuned raw | macro-F1 delta 0.0452, 95% CI [0.0082, 0.0799], fraction_delta_gt_0=0.9930 |
Descriptive, not fully pre-registered |
| Best reconstruction variant | pca@128, test normalized MSE 0.1838 |
Not the best prediction variant across all variants |
| Rank sensitivity | excluding last_snapshot_repeat@40 makes pca@128 both reconstruction-best and prediction-best |
Broad rank-mismatch claim is partial |
| Split integrity audit | naive random overlap risk 1.0000; blocked random overlap risk 0.0000 |
Random split result is diagnostic |
Caption: Step 8 fair-transfer visualization. Step 9 shows the same latent variant is selected by validation macro-F1, so the plotted best-frozen-latent comparison matches the validation-selected comparison in this run.
Caption: Reconstruction rank and frozen-head prediction rank do not perfectly align across all variants, with the strongest mismatch driven by last_snapshot_repeat@40.
Caption: Rank mismatch weakens after excluding last_snapshot_repeat@40, so the mismatch claim is partial rather than general.
| File | Purpose |
|---|---|
| technical_memo.md | Final interpretation and limitations |
| protocol_contrasts.csv | Split-protocol contrast |
| split_integrity_audit.csv | Overlap and near-neighbor exposure audit |
| fair_transfer_comparison.csv | Validation-selected transfer comparison |
Supporting docs and full artifact references are listed in artifact_index.md.
Reproduction commands are collected in reproduction_guide.md. The pipeline requires the external processed A-share dataset locally. Raw data, generated tensors, checkpoints, and latent arrays are not committed.
| Boundary | Current status |
|---|---|
| Symbol coverage | One symbol, sz000001 |
| Horizon coverage | One label horizon, trend5 |
| Sampling protocol | One stride-4 subset |
| Split protocol | Chronological, naive random, blocked random, and no-purge diagnostics on the same subset |
| Multi-symbol robustness | Not evaluated |
| Multi-horizon robustness | Not evaluated |
| Regime / failure-case diagnostics | Deferred to future work as Step 11; not part of the current evidence chain |
| Trading PnL | Not evaluated |
| Best frozen latent head | Validation-selected in Step 9; candidate set fixed by earlier steps |
| Bootstrap comparison | Descriptive, not fully pre-registered confirmatory evidence |
The next natural extension is a regime and failure-case diagnostic layer, deferred here as Step 11 rather than included in the current evidence chain. It should slice errors by interpretable market-state variables such as spread, volatility, liquidity, imbalance validity, and top-of-book movement, then test whether the reconstruction-prediction mismatch is concentrated in specific regimes. That extension should be treated as explanatory analysis, not as new primary evidence, until repeated across additional symbols and horizons.
src/data/: data loading, field mapping, labels, subset construction.src/models/: prediction and reconstruction baseline models.src/analysis/: metrics and diagnostic utilities.scripts/: runnable stage entry points.docs/: protocol, artifact index, and reproduction notes.results/: committed result and audit artifacts.figures/: plots and visual diagnostics.




