Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions ts/examples/snipsBench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
<!-- Copyright (c) Microsoft Corporation.
Licensed under the MIT License. -->

# snips-bench

Benchmarking the **action-grammar** engine on the [SNIPS](data/SOURCE.md)
natural-language-understanding task (intent + slot filling), and using it to ask
a focused question:

> Does cheap, finite-state "parsing" — coarse POS / noun-phrase chunking —
> improve grammar-based slot filling?

The short answer, with evidence below: **no, not as a slot-boundary signal.**
The lever that matters is _carrier-phrase coverage_, which is obtained far more
cheaply by **inducing the grammar from data** than by hand-authoring it — and
induction also subsumes the boundary logic that POS typing was meant to provide.

## Run

```bash
pnpm --filter snips-bench build
node dist/main.js test # M2 (hand-authored) + M3 (induced), minFreq=2
node dist/main.js test 1 # M3 induced with minFreq=1
```

Every run first proves the harness is sound: a scorer self-test (hand-computed
P/R/F1) and an **oracle** (gold-as-pred) that must score 100/100. All reported
numbers are CoNLL entity-level slot F1, evaluated per intent on its gold test
subset (intent given) and micro-pooled across intents.

## The three arms

The experiment holds the grammar fixed and varies only the _slot wildcard type_:

| arm | slot capture | boundary rule |
| ------------- | --------------------------- | ------------------------------------------------------------------------------------ |
| `wildcard` | greedy, unbounded | stops at the next literal anchor in the grammar |
| `NP` | greedy, per-token validated | stops at the first **function word** (strict content/function NP) |
| `title-aware` | `wildcard` capture, refined | keep leading determiners + medial glue; trim **trailing** glue + structural keywords |

`NP` is a first-class entity type registered on the engine (`npEntity.ts`),
backed by a closed-class lexicon + suffix tagger (`pos.ts`). `title-aware` is a
positional post-pass (`refine.ts`) — the rule a position-aware, engine-integrated
bounded wildcard would enforce.

## Results

### Hand-authored grammars (M2)

| arm | pooled slot F1 | precision | recall |
| ------------- | -------------- | --------- | ------ |
| `wildcard` | **23.4** | 47.4 | 15.5 |
| `NP` | 11.6 | 60.5 | 6.4 |
| `title-aware` | 23.5 | 47.8 | 15.6 |

### Induced grammars (M3, template induction from the train split)

| arm | pooled slot F1 | precision | recall |
| ------------- | -------------- | --------- | ------ |
| `wildcard` | **35.3** | 42.7 | 30.1 |
| `NP` | 13.5 | 55.9 | 7.7 |
| `title-aware` | 34.0 | 41.9 | 28.5 |

Per-intent coverage jumps sharply under induction (SearchCreativeWork 66→93%,
RateBook 22→91%, BookRestaurant 6.5→51%, GetWeather 54→78%), and induction
**learns slot labels from carrier context** — GetWeather, whose city/state/country
labels need world knowledge a hand grammar can't supply, rises from F1 4.0 → 26.7.

### Coverage / threshold sweep (induced, `wildcard`)

| minFreq | pooled F1 | recall |
| ------- | --------- | ------ |
| 1 | **37.3** | 33.0 |
| 2 | 35.3 | 30.1 |
| 3 | 32.7 | 26.4 |
| 5 | 28.6 | 21.9 |

F1 tracks recall, which tracks how many carrier-phrase templates are kept. The
`wildcard > title-aware > NP` ordering is identical at every threshold.

## What this shows

1. **POS / NP typing does not help slot boundaries here.** Strict `NP` has the
_highest precision_ in every setting (cleaner boundaries when it fires) but
_collapses recall_ — real slot values are titles and names full of function
words ("this is selena", "don't drink the water"), which a content/function
rule chokes on. It is a large net loss.

2. **Title-aware refinement is neutral-to-negative.** It recovers NP's precision
without the recall cost, but only helps title-heavy slots; in the induced
regime it slightly _hurts_, because the learned templates already encode the
trailing-keyword / boundary variation it tries to hard-code (e.g. both
`… {playlist}` and `… {playlist} playlist` are learned as separate templates).

3. **The boundary work is done by anchors, not syntax.** In a hand grammar the
literal carrier words ("to", "by", "out of") already pin boundaries; in an
induced grammar the learned carrier phrases do. POS typing is redundant with
anchors, which is why it never wins.

4. **Induction beats hand-authoring decisively** (+12 F1) and is cheaper. Each
delexicalized template generalizes over slot _fillers_, so even rarely-seen
carrier phrases are useful (minFreq=1 is best). The remaining ceiling is
**recall**: exact carrier-phrase matching is brittle, so test utterances whose
phrasing wasn't seen in train (~67% of gold spans at the cap) get no match.

## Implication: toward a lightweight fast translation model

The bottleneck is **generalizing surface variation to templates**, not syntax.
A lightweight, fast NL→action model should therefore invest in:

- **Soft/fuzzy carrier-phrase matching** (paraphrase, optional glue, word order)
to lift the recall ceiling that exact templates hit — this is where the real
headroom is, not POS features.
- **Open-vocabulary slot filling by context**, with the slot _label_ inferred
from the carrier phrase (as induction already does) rather than from a
gazetteer.
- Treating hand grammars as a _cold-start_ only; **induce from logged
interactions** and let coverage compound.

## Layout

```
data/ SNIPS BIO split (see SOURCE.md)
src/data.ts loader, BIO ↔ spans
src/score.ts CoNLL slot F1 + intent accuracy (self-tested)
src/pos.ts coarse POS: closed-class lexicon + suffix tagger
src/npEntity.ts NP + Num entity types registered on the engine
src/refine.ts title-aware positional boundary refinement
src/grammar.ts hand-authored grammar templates (7 intents)
src/induce.ts grammar induction by delexicalization
src/runner.ts compile → matchNFA → extract slots → recover spans
src/main.ts harness: oracle/self-test + M2/M3 scoreboards
src/debug.ts dev bisection harness
```

### Engine notes

- A wildcard compiles to a greedy self-loop; a _typed_ wildcard validates
per-token, giving unbounded function-word-bounded capture. `MAX_ENTITY_LOOKAHEAD`
caps only the whole-span optimization, not total capture length.
- The match result exposes slot **values**, not token spans; `runner.recoverSpans`
recovers spans by token-subsequence alignment.
- **Bug found:** an optional _rule-reference_ (`to <Owner>? grime …`) silently
fails to match (consumes tokens, never accepts); an inline optional group
`(my|the)?` works. Worth a minimal-repro report against actionGrammar.
50 changes: 50 additions & 0 deletions ts/examples/snipsBench/data/SOURCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
<!-- Copyright (c) Microsoft Corporation.
Licensed under the MIT License. -->

# SNIPS dataset provenance

This directory vendors the **SNIPS 2017** natural-language-understanding benchmark
in the BIO-tagged, train/valid/test split popularized by:

> Goo et al., "Slot-Gated Modeling for Joint Slot Filling and Intent Prediction",
> NAACL-HLT 2018.

It is the split everyone reports SNIPS slot-F1 / intent-accuracy numbers against.

## Origin

- Underlying data: **SNIPS Voice Platform** "2017-06-custom-intent-engines"
benchmark, originally released by Snips for research/benchmarking
(https://github.com/snipsco/nlu-benchmark).
- Preprocessed BIO split mirrored from the JointBERT repository
(https://github.com/monologg/JointBERT, `data/snips/`), which is the
widely-used copy of the Goo et al. preprocessing. JointBERT's `dev` split is
vendored here as `valid/`.

Retrieved 2026-06-05.

## Layout

```
data/<split>/
seq.in # one utterance per line, space-tokenized
seq.out # aligned BIO slot tags, space-separated (same token count per line)
label # one intent label per line
```

| split | utterances |
| ----- | ---------- |
| train | 13,084 |
| valid | 700 |
| test | 700 |

- 7 intents: AddToPlaylist, BookRestaurant, GetWeather, PlayMusic, RateBook,
SearchCreativeWork, SearchScreeningEvent.
- 39 slot types.

## Licensing note

The SNIPS benchmark data was released by Snips for research and benchmarking.
This copy is included solely to make the action-grammar benchmark reproducible.
Refer to the upstream repositories above for the authoritative terms governing
the underlying data.
Loading
Loading