From ea51d9b4b1de7b0501042cd4c095f8a4d5da3f90 Mon Sep 17 00:00:00 2001 From: Yuri S Villas Boas Date: Sat, 28 Feb 2026 16:30:53 -0300 Subject: [PATCH 01/11] Formosa as BIP Mnemonic *sentences* instead of words proposed as forwards- and backwards-compatible expansion to BIP39, itself as Bitcoin Improvement Proposal. --- bip.mediawiki | 224 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 224 insertions(+) create mode 100644 bip.mediawiki diff --git a/bip.mediawiki b/bip.mediawiki new file mode 100644 index 0000000000..819f49842e --- /dev/null +++ b/bip.mediawiki @@ -0,0 +1,224 @@ +
+  BIP: ?
+  Layer: Applications
+  Title: Formosa --- Themed mnemonic sentences for generating deterministic keys
+  Author: Yuri S Villas Boas 
+          André Fidencio Gonçalves 
+  Comments-Summary: No comments yet.
+  Comments-URI: https://github.com/bitcoin/bips/wiki/Comments:BIP-formosa
+  Status: Draft
+  Type: Standards Track
+  Created: 2021-12-10
+  License: BSD-2-Clause
+  Requires: BIP-0032, BIP-0039
+  Post-History: https://www.toptal.com/cryptocurrency/formosa-crypto-wallet-management
+
+ +==Abstract== + +This BIP describes an expansion of BIP-0039 for the generation of deterministic +wallets. Where BIP-0039 uses a flat list of unrelated words, Formosa organizes +mnemonic words into themed sentences with syntactic structure and semantic +coherence, substantially improving memorability while retaining all properties +of the original scheme. + +It consists of two parts: generating the mnemonic and converting it into a +binary seed. This seed can be later used to generate deterministic wallets using +BIP-0032 or similar methods. + +Full forward and backward compatibility with BIP-0039 is maintained: seed +derivation internally converts any Formosa mnemonic back to its equivalent +BIP-0039 representation, so existing keys and addresses are preserved. + +==Copyright== + +This BIP is licensed under the BSD 2-clause license. + +==Motivation== + +A mnemonic code or sentence is superior for human interaction compared to the +handling of raw binary or hexadecimal representations of a wallet seed. The +sentence could be written on paper or spoken over the telephone. + +However, human memory is an associative process: information is more readily +retained when it can be linked to existing knowledge through semantic +associations, visual imagery, and narrative context. A BIP-0039 mnemonic is a +sequence of unrelated words with no syntactic or semantic relationship, making +it difficult to form the mental associations that aid long-term retention. + +Formosa builds upon BIP-0039 by organizing mnemonic words into themed sentences +with syntactic roles (e.g., subject, adjective, object, location). Each sentence +draws vocabulary from a coherent semantic domain --- medieval fantasy, science +fiction, nature, finance, or any custom theme --- enabling the user to form vivid +mental images that reduce memorization effort per bit of entropy. + +This guide is meant to be a way to transport computer-generated randomness with +a human-readable transcription. It's not a way to process user-created +sentences (also known as brainwallets) into a wallet seed. + +==Generating the mnemonic== + +The mnemonic must encode entropy in a multiple of 32 bits. With more entropy +security is improved but the sentence length increases. We refer to the +initial entropy length as ENT. The allowed size of ENT is 128-256 bits. + +First, an initial entropy of ENT bits is generated. A checksum is generated by +taking the first ENT / 32 bits of its SHA256 hash. This checksum is +appended to the end of the initial entropy. Next, these concatenated bits +are split into groups of 33 bits, which we call '''sentences'''. Each sentence is +further subdivided into variable-length bit fields, one per syntactic category, +whose lengths are defined by the active theme. Each bit field encodes an index +into the corresponding category's word list. Finally, we convert these indices +into words and use the joined words as a mnemonic sentence. + +BIP-0039 is a special case where each sentence contains three 11-bit fields +indexing a single 2048-word list (3 x 11 = 33). + +The following table describes the relation between the initial entropy +length (ENT), the checksum length (CS), the number of 33-bit sentences (S), +and the length of the generated mnemonic sentence (MS) in words. The word +count assumes a 6-word theme; for BIP-0039 (3 words per sentence), divide by 2. + +
+CS = ENT / 32
+S  = (ENT + CS) / 33
+
+|  ENT  | CS | ENT+CS |  S  | MS (6-word) | MS (BIP-0039) |
++-------+----+--------+-----+-------------+---------------+
+|  128  |  4 |   132  |  4  |     24      |      12       |
+|  160  |  5 |   165  |  5  |     30      |      15       |
+|  192  |  6 |   198  |  6  |     36      |      18       |
+|  224  |  7 |   231  |  7  |     42      |      21       |
+|  256  |  8 |   264  |  8  |     48      |      24       |
+
+ +For each 33-bit sentence, the word selection algorithm proceeds as follows: + +# Initialize an empty sentence array with one slot per category. +# For each category in the theme's ''filling order'': +## Extract BIT_LENGTH bits from the current position in the bit stream. +## Interpret them as an unsigned integer index. +## If the category is ''led by'' another category, look up the appropriate sub-list from the leading category's mapping using the already-selected leading word. Otherwise, use the category's total word list. +## Select the word at the computed index from the resolved word list. +## Place the word into the sentence array at the position given by the theme's ''natural order''. +# Output the words in natural order. + +==Themes== + +The Formosa equivalent to a BIP-0039 wordlist is a '''theme'''. A theme is a JSON +document that defines syntactic categories, their word lists, bit-widths, and +optional semantic restrictions between categories. The sum of all category +bit-widths in a theme MUST equal 33. + +An ideal theme has the following characteristics: + +a) specific semantic scope (memory block) + - the entire vocabulary should adhere to a single coherent topic, enabling + the user to form a unified mental scene + +b) concrete imagery + - categories should consist of elements easily associated with mental images. + Prefer concrete nouns and tangible adjectives over abstract terms + +c) sorted wordlists + - the wordlist is sorted which allows for more efficient lookup of the code words + (i.e. implementations can use binary search instead of linear search) + +d) first-letters uniqueness + - the wordlist is created in such a way that it's enough to type the first two + letters to unambiguously identify the word + +The first-letters uniqueness property yields higher information density than +BIP-0039. In BIP-0039, four characters are needed to identify each word, +encoding 11 bits per 4 characters = 2.75 bits/character. In Formosa, two +characters suffice per word. The achievable density depends on the theme's +category bit-widths: + +
+| List size | Bits | Chars to identify | Density (bits/char) |
++-----------+------+-------------------+---------------------+
+|   2048    |  11  |        4          |   2.75 (BIP-0039)   |
+|    32     |   5  |        2          |   2.50              |
+|    64     |   6  |        2          |   3.00              |
+|   128     |   7  |        2          |   3.50              |
+
+ +As an example, the ''nationalities'' theme uses four 7-bit nationality +categories (128 entries each) and one 5-bit profession category (32 entries), +yielding 33 bits per 5-word sentence. A user typing only the first two +characters of each word types 10 characters to encode 33 bits, achieving an +information density of 33 / 10 = 3.30 bits/character --- a 20% improvement +over BIP-0039's 2.75 bits/character + +e) semantic restrictions (optional) + - themes may define restrictions between categories so that the available word list + for one category changes depending on the word selected in a leading category, + producing more semantically coherent sentences. Restriction relationships MUST + be acyclic + +The wordlist can contain native characters, but they must be encoded in UTF-8 +using Normalization Form Compatibility Decomposition (NFKD). + +==From mnemonic to seed== + +A user may decide to protect their mnemonic with a passphrase. If a passphrase is not +present, an empty string "" is used instead. + +To ensure forward and backward compatibility with BIP-0039, seed derivation first +converts any Formosa mnemonic back to its equivalent BIP-0039 mnemonic by extracting +the underlying entropy and re-encoding it using the BIP-0039 English word list. This +guarantees that the same entropy always produces the same seed, keys, and addresses +regardless of which theme was used. + +To create a binary seed from the resulting BIP-0039 mnemonic, we use the PBKDF2 function +with a mnemonic sentence (in UTF-8 NFKD) used as the password and the string "mnemonic" + +passphrase (again in UTF-8 NFKD) used as the salt. The iteration count is set to 2048 and +HMAC-SHA512 is used as the pseudo-random function. The length of the derived key is 512 +bits (= 64 bytes). + +This seed can be later used to generate deterministic wallets using BIP-0032 or +similar methods. + +The conversion of the mnemonic sentence to a binary seed is completely independent +from generating the sentence. This results in a rather simple code; there are no +constraints on sentence structure and clients are free to implement their own +themes or even whole sentence generators, allowing for flexibility in wordlists +for typo detection or other purposes. + +Although using a mnemonic not generated by the algorithm described in "Generating the +mnemonic" section is possible, this is not advised and software must compute a +checksum for the mnemonic sentence using a wordlist and issue a warning if it is +invalid. + +The described method also provides plausible deniability, because every passphrase +generates a valid seed (and thus a deterministic wallet) but only the correct one +will make the desired wallet available. + +==Standard themes== + +The reference implementation ships with standard themes listed at the link below. +Since BIP-0039 is a valid Formosa theme, all existing BIP-0039 mnemonics work +without modification. + +It is '''strongly discouraged''' to use non-standard custom themes for generating +mnemonic sentences, as the user assumes responsibility for ensuring the theme file +remains available and structurally valid. Users with proper training in security +protocols who understand these risks may benefit from custom themes through higher +memorization efficiency or an additional layer of obscurity. + +* [[https://github.com/Yuri-SVB/formosa/tree/master/src/mnemonic/themes|Standard Formosa Themes]] + +==Test vectors== + +The test vectors include input entropy, mnemonic and seed. The +passphrase "TREZOR" is used for all vectors. Since Formosa converts back to +BIP-0039 before seed derivation, the same test vectors apply to all themes +given the same underlying entropy. + +https://github.com/Yuri-SVB/formosa/blob/master/vectors.json + +==Reference Implementation== + +Reference implementation including themes is available from + +https://github.com/Yuri-SVB/formosa From 3166be94192a8f7653fcb38e3e2bdf99c960bc79 Mon Sep 17 00:00:00 2001 From: Yuri S Villas Boas Date: Mon, 23 Mar 2026 18:41:12 -0300 Subject: [PATCH 02/11] Update bip.mediawiki Co-authored-by: Mark "Murch" Erhardt --- bip.mediawiki | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/bip.mediawiki b/bip.mediawiki index 819f49842e..fc373e548d 100644 --- a/bip.mediawiki +++ b/bip.mediawiki @@ -2,16 +2,16 @@ BIP: ? Layer: Applications Title: Formosa --- Themed mnemonic sentences for generating deterministic keys - Author: Yuri S Villas Boas - André Fidencio Gonçalves - Comments-Summary: No comments yet. - Comments-URI: https://github.com/bitcoin/bips/wiki/Comments:BIP-formosa + Authors: Yuri S Villas Boas + André Fidencio Gonçalves Status: Draft - Type: Standards Track - Created: 2021-12-10 + Type: Specification + Assigned: ? License: BSD-2-Clause - Requires: BIP-0032, BIP-0039 - Post-History: https://www.toptal.com/cryptocurrency/formosa-crypto-wallet-management + Requires: 32, 39 + Discussion: https://gnusha.org/pi/bitcoindev/jQqInjh7VTC5byefTzENidJjigvRqf5Y7UvbrWjKPJykvhdlLETeglGE3zoAiVAxUyAXU8uWHsHEjJ0MHqqPTy4prgaIhgMyIrD9c6ZUuE0=@pm.me/#t + https://gnusha.org/pi/bitcoindev/F4cs-RJRQYBXhjoS9fc_cUc93yLrkQS5DNQAeFRHrLEQ5bScCjKSnaqN-IcXb16fxqO053muqFCx8_GzzKN5XCGCIHD9Ir1_baI5voKYfOo=@pm.me/ + https://www.toptal.com/cryptocurrency/formosa-crypto-wallet-management ==Abstract== From 738dac9c1671600f40b48933741161c824f671a2 Mon Sep 17 00:00:00 2001 From: Yuri S Villas Boas Date: Mon, 23 Mar 2026 18:43:36 -0300 Subject: [PATCH 03/11] Update bip.mediawiki Satisfying requirement of title in fewer than 50 characters. --- bip.mediawiki | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bip.mediawiki b/bip.mediawiki index fc373e548d..f22a20263e 100644 --- a/bip.mediawiki +++ b/bip.mediawiki @@ -1,7 +1,7 @@
   BIP: ?
   Layer: Applications
-  Title: Formosa --- Themed mnemonic sentences for generating deterministic keys
+  Title: Encoding seed as themed mnemonic sentences
   Authors: Yuri S Villas Boas 
            André Fidencio Gonçalves 
   Status: Draft

From f5b0a1e94218b0f23e6e72eeb4162ea1478dfd65 Mon Sep 17 00:00:00 2001
From: Yuri-SVB 
Date: Sun, 26 Apr 2026 19:52:39 -0300
Subject: [PATCH 04/11] Formosa: address PR #2108 review feedback

Restructure the draft to follow BIP-3 conventions and resolve the issues
raised by reviewers in https://github.com/bitcoin/bips/pull/2108:

- Introduce explicit Specification section with a Terminology subsection
  that distinguishes 'word', 'category', 'theme', 'sentence' and
  'mnemonic' / 'mnemonic story', removing the ambiguity of using
  'sentence' at two different scales.
- Replace the unclear 'if the category is led by another category'
  wording with an explicit LED_BY field description and a step-by-step
  algorithm that covers both the leaderless and led cases.
- Reflow the theme-property list (previously a/b/c/d/e split by an
  intervening paragraph) into a single numbered list so it renders as a
  list rather than as code blocks.
- Add a dedicated Rationale section covering the 33-bit sentence size,
  themed sentences, free-form theme schema, the LED_BY mechanism, the
  re-encoding-through-BIP-39 design, and why custom themes are
  discouraged.
- Add a dedicated Backwards Compatibility section describing
  compatibility at the mnemonic, entropy, and seed levels.
- Add a worked Example section showing a 128-bit entropy being encoded
  into a 4-sentence mnemonic story under a small illustrative theme,
  including bit splitting, FILLING_ORDER vs NATURAL_ORDER, and the
  LED_BY lookup.
- Tighten the Abstract and Motivation; clarify that BIP-39 is itself a
  Formosa theme.
---
 bip.mediawiki | 373 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 248 insertions(+), 125 deletions(-)

diff --git a/bip.mediawiki b/bip.mediawiki
index f22a20263e..51ef238088 100644
--- a/bip.mediawiki
+++ b/bip.mediawiki
@@ -16,19 +16,18 @@
 
 ==Abstract==
 
-This BIP describes an expansion of BIP-0039 for the generation of deterministic
-wallets. Where BIP-0039 uses a flat list of unrelated words, Formosa organizes
-mnemonic words into themed sentences with syntactic structure and semantic
-coherence, substantially improving memorability while retaining all properties
-of the original scheme.
-
-It consists of two parts: generating the mnemonic and converting it into a
-binary seed. This seed can be later used to generate deterministic wallets using
-BIP-0032 or similar methods.
-
-Full forward and backward compatibility with BIP-0039 is maintained: seed
-derivation internally converts any Formosa mnemonic back to its equivalent
-BIP-0039 representation, so existing keys and addresses are preserved.
+This BIP describes Formosa, an expansion of BIP-0039 for the generation of
+deterministic wallets. Where BIP-0039 maps each 11 bits of entropy to one word
+drawn from a single 2048-word list, Formosa maps each 33 bits of entropy to a
+short ''themed sentence'' built from several smaller, syntactically-typed
+wordlists. The sentences carry grammatical structure and semantic coherence,
+substantially improving memorability while retaining all cryptographic
+properties of the original scheme.
+
+The proposal is fully forward- and backward-compatible with BIP-0039: BIP-0039
+is itself a Formosa theme, and seed derivation re-encodes any Formosa mnemonic
+through the BIP-0039 English wordlist before applying PBKDF2, so existing keys
+and addresses are preserved.
 
 ==Copyright==
 
@@ -36,48 +35,78 @@ This BIP is licensed under the BSD 2-clause license.
 
 ==Motivation==
 
-A mnemonic code or sentence is superior for human interaction compared to the
-handling of raw binary or hexadecimal representations of a wallet seed. The
-sentence could be written on paper or spoken over the telephone.
+A mnemonic is superior for human interaction compared to handling raw binary or
+hexadecimal representations of a wallet seed. It can be written on paper or
+spoken over the telephone.
 
-However, human memory is an associative process: information is more readily
-retained when it can be linked to existing knowledge through semantic
-associations, visual imagery, and narrative context. A BIP-0039 mnemonic is a
-sequence of unrelated words with no syntactic or semantic relationship, making
-it difficult to form the mental associations that aid long-term retention.
+However, human memory is associative: information is more readily retained when
+it can be linked to existing knowledge through semantic associations, visual
+imagery, and narrative context. A BIP-0039 mnemonic is a sequence of unrelated
+words with no syntactic or semantic relationship, making it difficult to form
+the mental associations that aid long-term retention.
 
 Formosa builds upon BIP-0039 by organizing mnemonic words into themed sentences
-with syntactic roles (e.g., subject, adjective, object, location). Each sentence
-draws vocabulary from a coherent semantic domain --- medieval fantasy, science
-fiction, nature, finance, or any custom theme --- enabling the user to form vivid
-mental images that reduce memorization effort per bit of entropy.
+with syntactic roles (e.g., subject, verb, adjective, object, place). Each
+sentence draws vocabulary from a coherent semantic domain --- medieval fantasy,
+science fiction, nature, finance, or any custom theme --- enabling the user to
+form vivid mental images that reduce memorization effort per bit of entropy.
 
 This guide is meant to be a way to transport computer-generated randomness with
-a human-readable transcription. It's not a way to process user-created
+a human-readable transcription. It is not a way to process user-created
 sentences (also known as brainwallets) into a wallet seed.
 
-==Generating the mnemonic==
+==Specification==
+
+===Terminology===
+
+To avoid the ambiguity of using the word "sentence" at two different scales,
+this document fixes the following vocabulary:
+
+* '''word''': a single token drawn from a category's wordlist (e.g. ''dragon'').
+* '''category''': a syntactic role (e.g. ''SUBJECT'', ''VERB'', ''PLACE'') with its own wordlist and a fixed bit-width.
+* '''theme''': the full set of categories, wordlists, bit-widths, ordering rules and constraints that defines one Formosa dialect. A theme is the Formosa equivalent of a BIP-0039 wordlist.
+* '''sentence''': the words selected from one theme by encoding a single 33-bit block of entropy. A sentence is the Formosa equivalent of three consecutive BIP-0039 words.
+* '''mnemonic''' (or '''mnemonic story'''): the ordered concatenation of all sentences that together encode the entropy plus checksum.
+
+Wherever BIP-0039 speaks of a "mnemonic sentence" composed of words, Formosa
+speaks of a "mnemonic" (or, informally, a "mnemonic story") composed of
+sentences.
+
+===Theme structure===
+
+A theme is a JSON document that defines:
+
+# An ordered list of '''categories'''. For each category:
+#* a wordlist;
+#* a '''BIT_LENGTH''', i.e. the number of bits this category encodes (the wordlist MUST contain exactly 2^BIT_LENGTH entries);
+#* an optional '''LED_BY''' field naming another category. When present, the wordlist of this category is not a single flat list but a mapping from each word of the leading category to a sub-list of 2^BIT_LENGTH entries.
+# A '''FILLING_ORDER''': the order in which categories consume bits from the entropy stream.
+# A '''NATURAL_ORDER''': the order in which the selected words are spoken or written.
+
+The sum of all BIT_LENGTH values in a theme MUST equal 33.
+
+The '''LED_BY''' relation MUST be acyclic and a leading category MUST appear
+before its dependent category in FILLING_ORDER, so that the
+leader's word is already known when the dependent category is filled.
+
+Wordlist entries MAY contain native characters; they MUST be encoded in UTF-8
+using Normalization Form Compatibility Decomposition (NFKD).
+
+===Generating the mnemonic===
 
 The mnemonic must encode entropy in a multiple of 32 bits. With more entropy
-security is improved but the sentence length increases. We refer to the
+security is improved but the mnemonic length increases. We refer to the
 initial entropy length as ENT. The allowed size of ENT is 128-256 bits.
 
 First, an initial entropy of ENT bits is generated. A checksum is generated by
 taking the first ENT / 32 bits of its SHA256 hash. This checksum is
-appended to the end of the initial entropy. Next, these concatenated bits
-are split into groups of 33 bits, which we call '''sentences'''. Each sentence is
-further subdivided into variable-length bit fields, one per syntactic category,
-whose lengths are defined by the active theme. Each bit field encodes an index
-into the corresponding category's word list. Finally, we convert these indices
-into words and use the joined words as a mnemonic sentence.
-
-BIP-0039 is a special case where each sentence contains three 11-bit fields
-indexing a single 2048-word list (3 x 11 = 33).
+appended to the end of the initial entropy. The concatenated bits are then
+split into groups of 33 bits; each group encodes one '''sentence'''.
 
-The following table describes the relation between the initial entropy
-length (ENT), the checksum length (CS), the number of 33-bit sentences (S),
-and the length of the generated mnemonic sentence (MS) in words. The word
-count assumes a 6-word theme; for BIP-0039 (3 words per sentence), divide by 2.
+The following table describes the relation between the initial entropy length
+(ENT), the checksum length (CS), the number of 33-bit sentences (S), and the
+length of the mnemonic (MS) in words. The word count assumes a 6-word theme;
+for BIP-0039 (3 words per sentence), divide by 2.
 
 
 CS = ENT / 32
@@ -92,47 +121,65 @@ S  = (ENT + CS) / 33
 |  256  |  8 |   264  |  8  |     48      |      24       |
 
-For each 33-bit sentence, the word selection algorithm proceeds as follows: +For each 33-bit block, the sentence is built as follows: -# Initialize an empty sentence array with one slot per category. -# For each category in the theme's ''filling order'': -## Extract BIT_LENGTH bits from the current position in the bit stream. -## Interpret them as an unsigned integer index. -## If the category is ''led by'' another category, look up the appropriate sub-list from the leading category's mapping using the already-selected leading word. Otherwise, use the category's total word list. -## Select the word at the computed index from the resolved word list. -## Place the word into the sentence array at the position given by the theme's ''natural order''. -# Output the words in natural order. +# Initialize an empty array with one slot per category in the theme. +# For each category C in the theme's FILLING_ORDER: +## Read the next C.BIT_LENGTH bits from the block and interpret them as an unsigned big-endian integer i. +## Resolve C's wordlist: +##* if C has no LED_BY field, use C's flat wordlist; +##* if C has LED_BY = L, look up the word already chosen for L in C's mapping and use the corresponding sub-list of 2^C.BIT_LENGTH entries. +## Select the word at index i from the resolved wordlist and place it in the slot of C. +# Emit the slots in NATURAL_ORDER; the resulting word sequence is the sentence. -==Themes== +The mnemonic is the concatenation, in order, of the sentences produced from +all 33-bit blocks. + +BIP-0039 is a special case: a single category named ''WORD'' with +BIT_LENGTH = 11, a 2048-entry wordlist, no LED_BY +relation, and trivial FILLING_ORDER = NATURAL_ORDER = +[WORD, WORD, WORD]. -The Formosa equivalent to a BIP-0039 wordlist is a '''theme'''. A theme is a JSON -document that defines syntactic categories, their word lists, bit-widths, and -optional semantic restrictions between categories. The sum of all category -bit-widths in a theme MUST equal 33. +===From mnemonic to seed=== -An ideal theme has the following characteristics: +A user may protect their mnemonic with a passphrase. If a passphrase is not +present, an empty string "" is used instead. -a) specific semantic scope (memory block) - - the entire vocabulary should adhere to a single coherent topic, enabling - the user to form a unified mental scene +To ensure forward and backward compatibility with BIP-0039, seed derivation +proceeds in two steps: -b) concrete imagery - - categories should consist of elements easily associated with mental images. - Prefer concrete nouns and tangible adjectives over abstract terms +# '''Decode''' the Formosa mnemonic against its theme to recover the original entropy and checksum. Verify the checksum; if it does not match, software MUST issue a warning. +# '''Re-encode''' the entropy as a BIP-0039 mnemonic using the BIP-0039 English wordlist. -c) sorted wordlists - - the wordlist is sorted which allows for more efficient lookup of the code words - (i.e. implementations can use binary search instead of linear search) +A binary seed is then produced from the BIP-0039 mnemonic exactly as in +BIP-0039: PBKDF2 with the BIP-0039 mnemonic (UTF-8 NFKD) as password and the +string "mnemonic" + passphrase (UTF-8 NFKD) as salt, with 2048 +iterations of HMAC-SHA512, producing a 512-bit key. -d) first-letters uniqueness - - the wordlist is created in such a way that it's enough to type the first two - letters to unambiguously identify the word +The same entropy therefore always yields the same seed, keys and addresses, +regardless of which Formosa theme was used for the mnemonic. -The first-letters uniqueness property yields higher information density than -BIP-0039. In BIP-0039, four characters are needed to identify each word, -encoding 11 bits per 4 characters = 2.75 bits/character. In Formosa, two -characters suffice per word. The achievable density depends on the theme's -category bit-widths: +The decoding step MUST use the same theme that was used for encoding; +implementations SHOULD detect the active theme by attempting to parse the +mnemonic against each known theme and selecting the one whose words and +checksum match. + +==Themes== + +A theme is the Formosa equivalent of a BIP-0039 wordlist. Theme designers +SHOULD aim for the following properties: + +# '''Specific semantic scope'''. The whole vocabulary should adhere to a single coherent topic, so the user can form a unified mental scene per sentence. +# '''Concrete imagery'''. Categories should consist of elements easily associated with mental images. Concrete nouns and tangible adjectives are preferred over abstract terms. +# '''Sorted wordlists'''. Wordlists should be sorted to allow binary-search lookup. +# '''First-letters uniqueness'''. Wordlists should be constructed so that a short prefix (e.g. the first two letters) uniquely identifies each word. +# '''Optional semantic restrictions'''. Themes MAY use the LED_BY mechanism so that the wordlist available for one category depends on the word chosen in a leading category, producing more semantically coherent sentences. Restriction relations MUST be acyclic. + +The first-letters-uniqueness property yields higher information density than +BIP-0039. In BIP-0039 four characters are needed to identify each word, +encoding 11 bits per 4 characters = 2.75 bits/character. In a Formosa theme +with smaller wordlists, two characters typically suffice per word. The +achievable density depends on the bit-width of each category:
 | List size | Bits | Chars to identify | Density (bits/char) |
@@ -143,77 +190,153 @@ category bit-widths:
 |   128     |   7  |        2          |   3.50              |
 
-As an example, the ''nationalities'' theme uses four 7-bit nationality -categories (128 entries each) and one 5-bit profession category (32 entries), -yielding 33 bits per 5-word sentence. A user typing only the first two -characters of each word types 10 characters to encode 33 bits, achieving an -information density of 33 / 10 = 3.30 bits/character --- a 20% improvement -over BIP-0039's 2.75 bits/character +For example, a ''nationalities'' theme using four 7-bit nationality categories +(128 entries each) and one 5-bit profession category (32 entries) yields 33 +bits per 5-word sentence. A user typing only the first two characters of each +word types 10 characters to encode 33 bits, achieving 33 / 10 = 3.30 +bits/character --- a 20% improvement over BIP-0039. + +==Rationale== + +'''Why 33-bit sentences?''' BIP-0039 uses an 11-bit word and a checksum that +is a multiple of ENT/32 bits, which means valid concatenated +lengths are always multiples of 33 bits. Choosing 33 bits as the Formosa +sentence size is therefore the smallest unit that lets any theme map +losslessly onto the same entropy + checksum boundaries used by BIP-0039, +which is what enables full backward compatibility. + +'''Why themed sentences?''' Cognitive-psychology research on mnemonic +techniques (the method of loci, peg systems, story mnemonics) consistently +shows that vivid, syntactically-structured imagery is recalled more reliably +than disconnected lists. A themed sentence engages this machinery directly: +"a ''brave knight slays the green dragon in the castle''" is easier to recall +than six unrelated BIP-0039 words encoding the same 33 bits. + +'''Why a free-form theme schema rather than a fixed grammar?''' Different +languages, cultures and use-cases benefit from different syntactic templates +and vocabulary. Encoding the structure as data (categories, bit-widths, +filling/natural orders, optional LED_BY mapping) rather than as +hard-coded code keeps the specification small while letting communities +contribute themes without protocol changes. + +'''Why the LED_BY mechanism?''' Semantic restrictions (a ''dragon'' can be +''ancient'' but not ''retired'') make sentences sound natural and far easier +to memorize. Encoding such restrictions as an explicit acyclic +leader/dependent relation, evaluated at fill time, lets themes express +constraints without sacrificing the bijection between entropy and mnemonic: +each leader's chosen word selects a sub-list of exactly +2^BIT_LENGTH entries, so every bit pattern still decodes to +exactly one word. + +'''Why re-encode through BIP-0039 for seed derivation?''' Re-encoding makes +the seed a function of the entropy alone, not of the theme. This guarantees +that: + +* a user can switch themes (or fall back to BIP-0039) without losing access to existing wallets; +* a Formosa-aware wallet and a BIP-0039-only wallet derive the same keys from the same entropy; +* the security analysis of BIP-0039 (PBKDF2 parameters, salt construction) carries over unchanged. + +'''Why discourage custom themes?''' A mnemonic is only useful if the theme +that produced it is still available at recovery time. Standard themes shipped +by reference implementations enjoy that guarantee; one-off custom themes do +not, and the user assumes responsibility for preserving the theme file. + +==Backwards Compatibility== + +Formosa is a strict superset of BIP-0039. Compatibility is achieved on three +levels: + +# '''Mnemonic level.''' BIP-0039 itself is expressible as a Formosa theme (one category, 11-bit wordlist of 2048 entries, three repetitions per 33-bit sentence). Existing BIP-0039 mnemonics are therefore valid Formosa mnemonics under the BIP-0039 theme without any change. +# '''Entropy level.''' Encoding and decoding are bijective with respect to entropy: the same 128-256 bits encode under any theme to a different mnemonic but back to the same entropy. +# '''Seed level.''' Because seed derivation re-encodes the recovered entropy through the BIP-0039 English wordlist before PBKDF2, the resulting seed --- and therefore all derived BIP-0032 keys and addresses --- is identical to what BIP-0039 would have produced for the same entropy. A user can move between Formosa-aware and BIP-0039-only wallets without losing funds. + +Wallets that do not implement Formosa continue to operate exactly as before; +they cannot decode non-BIP-0039 themes but are not affected by their +existence. + +==Example== + +The following worked example illustrates the encoding of a 128-bit entropy +under a small hypothetical theme. The theme has 5 categories, with the +following BIT_LENGTHs and ordering: -e) semantic restrictions (optional) - - themes may define restrictions between categories so that the available word list - for one category changes depending on the word selected in a leading category, - producing more semantically coherent sentences. Restriction relationships MUST - be acyclic +
+Categories       : VERB(6) SUBJECT(7) ADJECTIVE(5) OBJECT(7) PLACE(8)
+Sum of bit widths: 6 + 7 + 5 + 7 + 8 = 33
+FILLING_ORDER    : [SUBJECT, VERB, OBJECT, ADJECTIVE, PLACE]
+NATURAL_ORDER    : [SUBJECT, VERB, ADJECTIVE, OBJECT, PLACE]
+LED_BY           : ADJECTIVE LED_BY OBJECT
+
-The wordlist can contain native characters, but they must be encoded in UTF-8 -using Normalization Form Compatibility Decomposition (NFKD). +Take the entropy (hex): -==From mnemonic to seed== +
+ENT (128 bits) = 0x8d96 8174 c2cd 0e2c 6f47 e7d6 1bd5 a341
+
+ +The SHA-256 of this entropy starts with the byte 0xA0; the +first ENT/32 = 4 bits of that byte are 1010, which +is the checksum. Appending the checksum yields 132 bits, split into +S = 4 blocks of 33 bits. The first 33-bit block is: + +
+1000 1101 1001 0110 1000 0001 0111 0100 1
+                                       ^
+binary block (33 bits): 100011011 00101101 0000001 0111010 01
+                        SUBJECT   OBJECT   ADJ.    PLACE   VERB padding...
+
-A user may decide to protect their mnemonic with a passphrase. If a passphrase is not -present, an empty string "" is used instead. +(Bit assignment follows FILLING_ORDER; the table above is purely +illustrative of how the bit string is consumed.) -To ensure forward and backward compatibility with BIP-0039, seed derivation first -converts any Formosa mnemonic back to its equivalent BIP-0039 mnemonic by extracting -the underlying entropy and re-encoding it using the BIP-0039 English word list. This -guarantees that the same entropy always produces the same seed, keys, and addresses -regardless of which theme was used. +For each category in FILLING_ORDER: -To create a binary seed from the resulting BIP-0039 mnemonic, we use the PBKDF2 function -with a mnemonic sentence (in UTF-8 NFKD) used as the password and the string "mnemonic" + -passphrase (again in UTF-8 NFKD) used as the salt. The iteration count is set to 2048 and -HMAC-SHA512 is used as the pseudo-random function. The length of the derived key is 512 -bits (= 64 bytes). +# SUBJECT: read 7 bits → index i_S; pick word from the SUBJECT wordlist (e.g. ''knight''). +# VERB: read 6 bits → index i_V; pick word from the VERB wordlist (e.g. ''slays''). +# OBJECT: read 7 bits → index i_O; pick word from the OBJECT wordlist (e.g. ''dragon''). +# ADJECTIVE: read 5 bits → index i_A; because ADJECTIVE LED_BY OBJECT, look up the sub-list keyed by ''dragon'' and pick at index i_A (e.g. ''ancient''). +# PLACE: read 8 bits → index i_P; pick word from the PLACE wordlist (e.g. ''castle''). -This seed can be later used to generate deterministic wallets using BIP-0032 or -similar methods. +Emitting the slots in NATURAL_ORDER yields the sentence: -The conversion of the mnemonic sentence to a binary seed is completely independent -from generating the sentence. This results in a rather simple code; there are no -constraints on sentence structure and clients are free to implement their own -themes or even whole sentence generators, allowing for flexibility in wordlists -for typo detection or other purposes. +
+knight slays ancient dragon castle
+
-Although using a mnemonic not generated by the algorithm described in "Generating the -mnemonic" section is possible, this is not advised and software must compute a -checksum for the mnemonic sentence using a wordlist and issue a warning if it is -invalid. +Repeating this procedure for the remaining three 33-bit blocks produces a +4-sentence mnemonic story (24 words total for ENT=128). Decoding inverts the +process: each word is mapped back to its index in the resolved wordlist, the +indices are concatenated in FILLING_ORDER, and the resulting +132-bit string is split into 128 entropy bits + 4 checksum bits and +verified. -The described method also provides plausible deniability, because every passphrase -generates a valid seed (and thus a deterministic wallet) but only the correct one -will make the desired wallet available. +For seed derivation, the recovered entropy 0x8d96…a341 is +re-encoded with the BIP-0039 English wordlist into the standard 12-word +BIP-0039 mnemonic, which is then passed to PBKDF2 exactly as specified by +BIP-0039. The resulting seed is identical to the seed a pure BIP-0039 wallet +would derive from the same entropy. ==Standard themes== -The reference implementation ships with standard themes listed at the link below. -Since BIP-0039 is a valid Formosa theme, all existing BIP-0039 mnemonics work -without modification. +The reference implementation ships with standard themes listed at the link +below. Since BIP-0039 is a valid Formosa theme, all existing BIP-0039 +mnemonics work without modification. -It is '''strongly discouraged''' to use non-standard custom themes for generating -mnemonic sentences, as the user assumes responsibility for ensuring the theme file -remains available and structurally valid. Users with proper training in security -protocols who understand these risks may benefit from custom themes through higher -memorization efficiency or an additional layer of obscurity. +It is '''strongly discouraged''' to use non-standard custom themes for +generating mnemonic sentences, as the user assumes responsibility for +ensuring the theme file remains available and structurally valid. Users with +proper training in security protocols who understand these risks may benefit +from custom themes through higher memorization efficiency or an additional +layer of obscurity. * [[https://github.com/Yuri-SVB/formosa/tree/master/src/mnemonic/themes|Standard Formosa Themes]] ==Test vectors== -The test vectors include input entropy, mnemonic and seed. The -passphrase "TREZOR" is used for all vectors. Since Formosa converts back to -BIP-0039 before seed derivation, the same test vectors apply to all themes -given the same underlying entropy. +The test vectors include input entropy, mnemonic and seed. The passphrase +"TREZOR" is used for all vectors. Since Formosa converts back to BIP-0039 +before seed derivation, the same seed test vectors apply to all themes given +the same underlying entropy. https://github.com/Yuri-SVB/formosa/blob/master/vectors.json From ac185147e0f8badaf99dcadc1d03d6bdeb5a42d1 Mon Sep 17 00:00:00 2001 From: Yuri-SVB Date: Sun, 26 Apr 2026 20:00:58 -0300 Subject: [PATCH 05/11] Formosa: spell out abbreviated table labels Reviewer on PR #2108 asked for no abbreviations in table labels. Replace: - ENT / CS / S / MS column headers with 'Initial entropy bits', 'Checksum bits', 'Total bits', 'Number of sentences', 'Mnemonic words (6-word theme)' and 'Mnemonic words (BIP-0039)'. - 'List size / Bits / Chars to identify / Density (bits/char)' with 'Wordlist size / Bits per word / Characters to identify / Density (bits per character)'. - ADJ. with ADJECTIVE in the example bit-assignment diagram, and the surrounding narrative ENT/MS uses with the spelled-out forms. The accompanying formulas now use the expanded names too, so the algorithm description and the table column headers stay consistent. --- bip.mediawiki | 73 ++++++++++++++++++++++++++------------------------- 1 file changed, 37 insertions(+), 36 deletions(-) diff --git a/bip.mediawiki b/bip.mediawiki index 51ef238088..a6d763fdec 100644 --- a/bip.mediawiki +++ b/bip.mediawiki @@ -95,30 +95,31 @@ using Normalization Form Compatibility Decomposition (NFKD). ===Generating the mnemonic=== The mnemonic must encode entropy in a multiple of 32 bits. With more entropy -security is improved but the mnemonic length increases. We refer to the -initial entropy length as ENT. The allowed size of ENT is 128-256 bits. +security is improved but the mnemonic length increases. The allowed initial +entropy size is 128-256 bits. -First, an initial entropy of ENT bits is generated. A checksum is generated by -taking the first ENT / 32 bits of its SHA256 hash. This checksum is -appended to the end of the initial entropy. The concatenated bits are then -split into groups of 33 bits; each group encodes one '''sentence'''. +First, an initial entropy is generated. A checksum is generated by taking the +first (initial entropy bits) / 32 bits of its SHA256 hash. This +checksum is appended to the end of the initial entropy. The concatenated bits +are then split into groups of 33 bits; each group encodes one +'''sentence'''. -The following table describes the relation between the initial entropy length -(ENT), the checksum length (CS), the number of 33-bit sentences (S), and the -length of the mnemonic (MS) in words. The word count assumes a 6-word theme; -for BIP-0039 (3 words per sentence), divide by 2. +The following table describes the relation between the initial entropy length, +the checksum length, the number of 33-bit sentences, and the length of the +mnemonic in words. The word count assumes a 6-word theme; for BIP-0039 (3 +words per sentence), divide by 2.
-CS = ENT / 32
-S  = (ENT + CS) / 33
-
-|  ENT  | CS | ENT+CS |  S  | MS (6-word) | MS (BIP-0039) |
-+-------+----+--------+-----+-------------+---------------+
-|  128  |  4 |   132  |  4  |     24      |      12       |
-|  160  |  5 |   165  |  5  |     30      |      15       |
-|  192  |  6 |   198  |  6  |     36      |      18       |
-|  224  |  7 |   231  |  7  |     42      |      21       |
-|  256  |  8 |   264  |  8  |     48      |      24       |
+checksum bits         = (initial entropy bits) / 32
+number of sentences   = (initial entropy bits + checksum bits) / 33
+
+| Initial entropy bits | Checksum bits | Total bits | Number of sentences | Mnemonic words (6-word theme) | Mnemonic words (BIP-0039) |
++----------------------+---------------+------------+---------------------+-------------------------------+---------------------------+
+|         128          |       4       |    132     |          4          |              24               |            12             |
+|         160          |       5       |    165     |          5          |              30               |            15             |
+|         192          |       6       |    198     |          6          |              36               |            18             |
+|         224          |       7       |    231     |          7          |              42               |            21             |
+|         256          |       8       |    264     |          8          |              48               |            24             |
 
For each 33-bit block, the sentence is built as follows: @@ -182,12 +183,12 @@ with smaller wordlists, two characters typically suffice per word. The achievable density depends on the bit-width of each category:
-| List size | Bits | Chars to identify | Density (bits/char) |
-+-----------+------+-------------------+---------------------+
-|   2048    |  11  |        4          |   2.75 (BIP-0039)   |
-|    32     |   5  |        2          |   2.50              |
-|    64     |   6  |        2          |   3.00              |
-|   128     |   7  |        2          |   3.50              |
+| Wordlist size | Bits per word | Characters to identify | Density (bits per character) |
++---------------+---------------+------------------------+------------------------------+
+|     2048      |      11       |           4            |     2.75 (BIP-0039)          |
+|       32      |       5       |           2            |     2.50                     |
+|       64      |       6       |           2            |     3.00                     |
+|      128      |       7       |           2            |     3.50                     |
 
For example, a ''nationalities'' theme using four 7-bit nationality categories @@ -199,7 +200,7 @@ bits/character --- a 20% improvement over BIP-0039. ==Rationale== '''Why 33-bit sentences?''' BIP-0039 uses an 11-bit word and a checksum that -is a multiple of ENT/32 bits, which means valid concatenated +is one bit per 32 bits of initial entropy, which means valid concatenated lengths are always multiples of 33 bits. Choosing 33 bits as the Formosa sentence size is therefore the smallest unit that lets any theme map losslessly onto the same entropy + checksum boundaries used by BIP-0039, @@ -268,22 +269,22 @@ NATURAL_ORDER : [SUBJECT, VERB, ADJECTIVE, OBJECT, PLACE] LED_BY : ADJECTIVE LED_BY OBJECT
-Take the entropy (hex): +Take the initial entropy (hex):
-ENT (128 bits) = 0x8d96 8174 c2cd 0e2c 6f47 e7d6 1bd5 a341
+initial entropy (128 bits) = 0x8d96 8174 c2cd 0e2c 6f47 e7d6 1bd5 a341
 
The SHA-256 of this entropy starts with the byte 0xA0; the -first ENT/32 = 4 bits of that byte are 1010, which -is the checksum. Appending the checksum yields 132 bits, split into -S = 4 blocks of 33 bits. The first 33-bit block is: +first 128 / 32 = 4 bits of that byte are 1010, +which is the checksum. Appending the checksum yields 132 bits, split into +132 / 33 = 4 blocks of 33 bits. The first 33-bit block is:
 1000 1101 1001 0110 1000 0001 0111 0100 1
-                                       ^
+
 binary block (33 bits): 100011011 00101101 0000001 0111010 01
-                        SUBJECT   OBJECT   ADJ.    PLACE   VERB padding...
+                        SUBJECT   OBJECT   ADJECTIVE PLACE  VERB (truncated)
 
(Bit assignment follows FILLING_ORDER; the table above is purely @@ -304,8 +305,8 @@ knight slays ancient dragon castle Repeating this procedure for the remaining three 33-bit blocks produces a -4-sentence mnemonic story (24 words total for ENT=128). Decoding inverts the -process: each word is mapped back to its index in the resolved wordlist, the +4-sentence mnemonic story (24 words total for an initial entropy of 128 +bits). Decoding inverts the process: each word is mapped back to its index in the resolved wordlist, the indices are concatenated in FILLING_ORDER, and the resulting 132-bit string is split into 128 entropy bits + 4 checksum bits and verified. From 621fa450427d7f7164990bd09146688b45ca7166 Mon Sep 17 00:00:00 2001 From: Yuri-SVB Date: Sun, 26 Apr 2026 20:18:17 -0300 Subject: [PATCH 06/11] Formosa: rebuild Example on the real medieval_fantasy theme MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the previous hypothetical 5-category example with one that mirrors the medieval_fantasy theme actually shipped at https://github.com/Yuri-SVB/formosa/tree/master/src/mnemonic/themes, including: - the real 6 categories with their actual BIT_LENGTHs (VERB=5, SUBJECT=6, OBJECT=6, ADJECTIVE=5, WILDCARD=6, PLACE=5, summing to 33); - the real FILLING_ORDER and NATURAL_ORDER; - the real lead tree (VERB → SUBJECT; SUBJECT → OBJECT and WILDCARD; OBJECT → ADJECTIVE; WILDCARD → PLACE), showing that a single leader can have several dependent categories; - a 33-bit block whose decoded indices (28, 32, 63, 27, 46, 29) pick existing words and existing sub-list entries: VERB[28] =unveil, SUBJECT_under_unveil[32]=king, OBJECT_under_king[63] =wine, ADJECTIVE_under_wine[27]=sweet, WILDCARD_under_king[46] =queen, PLACE_under_queen[29]=throne_room, yielding the sentence 'king unveil sweet wine queen throne_room'. This keeps the worked example faithful to the reference implementation rather than to a fabricated theme, so that anyone can reproduce the encoding by parsing medieval_fantasy.json. --- bip.mediawiki | 99 ++++++++++++++++++++++++++++----------------------- 1 file changed, 55 insertions(+), 44 deletions(-) diff --git a/bip.mediawiki b/bip.mediawiki index a6d763fdec..7a6d4bb134 100644 --- a/bip.mediawiki +++ b/bip.mediawiki @@ -257,65 +257,76 @@ existence. ==Example== -The following worked example illustrates the encoding of a 128-bit entropy -under a small hypothetical theme. The theme has 5 categories, with the -following BIT_LENGTHs and ordering: +The following worked example illustrates one sentence under the standard +''medieval_fantasy'' theme shipped with the reference implementation. The +theme has 6 categories with the following bit widths, filling order, natural +order and lead relations:
-Categories       : VERB(6) SUBJECT(7) ADJECTIVE(5) OBJECT(7) PLACE(8)
-Sum of bit widths: 6 + 7 + 5 + 7 + 8 = 33
-FILLING_ORDER    : [SUBJECT, VERB, OBJECT, ADJECTIVE, PLACE]
-NATURAL_ORDER    : [SUBJECT, VERB, ADJECTIVE, OBJECT, PLACE]
-LED_BY           : ADJECTIVE LED_BY OBJECT
+Category   | BIT_LENGTH | LED_BY    | LEADS
+-----------+------------+-----------+--------------------
+VERB       |     5      |  (root)   | SUBJECT
+SUBJECT    |     6      |  VERB     | OBJECT, WILDCARD
+OBJECT     |     6      |  SUBJECT  | ADJECTIVE
+ADJECTIVE  |     5      |  OBJECT   | (none)
+WILDCARD   |     6      |  SUBJECT  | PLACE
+PLACE      |     5      |  WILDCARD | (none)
+-----------+------------+-----------+--------------------
+Sum of bit widths: 5 + 6 + 6 + 5 + 6 + 5 = 33
+
+FILLING_ORDER : [VERB, SUBJECT, OBJECT, ADJECTIVE, WILDCARD, PLACE]
+NATURAL_ORDER : [SUBJECT, VERB, ADJECTIVE, OBJECT, WILDCARD, PLACE]
 
-Take the initial entropy (hex): +Note that the lead relations form a tree rooted at VERB: each +non-root category's wordlist is a sub-list selected by the word already +chosen in its leader. For instance, the wordlist for OBJECT +depends on which SUBJECT was selected, and the wordlist for +ADJECTIVE depends on which OBJECT was selected. -
-initial entropy (128 bits) = 0x8d96 8174 c2cd 0e2c 6f47 e7d6 1bd5 a341
-
- -The SHA-256 of this entropy starts with the byte 0xA0; the -first 128 / 32 = 4 bits of that byte are 1010, -which is the checksum. Appending the checksum yields 132 bits, split into -132 / 33 = 4 blocks of 33 bits. The first 33-bit block is: +Take the following 33-bit block as the first sentence to encode:
-1000 1101 1001 0110 1000 0001 0111 0100 1
-
-binary block (33 bits): 100011011 00101101 0000001 0111010 01
-                        SUBJECT   OBJECT   ADJECTIVE PLACE  VERB (truncated)
+binary block (33 bits): 11100 100000 111111 11011 101110 11101
+                        VERB  SUBJECT OBJECT ADJ.  WILDCARD PLACE
+                        (5)   (6)     (6)    (5)   (6)      (5)
 
-(Bit assignment follows FILLING_ORDER; the table above is purely -illustrative of how the bit string is consumed.) - -For each category in FILLING_ORDER: +Bits are consumed in FILLING_ORDER: -# SUBJECT: read 7 bits → index i_S; pick word from the SUBJECT wordlist (e.g. ''knight''). -# VERB: read 6 bits → index i_V; pick word from the VERB wordlist (e.g. ''slays''). -# OBJECT: read 7 bits → index i_O; pick word from the OBJECT wordlist (e.g. ''dragon''). -# ADJECTIVE: read 5 bits → index i_A; because ADJECTIVE LED_BY OBJECT, look up the sub-list keyed by ''dragon'' and pick at index i_A (e.g. ''ancient''). -# PLACE: read 8 bits → index i_P; pick word from the PLACE wordlist (e.g. ''castle''). +# VERB: read 5 bits = 11100 = 28; the VERB wordlist has 32 (= 2^5) entries, so index 28 selects ''unveil''. +# SUBJECT: read 6 bits = 100000 = 32; because SUBJECT LED_BY VERB, look up the sub-list keyed by ''unveil'' (a list of 64 = 2^6 entries) and pick index 32 → ''king''. +# OBJECT: read 6 bits = 111111 = 63; because OBJECT LED_BY SUBJECT, look up the sub-list keyed by ''king'' (64 entries) and pick index 63 → ''wine''. +# ADJECTIVE: read 5 bits = 11011 = 27; because ADJECTIVE LED_BY OBJECT, look up the sub-list keyed by ''wine'' (32 entries) and pick index 27 → ''sweet''. +# WILDCARD: read 6 bits = 101110 = 46; because WILDCARD LED_BY SUBJECT, look up the sub-list keyed by ''king'' (64 entries) and pick index 46 → ''queen''. +# PLACE: read 5 bits = 11101 = 29; because PLACE LED_BY WILDCARD, look up the sub-list keyed by ''queen'' (32 entries) and pick index 29 → ''throne_room''. -Emitting the slots in NATURAL_ORDER yields the sentence: +Emitting the selected words in NATURAL_ORDER +([SUBJECT, VERB, ADJECTIVE, OBJECT, WILDCARD, PLACE]) yields the sentence:
-knight slays ancient dragon castle
+king unveil sweet wine queen throne_room
 
-Repeating this procedure for the remaining three 33-bit blocks produces a -4-sentence mnemonic story (24 words total for an initial entropy of 128 -bits). Decoding inverts the process: each word is mapped back to its index in the resolved wordlist, the -indices are concatenated in FILLING_ORDER, and the resulting -132-bit string is split into 128 entropy bits + 4 checksum bits and -verified. - -For seed derivation, the recovered entropy 0x8d96…a341 is -re-encoded with the BIP-0039 English wordlist into the standard 12-word -BIP-0039 mnemonic, which is then passed to PBKDF2 exactly as specified by -BIP-0039. The resulting seed is identical to the seed a pure BIP-0039 wallet -would derive from the same entropy. +Read with the implicit articles supplied by the theme this becomes +"the ''king'' ''unveil''(s) ''sweet'' ''wine'' (to the) ''queen'' (in the) +''throne_room''" --- a vivid scene that encodes 33 bits of entropy. + +For an initial entropy of 128 bits, the procedure above is repeated for each +of the four 33-bit blocks (128 entropy bits + 4 checksum bits = 132 = 4 × 33), +producing a 4-sentence mnemonic story of 24 words. + +Decoding inverts the process: each word is mapped back to its index in the +resolved sub-list (using the already-decoded leader to pick the right +sub-list), the indices are concatenated in FILLING_ORDER, and +the resulting 132-bit string is split into 128 entropy bits and 4 checksum +bits, which are verified against SHA-256 of the entropy. + +For seed derivation, the recovered entropy is re-encoded with the BIP-0039 +English wordlist into the standard 12-word BIP-0039 mnemonic, which is then +passed to PBKDF2 exactly as specified by BIP-0039. The resulting seed is +identical to the seed a pure BIP-0039 wallet would derive from the same +entropy. ==Standard themes== From 2d87a3cbe5b72b8b03aa38e6e39210d98ba07a3a Mon Sep 17 00:00:00 2001 From: Yuri-SVB Date: Sun, 26 Apr 2026 20:35:04 -0300 Subject: [PATCH 07/11] Formosa: explain LED_BY as a primitive next-word predictor Add a paragraph to the LED_BY rationale clarifying that a Formosa theme behaves as a primitive language model (next-word predictor): each LED_BY relation skews the conditional distribution over the next word so that probability mass falls only on the 2^BIT_LENGTH words compatible with the already- chosen leader, and zero elsewhere. The theme designer plays the role of training data, hand-curating which combinations are semantically coherent. This framing makes explicit why themes produce sentences that 'sound right' while still covering all 2^33 bit patterns of a sentence. --- bip.mediawiki | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/bip.mediawiki b/bip.mediawiki index 7a6d4bb134..d658b27cf3 100644 --- a/bip.mediawiki +++ b/bip.mediawiki @@ -229,6 +229,20 @@ each leader's chosen word selects a sub-list of exactly 2^BIT_LENGTH entries, so every bit pattern still decodes to exactly one word. +A Formosa theme works as a primitive language-model. Where a generic +language model assigns each candidate next word a probability conditioned +on the words that came before, a theme assigns probability +1/2^BIT_LENGTH uniformly to the words that are ''compatible'' +with the already-chosen leader(s) and probability 0 to all +others.The role of the theme designer is exactly the role of training data: +by curating which adjectives can describe ''wine'' or which places a ''queen'' +may occupy, the designer sculpts a probability distribution that +broadly excludes nonsensical combinations. The result is that a 33-bit block +likely decodes into a phrase the predictor judges semantically coherent, while +still covering all 2^33 bit patterns. The bijection with entropy +is preserved because the support of the distribution at each step has size +exactly 2^BIT_LENGTH, never more and never less. + '''Why re-encode through BIP-0039 for seed derivation?''' Re-encoding makes the seed a function of the entropy alone, not of the theme. This guarantees that: From 000a7401d98bb217cb51b078aa5e13ea36227ddf Mon Sep 17 00:00:00 2001 From: Yuri-SVB Date: Sun, 26 Apr 2026 21:36:48 -0300 Subject: [PATCH 08/11] Cite the companion project Mooncake (https://github.com/T3-Infosec/mooncake) which builds on this property by rendering each Formosa category as an on-screen table whose rows and columns are permuted per input session. Combined with the randomized-indexation property, an attacker watching only the screen still learns nothing without also recovering the press sequence. Add a Rationale paragraph explaining a further benefit of splitting the vocabulary into several short wordlists (32-128 entries each): such tables fit on a mobile-device screen and admit input via on-screen lookup, which a single 2048-word list does not. The randomized indexation: - defeats pure key-logging (keystrokes alone don't reveal words; the attacker also needs the session permutation), - raises the bar for shoulder surfing (same as key-logging: only keys AND session's permutation suffice. Either alone is uniformative). This gives an operational, security-focused argument for the many-small-lists design that complements the existing memorization and information-density arguments. Formosa: document Mooncake's volume-key input on mobile Add a paragraph to the Mooncake rationale describing the proposed mobile input mechanism: reuse of the volume-up / volume-down keys as a two-button binary selector. Because every Formosa category is sized 2^BIT_LENGTH and the on-screen table is laid out in rows, sub-rows and columns whose counts are powers of two, narrowing to a single cell takes exactly BIT_LENGTH presses (5 for a 32-entry category, 6 for 64, 7 for 128). The per-category press count is invariant therefore uninformative, and equal to the bits of entropy encoded, and the 'one bit per press' bound matches the existing side-channel argument. Add three concrete reasons why volume-key input on mobile resists visual shoulder surfing better than an on-screen keyboard: - Subtler input motions: a single finger pressing a side rocker, much harder to read from a distance than multi-finger taps on a glass keyboard. - Easy occlusion with the second hand: both volume keys are on one edge of the device, so the free hand (or the holding hand's thumb) can cover them without obscuring the screen for the user. - Pocket input via headphone volume buttons: because the protocol is purely binary, headphone volume controls are sufficient, letting the user keep the buttons in a pocket while operating it by feel and removing the input motion from the observer's field of view entirely. --- bip.mediawiki | 49 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) diff --git a/bip.mediawiki b/bip.mediawiki index d658b27cf3..42d26fa348 100644 --- a/bip.mediawiki +++ b/bip.mediawiki @@ -256,6 +256,55 @@ that produced it is still available at recovery time. Standard themes shipped by reference implementations enjoy that guarantee; one-off custom themes do not, and the user assumes responsibility for preserving the theme file. +'''Why many small wordlists rather than one 2048-word list?''' Beyond the +memorization and information-density benefits already discussed, splitting +the vocabulary into several short, syntactically-typed wordlists (32, 64 +or 128 entries each) enables interaction patterns that a single 2048-word +list does not. A 32- or 64-entry table fits comfortably on a mobile-device +screen with legible typography, so a user can input a Formosa mnemonic by +selecting cells from compact lookup tables instead of typing each word. + +This in turn enables the companion project '''Mooncake''' +(https://github.com/T3-Infosec/mooncake), which renders +each Formosa category as an on-screen table. The words themselves stay in +their alphabetical positions in the table (so the user can locate them +visually); what is randomized per input session is the '''indexation''', +i.e. the labels (numbers or short codes) that the user must type to +designate a given cell. The user therefore enters a sequence of session- +specific indexes rather than the words themselves. The security properties +of mnemonic input are improved on two fronts: + +* '''Keylogging is no longer sufficient.''' A keylogger captures only the sequence of indexes typed; without the per-session indexation map, that sequence cannot be inverted to the underlying words. Recovery of the mnemonic requires both the keystrokes and the random indexation that was active at input time. +* '''Shoulder surfing requires compromising two channels.''' An attacker who only watches the keyboard sees the same indexes a keylogger would, and an attacker who only watches the screen sees only the (always alphabetical) wordlists with their session-specific labels. To recover the mnemonic the attacker must capture both the typed indexes ''and'' the indexation displayed during that same session. + +On mobile devices, where there is no convenient hardware keyboard, +Mooncake's proposed input mechanism reuses the '''volume keys''' as a +two-button binary selector: each press of volume-up / volume-down chooses +between two halves of the table. Because every Formosa category has a +wordlist of size 2^BIT_LENGTH and the on-screen table is laid +out in rows, sub-rows and columns whose counts are themselves powers of +two, narrowing down to a single cell takes exactly BIT_LENGTH +binary presses --- 5 presses for a 32-entry category, 6 for a 64-entry +category, 7 for a 128-entry category. The number of presses per category +is therefore constant, deterministic, and equal to the bits of entropy that +category encodes; number of presses are invariant, hence uninformative. +This also keeps the per-press observation bound ("one bit per press") aligned +with the side-channel argument above: a shoulder-surfer who sees only the +volume-key presses captures the same indexation-relative bits a keylogger would. + +The volume-key channel further raises the bar against shoulder +surfing in ways that a keyboard cannot match: + +* '''Subtler input motions.''' Pressing a volume rocker involves a small movement of a single finger against the side of the device, far less conspicuous than the multi-finger tapping pattern of a keyboard. An observer trying to read the input visually has much less motion to work with. +* '''Easy occlusion with the second hand.''' Because both volume keys live on one edge of the device, the user can hold the phone in one hand and cover the volume rocker with the other (or with the same hand's thumb), occluding the input from any line-of-sight observer without obscuring the screen for the user. +* '''Pocket input via headphone controls.''' Many wired and wireless headphones expose volume-up / volume-down buttons. Mooncake's binary protocol means those headphone buttons are sufficient to drive the entire input flow, so the user can keep them in a pocket or bag and operate the volume buttons by feel, removing the input motion from the observer's field of view entirely. Combined with the randomized-indexation property, an attacker who only sees the screen still learns nothing about the chosen words without also recovering the press sequence. + +These properties depend on the small wordlists Formosa uses; a single +2048-entry list would hardly fit a typical dektop screen and not at all +those of a mobile device. Mooncake therefore provides a concrete operational +reason to prefer many small, syntactically-typed wordlists, complementing +the cognitive arguments above. + ==Backwards Compatibility== Formosa is a strict superset of BIP-0039. Compatibility is achieved on three From 38c7dfd7541cceb05c99c9392e679e53c7091a81 Mon Sep 17 00:00:00 2001 From: Yuri S Villas Boas Date: Tue, 28 Apr 2026 14:18:14 -0300 Subject: [PATCH 09/11] Update bip.mediawiki Fixed typo from "dektop" to "desktop" Fixed agreement of number from "Those of a mobile device" to "Those of mobile devices" --- bip.mediawiki | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bip.mediawiki b/bip.mediawiki index 42d26fa348..d406c00cba 100644 --- a/bip.mediawiki +++ b/bip.mediawiki @@ -300,8 +300,8 @@ surfing in ways that a keyboard cannot match: * '''Pocket input via headphone controls.''' Many wired and wireless headphones expose volume-up / volume-down buttons. Mooncake's binary protocol means those headphone buttons are sufficient to drive the entire input flow, so the user can keep them in a pocket or bag and operate the volume buttons by feel, removing the input motion from the observer's field of view entirely. Combined with the randomized-indexation property, an attacker who only sees the screen still learns nothing about the chosen words without also recovering the press sequence. These properties depend on the small wordlists Formosa uses; a single -2048-entry list would hardly fit a typical dektop screen and not at all -those of a mobile device. Mooncake therefore provides a concrete operational +2048-entry list would hardly fit a typical desktop screen and not at all +those of mobile devices. Mooncake therefore provides a concrete operational reason to prefer many small, syntactically-typed wordlists, complementing the cognitive arguments above. From 923faa48805b3d4e83836e767332740a7945cad9 Mon Sep 17 00:00:00 2001 From: Yuri S Villas Boas Date: Wed, 29 Apr 2026 19:46:25 -0300 Subject: [PATCH 10/11] Update bip.mediawiki MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Substituted triple hyphen for — Co-authored-by: Murch --- bip.mediawiki | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bip.mediawiki b/bip.mediawiki index d406c00cba..052a9fad88 100644 --- a/bip.mediawiki +++ b/bip.mediawiki @@ -47,8 +47,8 @@ the mental associations that aid long-term retention. Formosa builds upon BIP-0039 by organizing mnemonic words into themed sentences with syntactic roles (e.g., subject, verb, adjective, object, place). Each -sentence draws vocabulary from a coherent semantic domain --- medieval fantasy, -science fiction, nature, finance, or any custom theme --- enabling the user to +sentence draws vocabulary from a coherent semantic domain — medieval fantasy, +science fiction, nature, finance, or any custom theme — enabling the user to form vivid mental images that reduce memorization effort per bit of entropy. This guide is meant to be a way to transport computer-generated randomness with From 08df954e5ff367756b136edc55f44fc66d525ab6 Mon Sep 17 00:00:00 2001 From: Yuri S Villas Boas Date: Wed, 29 Apr 2026 19:51:01 -0300 Subject: [PATCH 11/11] Update bip.mediawiki Updated title to mention Formosa and be more self-explanatory. Co-authored-by: Murch --- bip.mediawiki | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bip.mediawiki b/bip.mediawiki index 052a9fad88..620534b797 100644 --- a/bip.mediawiki +++ b/bip.mediawiki @@ -1,7 +1,7 @@
   BIP: ?
   Layer: Applications
-  Title: Encoding seed as themed mnemonic sentences
+  Title: Formosa—Seed encoding per themed mnemonic stories
   Authors: Yuri S Villas Boas 
            André Fidencio Gonçalves 
   Status: Draft