Skip to content
Merged
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 54 additions & 26 deletions src/v-st-ext.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,17 @@ Each hart supporting a vector extension defines two parameters:

. [#norm:elen]#The maximum size in bits of a vector element that any operation can produce or consume, _ELEN_ {ge} 8, which
must be a power of 2.#
. [#norm:vlen]#The number of bits in a single vector register, _VLEN_ {ge} ELEN, which must be a power of 2, and must be no greater than 2^16^.#
. [#norm:vlen]#The number of bits in a single vector register, _VLEN_ {ge} 8, which must be a power of 2, and must be no greater than 2^16^.#

Standard vector extensions (<<sec-vector-extensions>>) and
architecture profiles may set further constraints on _ELEN_ and _VLEN_.

NOTE: Future extensions may allow ELEN {gt} VLEN by holding one
element using bits from multiple vector registers, but this
extension does not include this option.
NOTE: Following the ratification of the V extension, this specification
has been revised to admit the possibility of future extensions that
allow ELEN > VLEN, wherein one element is held using bits from
multiple vector registers.
These relaxations have no impact on implementations with ELEN {le}
VLEN or on existing software that assumes ELEN {le} VLEN.
Comment thread
DmitryUtyansky marked this conversation as resolved.
Outdated

NOTE: The upper limit on VLEN allows software to know that indices
will fit into 16 bits (largest VLMAX of 65,536 occurs for LMUL=8 and
Expand Down Expand Up @@ -280,15 +283,18 @@ register-resident vectors.
Implementations must provide fractional LMUL settings that allow the
narrowest supported type to occupy a fraction of a vector register
corresponding to the ratio of the narrowest supported type's width to
that of the largest supported type's width. In general, the
requirement is to support LMUL {ge} SEW~MIN~/ELEN, where SEW~MIN~ is
the narrowest supported SEW value and ELEN is the widest supported SEW
value. In the standard extensions, SEW~MIN~=8. For
the smaller of VLEN and the largest supported type's width.
In general, the requirement is to support LMUL {ge} SEW~MIN~/min(ELEN, VLEN),
where SEW~MIN~ is the narrowest supported SEW value, ELEN is the widest
supported SEW value and VLEN is the number of bits in a vector register.
In the standard extensions, SEW~MIN~=8. For
standard vector extensions with ELEN=32, fractional LMULs of 1/2 and
1/4 must be supported. For standard vector extensions with ELEN=64,
1/4 must be supported. For standard vector extensions with ELEN=64 and ELEN {le} VLEN,
fractional LMULs of 1/2, 1/4, and 1/8 must be supported.
For a vector extensions with SEW~MIN~=8, ELEN=64 and VLEN=32, fractional LMULs of 1/2 and 1/4
Comment thread
DmitryUtyansky marked this conversation as resolved.
Outdated
must be supported.

NOTE: When LMUL < SEW~MIN~/ELEN, there is no guarantee
NOTE: When LMUL < SEW~MIN~/min(ELEN, VLEN), there is no guarantee
an implementation would have enough bits in the fractional vector
register to store at least one element, as VLEN=ELEN is a
valid implementation choice. For example, with VLEN=ELEN=32,
Expand All @@ -297,20 +303,20 @@ storage in a vector register.

[[norm:vtype_sew_val]]
For a given supported fractional LMUL setting, implementations must support
SEW settings between SEW~MIN~ and LMUL * ELEN, inclusive.
SEW settings between SEW~MIN~ and LMUL * min(ELEN, VLEN), inclusive.

[[norm:vtype_lmul_fval_rsv]]
The use of `vtype` encodings with LMUL < SEW~MIN~/ELEN is
The use of `vtype` encodings with LMUL < SEW~MIN~/min(ELEN, VLEN) is
__reserved__, but implementations can set `vill` if they do not
support these configurations.

NOTE: Requiring all implementations to set `vill` in this case would
prohibit future use of this case in an extension, so to allow for a
future definition of LMUL<SEW~MIN~/ELEN behavior, we
future definition of LMUL<SEW~MIN~/min(ELEN, VLEN) behavior, we
consider the use of this case to be __reserved__.

NOTE: It is recommended that assemblers provide a warning (not an
error) if a `vsetvli` instruction attempts to write an LMUL < SEW~MIN~/ELEN.
error) if a `vsetvli` instruction attempts to write an LMUL < SEW~MIN~/min(ELEN, VLEN).

[[norm:lmul]]
LMUL is set by the signed `vlmul` field in `vtype` (i.e., LMUL =
Expand Down Expand Up @@ -776,6 +782,12 @@ lowest-numbered vector register and moving to the
next-highest-numbered vector register in the group once each vector
register is filled.

If a vector extension supports EEW > VLEN, one element can span multiple
Comment thread
DmitryUtyansky marked this conversation as resolved.
Outdated
vector registers, in which case the least-significant bits of the element
are held in the lowest-numbered vector register.
Instructions that access vector register groups with EMUL < EEW/VLEN are
reserved.

----
LMUL > 1 examples

Expand Down Expand Up @@ -834,6 +846,14 @@ register is filled.
v4*n+1 7 6 5 4
v4*n+2 B A 9 8
v4*n+3 F E D C

VLEN=32b, SEW=64b, LMUL=4

Byte 3 2 1 0
v4*n 0
v4*n+1
v4*n+2 1
v4*n+3
----

[[sec-mapping-mixed]]
Expand Down Expand Up @@ -1033,6 +1053,8 @@ LMUL=8 is reserved as this would imply a result EMUL=16.
Widened scalar values, e.g., input and output to a widening reduction
operation, are held in the first element of a vector register and
have EMUL=1.
If a vector extension supports EEW > VLEN, EEW-wide widened scalar
Comment thread
aswaterman marked this conversation as resolved.
Outdated
values are held in a vector register group with EMUL = EEW/VLEN.

==== Vector Masking

Expand Down Expand Up @@ -1618,7 +1640,7 @@ vse64.v vs3, (rs1), vm # 64-bit unit-stride store
Additional unit-stride mask load and store instructions are
provided to transfer mask values to/from memory. These
operate similarly to unmasked byte loads or stores (EEW=8), except that
the effective vector length is ``evl``=ceil(``vl``/8) (i.e. EMUL=1),
the effective vector length is ``evl`` = ceil(``vl``/8) (i.e. EMUL=1),
and the destination register is always written with a tail-agnostic
policy.

Expand Down Expand Up @@ -2069,8 +2091,9 @@ handlers, and OS context switches. Software can determine the number
of bytes transferred by reading the `vlenb` register.

[[norm:vector_ls_seg_wholereg_eew]]
The load instructions have an EEW encoded in the `mew` and `width`
The load instructions have the element width encoded in the `mew` and `width`
fields following the pattern of regular unit-stride loads.
EEW is computed as EEW=min(VLEN, EEW_encoded).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EEW is used as a hint to microarchitecture and doesn't affect architectural behavior of these instructions - I don't think it should change in case some funny trick is used by micorarchitecture knowing that EEW > VLEN?

Copy link
Copy Markdown
Member

@aswaterman aswaterman Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a clean definition that doesn't actually preclude funny tricks. I suggested it because it's important we don't preclude vtype-unaware register spill/fill code from e.g. moving v1 to v2 when SEW > VLEN. Without this definition, the reference to v1 would make the instruction reserved.

A uarch is free to ignore this dictum in its trickery and represent the destination with whatever EEW it wants, e.g. it could compute its internal-representation EEW as a function of VLEN, EEW_encoded, and the register specifiers involved.

Copy link
Copy Markdown
Contributor Author

@DmitryUtyansky DmitryUtyansky Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted there the paragraph left by oversight after the earlier edit introducing "EEW=min(VLEN, EEW_encoded)." ("In implementations supporting ELEN > VLEN, element size can exceed the number of bits available in a vector register or a vector register group. In that case, the whole vector register load instructions
still operate on the specified number of vector register(s), using the
least-significant bits of the element.")

It's either one or the other, EEW as min(...), as @aswaterman suggested or "use LSBs of an unchanged bigger EEW" as it was originally.

The benefit of having EEW defined through min(...) is that the subsequent formulas evl=NFIELDS*VLEN/EEW works without giving fractional evl.

Thinking more about this, the formula with min(VLEN, EEW_encoded) for e.g. VLEN=32 EEW=64 breaks e.g. vl2re64.v: a perfectly valid "load pair of vregs with EEW=64" is now switched to EEW=32 (potentially misguiding all those "microarch hints").
I have changed the formula to "EEW=min(VLEN*NFIELDS, EEW_encoded)", limiting EEW to whatever fits into the requested group. Same logic applies to whole register move: I have changed the formula there as well, to EEW = min(VLEN*EMUL, SEW), factoring in EMUL.

@aswaterman , @kasanovic , please tell me what you think of the current wording.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think this scheme hangs together. I'll chat with @kasanovic about this topic later this week and try to reach a conclusion.


NOTE: Because in-register byte layouts are identical to in-memory byte
layouts, the same data is written to the destination register group
Expand All @@ -2080,6 +2103,12 @@ The full set of EEW variants is provided so that the encoded EEW can be used
as a hint to indicate the destination register group will next be accessed
with this EEW, which aids implementations that rearrange data internally.

In vector extensions supporting ELEN > VLEN, element size can exceed
the number of bits available in a vector register or a vector register
group. In that case, the whole vector register load instructions
still operate on the specified number of vector register(s), using the
least-significant bits of the element.

The vector whole register store instructions are encoded similar to
unmasked unit-stride store of elements with EEW=8.

Expand Down Expand Up @@ -3898,12 +3927,10 @@ destination format is converted to the destination format's largest finite value
=== Vector Reduction Operations

[#norm:vreduction_scalar_def]#Vector reduction operations take a vector register group of elements
and a scalar held in element 0 of a vector register, and perform a
and a scalar held in element 0 of a vector register group, and perform a
reduction using some binary operator, to produce a scalar result in
element 0 of a vector register.# [#norm:vreduction_scalar_disregard_LMUL]#The scalar input and output operands
are held in element 0 of a single vector register, not a vector
register group, so any vector register can be the scalar source or
destination of a vector reduction regardless of LMUL setting.#
element 0 of a vector register group.# [#norm:vreduction_scalar_disregard_LMUL]#The scalar input and output operands
are held in element 0 of a group with EMUL = ceil(EEW/VLEN), regardless of LMUL setting.#

[#norm:vreduction_vd_overlap_vs]#The destination vector register can overlap the source operands,
including the mask register.#
Expand Down Expand Up @@ -4500,7 +4527,7 @@ around within the vector registers.

The integer scalar read/write instructions transfer a single
value between a scalar `x` register and element 0 of a vector
register. [#norm:vmv-x-s_vmv-s-x_ignoreLMUL]#The instructions ignore LMUL and vector register groups.#
register group with EMUL = ceil(EEW/VLEN). [#norm:vmv-x-s_vmv-s-x_ignoreLMUL]#The instructions ignore LMUL, EMUL is computed as ceil(EEW/VLEN).#
Comment thread
DmitryUtyansky marked this conversation as resolved.
Outdated

----
vmv.x.s rd, vs2 # x[rd] = vs2[0] (vs1=0)
Expand All @@ -4515,7 +4542,8 @@ ignored. If SEW < XLEN, the value is sign-extended to XLEN bits.#
NOTE: [#norm:vmv-x-s_vstartgevl_vl0]#`vmv.x.s` performs its operation even if `vstart` {ge} `vl` or `vl`=0.#

[#norm:vmv-s-x_op]#The `vmv.s.x` instruction copies the scalar integer register to element 0 of
the destination vector register. If SEW < XLEN, the least-significant bits
the destination vector register group with EMUL = ceil(EEW/VLEN)).
Comment thread
DmitryUtyansky marked this conversation as resolved.
Outdated
Comment thread
DmitryUtyansky marked this conversation as resolved.
Outdated
If SEW < XLEN, the least-significant bits
are copied and the upper XLEN-SEW bits are ignored. If SEW > XLEN, the value
is sign-extended to SEW bits. The other elements in the destination vector
register ( 0 < index < VLEN/SEW) are treated as tail elements using the current tail agnostic/undisturbed policy.# [#norm:vmv-s-x_vstart_ge_vl]#If `vstart` {ge} `vl`, no
Expand All @@ -4532,7 +4560,7 @@ and `vmv.s.x` are reserved.#

The floating-point scalar read/write instructions transfer a single
value between a scalar `f` register and element 0 of a vector
register. [#norm:vfmv-f-s_vfmv-s-f_ignoreLMUL]#The instructions ignore LMUL and vector register groups.#
register group with EMUL = ceil(EEW/VLEN). [#norm:vfmv-f-s_vfmv-s-f_ignoreLMUL]##The instructions ignore LMUL; EMUL is computed as ceil(EEW/VLEN).#

----
vfmv.f.s rd, vs2 # f[rd] = vs2[0] (rs1=0)
Expand Down Expand Up @@ -4875,8 +4903,8 @@ e q r d c b v a # v11 destination after vrgather using viota.m under mask
[#norm:vmv-nr-r_op]#The `vmv<nr>r.v` instructions copy whole vector registers (i.e., all
VLEN bits) and can copy whole vector register groups. The `nr` value
in the opcode is the number of individual vector registers, NREG, to
copy. The instructions operate as if EEW=SEW, EMUL = NREG, effective
length `evl`= EMUL * VLEN/SEW.#
copy. The instructions operate as if EEW = min(VLEN, SEW), EMUL = NREG, and effective
length `evl` = EMUL * VLEN/EEW.#

NOTE: These instructions are intended to aid compilers to shuffle
vector registers without needing to know or change `vl`.
Expand Down
Loading