Skip to content
Open
Changes from 5 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
57ccc48
Sticky Events
kegsay Sep 16, 2025
94b1a87
Remove prev_batch
kegsay Sep 16, 2025
50d76e6
Update 4354-sticky-events.md
kegsay Sep 16, 2025
3baf0d8
Update 4354-sticky-events.md
kegsay Sep 16, 2025
29e9bf7
Update 4354-sticky-events.md
kegsay Sep 16, 2025
b6e8159
Syntax
kegsay Sep 17, 2025
33ec282
Update proposals/4354-sticky-events.md
kegsay Sep 18, 2025
7725f74
Update proposals/4354-sticky-events.md
kegsay Sep 18, 2025
192c6b4
Update 4354-sticky-events.md
kegsay Sep 18, 2025
97c9c5b
Update 4354-sticky-events.md
kegsay Sep 19, 2025
8d101fd
Update 4354-sticky-events.md
kegsay Sep 19, 2025
c75e19c
Update 4354-sticky-events.md
kegsay Sep 19, 2025
c925a4c
Update 4354-sticky-events.md
kegsay Sep 19, 2025
6524be2
Update 4354-sticky-events.md
kegsay Sep 22, 2025
d14448c
Update 4354-sticky-events.md
kegsay Sep 22, 2025
ce37b02
Update 4354-sticky-events.md
kegsay Sep 22, 2025
caf3fcd
Update 4354-sticky-events.md
kegsay Sep 23, 2025
ba01efd
Update 4354-sticky-events.md
kegsay Sep 23, 2025
06d7aa5
Update 4354-sticky-events.md
kegsay Sep 24, 2025
b44ccaa
Update 4354-sticky-events.md
kegsay Sep 25, 2025
81cf728
Update 4354-sticky-events.md
kegsay Sep 25, 2025
eced090
Update 4354-sticky-events.md
kegsay Sep 25, 2025
cec1815
Update 4354-sticky-events.md
kegsay Sep 26, 2025
b94096a
Update 4354-sticky-events.md
kegsay Sep 26, 2025
b9ed93f
Move around k:v map bits to Addendum
kegsay Sep 26, 2025
b135726
Update 4354-sticky-events.md
kegsay Sep 26, 2025
8f0e3ce
Update proposals/4354-sticky-events.md
kegsay Sep 27, 2025
3c26e3b
Update 4354-sticky-events.md
kegsay Sep 29, 2025
71e83cb
Update 4354-sticky-events.md
kegsay Oct 1, 2025
b2eab83
Apply suggestions from code review
kegsay Oct 1, 2025
99ee9f8
Update 4354-sticky-events.md
kegsay Oct 1, 2025
3ff65a5
Update 4354-sticky-events.md
kegsay Oct 1, 2025
865746c
Update 4354-sticky-events.md
kegsay Oct 2, 2025
240d650
Update proposals/4354-sticky-events.md
kegsay Oct 2, 2025
434794d
Update 4354-sticky-events.md
kegsay Oct 7, 2025
6f94547
Update 4354-sticky-events.md
kegsay Oct 8, 2025
0d5e4d8
Apply suggestions from code review
kegsay Oct 29, 2025
7e54063
Redaction and last-to-expire commentary
kegsay Oct 29, 2025
da7c7c7
vdh comments
kegsay Oct 29, 2025
50c1910
more vdh
kegsay Oct 29, 2025
732a72b
More vdh comments
kegsay Oct 31, 2025
e5c1635
Pagination
kegsay Oct 31, 2025
331484d
Replace 'companion endpoint pagination' with MSC3885-style 'subtoken …
reivilibre Dec 16, 2025
4340903
Describe interaction of sticky events with /sync RoomFilter
reivilibre Dec 19, 2025
41deb2d
Describe 'join room' behaviour for sync, as well as 'all joined rooms…
reivilibre Apr 1, 2026
082a157
MUST send in order -> SHOULD (best-effort) but no need to guarantee
reivilibre Apr 1, 2026
8fbd13d
Simplify by picking a lane for /sync deduplication
reivilibre Apr 1, 2026
8c491f3
Unify on deduplication also for sliding sync
reivilibre Apr 1, 2026
5c6bd89
Policy servers and similar spam checkers disable the stickiness
reivilibre Apr 1, 2026
ad1203d
Sliding sync: fix to 'only for interested rooms, regardless of top-N …
reivilibre Apr 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
331 changes: 331 additions & 0 deletions proposals/4354-sticky-events.md
Comment thread
kegsay marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
# MSC4354: Sticky Events

MatrixRTC currently depends on [MSC3757](https://github.com/matrix-org/matrix-spec-proposals/pull/3757)
Comment thread
kegsay marked this conversation as resolved.
Outdated
for sending per-user per-device state. MatrixRTC wants to be able to share a temporary state to all
users in a room to indicate whether the given client is in the call or not.

The concerns with MSC3757 and using it for MatrixRTC are mainly:

1. In order to ensure other users are unable to modify each other’s state, it proposes using
string packing for authorization which feels wrong, given the structured nature of events.
2. Allowing unprivileged users to send arbitrary amounts of state into the room is a potential
abuse vector, as these states can pile up and can never be cleaned up as the DAG is append-only.
3. State resolution can cause rollbacks. These rollbacks may inadvertently affect per-user per-device state.

Other proposals have similar problems such as live location sharing which uses state events when it
really just wants per-user last-write-wins behaviour.
Comment thread
kegsay marked this conversation as resolved.
Outdated

There currently exists no good communication primitive in Matrix to send this kind of data. EDUs are
almost the right primitive, but:
Comment thread
kegsay marked this conversation as resolved.

* They can’t be sent via clients (there is no concept of EDUs in the Client-Server API\!
[MSC2477](https://github.com/matrix-org/matrix-spec-proposals/pull/2477) tries to change that)
* They aren’t extensible.
Comment thread
kegsay marked this conversation as resolved.
Outdated
* They do not guarantee delivery. Each EDU type has slightly different persistence/delivery guarantees,
all of which currently fall short of guaranteeing delivery.
Comment thread
kegsay marked this conversation as resolved.
Outdated

This proposal adds such a primitive, called Sticky Events, which provides the following guarantees:

* Eventual delivery (with timeouts) and convergence.
* Access control tied to the joined members in the room.
* Extensible, able to be sent by clients.

This new primitive can be used to implement MatrixRTC participation, live location sharing, among other functionality.
Comment thread
kegsay marked this conversation as resolved.
Outdated

## Proposal

Message events can be annotated with a new top-level `sticky` key, which MUST have a `duration_ms`,
which is the number of milliseconds for the event to be sticky. The presence of `sticky.duration_ms`
with a valid value makes the event “sticky”[^stickyobj]. Valid values are the integer range 0-3600000 (1 hour).
Comment thread
kegsay marked this conversation as resolved.

```json
{
"type": "m.rtc.member",
"sticky": {
"duration_ms": 600000
},
Comment thread
kegsay marked this conversation as resolved.
"sender": "@alice:example.com",
"room_id": "!foo",
"origin_server_ts": 1757920344000,
"content": { ... }
}
```

This key can be set by clients in the CS API by a new query parameter `stick_duration_ms`, which is
Comment thread
kegsay marked this conversation as resolved.
Outdated
added to the following endpoints:

* `PUT /_matrix/client/v3/rooms/{roomId}/send/{eventType}/{txnId}`
* `PUT /_matrix/client/v3/rooms/{roomId}/state/{eventType}/{stateKey}`
Comment thread
kegsay marked this conversation as resolved.
Outdated
Comment thread
kegsay marked this conversation as resolved.
Outdated
Comment thread
kegsay marked this conversation as resolved.
Outdated

To calculate if any sticky event is still sticky:

* Calculate the start time:
* The start time is `min(now, origin_server_ts)`. This ensures that malicious origin timestamps cannot
specify start times in the future.
Comment thread
kegsay marked this conversation as resolved.
Outdated
* If the event is pushed via `/send`, servers MAY use the current time as the start time. This minimises
the risk of clock skew causing the start time to be too far in the past. See “Potential issues \> Time”.
Comment thread
kegsay marked this conversation as resolved.
Outdated
* Calculate the end time as `start_time + min(stick_duration_ms, 3600000)`.
* If the end time is in the future, the event remains sticky.

Sticky events are like normal message events and are authorised using normal PDU checks. They have the
following _additional_ properties:

* They are eagerly synchronised with all other servers.[^partial]
* They must appear in the `/sync` response.[^sync]
* The soft-failure checks MUST be re-evaluated when the membership state changes for a user with unexpired sticky events.[^softfail]

To implement these properties, servers MUST:

* Attempt to send all sticky events to all joined servers, whilst respecting per-server backoff times.
Comment thread
erikjohnston marked this conversation as resolved.
Outdated
Large volumes of events to send MUST NOT cause the sticky event to be dropped from the send queue on the server.
* Ensure all sticky events are delivered to clients via `/sync` in a new section of the sync response,
regardless of whether the sticky event falls within the timeline limit of the request.
* When a new server joins the room, the server MUST attempt delivery of all sticky events immediately.
* Remember sticky events per-user, per-room such that the soft-failure checks can be re-evaluated.

When an event loses its stickiness, these properties disappear with the stickiness. Servers SHOULD NOT
eagerly synchronise such events anymore, nor send them down `/sync`, nor re-evaluate their soft-failure status.
Note: policy servers and other similar antispam techniques still apply to these events.
Comment thread
reivilibre marked this conversation as resolved.
Outdated

The new sync section looks like:

```json
{
"rooms": {
"join": {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a room is newly-joined, is the server meant to send down all the pre-existing sticky events?
(If so, that is going to get fiddly because we need to track that backlog in the sync token so we can send pages of 50 sticky events down at a time)

This question also applies to the equivalent sliding sync extension.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All unexpired sticky events for that room, yes. It should be functionally the same as how we send all pre-existing room state to the client when they join a room, so the transition to join should:

  • inspect the sync token to see its event stream position
  • load all unexpired sticky events in that room < that stream position
  • include them in the response

Am I missing something here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble is that it's difficult to paginate to include only (e.g.) 50 sticky events in one response, if we do it directly like that.

What happens if there are 1000 active sticky events?

We can decide to send them all down to the client at once, no matter what, but I don't know if that's reasonable or not.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally use "It should be functionally the same as how we send all pre-existing room state to the client when they join a room" as a guide here. When you join a room with 1000 current state events, do we paginate it? No. So why should we do this with sticky events?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm happy to go along with that, though I suppose a difference is that for state events, only users with PL ≥ 50 can send new ones by default (other than their membership, where only one can be active at a time), whereas sticky events can be sent by anyone.

We possibly want to consider the abuse considerations a bit more (...should we have an 'Abuse Considerations' section alongside the Security Considerations one?).

If some user decides to spam my room with 1k, 10k, 100k, 1M or more sticky events and we have to push this down the pipe to clients, should I have any tooling to deal with that?

For example,

  • Make sticky events become un-sticky if the sender leaves|gets banned|gets kicked (?)?
  • Make sticky events become un-sticky when redacted?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of treating it semantically the same as room state is to piggyback off any existing protections, rather than having a patchwork of protections. A malicious actor can join 100k users to the same room today. We protect against this by rate limiting, policy servers, and setting the join rules to invite. Sticky events would benefit from the same protection measures (they are rate limited using the same mechanism, checked by policy servers in the same way, and the act of stopping bad users from joining stops them being able to send sticky events in the first place).

Do we really need an extra safeguard here?

As an aside, we must also consider the whole protocol not parts in isolation. For example, we do not rate limit messages. When considered in part, this is fine as each /sync response is bounded by timeline_limit. In practice though, the whole app behaviour is not really any different because most apps automatically and repeatedly ask for older /messages if the viewport is not full, allowing a similar amount of bandwidth to be consumed by just sending 100k junk events.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I find it hard to resonate with that: the main protection against state event spam is that users can't actually bloat the state space with PL0. (This doesn't apply to sticky events, right?)

I don't think policy servers currently have any protection mechanism against either state nor sticky events; once a policy server rule had been written it would be too late to enforce it for the room that was already polluted. (With a slight benefit for sticky events given that they expire, so the storm would 'blow over' so to speak.)

We don't send the full state set down to syncing clients; see for example lazy-loading room memberships where we only include 'relevant' state. This is another aspect that's new with sticky events: clients don't get any control and the server is actually being required to send the entire set down.

Applying rate limits to sticky event (or indeed, any kind of event) spam over federation is difficult.

However the protection for timeline events is that clients don't have to see them all; typical clients only request enough history to fill the visible client window and the federation protocol doesn't require you to have all of the timeline. This is again a situation that changes with sticky events (in fact, that's the motivating reason for them to exist).

I am finding it hard to know what the right course is here fwiw (and trusting our existing safeguards might indeed be the right call), but it does feel to me that sticky events can be a potential unprivileged abuse vector with novel repercussions in several of these axes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users can't actually bloat the state space with PL0.

..but they can, by joining new users they control.

We don't send the full state set down to syncing clients; see for example lazy-loading room memberships where we only include 'relevant' state.

..but we do. We explicitly send all membership deltas between two sync tokens even with lazy loading enabled.

once a policy server rule had been written it would be too late to enforce it for the room that was already polluted

..which is true in basically all "reactive" scenarios, e.g. changing the join rule to invite-only, banning users, etc.


I don't dispute that it allows more bandwidth to be consumed over the CS link. I don't think adding a patchwork of bandwidth protections is the right call: it just further reinforces the whole "whoops, the protocol grew organically, so there's one way to do this, another way to do that, and this thing isn't possible because it was done before we did things this way". If we are serious about having bandwidth control on the CS link then we should.. have an MSC that talks about it. The fact we don't is perhaps the biggest indication that this isn't a problem in practice. Having a holistic view allows us to address other elephants in the room (to-device messages, device lists, invites, etc all of which are unbounded and attacker controlled, in addition to the whole-protocol-view that apps typically do follow-up requests which can also consume lots of bandwidth (e.g. threaded operations which defer to thread specific endpoints, /messages and backfill).

I'm also fairly bullish on this because of the 1 hour limit on sticky events. This means you need a continuous active attack for it to consume lots of bandwidth, which is a high bar compared to.. basically every other payload in the CS link :S so I do find it quite bizarre to focus on bandwidth concerns quite so much for this MSC, yet neglect it in all the others (see: organic growth).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cross-linking another conversation about shotgun bandwidth optimizations vs holistic approach, #4186 (comment))

"!726s6s6q:example.com": {
"account_data": { ... },
"ephemeral": { ... },
"state": { ... },
"timeline": { ... },
"sticky": {
"events": [
{
"sender": "@bob:example.com",
"type": "m.foo",
"sticky": {
"duration_ms": 300000
},
"origin_server_ts": 1757920344000,
"content": { ... }
},
{
"sender": "@alice:example.com",
"type": "m.foo",
"sticky": {
"duration_ms": 300000
},
"origin_server_ts": 1757920311020,
"content": { ... }
}
]
}
Comment thread
kegsay marked this conversation as resolved.
}
}
}
```

Over Simplified Sliding Sync, Sticky Events have their own extension `sticky_events`, which has the following response shape:
Comment thread
kegsay marked this conversation as resolved.
Outdated
Comment thread
kegsay marked this conversation as resolved.
Outdated

```json
{
"rooms": {
"!726s6s6q:example.com": {
"events": [{
"sender": "@bob:example.com",
"type": "m.foo",
"sticky": {
"duration_ms": 300000
},
"origin_server_ts": 1757920344000,
"content": { ... }
}]
}
}
}
```
Comment thread
kegsay marked this conversation as resolved.

Sticky messages MAY be sent in the timeline section of the `/sync` response, regardless of whether
or not they exceed the timeline limit[^ordering]. If a sticky event is in the timeline, it MAY be
omitted from the `sticky.events` section. This ensures we minimise duplication in the `/sync` response JSON.

Servers SHOULD rate limit sticky events over federation. If the rate limit kicks in, servers MUST
return a non-2xx status code from `/send` such that the sending server *retries the request* in order
to guarantee that the sticky event is eventually delivered. Servers MUST NOT silently drop sticky events
and return 200 OK from `/send`, as this breaks the eventual delivery guarantee.

These messages may be combined with [MSC4140: Delayed Events](https://github.com/matrix-org/matrix-spec-proposals/pull/4140)
to provide heartbeat semantics (e.g required for MatrixRTC). Note that the sticky duration in this proposal
is distinct from that of delayed events. The purpose of the sticky duration in this proposal is to ensure sticky events are cleaned up.

### Implementing a map

MatrixRTC relies on a per-user, per-device map of RTC member events. To implement this, this MSC proposes
a standardised mechanism for determining keys on sticky events, the `content.sticky_key` property:

```json
{
"type": "m.rtc.member",
"sticky": {
"duration_ms": 300000
},
"sender": "@alice:example.com",
"room_id": "!foo",
"origin_server_ts": 1757920344000,
"content": {
"sticky_key": "LAPTOPXX123",
...
}
}
```

`content.sticky_key` is ignored server-side[^encryption] and is purely informational. Clients which
receive a sticky event with a sticky key SHOULD keep a map with keys determined via the 4-uple
`(room_id, sender, type, content.sticky_key)` to track the current values in the map. Nothing stops
Comment thread
kegsay marked this conversation as resolved.
users sending multiple events with the same `sticky_key`. To deterministically tie-break, clients which
implement this behaviour MUST:

- pick the one with the highest `origin_server_ts`,
- tie break on the one with the highest lexicographical event ID (A < Z).

When overwriting keys, clients SHOULD use the same sticky duration as the previous sticky event to avoid clients diverging.
Comment thread
kegsay marked this conversation as resolved.
Outdated
This can happen when a client sends a sticky event with key K with a long timeout, then overwrites it with the same key K’
with a short timeout. If the sticky event K’ fails to be sent to all servers before the short timeout is hit,
some clients will believe the state is K and others will have no state. This will only resolve once the long timeout is hit.

Note that encrypted sticky events will encrypt some parts of the 4-uple. An encrypted sticky event only exposes the room ID and sender to the server:

```json
{
"content": {
"algorithm": "m.megolm.v1.aes-sha2",
"ciphertext": "AwgCEqABubgx7p8AThCNreFNHqo2XJCG8cMUxwVepsuXAfrIKpdo8UjxyAsA50IOYK6T5cDL4s/OaiUQdyrSGoK5uFnn52vrjMI/+rr8isPzl7+NK3hk1Tm5QEKgqbDJROI7/8rX7I/dK2SfqN08ZUEhatAVxznUeDUH3kJkn+8Onx5E0PmQLSzPokFEi0Z0Zp1RgASX27kGVDl1D4E0vb9EzVMRW1PrbdVkFlGIFM8FE8j3yhNWaWE342eaj24NqnnWJ5VG9l2kT/hlNwUenoGJFMzozjaUlyjRIMpQXqbodjgyQkGacTEdhBuwAQ",
"device_id": "AAvTvsyf5F",
"sender_key": "KVMNIv/HyP0QMT11EQW0X8qB7U817CUbqrZZCsDgeFE",
"session_id": "c4+O+eXPf0qze1bUlH4Etf6ifzpbG3YeDEreTVm+JZU"
},
"origin_server_ts": 1757948616527,
"sender": "@alice:example.com",
"type": "m.room.encrypted",
"sticky": {
"duration_ms": 600000
},
"event_id": "$lsFIWE9JcIMWUrY3ZTOKAxT_lIddFWLdK6mqwLxBchk",
"room_id": "!ffCSThQTiVQJiqvZjY:matrix.org"
}
```

The decrypted event would contain the `type` and `content.sticky_key`.

## Potential issues
Comment thread
kegsay marked this conversation as resolved.

### Time

Servers who can’t maintain correct clock frequency may expire sticky events at a slightly slower/faster rate
than other servers. As the maximum timeout is relatively low, the total deviation is also reasonably low,
making this less problematic. The alternative of explicitly sending an expiration event would likely cause
more deviation due to retries than deviations due to clocks.

Servers with significant clock skew may set `origin_server_ts` too far in the past or future. If the value
is too far in the past this will cause sticky events to expire quicker than they should, or to always be
treated as expired. If the value is too far in the future, this has no effect as it is bounded by the current time.
As such, this proposal relies somewhat on NTP to ensure clocks over federation are roughly in sync.
As a consequence of this, the sticky duration SHOULD NOT be set to below 5 minutes.[^ttl]

### Encryption
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the strategy regarding Mixed Content sticky chain? Like a clear event replacing an encrypted one?
Can we disable mixed content? Only an encrypted event can replace an encrypted sticked event.

Or at least having a way to discard such a sticky event?
If not it would be like allowing clear edits of encrypted messages without showing a big red warning.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW there is prior art / similar rules for edits validity of replacement events

Maybe we could add similar rules? to be consistent.
Things like

  • The replacement and original events must have the same type
  • If the original event was encrypted the replacement should be too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is referring to the 'Addendum: Implementing an ephemeral map'? I wonder if the formalisation of that addendum would be better-suited to another MSC?


Encrypted sticky events reduce reliability as in order for a sticky event to be visible to the end user it
requires *both* the sending client to think the receiver is joined (so we encrypt for their devices) and the
receiving server to think the sender is joined (so it passes auth checks). Unencrypted events only strictly
require the receiving server to think the sender is joined.

The lack of historical room key sharing may make some encrypted sticky events undecryptable when new users join the room.

### Spam

Servers may send every event as a sticky event, causing a higher amount of events to be sent eagerly over federation
and to be sent down `/sync` to clients. The former is already an issue as servers can simply `/send` many events.
The latter is a new abuse vector, as up until this point the `timeline_limit` would restrict the amount of events
Comment thread
kegsay marked this conversation as resolved.
that arrive on client devices (only state events are unbounded and setting state is a privileged operation).
This proposal has the following protections in place:

* All sticky events expire, with a hard limit of 1 hour. The hard limit ensures that servers cannot set years-long expiry times.
This ensures that the data in the `/sync` response can go down and not grow unbounded.
* All sticky events are subject to normal PDU checks, meaning that the sender must be authorised to send events into the room.
* Servers sending lots of sticky events may be asked to try again later as a form of rate-limiting.
Due to data expiring, subsequent requests will gradually have less data.

## Alternatives

### Use state events

We could do [MSC3757](https://github.com/matrix-org/matrix-spec-proposals/pull/3757), but for the
reasons mentioned at the start we don’t really want to do so.

### Make stickiness persistent not ephemeral

There are arguments that, at least for some use cases, we don’t want these sticky events to timeout.
However, that opens the possibility of bloating the `/sync` response with sticky events.

Suggestions for minimizing that have been to have a hard limit on the number of sticky events a user can have per room,
instead of a timeout. However, this has two drawbacks: a) you still may end up with substantial bloat as stale data doesn’t
automatically get reaped (even if the amount of bloat is limited), and b) what do clients do if there are already too many
sticky events? The latter is tricky, as deleting the oldest may not be what the user wants if it happens to be not-stale data,
and asking the user what data it wants to delete vs keep is unergonomic.

Non-expiring sticky events could be added later if the above issues are resolved.

### Have a dedicated ‘ephemeral user state’ section

Early prototypes of this proposal devised a key-value map with timeouts maintained over EDUs rather than PDUs.
This early proposal had much the same feature set as this proposal but with one major difference: equivocation.
Comment thread
kegsay marked this conversation as resolved.
Outdated
Servers could broadcast different values for the same key to different servers, causing the map to not converge:
the Byzantine Broadcast problem. Matrix already has a data structure to agree on shared state: the room DAG.
As such, this led to the prototype to the current proposal. By putting the data into the DAG, other servers
can talk to each other via it to see if they have been told different values. When combined with a simple
conflict resolution algorithm (which works because there is [no need for coordination](https://arxiv.org/abs/1901.01930)),
this provides a way for clients to agree on the same values. Note that in practice this needs servers to *eagerly*
share forward extremities so servers aren’t reliant on unrelated events being sent in order to check for equivocation.
Currently, there is no mechanism for servers to express “these are my latest events, what are yours?” without actually sending another event.

## Security Considerations

Servers may equivocate over federation and send different events to different servers in an attempt to cause
the key-value map maintained by clients to not converge. Alternatively, servers may fail to send sticky events
to their own clients to produce the same outcome. Federation equivocation is mitigated by the events being
persisted in the DAG, as servers can talk to each other to fetch all events. There is no way to protect against
dropped updates for the latter scenario.

## Unstable Prefix

- The `stick_duration_ms` query param is `msc4354_stick_duration_ms`.
- The `sticky` key in the PDU is `msc4354_sticky`.
- The `/sync` response section is `msc4354_sticky_events`.
- The sticky key in the `content` of the PDU is `msc4354_sticky_key`.
Comment thread
kegsay marked this conversation as resolved.

[^stickyobj]: The presence of the `sticky` object alone is insufficient.
[^partial]: Over federation, servers are not required to send all timeline events to every other server.
Servers mostly lazy load timeline events, and will rely on clients hitting `/messages` which in turn
hits`/backfill` to request events from federated servers.
[^sync]: Normal timeline events do not always appear in the sync response if the event is more than `timeline_limit` events away.
[^softfail]: Not all servers will agree on soft-failure status due to the check considering the “current state” of the room.
To ensure all servers agree on which events are sticky, we need to re-evaluate this rule when the current room state changes.
This becomes particularly important when room state is rolled back. For example, if Charlie sends some sticky event E and
then Bob kicks Charlie, but concurrently Alice kicks Bob then whether or not a receiving server would accept E would depend
on whether they saw “Alice kicks Bob” or “Bob kicks Charlie”. If they saw “Alice kicks Bob” then E would be accepted. If they
saw “Bob kicks Charlie” then E would be rejected, and would need to be rolled back when they see “Alice kicks Bob”.
[^ordering]: Sticky events expose gaps in the timeline which cannot be expressed using the current sync API. If sync used
something like [stitched ordering](https://codeberg.org/andybalaam/stitched-order)
or [MSC3871](https://github.com/matrix-org/matrix-spec-proposals/pull/3871) then sticky events could be inserted straight
into the timeline without any additional section, hence “MAY” would enable this behaviour in the future.
Comment thread
kegsay marked this conversation as resolved.
[^encryption]: Previous versions of this proposal had the key be at the top-level of the event JSON so servers could
implement map-like semantics on client’s behalf. However, this would force the key to remain visible to the server and
thus leak metadata. As a result, the key now falls within the encrypted `content` payload, and clients are expected to
implement the map-like semantics should they wish to.
Comment thread
kegsay marked this conversation as resolved.
[^ttl]: Earlier designs had servers inject a new `unsigned.ttl_ms` field into the PDU to say how many milliseconds were left.
This was problematic because it would have to be modified every time the server attempted delivery of the event to another server.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was problematic because it would have to be modified every time the server attempted delivery of the event to another server.

Doesn't the spec require that today with the age field?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but not over federation. I mostly added this because Erik seemed to think this was a downside in his earlier proposal:

Also having a short expiry makes retries over federation annoying (as they are for events with age), since you need to mutate the contents before retrying a request

Do you want me to add anything to this?

Furthermore, it didn’t really add any more protection because it assumed servers honestly set the value.
Malicious servers could set the TTL to be 0 ~ `sticky.duration_ms` , ensuring maximum divergence
on whether or not an event was sticky. In contrast, using `origin_server_ts` is a consistent reference point
that all servers are guaranteed to see, limiting the ability for malicious servers to cause divergence as all
servers approximately track NTP.