diff --git a/inbox/jingle-rtt-sync.xml b/inbox/jingle-rtt-sync.xml new file mode 100644 index 000000000..5ee087094 --- /dev/null +++ b/inbox/jingle-rtt-sync.xml @@ -0,0 +1,483 @@ + + +%ents; +]> + + +
+ Jingle Synchronized Real-Time Text + This specification defines a Jingle application extension for negotiating real-time text as part of the same conversational session as audio and video. + &LEGALNOTICE; + xxxx + ProtoXEP + Standards Track + Standards + Council + + XEP-0166 + XEP-0167 + XEP-0176 + XEP-0301 + RFC 4103 + RFC 8865 + + + + jingle-rtt-sync + + jingle + rtt + accessibility + webrtc + + + Edward + Tie + info@tiedragon.com + + + 0.0.2 + 2026-05-30 + et +

Document initial browser implementation test results.

+
+ + 0.0.1 + 2026-05-30 + et +

Initial ProtoXEP submission.

+
+
+ + +

Real-time text is already defined for XMPP by &xep0301;. Jingle is already used to negotiate real-time audio and video sessions, most commonly using &xep0167; and &xep0176;. However, when a client establishes a Jingle audio-video call and sends real-time text as ordinary XMPP messages outside the Jingle session, the user experience can look like one conversation while the protocol state is split into two unrelated paths.

+

This specification defines a way to negotiate real-time text as a Jingle content in the same session as audio and video. The text content can be human typed RTT, captions, ASR output, interpreter text, translation text or transcript text. The goal is Total Conversation: audio, video and text presented as one conversational unit.

+

The motivating implementation problem is simple: a call can exist, text can exist, and yet the text might not be part of the negotiated Jingle session. In that case the receiver cannot reliably treat the text as synchronized conversational media.

+
+ + +

This specification is designed to meet the following requirements.

+
    +
  1. Enable a Jingle initiator to offer real-time text in the same session as audio and video.
  2. +
  3. Enable a responder to accept or reject real-time text independently from audio and video.
  4. +
  5. Define a first-class Jingle content for text, for example with content name text or rtt.
  6. +
  7. Allow endpoints to identify the text purpose, source and language.
  8. +
  9. Allow endpoints to indicate whether the text is synchronized to a media clock, a session clock, the call session only, or not synchronized.
  10. +
  11. Allow fallback to &xep0301; when synchronized Jingle text is not supported.
  12. +
  13. Prevent clients from silently presenting fallback RTT as synchronized text.
  14. +
+ +

Implementations can support different levels without falsely claiming full synchronization.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
LevelNameMinimum capabilityUser-visible promise
0XEP-0301 fallbackOrdinary in-band RTT outside JingleLive text, not media synchronized
1Jingle co-session textText is negotiated by the same Jingle session but does not share a media clockBelongs to the call, limited synchronization
2Session-clock textText has timestamps relative to a shared call or session clockCall-synchronized text
3Media-clock textRTP/T.140 or equivalent media-clock timing with audio/video correlationStrict synchronized Total Conversation
+

An implementation MUST NOT advertise a higher level than it can actually deliver. In particular, a WebRTC data channel that is merely opened during a call is Level 1 unless it can demonstrate a shared session clock or media clock.

+
+
+ + +
+ +
RTT
+
Real-Time Text, transmitted while it is being typed or created.
+
+ +
Total Conversation
+
A conversation containing simultaneous audio, video and real-time text.
+
+ +
Jingle content
+
A named component inside a Jingle session, such as audio, video or text.
+
+ +
Conversation group
+
A set of Jingle contents intended to be presented as one synchronized conversational unit.
+
+
+
+ + + +

An initiator offers audio, video and text contents in one Jingle session. The receiver accepts all three contents and presents them as a single Total Conversation.

+ RTP audio + content video -> RTP video or signing + content text -> RTP T.140 or WebRTC datachannel T.140 +]]> +
+ +

A participant starts an audio-video call and later adds captions, ASR or typed text by sending a Jingle content-add action for the text content.

+
+ +

If the peer does not support this specification, a client can fall back to &xep0301;. The fallback MUST be visible to the user when synchronized text is required.

+
+
+ + +

A Total Conversation call SHOULD contain three Jingle contents:

+ ... + ... + ... +]]> +

The text content is not an ordinary XMPP message stream. It is part of the Jingle session and is described by this extension.

+

The binding key is the Jingle sid plus the content name and the sync-group. A client MUST NOT infer synchronization only from the peer JID, because a user can have multiple simultaneous sessions, devices or fallback chat streams with the same peer.

+
+ + +

An entity supporting this specification MUST advertise the following feature:

+ +]]> +

If the entity supports RTP/T.140, it SHOULD advertise:

+ +]]> +

If the entity supports WebRTC datachannel T.140, it SHOULD advertise:

+ +]]> +

If the entity supports fallback to &xep0301;, it SHOULD also advertise the normal XEP-0301 feature.

+
+ + +

This specification defines an rtt-sync element qualified by the urn:xmpp:jingle:apps:rtt-sync:0 namespace.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
AttributeRequiredValuesMeaning
roleyesconversation, caption, transcript, translation, interpreterPurpose of the text stream
sourcenohuman, asr, captioner, interpreter, translation, systemOrigin of the text
langnoBCP 47 language tagLanguage of the text
sync-groupyestokenGroup shared by audio, video and text contents
sync-referencenocontent nameContent this text is synchronized with, usually audio
sync-modeyesmedia-clock, session-clock, co-session, noneSynchronization model
max-skewnomillisecondsMaximum target presentation difference
finalitynopartial, final, mixedWhether text can change
+ +]]> +
+ + +

The RTP/T.140 profile is the preferred profile when strict synchronization with audio and video is required. The initiator offers a Jingle RTP content with media='text' and payload types for t140 and optionally red.

+ + + + + + + + + + + + + + + + + + + + + + + + + + +]]> +

When sync-mode='media-clock' is negotiated, endpoints SHOULD use the same RTCP CNAME for audio, video and text RTP streams belonging to the same endpoint. Receivers SHOULD use RTP/RTCP timing to align text with audio or video where possible. If timing information is unavailable, the receiver MAY fall back to session arrival time and SHOULD indicate reduced synchronization quality.

+
+ + +

The datachannel profile supports browser/WebRTC deployments using T.140 over a reliable, ordered data channel. This profile is useful when a WebRTC implementation naturally uses data channels for RTT. However, data channels do not automatically share the RTP media clock, so the synchronization mode MUST be declared carefully.

+ + + + + + + + +]]> +

The exact Jingle mapping for WebRTC data channel negotiation should be aligned with the relevant Jingle data channel signalling specification. This document does not attempt to replace that signalling.

+
+ + +

If the responder does not support urn:xmpp:jingle:apps:rtt-sync:0, the initiator MAY fall back to &xep0301;. Fallback MUST be explicit in the user interface when synchronization is required.

+ + + +]]> +

Fallback is a state transition, not just a transport choice. If a Jingle text content is rejected but audio and video are accepted, the call MAY continue without synchronized text. If fallback RTT is started for the same conversation, it SHOULD be bound to the Jingle sid and shown as fallback rather than synchronized captions.

+
+ + + +
    +
  1. A sender that offers synchronized RTT MUST include an rtt-sync element.
  2. +
  3. A sender MUST identify whether the stream is conversation text, caption text, transcript text, interpreter text or translation text.
  4. +
  5. A sender SHOULD include a language tag when known.
  6. +
  7. A sender MUST NOT label ASR text as human captioning.
  8. +
  9. A sender MUST route Jingle text for the negotiated content through the negotiated Jingle transport, not through an unrelated ordinary chat message path.
  10. +
+
+ +
    +
  1. A receiver MUST treat a Jingle synchronized RTT content as part of the call, not as normal chat.
  2. +
  3. A receiver SHOULD use the negotiated sync-mode to determine presentation.
  4. +
  5. A receiver MUST bind incoming synchronized text to the Jingle sid and content name before presenting it as part of a call.
  6. +
  7. A receiver SHOULD detect duplicate text received through both Jingle text and XEP-0301 fallback and avoid showing it twice.
  8. +
  9. A receiver SHOULD expose diagnostics when RTT is present in chat but absent from the Jingle session.
  10. +
+
+
+ + +

A user interface SHOULD distinguish at least these cases: live text, live captions, AI captions, human captions, translation and unsynchronized fallback.

+

During call setup, a client SHOULD expose whether synchronized text was negotiated, whether live text fallback is active or whether text is unavailable in the call.

+ +
+ + +

This specification is specifically motivated by accessibility and Total Conversation use cases. A deaf or hard-of-hearing user MUST be able to distinguish between typed text, human captions, AI or ASR captions and translated text where this information is known.

+

A client SHOULD visibly indicate late captions, uncertain ASR captions or unsynchronized fallback text. A client SHOULD allow users to prefer synchronized captions over lowest-latency captions, or lowest-latency captions over strict synchronization.

+
+ + +

Text content MUST support Unicode. Language tags SHOULD use BCP 47. Clients SHOULD support multiple simultaneous text streams where translation or interpreter text is provided in addition to original captions.

+
+ + +

Synchronized RTT and captions can contain highly sensitive conversation content. Implementations SHOULD use end-to-end encrypted signalling and encrypted media where available.

+

For RTP/T.140, implementations SHOULD use SRTP or an equivalent encrypted RTP transport, authenticate the sender of the text stream and protect against injection of false captions. Implementations SHOULD prevent downgrade attacks from synchronized RTT to unsynchronized fallback without user indication.

+

Clients SHOULD avoid misrepresenting AI captions as human or verified text.

+
+ + +

Real-time text can reveal text before the sender considers it final. Captions can reveal speech content to captioning, relay or ASR services. A client SHOULD obtain user consent before sending typed RTT and before sending audio to ASR or captioning services.

+

A client SHOULD not store partial captions or partial RTT as a final transcript unless enabled. A client SHOULD indicate when a third-party captioning, ASR, relay or interpreting service is active.

+
+ + +

This document makes no direct IANA request unless future revisions define new SDP attributes or new media types. The RTP/T.140 profile uses existing text/t140 and text/red media formats.

+
+ + +

This specification requests registration of the following namespace:

+ urn:xmpp:jingle:apps:rtt-sync:0 +

The following service discovery features are requested:

+ urn:xmpp:jingle:apps:rtt-sync:0 +urn:xmpp:jingle:apps:rtt-sync:rtp-t140:0 +urn:xmpp:jingle:apps:rtt-sync:dc-t140:0 +
+ + +

This document does not replace &xep0301;. XEP-0301 remains appropriate for chat-oriented real-time text and as a fallback. The distinction is that this specification binds text to a Jingle session when an implementation needs Total Conversation semantics.

+

RTP/T.140 is the preferred strict synchronization profile. WebRTC datachannel T.140 is useful for browser deployments, but MUST NOT be described as media-clock synchronized unless the implementation can provide the required timing relationship.

+
+ + +

An experimental browser implementation has tested the WebRTC datachannel profile at Level 1. Two browser sessions negotiated one Jingle audio-video session plus a text content using urn:xmpp:jingle:apps:rtt-sync:0, opened a reliable ordered data channel labelled rtt, exchanged live RTT updates, and delivered final text bound to the Jingle session. The client presented the call as live text synchronized with the call session.

+

The same implementation retained &xep0301; fallback for peers that do not negotiate the Jingle text content, so ordinary live text remains available without being presented as synchronized call media.

+
+ + +

The following schema is an initial sketch.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +]]> +
+ + +
    +
  1. Should this be a new Jingle application format or an extension to &xep0167;?
  2. +
  3. Should RTP/T.140 be mandatory-to-implement for strict synchronization?
  4. +
  5. Which existing Jingle datachannel signalling elements should be used for the WebRTC datachannel profile?
  6. +
  7. Should emergency-service profiles have stricter requirements?
  8. +
  9. Should multiparty RTT support be included here or deferred to a separate specification?
  10. +
+
+