Skip to content

feat: voice-activity streaming mode & inner-vad for speech-to-text module#1160

Open
IgorSwat wants to merge 11 commits into
mainfrom
@is/vad-streaming
Open

feat: voice-activity streaming mode & inner-vad for speech-to-text module#1160
IgorSwat wants to merge 11 commits into
mainfrom
@is/vad-streaming

Conversation

@IgorSwat
Copy link
Copy Markdown
Contributor

@IgorSwat IgorSwat commented May 20, 2026

Description

This PR introduces changes focused on voice-activity-detection module and it's utilization within the library:

  • Native side VAD streaming - introduces a continuous voice-activity-detection mechanism with user-friendly callback system. Example usage from demo app:
  await model.stream({
    onSpeechBegin: () => {...},
    onSpeechEnd: () => {...},
    options: {...},
  });
  • VAD x STT integration - adds an option to utilize voice-activity-detection within the speech-to-text module, significantly improving the effective performance of the STT.
  • Demo apps: introduces new screen in the speech demo app: VoiceActivityDetectionScreen and changes the behavior of SpeechToTextScreen, adding a toggle to switch the VAD submodule for STT on/off.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

  • To test the VAD streaming: run the VoiceActivityDetectionScreen within the Speech demo app.
  • To test the VAD & STT integration: run the SpeechToTextScreen within the Speech demo app, with VAD toggle on.

Screenshots

Related issues

#1118

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@IgorSwat IgorSwat requested review from chmjkb and msluszniak May 20, 2026 13:09
@IgorSwat IgorSwat force-pushed the @is/vad-streaming branch from 694fe4f to 1c2411e Compare May 20, 2026 13:15
@IgorSwat IgorSwat changed the base branch from main to @is/speech-to-text-ultimate May 20, 2026 13:26
Comment on lines +24 to +26
inline constexpr size_t kMinSpeechDuration = 25; // 250 ms
inline constexpr size_t kMinSilenceDuration = 10; // 100 ms
inline constexpr size_t kSpeechPad = 3; // 30 ms
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these are in 10s of ms while other constants above are in miliseconds?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the constants in this file are in 10s of ms.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why these are named kWindowSizeMs and kHopLengthMs, this ms suffix almost screams to me: "This value is in milliseconds". If not, then I would be very surprised ;p

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, constants like kWindowSizeMs are indeed in milliseconds, but the other ones like kModelInputMin are in tens of milliseconds.

Copy link
Copy Markdown
Member

@msluszniak msluszniak May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly, and that was my original question, Why these are in 10s of ms while other constants above are in miliseconds? Why is that? Can't we unify this?

@IgorSwat IgorSwat force-pushed the @is/speech-to-text-ultimate branch from 02113ff to 6bba141 Compare May 20, 2026 15:46
Comment thread apps/speech/screens/SpeechToTextScreen.tsx
Comment thread apps/speech/screens/VoiceActivityDetectionScreen.tsx
Base automatically changed from @is/speech-to-text-ultimate to main May 21, 2026 08:20
@IgorSwat IgorSwat force-pushed the @is/vad-streaming branch from 1c2411e to 0ea858d Compare May 21, 2026 08:55
@msluszniak msluszniak added the feature PRs that implement a new feature label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants