Uses ffmpeg to cut the audio track from mp4 file, performs speech recognition via Vosk API and Vosk model and returns text as result. A utility for calculating metrics based on the reference text is included.
- Python 3.6+
- ffmpeg
This application utilizes ffmpeg to convert .mp4 to .wav. You should have ffmpeg package installed in your system to make mpeg-4 video to text conversion work.
pip install -r requirements.txtDownload a model for your language from https://alphacephei.com/vosk/models and put in into ./model directory.
Example:
wget https://alphacephei.com/vosk/models/vosk-model-ru-0.22.zip
unzip vosk-model-ru-0.22.zip
mv vosk-model-ru-0.22 modelpython transcript_mp4.py some_video.mp4python wer.py <hypotesis_text_file> <reference_text_file>We have an ideal transcript for this video in russian (./samples/ideal.txt): https://vod-video.rbc.ru/archive/2021/12/02/den1118.folder/telecast_576p.mp4
We have also made a transcript with Vosk model for the same video (./samples/test.txt).
So, we can run the calculation:
python wer.py samples/test.txt samples/ideal.txtResult:
WER (Words Error Rate): 0.14775815217391305
MER (Match Error Rate): 0.14164767176815368
WIL (Word Information Lost): 0.22181904843819444
WIP (Word Information Preserved): 0.7781809515618056
Hits: 2636
Substitutions: 270
Deletions: 38
Insertions: 127