Skip to content

Automatic speech recognition in MATLAB/Octave (using whisper.cpp and OpenAI's Whisper)

License

Notifications You must be signed in to change notification settings

gllmflndn/whisper.m

Repository files navigation

whisper.m

License: MIT CI

Automatic speech recognition in MATLAB/Octave based on the excellent whisper.cpp from Georgi Gerganov and models from OpenAI's Whisper.

Installation

First, clone the repository with submodules:

git clone --recurse-submodules https://github.com/gllmflndn/whisper.m.git

MATLAB

Then compile the MEX file using make in a Terminal:

make

The Accelerate and Metal frameworks will be used on macOS. On Windows, use MSYS2 and MinGW-w64, see MATLAB Support.

GNU Octave

If compiling for Octave, execute the following instead from a Terminal:

make MEXBIN="mkoctfile --mex" MEXEXT=mex MEXOPT=""

Usage

To run whisper.m on a pre-recorded audio file (mono, 16kHz) called input.wav:

w = whisper('small');
[segments,tokens] = w.transcribe('input.wav',...
                                 'print_realtime', true,...
                                 'print_progress', false);
whisper.display_tokens(tokens);

Pre-trained models will be downloaded automatically from Hugging Face when needed and stored in a models directory. Model options are tiny, tiny.en, base, base.en, small, small.en, medium, medium.en and large.

Another example to record audio data and run whisper.m:

Fs = 16000;
nbits = 16;
nchannels = 1;
id = 1; % see audiodevinfo to select the audio device
rec = audiorecorder(Fs, nbits, nchannels, id);

recDuration = 10;
disp('Begin speaking.')
recordblocking(rec, recDuration);
disp('End of recording.')
y = getaudiodata(rec);

w = whisper('small');
[segments,tokens] = w.transcribe(y','print_progress', false);
whisper.display_tokens(tokens);

To extrac the audio track from a video at 16kHz mono, you can use ffmpeg:

ffmpeg -i video.mp4 -f wav -ar 16000 -ac 1 -vn  audio.wav

There is also a demo that uses an audio file shipped with whisper.cpp:

>> whisper.demo()
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 1
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.62 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB
whisper_init_state: compute buffer (conv)   =   11.17 MB
whisper_init_state: compute buffer (encode) =   61.76 MB
whisper_init_state: compute buffer (cross)  =    3.67 MB
whisper_init_state: compute buffer (decode) =   18.82 MB

And so my fellow Americans ask not what your country can do for you ask what you can do for your country