Skip to content

Code snippets showing how to record I2S audio and store as .wav file on ESP32 with SD card, how to transcribe pre-recorded audio via Deepgram SpeechToText API, how to generate audio from text via TextToSpeech API from OpenAI a/o Google TTS. Triggering ESP32 actions via Voice.

Notifications You must be signed in to change notification settings

kaloprojects/KALO-ESP32-Voice-Assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Summary

Code snippets showing how to record I2S audio and store as .wav file on ESP32 with SD card, how to transcribe pre-recorded audio via STT (SpeechToText) Deepgram API, how to generate audio from text via TTS (TextToSpeech) API from OpenAI a/o Google TTS. Triggering ESP32 actions via Voice.

The repository contains the Demo main sketch 'KALO_ESP32_Voice_Assistant.ino', demonstrating different use case of my libraries 'lib_audio_recording.ino' and 'lib_audio_transcription.ino'

Features

Explore the demo use case examples (1-6) in main sketch, summary:

  • Recording and playing audio are working offline, online connection needed for STT, TTS and streaming services
  • Recording Voice Audio with variable length (recording as long a button is pressed), storing as .wav file (with 44 byte header) on SD card
  • Replay your recorded audio (using Schreibfaul1 <audio.h> library)
  • Playing Audio streams (e.g. playing music via radio streams with <audio.h> library)
  • STT (SpeechToText), using Deepgram API service (registration needed)
  • TTS (TextToSpeech), supporting multilingual 6 voices via Open AI API (registration needed)
  • TTS (TextToSpeech), using Google TTS API (no registration needed)
  • Triggering ESP actions via voice (e.g. triggering GPIO LED pins, addressing dedicated voices by calling their name, playing music on request)

Hardware

  • ESP32 development board (e.g. ESP32-WROOM-32), connected to Wifi
  • I2S digital microphone, e.g. INMP441 [I2S pins 22, 33, 35]
  • I2S audio amplifier, e.g. MAX98357A [I2S pins 25,26,27] with speaker
  • Micro SD Card [VSPI Default pins 5,18,19,23]
  • RGB LED (status indicator) and Analog Poti (audio volume)

Installation & Customizing

  • Required: Arduino IDE with ESP32 libray 3.0.x (based on ESP-IDF 5.1). Older 2.x ESP framework fail because new I2S driver missed
  • Required (for playing Audio on ESP32): AUDIO.H library ESP32-audioI2S.zip. Install latest zip (3.0.11g from July 18, 2024 or newer)
  • Copy all 3 .ino files of 'KALO-ESP32-Voice-Assistant' into same folder (it is one sketch, split into 3 Arduino IDE tabs)
  • Update your pin assignments in the header of all 3 .ino files
  • Insert your credentials (ssid, password, OpenAI API key, Deepgram API key)
  • Define your favorite recording settings (SAMPLE_RATE, BITS_PER_SAMPLE, GAIN_BOOSTER_I2S) in lib_audio_recording.ino header
  • Define your language settings (Google TTS in KALO_ESP32_Voice_Assistant.ino, Deepgram STT in lib_audio_transcription.ino header)
  • Toggle DEBUG flag to true (displaying Serial.print details) or false (for final usage)

Known issues

  • WifiClientSecure connection not reliable (assuming RAM heap issue in WifiClientSecure.h library), rarely freezing (e.g. after 10 mins)

Updates

  • 2024-07-22: Misc. enhancements, STT connection reliablility improved further, code cleaned up
  • 2024-07-14: Updated version:
    • WifiClientSecure connection reliablility improved (still not perfect)
    • STT Deepgram response faster (new total response time average on e.g. 5 sec voice record: ~ 2.5 sec). Recommendation: It's worth trying 8Khz/8bit once, STT response ~1 sec (Note: Using complete sentences instead of single words improves recognition quality)
    • user language settings (STT & TTS) added, bug fixing etc.
  • 2024-07-08: First drop, already working, not finally cleaned up (just posted this drop on some folks request)

Next steps

  • Code cleanup, regular updates .. ongoing
  • Review & improve reliability of WifiClientSecure connection .. ongoing
  • Fixing 'Play 8bit audio' issue - Done (2024-07-18), latest AUDIO.H (since 2024-07-18) supports 8bit wav format
  • Adding more use case examples in main sketch
  • Including SpeechGen.IO TTS API call (hundreds of additional voices). Coded already, unfortunaltly failed since ESP 3.x framework update
  • Including a OpenAI API library with demo code, using an ESP32 as Voice ChatGPT device

. . .

Demo Videos

Short video clip, presenting Recording & SpeechToText & TextToSpeech (without Open AI, ESP32 is not 'answering', just parroting my voice). Workflow:

  • Recording user voice, storing audio .wav file (8KHz/8bit) to SD card,
  • STT: transcribe pre-recorded voice via Deepgram API,
  • TTS: repeat spoken sentence with Goggle TTS voice (a/o triggering e.g. LED via voice):

Video Screenshot


Featured video from other users & friends:
@techiesms: using my Deepgram transcription STT library in his IoT projects:
https://www.youtube.com/watch?v=j0EEFXmikvk-

About

Code snippets showing how to record I2S audio and store as .wav file on ESP32 with SD card, how to transcribe pre-recorded audio via Deepgram SpeechToText API, how to generate audio from text via TextToSpeech API from OpenAI a/o Google TTS. Triggering ESP32 actions via Voice.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages