In this guide, we’ll explore how to use the Seeed XIAO ESP32S3 Sense to capture voice input using its built-in microphone and convert it into text using the ElevenLabs Speech-to-Text API. Powered by the dual-core ESP32-S3 and on-chip PSRAM, this tiny board makes it easy to prototype voice-interactive applications with minimal hardware.
Test the Mic – Record and Save Audio LocallyBefore diving into live transcription, let’s verify that the XIAO ESP32S3 Sense’s microphone works as expected. In this section, we’ll use I2S to record a short WAV audio clip and save it to an SD card.
What This Code Does
- Captures 10 seconds of mono audio at 16 kHz
- Saves it as a proper
.wav
file to the SD card - Uses built-in PSRAM for buffering
- Prepares a valid WAV header for playback on any device
Recording Output
- File:
/arduino_rec.wav
- Format: 16-bit PCM, mono, 16 kHz
- Can be played using VLC, Audacity, etc.
To use ElevenLabs' Speech-to-Text API, you’ll need an API key. Here's how to generate one:
- Go to https://elevenlabs.io and sign in or create a free account.
- Once logged in, navigate to your Account Settings or API section from the dashboard.
- Find the API Key area and click “Create New Key” (give it a name like “XIAO_STT_Test”).
- Copy the generated API key and save it in a safe place—you’ll use this in the Arduino sketch later.
With our audio successfully recorded and saved to the SD card as a.wav file, it's time to bring in the power of ElevenLabs. In this section, we'll walk through how to send the recorded file to the ElevenLabs Speech-to-Text API and receive back a transcription.
What This Code Does
- Connects your XIAO ESP32S3 to WiFi
- Records 5 seconds of 16-bit mono audio at 16 kHz using the built-in mic
- Saves the file to an SD card with a valid
.wav
header - Finds the latest
.wav
file recorded - Sends the file using a multipart HTTP POST request to the ElevenLabs STT API
- Parses the JSON response and prints the transcribed text to Serial
Output
- Once the recording is saved to the SD card, the XIAO ESP32S3 sends the.wav file to ElevenLabs' Speech-to-Text API.
- Upon success, the transcription result is displayed in the Arduino Serial Monitor.
- This demonstrates that the device handles multilingual inputs with high accuracy and returns detailed transcription metadata like word-level timestamps and confidence scores.
Comments