In December 2023, I participated in a hands-on workshop titled "Renesas Voice UI Solutions" organized in Thailand by the Thai Embedded Systems Association (TESA) and AVNET. As a result, I received a Renesas Voice-RA6E1 board. Since this period is a long Songkran holiday, I have decided to experiment with the Voice-RA6E1 board. However, the Renesas development process using e2 studio is a bit complex, and there's a risk of deleting the license key programmed into flash memory if you don't use the sample code. While researching further, I came across news about four genuine Arduino boards that support Cyberon's Speech Recognition Engine. I happen to have an Arduino Nano RP2040 board, so I'm considering giving it a try.
To experiment with sample code using Cyberon's Speech Recognition Engine, the first step is to retrieve the serial number from the board. Here's how to retrieve the serial number for the Arduino Nano RP2040 board:
- Install the Cyberon_DSpotterSDK_Maker_RP2040 library.
- Open the GetSerialNumber.ino sample code.
- Build and upload the code to the board.
- Check the serial number printed from the Serial monitor.
This Cyberon_DSpotterSDK_Maker_RP2040 library provides the necessary functions for interacting with Cyberon's Speech Recognition Engine on the Arduino Nano RP2040 board. You can install the library using the Arduino Library Manager or by manually downloading and adding the library files to your project. For the other three Arduino boards, the included library must be chosen according to the board type: Cyberon_DSpotterSDK_Maker_33BLE (Nano 33 BLE Sense), Cyberon_DSpotterSDK_Maker_PortentaH7 (Portenta H7), and Cyberon_DSpotterSDK_Maker_NiclaVision (Nicla Vision).
Then, I visited Cyberon's DSpotterSDK Maker website and registered for a trial license. Upon successful registration of the retrieved serial number, I received a license text to be entered into the CybLicense.h file. This license text is locked to the specific board used for registration. With the trial license text in hand, I decided to try out the sample code VR_LEDControl, which includes a pre-trained English language model. The trigger word is "Hey Arduino, " and the voice commands to control the three-color LED are "LED red", "LED green", "LED blue", and "LED off". The sample code utilizes the LED on pin 13 of the board to indicate the status (waiting for a trigger or a command).
The speech recognition models are provided in binary format and stored in two files: Model_L0.h (smaller file, lower accuracy) and Model_L1.h (larger file, higher accuracy). Due to its ARM Cortex-M0+ processor, the Arduino Nano RP2040 board is limited to using the Model_L0 to minimize processing load. The other three boards, however, can choose between Model_L0 and Model_L1. A list of commands and model codes is provided in the Info.txt file located in the /data subfolder.
The sample code appears straightforward. The callback function receives event information, command codes, and parameters obtained from detecting trigger words or commands.
void VRCallback(int nFlag, int nID, int nScore, int nSG, int nEnergy) {
if (nFlag==DSpotterSDKHL::InitSuccess) {
// code to handle SDK init success event
} else if (nFlag==DSpotterSDKHL::GetResult) {
switch(nID) {
// code to handle voice commands
}
} else if (nFlag==DSpotterSDKHL::ChangeStage) {
switch(nID) {
// code to handle state changes: trigger <-> command
}
} else if (nFlag==DSpotterSDKHL::GetError) {
// code to handle SDK error
} else if (nFlag == DSpotterSDKHL::LostRecordFrame) {
// code to handle voice streaming issues
}
}
To assess the processing performance, I profiled the main code section, g_oDSpotterSDKHL.DoVR(). The maximum time observed, representing the processing time for audio data in the buffer, was 24 milliseconds. Counting the number of processing cycles within a second, I found approximately 3-4 cycles resulting in less than 100 milliseconds. This indicates a utilization ratio of no more than 10%, suggesting ample time for handling other tasks. Considering the RP2040's ARM Cortex-M0+ processor running at 133 MHz, the performance is quite impressive. Switching to a high-performance processor like Cortex-M4 or M7 would likely enable even more demanding signal processing capabilities.
Compared to speech recognition through Google Assistant or other cloud-based solutions, theCyberon engine offers several advantages for developing home appliance products:
- Offline operation: Eliminates the complexity of network connectivity.
- Microcontroller-oriented: Enables processing on microcontrollers, making it suitable for resource-constrained devices.
- Multilingual support: Supports up to 40 languages/dialects, catering to a global audience.
Language is a crucial aspect of developing products for smart home scenarios. So, I would like to try this feature in my native tongue: Thai language.
To facilitate Thai commands, a new model must be created from user-selected phrases. The sample code provides the following steps for creating a custom language model:
- Register email and board: Enter email and board information on the model settings website and accept the license agreement.
- Select Project Language: Choose the language to be used for the project.
- Define Trigger and Command Phrases: Provide the phrases that will serve as triggers and commands for voice interactions.
- Generate Model: Upon confirmation, the generated model file will be sent to the registered email address.
The generated language model can then be integrated into the product's firmware to enable voice recognition and command execution in Thai.
Replacing the CybLicense.h and Model_L0.h files in the original sample code with the files received via email allows the VR_LEDControl code to function immediately. However, the newly created model introduces a 20-second delay between each trigger reception, as evident from the messages reported through the Serial monitor. This is one of the limitations of any custom model created by the trial license.
Based on the experiment with sequential voice commands, it appears that the speech recognition of Thai commands is not yet fully accurate. This could be due to the use of the lower-accuracy L0 model. The command "ไฟเขียวติด" (substitution of "LED green" command) was not detected at all, while other commands were recognized correctly. For the trigger phrase "สวัสดีครับ" (substitution of "Hey Arduino" trigger), it had to be spoken slowly to be detected. However, when switching to using voice narrated by Google Translate for speech output, the recognition accuracy was higher than with my voice.
3rd step: add online featureThe Arduino Nano RP2040 Connect board includes a u-blox NINA-WT102 module (essentially an ESP32) for connecting to WiFi and Bluetooth wireless networks. Adding online features is relatively straightforward, I chose to experiment with the MQTT protocol due to its minimal overhead. Since I always use Platform.io as the development tool, I started by adding the WiFiNINA library (replacing the standard WiFi library), PubSubClient, and ArduinoJSON. The code modifications were implemented in two ways:
- Uplink function: When a voice command is recognized, the status is reported as JSON to the MQTT broker.
- Downlink function: JSON commands for controlling the LED can be used via MQTT.
When I added the MQTT protocol code to the sample VR_LEDControl code, I encountered a barrage of "lost recording frame" messages, preventing voice command detection. Debugging revealed that the common practice of periodically checking the connection to the MQTT broker was causing delays that resulted in missing parts of the audio data. Consequently, I opted to move the WiFi/MQTT connection code to the setup() function, which immediately resolved the issue.
The Speech Recognition Engine developed in collaboration between Arduino and Cyberon is impressive in terms of its simple code and convenient online tools. However, the 20-second delay in detecting trigger phrases feels like a subtle push to purchase the $9 license to unlock this limitation. Additionally, this purchase is locked to the board, not the user. Therefore, anyone planning to use it for a project should be aware of this condition, as it could lead to budget overruns if the number of devices needs to be expanded. For businesses, choosing hardware that has a pre-existing agreement with Cyberon, such as Renesas, seems like a much better option. So, my current plan is to continue my development effort with the Renesas Voice RA6E1 board.
Comments