With the deepening of the machine learning field and the increasing performance of microprocessor chips, more and more machine learning algorithms are starting to be integrated into microprocessors. In particular, the development of lightweight machine learning frameworks, such as Tensorflow Lite [1], adapted to edge devices has enabled microprocessor chips to overcome the high complexity required for traditional deep learning deployments.
In this work, I aim to implement a natural language processing recognition application using a consumer-grade microprocessor chip and a built-in microphone sensor. The microphone picks up the human voice input signal and recognizes simple words with a trained speech model. The model recognizes the following states: the user says "yes" or "no", there is no sound ("silence"), or an unknown sound. The prediction results would trigger different LED responses on the chip. The overall implementation follows Chapter 8 in the TinyML book [2].
Methods:- Dataset: In this project, I trained the model using the Speech Commands dataset [3], which covers 65, 000 second-long audio sequences for 30 short words.
- DataPre-Processing: To better extract the features that related to human speech inside the audio segments, I converted the raw sequence into the Mel Spectrogram. Humans do not perceive frequencies on a linear scale. We are more adept at detecting differences at lower frequencies than at higher frequencies. For example, we can easily distinguish the difference between 500 and 1, 000 Hz, but we will have difficulty distinguishing the difference between 10, 000 and 10, 500 Hz, even though the distance between the two pairs is the same. The Merle scale is the result of some non-linear transformations of the frequency scale. This Merle scale is constructed in such a way that sounds that are equidistant from each other on the Merle scale are also "sounds" to humans because they are equidistant from each other. And the Mel Spectrogram is the Spectrogram that uses theMel scale as the y-axis.
- Model Design: We have processed the input signal so that it becomes a normalized spectrogram. Convolutional neural network networks can effectively advance local information from the picture as a structural representation. Thus, I applied a CNN-based network structure to extract information from spectrograms, and then pass the extracted high-level representations to a fully-connected layer. The fitted neurons output the predictions based on the representations the model learned.
After training, the model is able to distinguish the "yes" and "no" from human speeches. When the board "hears" a "yes", the blue LED light will be turned on; whereas a red light will be on once the board "hears" a "no".
References:
[1] https://www.tensorflow.org/lite
[2] Warden, P., & Situnayake, D. (2019).TinyML. O'Reilly Media, Incorporated.
[3] https://www.tensorflow.org/datasets/catalog/speech_commands
Comments