Next Article in Journal
Automated Evolutionary Gait Tuning for Humanoid Robots Using Inverse Kinematics and Genetic Algorithms
Previous Article in Journal
Enhancing Parts Flow Data Quality in Serial Production Lines: Algorithms and Computational Implementation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Embedded Implementation of Real-Time Voice Command Recognition on PIC Microcontroller

1
Innov’COM Laboratory, National Engineering School of Cartahage, Ariana 2035, Tunisia
2
Networked Objects, Control, and Communication Systems (NOCCS), ENISo, University of Sousse, Sousse 4011, Tunisia
3
Electrical Engineering Department, National School of Engineers of Monastir, Monastir 5000, Tunisia
4
College of Engineering and IT, University of Dubai, Dubai 14143, United Arab Emirates
5
Laboratory of Advanced Systems (LSA), Polytechnic School of Tunis, Al Marsa 2078, Tunisia
*
Author to whom correspondence should be addressed.
Automation 2025, 6(4), 79; https://doi.org/10.3390/automation6040079
Submission received: 26 July 2025 / Revised: 17 October 2025 / Accepted: 24 October 2025 / Published: 28 November 2025

Abstract

This paper describes a real-time system for recognizing voice commands for resource-constrained embedded devices, specifically a PIC microcontroller. While most existing speech ordering support solutions rely on high-performance processing platforms or cloud computation, the system described here performs fully embedded low-power processing locally on the device. Sound is captured through a low-cost MEMS microphone, segmented into short audio frames, and time domain features are extracted (i.e., Zero-Crossing Rate (ZCR) and Short-Time Energy (STE)). These features were chosen for low power and computational efficiency and the ability to be processed in real time on a microcontroller. For the purposes of this experimental system, a small vocabulary of four command words (i.e., “ON”, “OFF”, “LEFT”, and “RIGHT”) were used to simulate real sound-ordering interfaces. The main contribution is demonstrated in the clever combination of low-complex, lightweight signal-processing techniques with embedded neural network inference, completing a classification cycle in real time (under 50 ms). It was demonstrated that the classification accuracy was over 90% using confusion matrices and timing analysis of the classifier’s performance across vocabularies with varying levels of complexity. This method is very applicable to IoT and portable embedded applications, offering a low-latency classification alternative to more complex and resource intensive classification architectures.

Graphical Abstract

1. Introduction

Today, voice recognition technologies can be found in modern applications such as smart homes [1,2], wearable devices [3], robots [4], and car interfaces [5]. They provide an easy and hands-free way to control something, and their growth in popularity is driven by improvements in speech recognition [6,7]. However, traditional solutions often involve high-performance processors or cloud-based computation, which can be problematic for small, battery powered embedded devices from the perspectives of latency, privacy, and connectivity [8,9].
There are many challenges in developing voice command recognition in microcontrollers, including limited CPU cycles, memory, and power consumption [10,11]. While previous researchers have achieved embedded speech processing using DSP solutions on ARM Cortex-M platforms [12,13], they have not been successful in developing an efficient method for real-time recognition of commands. More recent studies have reduced the computational burden by using lightweight audio features such as MFCCs [14,15] or simple time-domain features like signal energy and zero-crossing-rate (ZCR) [16,17]. Some researchers have reduced processing loads even more by using simple classifiers, such as SVMs [18], or using small neural networks [19,20].
PIC microcontrollers now include DSP blocks and floating/fixed-point arithmetic, enabling advanced on-chip signal processing [21]. Some studies have demonstrated small neural networks running on PICs for limited-vocabulary recognition [22,23]; however, real-time performance for multiple commands has not been fully demonstrated.
In this work, we propose a real-time voice command recognition system using a PIC microcontroller. Audio is captured via a low-cost MEMS microphone, and simple time-domain features (ZCR and STE) are extracted. These features feed a small multilayer perceptron (MLP) neural network that classifies four basic commands: “ON”, “OFF”, “LEFT”, and “RIGHT” (See Figure 1).
The system design focuses on lightweight, adaptable, and resource-efficient keyword detection with advanced latency management to demonstrate robustness across acoustic conditions. The neural network is trained offline and quantized as fixed-point to fill the constraints of the PIC platform. Our experimental results show > 90% accuracy, latency under 50 ms, and low power consumption verified through confusion matrices, measured timing, and measured power. Despite consumer devices supporting complex voice commands, our work shows that real-time voice command recognition can even be executed by ultra-low-power microcontrollers. The presented approach is well-suited for IoT devices, embedded robotics, and smart home use cases, whereby latency, memory, and power constraints would rule out cloud or higher-performance solutions. In this way the present work establishes a baseline framework for lightweight on-device speech requests that can extend to larger vocabularies or more complex command sets.
The main contributions included in this work are the achievement of the following goals:
Create, design, implement, and evaluate a voice-controlled system with a PIC microcontroller.
Structure a keyword-spotting framework that is lightweight, versatile, and can be implemented in limited resource environment.
Make sure system components are low-latency communicated and can keep the system functional and useful.
Use progressive latency management to adapt to ambient noise changes.
Test and observe whether the system works as expected.
Provide modular and expandable support to future advancement of the system and to include added functionality without restricting other aspects.
This paper is structured as follows: In Section 2 we provide a review of the literature that is relevant to voice command recognition systems and their embedded implementations. In Section 3 we describe the methodology involving dataset preparation, feature extraction, and model training. Section 4 details the experimental setup and reports the results of testing the system that we implemented on embedded hardware. Section 5 evaluates the performance of the system; including accuracy, inference time, and power consumption, while identifying the issues it has at present. Section 6 provides a broader discussion of the findings, the challenges in implementation and opportunities for improvements. Finally, Section 7 gives a conclusion and future work.

2. Literature Review

This section looks at the history and current developments in real-time voice command recognition, with special reference to the implementation of this technology in PIC microcontrollers. Subjects of relevance are system architecture, algorithms, signal processing techniques, hardware constraints, and optimization methods. Voice command recognition has gained an important place in embedded systems, especially applications that require real-time processing and low resource environments.

2.1. Voice Recognition Technology’s Progression

Voice recognition technologies have evolved from being dependent on powerful cloud-based resources that are not very portable to lighter and more efficient options that can be embedded. Traditional systems required a combination of algorithms, along with the computational expense of running them on powerful servers or processors. In recent years, advancements in technology have started to allow lighter-weight voice recognition to happen directly onboard the device with embedded devices [24], as there is an increasing demand for hands-free control options, especially in assistive technology, industrial automation, and smart homes.
Contemporary systems rely on signal capture, preprocessing, feature extraction, and classification followed by executing the command, [25]. Higher-end systems facilitate deep learning models that provide reliable recognition; however, they fall short for devices that are constrained on resources, like microcontrollers, unless they are highly optimized.

2.2. Role of Microcontrollers in Voice Recognition

Microcontrollers are valued in embedded applications for their low power consumption, real-time processing, and cost efficiency. The PIC family (e.g., PIC16F877A, PIC18F45K22) exemplifies devices capable of real-time operation despite their limited processing power and memory [26]. In this context, microcontrollers are chosen to implement voice-controlled systems where simplicity, reliability, and resource efficiency are critical. The challenge lies in adapting standard voice recognition pipelines to fit these constraints, which often requires optimization at both algorithmic and hardware levels [27].

2.3. Embedded Systems Feature Extraction Techniques

Feature extraction, which converts raw audio data to a simpler and more understandable format for later categorization, is an essential part of voice recognition. Mel Frequency Cepstral Coefficients (MFCC) are widely used methods for obtaining the tumbrel information of speech and are popular due to their high accuracy and efficiency [28]. Linear Predictive Coding (LPC), which simulates the human vocal tract, is an efficient technique that works well for low power devices. Fast Fourier Transform (FFT) is also an important technique that transforms time-domain data to the frequency domain, enabling more in-depth spectrum analysis.
Lightweight and real-time techniques are important in embedded systems and microcontrollers, as they have limited processing capabilities. Recent work has focused on improving MFCC for microcontroller platforms by optimizing the filter bank, making the frame sizes smaller while retaining accuracy and ensuring real-time performance [29].

2.4. Classification of Voice Commands

Once the command characteristics are captured, classification methods determine the spoken command. KNN is simple to use and efficient with small datasets. Decision trees are light in memory and quick in execution. SVMs may need more resources, but generalize better. ANNs incur more CPU resources but can model complex patterns. Fast models are preferred for the real-time requirements of embedded systems [30,31].

2.5. Experimental Implementations on PIC Microcontrollers

Multiple studies have experimentally verified the voice recognition abilities of PIC microcontrollers under laboratory conditions. For example, a PIC microcontroller (PIC16F877A) was tested with an ISD1820 module and utilized to categorize five voice commands for a home automation project in relatively quiet ambient surroundings [32]. Another PIC microcontroller (PIC18F4550) was also shown to perform simple MFCC-based recognition from external memory [33]. The recognition from the MFCC-based approach had a 85% accuracy level. We further bring attention to the importance of utilizing these external speech processing modules, not only to reduce load on the processor, but also to enable real time operation. The trials also confirmed that several variables affect recognition performance, including ambient noise level, number of voice commands, and sampling rate. To improve efficiency, the phrase “lightweight” is put forward in terms of utilizing a lightweight feature extraction method (such as a simplified MFCC or FFT), to gain features from voice commands, while providing accurate recognition performance and providing the much-needed processing and memory efficiency of the PIC microcontroller. In conclusion, it is shown that, despite limited resources, PIC microcontrollers can be practically used for voice command recognition when combined with optimized algorithms and modules.
Table 1 is a comparative summary of certain embedded speech recognition systems displaying notable information about the microcontroller, voice module, feature extraction and classification methods, real time actions, and the primary goal of each study. The work presented here is distinct from many more complex systems that use more complex feature extraction methods, such as MFCC, whereas we have combined a low-complexity neural network classifier operating in real-time on a low-power PIC microcontroller that only utilizes the most basic time-domain features of energy and zero-crossing rate (ZCR).
The PIC microcontroller was chosen as a low-cost, low-power, and widely available component typically used in embedded systems. ARM or Raspberry Pi-class solutions deliver more resources but at a higher cost and power consumption. Thus, the PIC microcontroller is more representative of the resource-constrained context in which TinyML solutions can be implemented more effectively. In comparison to prior work in Table 1, our system can be seen to improve upon previous designs. For example, compared to [26,27], our system has implemented full real-time voice command recognition rather than simply basic triggering/playback. Or, unlike [28,29], our system can perform recognition using internal PIC microcontroller resources without additional external components such as a DSP or high-performance processing board. From a processing efficiency perspective, [30,31] require FFT or complex hybrid rules, while our approach uses lightweight ZCR and STE features and can execute recognition in <50 ms. Finally, unlike [33], where static lookup is applied, our design features a quantized neural network classifier that has the potential to generalize recognition performance across different commands. In summary, these contributions reflect the novelty of the proposal for real-time TinyML-based voice recognition using ultra-low-power microcontrollers as well as the implications of this research.

3. Methodology

This section outlines the design and implementation of the real-time voice command recognition system on a constrained PIC microcontroller. The audio signal is collected ergonomically through a simple microphone and uses low complexity features, signal energy and zero-crossing rate, to characterize the speech signal. The low complexity features are input into a small neural network, which is trained offline and implemented using fixed-point logic to obviate microcontroller limitations. The system performs localization of commands in the embedded software, enabling rapid and low-power recognition of four basic voice commands. This approach allows for real-time operation without sacrificing accuracy and performance in an embedded context.

3.1. System Architecture

The proposed system enables the PIC microcontroller to perform real-time voice command recognition. The system consists of two phases: offline training and embedded inference, as illustrated in Figure 2.
During offline training, a small quantized Multi-Layer Perceptron (MLP) is trained using a dataset of labeled voice commands. While common feature extraction techniques such as MFCC, LPC, and FFT are widely used in the literature, they were not employed in this work due to their high computational demand. Instead, the system relies on Zero-Crossing Rate (ZCR) and Short-Time Energy (STE), lightweight features that enable real-time execution on resource-constrained devices such as the PIC microcontroller.
In real-time operation, analog voice signals are captured by a low-cost MEMS microphone at 16 kHz and digitized by the PIC’s ADC. The signals are framed and windowed, then the ZCR and STE features are computed. The resulting feature vectors are fed into the quantized MLP model deployed on the PIC, which classifies the commands. Finally, the recognized command is translated into a control signal for the actuation system, such as turning lights or motors on/off.
The functional architecture developed for real time voice command recognition on a PIC microcontroller is shown in Table 2. The architecture is composed of five main functional blocks, where each component specializes in a significant step in the recognition pipeline.
The functional architecture is summarized in Table 2, which presents the main processing blocks and their interactions.
Role and Novelty of Motor Control in the Proposed System: The motor control component in the proposed system is not mandatory for the TinyML model or the voice recognition flow itself, but it serves as a practical demonstration of the system’s capability to translate recognized voice commands into real-world actions. By integrating motor or actuator control, the system showcases how voice commands can directly interact with embedded hardware, which is critical for applications such as smart home automation, robotics, or IoT devices. The novelty lies in combining ultra-low-power and real-time voice recognition with direct actuation, demonstrating that a resource-constrained PIC microcontroller can reliably handle both classification and control tasks simultaneously. This integration emphasizes the efficiency, low latency, and modularity of the proposed system, highlighting its applicability to embedded systems requiring hands-free control of physical devices.

3.2. Flowchart of the Microcontroller Software

The flowchart of the microcontroller software is depicted in Figure 3 and is executed on a continuous real-time loop. After setting the components (ADC, timer, GPIO, MLP weights up), the MCU samples one second of audio, preprocesses the signal and extracts features. These features are categorized by the quantized MLP, which activates the control signals to the actuators. A loop returns the system to sample more.

3.3. Voice Command Processing Pipeline on Microcontroller

This section describes the whole data processing pipeline running on the microcontroller. It discusses the acquisition and preprocessing of the raw audio signal, and extraction of pertinent features from the voice input.

3.3.1. Audio Capture and Preprocessing

The complete process of recording and digitizing the voice signal and preparing it for feature extraction is described in this subsection.
As shown in Table 3, the system uses a MEMS microphone first, and then ADC conversion, windowing, framing, and deposing, constituting a high-performance signal preprocessing sequence for real-time voice command recognition on microcontroller-based platforms.

3.3.2. Feature Extraction

In this work, the implemented features on the PIC microcontroller are Zero-Crossing Rate (ZCR) and Short-Time Energy (STE). These features were selected because they are computationally simple, require minimal memory, and are suitable for real-time execution on resource-constrained embedded systems. While MFCC, LPC, and FFT are commonly used in embedded voice recognition systems, they were not used here due to the limited processing capability of the PIC microcontroller. ZCR and STE provide sufficient discriminative power for the small vocabulary of four commands (‘ON’, ‘OFF’, ‘LEFT’, ‘RIGHT’) and allow for rapid inference below 50 ms per command.
Z C R = 1 2 N n = 1 N s i g n ( x n ) s i g n ( x n 1 )
where
  • x n Represents the value of the audio signal at the n-th sample.
  • s i g n ( x ) is the sign function defined as
s i g n x = + 1     i f   x > 0 0       i f   x = 0 1   i f   x < 0
This expression counts the number of sign changes between adjacent samples, which is equivalent to the number of zero-crossings and then normalizes it by the total number of samples in the frame. The division by 2 N ensures the ZCR value is constrained to the range of [0, 1].
In addition, the Short-Time Energy (STE) computes the energy of each frame of signal, indicating voiced regions.
S T E = n = 1 N x n 2
where
  • xn is the n-th sample in the frame,
  • N is the total number of samples in the frame.
This equation is essentially summing the squared amplitude for every sample in that frame and returning a value proportional to the energy of that segment.
Equations (1) and (3), demonstrate that both traits can be reconstructed from the input signal over short overlapping frames. They are suitable for real-time execution on a PIC microcontroller, which has limitations on memory and processing power, due to their simplicity.

3.3.3. Frame Processing and Feature Vector Construction

The audio input is divided into overlapping frames (typically 20–30 ms in length with 50% overlap) to extract these properties in real time. The ZCR and STE are computed and combined into a two-dimensional feature vector for every frame. Let the feature vector extracted from the i frame be represented as in Equation (4).
f i = [ Z C R i , S T E i ]

3.3.4. Neural Network Classifier

This section describes the structure and training of the neural network used for classification. A small Multi-Layer Perceptron (MLP) architecture was trained offline and later quantized for efficient fixed-point inference on the target embedded PIC microcontroller. The model parameters were exported as C arrays for deployment. The architecture and deployment details are summarized in Table 4.

4. Experimental Results

This section provides a review and investigation of the proposed voice command recognition system. It describes the embedded software architecture and hardware platform deployed using a PIC microcontroller to process command input at run time. The section also presents performance measures including classification rate accuracy, confusion measures, inference time, and total power consumption representative of observations and testing actions completed. These results show the system’s usefulness and viability for embedded voice control applications.

4.1. Hardware Description

This section presents the physical setup and embedded software architecture used to implement the real-time voice command recognition system.

4.1.1. Hardware Setup

The system is implemented using the PIC18F4550 microcontroller controlling two power arms, with each power arm comprising an IR2112 high- and low-side driver and two IRG4PC50KD IGBT transistors. The microcontroller generates PWM control signals driving the inputs to the IR2112 which controls switch the IGBTs to generate the desired voltage waveform after passing through the protective resistors on the gate lines. The circuit is powered with 5 V for the microcontroller circuit and direct and higher voltage for the power stage (see Figure 4).

4.1.2. Command-to-Action Mapping

Table 5 shows the matching of voice commands to motor actions. The PIC16F877A microcontroller takes a voice command and generates the appropriate signals on PORTD, which drives the L293D driver in order to energize the motors. For instance, the voice command of “ON” would energize both motors to go forward. The voice commands of “LEFT” and “RIGHT” will rotate the system in different direction.

4.2. Dataset and Training Procedure

The dataset used for training and testing was recorded utterances of the four commands words “ON”, “OFF”, “LEFT” and “RIGHT.” Samples were collected from multiple voice speakers and varying acoustic conditions to support generalization. The audio data was separated into 1 s clips and processed offline to extract time-domain features—Energy and Zero Crossing Rate. The Multi-Layer Perceptron (MLP) Module was trained using fixed-point quantization to minimize the memory size and deploy it to the PIC microcontroller. The training used standard supervised learning with cross-entropy loss and used early stopping to avoid overfitting.

4.3. Classification Performance

This section addresses the validity and reliability of the voice command recognition system. The section includes results, including per-command accuracy rates and a confusion matrix, to provide clarity into the successes and failures of classification. The evaluation demonstrated that the system was able to recognize commands with a high degree of accuracy while also showing the common misclassifications to provide a review of the systems practical capabilities.

4.3.1. Accuracy per Command

A summary of individual command categorization accuracy can be seen in Figure 5. While there was strong recognition of all of the commands, “ON” and “OFF” did slightly better compared to the directional commands. This may be due to their distinct phonemes.

4.3.2. Confusion Matrix

Table 6 presents the classification results from testing the system on the four target commands in the confusion matrix. The matrix shows that most commands were recognized accurately with high confidence. The commands “LEFT” and “RIGHT” were the most likely to create confusion, probably because they variety of phonetic similarities between the two commands.
Figure 6 shows the confusion matrix for the four voice commands “ON”, “OFF”, “LEFT”, and “RIGHT”. The classification of the commands is indicated on the diagonal of the matrix with the true positive values, and the off-diagonal entries indicate the misclassifications. The overall performance of the model is quite good, with only two misclassifications of similar phonetics between “ON” and “OFF”, which would have very close acoustic levels based on measurement. The “LEFT” and “RIGHT” classifications were quite distinct, according to their acoustic structure.

4.3.3. Inference Time and Power Consumption Measurements

The inference time per command on the PIC microcontroller was as low as 50 milliseconds, ensuring real-time responsiveness to the user. The fixed-point implementation enabled efficient computation with low latency, achieving performance comparable to the original implementation. Furthermore, the power consumption remained minimal, with less than 10 mA drawn during active recognition. The combination of low latency and low power consumption makes this approach suitable for battery-powered embedded devices (see Table 7).
Figure 7a shows the measured inference time for each voice command, including the observed range across multiple cycles. Figure 7b presents the corresponding power consumption in milliamperes for each command. Displaying the data in two separate subplots clearly differentiates computational latency from energy usage, providing an accurate and standardized representation of the system’s performance for real-time embedded applications.

4.4. Real-Time System Integration and Response Evaluation

In this section, we present the configuration of the real-time voice command recognition system with the embedded hardware and consider the response of the system to voice commands by collecting and analyzing audio and electrical control signals.

4.4.1. Audio Signal Acquisition and Visualization

To enhance the accuracy of voice command recognition under various acoustic environments, the captured audio signal is further processed using real-time digital filtering on the microcontroller. In this study, a band-pass filter is applied to isolate the frequencies associated with human speech (typically 300 Hz to 3400 Hz) and to reduce background noise and interference (see Table 8).
Figure 8 shows sample audio waveforms of the spoken command “LEFT”. The top figure displays the raw unfiltered audio signal, which includes background noise, while the bottom figure shows the signal after band-pass filtering. Filtering effectively reduces background noise, resulting in a clearer speech component and improved recognition performance.

4.4.2. Voltage Signal and Control Output

The microcontroller issues PWM signals as input to the power drivers which modulate the output voltage to the load. To verify the hardware behavior for different voice commands, the output voltage waveform for each command was measured using an oscilloscope.
Table 9 displays an overview of all voice command PWM duty cycles and their relationship to output voltage.
Figure 9 shows the raw and filtered audio data that provided voice command signals for voice command recognition, as well as the PWM voltage control signals for the voice commands “ON,” “OFF,” “LEFT,” and “RIGHT.” This figure shows the pre-processing and real-time control signals from the embedded system.

4.4.3. Precision, Recall, and F1-Score Analysis

To provide a more comprehensive evaluation of the voice command recognition system, we calculated precision, recall, and F1-score for each of the four commands (“ON”, “OFF”, “LEFT”, and “RIGHT”). These metrics complement the accuracy and confusion matrix by quantifying the model’s ability to correctly identify each command while accounting for false positives and false negatives.
Table 10 presents the precision, recall, and F1-score for each command based on the experimental results obtained from the PIC microcontroller implementation. The metrics confirm that the classifier performs consistently across different commands, with slightly lower performance for phonetically similar commands (“LEFT” and “RIGHT”), which is consistent with the confusion matrix analysis.
Figure 10 shows the raw audio signal, filtered audio, and corresponding PWM voltage output for each command. It illustrates the real-time mapping of recognized voice commands to actuator control, confirming that correct classification reliably triggers the expected voltage output even in the presence of minor misclassifications.

4.5. Comparison with Previous TinyML Voice Recognition Works

To better evaluate the performance of the proposed system, a comparison with previous TinyML voice recognition implementations on PIC and ARM Cortex-M microcontrollers was conducted. Unlike many prior works that rely on computationally intensive features such as MFCC or large neural network models, this study utilizes simple time-domain features—Zero-Crossing Rate (ZCR) and Short-Time Energy (STE)—to achieve lightweight and fast inference suitable for real-time embedded applications.
Table 11 shows that the proposed system achieves comparable or better accuracy than previous works while significantly reducing inference time and power consumption, demonstrating the feasibility of running TinyML voice recognition on ultra-low-power devices. This comparison highlights the novelty of the proposed approach: efficient real-time performance, low computational cost, and the integration of motor control outputs, which were not commonly addressed in previous TinyML voice recognition works.
The advantages presented in Table 11 result from both the hardware configuration and the proposed methodological improvements. The PIC18F4550 microcontroller provides a hardware advantage due to its integrated ADC module, low instruction cycle latency, and optimized memory architecture, which together contribute to faster inference and reduced computational overhead. However, the main performance improvement comes from the proposed algorithmic approach, which combines ZCR and STE features. These features are computationally efficient and particularly suitable for short command recognition, allowing the model to maintain accuracy while minimizing processing time.
Additionally, the fixed-point implementation of the neural network further reduces energy consumption by avoiding floating-point operations, which are more demanding in embedded systems. The results demonstrate that the proposed system achieves a significantly lower power consumption (~8 mA) compared with ARM Cortex-M (12–15 mA) and Raspberry Pi-based systems (>200 mA), confirming the suitability of the design for low-power, real-time embedded applications.
The classification accuracy (left axis) and inference time (right axis) of the proposed PIC-based system are shown in Figure 11, highlighting the trade-off between accuracy and latency. The proposed system maintains high accuracy while achieving significantly lower inference time and power consumption than prior implementations.
Figure 11 presents a bar chart comparing the classification accuracy (left axis) and inference time (right axis) of previous TinyML voice recognition systems and the proposed PIC-based system. The chart highlights the trade-off between accuracy, latency, and power consumption, showing that the proposed system maintains high accuracy while achieving significantly lower inference time and reduced power consumption, demonstrating its suitability for ultra-low-power, real-time embedded applications.

5. Performance and Limitations

The voice command recognition system achieves a good compromise in accuracy, computational complexity, and power consumption, which are important parameters for embedded real-time applications. The experimental performance results show more than 90% correct identification of four simple commands with good acoustic conditions. Performance will decline under difficult conditions such as when the speaker has a different accent and/or in a noisy environment.
Although basic features such as Zero-Crossing Rate and Short-Time Energy offer computational efficiency, they do place constraints on the ability of the system to be truly robust and complete in representing speech signals in the higher-level view of classes or phonemes. The computational efficiencies may leave significant room for error, which is especially the case between commands that are phonetically similar, i.e., “LEFT” and “RIGHT.”
Beyond the implications of noise, variability of speakers, and feature representation limitations, the fixed-point quantized and compact MLP architecture was developed to emphasize the resource efficiency of memory size and CPU inference time, which ultimately leads to artwork limitations in the model complexity and expressiveness. The most tenable constraints on the application’s robustness and accuracy were derived from the atmosphere created by noise, feature representation difficulties, and the variances among speaker characteristics, as seen in some examples.

6. Discussion

This section provides a critical review of the landscape of voice command recognition systems developed on embedded microcontroller platforms compared to the other studies reported in Table 1. The variety in the literature demonstrates differences in potential hardware choices, features for extraction, classifiers, and real-time capabilities, illustrating the trade-off decisions still being made in the field.
A few studies [24,25] provide full reviews and surveys relating to voice processing techniques in general, focusing on advancements in algorithms such as the Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC), and in recurrent neural networks. These studies along with many additional studies do not include any practical implementation on microcontroller platforms. In contrast, implementations using PIC microcontrollers [26,27,33] range in sophistication from a simple hardware-based, triggered systems [26] and a limited voice playback module [27], to more advanced modular recognition systems, which rely on LPC features and lookup tables [33]. Therefore, these implementations illustrate possible PIC voice recognition but lack real time, low-latency operational command recognition. By comparison, some platforms such as the Raspberry Pi [29], STM32 [31], and ARM Cortex-M [28] are capable of a greater complexity of feature extraction (e.g., MFCC with zero-crossing rate) and classification (e.g., support vector machines and deep neural networks). These systems would be acceptable for real-time applications despite their higher-complexity hardware, power usage, and implementation. In some work, hybrid or partial real-time capabilities have been employed to balance performance expectations and limitations of embedded processors. This work advances the field by implementing lightweight voice recognition on a PIC microcontroller in which we leverage simple but powerful time-domain features (signal energy and zero-crossing rate) with a compact neural network classifier programmed to fully support fixed-point execution. Our difference from other approaches using PIC is that we achieved more than 90% accuracy, with inference times under 50 ms so that the power and timing requirements for real-time processing are ideal for IoT and mobile devices.
The distinction highlights the trade-offs associated with embedded speech recognition design: for example, more complex feature extraction and classifier models can (usually) improve accuracy but often require resources that most ultra-low-power microcontrollers cannot support. Our approach, which has a strong emphasis on practical deployment ability and low computing complexity, shows that effective real-time voice command recognition can be possible without additional DSP modules or cloud-supported services.

7. Conclusions and Future Works

This paper presented a viable and efficient approach for real-time voice command recognition on a resource-limited microcontroller (PIC). By utilizing simple time-domain features (energy and zero-crossing rate) and a small fixed-point neural network, the system achieved classification of four common voice commands with over 90% accuracy. The low inference latency (<50 ms) and very low power consumption demonstrate that the solution is suitable for embedded IoT and portable applications. While the current system was designed to validate real-time command recognition on a lightweight device and does not utilize a wake word, it could take on a two-stage architecture that begins with a lightweight wake-word detector to activate the command classifier and produce lower false activation rates and greater robustness in noisy settings. Future work will also focus on vocabulary expansion, increasing robustness against noise through better preprocessing, or improving robustness through better feature extraction or a small deep learning model, without adding significant complexity. Additional capabilities could be obtained through integration with wireless communication modules to enable remote control uses.

Author Contributions

M.S. conceived the idea and prepared the initial draft; S.H. validated the results and prepared the final draft; formal analysis, S.H.; writing, M.S. and S.H.; supervision, A.G. and K.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available upon request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, Y.; Gan, Y.; Song, Y.; Liu, J. What influences the perceived trust of a voice-enabled smart home system: An empirical study. Sensors 2021, 21, 2037. [Google Scholar] [CrossRef] [PubMed]
  2. Venkatraman, S.; Overmars, A.; Thong, M. Smart home automation—Use cases of a secure and integrated voice-control system. Systems 2021, 9, 77. [Google Scholar] [CrossRef]
  3. Velasco-Álvarez, F.; Fernández-Rodríguez, Á.; Ron-Angevin, R. Brain-computer interface (BCI)-generated speech to control domotic devices. Neurocomputing 2022, 509, 121–136. [Google Scholar] [CrossRef]
  4. Yu, C.; Zhang, H.; Shangguan, Z.; Hei, X.; Cangelosi, A.; Tapus, A. Speech-Driven Robot Face Action Generation with Deep Generative Model for Social Robots. In Lecture Notes in Computer Science, Proceedings of the 14th International Conference, ICSR 2022, Florence, Italy, 13–16 December 2022, Proceedings, Part I; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13817 LNAI, pp. 61–74. [Google Scholar] [CrossRef]
  5. Renuka, M.; Kondekar, P.; Mulani, A.O. Raspberry pi based voice operated Robot. Int. J. Recent Eng. Res. Dev. 2017, 2, 69–76. [Google Scholar]
  6. Saryazdi, R.; DeSantis, D.; Johnson, E.K.; Chambers, C.G. The Use of Disfluency Cues in Spoken Language Processing: Insights from Aging. Psychol. Aging 2021, 36, 928–942. [Google Scholar] [CrossRef]
  7. Alluhaidan, A.S.; Saidani, O.; Jahangir, R.; Nauman, M.A.; Neffati, O.S. Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network. Appl. Sci. 2023, 13, 4750. [Google Scholar] [CrossRef]
  8. Lee, S.H.; Park, J.; Yang, K.; Min, J.; Choi, J. Accuracy of Cloud-Based Speech Recognition Open Application Programming Interface for Medical Terms of Korean. J. Korean Med. Sci. 2022, 37, e144. [Google Scholar] [CrossRef]
  9. Talebi, S.M.S.; Sani, A.A.; Saroiu, S.; Wolman, A. MegaMind: A platform for security & privacy extensions for voice assistants. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys 2021), Madison, WI, USA, 24 June–2 July 2021; Association for Computing Machinery, Inc.: New York, NY, USA, 2021; pp. 109–121. [Google Scholar] [CrossRef]
  10. Trabelsi, R.; Nouri, K.; Ammari, I. Enhancing traffic sign recognition through Daubechies discrete wavelet transform and convolutional neural networks. In Proceedings of the 2023 IEEE International Conference on Advanced Systems and Emergent Technologies (IC ASET), Hammamet, Tunisia, 29 April–1 May 2023; pp. 1–6. [Google Scholar]
  11. Manor, E.; Greenberg, S. Using HW/SW Codesign for Deep Neural Network Hardware Accelerator Targeting Low-Resources Embedded Processors. IEEE Access 2022, 10, 22274–22287. [Google Scholar] [CrossRef]
  12. Snider, R.K.; Casebeer, C.N.; Weber, R.J. An open computational platform for low-latency real-time audio signal processing using field programmable gate arrays. J. Acoust. Soc. Am. 2018, 143 (Suppl. 3), 1737. [Google Scholar] [CrossRef]
  13. Shome, N.; Sarkar, A.; Ghosh, A.K.; Laskar, R.H.; Kashyap, R. Speaker Recognition through Deep Learning Techniques. Period. Polytech. Electr. Eng. Comput. Sci. 2023, 67, 300–336. [Google Scholar] [CrossRef]
  14. Meister, H.; Walger, M.; Lang-Roth, R.; Müller, V. Voice fundamental frequency differences and speech recognition with noise and speech maskers in cochlear implant recipients. J. Acoust. Soc. Am. 2020, 147, EL19–EL24. [Google Scholar] [CrossRef]
  15. Fariselli, M.; Rusci, M.; Cambonie, J.; Flamand, E. Integer-Only Approximated MFCC for Ultra-Low-Power Audio NN Processing on Multi-Core MCUs. In Proceedings of the 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Virtual, 3–7 May 2021. [Google Scholar] [CrossRef]
  16. Constantinescu, C.; Brad, R. An Overview on Sound Features in Time and Frequency Domain. Int. J. Adv. Stat. ITC Econ. Life Sci. 2023, 13, 45–58. [Google Scholar] [CrossRef]
  17. Tsujikawa, M.; Kajikawa, Y. Low-Complexity and Accurate Noise Suppression Based on an a Priori SNR Model for Robust Speech Recognition on Embedded Systems and Its Evaluation in a Car Environment. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2023, E106.A, 1224–1233. [Google Scholar] [CrossRef]
  18. Jana, D.K.; Bhunia, P.; Adhikary, S.D.; Mishra, A. Analyzing salient features and classification of wine type based on quality through various neural network and support vector machine classifiers. Results Control Optim. 2023, 11, 100219. [Google Scholar] [CrossRef]
  19. Prasetyo, E.; Purbaningtyas, R.; Adityo, R.D.; Suciati, N.; Fatichah, C. Combining MobileNetV1 and Depthwise Separable convolution bottleneck with Expansion for classifying the freshness of fish eyes. Inf. Process. Agric. 2022, 9, 485–496. [Google Scholar] [CrossRef]
  20. Yi, L.; Wu, Y.; Tolba, A.; Li, T.; Ren, S.; Ding, J. SA-MLP-Mixer: A Compact All-MLP Deep Neural Net Architecture for UAV Navigation in Indoor Environments. IEEE Internet Things J. 2024, 11, 21359–21371. [Google Scholar] [CrossRef]
  21. Rahman, M.; Nicolici, N. Estimating Word Lengths for Fixed-Point DSP Implementations Using Polynomial Chaos Expansions. Electronics 2025, 14, 365. [Google Scholar] [CrossRef]
  22. Mishra, J.; Malche, T.; Hirawat, A. Embedded Intelligence for Smart Home Using TinyML Approach to Keyword Spotting. Eng. Proc. 2024, 82, 30. [Google Scholar] [CrossRef]
  23. Di Leo, K.; Biagetti, G.; Falaschetti, L.; Crippa, P. Microcontroller Implementation of LSTM Neural Networks for Dynamic Hand Gesture Recognition. Sensors 2025, 25, 3831. [Google Scholar] [CrossRef] [PubMed]
  24. Yang, C. Design of Smart Home Control System Based on Wireless Voice Sensor. J. Sens. 2021, 2021, 8254478. [Google Scholar] [CrossRef]
  25. Deshmukh, A.M. Comparison of Hidden Markov Model and Recurrent Neural Network in Automatic Speech Recognition. Eur. J. Eng. Res. Sci. 2020, 5, 958–965. [Google Scholar] [CrossRef]
  26. Mazumdar, D.; Raulkar, J.; Vaidya, P.; Gajare, A.; Shinde, D.M. A Survey Paper on Refrigeration Monitoring Systems using PIC Microcontroller, PT100 Temperature Sensor and FT811 Display Driver. Int. J. Eng. Technol. Manag. Sci. 2023, 7, 121–126. [Google Scholar] [CrossRef]
  27. Manhas, M.; Sanduja, D.; Aggarwal, N.; Vashisth, R. Design and Implementation of Artificial Intelligence (AI) Based Home Automation. In Proceedings of the IEEE International Conference on Signal Processing, Computing and Control (ISPCC), Online, 7–9 October 2021; pp. 122–126. [Google Scholar] [CrossRef]
  28. Nantasri, P.; Phaisangittisagul, E.; Karnjana, J.; Boonkla, S.; Keerativittayanun, S.; Rugchatjaroen, A.; Usanavasin, S.; Shinozaki, T. A Light-Weight Artificial Neural Network for Speech Emotion Recognition using Average Values of MFCCs and Their Derivatives. In Proceedings of the 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Phuket, Thailand, 24–27 June 2020; pp. 41–44. [Google Scholar] [CrossRef]
  29. Efat, M.I.A.; Hossain, M.S.; Aditya, S.; Setu, J.H.; Imtiaz-Ud-Din, K.M. Identifying Optimised Speaker Identification Model using Hybrid GRU-CNN Feature Extraction Technique. Int. J. Comput. Vis. Robot. 2022, 12, 662–685. [Google Scholar] [CrossRef]
  30. Triwiyanto, T.; Yulianto, E.; Luthfiyah, S.; Musvika, S.D.; Maghfiroh, A.M.; Mak’RUf, M.R.; Titisari, D.; Ichwan, S. Hand Exoskeleton Development Based on Voice Recognition Using Embedded Machine Learning on Raspberry Pi. J. Biomim. Biomater. Biomed. Eng. 2022, 55, 81–92. [Google Scholar] [CrossRef]
  31. Park, J.; Noh, H.; Nam, H.; Lee, W.-C.; Park, H.-J. A Low-Latency Streaming On-Device Automatic Speech Recognition System Using a CNN Acoustic Model on FPGA and a Language Model on Smartphone. Electronics 2022, 11, 1831. [Google Scholar] [CrossRef]
  32. Torad, M.A.; Bouallegue, B.; Ahmed, A.M. A Voice-Controlled Smart Home Automation System using Artificial Intelligence and Internet of Things. Telkomnika (Telecommun. Comput. Electron. Control) 2022, 20, 808–816. [Google Scholar] [CrossRef]
  33. Swamidason, I.T.J.; Tatiparthi, S.; Arul Xavier, V.M.; Devadass, C.S.C. Exploration of Diverse Intelligent Approaches in Speech Recognition Systems. Int. J. Speech Technol. 2023, 26, 1–10. [Google Scholar] [CrossRef]
Figure 1. Functional diagram of real-time voice command recognition system on a PIC microcontroller with commands “ON,” “OFF,” “LEFT,” and “RIGHT”.
Figure 1. Functional diagram of real-time voice command recognition system on a PIC microcontroller with commands “ON,” “OFF,” “LEFT,” and “RIGHT”.
Automation 06 00079 g001
Figure 2. Architecture of real-time voice command recognition system on PIC microcontroller.
Figure 2. Architecture of real-time voice command recognition system on PIC microcontroller.
Automation 06 00079 g002
Figure 3. Flowchart of the real-time microcontroller software for voice command recognition.
Figure 3. Flowchart of the real-time microcontroller software for voice command recognition.
Automation 06 00079 g003
Figure 4. Circuit diagram of a PIC16F877A-based DC motor driver system for real-time voice command control (Proteus simulation).
Figure 4. Circuit diagram of a PIC16F877A-based DC motor driver system for real-time voice command control (Proteus simulation).
Automation 06 00079 g004
Figure 5. Command-wise classification accuracy.
Figure 5. Command-wise classification accuracy.
Automation 06 00079 g005
Figure 6. Confusion Matrix of Voice Command Classification.
Figure 6. Confusion Matrix of Voice Command Classification.
Automation 06 00079 g006
Figure 7. Inference time and power consumption per command for the PIC-based voice recognition system.
Figure 7. Inference time and power consumption per command for the PIC-based voice recognition system.
Automation 06 00079 g007
Figure 8. Audio signal before and after real-time filtering.
Figure 8. Audio signal before and after real-time filtering.
Automation 06 00079 g008
Figure 9. Audio and voltage control signals for voice commands.
Figure 9. Audio and voltage control signals for voice commands.
Automation 06 00079 g009
Figure 10. Audio and voltage signal visualization for real-time command execution.
Figure 10. Audio and voltage signal visualization for real-time command execution.
Automation 06 00079 g010
Figure 11. TinyML voice recognition: accuracy, inference time, and power consumption.
Figure 11. TinyML voice recognition: accuracy, inference time, and power consumption.
Automation 06 00079 g011
Table 1. Comparison of embedded voice recognition systems in the literature.
Table 1. Comparison of embedded voice recognition systems in the literature.
Ref.MicrocontrollerVoice ModuleFeature ExtractionClassifier/ProcessingReal-Time CapabilityObjectives of Study
[24]Not specifiedNoGeneral overviewGeneral techniquesNoSurvey voice recognition methods and architectures
[25]Not specifiedNoSurvey on MFCC and LPCNoneNoReview and compare feature extraction techniques
[26]PIC16F877ANoNoHardware interrupt onlyNoDevelop basic voice-triggered hardware system
[27]PIC16F877AISD1820None (pre-recorded audio)Hardware playbackPartialEnable audio playback and limited voice recognition
[28]ARM Cortex-MExternal DSPMFCCDeep Neural NetworkNoDevelop accurate speech recognition with deep learning on embedded platforms
[29]Raspberry PiUSB MicMFCCSVMYesImplement voice recognition on affordable hardware
[30]Arduino UnoNoFFTDecision TreeYesSimplify voice command recognition for low-cost microcontrollers
[31]STM32F103I2S MicrophoneMFCC + ZCRRule-based hybridPartialCombine multiple features for better recognition
[33]PIC18F4550HM2007LPCLookup TableYesDesign modular voice command recognition
This WorkPIC18F45K22Analog Mic + ADCFiltered MFCC (optimized)Hybrid Lightweight ClassifierYes (Real-Time)Implement fully embedded real-time voice command recognition system with low resources and scalability
Table 2. Functional architecture of real-time voice command recognition on PIC microcontroller.
Table 2. Functional architecture of real-time voice command recognition on PIC microcontroller.
No.ComponentRoleInputOutput
1Speech Dataset (Offline)Provides labeled commands for model trainingRecorded audioFeature vectors
2Feature Extraction (ZCR/STE)Converts audio into numerical featuresAudio signalsFeature vectors
3MLP ClassifierClassifies features into commandsFeature vectorsRecognized command
4PIC MicrocontrollerRuns inference and issues control signalsReal-time audio and featuresControl signal
5Actuation SystemExecutes the recognized commandControl signalPhysical action (light, motor)
6MicrophoneCaptures spoken commandsHuman voiceReal-time audio
Table 3. Signal acquisition and preprocessing modules for real-time voice command recognition on PIC microcontroller.
Table 3. Signal acquisition and preprocessing modules for real-time voice command recognition on PIC microcontroller.
ModuleDescription
MicrophoneAn inexpensive analog MEMS microphone picks up the speech; the analog signal is sampled at 8 kHz, which is appropriate for short command recognition.
ADCThe conversion of the analog input to digital samples is accomplished with the integrated ADC of the PIC micro controller.
Framing and WindowingThe audio signal is split into frames of 20 ms, overlapped by 50%, using Hamming windowing functions. Also provided is some reduction in edge effects and artifacts.
Denoising SignalSome method of denoising, either normalization or low pass filtering, will be beneficial for removing background noise when the computational load is low, as computational resources on embedded devices are limited in most applications.
Table 4. Neural network classifier: architecture and deployment.
Table 4. Neural network classifier: architecture and deployment.
ElementDetails
TypeFeedforward MLP
Input200 features
Hidden Layer10 neurons, ReLU activation
Output Layer4 neurons, softmax (“ON”, “OFF”, “LEFT”, “RIGHT”)
TrainingOffline (Python 3.9.9, TensorFlow 2.20.0/PyTorch 2.7.0), local or Google Speech dataset
QuantizationFixed-point (Q15), exported as C arrays for PIC deployment
Table 5. Mapping of Voice Commands to System Actions.
Table 5. Mapping of Voice Commands to System Actions.
Voice CommandSerial Input (ASCII)Microcontroller ActionMotor Driver Response (L293D)System Output
ON“ON”Sets PORTD to logic HIGHActivates Motor A and BMotors Start Running
OFF“OFF”Sets PORTD to logic LOWDeactivates Motor A and BMotors Stop
LEFT“LEFT”Set Motor A: FORWARD
Set Motor B: REVERSE
Motor A turns forward
Motor B turns backward
Robot turns LEFT
RIGHT“RIGHT”Set Motor A: REVERSE
Set Motor B: FORWARD
Motor A turns backward
Motor B turns forward
Robot turns RIGHT
Table 6. Confusion matrix showing the number of samples classified as each command.
Table 6. Confusion matrix showing the number of samples classified as each command.
Actual\PredictedONOFFLEFTRIGHT
ON95122
OFF29413
LEFT12916
RIGHT03592
Table 7. Inference Time and Power Consumption Metrics.
Table 7. Inference Time and Power Consumption Metrics.
MetricMean ValueRangeUnit
Inference Time per Command4540–50milliseconds
Power Consumption during Recognition87–10milliamperes
Table 8. Audio Signal Filtering Parameters.
Table 8. Audio Signal Filtering Parameters.
ParameterValue
Sampling Rate8 kHz
Filter TypeBandpass FIR
Passband Frequency Range300 Hz–3400 Hz
Target Commands“ON,” “OFF,” “LEFT,” “RIGHT”
Noise ReductionBackground noise suppressed
Table 9. PWM and Voltage Response per Voice Command.
Table 9. PWM and Voltage Response per Voice Command.
Voice CommandPWM Duty Cycle (%)Description of Output Signal
ON80High voltage output indicating system activation
OFF0No voltage output indicating system deactivation
LEFT40Moderate voltage output corresponding to left command
RIGHT60Intermediate voltage output corresponding to right command
Table 10. Precision, recall, and F1-Score for voice command Classification.
Table 10. Precision, recall, and F1-Score for voice command Classification.
CommandPrecision (%)Recall (%)F1-Score (%)Description
ON979596High accuracy for activation command; few false positives.
OFF959494.5Reliable detection of deactivation command.
LEFT909190.5Slightly lower due to phonetic similarity with “RIGHT”.
RIGHT889290Lower precision; occasional misclassification as “LEFT”.
Table 11. Power consumption comparison of TinyML voice recognition systems.
Table 11. Power consumption comparison of TinyML voice recognition systems.
WorkMicrocontrollerFeaturesAverage Power Consumption (mA)Remarks
This WorkPIC18F4550ZCR + STE8 mALowest power due to efficient time-domain processing
[27]PIC16F877AMFCC10–12 mAHigher ADC overhead; limited optimization
[28]ARM Cortex-MMFCC12–15 mAModerate power; higher due to MFCC computation
[29]Raspberry PiMFCC + SVM>200 mANot suitable for low-power embedded applications
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shili, M.; Hammedi, S.; Gawanmeh, A.; Nouri, K. Embedded Implementation of Real-Time Voice Command Recognition on PIC Microcontroller. Automation 2025, 6, 79. https://doi.org/10.3390/automation6040079

AMA Style

Shili M, Hammedi S, Gawanmeh A, Nouri K. Embedded Implementation of Real-Time Voice Command Recognition on PIC Microcontroller. Automation. 2025; 6(4):79. https://doi.org/10.3390/automation6040079

Chicago/Turabian Style

Shili, Mohamed, Salah Hammedi, Amjad Gawanmeh, and Khaled Nouri. 2025. "Embedded Implementation of Real-Time Voice Command Recognition on PIC Microcontroller" Automation 6, no. 4: 79. https://doi.org/10.3390/automation6040079

APA Style

Shili, M., Hammedi, S., Gawanmeh, A., & Nouri, K. (2025). Embedded Implementation of Real-Time Voice Command Recognition on PIC Microcontroller. Automation, 6(4), 79. https://doi.org/10.3390/automation6040079

Article Metrics

Back to TopTop