Steering a Robotic Wheelchair Based on Voice Recognition System Using Convolutional Neural Networks

: Many wheelchair people depend on others to control the movement of their wheelchairs, which signiﬁcantly inﬂuences their independence and quality of life. Smart wheelchairs offer a degree of self-dependence and freedom to drive their own vehicles. In this work, we designed and implemented a low-cost software and hardware method to steer a robotic wheelchair. Moreover, from our method, we developed our own Android mobile app based on Flutter software. A convolutional neural network (CNN)-based network-in-network (NIN) structure approach integrated with a voice recognition model was also developed and conﬁgured to build the mobile app. The technique was also implemented and conﬁgured using an ofﬂine Wi-Fi network hotspot between software and hardware components. Five voice commands (yes, no, left, right, and stop) guided and controlled the wheelchair through the Raspberry Pi and DC motor drives. The overall system was evaluated based on a trained and validated English speech corpus by Arabic native speakers for isolated words to assess the performance of the Android OS application. The maneuverability performance of indoor and outdoor navigation was also evaluated in terms of accuracy. The results indicated a degree of accuracy of approximately 87.2% of the accurate prediction of some of the ﬁve voice commands. Additionally, in the real-time performance test, the root-mean-square deviation (RMSD) values between the planned and actual nodes for indoor/outdoor maneuvering were 1.721 × 10 − 5 and 1.743 × 10 − 5 , respectively.


Introduction
Many patients still depend on others to help them move their wheelchairs, and patients with limited mobility still face significant challenges when using wheelchairs in public and in other places [1]. Statistics also indicate that 9-10% of patients who were trained to operate power wheelchairs could not use them for daily activities, and 40% of limited mobility patients reported that it was almost impossible to steer and maneuver a wheelchair [2]. Moreover, it was reported that approximately half of the 40% of patients with impaired mobility could not control a powered wheelchair [3]. Furthermore, the same study determined that over 10% of patients that use traditional power wheelchairs not equipped with any sensors have accidents after 4 months [3]. However, using an electric wheelchair equipped with an automatic navigation and sensor system, such as a smart wheelchair, would be beneficial in addressing a significant challenge for several patients. The smart wheelchair is an electric wheelchair equipped with a computer and sensors designed to facilitate the efficient and effortless movement of patients [4][5][6][7]. These wheelchairs are considered safer and more comfortable than conventional wheelchairs because they introduce new control options, which include navigation systems (GPS) and other technologies, such as saving places on the user's map [8,9].
Various sensors can be used in smart wheelchairs, such as ultrasound, laser, infrared, and input cameras. These wheelchairs adopt computers that process input data from the sensors and produce a command that is sent to the motor to spin the wheels of the chair [10]. One of the most important developments in this field is the introduction of the joystick control system. This system drives the wheelchair via an intelligent control unit [11]. However, patients with impaired upper extremities cannot operate the joystick flexibly and smoothly. This leads to fatal accidents when situations require rapid action in motion. Therefore, the conventional joystick system needs to be replaced with advanced technologies [12]. The human-computer interface (HCI) is a method for controlling a wheelchair using a signal or a combination of different signals, such as electroencephalogram (EEG), electrooculogram (EOG), and electromyogram (EMG) [13][14][15]. Brain-computer interfaces (BCIs) are one of the most researched CIs that translate brain signals into action to control a device [16]. Regarding EEG-BCI, it has some limitations, including low spatial resolution and a low signal-to-noise ratio (SNR). A hybrid BCI (hBCI) that combines EEG with EOG exhibited improved accuracy and speed. Despite the HCI results, some limitations to the application of BCI systems still exist. EEG devices are relatively expensive, and bio-potential signals are affected by artifacts. Furthermore, although hBCI can address some of these challenges, it is not efficient and flexible in its simultaneous control of speed and direction [17][18][19].
Speech is the most important means of communication between humans and in human communication, speech is the most important mode of communication. By employing a microphone sensor, speech can be used to interact with a computer and serve as a potential method for human-computer interactions (HCI). These sensors are being used in quantifiable voice recognition research in human-computer interactions (HCI), which has applications in a variety of areas such as human-computer interactions (HCI), controlling wheelchairs, and health-related applications. Therefore, smart or intelligent wheelchair developments-based on voice recognition techniques-have increased significantly [20]. For instance, Aktar et al. [21] developed an intelligent wheelchair system using a voice recognition technique with a GPS tracking model. The voice commands were converted into hexadecimal numeral data to control the wheelchair in three different speed stages via a Wi-Fi module. The system also used an infrared radiation (IR) sensor to detect obstacles and used a mobile app to detect the location of the patient. Similarly, Raiyan et al. [22] developed an automated wheelchair system based on the Arduino and Easy VR3 speech recognition module. In this study, the authors claim that the implemented system is less expensive and does not require any wearable sensor or complex signal processing. In an advanced study, an adaptive neuro-fuzzy has been designed to drive a powered wheelchair. The system implementation was based on real-time control signals generated by the voice commands' classification unit. The proposed system used a wireless sensor network to track the wheelchair [23]. Despite highly sophisticated approaches presented by researchers in this area, high cost, accuracy in distinguishing, classifying, and identifying the patient's voice remain the most critical challenges.
To overcome the lack of accuracy for distinguishing and classifying patients' speech, many researchers have used the convolutional neural network (CNN) technique [24,25]. This technique relies on converting voice commands into spectrogram images before being fed into CNN. This method has proven to be helpful in the level of accuracy for speech recognition. In this context, Huang et al. [26] proposed a method to analyze CNN for speech recognition. In this method, visualizing the localized filters learned in the convolutional layer was used to detect automatic learning. The authors claim that this method has advantages of identifying four domains of CNN over the fully connected method. These domains are distant speech recognition, noise robustness, low-footprint models, and channel-mismatched training-test conditions. In addition, Korvel et al. [27] analyzed 2D feature spaces for voice recognition based on CNN. The analysis used the Lithuanian word recognition task to feature maps. The results showed that the highest rate of word recognition was achieved using spectral analysis. Moreover, the Mel scale and spectral linear cepstra and chroma are outperformed by cepstral feature spaces.
The driving of smart wheelchairs using voice recognition technologies with CNN has attracted many researchers [28]. For instance, Sutikno et al. [24] proposed a voice control method for wheelchairs using long short-term memory (LSTM) and CNN. This method used Sox Sound Exchange and Sound Recorder Pro to achieve the objective. The accuracy level of this method was above 97.80%. Another study was conducted by Ali et al. [29], who designed an algorithm for smart wheelchairs using CNN to help people with disabilities in detecting buses and bus doors. The method was implemented based on accurate localization information and used CPU for fast detecting. However, the use of CNN in smartphones is still under development due to associated complex calculations to achieve high accuracy predictions [30].
This paper develops a new powerful, low-cost system based on voice recognition and CNN approaches to drive a wheelchair for disabled users. The method proposes the use of a network-in-network (NiN) structure for mobile applications [31]. The system used smartphones to create an interactive user interface that can be easily controlled by sending a voice command via the mobile application to the system's motherboard. A mobile application, voice recognition model, and CNN model were developed and implemented to achieve the main goal of this study. In addition, all safety issues were considered during driving and maneuvering at indoor and outdoor locations. Results showed that the implemented system was robust in time response and had accurate execution of all orders without time delay.
The paper is organized as follows: Section 2 illustrates the materials and methods used in this study. Section 3 addresses the experimental procedure. Section 4 shows the results of the study. Section 5 discusses the results. Section 6 concludes this study. Finally, Section 7 shows future work. Figure 1 illustrates the implementation of the system architecture of the proposed system. This system is divided into two stages. The first stage is the set of hardware devices used to control the movement of the wheelchair reliably. These devices include a standard wheelchair, Android smartphone (Huawei Y9-CPU: Octa-core, 4 × 2.2 GHz), DC electric motors, batteries, relay model, Raspberry Pi4, and an emergency push button in case of an abnormal system response. The second stage focuses on the software development of the mobile application, voice recognition model, and CNN model. The software was designed and implemented to control the wheelchair using the five voice commands mentioned in Table 1. The main components for controlling the chair were connected via offline Wi-Fi.

Materials and Methods
In this work, the mobile app was built based on Flutter software [32,33]. The design process includes creating a user flow diagram for each screen, creating and drawing wireframes, selecting design patterns and color palettes, creating mock-ups, creating an animated app prototype, and designing final mock-ups to prepare the final screens for coding to be initiated. Usually, the app appears on the application list, and after it has been opened, it displays the enlisted words that we have trained in our model. After permitting the application to use our microphone, it attempts the words and highlights them in the interface recognition, as shown in Figure 1.

Voice Recognition Model Development
Each audio file signal is subjected to feature extraction to create a map that shows how the signal changes in frequency over time. Therefore, the Mel frequency cepstral coefficients (MFCC) were used in speech analysis systems to extract this information [34]. The initial step in character extraction is to emphasize the signal by passing it through a one-coefficient digital filter (finite impulse response (FIR) filter) to prevent numerical instability as: where x(n) is the original voice signal, y(n) is the output of the filter, n is the number of sampling, and β is a constant such that 0 < β ≤ 1.
To keep the samples in frame and reduce signal discontinuities, the framing and windowing [w(n)] are employed as: where is α constant and N is the number of frames. For spectral analysis, fast Fourier transform (FFT) is applied to calculate the spectrum of magnitude for each frame as: The spectrum is then processed by a bank of filters according to MFCC, where the Mel filter bank can be written as: If we consider f l and f h to be lowest and highest on the filter bank in hertz and frequency, then the boundary points f [m] can be written as: where N is the size of the FFT, M is the number of filters, and B is the Mel scale which is given by: To eliminate noise and spectral estimation errors, we applied approximate homomorphic transform as: The logarithmic energy operation log(∑|.| 2 ) and the inverse of discrete cosine transformer (DCT) are used in the final step of MFCC processing. The use of DCT has features for high decorrelation, and partial decorrelation can be given as: To obtain the feature map, we take the first and second derivatives of (8) to obtain: This is applied to all recordings that have been made; the database was thus created and used by the CNN.

CNN Implementation Model
Here, we adopted the network-in-network (NIN) structure as the foundational architecture for mobile application development [35,36]. NIN is a CNN technique that does not include fully connected (FC) layers and, in addition, can accept images of any size as inputs to the network by employing global pooling rather than fixed-size pools. This is useful for mobile applications because users may adjust the balance between speed and accuracy without affecting the network weights.
To contrast CNN, we adopt a multi-threading technique. In this technique, the smartphone has four CPU cores that easily allow dividing a kernel matrix into four sub-matrices along with the row. Therefore, four generic matrix multiplication (GEMM) operations are carried out in parallel to obtain the output feature maps of the target convolution layer. Our method adopted cascaded cross channel parametric pooling (CCCPP) to compensate for the FC layers' elimination. Therefore, our CNN model consists of input, output layers, twelve convolution layers, and two consecutive layers, as shown in Figure 2.

DC Motor Control Drive
The drive wheels are powered by motors at the rear and front ends of the chair. The rear motors correspond to the rear wheel movement, which is used to drive the wheels forward, and the front wheel (freewheels) corresponds to different chair movements. The two motors were connected to the driver via four power lines. The motor speed was predefined at approximately 1 km/h. To move forward or backward, both wheels will move clockwise and anti-clockwise, respectively. However, to turn right or left, one motor uses the entire free gear and the other moves forward. If one needs to turn left, the left wheel uses the free gear and the right one moves forward, thereby causing the wheelchair to move in the opposite direction. The movement table of the wheelchair is presented in Table 2.
All wheelchair movements are controlled by a relay module. The relay module provides four relays that are rated for 15-20 mA at 5 Vdc. Each relay has a normally closed (NC) and normally open (NO) contact. Each relay is controlled by a corresponding pin that originates from the microcontroller. The relays are optically isolated, and each motor is controlled by two relays: one relay is used to switch (on), and the other remains at the first position (off) by means (ground), which will cause the (on) motor to turn in a clockwise or opposite direction, based on the (on) or (off) state of the relay. Then, the command is sent by the microcontroller program, and the relay coil operates at 5 V. Figure 2 presents the complete electronic circuit diagram for the wheelchair movement. In this diagram, the polarity across a load for the four relay modules can be altered in both directions. Terminals are connected between the common poles of the two relays and the DC motor. Normally open terminals are connected to the positive terminal, whereas a normally closed terminal of both relays is connected to a current driver circuit (ULN2033) to protect the pins of the controller from any abrupt sinking current. The current driver circuit can support approximately 500 mA, which is sufficient for the relay module. Furthermore, the diode connects to each relay to ensure protection from voltage spikes when the supply is disconnected.  Figure 3 illustrates the mechanical assembly of a wheelchair. This wheelchair was purchased from the market, and no mechanical modifications were made to the basic design of the original chair. In our proposed design, an electro-mechanical motor was attached directly to the frame of the wheelchair. A wheelchair's maneuverability depends on the position of the steering wheels, which significantly affects the space required for the chair to turn, including the way the chair moves in narrow spaces. Owing to their small 360-degree turning circumference and tight turning radius (20-26 in), mid-wheel drives are the most maneuverable, making them excellent indoor wheelchairs. Table 3 summarizes the hardware specifications of all parts that are used in this work. Figure 4 presents a flowchart of the complete wheelchair system.

Experimental Procedure
We evaluated our system on the English speech corpus for isolated words, which was conducted at the Health and Basic Sciences Research Center, Majmaah University. A total of 2000 utterances of five words are contained inside this collection, which was created by 10 native Arabic speakers. At a sample rate of 20 kHz and a 16-bit resolution, the corpus was recorded. Then, that data set was augmented by creating extra speech signals using a method of augmentation. The additional data set contains 2000 utterances by changing pitch, speed, dynamic range, adding noise, and forward and backward shift in time. The new dataset (original and augmented) contains 4000 utterances is divided into two parts: a training set (training and validation) with 80% of the samples (3200) and the test set with the remaining 20% of samples (800).
To evaluate the accuracy and the quality of prediction of the proposed system, we calculate the F-score as: where P and R represent precision and recall, respectively, and are stated by the following: Here, T p is the true positive, F P is the false positive, and F N is the false negative.
To evaluate the right prediction of each voice command during the classification, the percentage difference (%d) equation was used as: where V 1 and V 2 represent the first and second observations during the comparison process, respectively. The method also evaluates the real-time performance of indoor/outdoor navigation. This test (Video S1) describes the indoor/outdoor navigation performance, when the user controlled the wheelchair via voice commands, and the path is around and inside the mosque, with the coordinates 24.893374, 46.614728.

Results
In this work, the audio file was recorded and trained for five words to test the application performance until it reached the required prediction ratio. These words were chosen mainly based on the ease of pronunciation and circulation in the Arab countries and the significant variation among each other in the phonemic outlets. Figures 5-9 illustrate the recognizable faces of the voice command "Yes, No, Left, Right, and Stop". Each figure includes the sound waveform (a) two-dimensional long-term spectrum with frequency band (b), spectrogram (c), and voice command prediction ratio in the mobile app (d). Table 4 summarizes the resizing and normalization phases for each voice command. The program also displays the predicted weight of the spoken word for the user. It is always a one-voice command and has more weight than other words, and this indicates that an incorrect decision cannot be made during the classification process.       Based on the previous results, the confusion matrix was calculated as shown in Table 5. The accuracy of voice commend "yes" was approximately 87.2% of the true prediction for the five voice commands. Regarding the classification tasks, we adopted the terms-true positives, true negatives, false positives, and false negatives. Tables 6 and 7 present the calculations of the voice-command prediction ratio, accuracy, and precision. In terms of calculating percentage difference when comparing one command with other commands, the example displays "STOP" against other commands. This indicates a slight possibility of making an incorrect choice during the classification. The difference between the percentage of true and false predictions is markedly high, which indicates a negligible probability of making wrong predictions, and the difference reached more than 150%, as presented in Table 7.  The real-time performance of indoor/outdoor navigation was evaluated in public places, as shown in Figure 10, which presents the planned route navigation versus the actual route (outbound navigation). Table 8 presents the coordinate nodes of the planned and actual paths while navigating. The root-mean-square deviation (RMSD) was adopted to represent the differences between the planned and actual nodes of this experiment. RMSD appears to be equal to 1.721 × 10 −5 and 1.743 × 10 −5 for latitude and longitude coordinates, respectively.

Discussion
The objective of this study was to design and implement a low-cost and powerful system to drive a powered wheelchair system using a built-in voice recognition app on a smartphone. This design was achieved to facilitate substantial independence among disabled people and, consequently, improving their quality of life. The proposed design of the smart wheelchair increases the capabilities of the conventional joystick-controlled design by introducing novel smart control systems, such as voice recognition technology and GPS navigation systems. Owing to the significant advancements in smartphones, accompanied by high technology for voice recognition and the use of wireless headphones, the voice recognition technology for controlling wheelchairs has become widely adopted [4,29,30].
In general, the proposed system is characterized by the ease of installing the proposed electrical and electronic circuits, along with low economic cost and low energy consumption. Figure 4 shows a simple structure of the electronic circuit connection inside the installed protection case. The design used is highly effective and low cost in terms of the materials and techniques used and their ability to be configured, customized, and subsequently transferred to the end-user. The average response time for processing a single task is approximately 0.5 s, which is sufficient to avoid accidents. All programs and applications in this smart wheelchair can operate offline without requiring access to the Internet. In addition, the proposed program works under conditions of external noise with high accuracy.
This study investigated the robustness of the voice recognition model by examining the percentage difference between true words, predictions, and false predictions. The experimental results exhibited a significantly high difference between the percentage values of different categories, which indicates a very low probability of wrong predictions. According to Table 3, the difference between the true and false predictions was approximately over 150%. The second experiment was adopted to evaluate the performance of indoor and outdoor navigation. The user controlled the chair via voice command, and the RMSD was employed to represent the errors in navigation.
In general, the technology of the speech recognition module in Android has become widely used in recent years. In this regard, there are many free or commercially online license software available in the market and suitable for our proposed model, such as Google Cloud Speech API, Kaldi, HTK, and CMUSphinx [37][38][39]. However, wheelchairs require more studies in terms of static, motion, and moment of inertia. These studies make the system more suitable for different users. In addition, the current voice recognition model did not implement a speaker identification algorithm. Identifying a speaker could improve the safety of wheelchair users by only accepting specific instructions from the authorized person.
Comparing our study with others in terms of efficacy, reliability, and cost, we believe that our design has overcome many complexities. For example, a recent study conducted by Abdulghani et al. implemented and tested an adaptive neuro-fuzzy control to track powered wheelchairs based on voice recognition. To perform a robust accuracy, the design needs to implement a wireless network where the wheelchair is considered a node within the network. Furthermore, this controller is dependent on real-time data obtained from obstacle avoidance sensors and a voice recognition classifier to function appropriately and efficiently [28]. A different study used an eye and voice-controlled human-machine interface technique to drive a wheelchair. In this technique, the authors incorporated a voice-controlled mode with a web camera to achieve congenial and reliable performance for the controller, in which this camera was used to capture real-time images [22].

Conclusions
In this study, a low-cost and robust method was used for designing a voice-controlled wheelchair and subsequently implemented using an Android smartphone app to connect microcontrollers via an offline Wi-Fi hotspot. The hardware used in this design consisted of an Android smartphone (Huawei Y9: CPU-Octa-core, 4 × 2.2 GHz), DC electric motors, batteries, relay model, Raspberry Pi4, and an emergency push button in case of an abnormal system response. The system controlled the wheelchair via a mobile app that was built based on Flutter software. A built-in voice recognition model was developed in combination with the CNN model to train and classify five voice commands (yes, no, left, right, and stop).
The experimental procedure was designed and implemented with a total of 2000 utterances of five words that were created by 10 native Arabic speakers. The maneuverability, accuracy, and performance of indoor and outdoor navigation were evaluated in the presence of various disturbances. Normalized confusion matrix, accuracy, precision, recall, and F-score of all voice commands were calculated. Results obtained from real experiments demonstrated that the accuracy of voice recognition commands and wheelchair maneuvers was high. Moreover, the calculated RMSD between the planned and actual nodes at indoor/outdoor maneuvering was shown to be accurate. Importantly, the implemented prototype has many benefits, including its simplicity, low cost, self-sufficiency, and safety. In addition, the system has an emergency push button feature to ensure the safety of the disabled individual and the system.

Future Work
The system can be adapted with GPS location technology, and the user can use this technology to create their own path, i.e., building a manual map. By introducing ultrasound sensors for safety purposes, this system will activate and ignore the user's command if the chair arrives near an obstacle that could lead to an accident. In addition, we were able to investigate the users' preference for a voice-controlled interface against a brain-controlled interface. Moreover, a speaker identification algorithm can be added to the voice recognition model to ensure the safety of the disabled person by only accepting commands from a specific user.