A Multiplier-Free Convolution Neural Network Hardware Accelerator for Real-Time Bearing Condition Detection of CNC Machinery

In various industrial domains, machinery plays a pivotal role, with bearing failure standing out as the most prevalent cause of malfunction, contributing to approximately 41% to 44% of all operational breakdowns. To address this issue, this research employs a lightweight neural network, boasting a mere 8.69 K parameters, tailored for implementation on an FPGA (field-programmable gate array). By integrating an incremental network quantization approach and fixed-point operation techniques, substantial memory savings amounting to 63.49% are realized compared to conventional 32-bit floating-point operations. Moreover, when executed on an FPGA, this work facilitates real-time bearing condition detection at an impressive rate of 48,000 samples per second while operating on a minimal power budget of just 342 mW. Remarkably, this system achieves an accuracy level of 95.12%, showcasing its effectiveness in predictive maintenance and the prevention of costly machinery failures.


Introduction
With the development of automated production, electric motors have been used in various industrial fields.Due to the different applications, the motor can be operated in harsh, fast, and overloaded environments.In this case, the motor parts may quickly fail, and the failure of the parts can cause considerable losses and result in safety risks for the operator.Therefore, fault diagnosis technology is one of the most important techniques in the modern industrial field.The fault types of motors and drive systems can be divided into stator faults, electrical faults of the rotor, and mechanical faults of the rotor [1], whereby bearing faults are the most common causes.Bearing faults account for 41% to 44% of all faults [2][3][4][5], and bearing faults cause noise, vibrations, and even system failures.If the fault can be diagnosed in time, then the loss due to the faults can be reduced.Also, the reliability and safety of the machine can be improved.The main fault conditions of the bearing include ball faults, outer race faults, and inner race faults.
Currently, the bearing fault diagnosis methods can be divided into model-driven methods [6] and data-driven methods [7].The model-driven approach must understand its physical principles and construct mathematical models accordingly.The data-driven method can be divided into four steps to diagnose bearing fault types: data acquisition, feature extraction, feature selection, and classification [8].Also, the fault model can be built from the signals detected by the accelerometers.When dealing with non-linear and complex systems, the data-driven approach is a better choice because these systems are difficult to describe with accurate mathematics.More specifically, this paper adopts the Although many existing excellent works discussed the weight pruning and quantization process, their usage for bearing fault detection hardware implementation gained little attention.On the other hand, despite that the computing time can be sped up by implementing neural network model to hardware, and that memory usage can be reduced by the technology of weight pruning and quantization, the complex data-preprocessing process still significantly influences the fault-detecting time.Therefore, reducing the complexity of the data-preprocessing process is also an important goal of this paper.More specifically, the main goal of this work is to propose a hardware solution for the accurate bearing fault diagnosis.In addition, the proposed method will achieve real-time processing and minimize the hardware resources.
The main contribution of this work can be summarized as follows. 1.
In this work, a CNN-based hardware accelerator has been implemented for bearing faults, which can achieve real-time processing in industrial environments.

2.
An improved incremental network quantization (INQ) method has been proposed to reduce the memory usage of the proposed hardware accelerator.

3.
The complex multiplier operations have been removed from the proposed method to realize a multiplier-free accelerator.4.
The power consumption of the proposed accelerator can be reduced to 342 mW with a 140 MHz clock frequency.

5.
The accuracy of the proposed accelerator can achieve 95.12% with only 8.69 K parameters.
The organization of the rest of the paper can be summarized as follows.Various of different type fault detection methods will be explored in Section 2. Next, the proposed method will be introduced in Section 3.More specifically, Section 3.1 describes the software architecture of our proposed model, and Section 3.2 provides the details of hardware implementation.The experimental results are then shown in Section 4. Finally, Section 5 will conclude this work.

Exploring Different Fault Diagnosis Methods
In this section, various machine-learning-based methods for fault diagnosis will be discussed to explore the research background.
(1) K-Nearest Neighbor (K-NN) K-NN is an algorithm for classification and regression.The output is a classified ethnic group.The majority vote of the neighbors classifies the unclassified data.K indicates how many similar training data were selected.In [20], 30 features are extracted from the vibration signal, 24 of which are standard features for diagnosing rotating machine faults, and 6 are extracted by electromyography (EMG).Relief algorithm, Chi-squared, and information gain rank thirty features.The top 10 features are selected and fed into K-NN and random forest for classification, using five different vibration datasets for testing.The experiment results show that the features extracted by EMG are useful for diagnosing faults on rotating machines.
(2) Support Vector Machine (SVM) The SVM is a supervised learning model that can find decision boundaries in the hyperplane to maximize both boundaries.When faced with non-linear responsibilities, kernel tricks can be added to map the original data in high-dimensional space.Many studies used SVM for bearing fault detection and achieved excellent results.In [21], a one-class v-SVM is used to detect abnormal vibration signals.If abnormal, a sensitivity test is used to select the fault frequency band through envelope analysis.The one-class v-SVM is highly input-dependent, and small changes to the input can result in reduced accuracy.
(3) Principal Component Analysis (PCA) PCA is a method of statistical analysis and simplification of data.In [22], a solution for the unknown signal is proposed.The self-organizing map and principal component analysis (SOM-PCA) method analyzes the residual signal of the unknown signal.After the characteristic frequency of the faulty component has been separated, the SOM model classifies the four bearing states (normal condition, infield fault, field fault, and ball fault).This method can effectively distinguish the bearing condition.
(4) Neural Network (NN) In [23], the feedforward multi-layer perceptron (MLP) neural network consists of three layers: an input layer, a hidden layer, and an output layer.After using the Laplace wavelet transform and genetic algorithm to preprocess the bearing vibration signal, the time and frequency domain features are extracted as inputs for the DNN.The features are the root mean square, standard deviation, kurtosis, wavelet spectrum frequency peak to the shaft rotational frequency ratio, and wavelet power spectrum maximum amplitude to the overall amplitude ratio.The genetic algorithm aims to reduce the number of hidden-layer nodes to improve the classification speed.Finally, the bearing health status is classified into four bearing categories.The Laplace wavelet transform is very sensitive to noise, and when it encounters a signal with a sizeable periodic component, it is easy to generate errors.In [24], the vibration signal characteristics are extracted from the frequency domain to reduce the size of the data and then fed into the DNN model for fault classification.This approach reduces the network architecture and shortens the calculation time.The Fourier transform is useful for smoothing signals and may not reflect features in non-uniform signals.
Moreover, in [25], a quantum-inspired differential evolution (MSIQDE) algorithm is proposed based on the improvement of the multi-strategy to optimize the DBN connection weights and improve accuracy.Moreover, the CNN-based method has become mainstream for fault diagnosis technology because of its high accuracy.In [26], the discrete wavelet transform (DWT) splits the original signal into different scales, uses CNN for feature extraction, and finally performs fault classification using the softmax classifier, which performs better than other wavelet-based methods.However, the input is the four components of the signal.The parameters of the first layer are 32 × 32 × 4 × 32, and there are many parameters.Therefore, the chip area of hardware circuits and power consumption are unfavorable.In [27], the original signal is randomly sampled, and the input signal is converted into a two-dimensional image and then fed to the LeNet-5 CNN for classification.The accuracy of the three experimental conditions is more than 99%.
More recently, the study cited as [28] introduced a CNN model specifically developed for detecting bearing faults, utilizing both vibration and acoustic signals.In this study, the short-time Fourier transform (STFT) method was applied to transform these signals into a time frequency representation.The findings from this research showed that the proposed model achieved notably high accuracy in classification tasks.In contrast, the study referred to as [29] went beyond signals produced by CNC machines (like vibration, current, and acoustic signals) and introduced an innovative approach to fault detection.This approach involved the use of event-based cameras and vision sensing techniques to develop a contactless fault detection method.The results of their experiments showed high accuracy under specific conditions, also emphasizing that the effectiveness of this camera-based solution is greatly influenced by lighting conditions.
However, to enhance the accuracy, many existing works tend to select complex preprocessing methods, taking several important features and then using neural networks to classify them.Complex pretreatment will greatly burden hardware implementation, so this paper uses the method of [27].The pretreatment stage is relatively simple, and the accuracy is not worse than other methods.

The Proposed Methodology
In this research, we focus on developing a real-time hardware solution to identify faults in CNC machinery.Our approach employs a CNN as the predictive mechanism for fault detection.The CNN model primarily uses images as input, which, in our case, are derived from sensors attached to CNC machines.The essence of real-time processing in our project is the system's ability to complete the fault prediction before the sensors produce a subsequent image.For example, if an input image is made up of 4096 data points and the sensor outputs every × nanoseconds, then the model's prediction time per image should be less than 4096× nanoseconds.In simpler terms, the model must process each image swiftly to prevent delay accumulation, thereby maintaining real-time efficiency.To achieve this, we simplify complex data preprocessing steps in our model.Besides streamlining preprocessing, reducing the computational delay in the CNN is essential, particularly in processes like convolution.Typically, a CNN with more parameters yields higher accuracy but also longer prediction times.Therefore, balancing the CNN's architecture size and its accuracy is crucial.On the hardware side, we prioritize efficiency; adders are used instead of multipliers to decrease computation time, power consumption, and chip area in our design.Moreover, real-time performance, accuracy, and hardware cost are our prime considerations.We have endeavored to reduce memory usage and hardware overhead by minimizing the bits needed for data representation.However, this reduction can significantly affect prediction accuracy.This paper will also explore the trade-off between accuracy and hardware cost.Detailed explanations of our methodology will be discussed in the following subsections.First of all, we reduce the data preprocessing process, and this section will introduce the details of the dataset we use in this paper and the dataset preprocessing process.As aforementioned, in this work, we use the CWRU [9] dataset to train a CNN model to detect the bearing faults.The test equipment at CWRU includes a two-horsepower (HP) motor, torque sensor and encoder, a force meter, and control electronics.Svenska Kullagerfabriken (SKF) (Gothenburg, Sweden) bearings and NTN equivalent bearings are used as test bearings to support the motor shaft.The test bearings generate a single point of failure by discharge processing.SKF bearings produce 0.007-, 0.014-, and 0.021-inch single-point failures, and NTN equivalent bearings produce 0.028-and 0.04-inch single-point failures.
The defective bearing is installed in the electric motor, and the electric motor runs at a constant speed.Each motor loads 0-3 HP (approx.1730-1797 r.p.m).Bearing vibration data are collected by three accelerometers mounted on the motor support substrate, the drive-end motor housing, and the fan-end motor housing.Subsequently, the collected vibration signals are processed in MATLAB.Each file contains multiple accelerometer data.Moreover, there are four bearing datasets in the CWRU datasets: normal reference, 12 K drive bearing failure, 48 K drive bearing failure, and 12 K fan-end bearing failure.There are three types of fault data: inner race, ball fault, and outer race.The position of the outer race fault concerning the bearing load area directly affects the vibration response, so the outer race fault data contain data at three different locations: three o'clock position (orthogonal), six o'clock position (center), and twelve o'clock position (opposite).
The vibration dataset is used to train the CNN model in this work.More specifically, the vibration signals are converted into images sequentially.The bearing status is classified into ten categories according to the diameter of damage and type of fault, and if the vibration signal can be classified into one of ten categories, then the CNN network outputs the appropriate category.The label and corresponding bearing status are shown in Table 1.The CWRU bearing vibration dataset for bearing fault detection consists of a total of three accelerometer data, respectively: drive-end accelerometer data (DE), fan-end accelerometer data (FE), and basic accelerometer data (BA).In this paper, data from fan-end (FE) accelerometers are used.To convert a vibration signal to a gray-scale image through normalization, the normalization requires a time domain signal's maximum and minimum values and performing multiplication and division operations, as shown in Equation ( 1).When implementing CNN hardware, it takes 4096 cycles to find the minimum and maximum values of the vibration signal.The preprocessing of the vibration signal will take a long time and require memory and computing resources to process.Therefore, this paper directly arranges the values of vibration signals into 64 × 64 images.
Note that, in order to avoid waiting time, this study does not make a normalization for the vibration signal.The vibration signal is converted directly as an image, and each image size is 64 × 64.All images are divided into training images in the proportion of 64:16:20, verified, and tested.There are 8306 images, of which 5317 are training images, 1329 are validated images, and 1660 are test images.Training, validation, and test images are randomly selected from all images, not from each category.Therefore, the number of training, validation, and test images in each category is not exactly 64:16:20.The number of training, validation, and test images in each category is shown in Table 2.The proposed CNN network architecture for bearing condition detection is shown in Figure 1.After arranging the vibration signals into images, the bearing conditions are classified using the proposed CNN network.The proposed CNN architecture includes the convolutional layers (Conv1 to Conv4), the max pooling layers, the global average pooling layer, and the fully connected layer.If the global average pooling layer is used in the network instead of the convolution results being output directly to the fully connected (FC) layer, then the parameters of the neural network are greatly reduced, and overfitting is prevented.The Conv includes a zero-padding layer, a convolution layer, and an activation layer.

Architecture of the Proposed CNN Model
The proposed CNN network architecture for bearing condition detection is shown in Figure 1.After arranging the vibration signals into images, the bearing conditions are classified using the proposed CNN network.The proposed CNN architecture includes the convolutional layers (Conv1 to Conv4), the max pooling layers, the global average pooling layer, and the fully connected layer.If the global average pooling layer is used in the network instead of the convolution results being output directly to the fully connected (FC) layer, then the parameters of the neural network are greatly reduced, and overfitting is prevented.The Conv includes a zero-padding layer, a convolution layer, and an activation layer.3.With a 1 × 5 kernel size, we achieve an accuracy rate of 97.16%, leading to the adoption of this size for our study.Eight distinct architectures are trained to ascertain the optimal number of layers and the number of filters within each layer.Out of these, half feature three convolutional layers, while the remaining half have four convolutional layers.All architectures employ a kernel size of 1 × 5 and leverage global average pooling, leading to outputs for ten classes.As illustrated in Table 4, the accuracy of using three convolutional layers is on par with that of four convolutional layers.However, the parameter count for the three-layer setup is not necessarily lower.For instance, Architecture 3, comprising three convolutional layers, has filters distributions of 32 in Conv1, 64 in Conv2, and 128 in Conv3.It achieves an accuracy rate of 97.34% with a parameter count of 13,530.In comparison, Architecture 6, with its four convolutional layers, allocates 8 filters to Conv1, 16 to Conv2, and 32 to Conv3 and Conv4.Despite its accuracy being marginally lower by 0.18% compared to Architecture 3, it requires 4840 fewer parameters.Based on this, our study selects Architecture 6.  3.With a 1 × 5 kernel size, we achieve an accuracy rate of 97.16%, leading to the adoption of this size for our study.Eight distinct architectures are trained to ascertain the optimal number of layers and the number of filters within each layer.Out of these, half feature three convolutional layers, while the remaining half have four convolutional layers.All architectures employ a kernel size of 1 × 5 and leverage global average pooling, leading to outputs for ten classes.As illustrated in Table 4, the accuracy of using three convolutional layers is on par with that of four convolutional layers.However, the parameter count for the three-layer setup is not necessarily lower.For instance, Architecture 3, comprising three convolutional layers, has filters distributions of 32 in Conv1, 64 in Conv2, and 128 in Conv3.It achieves an accuracy rate of 97.34% with a parameter count of 13,530.In comparison, Architecture 6, with its four convolutional layers, allocates 8 filters to Conv1, 16 to Conv2, and 32 to Conv3 and Conv4.Despite its accuracy being marginally lower by 0.18% compared to Architecture 3, it requires 4840 fewer parameters.Based on this, our study selects Architecture 6.Table 5 compares prior works focusing on CWRU bearing fault detection but employing varying network architectures and preprocessing techniques.This study takes into account hardware implementation; therefore, it adopts preprocessing methods that are easy to implement and aim to minimize the architecture to achieve parameter lightness.In Table 5, the network parameters of the proposed model are more than 30-fold fewer than other architectures.In contrast, the accuracy of the proposed model is only slightly lower by 2.54% compared to the others.To assess our trained model's ability to classify accurately, we create a confusion matrix for the proposed CNN architecture, as shown in Figure 2.This matrix uses the x-axis to display predicted categories across ten classes and the y-axis to indicate the actual labels of the input images.Each value in the matrix reflects the likelihood of a specific outcome.For example, a value at the intersection of the actual label 0 and predicted label 0 indicates the probability that an image, truly belonging to class 0, is correctly predicted as class 0. High values along the diagonal of the matrix, nearing 1, suggest strong classification accuracy.Notably, all diagonal values being close to 1 in Figure 2 demonstrates that our CNN model has a high level of classification precision.

Hardware Implementation
Upon successfully training the CNN model in a software environment, our next step involves starting to implement this model into a hardware accelerator.This section delves into strategies for minimizing power consumption and chip area while enhancing computational speed.A significant factor in the circuit's power consumption is attributed to off-chip memory access.Frequent read and write operations on the same memory device substantially limit the energy efficiency of the circuit.Consequently, quantizing the CNN network becomes imperative to lessen memory access time and power consumption, thereby boosting the speed of computations.

Hardware Implementation
Upon successfully training the CNN model in a software environment, our next step involves starting to implement this model into a hardware accelerator.This section delves into strategies for minimizing power consumption and chip area while enhancing computational speed.A significant factor in the circuit's power consumption is attributed to off-chip memory access.Frequent read and write operations on the same memory device substantially limit the energy efficiency of the circuit.Consequently, quantizing the CNN network becomes imperative to lessen memory access time and power consumption, thereby boosting the speed of computations.
The entire design process proceeds from training the CNN network model to the hardware implementation, as shown in Figure 3.The first step is to convert 4096 vibration signals per image into 64 × 64 images using the signal-to-image conversion method mentioned in [27].The second step is to train the CNN network architecture and consider power consumption and computing speed in the hardware implementation.Therefore, the network structure is kept lightweight to reduce network complexity and the number of parameters, but high accuracy is still required.
The third step involves parameter quantization.This research draws inspiration from the incremental network quantization method [19] with further enhancements.In [19], the quantization process is divided into four stages, initially quantizing 50% of all model parameters, followed sequentially by 75%, 87.5%, and 100%.During the retraining process, ref. [19] utilizes unquantized parameters for adjustments to compensate for the accuracy loss caused by quantization.On the other hand, in this work, all parameters are used for retraining.Notably, this work does not employ pruning since its effects are insignificant.
In the fourth step, the values are represented in fixed-point format after quantizing all parameters.However, TensorFlow is used when training the proposed CNN, and floating-point computations are applied.Therefore, during hardware implementation, as fixed-point computations can lead to accuracy loss, it is essential to evaluate the proposed CNN in Python to assess the precision loss when choosing the minimum bit-width implementation.After confirming the precision loss and determining the bit-width, we proceed to write the Verilog register transfer level (RTL) code to implement the functionality of The entire design process proceeds from training the CNN network model to the hardware implementation, as shown in Figure 3.The first step is to convert 4096 vibration signals per image into 64 × 64 images using the signal-to-image conversion method mentioned in [27].The second step is to train the CNN network architecture and consider power consumption and computing speed in the hardware implementation.Therefore, the network structure is kept lightweight to reduce network complexity and the number of parameters, but high accuracy is still required.The third step involves parameter quantization.This research draws inspiration from the incremental network quantization method [19] with further enhancements.In [19], the quantization process is divided into four stages, initially quantizing 50% of all model parameters, followed sequentially by 75%, 87.5%, and 100%.During the retraining process, ref. [19] utilizes unquantized parameters for adjustments to compensate for the accuracy loss caused by quantization.On the other hand, in this work, all parameters are used for retraining.Notably, this work does not employ pruning since its effects are insignificant.
In the fourth step, the values are represented in fixed-point format after quantizing all parameters.However, TensorFlow is used when training the proposed CNN, and floatingpoint computations are applied.Therefore, during hardware implementation, as fixed-point computations can lead to accuracy loss, it is essential to evaluate the proposed CNN in Python to assess the precision loss when choosing the minimum bit-width implementation.After confirming the precision loss and determining the bit-width, we proceed to write the Verilog register transfer level (RTL) code to implement the functionality of each component in the proposed CNN.Subsequently, we synthesize the code using Xilinx Vivado and obtain the implementation results.
The proposed CNN network hardware block diagram is shown in Figure 4.The entire calculation process can be divided into three blocks: Conv, max pooling, and FC.The Conv is responsible for the zero-padding, dynamic fixed-point adjustment, convolution calculations, and activation function.The image size output after the convolution operation is smaller, so the input image needs to be zero-padded to maintain the same input and output image size.Dynamic fixed-point adjustments are used in the architecture, and each layer's integer and decimal bits are different.In this study, the whole number of integers and decimal places is adjusted in the Conv, and the entire convolution operation is carried out only after adjustment.The convolution contains six adders, five of which are used for the kernel calculation of the convolution operation, and one is for the partial sum accumulation.The activation function chooses to use ReLU because ReLU is very simple.The max pooling block is responsible for performing max pooling and controlling whether the output is stored to Ofmap_RAM, and when four layers of convolution are complete, the results are output to the FC block.The FC block is responsible for global average pooling and fully connected operations and compares ten output values, outputting the maximum label.There are four convolution layers and four max pooling layers in the proposed CNN network.The max pooling results from the first to the third layer are transmitted to Ofmap_RAM1 or Ofmap_RAM2, and the max pooling results in the fourth layer are transferred to the FC block for operations.The global average operation and the fully connected layer operation are performed in the FC block.After four maximum pooling layers, only 4 × 4 images remain in the output.Therefore, the global average operation can be calculated by shifting four bits to the right instead of division.When the fully connected layer is calculated, the label with the maximum value is output.ConvKernel_ROM only stores the weight of the convolution layer.The weight of the fully connected layer is stored in the FCWeight_ROM, which makes it more convenient to take the value when performing fully connected layer operations.The memory management method of the ping-pong architecture consists of two memory banks, where one memory provides the value, and the other memory stores the result of the operations.Since the CNN network proposed in this design has the most significant output values on the first and second layers, the number of first-and second-layer output values determines the memory size.
The quantization process is divided into four stages, initially quantizing 50% of all model parameters, followed sequentially by 75%, 87.5%, and 100%.The quantization process is shown in Algorithm 1. First, random numbers produce a quantization order of parameters for each layer and quantify the parameters of that layer based on the quantization order and the quantization phase.The model is quantified once it documents the accuracy of the quantization, and then the model, maximum precision, and quantization order to be trained and quantized is entered into Algorithm 2 for retraining and quantization.If the accuracy of the output of Algorithm 2 is not higher than before, then the quantization method is replaced.If all three quantization methods are already in use, then the next phase of quantization will take place.The quantization process is divided into four stages, initially quantizing 50% of all model parameters, followed sequentially by 75%, 87.5%, and 100%.The quantization process is shown in Algorithm 1. First, random numbers produce a quantization order of parameters for each layer and quantify the parameters of that layer based on the quantization order and the quantization phase.The model is quantified once it documents the accuracy of the quantization, and then the model, maximum precision, and quantization order to be trained and quantized is entered into Algorithm 2 for retraining and quantization.If the accuracy of the output of Algorithm 2 is not higher than before, then the quantization method is replaced.If all three quantization methods are already in use, then the next phase of quantization will take place.
This study employs three distinct methods in the quantization process, as detailed in Algorithm 2. The first method involves quantizing after each training epoch.The second method quantifies after each training batch.Meanwhile, the third approach activates quantization once the verification accuracy surpasses a predetermined threshold.If the conditions for quantization are met, then the quantization model and the quantization order are transmitted to Algorithm 3. If the quantization accuracy surpasses the peak accuracy, then the model will be preserved.While all three methods are applied, the first and third methods often yield higher training accuracies at the expense of lower verification accuracies.Conversely, the second method effectively narrows the training and verification accuracy gap.This study employs three distinct methods in the quantization process, as detailed in Algorithm 2. The first method involves quantizing after each training epoch.The second method quantifies after each training batch.Meanwhile, the third approach activates quantization once the verification accuracy surpasses a predetermined threshold.If the conditions for quantization are met, then the quantization model and the quantization order are transmitted to Algorithm 3. If the quantization accuracy surpasses the peak accuracy, then the model will be preserved.While all three methods are applied, the first and third methods often yield higher training accuracies at the expense of lower verification accuracies.Conversely, the second method effectively narrows the training and verification accuracy gap.
In the proposed CNN model, there are 8680 parameters, excluding biases.Throughout the training process, 32-bit floating-point numbers facilitate the calculations of the neural network model.Without quantization, storing all these parameters necessitates 277,760 bits of memory.Models that are neither compressed nor quantized lead to increased chip area and greater power demands.Consequently, we employ the improved incremental network quantization method to quantize the parameters, with each parameter being represented using just four bits.This results in savings of 243,040 bits at the cost of losing 1.62% accuracy.Table 6 exhibits the accuracy achieved at varying quantization levels.The remaining unquantized parameters can counteract the losses from quantization.Therefore, even after quantizing 87.5% of the parameters, the verification accuracy remains unchanged.

Algorithm 3. Quantization network
Input : Net , orderM Output : quantized_network Net Target = 0, ±2 0 , ±2 −1 , ±2 −2 , ±2 −3 , ±2 −4 , ±2 −5 , ±2 −6 for i in range (number of Net layers) for j in range(orderM) To demonstrate the reliability of our approach, we conduct a series of experiments for cross-validation analysis.In this experiment, we adopt a shuffle-split strategy that randomly samples elements from datasets in each iteration.More specifically, we resample our dataset ten times, generating unique training and testing sets labeled from v1 to v10.For each set, we retrain our model.The outcomes of these experiments are presented in Table 7.As observed in the table, our proposed network consistently achieves an accuracy rate above 91% in all ten instances of resampling.Moreover, sensor data are continuously fed into the operational circuit in the hardware configuration.Consequently, a dedicated memory segment is essential to store these sensor data.The vibration signal's minimum and maximum values are −7.1409 and 7.5856, respectively.As a result, one bit is allocated for the sign and three bits for the integer part.Next, we experiment with varying the precision from 4 to 10 decimal bits to assess accuracy implications.The outcomes of these tests can be found in Figure 5.Note that, in Figure 5, the x-axis denotes the total presenting bits, which includes the 4-bit integer of the input, and the y-axis indicates the accuracy.Assuming an input value is represented using a 32-bit floating point and an image contains 4096 values, the necessary memory storage amounts to at least 131,072 bits.When each value is denoted using four bits for the integer part and five for the fractional part, summing up to nine bits in total, there is a slight reduction in accuracy by 0.42%.However, this approach allows for significant memory savings of 94,208 bits.During the execution of the CNN hardware circuit, the input is accessed multiple times.Hence, two instances of Ofmap_RAM are required to implement the ping-pong architecture.While one memory handles the data reading process, the other memory stores the results from the CNN operations.The outputs from each layer are also represented in fixed-point format to conserve space in Ofmap_RAM.
Owing to the vast variability in the output values across each layer of the neural network, it is crucial to assess the bit requirement for each layer before transitioning values from 32-bit floating-point to fixed-point format.The number of output bits is minimized During the execution of the CNN hardware circuit, the input is accessed multiple times.Hence, two instances of Ofmap_RAM are required to implement the ping-pong architecture.While one memory handles the data reading process, the other memory stores the results from the CNN operations.The outputs from each layer are also represented in fixed-point format to conserve space in Ofmap_RAM.
Owing to the vast variability in the output values across each layer of the neural network, it is crucial to assess the bit requirement for each layer before transitioning values from 32-bit floating-point to fixed-point format.The number of output bits is minimized by trimming extra bits before they are relayed to the subsequent layer for processing.It is worth noting that the count of output bits directly influences memory consumption.A Python-based simulation of the CNN network's computation process is employed to quickly determine the integer and fractional bits that need to be retained for each layer's output.During this Python-based CNN operation simulation, each layer's maximal and minimal values are logged throughout the computation, with the results shown in Table 8.This study examines two distinct fixed-point methods to calculate the necessary bit count.The first approach uses a consistent representation with a set number of integer and decimal bits per layer.Specifically, the integer part is set to 12 bits tested across decimal bits ranging from 5 to 10.The second method, dynamic fixed point, adapts the integer digits based on the layer's maximum value.For example, Conv1 through FC have integer bits set at 5, 8, 10, 12, and 10, respectively, and are tested using 17 to 22 bits to evaluate any accuracy drop in the CNN.
Results for both methods are shown in Figure 6.The data suggests that the dynamic fixed-point method outperforms the consistent representation, especially when fewer bits are employed.A clear observation from Figure 6 is that the dynamic approach delivers superior accuracy for the same total bit count, particularly at lower bit numbers.
FC 437.778 −637.388 11 This study examines two distinct fixed-point methods to calculate the necessary bit count.The first approach uses a consistent representation with a set number of integer and decimal bits per layer.Specifically, the integer part is set to 12 bits tested across decimal bits ranging from 5 to 10.The second method, dynamic fixed point, adapts the integer digits based on the layer's maximum value.For example, Conv1 through FC have integer bits set at 5, 8, 10, 12, and 10, respectively, and are tested using 17 to 22 bits to evaluate any accuracy drop in the CNN.
Results for both methods are shown in Figure 6.The data suggests that the dynamic fixed-point method outperforms the consistent representation, especially when fewer bits are employed.A clear observation from Figure 6 is that the dynamic approach delivers superior accuracy for the same total bit count, particularly at lower bit numbers.
Given these findings, this study opts for the dynamic fixed-point method, setting each output at 18 bits.The two Ofmap_RAM storage units require storage for 8192 and 4096 values.Using 32-bit floating points for the output would demand 262,144 bits and 131,072 bits of memory, respectively.By selecting the 18-bit dynamic method, we achieve significant memory savings of 114,688 bits and 57,344 bits for the two Ofmap_RAM units.Given these findings, this study opts for the dynamic fixed-point method, setting each output at 18 bits.The two Ofmap_RAM storage units require storage for 8192 and 4096 values.Using 32-bit floating points for the output would demand 262,144 bits and 131,072 bits of memory, respectively.By selecting the 18-bit dynamic method, we achieve significant memory savings of 114,688 bits and 57,344 bits for the two Ofmap_RAM units.
The accuracy for each category is shown in Table 9, derived from a Python evaluation used to quantize the fixed-point networks.Notably, the second and eighth categories display lower accuracy levels.These categories correspond to 0.007-inch and 0.021-inch inner race faults, respectively.Their identification poses a challenge; for instance, the 0.007-inch inner race fault is often misclassified as a 0.021-inch inner race fault.
In total, there are 8680 weights.Originally, each weight is represented by 32 bits.After quantization, each weight requires only four bits, resulting in memory savings of 87.5%.The bias, Ofmap_RAM1, and Ofmap_RAM2-with 10, 8192, and 4096 values, respectively-transition from using 32-bit floating-point representations to 18-bit fixedpoint numbers.This change means each value only needs 18 bits, saving 43.75% of memory space for these components.Additionally, the Input_RAM, comprising 4096 values, is optimized to use just nine bits per value, leading to a 71.88% reduction.Cumulatively, the overall memory savings is 63.49%, as shown in Table 10.

Experimental Results
When deploying on an FPGA, components such as Kernel_ROM, Input_RAM, Ofmap_RAM1, and Ofmap_RAM2 are constructed using the block memory generator, while the clocking wizard produces the system clock.Within the FPGA synthesis tool, the integrated logic analyzer (ILA) allows easy waveform visualization to facilitate debugging.Users can specify the signals and the depths they wish to observe via the ILA.However, it is essential to define the trigger conditions appropriately.
To achieve real-time capability, we analyze the requirement of the computing cycles and decide the clock rate.The CWRU dataset offers vibration signal samples at 12,000 and 48,000 samples per second.The circuit must wait for 4096 data points to be input before performing the CNN network operations using the 48,000 samples/second rate for illustration.An image containing 4096 time domain signal points for real-time state diagnosis must be processed within 0.085333 s.Completing the calculations for one such image requires 360,828 cycles, implying a required clock frequency of approximately 4.23 MHz for the intended hardware accelerator.However, generating a precise 4.23 MHz frequency during FPGA implementation is not feasible.Consequently, a 5 MHz clock frequency is employed in this work.
This research specifically utilizes the Xilinx Virtex-7 FPGA VC707 evaluation board to validate the circuit design.While the VC707 inherently offers a 200 MHz clock frequency, the necessary clock rate for our circuit is achieved using the clocking wizard.The circuit's power consumption is 342 mW at a clock rate of 5 MHz.Notably, the circuit can function at a maximum clock rate of 140 MHz, which consumes 616 mW.
In the proposed design, a single Input_RAM is utilized to access the input signal.Once the initial convolution layer processes are finalized, the RAM values can be refreshed.With a frequency of 5 MHz, completing the first convolution layer requires 6574,500 ns, while generating the label output takes 72,165,700 ns.Vibration signals are fed every 20,833 ns.Hence, besides the 4096 values needed for the initial convolution layer's computations, extra memory is necessary to accommodate the 316 vibration signals fed during this period.After the label generation, only 3464 vibration values are input, leading to a circuit idle time of 13,166,456 ns.
On the other hand, at 140 MHz clock rate, the first convolution layer completes in 236,682 ns, and producing the label output requires 2,597,965 ns.In this scenario, extra storage is required for the 12 vibration signals processed during the initial convolution phase.Post-label production, with just 125 vibration values input, the circuit remains inactive for 82,727,843 ns.The vibration signal's sampling rate can be elevated, or the operational frequency can be adjusted to 4.23 MHz to optimize and prevent such inactivity.
To assess real-time capabilities, we conduct a comparison between our method and a state-of-the-art real-time hardware solution [30] for bearing faults diagnosis.The referenced work [30] introduced a binary weight-based CNN on FPGA, achieving real-time performance at 130 MHz with 20,387,288 computing cycles and resulting in a detection iteration time of 156,819,019 ns.As previously noted, our solution, operating at a comparable clock rate (e.g., 140 MHz), demonstrates a significantly faster computational completion within 2,597,965 ns, thereby surpassing the performance of the referenced method.
Finally, In Table 11, the comparisons indicate that the proposed design, implemented with a lightweight network, has fewer parameters than other architectures.This reduction in parameters leads to decreased storage requirements.Regarding the FPGA implementation, LUT, DSP, FF, and BRAM usage in this work is significantly lower than in other comparable models, resulting in the lowest power consumption.The efficiency reported in this work stands at 1.126 (GOPS/W) when operating at 140 MHz.

Conclusions
This paper explores the design flow from software to hardware, applying various quantization approaches in the software to minimize both the quantity and the bit size of the model's parameters.Based on the sampling rate provided by the CWRU dataset,

Figure 1 .
Figure 1.The proposed CNN network architecture.

Figure 1 .
Figure 1.The proposed CNN network architecture.This research examines four kernel sizes: 1 × 3, 1 × 5, 2 × 2, and 3 × 3. The vibration signals are arranged on the image from the top to bottom and left to right.Notably, there is a lack of continuity among values from distinct data rows situated at the same point.To illustrate, the initial values of the first and second rows do not represent continuous signals.As a result, for the purposes of this research, kernel sizes of 1 × 3 and 1 × 5 are more suitable.For convolutional layers, Conv1 utilizes 8 filters, Conv2 employs 16 filters, while Conv3 and Conv4 use 32 filters to gauge accuracy and pinpoint the optimal kernel size.The outcomes of the simulations can be found in Table3.With a 1 × 5 kernel size, we achieve an accuracy rate of 97.16%, leading to the adoption of this size for our study.

Figure 2 .
Figure 2. The confusion matrix of the proposed CNN network architecture.

Figure 2 .
Figure 2. The confusion matrix of the proposed CNN network architecture.

Sensors 2023 ,Figure 3 .
Figure 3. From software training to quantization, parameter extraction, final software verification, and hardware implementation.The proposed CNN network hardware block diagram is shown in Figure 4.The entire calculation process can be divided into three blocks: Conv, max pooling, and FC.The Conv is responsible for the zero-padding, dynamic fixed-point adjustment, convolution calculations, and activation function.The image size output after the convolution opera-

Figure 3 .
Figure 3. From software training to quantization, parameter extraction, final software verification, and hardware implementation.

Sensors 2023 ,
23,  x FOR PEER REVIEW 11 of 20 the other memory stores the result of the operations.Since the CNN network proposed in this design has the most significant output values on the first and second layers, the number of first-and second-layer output values determines the memory size.

Figure 4 .
Figure 4.The proposed CNN network block diagram.

Figure 4 .
Figure 4.The proposed CNN network block diagram.

Figure 5 .
Figure 5. Accuracy vs. presenting bits of the input value.

Figure 5 .
Figure 5. Accuracy vs. presenting bits of the input value.

Figure 6 .
Figure 6.Effect on accuracy vs. bit number of the fixed representation and dynamic fixed point.Figure 6.Effect on accuracy vs. bit number of the fixed representation and dynamic fixed point.

Figure 6 .
Figure 6.Effect on accuracy vs. bit number of the fixed representation and dynamic fixed point.Figure 6.Effect on accuracy vs. bit number of the fixed representation and dynamic fixed point.

Table 1 .
Label and corresponding bearing status.

Table 2 .
Number of training and test images for each category.

Table 3 .
Different kernel sizes and the corresponding accuracy.

Table 3 .
Different kernel sizes and the corresponding accuracy.

Table 4 .
Different architectures and the corresponding accuracy.

Table 5 .
Comparison table of the software level.

Table 6 .
Accuracy with different quantization percentages.

Table 8 .
Maximum and minimum values of each layer output.

Table 9 .
Accuracy for each category.

Table 10 .
The memory usage of 32-bit floating point and this work.

Table 11 .
Hardware accelerator comparison table.