Advances in Machine Learning for Sensing and Condition Monitoring

: In order to overcome the complexities encountered in sensing devices with data collection, transmission, storage and analysis toward condition monitoring, estimation and control system purposes, machine learning algorithms have gained popularity to analyze and interpret big sensory data in modern industry. This paper put forward a comprehensive survey on the advances in the technology of machine learning algorithms and their most recent applications in the sensing and condition monitoring ﬁelds. Current case studies of developing tailor-made data mining and deep learning algorithms from practical aspects are carefully selected and discussed. The characteristics and contributions of these algorithms to the sensing and monitoring ﬁelds are elaborated.


Brief Introduction
Machine learning algorithms can be very useful for knowledge discovery [1], with the building of models based on training data The knowledge discovery process of machine learning algorithms usually involves feedback at each iteration with the goal that further improvement can be achieved. While the feedback can be made by humans, this can be time consuming and labor intensive. Data mining algorithms are developed to automate the feedback process to overcome the disadvantages of manual feedbacks, with the goal of discovery of unknown features in the data, while machine learning usually needs known features learned in the training process for prediction. Machine learning includes, for example, supervised learning such as classification and regression, and unsupervised learning such as clustering, and dimensionality reduction. Clustering algorithms are used to group data without any pre-defined classes. These methods can be employed to extract valuable information from the datasets [2]. Other machine learning approaches include semi-supervised learning, reinforcement learning, self-learning, robot learning and association rule learning, which are not covered in this review for sensor applications. Deep learning refers to machine learning algorithms with multi-layer structures for processing higher-level characters from the input dataset.
This review paper is organized as follows. In Section 2, a review is undertaken for supervised machine learning. In Section 3, a review is undertaken for unsupervised machine learning, more specifically, clustering. In Section 4, a review is undertaken for deep learning.

Supervised Machine Learning
In order to build robust learning systems, in many cases, only the relevant features of the dataset are needed. This selection process is called feature selection. For cases that need optimal feature selection, this involves the exhaustive search of all possible feature target variable, as their underlying assumptions are different. Logistic transformation and logistic regression models are often employed for this type of application.
In gesture recognition and communication, sensors such as flex sensors and accelerometers are attached to a glove, and machine learning models are employed to predict the gesture by obtaining the values from the sensors. Krishnan and Vijay et al. [12] used logistic regression models to perform the classification of gestures from values of these sensors.
Li and Cock [13] focused on detecting the cognitive load of a user from the reading obtained by the smart wrist-band sensors. Feature selection was employed, and the machine learning algorithms used in their applications included logistic regression, a decision tree model, and support vector machines.
A decision tree can be employed to form more homogeneous smaller sets for particular target variable. Decision rules are defined in order to split the records in the original dataset into smaller sets. Many classification and prediction problems can be handled satisfactorily by decision tree algorithms [14]. The underlying procedures for the various decision tree models are as follows: the data records are repeatedly split into smaller subsets. The objective is to achieve greater purity in the newly formed subsets than its ancestors. The performance of a split is measured by the degree of purity obtained by that split. Measures such as Gini, information gain, or chi-square can be used for applications with categorical target variable. Measures such as variance reduction and F-test are for cases with numerically continuous variables.
How to recognize new types of attacks in the intrusion detection systems is an important topic for the security of wireless sensor networks. Nancy and Muthurajkumar et al. [15] showed a new intrusion detection system, employing a decision tree classification algorithm to find attacks. Their proposed fuzzy temporal decision tree algorithm was integrated with convolution neural networks for locating intruders. The experiment results clearly supported that the detection performance and efficiency are satisfactory.
The cleaning of rice is an important function of a combine harvester. Chen and Lian et al. [16] developed sensors for checking rice grain impurity in harvesters. Highquality images are recorded during harvesting. The morphological features of the particles extracted from the images served as the inputs to the decision tree model for the later classifying process. The output in their application is the visualized tree, which is useful for the classification of the particles labeled in the binary image.
Inductive-based learning refers to the learning process by instances. The system tries to induce general rules from the input examples [17]. In inductive methods, relational learners are employed to achieve the partial ordering among the hypotheses concerned.
Problems of missing data values are common in sensor applications. Elhassan and Abu-Soud et al. [18] developed an inductive learning algorithm for dealing with the missing data values problem. They focused on enhancing the existing inductive learning algorithm to deal with datasets with missing values and showed a new algorithm that can have the added ability to deal with noise data.
Enormous amounts of spatial data are generated from remote sensing of geographical information system and computer cartography, etc. Mihai and Mocanu [19] focused on spatial data mining with the decision tree classifier algorithm. Information theory and an inductive learning method were used to construct a decision tree, which can in turn extract relevant relationships in a set of labeled input data.
Artificial neural networks (ANN) are well known for their high performance in tasks involving filtering and prediction. This process usually involves the filtering of noises from the source dataset and prediction based on the filtered dataset. Filtering of the noises can also refer to the extraction of essential patterns from the input. Based on some function approximation assumptions, the filtered data can be used for the prediction of future values of the target variables. The artificial neural network is a popular advanced tool because of its proven robustness. Feedforward networks refer to a network with more than one neuron but with no feedback paths in its structure. Multi-layer feedforward networks refer to a network with an input layer, a hidden layer and an output layer.
Machine vision technology can capture visual information by frame-based cameras, etc., and convert the images into a digital format and process afterwards using machine learning algorithms. Mennel and Symonowicz et al. [20] showed an image sensor with a built-in ANN that can sense and process optical images at the same time without latency. The sensor can conduct the classifying and encoding of images optically projected onto the chip at a rate of 20 million bins per second.
Falls in the elderly can cause serious consequences and are of major public health concern. Wearable inertial sensors, accelerometers and gyroscopes, can generate large datasets of various falls and activities of daily living (ADL). Yu and Qiu et al. [21] deployed three deep learning models for detecting falls from the large dataset obtained by wearable inertial sensors. These models are a convolutional neural network, long short-term memory, and a hybrid model integrating both convolution and long short-term memory. The prediction of a fall during its descent may lead to a safety mechanism that can prevent fallrelated injuries. Chen and Zhang et al. [22] proposed a cuffless blood pressure estimation framework using a CNN-based Receptive Field Parallel Attention Shrinkage Network by capturing the long-term dynamics in the photoplethysmography signal with no long short-term memory.
A multi-layer perceptron neural network was employed in [23] to identify the working condition of a mechanical indexing system, using data acquired by accelerometers, with the aim to prevent the onset of vibratory phenomena or failures. The extraction of features from the raw data represents a very important phase of the diagnostic process, allowing to reduce the dimensionality of the problem and, therefore, of the networks. Different features used in [24], based on the power spectral density, the Fourier transform (FT), the wavelets, the probability density function, the higher-order spectra (HOS), have been compared for case study of an indexed rotating table. From the study, it emerged that all the considered pre-processing techniques permitted obtaining acceptable classifications, but two of them (the FT and the HOS) allowed better results.
Online fault detection of an aircraft becomes possible with the advances of actuator and sensor technologies. Taimoor and Aijun et al. [25] increased fault detection capabilities by employing the Extended Kalman Filter for the weight updating parameters of a multi-layer perceptron (MLP) neural network. With the online adaptation of weighting parameters of MLP, the preciseness of the fault detection is found to increase.
In 1965, Lotfi Zadeh proposed the concept of fuzzy logic, which is multi-valued logic that can handle reasoning approximately (Zadeh et al. [26]). The truth of a statement is no longer limited to the two traditional values "true" and "false". In fuzzy logic, the degree of truth has any value in the interval from zero to one. Fuzzy systems may have problems such as how to define the fuzzy operators for real-world applications.
Cooperative cargo transportation studies the management of unmanned aerial systems, by utilizing information obtained by sensors. Teixeira and Neves-Jr. et al. [27] presented a fuzzy model to avoid the drones from colliding themselves or with other objects. A new approach was developed to evaluate potential fields with fuzziness measure for collisions avoidance. Four intelligent controllers were employed to monitor the motion of the drones for avoiding collisions.
Portable, wearable gait analysis system with signal obtained from the pressure sensors can be used for accurate gait phase recognition and gait cycle segmentation. Yang and Gao et al. [28] applied fuzzy logic inference to achieve continuous and smooth gait phase recognition. Then, gait cycle segmentation was performed using gait phases by fully considering the internal difference among different people.
The evolution of the biological species is the inspiration for the development of evolution computing (Jong,[29]). Evolutionary computational algorithms have iterative procedures concerning about the growth or shrinking of a population. In each iteration, the population is chosen randomly with the objective to get closer to the desired result. Metaheuristic optimization methods such as genetic algorithms, evolution strategy, ant colony optimization and particle swarm optimization are popular among evolutionary computing.
During the deployments of wireless sensor networks (WSNs), clustering and routing are two major issues that need to be addressed, and yet these two problems are both NP-hard. Kuila et al. [30] employed genetic algorithm, particle swarm optimization and differential evolution for solving clustering and routing problems in WSNs. Comparison as well as strengths and weaknesses of the algorithms were highlighted.
In mobile wireless sensor networks (MWSNs), the sensor nodes are movable within a certain area. It becomes more and more important to prolong the lifetime of the sensors for real-time and effective information. Zhang et al. [31] employed five evolutionary computing algorithms to achieve a MWSN lifetime optimization model.
Computational learning methods focus on utilizing induction for understanding the common methodology among efficient learning algorithms, and to find the hindrance for learning effectively (Kearns and Vazirani [32]). Mathematical analysis is often needed. There are learning algorithms that can forecast based on the values of past events. There are also algorithms that can improve with the advice from experts or teachers. When an algorithm can be finished in polynomial time, it is called feasible. Probably approximately correct learning, Vapnik-Chervonenkis theory, Bayesian inference, and algorithmic learning theory are common efficient methods of computational learning theory.
Nowadays, many manufacturing systems achieve monitoring jobs with the help of appropriate sensors. How to transfer the industrial input data from sensors to knowledgebased automatic execution with no need of human interference can be challenging. Kozłowski et al. [33] developed a new approach to determine the remaining useful life of machine tools at an early stage and to classify the conditions of the machine tools. It utilized the support vector machine for classifying the machine tool conditions. Autoregressive and integrated moving average-based identification is also employed to act as expert during normal operation.
Remote sensing image captioning is about producing natural semantic descriptions of images remotely. Shen et al. [34] developed a two-stage multi-task learning model for accomplishing this task. The proposed transformer generated the text to describe image from the spatial and semantic attributes. The sentence descriptions were further improved with the reinforcement learning.
In order to further enhance the prediction capability of individual machine learning methods, ensemble modeling is often employed for applications involving both forecasting and classification. Many experimental results support that ensemble modeling can further improve the forecasting capability of the individual models in the whole system (Opitz and Maclin [35]). Even though previous research results support that the performance of an ensemble can be better than its individual component, it has also been highlighted that the ensemble model can work better if its individual component is chosen carefully with high prediction accuracy.
A simple example of ensemble model is the combining of individual machine learning methods with linear weightings. Studies by Maqsood et al. [36] supported that with ensemble model the system could achieve better prediction capability than its individual algorithm. It has been shown that this ensemble model can work better than its individual methods.
Abubakr et al. [37] proposed a classification method to monitor the failure of tool condition in machine operations. The input data are the signals obtained by sensors monitoring the current, vibration and acoustic emission. The random forest method was employed for feature reduction. The authors illustrated that an ensemble of individual methods can further improve its performance and the approach has the potential applications in tool condition monitoring application.
At AT&T Bell Laboratories, Vapnik and colleagues, Boser and Guyon, initiated the studies of the support vector machines (SVMs) algorithms [38]. The development of the SVMs algorithms has clearly focused on industrial applications (Smola and Scholkopf [39]. Support vector machines for classification (SVC) and support vector regression (SVR) are the two main types of SVM algorithms. The mathematical properties of the SVM algorithms are found to be robust (Brown et al. [40]). The advantages include for example sparseness of solution, flexibility for large feature spaces, and outlier handling capabilities. With these mathematical properties, SVM algorithms can handle even large datasets well. Structural risk minimization with statistics and learning methods is the foundation of the SVM algorithms. Minimization of the empirical risk with the prevention of the "over-fitting" problem can be achieved in the structural risk minimization process. SVM algorithms deal with the mapping of input data of low dimension into targets of much higher dimension through kernel function. Quadratic programming is often employed for solving the global optimization problem involved. A wireless sensor system was developed by Liu et al. [41] to monitor the water quality in real time. The system can handle the problem of delay of data transmission well with robust comparability for water quality forecasting. A wireless sensor network with a ZigBee protocol were employed to detect the quality of the water in the basin with the help of various indicators such as amounts of nitrogen and phosphorus in the water. SVM algorithms were deployed in this system for automatic detection of the quality of water.
Useful information such as the status of the objects is obtained by methods for the monitoring of the conditions. The information can be helpful in the prevention of catastrophic failures. Gómez et al. [42] developed a monitoring system for the conditions of railway axles dynamically. Wavelet Packet Transform energy and support vector machine diagnosis model were deployed satisfactorily in their proposed system. Hybrid artificial intelligent system refers to system that combine several artificial intelligence methods to work together to achieve the target. Individual methods such as neural networks, evolutionary computing, fuzzy logic, and SVM, Bayesian networks, statistical learning are often deployed to form hybrid systems such as hybrid multi-agent models, knowledge-based artificial neural networks, and hybrid optimization algorithms.
A common goal of hybrid artificial intelligent systems is to improve the performance of the individual methods in the machine learning process. In [43], hybrid neural network regression models were combined with fuzzy clustering technique, and clustering nonparametric regression models were developed. The neural network regression models worked iteratively with optimal fuzzy membership values for each object, with the goal to minimize the total error of the neural network regression models. This hybrid system was shown to have the capability to cope with situations cases that the individual methods, i.e., the K-means and Fuzzy C-means methods, could not perform satisfactorily.
Mustafa et al. [44] developed a hybrid artificial intelligent system for the species recognition and the herb disease detection at early stage with computer intelligent vision technologies and electronic nose. The hybrid system employed fuzzy logic, naïve Bayes, an artificial neural network and the SVM algorithm to perform the tasks of specie recognition and disease detection. The proposed hybrid technique combined with these three machine learning approaches has a recognition and detection rate of almost 99%.
In recent years, emerging technologies in the fields of cloud computing, robotic computing algorithms, wireless sensor networks and communication help the advance of cloud robotics in smart cities. Kumaran et al. [45] developed a cloud robotic system using hybrid artificial intelligent algorithms. The proposed system was shown to perform the crowd control in smart cities satisfactorily. The integrated framework can arrange the robotics to move efficiently to accomplish various tasks.
Swarm intelligence is useful in optimization problems to find optimal goal. The homing behavior of pigeon is the inspiration for the development of swarm intelligence. Sun et al. [46] proposed a hybrid algorithm, combining transformation technique, evolutionary computing technique and swarm intelligence technique together. This hybrid system can address the problem of trapping in local optimum in the optimization process. The system was further integrated into the Distance Vector-Hop algorithm in application to locate the nodes of wireless sensor networks.
A synthetic summary of the advances in supervised machine learning for sensing and condition monitoring is presented in Table 1.

Unsupervised Machine Learning (Clustering)
Clustering algorithms focus on searching for similarities in feature vectors of the data, and then grouping similar vectors [47]. Another name for clustering approach is unsupervised pattern learning, while in supervised pattern learning training dataset is needed. Supervised learning such as classification needs the information obtained from the training dataset to guide the learning process. Clustering algorithms are deployed in many real-world applications in engineering and science.
Clustering algorithms are based on how to arrange the data points into clusters optimally. This combinatorial task is found to be NP-hard. To solve this combinatorial problem efficiently, it is leaded to the development of various clustering algorithms. The common goal is to restrict the total number of combinations of the clusters to be investigated. Hierarchical clustering, partition clustering and spectral clustering are the three main types of clustering methods. Various clustering approaches may lead to results of various clusters, and the nature of the problem may provide some guide about the clustering approach to be chosen.
The determination of similarities between two feature vectors of the data is a challenging task. Employing a suitable measure for comparing the similarity is essential for clustering algorithms. The procedure of how to cluster the vectors, based on the chosen measure, is the next issue in designing clustering algorithms. Different cluster outputs are often obtained with different clustering measure and procedure. For solving real-world tasks efficiently and accurately, opinions from experts may be very helpful.
Agglomerative clustering and divisive clustering are the two different types of hierarchical clustering algorithms. Agglomerative clustering algorithms use the bottom-up procedure. Firstly, each data object is regarded as individual cluster. Then, the data objects are iteratively merged into larger clusters. For the divisive clustering algorithms, the topdown procedure is employed. Firstly, the whole set of data objects are treated as a single cluster. Then, the large clusters are iteratively divided into smaller ones. Co-clustering algorithms focus on the clustering of both the data objects as well as their features. Centralized entities, for example cloud or edge, can allow automated decision making for the applications in fields such as Internet of Things when fed with data from several sensors. Nevertheless, malicious outliers among the data obtained by sensors may affect this automation process. Shukla and Sengupta [48] developed an expandable outlier detection algorithm based on hierarchical clustering together with an artificial neural network. In this system, the hierarchical clustering algorithm can ensure expandability of the outlier detection algorithm from correlated sensors, while an artificial neural network worked together with statistical methods for detecting outliers from the time series obtained by the sensors.
A biosensor platform can be used for detection of drug contaminants in hormone drugs and antibiotics. M13 bacteriophage-based colorimetric sensors are found to be able to detect extremely small amounts of target molecules, while further works are needed to enhance their capability of formulating the groupings of target molecules. Kim et al. [49] proposed a statistical approach to classify the types of target molecules with high computational performance even for very large dataset. The proposed method can analyze pattern of change in color by a reaction among sensors and foreign materials. Hierarchical cluster algorithm is employed for separating the target materials.
A common property of partition clustering algorithms is that all the clusters can be estimated at one time. Renowned partition clustering algorithms include k-means clustering algorithms and fuzzy c-means clustering algorithms. k-means clustering algorithms start with k clusters that are randomly generated. The center of a cluster is computed with the average of all the data points within that cluster. Subsequently each point is allocated to its closest cluster center. All the new centers of the clusters need to be evaluated again. These procedures are iterated till the pre-defined criterion is satisfied. Fuzzy clustering algorithms utilize the concept of fuzzy logic. Data point no longer needs to belong to a single cluster. Instead, the data point can belong to clusters to some degrees. This is the main difference for the fuzzy c-means algorithms and the k-means algorithms.
Clustering algorithms have been employed in applications involving wireless sensor networks. A common difficulty is that the clustering process may be trapped in local minima. This can result in inaccurate cluster partitions. Kotary and Nanda [50] developed hybrid clustering techniques combined with evolutionary computing such that the global optima may be obtained. For the monitoring of outliers, a weight system was proposed, based on the volume and density of the data points. In this case, outliers are the ones with larger weights.
There are many difficulties in the deployment of wireless underwater sensor networks (WUSNs), such as the high loss rate of transmission powers in the data transfer process. Clustering may address this issue by combining wireless sensors into cluster with local base station one hop away. As the sensor nodes are now close to the local base station, the transmitting power can be reduced significantly. Omeke et al. [51] proposed a novel k-means clustering scheme for local base station selection. It was found to be able to prolong the lifetime of WUSNs. The proposed algorithm can decide the optimal number of clusters in real time. The experimental results support that it can outperform the traditional clustering algorithms by more than 90%.
A common property of the spectral clustering algorithm is the reduction in the dimension for the source data before measuring the similarity among the data. Shi-Malik algorithm is a spectral clustering algorithm which is popular in the segmentation process of images.
Gao and Shi [52] developed a novel clustering algorithm to monitor the behavior pattern of the handling of ships. Ship information is obtained from the array of sensors, and then feeds into the identification system with trajectory data. The sliding window algorithm was employed to extract information from the data given by this sensor system. The trajectories were divided and generated sub-trajectories, and a spectral clustering algorithm was utilized for the clustering of sub-trajectories in order to discover the patterns of behavior. This method can help understanding the behavior patterns during the process of handling ships. The proposed method can also increase the efficiency of the learning process for planning of ship routes and collision avoidance decision making, etc.
Sensors for hyperspectral imaging (HSI) have the capability to handle source dataset of wide spectrum of wavelengths. Yet, HSI classification can be a challenging task because of the high-dimensional feature space. Sellami et al. [53] developed a new HSI classification method, combining spectral technique with a deep neural network to the classification task of HSI. The issue about the redundancy between spectral groups were addressed with unsupervised selection algorithm. Spectral-spatial features were extracted from the different groups of selected bands for improving the accuracies of classification. A 3D CNN model was applied to associate and fuse each group with the target for further enhancing the accuracies of classification.
A synthetic summary of the advances in unsupervised machine learning for sensing and condition monitoring is presented in Table 2.

Deep Learning
Deep learning (DL) is a machine learning (ML) framework, developed from traditional neural networks, approximately since 2006. Deep learning is actually based on large-sized deep neural networks (DNNs), and can be referred as neural networks with a deep structure (Zhao and Zheng et al. [54]). Deep models have outperformed conventional techniques in recent decades and are now a common tool for data representation (Yuan and Shen et al. [55]).
The main advantage of deep learning over traditional ML is the automatic identification of features, learned through a general purpose learning procedure (LeCun and Bengio et al. [56]). Classic methods of machine learning starting from raw data require a pre-processing work, based on the identification and selection of feature vectors or of a suitable internal representation to be provided as input to the neural network. This work of data targeting is intensive and time-consuming and must be carried out by expert engineers. Otherwise, raw data can be directly supplied to a deep neural network and, from the composition of a number of levels consisting of simple but non-linear modules, progressive transformations are obtained with gradually increasing levels of abstraction, leading to learn even very complex functions. In deep learning for classification purpose, the elements of the inputs, which are crucial for discrimination, are amplified by higher level of representation, while the irrelevant ones are neglected.
High-level and abstract features are automatically extracted from a large variety and quantity of data, captured from various sources. Typical feature extraction methods are unable to obtain similar results (Samaras and Diamantidou et al. [57]). Hinton and Salakhutdinov in [58] firstly demonstrated this superiority. As clearly expressed by LeCun and Bengio et al. [56], deep learning methods are, in short, representation-learning methods with multiple levels of representation. Each layer representation is computed from the representation in the previous layer; the computation is based on internal parameters which are updated through a back-propagation algorithm. With multiple non-linear layers even an intricate structure in a large dataset can be discovered. Multi-layer learning allows very high performance in complex function approximation, image, video, speech or audio processing, classification problems, multi-sensor data aggregation, with extraordinary results in many fields as speech or visual object or signals recognition, natural language processing, face or object, or pedestrian detection, human activity identification, fault diagnosis, drug discovery, genomics, multi-task and transfer learning, domain adaptation, etc.
Input data quality is fundamental for a good functioning of deep learning. Consequently, the tools that enable data acquisition also play a key role. Depending on the field of application and the purpose, the nature of the data and of the sensors/acquisition devices is different. The use of data from sensors/sources of different nature is increasingly frequent and therefore data fusion, which merges complementary information, as spatial-temporalspectral resolution data, often occurs. DL models extract abstract features from multiple input streams and can establish robust relationships between dissimilar input signals, not influenced by sensor type and spatial scale. Moreover, DL is robust even in cases of missing or corrupted sensor data.
Starting from the classical NNs, such as the back-propagation feedforward NN (BPNN) or the radial basis type known as a generalized regression neural network (GRNN), different DL architectures have been developed to address different kinds of problems. The mainstream deep neural models are the deep belief network (DBN), a convolutional neural network (CNN), autoencoder (AE), recurrent NN (RNN) and the long short-term memory network (LSTMN). In the following, these architectures will be treated, referring to some specific applications, focusing attention on the input data and the types of sensors adopted for datasets creation.
Among the most commonly used deep learning model in recognition and detection tasks is the CNN. In order to extract features which are resistant to distortion, CNNs use interconnected network architectures.
A convolutional neural network (CNN) is a deep learning method that can use images as input, assign weight/importance to objects in the images and classify them. For simple application, a 1D convolutional neural network may be used. More sophisticated classification models, CNN-Net, Encoded-Net, and CNN-LSTM, will have more complicated architectures such as denser layers and larger kernel size than 1D CNN. Medical care benefits from automatic prediction of routine human activity. For the purpose of recognizing human activity, Mukherjee et al. [59] created the EnsemConvNet ensemble, which combines CNN-Net, Encoded-Net, and CNN-LSTM classification models. Each model can accept time-series data as a 2D matrix, and the EnsemConvNet model's classification result is created by combining different classifiers using techniques including majority voting, sum rules, product rules, and score fusion approaches. The suggested EnsemConvNet model outperforms the following deep learning models, according to the evidence: long short-term memory, multi-headed CNN, and CNN hybrid models.
Input data are structured in multiple arrays, which may have different dimensions, depending on the signal type (language-based signals and sequences are 1D; 2D pictures or audio spectrograms; and 3D video or volumetric images.). CNN is a feedforward network, formed in the early stages by a sequence of convolution, pooling, non-linear activation layers and in the final stages by fully connected layers. In convolutional layers, a filtering operation is performed through a feature map in which units are organized, thus, getting a discrete convolution in order to identify local confluences of features from the preceding layer; hence the name given to these levels and more generally to the deep model.
Simple features such as texture, lines, and edges are often extracted by the bottom convolution layers, whereas more abstract features are typically extracted by the top layer. (Chen and Li et al. [60]). Pooling layers combine semantically related features into a single feature, resulting in more robust feature descriptions as well as down-sampling and dimensionality reduction processing. These operations include max-pooling, averagepooling, L2-pooling, and local contrast normalization. To improve CNN's capacity to fit non-linear data, activation layers' units perform non-linear procedures such as rectified linear unit (ReLU) or sigmoid units. Fully connected layers are located at the outermost level, closest to the output, and serve the purpose of classification. As in BPNNs, the back-propagation algorithm is employed for weights update.
Numerous articles discuss the usage of CNN in numerous disciplines, and a significant amount of these works emphasize the importance of the sensor selection and sensorrelated concerns.
For images, recognition cameras are adopted. The widespread use of cell phones equipped with high resolution cameras makes available a huge amount of data that can be used for various applications. The combination of mobile phones and deep learning is a promising solution in many fields.
In a framework for indoor localization, Ashraf et al. [61] presented a deep learningbased convolutional neural network (CNN) localization based on smartphone photos. CNN is used to distinguish between floors, recognize inside scenes in a variety of lighting conditions, and improve indoor localization precision. The classification is based on camera images captured in pre-defined collection points using Samsung Galaxy S8 rear camera. For recognizing scenes, CNN has a prediction accuracy of 91.04%. The search space in a geomagnetic database used for localization is further reduced using the identified scene.
Chen and Cao et al. [62] use image processing and a CNN-based technique to determine UV intensity. They created a wearable UV sensor out of PDMS and photochromic material. Images from a cell phone were used to construct the dataset, and the sensor changes color when exposed to UV radiation. When a CNN was trained to measure UV intensity, the influence of ambient light was considerably diminished, yielding an identification rate of more than 90% under various ambient light circumstances.
Yang et al. [63] presented a comprehensible fuzzy fusion method to combine the output of CNN models that could assess the relevance of each classifier by looking at the interaction index between each classifier. Additionally, SoftPool and Mish activation features were added to conventional CNNs to improve their capacity for feature extraction. An experimentally collected dataset and an artificially generated fault bearing dataset are used to evaluate the performance of the suggested model and assess its capacity to extract features.
In the health monitoring of industrial systems, DL is extensively adopted, as DL-based fault diagnosis methods achieve better results than traditional ML methods. Bearing fault detection classification and localization is a problem in which CNNs obtained very positive performances (Waziralilah and Fathiah et al. [64]).
Niu and Liu et al. [65] proposed a deep residual convolutional neural network (DR-CNN) with gray-scale pictures obtained by a multi-sensor data (multiple 3-axis accelerometers) as input data to address the problem of bearing fault diagnostics with multi-sensor data. The CNN degrades as the network depth reaches a particular level, but certain connections in the residual network skip some of the CNN structure's layers, making it simple for parameter gradients to spread from the output layer to the lower levels.
Another area, in which accelerometer signals are highly adopted is the biomedical field, such as for human activity recognition (HAR). In Kulchyk and Etemad [66], the authors apply a deep CNN for HAR using a publicly available dataset (Ugulino and Cardador et al. [67]), which contains raw data from four tri-axial wearable accelerometers. The suggested approach is evaluated against other conventional classifiers, such as decision trees, random forest, support vector machines (SVM), and k-Nearest Neighbors (kNN). A classification accuracy of 100% is achieved, with the great advantage of eliminating the need for a pre-processing activity.
A combination of a CNN and a deep convolutional generative adversarial network (DCG), whose acronym is DCG-CNN, is proposed by Sun and Zhao [68] for gas sensor condition monitoring to prevent fault; in the specific case, the gas is hydrogen. A DCG combines a CNN with a generative adversarial network (GAN), whose purpose is to produce fresh samples of data from the available data with the same statistical properties, enhancing defect detection precision when imbalanced data samples are present. The following steps make up the method: the DCG approach is used to construct synthetic 2D grey images of sensor fault signals from 1D hydrogen sensor fault signals; the experimental signal and the synthetic signal are mixed to balance the training dataset. A CNN is trained and evaluated using the entire dataset.
It has been discovered that deep learning has great promise for wireless sensing tasks. There are however problems with labor-intensive training that involves gathering training samples and retraining efforts for trained systems. In order to complete wireless sensing tasks with fewer training efforts, Wang and Gao et al. [69] concentrated on the viability of utilizing deep learning networks. Deep generative adversarial networks (DGAN) were used to provide virtual training samples for the suggested wireless sensing system based on deep learning. The case study of wireless gesture recognition established its efficacy.
Remote sensing is another important area of DL application, with an exponential rise in papers published on the subject in recent years (Zhu and Tuia et al. [70]). Supervised CNNs give optimal performances in the direct classification of hyperspectral images in the spectral domain, as obtained by Hu and Huang et al. [71]. The adopted CNN model has only one convolutional level, since authors verified that the typical CNN with two convolutional layers is actually not applicable for hyperspectral data. According to experimental results based on multiple hyperspectral image datasets, the suggested method produced high classification performance.
CNNs analyze image-based patterns and are ineffective at simulating temporally oriented events. On the other hand, R-CNNs are particularly well suited to model temporal changes in data. They perform temporal analysis of events in time-sequence applications, such as language and speech recognition, when given sequential inputs. The history of the sequence is stored in a state vector in the hidden units of an R-CNN, which processes one input element at a time. Back-propagation is utilized to train an R-CNN because the outputs of the hidden neurons at each step time are analogous to the outputs of various neurons in a deep multilayer network.
Uddin and Mehedi et al. [72] use a deep R-CNN in the field of HAR to identify human behaviors (such as sitting, standing, and walking) from data collected by wearable body sensors. In the study two publicly available datasets, MHEALTH (mobile health) (Banos and Villalonga et al. [73]) and PUC-Rio are used, as well as the AReM (activity recognition system based on multi-sensor data fusion) dataset gathered by the authors. The suggested method is based on data fusion from many wearable sensors, including an electrocardiogram (ECG), an accelerometer, and a magnetometer. Next, using kernel principal component analysis (KPCA), features are retrieved, and then a deep R-CNN is applied to recognize behavior.
Controlling HMI devices or artificial limbs frequently involves the detection and classification of human movements. According to Wang and Chen et al. [74], a R-CNN is a promising decoder for classifying hand movements based on the combination of complicated time-series EMG signals and acceleration data.
Remote sensing also uses R-CNN. Arefin and Michalski et al. [75] developed a superresolution method based on an R-CNN architecture to produce a high-resolution image from a succession of low resolution satellite photographs.
In order to learn long-term dependencies, Hochreiter and Schmidhuber [76] modified the recurrent neural network (RNN) and created the long short-term memory (LSTM). The LSTM employs a self-feeding loop in its inner layers that may learn time-based correlations, combining knowledge from previous inputs into the analysis of present inputs. Both spatial and temporal information may be extracted from data thanks to the combined strength of CNN and LSTM.
A CNN-LSTM was used by Bilgera et al. in [77] to determine the position of a gas source (GSL) in an outdoor environment using a variety of stationary sensors (sensor network). In the investigation, thirty metal oxide (MOX) gas sensors that are commercially available and one ultrasonic anemometer were used, and data from the gas sensor array were arranged in a series of monochrome images to create a visual learning challenge for GSL.
According to Nagrecha et al. [78], a deep CNN-LSTM provides reliable findings for predicting air pollution in the field of earth environmental monitoring. Ground-based pollution sensors are used in the solution, and the sensor data are recast into a modified pseudo-image to enable the usage of deep 1D CNN and LSTM.
Xia and Huang et al. [79] used inertial sensor data from a wearable smartphone to apply an LSTM-CNN model to a HAR issue to identify activities of daily life such as standing, walking, walking downstairs, and going upstairs. The model is made up of a pre-processing phase that uses a two-layer LSTM to extract temporal features, two convolutional layers with a max-pooling layer to extract spatial features, a global average pooling layer (GAP), a batch normalization layer (BN), and an output layer (with a Softmax classifier) that produces a probability distribution over classes. Three open datasets (UCI, WISDM, and OPPORTUNITY) were used for testing, with overall accuracy ratings of 95.78%, 95.85%, and 92.63%, respectively. The cost-minimization method is used by the logistic regression-based Softmax classifier to describe multi-class classification problems.
A generative deep learning model called autoencoder (AE) converts high-dimensional data into low-dimensional feature vectors by using copies of training data as input. This reduces the complexity of calculation. AE is an unsupervised method for learning data coding since it uses a feature learning paradigm that directly learns a para-metric map from inputs to their representation (Ma and Sun et al. [80]; Lei et al. [81]). The encoder, a feature-extracting function, and the decoder, which maps the feature space back into the input space, are the two parts of an AE. An encoder and a decoder consist of an input layer, an output layer, and numerous hidden layers in between.
In order to decrease the reconstruction error, which is a measurement of the disparity between the inputs and their reconstruction over all training datasets, a back-propagation technique is employed to modify the encoder and decoder parameters (the weights of the hidden layers). Deep AE provides a data-driven method for learning feature extraction in an effort to lessen the over-reliance on manually produced features prevalent in conventional machine learning techniques. Different fields use AE variations that have been established.
By connecting the hidden representations of two single AEs, a deep coupling AE (DCAE) model is created. DCAE is used to gather the combined information from multimodal data. Ma et al. developed a DCAE in [80] with the objective of discovering a combined feature between vibration and acoustic data in order to categorize the health state of gears and bearings. For the purpose of combining multimodal signals obtained from several sensors, the model uses a deep learning approach based on the CAE. This technique fuses multimodal data fusion and feature learning into a single step. Furthermore, by self-teaching the high-level features through greedy layer-wise training, the created deep architecture can effectively extract correlations between vibration and acoustic data.
To increase precision and decrease over-fitting in HAR with smartphone-embedded accelerometer sensor data, Alo and Nweke et al. [82] present a deep sparse AE-based deep learning model. The sparse AE is a not supervised DL technique to learn an overcomplete feature representation from the raw sensor data by modeling the loss function's sparsity term with the sparsity term and setting to zero some of the active units. The model can train stable, linearly separable, displacement, distortion, and change-invariant feature representations thanks to the sparsity term. The features of the sparse AE guarantee effective low-dimensional characteristic extraction from the high-dimensional structure of the input sensor data. Furthermore, Additionally, a complicated activity recognition framework is compactly represented.
The model and the training algorithm of a deep belief network (DBN) were proposed by Hinton and Osindero et al. [83]. DBNs use a greedy layer-by-layer learning approach and a hierarchical structure with numerous stacked restricted Boltzmann machines (RBMs), followed by a fine-tuning. A visible layer and a concealed layer are both present in every RBM. Each layer contains a particular number of neurons. Although the RBM's layers are interconnected, the units within each layer are not. The values of the hidden neurons can be updated for this structure using matrix operations. This approach is suitable for making predictions online because it can speed up training. The input for the following layer is thought to be the learned properties. A Softmax classifier is then used to update the network's parameters, and it is also used in the final layer to label each pixel and the classification result. RBMs are an efficient method for extracting features for the feedforward neural network's initialization, and they greatly enhance the network's generalization performances.
Due to the random initialization of weighted parameters, local optimum and extended training period are problems that DBN resolves. The convergence time is substantially less because the parameter space only calls for a local search.
With information gathered from numerous sensors, Chen and Jin et al. [84] employed a DBN to forecast tool wear in a high-speed CNC milling machine (a three-component dynamometer, piezo-accelerometers and an acoustic emission sensor). The study found that the DBN performed well in terms of speed, accuracy, and stability.
Zhong et al. [85] created a technique based on a DBN used for multivariate optical sensors and hyperspectral image classification. The development of an RBM's stacks has been their main contribution. They changed the RBM and the learning algorithm. As part of the pre-training and fine-tuning procedures, data are trained in tiny batches to maximize the loss function of the validation dataset. In hyperspectral images, deep features that model several ground-truth classes are extracted. Experiments demonstrate the effectiveness of this generative feature learning for a spatial classifier (SC) or combined spectral-SC (JSSC), demonstrating cutting-edge performance on hyperspectral image classification.
Deep learning is also declined in terms of "geometric deep learning". Some current DNN topologies can be seen as graph neural networks (GNN). In the computer vision domain, e.g., CCNs can be thought of as a GNN applied to graphs that are organized as pixels-per-grid grids. A GNN allows the processing of data represented in graph domains, e.g., chemical compounds, images, subsets of the web [86]. Graphs could be cyclic, directed, undirected, or a mixture of these. Social networks, molecular biology, chemistry, citation networks, forecasting of environmental conditions and physics are a few relevant application domains for GNNs.
In [87], Jiao et al., in order to create a group solar irradiance neural network GSINN, merged a GCN (Graph Convolutional Network) with a modified LSTM RNN to capture the graph feature of photo-voltaic panels (PV) groups. The role of the LSTM RNN is to catch the temporal correlations. Meteorological data of 17 silicon radiometers of the U.S. National Renewable Energy Laboratory [88] are used to conduct a thorough study The testing outcome demonstrates the suggested GSINN's higher performance in terms of universality, dependability, and accuracy when compared to existing prediction systems.
Shi and Rajkumar applied a GNN [89] for 3D object detection in a point cloud obtained through Lidar sensors. They propose a single-stage detection method, in which a graph is constructed from the point cloud, a GNN with auto-registration is used to refine the vertex features by aggregating features along the edges and the NN outputs (multiple bounding boxes) are merged into one and a confidence score is assigned.
CNN and GCN have been applied by Zhang et al. [90] to extract discriminative features from RNA sequences. They developed a method based on two-layer CNN and GCN in parallel to extract the hidden features, followed by a fully connected layer to make the prediction of RNA-binding proteins for the anatomy of the essential mechanism of gene regulation. The use of the spectral GCN in RNA sequence analysis suggests that GCNs are useful for extracting relative characteristics from RNA sequences.
ML and DL models have been effectively used to address the intrusion detection challenge for wireless sensor networks. A redundancy identification system based on a convolutional DBN and a performance evaluation strategy was created by Wen and Shang et al. [91]. The improved method deals with the issue of unidentified or inadequate preceding samples by using unsupervised learning to extract characteristics from examples of both normal and abnormal behavior. To raise the execution effectiveness of CDBN, a knowledge contraction method was created. This mechanism may optimize feature datasets and produce a useful classification sample space to improve intrusion detection's classification accuracy.
Over the last decade, we have witnessed a big number of publications on ML and DL applications in science and engineering. Considering that both deep or shallow learning models are built as a black-box model by ML or DL algorithms, respectively, an interpretation mechanism should be applied, which allows to interpret or describe the ML or DL model results and make them more transparent. Having said this point, an early work to explainable ML models can be referred to [92]. However, there would be a tradeoff between the complexity of deep leaning models and its simplification to be interpretable or understandable by humans. On this aspect, the authors in [93] developed gradient-weighted class activation mapping (Grad-CAM) for generating "visual explanations" for choices from a broad class of CNN-based models.
Recently, the authors in [94] provided a comprehensive review for the existing machine learning interpretability methods. Four main kinds of interpretability approaches-those for developing white-box models, explaining complex black-box models, promoting fairness and preventing prejudice, and, finally, methods for measuring the sensitivity of model predictions-were specifically examined.
A synthetic summary of the deep learning advances for sensing and condition monitoring is presented in Table 3.

Brief Conclusions
This article has provided an overview and understanding on the impact of machine learning techniques in real-time condition monitoring and sensing technologies. More specifically, various learning algorithms are analyzed to deal with the accuracy and computational complexity challenges within the context of sensor data processing. Afterwards, different machine learning and deep learning models are provided from application point of view. Some important and yet challenging research topics are machine learning for complex sensing networks, integrated machine learning hardware for soft sensing applications, interpreting machine learning models and a different range of applications.