Data Mining Algorithms for Smart Cities: A Bibliometric Analysis

Smart cities connect people and places using innovative technologies such as Data Mining (DM), Machine Learning (ML), big data, and the Internet of Things (IoT). This paper presents a bibliometric analysis to provide a comprehensive overview of studies associated with DM technologies used in smart cities applications. The study aims to identify the main DM techniques used in the context of smart cities and how the research field of DM for smart cities evolves over time. We adopted both qualitative and quantitative methods to explore the topic. We used the Scopus database to find relative articles published in scientific journals. This study covers 197 articles published over the period from 2013 to 2021. For the bibliometric analysis, we used the Biliometrix library, developed in R. Our findings show that there is a wide range of DM technologies used in every layer of a smart city project. Several ML algorithms, supervised or unsupervised, are adopted for operating the instrumentation, middleware, and application layer. The bibliometric analysis shows that DM for smart cities is a fast-growing scientific field. Scientists from all over the world show a great interest in researching and collaborating on this interdisciplinary scientific field.


Introduction
The last few decades have observed an unprecedented trend of people moving to live in urban areas, as cities accelerate time by compressing space [1]. The year 2008 was a milestone when for the first time, the urban population surpassed that of rural areas; it is foreseen that until 2050, two-thirds of the global population will be metropolitan inhabitants [2,3]. This trend of people moving to cities causes immense pressure on city infrastructures [4]. Ever since the first cities were built, they have been dependent on technology to sustain life and produce ever more sophisticated technologies and tools [5]. Cities are like dynamic living organisms, and they constantly evolve [6].
The smart city is an innovation of the physical city with high integration of advanced monitoring, sensing, communication, and control technologies, aiming to provide real-time, interactive, and intelligent services to citizens [7]. In order to improve the city services, it is very important to infer patterns and analyze citizen behavior [8] because a smart city must focus on the needs of its inhabitants [9].
A city is a complex system to operate, and new methods are required to manage it and use the massive amounts of data it generates [10]. City administrations can gain knowledge that is hidden in large-scale data to provide better urban governance and management by applying Information and Communication Technologies (ICT) solutions. Such ICT solutions enable better transport planning, efficient water management, new energy efficiency strategies, improved waste management, and effective risk management policies for the city users.
Moreover, other important aspects of urban life, such as public health, air quality, and pollution, and public security, can also benefit from these ICT solutions [11][12][13][14][15]. With the fast development of ICT and ubiquitous mobile computing, large quantities of digital traces that register individual human activities at both spatial and temporal axes have become city "smart" is not simply data availability but the process of "closing the loop" consisting of sensing, communicating, decision making, and actuating [52]. D' Aniello et al. [53] consider the smart city as an adaptive system that can achieve two goals: support decision making and enrich the city domain knowledge. They represent the smart city operation in three phases, shown in Figure 1. During the first phase, real-time data streams are generated by hard and soft sensing. These data should be processed in phase 2 to gain useful information able to support decision making. During the third phase, knowledge is transformed into actions in the city. of a smart city is to provide citizens with a promising quality of life by using technology to improve the efficiency of services [47] and address inhabitants' demands [48] by optimizing its resources [49]. A smart city is conceived as an idealistic city, where the quality of life for citizens is significantly improved by combining ICT, new services, and new urban infrastructures [50].
Del Casino [51] considers contemporary cities as repositories of images, movements, flows, and representations. It is important to emphasize that what ultimately makes the city "smart" is not simply data availability but the process of "closing the loop" consisting of sensing, communicating, decision making, and actuating [52]. D' Aniello et al. [53] consider the smart city as an adaptive system that can achieve two goals: support decision making and enrich the city domain knowledge. They represent the smart city operation in three phases, shown in Figure 1. During the first phase, real-time data streams are generated by hard and soft sensing. These data should be processed in phase 2 to gain useful information able to support decision making. During the third phase, knowledge is transformed into actions in the city. A smart city rests on people, technology, and processes and how they connect with various domains, including healthcare, education, transportation, telecom, tourism, utilities, public safety, and buildings. The characteristics of a smart city could be summarized to include six elements: Smart People, Smart Economy, Smart Mobility, Smart Living, Smart Governance, and Smart Environment ( Figure 2) [50,54,55]. Li et al. [29] state that economic development, environmental issues, social equity, and sustainability for a smart city can be achieved only if all these aspects are equally considered.
The smart city is built on the infrastructure of the digital city. A digital city provides a 3D geospatial framework for cities, while the IoT embedded in the ubiquitous sensor network realizes real-time sensing, measuring, and data transmitting of still or moving objects [29]. The interaction of computational and physical components, specifically the use of cyber-physical systems, led to the advancement of such integration [30].
Williamson [56] uses the concept of "programmable city" as a prototypical code/space, where city functions are delegated to computational systems, which then renovate how they act. The term "code/space" articulates how the space phenomena translated into code that then acts recursively to alter them. The digital and physical cities are linked by IoT, thereby forming an integrated cyber-physical space. In this space, the state and changes in the real world will be sensed automatically in real-time [45]. In general, a digital city is a harmonious framework encompassing all kinds of information to ensure the city functions smoothly and orderly. Moreover, a digital city, as a A smart city rests on people, technology, and processes and how they connect with various domains, including healthcare, education, transportation, telecom, tourism, utilities, public safety, and buildings. The characteristics of a smart city could be summarized to include six elements: Smart People, Smart Economy, Smart Mobility, Smart Living, Smart Governance, and Smart Environment ( Figure 2) [50,54,55]. Li et al. [29] state that economic development, environmental issues, social equity, and sustainability for a smart city can be achieved only if all these aspects are equally considered. considerable version of our real world, contains all the natural, social, and economic information related to the physical city [29]. The overall architecture of a smart city has a hierarchical structure of three layers [1], as shown in Figure 3: 1. The "instrumentation" layer. The distributed sensor grid is embedded in infrastructure for acquiring and transferring real-time environmental and social data. The smart city is built on the infrastructure of the digital city. A digital city provides a 3D geospatial framework for cities, while the IoT embedded in the ubiquitous sensor network realizes real-time sensing, measuring, and data transmitting of still or moving objects [29]. The interaction of computational and physical components, specifically the use of cyber-physical systems, led to the advancement of such integration [30].
Williamson [56] uses the concept of "programmable city" as a prototypical code/space, where city functions are delegated to computational systems, which then renovate how they act. The term "code/space" articulates how the space phenomena translated into code that then acts recursively to alter them. The digital and physical cities are linked by IoT, thereby forming an integrated cyber-physical space. In this space, the state and changes in the real world will be sensed automatically in real-time [45]. In general, a digital city is a harmonious framework encompassing all kinds of information to ensure the city functions smoothly and orderly. Moreover, a digital city, as a considerable version of our real world, contains all the natural, social, and economic information related to the physical city [29].
The overall architecture of a smart city has a hierarchical structure of three layers [1], as shown in Figure 3:

1.
The "instrumentation" layer. The distributed sensor grid is embedded in infrastructure for acquiring and transferring real-time environmental and social data. Data acquisition elements are responsible for collecting and locally storing external data. It can capture any kind of information, including images, video, sound, temperature, humidity, pressure, etc. The network elements are used for data transferring and information routing between the distributed sensor layer and the service-oriented middleware layer. In other studies [29,30,33,45], we find this layer as two separate layers, sensors and network layer.

2.
The service-oriented middleware layer takes charge of massive data storage, real-time analysis, and processing. It is based on cloud computing, DM, and highly efficient index services. The results can be used to support the decision making and effective operation of smart city applications.

3.
The application layer for end users applies tailored intelligence services to different domains, and it is responsible for interacting directly with the user. It provides the user with information in a comprehensible manner, such as graphical forms, tables, or other presentation types, and facilitates interaction with the system.

Data Mining Technologies for Smart Cities
In a smart city application, the production of analytics can lead to advanced insights, a better understanding of city phenomena, and supports the design of evidence-based  This architecture projects the trip of big data from their origin in raw structure until valuable information and knowledge being extracted for the benefit of decision makers and citizens [57].

Data Mining Technologies for Smart Cities
In a smart city application, the production of analytics can lead to advanced insights, a better understanding of city phenomena, and supports the design of evidence-based urban strategies and innovation [53,58]. Searching for interesting patterns and correlations [3] in the public-service facilities of developed cities using a DM approach has gradually become a significant area of research. The extracted patterns can be used to plan layouts or arrange new facilities in cities [59]. Advancements of big DM technologies can support, explore and discover environmental and societal changes, including how people go about their life, behavior, and preferences; social trends, and public opinion [45,60]. DM and ML are vital technologies for data-centric applications for smart cities.
DM is a broad field that includes many algorithms and techniques from statistics to ML and information theory to extract information from data [57]. DM aims to build computer programs that extract hidden, previously unknown, and potentially valuable information from data [61]. The process must be automated or, more usually, semi-automated, and the regularities or patterns discovered must be meaningful in a practical sense [62]. Big DM needs to extend the entire process to the front and back end, under the characteristics of big data. This involves processing and analysis of massive and heterogeneous data, automatically discovering and extracting implicit, hidden patterns, rules, and knowledge, and visualizing them in an easily understandable form [45].
ML is the study of how to build computer programs that improve their performance at some tasks through experience [63] to address problems in which human expertise does not exist or when it is difficult to express it [64]. With this technology, the algorithm is training computers to learn from a past experience E regarding task T and some performance measure P, if its performance improves on task T, with experience E, computed by P [62].
Essentially, ML is an application of AI that provides computers with the ability to learn from data and provide relevant insights that increase operational performance from experience from data without being programmed [65]. According to Din et al. [66], ML is classified into four categories: supervised learning, unsupervised learning, semi-supervised learning, and active learning [62]. Nef et al. [67] propose a typical ML pipeline ( Figure 4) that starts with the raw data as input, clustering to further preprocess the data before the actual classification is performed. Finally, the computed results are displayed.
Algorithms 2021, 14, x FOR PEER REVIEW 7 of 38 ( Figure 4) that starts with the raw data as input, clustering to further preprocess the data before the actual classification is performed. Finally, the computed results are displayed.

Data Preprocessing
Smart city applications consume data about both physical and virtual system entities. The prime goal of data collection is accurate and comprehensive data. These data obtained by heterogeneous sources, including sensors, SM, even manually submitted by users, are raw, noisy, and require processing by applications to be transformed into usable information [68]. Entering raw data into the training model will not produce a reasonable result, so data preprocessing is necessary for harnessing the data value.
In addition to dirty data, many kinds of data cannot be used directly for training, such as weather conditions, time, and so on [69]. To prepare the input for processing, the raw data must be enriched, cleansed, filtered, normalized, checked, and organized in a usable format [67,70]. Data fusion techniques combine multiple data sources to fix problematic data while improving data reliability, increasing data completeness, and extracting higher-level information from multiple data sources [71].
In recent years, several techniques have been proposed and implemented for the

Data Preprocessing
Smart city applications consume data about both physical and virtual system entities. The prime goal of data collection is accurate and comprehensive data. These data obtained by heterogeneous sources, including sensors, SM, even manually submitted by users, are raw, noisy, and require processing by applications to be transformed into usable information [68]. Entering raw data into the training model will not produce a reasonable result, so data preprocessing is necessary for harnessing the data value.
In addition to dirty data, many kinds of data cannot be used directly for training, such as weather conditions, time, and so on [69]. To prepare the input for processing, the raw data must be enriched, cleansed, filtered, normalized, checked, and organized in a usable format [67,70]. Data fusion techniques combine multiple data sources to fix problematic data while improving data reliability, increasing data completeness, and extracting higherlevel information from multiple data sources [71].
In recent years, several techniques have been proposed and implemented for the improvement of data preparation methods [72]. There are thousands of features available in real-world problems. The main goal of feature selection/extraction is the identification of a feature subset that is more informative or predictive of a given response variable [73]. Bag of Words (BoW) is a standard feature descriptor. M. Sajjad et al. in the work of [74] proposed a face recognition framework that uses BoW with the Oriented FAST and Rotated BRIEF (ORB) feature extraction method.
Another frequently used method for data reduction is Principal Component Analysis (PCA). PCA consists of reducing attribute overlap. This can be used to identify if some attributes can be more relevant than others [75]. Applying dimensionality reduction approaches, such as PCA, mitigates the requirement for large training data sets and reduces the number of zeros in the feature matrix [76]. In order to model sensor observations, measurements, and sensing device information and characteristics, D'Aniello et al. [53] presented an ontological scheme named Semantic Sensor Network Ontology (SSNO). SSNO allows putting together data and metadata provided by a sensing device or a virtual sensor and represent them by using interoperable statements, able to be possibly integrated with other additional domain knowledge.

Machine Learning (ML) Algorithms
ML algorithms are capable of handling a large amount of data, as well as providing features such as frequent pattern mining [77], anomaly detection, predictive modeling, and optimization that can lead to better situational awareness, more efficient, dynamic, and adaptive control [78]. Depending on the characteristics of the specific application and on the requirements of the use case, some algorithms perform better than others [34]. Some ML algorithms have aimed at improving matching time and accuracy, e.g., DL for information retrieval and multimodal interaction [79].

Classification
Classification is perhaps the most familiar and most popular DM technique [80]. The goal of classification algorithms is to find functions and models in order to identify to which of a set of categories (sub-populations) a new observation belongs to, based on a training set of data containing observations (or instances) whose category membership is known [58]. Classification is a paradigm of supervised learning. Supervised learning consists of two main phases: (1) the training or construction phase in which the model learns from a given set of labeled samples and (2) the classification phase, which outputs a label with the maximum likelihood for a given sample. It is shown that classifier performance typically increases as the volume of training data grows [81].
Decision Trees (DT) is a widely used technique that has been effective for regression and classification. The DT predictive model represents observations in the branches and conclusions about the target value in the leaves. In the first stage (construction), the DT breaks down a data set into subsets by DTs, and later an associated DT is incrementally developed [77]. In the second stage, pruning is carried out inside the nodes, replacing the node, if necessary, with a regression plane [62]. DT is one of the most popular learning techniques. It can naturally handle data of mixed types and missing values, which occur in most data sets in a smart city application [82]. Some of the methods for DT construction include Classification and Regression Trees (CART), ID3, C4.5 (J48), and T3 [83][84][85][86][87].
Random Forest (RF) is an ensemble learning model for classification and regression that operates by developing a batch of DT at training time [75]. The output class is the mean prediction (in regression problems) or mode of the classes (in classification) of the individual trees. To make a prediction at a new point, random forests take the majority vote among the outputs of the grown trees in the ensemble. The number of trees in the ensemble is selected through cross-validation [52]. RF runs very efficiently both on scaled and not scaled data [88,89]. It can handle data sets with unbalanced classes and generates strong predictive models, dealing well with the issue of overfitting. Two features make RF an attractive prediction technique: its ability to achieve high prediction accuracy and its usability of desired capabilities, such as daily electricity data consumed by various appliances. These two characteristics make RF a unique and desired model for analyzing smart city data [10]. Support Vector Machine (SVM) is a prominent supervised learning model that efficiently performs data classification and thereby finds its applications in many real-world scenarios, such as disease diagnosis [77], activity recognition [85], anomaly detection [34], text classification [90], face recognition [74], electricity consumption prediction [91], and finding available parking slots [92]. The SVM training algorithm seeks a separating hyperplane in the feature space that maximizes the distance of each input data point from the hyperplane. The minimum overall distance is called margin [93]. Often, in real-world data, the sample points are not linearly separable. For this reason, the original space is mapped through a kernel function into a higher-dimensional space, where presumably linear separation can be achieved [94]. Furthermore, we prefer a hyperplane that better separates much of the data, even if it ignores a few misclassified samples [52]. To bridge the gap between ideal assumptions and realistic constraints, Shen et al. [81] proposed the secure SVM, which is a privacy-preserving SVM training scheme over blockchain-based encrypted IoT data. When the classification is based on more than two classes, we can use multi-class SVM, which classifies the output into more classes. Initially, SVM was mainly proposed for binary classification, but later it tuned into multi-class classification due to a variety of methods. Two methods are mostly used for multi-classification problems: (1) by reducing it to multiple binary classification problems [47] and (2) one versus one [95]. Garcia-Font et al. in the work of [34] used One-Class SVM (OC-SVM), which is a special case of semi-supervised SVMs that do not require labeled data. OC-SVMs build a frontier to classify new samples as normal or outlier.
A Bayesian Network (BN) is a concise representation of the direct dependencies between a set of statistical variables formed in a directed acyclic graph and a set of node probability tables [10]. In data modeling, a BN creates implicit assumptions about dependencies between variables, though, in the real world, two variables are theoretically never truly and fully independent. A Naïve Bayes (NB) classifier is technically a special case of BN, using Bayes' theorem in a naive way since it assumes every predictor variable is conditionally independent on the class (i.e., attribute) label [96]. Hence, an NB is a simple stochastic classifier based on applying Bayes' theorem with strong independence assumptions [10]. Brisimi et al. in the work of [93] deployed the Likelihood Ratio Test (LRT), which is an NB classifier and assumes that individual features (elements) of a feature vector x = (x 1 , . . . , x n ) are independent random variables.
K Nearest Neighbors (KNN) is a simple regression model that estimates the output of new samples based on the average output of its k nearest neighbors. Nearest neighbors are found in the feature space [76]. To make a prediction, this technique aggregates the values of the K "closest" examples in the training set, where K is an input parameter [97]. To compute the distance among observations, we can use the Euclidean or Manhattan distance measures. KNN is a type of instance-based learning, or lazy learning, in which the function is only approached locally, and all calculations are postponed until classification [62].
The Artificial Neural Network (ANN) is a very popular technique, which relies on supervised learning [92,98]. An ANN is a strong, nonlinear modeling tool that imitates the operation of biological neurons. Training an ANN involves the tuning of the weights and biases of the network. The objective is to maximize the network prediction performance, which corresponds to minimizing the difference between all network outputs y k and desired outputs or targets t k on validation data [94]. Using the Back Propagation (BP) algorithm [99], the NN can learn relevant statistical information from a suitable amount of training data, and the mathematical information learned can reflect the function mapping relation of the input-output data model [100]. An earlier ANN architecture is Multilayer Perceptron (MLP), a Neural Network (NN) with a fully connected architecture. Generally, MLP performs well and has been applied widely. A quicker-to-train but more memoryintensive network is the Radial Basis Function (RBF) network [101], used by Olszewski et al. in the work of [84] for building a regression model that considers many variables. At present, many new architectures have been developed for ANN [102]. ANNs tend to overfit, which means to be trained to fit the noise trend, but without producing a suitable generalization, as expected by ANN. However, Bayesian Regularized ANNs (BRANNs) tries to overcome the overfitting problem by incorporating Bayes' modeling into the regularization scheme [91]. Convolutional Neural Network (CNN) is now gaining considerable attention leveraging its powerfulness in automatically learning the underlying patterns from the data. A basic structure of CNN generally consists of two stages, namely the feature learning stage and the classifier training stage [103,104]. The structure of a CNN is more complicated than that of a traditional NN [100]. Recurrent Neural Networks (RNNs) are feedforward NN with a recurrent loop. They are considered a powerful model for sequential data by using past history [105], and they are applied to a wide variety of problems involving time sequences of events and ordered data [92]. Long Short-Term Memory (LSTM) is a three-layer particular type of RNN proposed to identify a correlation between an input sequence and an output sequence. LSTM was used by Fenza et al. [105] for energy consumption prediction and anomaly detection in a smart grid application. To detect moving objects with maximum accuracy, [106] used a Counter-Propagation Artificial Neural Network (CP-ANN). The CP-ANN has a three-layer architecture where the first layer is an input layer, the second layer is a Kohonen layer, and the third layer is a Grossberg layer.
While ANNs are capable of extracting and modeling the general behavior of the system, Fuzzy Logic (FL) modeling can be used for approximate reasoning, modeling of qualitative data for uncertainties [103] that inherently appear in data to ensure adaptable control even in the presence of noisy, imprecise data [107]. FL also assists in easily incorporating expert domain knowledge into the control system by means of human interpretable linguistic rules. [108] used a firmly structured network, an Adaptive Neuro-Fuzzy Inference System (ANFIS) formed by the combination of ANN and fuzzy systems. In order to achieve near-optimal control, they used the Particle Swarm Optimization (POS) method. Alternatively, Evolutionary Algorithms (EA) may be used for dynamically optimizing both ANN and FL techniques as well as classical control methods. As EAs are inspired by the process of natural selection, they provide the capability of optimizing complex real-world problems [109] and converging on near-optimal results when the search space is too large to be searched exhaustively [78].
Deep Learning (DL) derives from ANNs with many hidden perceptron layers that can help to identify hidden patterns. The basic idea of DL is to replicate what the human brain does in most cases [110]. The Deep Belief Network (DBN) was proposed in 2006 by Hinton and has been widely used since then. DL not only changes traditional ML methods but also affects our understanding of human perception [100]. DL has been applied to several classification and regression problems. Part of its success is due to automatic feature extraction at different levels of abstraction. This promotes the easy reutilization of models on different domains without the intervention of a field specialist. Moreover, DL is capable of the representation of nonlinear, complex city data. DL models have been used to achieve state-of-the-art results in the field of computer vision and have also been applied to the problem of time series forecasting [111].
Reinforcement learning (RL) asks users who are already involved in sensing and actuation to reinforce and guide the system toward better accuracy and intuitive actuation [68]. The emerging Deep Reinforcement Learning (DRL) [112] can be considered as a promising technology, which takes a long-term goal into account and can generate optimal control actions to time-variant dynamic systems.
Shamshirband et al. [108] argue that ensemble and hybrid models are the future trends in ML. Ensemble meta classifiers adopt classification techniques where multiple classifiers of a different or similar type are being trained over the same or subsets of a training set [94]. Boosting is an ensemble supervised learning method that constructs a classifier as a linear combination of simpler weak classifiers. Brisimi et al. [52] used AdaBoost decision stumps as component classifiers. Moustafa et al. [113] applied three ML techniques of DT, NB, and ANN to classify normal and attack records in their intrusion detection system. The techniques are implemented as an ensemble method with the AdaBoost mechanism, where each technique is considered a weak classifier, and its findings are not high enough compared with the findings of the ensemble method. Other ensemble classifiers are bagging and random committees [62]. In order to build scalable learning systems, Zhang et al. [69] used XGBoost, a large-scale ML method. Hybrid ML models are shown to deliver higher performance in modeling and prediction due to their optimized algorithms for higher efficiency. Shamshirband et al. [108] proposed a hybrid model consisting of a neuro-fuzzy inference system and POS.

Forecasting
Forecasting algorithms facilitate the process of making statements about events whose actual outcomes (typically) have not yet been observed [58]. For instance, Fernadez-Ares et al. [35] implemented a time series forecasting analysis with the adoption of three different techniques, such as exponential smoothing state-space model (ETS), ARIMA, and Theta. They also tried a control method (Mean), but they received worse results. Badii et al. [92] used the Auto-Regressive Integrated Moving Average (ARIMA) model as a forecasting method. ARIMA model is composed of two parts: Auto-Regressive and Moving Average. The Auto-Regressive part (AR) creates the basis of the prediction and can be improved by a Moving Average (MA) modeling for errors made in previous time instants of prediction. The order of ARIMA models is defined by the parameters (p; d; q): p is the order of AR model; d is the degree of differencing, and q is the order of the MA part, respectively; and by the corresponding seasonal counterparts (P; D; Q). ARIMA can be used for planning, as well, as it poses the basis to be used as an instrument for early warning: that is, for detecting dysfunctions as unexpected patterns in the city users' behavior [8]. The Markov forecasting model can be applied to make governance decisions according to results generated by the ontology model, and a sequential evaluation model can also be built to assess the effect of governance decisions, which effectually optimizes the Markov forecasting model [44]. As there is a need for algorithms that have low-enough computational complexity to run on the devices found in the smart city infrastructure, Venkatesh et al. [68] leverage Taylor Expanded Analog Forecasting Algorithm (TESLA), implementation a statistical learning model that can be fully generalized, as the data translation algorithm in their context engine. Tse et al. employed the Granger Causality Test (GCT) in their study [42]. GCT is a statistical hypothesis test for determining whether one time series has causal relationships with another.
One model that deals with numeric variables is Linear Regression (LinR), where a single linear formula represents the mapping from input to class values. Regression analysis aims to model the relationship between a dependent variable and one or more explanatory independent variables. Among linear regression models, Hashemi selected Least Squares (LS) in the work of [76], which fits a linear model to the data by minimizing the total squared error. Logistic regression (LogR) is a linear, fairly simple classifier widely used in many classification applications. The basic idea is that for each instance, we model the posterior probability of the actionable class as a logistic function with parameters that weigh the features f and an offset β [52]. LogR is popular in the medical literature because it predicts a probability of a sample belonging to the positive class [93]. If data have a large number of features, we can also use a nonlinear regression technique, which complements the linear regression method, the Support Vector Regression (SVR) [92]. Dynamic Time Warping (DTW) is an algorithm for comparing two given (time-dependent) sequences that may vary in speed, applied to time series analysis video, audio, and graphics data sequences [114].
Anomaly detection, also known as outlier detection or novelty detection, is an important research domain of data analysis. It aims to detect a few data patterns that do not conform to the expected data characteristics [115]. An anomaly can be defined as unusual or abnormal behavior [116]. To excavate the latent information of data and construct the compact data representation, Sun et al. proposed a generative Dictionary Learning Model (DLM) for anomaly detection [117], which jointly learns the latent representations and their recombination basis. Dictionary learning aims to learn a set of basis vectors that can encode all feature data into compact form with a linear combination. In the application of anomaly detection, the maximum reconstruction error of test samples is calculated to discriminate whether it is abnormal. The ability of this method to reveal the underlying structure of data is the key to solve the problem of anomaly discrimination. Pena et al. [118] applied a rule-based system to energy efficiency indicators and historical data to detect anomalies at the energy building consumption.

Unsupervised Learning
Clustering is an unsupervised learning technique aiming at partitioning a collection of data objects into groups (clusters), but unlike classification, the groups are not predefined [80]. The grouping is implemented in a way that intra-cluster dissimilarity is low (according to a set of metrics) and inter-cluster dissimilarity is high [58]. Clustering analysis has been extensively used in problems where there is little prior information available about the data [119], and in a wide range of applications, such as pattern recognition, market research, and image segmentation [120]. In general, the existing clustering algorithms can be broadly categorized into hierarchical clustering, partitioning clustering, model-based clustering, density-based clustering, grid-based clustering, and so on. Hierarchical Clustering (HC) techniques, used by Martinez-Espana et al. [62], define the cluster distance between two clusters to be the maximum distance between their individual components. At every stage of the clustering process, the two nearest clusters are merged into a new cluster. However, these methods are static, and data objects assigned to a cluster cannot move to a different cluster. Partitioning algorithms are dynamic, and data objects can move from one cluster to another. They can incorporate knowledge about the shape or size of clusters by using appropriate prototypes and distance measures [121].
K-means is a very popular [116] and simple partitioning clustering algorithm [105]. Because of its efficiency, the method is scalable in parallel and distributed systems for big data set handling [49]. K-means assigns n points to k clusters using distance as a similarity factor until there is no change to which point belongs to which cluster by iteratively updating cluster centers. However, traditional K-means suffer from sensitivity to the initial selection of cluster centers and difficulty in specifying the number of clusters in advance [121]. Aydin et al. in the work of [31] used K-means for clustering twodimensional GPS position data. In [8], two clustering techniques have been adopted. The first technique was a variance of K-means, Partitioning Around Medoids (PAM), which are the most representative elements in the cluster instead of the centroid as with K-means. The second approach is the model-based Expectation-Maximization algorithm or EM algorithm (EM method). It is a generalization of K-means that uses an iterative process to find the maximum likelihood.
Arribas-Bel et al. in the work of [39] used two very distinct techniques for clustering: Kmeans and the more advanced Self-Organizing Map (SOM). Although their function in this context is the same (grouping observations based on attribute similarity), the underlying mechanics of both algorithms differ substantially: while K-means tries to optimize an objective function that minimizes cluster variance, SOM employs an iterative approach in which a feedforward NN learns the properties or find unknown relationships among the set of variables that describe a problem, to later assign the original observations to output neurons. The main property of SOM is that it makes a nonlinear projection from a high-dimensional data space on a regular, low-dimensional (usually 2D) grid of neurons. Fernadez-Ares et al. conducted a study for monitoring real traffic and mobility scenarios with the implementation of SOM [35].
Nesi et al. [122] adopted another technique for clustering, building a hierarchical tree, represented as a dendrogram. They consider hierarchical clustering as a suitable choice for their study in geographic localization since it does not a priori require the number of clusters.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is used in the work of [24,123]. Yang et al. state that all existing clustering methods have the same difficulty of parameter setting [123]. DBSCAN can find arbitrary-shaped clusters with only one input parameter required and supporting the user in determining an appropriate value for this input parameter [120].
Pournaras et al. in the work of [124] illustrate a generic, unsupervised, and highly efficient collective learning algorithm designed to solve fully decentralized combinatorial optimization problems: I-EPOS1, the Iterative Economic Planning and Optimized Selections. I-EPOS is applicable in the broader context of large-scale multi-agent systems.

Association Rules
Association Rules (AR) is one of the most popular methods [125] within the context of extracting relationships among items hidden within data sets [126], as it has been used in several smart city applications [44,77,127,128]. Agrawal and Strikant [129] presented the Apriori algorithm for discovering all significant AR between items in a large database of transactions. It uses the transcendental nature of frequent itemsets, i.e., all non-empty subsets of frequent itemsets also being frequent [125]. It proceeds by identifying the frequent individual items in the database and extending them into larger itemsets as long as they appear sufficiently often in the database. The frequent itemsets can be used to determine AR that reflect general trends. Some AR could be discovered when applying DM techniques to a training set; then, the most meaningful one could be extracted from these general AR by techniques such as Bayesian Net [44]. Pattern growth is a method of frequent pattern mining that does not require candidate generation. It discovers the frequent single items, then compresses this information into a frequent pattern tree. Honavar and Sami [23] used PrefixSpan, which extends the pattern-growth approach to mine sequential patterns instead. Its general idea is to examine only the prefix subsequences and project only their corresponding postfix subsequences into projected databases.

Spatial Mining
A digital city, as a considerable representation of our real world, contains all the natural, social, and economic information related to the physical real city. Particularly, it consists of 2D digital maps, 3D digital city models as the city grows both horizontally and vertically [130], 4D spatial-temporal databases, and points of interest [29]. Spatial data are required for many smart city applications, such as to infer spatial rain fields from streaming short text messages or to model spatially continuous fields, such as noise level, air temperature, or pollution [131]. Geographic Information Systems (GIS) are used to store information related to locations on the Earth's surface [80,132]. The typical descriptor of people flow analysis in the city is the Origin Destination (OD) matrix. It presents city zones on both axes, while the single element (at the intersection) contains the number of people (or the probability) of passing from the zone of origin to the zone of destination, in a given time window, for a given kind of user, for a given day of the week [8].

Natural Language Processing
Natural Language Processing (NLP) is a core method for mining information from human language [133]. The textual information is properly preprocessed to remove noise by using techniques such as tokenization, normalization, stemming, and stop word removal [54]. NLP techniques include lexical acquisition, word sense disambiguation, and part-of-speech (POS) tagging [107]. NLP involves translating natural language into data (numbers) that a computer can use to learn about the world. This understanding of the world is sometimes used to generate natural language text that reflects that understanding [134]. Techniques for semantic analysis of textual content coming from social networks can provide very interesting findings and improve the understanding of psycho-social dynamics in a totally new way [89]. Qiu et al. [135] used two main techniques to mine knowledge from text, the Latent Dirichlet Allocation (LDA) algorithm and the Word2vec tool. The LDA algorithm considers each document as a mixture of a small number of latent topics and that each word creates a contribution to one topic. When given all words appearing in each document of the document set, LDA can infer the implied topic distribution of documents [127]. Word2vec is a tool based on deep learning and was released by Google in 2013. This tool adopts two main model architectures, Continuous Bag-of-Words (CBoW) model and the continuous skip-gram model. Kaiser and Pozdnoukhov [131] generated the dictionary with the topic of interest based on a text corpus built from related sites, such as Wikipedia. The irrelevant tweets are generated based on some recent news from the New York Times. Twitter messages were then generated using a Markov chain trained on this corpus. Then, the incremental classification algorithm Pro-jectron++ was trained using the bag-of-words feature representation and the known label used at the simulation stage to classify the stream into the topic of interest. Tse et al. [42] performed topic modeling on the whole data set using Non-negative Matrix Factorization (NMF). NMF is a multivariate analysis algorithm where an input matrix with non-negative elements is factorized into two other matrices with non-negative elements. Since NMF clusters the columns of the input matrix, it can be applied for topic modeling and document clustering. Costa et al. used in the work of [133] a multinomial NB classification model, which is used to determine the contextual severity of an event. In general, the NB algorithm has a high recall and precision while classifying textual data, being one of the most suitable algorithms for this task. Nesi et al. used text mining and NLP algorithms for geographical annotation [122]. They consider this technique as an important application for smart city frameworks, aiming at helping citizens by providing different services and useful information on publicly available Open Data (OD), including geographical information and spatial location of Places Of Interest (POI), real-time traffic, and parking structures, as well as any other kind of municipality resource that can be geolocated.

Sentiment Analysis
Sentiment Analysis (SA) infer the sentiment conveyed by a piece of text by relying on (external) lexical resources, which map each term to a categorical (positive, negative, neutral) sentiment score. As an example, terms such as wonderful, beautiful, and joy have a positive sentiment score, while terms such as fear and sadness have a negative one [90]. SA techniques are applied to almost every social domain because opinions are critical to almost all human behaviors [136]. Opinion mining and sentiment analysis methods can be applied to the SM comments [137] to automatically identify issues that concern citizens, as well as features they liked [138]. The task is rather unwieldy because each word has to be treated separately [139]. The fuzzy nature of human emotion provides data that contain a vast amount of uncertainty. Fuzzy systems are capable of addressing this problem while achieving a suitable tradeoff between accuracy and performance [109]. The words that someone uses are not the only source of information emotion recognition. Emotion recognition is closely related to Facial Expression Recognition (FER) [95]. FERbased systems comprise three steps. In the first step, the face is detected in a video stream and is cropped as region-of-interest for the next step. Then, it is resized into specified dimensions so that all the images given to the model become of the same size. In the next step, some low-level or high-level or both features are extracted from the cropped region. In the last step, the features are classified using a classifier.

User Interface and Visualization
As Ploennigs et al. argue, IoT applications will only be successful if it is usable by everybody [65]. The operators and dwellers of smart cities are not data scientists, so they have difficulty when applying ML techniques to their applications without experts [140]. Data analysis would be considerably simplified if one could visualize data graphically. Information visualization uses graphic techniques to help people understand and analyze data. Visual representations and interaction techniques take advantage of the human eyes' broad bandwidth pathway into the mind to allow users to see, explore, and understand large amounts of information at once [75]. The objective of the analysis and visualization of data is to highlight useful information and support decision making with the lowest degree of human intervention [140]. A frontend component should provide users with comprehensive charts that visualize the information [9]. Therefore, interfaces should enable users to intuitively understand the behavior of the system. The interfaces should be easy to use, responsive, mobile, and abstract the complexity of the underlying processes. Once the data from the sensors has been retrieved, it may be helpful for the city operator to have tools available to create dashboards that show this data simply and intuitively [141]. Users should be able to interact with the smart city application with a smart city dashboard that represents the structure, real-time data, and key indexes on a city map [92]. The goal of the dashboard is to provide the user with a set of tools to visualize and handle the aggregated analysis results [90]. Thanks to quickly visualized data tools, identify correlations, and conceive of innovative, unanticipated uses for existing information became easier [38]. Additionally, the integration of speech interfaces shows great potential to interact in natural language with the operators. In addition, augmented reality interfaces are paving new ways to seamlessly access sensors and systems data [65].

Methodology
To explore the DM for smart cities research field, we review results from quantitative and qualitative methods, as follows:

•
Quantitative method: a bibliometric analysis, and • Qualitative method: a critical review of the 100 most cited articles.

Bibliometrics
As the number of publications continues to expand at increasing rates and publications develop fragmentarily, the task of accumulating knowledge becomes more complicated [142]. The term "bibliometrics" first appeared in the literature in 1969. Until then, the relevant research area was defined as the "application of mathematical and statistical methods to books and other media of communication", and the term bibliometrics was quickly adopted and used, particularly in N. America. In contrast, at almost the same time, the term "scientometrics" was widely used in Europe. Bibliometrics are applicable in the research sub-areas of methodology research, scientific disciplines, and science policy [143]. Bibliometric studies tend to examine statistically the quantitative aspects of scientific publications within a field [144]. Bibliometrics are often used as a measure of the quality of the work produced by individual scholarly contributions, venues of production, individual authors, groups and institutions, journals, etc. As a general indicator for measuring impact across the sciences, E. Garfield invented the calculation of Impact Factor (IF), which remains over half a century later the "gold standard" for journals [145].

Bibliometric Analysis Software
For the bibliometric analysis, we used the Bibliometrix library, developed in R, by Aria and Cuccurulo [142]. Bibliometrix is an efficient tool for quantitative research in bibliometrics, distributed under GPL-3 license, performing bibliometric analysis and building networks for co-citation, coupling, scientific collaboration, and co-word analysis [146]. Bibliometrix offers a web application developed with the help of the Shiny library, named Biblioshiny. Biblioshiny supports scholars with data importing and filtering, analytics and plots for sources, authors, and documents, and analysis of the conceptual, intellectual, and social structure of the research topic.

Information Retrieval
We exported article information from the Scopus database. Roemer and Borchardt [145] consider Scopus as an authoritative and comprehensive online source for discovering citation-based connections between scholarly articles. Scopus, published by Elsevier, was established in 2004. We chose this source because it is the largest abstract and citation database since it indexes content from 24,600 active titles and 5000 publishers. Its main strength is that it offers the broadest coverage available for scientific, technical, medical, and social sciences [143].
Aiming to index journal articles in English containing the keywords "data mining" and "smart city" (or "smart cities") in their abstracts, titles, and keywords [147], we posed the following search query in the Scopus search form: TITLE-ABS-KEY ("data mining" AND ("smart city" OR "smart cities")) AND (LIMIT-TO (LANGUAGE, "English")) AND (LIMIT-TO (SRCTYPE, "j")).

Results
The query was stated on 16 February 2021 and returned 197 records. The results were exported in BibTex file format, which is compatible with the Bibliometrix library. As a response to our query, Scopus returned 197 articles published within a period from January 2013 to February 2021, in 112 different journals, by 682 authors. There are 3.46 authors and 15.88 citations per document on average. Table 1 presents general information about our data.  Table A1 in Appendix A presents the 100 most cited research documents in DM for the smart cities topic. The table is sorted by the number of total citations each article has received. When an article is cited, it can be considered to impact someone else's work. As the simplest and most straightforward of the citation-based metrics, the number of citations an article has received is the starting point for almost all bibliometric indicators, which have become more sophisticated since the invention of advanced computing modeling [145].

Annual Scientific Production
Andres in the work of [143] states that, according to Prices' law, science grows in a multiplicative way over time and, according to this exponential function, the growth rate will be proportional to the population size, i.e., the bigger the population is, the faster it grows. Consequently, the number of productions also grows in a multiplicative way, as shown in Table 2. It can be assumed that the exponential growth described by Price's law is acceptable within a logistic function, so this period of accelerated growth will be followed by a stabilization phase. In the DM technologies for smart cities, we found an annual growth rate of 73.33% in the period between 2013 and 2020. The most productive year was 2019, when 54 articles were published. Articles published in 2015 and 2017 received the higher average number of total citations, but the latter achieved a higher score in the mean total citation per year index because of their fewer citable years.

Sources
The 10 most productive sources on the topic appear in Table 3. According to Bradford's law of scattering, "the bulk of articles on a given topic is concentrated in a small set of core journals and then scattered across other journals to such a degree that, if the set of relevant articles is subdivided into groups or zones containing the same number of items as the core, an exponentially increasing number of journals will be required to fill the succeeding zones" [148]. Thus, Bradford's law is widely used in bibliometrics to survey journal productivity. As observed in Figure 5, a small number of journals are the core sources of the total number of publications on the topic, whereas increasing numbers of journals publish fewer articles on the topic [143]. Figure 6 illustrates the top-5 related sources' growth for the period from 2013 to 2021 (February).   The Impact Factor (IF) of a journal is "a citation measure of its average article's citation score over a relatively short period. It is computed for a given year through the division of the number of citations received, in the processing years by the overall number of "citable" items (research articles, reviews, and notes) issued by the journal during the same years" [148]. For instance, when we say that a journal has an impact factor of three, we mean that in the last two years, this journal averaged three citations per published article [145]. With the IF, we can identify publications that have a strong impact during a given period. The journal IF has practical importance either for libraries to decide which journals to purchase or for authors to choose where to submit their articles. As a rule, journals with high IF are seen to be more prestigious [143]. De Bellis [148] considers that the higher the IF of the journals listed in the publication records of the units under assessment, the greater the candidate's chance of outperforming all the other applicants in a competition for promotion, tenure, or funding allocation. The Hirsch index, also known as the h-index, measures quantity and impact through a single indicator. The h-  The Impact Factor (IF) of a journal is "a citation measure of its average article's citation score over a relatively short period. It is computed for a given year through the division of the number of citations received, in the processing years by the overall number of "citable" items (research articles, reviews, and notes) issued by the journal during the same years" [148]. For instance, when we say that a journal has an impact factor of three, we mean that in the last two years, this journal averaged three citations per published article [145]. With the IF, we can identify publications that have a strong impact during a given period. The journal IF has practical importance either for libraries to decide which journals to purchase or for authors to choose where to submit their articles. As a rule, journals with high IF are seen to be more prestigious [143]. De Bellis [148] considers that the higher the IF of the journals listed in the publication records of the units under assessment, the greater the candidate's chance of outperforming all the other applicants in a competition for promotion, tenure, or funding allocation. The Hirsch index, also known as the h-index, measures quantity and impact through a single indicator. The hindex is defined by Andres [143] as follows: "a scientist has index h if h of his or her N The Impact Factor (IF) of a journal is "a citation measure of its average article's citation score over a relatively short period. It is computed for a given year through the division of the number of citations received, in the processing years by the overall number of "citable" items (research articles, reviews, and notes) issued by the journal during the same years" [148]. For instance, when we say that a journal has an impact factor of three, we mean that in the last two years, this journal averaged three citations per published article [145]. With the IF, we can identify publications that have a strong impact during a given period. The journal IF has practical importance either for libraries to decide which journals to purchase or for authors to choose where to submit their articles. As a rule, journals with high IF are seen to be more prestigious [143]. De Bellis [148] considers that the higher the IF of the journals listed in the publication records of the units under assessment, the greater the candidate's chance of outperforming all the other applicants in a competition for promotion, tenure, or funding allocation. The Hirsch index, also known as the h-index, measures quantity and impact through a single indicator. The h-index is defined by Andres [143] as follows: "a scientist has index h if h of his or her N papers have at least h citations each and the other (N-h) papers have ≤h citations each". One of the most popular improvements of the h-index is the g-index, which considers the weight of citations received by the top articles of a scientist, and the total number of documents does not limit the value of the index [149]. M-index is computed as the median of citations received by papers ranking smaller than or equal to the Hirsch core h [150]. Table 4 summarizes the impact measures of the 10 most relevant journals.  Table 5 shows the 10 most productive authors. As De Bellis states, there is a reasonably suitable correlation between the eminence of a scientist and his productivity of papers [148]. In Figure 7, top author productivity over time appears. Lotka's law assesses patterns in author productivity. It is also known as the inverse square law on author productivity. The law predicts how many authors would have published x studies according to the number of authors who have contributed to a single study [143]. Lotka's law determines "the degree scientists of different caliber, involved in the struggle for life on the forefront of scientific communication, contributed to the advancement of knowledge" [148]. Figure 8 visualizes the frequency distribution of scientific productivity (  Lotka's law assesses patterns in author productivity. It is also known as the inverse square law on author productivity. The law predicts how many authors would have published x studies according to the number of authors who have contributed to a single study [143]. Lotka's law determines "the degree scientists of different caliber, involved in the struggle for life on the forefront of scientific communication, contributed to the advancement of knowledge" [148]. Figure 8 visualizes the frequency distribution of scientific productivity (Table 6) through Lotka's law.  Citation analysis is often used to obtain information about the impact and, more often, the quality of a publication, a source, or an author [143]. Published articles are read and assessed by the community of peers, who recognize their value by citing them in their studies. The bibliographic citation, therefore, has been considered as an elementary building block of a scientific reward system [148]. Table 7 summarizes the impact measures of the 10 most productive authors.   Citation analysis is often used to obtain information about the impact and, more often, the quality of a publication, a source, or an author [143]. Published articles are read and assessed by the community of peers, who recognize their value by citing them in their studies. The bibliographic citation, therefore, has been considered as an elementary building block of a scientific reward system [148]. Table 7 summarizes the impact measures of the 10 most productive authors. From all the studies we examined, it is observed that China was the most productive country, followed by the USA. The productivity of the 10 most relevant countries is presented in Table 8. Table 9 includes the 10 top-cited countries. Articles from China have received 473 citations in total, while articles from the USA have received 357 and Spain 341.

Content
The frequency of words in a text can be studied by Zipf's law, which has been considered as a generalization of both Lotka's and Bradford's laws [143]. According to Zipf's law, at its simplest, the vast majority of text words appear only a few times, and a limited number are extremely frequent [148], as shown in Figure 9. As presented in Table 10, the most frequent terms that appeared as "keywords plus" in the articles are "data mining", "smart cities", "internet of things", and "big data". Keywords plus are words or phrases that frequently appear in the titles of an article's references but are not included in the title of the article itself in order to augment the power of cited-reference searching [151]. At first, the text analysis we performed returned as a result various terms relating to common concepts (such as "city"-"cities" or "Internet-of-Things (IoT)"-"Internet of Things"), so after the data set retrieval from the Scopus database, we had to do some preprocessing work such as lemmatize the words, ignore numbers, and disregard case sensitivity. Figure 10 shows the word cloud constructed by the terms. Word dynamic graph ( Figure 11) can help us see the trends on a research topic, as it illustrates growth or decline for each keyword [152]. clustering algorithms 13 energy use 13 intelligent buildings 12 extraction 11 social networking (online) 11 support vector machine 11 cloud computing 10 deep learning 10 information management 10 urban transportation 10 human 9 Figure 9. Zipf's law. Figure 9. Zipf's law.   The word co-occurrences are a measure of the different poles of interests that subserve to build the structure and the dynamics of a scientific field. Two words, Wi and Wj, co-occur in the corpus if there is at least one document containing both Wi and Wj. The strength of the link between Wi and Wj is given by the number Cij of documents in which the couple (Wi, Wj) appears [148]. A word co-occurrence network ( Figure 12) can be considered as a concept map [144] that facilitates the understanding of the knowledge components and cognitive structure of a research field by examining the structure of the map [152]. Each node in the network represents a semantic concept, and the size of a node shows the frequency of each concept. The weight of the edge between two nodes represents the strength of the relationship between the concepts. When word cooccurrence analysis is used for mapping a research topic, clusters of keywords, as shown in Figure 13, and their links are obtained. These clusters can be considered as themes. In a theme, the keywords and their interconnections portray a network graph called a "thematic network" (Figure 14). Each theme obtained in this process is indicated by two parameters, centrality, which represents the relevance degree in the horizontal axis, and density, which represents the development degree in the vertical axis [152].  The word co-occurrences are a measure of the different poles of interests that subserve to build the structure and the dynamics of a scientific field. Two words, Wi and Wj, co-occur in the corpus if there is at least one document containing both Wi and Wj. The strength of the link between Wi and Wj is given by the number Cij of documents in which the couple (Wi, Wj) appears [148]. A word co-occurrence network ( Figure 12) can be considered as a concept map [144] that facilitates the understanding of the knowledge components and cognitive structure of a research field by examining the structure of the map [152]. Each node in the network represents a semantic concept, and the size of a node shows the frequency of each concept. The weight of the edge between two nodes represents the strength of the relationship between the concepts. When word cooccurrence analysis is used for mapping a research topic, clusters of keywords, as shown in Figure 13, and their links are obtained. These clusters can be considered as themes. In a theme, the keywords and their interconnections portray a network graph called a "thematic network" (Figure 14). Each theme obtained in this process is indicated by two parameters, centrality, which represents the relevance degree in the horizontal axis, and density, which represents the development degree in the vertical axis [152]. The word co-occurrences are a measure of the different poles of interests that subserve to build the structure and the dynamics of a scientific field. Two words, W i and W j, co-occur in the corpus if there is at least one document containing both W i and W j . The strength of the link between W i and W j is given by the number C ij of documents in which the couple (W i , W j ) appears [148]. A word co-occurrence network ( Figure 12) can be considered as a concept map [144] that facilitates the understanding of the knowledge components and cognitive structure of a research field by examining the structure of the map [152]. Each node in the network represents a semantic concept, and the size of a node shows the frequency of each concept. The weight of the edge between two nodes represents the strength of the relationship between the concepts. When word co-occurrence analysis is used for mapping a research topic, clusters of keywords, as shown in Figure 13, and their links are obtained. These clusters can be considered as themes. In a theme, the keywords and their interconnections portray a network graph called a "thematic network" (Figure 14). Each theme obtained in this process is indicated by two parameters, centrality, which represents the relevance degree in the horizontal axis, and density, which represents the development degree in the vertical axis [152].

Social Structure
Co-citation analysis is a common analysis in bibliometrics. It employs citation counts as a measure of similarity between documents, authors, and journals [142]. As de Belis states, "Co-citation analysis rests on the premise that if two documents are cited by a third document, it is likely that some kind of structural relationship between them does exist, the strength of the relationship depending on how many times they are co-cited in a given corpus of literature. The more two documents are co-cited, the more likely it is that their repeated co-link has something to say about the sociocognitive structure of the subject area to which the papers belong. Consequently, their position is represented by nearby points on the surface of the map" [148]. These networks aim to quantify interdependencies and scholarly influence among the entities at different levels of detail, including between authors, journals, subject categories, institutions, and countries [144]. Figure 15 illustrates the co-citation network of the selected set of articles in our study.

Social Structure
Co-citation analysis is a common analysis in bibliometrics. It employs citation counts as a measure of similarity between documents, authors, and journals [142]. As de Belis states, "Co-citation analysis rests on the premise that if two documents are cited by a third document, it is likely that some kind of structural relationship between them does exist, the strength of the relationship depending on how many times they are co-cited in a given corpus of literature. The more two documents are co-cited, the more likely it is that their

Social Structure
Co-citation analysis is a common analysis in bibliometrics. It employs citation counts as a measure of similarity between documents, authors, and journals [142]. As de Belis states, "Co-citation analysis rests on the premise that if two documents are cited by a third document, it is likely that some kind of structural relationship between them does exist, the strength of the relationship depending on how many times they are co-cited in a given corpus of literature. The more two documents are co-cited, the more likely it is that their repeated co-link has something to say about the sociocognitive structure of the subject area to which the papers belong. Consequently, their position is represented by nearby points on the surface of the map" [148]. These networks aim to quantify interdependencies and scholarly influence among the entities at different levels of detail, including between authors, journals, subject categories, institutions, and countries [144]. Figure 15 illustrates the co-citation network of the selected set of articles in our study.  A scientific collaboration network ( Figure 16) is one of the most well-documented forms of scientific collaboration. It is a network where the nodes represent the authors, and the links represent the collaboration between authors, namely co-authorships [142].
The size of each node shows the centrality of a role in a cluster, and the edge weight how strong is the relationship between them, based on the number of articles published in co-authorship [143]. Collaboration serves social networks development and knowledge building. Collaboration network analysis is mainly prevalent in interdisciplinary domains, where authors from different areas contribute to scientific development and progression growth. Such networks support interpreting the behavioral characteristics of scientists in multiple domains, presenting the phenomenon of knowledge flows [144]. strong is the relationship between them, based on the number of articles published in coauthorship [143]. Collaboration serves social networks development and knowledge building. Collaboration network analysis is mainly prevalent in interdisciplinary domains, where authors from different areas contribute to scientific development and progression growth. Such networks support interpreting the behavioral characteristics of scientists in multiple domains, presenting the phenomenon of knowledge flows [144].  Figure 17 shows the visual representation of the collaboration network between countries. We posed the analogous query considering a threshold of two or more collaborative papers for each country relationship. We can see that the strongest edge appears between China and the USA, which are the most productive countries in the research domain. The most productive countries appear with deep blue color. The 10 strongest relationships between countries are presented in Table 11. As Waheed et al. [144] state, the number of international co-publications shared between two regions indicates the extent of collaboration between those regions, implying knowledge flows.   Figure 17 shows the visual representation of the collaboration network between countries. We posed the analogous query considering a threshold of two or more collaborative papers for each country relationship. We can see that the strongest edge appears between China and the USA, which are the most productive countries in the research domain. The most productive countries appear with deep blue color. The 10 strongest relationships between countries are presented in Table 11. As Waheed et al. [144] state, the number of international co-publications shared between two regions indicates the extent of collaboration between those regions, implying knowledge flows.
The size of each node shows the centrality of a role in a cluster, and the edge weight how strong is the relationship between them, based on the number of articles published in coauthorship [143]. Collaboration serves social networks development and knowledge building. Collaboration network analysis is mainly prevalent in interdisciplinary domains, where authors from different areas contribute to scientific development and progression growth. Such networks support interpreting the behavioral characteristics of scientists in multiple domains, presenting the phenomenon of knowledge flows [144].  Figure 17 shows the visual representation of the collaboration network between countries. We posed the analogous query considering a threshold of two or more collaborative papers for each country relationship. We can see that the strongest edge appears between China and the USA, which are the most productive countries in the research domain. The most productive countries appear with deep blue color. The 10 strongest relationships between countries are presented in Table 11. As Waheed et al. [144] state, the number of international co-publications shared between two regions indicates the extent of collaboration between those regions, implying knowledge flows.

Conclusions
Constructing a smart city is a systematic process, which is conducted over time, step by step. The design and implementation of smart city applications are extremely complex actions, and the choice and use of the appropriate DM techniques and tools for the communication between real and digital worlds have a crucial role in operation success. This study aimed to provide a comprehensive view of research published in the literature associated with DM algorithms for smart cities, based on bibliometric analysis using Scopus data from 2013 to February 2021.
The study has indicated that DM algorithms for smart cities are an evolving and fastgrowing research field as they grew multiplicatively in the last 8 years. The topic has been most popular among researchers in China, the USA, India, Spain, the U.K., and Greece and has emerged as a fertile field for collaboration among researchers from different countries, especially between China and the USA. The research landscape was also explored by means of bibliometric analysis, at various levels, including investigation of prominent articles, sources, and authors.
The most frequent terms that appear in the articles are "data mining", "smart cities", "internet of things", and "big data". The word dynamic graph ( Figure 11) showed that the growth of the appearances of the most frequent words follows the trend of the research topic. The word co-occurrence network, the topic dendrogram, and the thematic map (Figures 12-14) represent the conceptual structure of the research field, as they illustrate not only the most frequent terms of the articles but also the connections between them.
The critical review of the selected articles highlighted the wide range of DM techniques employing the development of smart cities. The integration of different technologies used in smart city applications and services remains the most challenging issue to overcome due to the volume, heterogeneity, and complexity of the collected data [153]. The development of advanced data-driven infrastructure and techniques that scale well and facilitate the interoperability of smart city applications and services can be a promising field for future work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript.