Open Data Based Machine Learning Applications in Smart Cities: A Systematic Literature Review

Machine learning (ML) has already gained the attention of the researchers involved in smart city (SC) initiatives, along with other advanced technologies such as IoT, big data, cloud computing, or analytics. In this context, researchers also realized that data can help in making the SC happen but also, the open data movement has encouraged more research works using machine learning. Based on this line of reasoning, the aim of this paper is to conduct a systematic literature review to investigate open data-based machine learning applications in the six different areas of smart cities. The results of this research reveal that: (a) machine learning applications using open data came out in all the SC areas and specific ML techniques are discovered for each area, with deep learning and supervised learning being the first choices. (b) Open data platforms represent the most frequently used source of data. (c) The challenges associated with open data utilization vary from quality of data, to frequency of data collection, to consistency of data, and data format. Overall, the data synopsis as well as the in-depth analysis may be a valuable support and inspiration for the future smart city projects.


Introduction
Urbanization is an important demographic mega-trend. With a pace of growth of urban population that is estimated to double until 2050, there is no doubt that "the future of the world's population is urban" [1]. In a rapidly urbanizing world, all the factors concerned/involved in bringing solutions for sustainable living in urban agglomerations shock hands, from academia, visionary companies to policy makers all over the world.
Based on the advanced digital technologies and their implementation in the cities especially in the last decade, smart city (SC) became one of the most important concepts of the new economy and one of the most researched topics. In the history of ideas, smart city is a relative newcomer [2], and the interest for this research field has grown and publications have intensified after 2008, as reported in a recent systematic literature review [3]. The quoted analysis also states the fuzziness of the SC concept, while there is not a generally agreed definition in the literature; the SC concept is frequently described as many-sided, multidimensional, complex, widespread, or fuzzy, while being used in inconsistent ways [2,[4][5][6]. Having all these different opinions, we have decided to combine the 'intelligent city' perspective with the multidimensional view of the smart city in our research. The multidimensional aspect is related to the idea that a smart city should perform well in areas such as economy, environment, living, mobility, as well as people and governance.
In the last decade, among the technologies used to create smart applications for cities and their citizens, we can enumerate IoT, cloud computing, big data, analytics, and artificial intelligence. Acknowledging the fact that artificial intelligence (AI) for smart cities is a developing field of research and practice, a systematic literature review performed in 2020 indicated that AI is applied in many smart cities' areas such as education, security, transport, energy, environment, health, land use, and urban governance [7]. They conclude that learning-based AI, also known as machine learning (ML) has a greater potential to solve SC problems than rules-based AI. In this respect, we have chosen to investigate the 'intelligent city' fostered by the impulse generated by the development of machine learning applications.
As stated in [8], open data should be the support of the new economy because the core of a smart city consists of the creation and use of data to generate new services and to support decision making. In addition, according to [9], data is one of the three major pillars of the SC, along with technology and people. Taking into consideration the ITC opportunities and the emerged open data movement, one of the most relevant emerged SC related domains combines open data (OD) with machine learning (ML) applications.
Literature provides plenty of papers that review and analyze the employment of ML methods in the smart cities. Numerous papers on the ML applications for SC were found as well as papers arguing about the OD relevance for the smart cities. Considering the OD in relation to ML applications for smart cities, there have not found literature reviews that connect the three concepts of SC, ML, and OD. Thus, this paper aims to fill this gap and provides a state-of-the-art review of open data-based machine learning applications for smart city. To do so, we applied the systematic literature review methodology.
At first, we have scrutinized the existent systematic literature review on smart cities, machine learning, and open data. We went through twelve selected papers that combine the three main pillars that are the background of our research (see all referred papers in Table 1) and discovered that only one had a similar approach as a starting point [10]. While the conceptual framework is comparable, the mentioned research [10] did not target the ML applications but optimizing sustainability for SC. Other SLRs were analyzed at this point to decide on the added value that our paper could generate in the literature. Table 1. Related SLR papers.

Paper
SLR Scope/Focus [4] "A systematic literature review on maturity models that assess the level of maturity for smart city projects" [11] "A systematic literature survey on software architectures for big data systems", with a few connections with SC [12] "IoT challenges in smart cities and provide the gap between the existing state-of-the-art IoT application on S" [13] "A comprehensive analysis of the literature on interoperability of SC data platforms" [10] "Analyze the link between the concepts of smart cities, machine learning techniques and their applicability" [14] "Systematically investigated the evolution of OGD research" [15] "the relationship between big and open data and how they relate to the broad concept of open government" [16] "Covers the revision of the studies related to air pollution prediction using machine learning algorithms based on sensor data in the context of smart cities" and concludes that "open data movement has increased the number of research works in the field of machine learning, especially in the prediction of air quality". [17] "Comprehensive survey that explores the application of graph neural networks for traffic forecasting problems", presenting also "a comprehensive list of open data and source resources for each problem" [18] "The challenges faced by smart cities and the key role data mining, machine learning and statistical methods can play to enable intelligent solutions for different applications" [19] "Systematically reviews the top 200 Google Scholar publications in the area of smart city with the aid of data-driven methods from the fields natural language processing and time series forecasting" [7] "Generates insights into how AI can contribute to the development of smarter cities".
After analyzing the coverage area of the listed SLR type works, the main scope of our SLR was set and the research questions were formulated. The main objective is to investigate the particulars of open data-based machine learning applications for specific smart city areas. The following research questions were crystallized: (RQ1) Which learning types and algorithms are used in open data-based ML applications for each of the six smart city areas (smart governance, smart economy, smart mobility, smart environment, smart people, smart living)? (RQ2) What are the sources used for open data? (RQ3) What are the challenges of open data utilization in ML applications for smart cities?
In this respect, the remaining of this paper is structured as follows. Section 2 includes a theoretical background covering the main concepts of SC, OD, and ML and Section 3 explains the research methodology. In Section 4, we present and discuss the results of reviewing the selected papers and data analysis using Power BI. The paper ends with Section 5 that includes research conclusions. In addition, Appendix A contains the list of the papers that were selected for the in-depth analysis.

Smart Cities
During the last decade, the concept of SC has evolved from the simple implementation of information technologies in the public services into an ecosystem that also takes into account innovation, the environment, the human, or the social aspects [4]. In [13], SC is considered not an ecosystem, but a development vision where ITC based solutions are integrated, data is acquired from heterogeneous sources, and assets are connected in a platform with the objective of improving life's quality while enhancing the efficiency and the economical value.
Based on a systematic effort of analyzing over one hundred SC definitions from various sources like: academic research community, government programs, different organizations (European Commission, United Nations, ITU etc.), corporations or standards development organizations, the focus group constituted within ITU proposed a comprehensive and integrative SC definition. In their regard, SC is "an innovative city that uses ICTs and other means to improve quality of life, efficiency of urban operations and services and competitiveness, while ensuring that it meets the needs of present and future generations with respect to economic, social, and environmental aspects" [20]. The same focus group has inventoried eight key aspects that support a sustainable smart city: "(1) quality of life and lifestyle, (2) infrastructure and services, (3) ICT, communications, intelligence and information, (4) people, citizen and society, (5) environment and sustainability, (6) governance, management and administration, (7) economy and finance, and (8) mobility" [20].
Closely related to our approach, [5] considers the SC being a city that employs technology to work toward the public problems "on the basis of a multi-stakeholder, municipally based partnership". Further, [5] has also formulated six characteristics or dimensions of the SC: (1) smart economy, (2) smart mobility, (3) smart environment, (4) smart people, (5) smart living, (6) smart governance. Actually, these characteristics actually represent the areas (domains) that SC initiatives focus on. They are described in Figure 1. The SC concept has gained greater attention since 2008 also due to the launch of the visionary IBM Smarter Planet project, where SC is defined as "a comprehensive approach to helping cities run more efficiently, save money and resources, and improve the quality of life for citizens" [22]. The expanding role of the data in the SC came slowly in the center of interest; in a report of the Academy of Smarter Communities, it is stated "IoT and data platforms play a central role ( . . . ) in managing the vast amount of data generated across different urban domains" [23]. More recently, Oracle, another visionary company, put forward their solutions to support cities to tackle high volumes of data-"combining artificial intelligence and machine learning to transform existing platforms into automated and mobile-friendly citizen services" [24].

Open Data for Smart Cities
In the recent years, the proliferation of technologies such as IoT and analytics, along with the constant growth of the data volume (big data) have motivated the vision of data and technology being used to create a better and sustainable quality of life of the citizens and businesses that inhabit the city. Based on the recent developments, we assert that cloud-based data and technology are used to make possible data-informed decisions in real time that improve the urban management. Overall, the recent literature agrees that data became a key feature in the smart city conceptualization and the new framework is now designed around three pillars: people, data, and technology.
Taking into account that a smart city initiative has to be developed around the data, [9] specifies that nowadays, it depends on connections, open data, and sensors. The nature of collected data depends on various factors and can vary from health services to governmen-tal measures, social, economic, and environmental impacts [25]. For a long time, public organizations gather, manage, and process data for their internal operations. In the last decade, the emerged open data movement encouraged them to make their data available to the public as 'open data' [26]. Today, the ecosystem of a smart city has a plethora of sensors that generate large amount of data [27]. Aside of these sensors, data is also collected using different tools and technologies available, as follows: cameras, kiosks, personal devices, appliances, social networks and others. Data collection is a helpful tool, for both citizens and planner, helping to regain control and to access necessary information [28]. Data that are relevant for the SC can be gathered from numerous heterogeneous sources, from sensor data to user-contributed data in participatory sensing [29]. When we mention open data, we must understand that we do not limit this area just to government data, because the private sector also recognizes the potential benefits of sharing data under the umbrella of open data [30].
Today, both public and private entities appreciate the value of data because this resource has already proved to be the key to improving efficiency and effectiveness in everyday activities [31]. For the SC, the number of stakeholders is higher and more diverse than in the private business case, i.e., utility companies, transport providers, mobile phone operators, social media sites, financial institutions, surveillance and security providers, emergency services, and others, along with the citizens themselves [28].
In [32], open data (OD) is defined as data freely available to everyone to use and republish as they wish, without any restrictions (copyright, patents or other control mechanisms). The European Portal for open data states the following three features for open data: free flow of data, transparency, and fair completion. In [30] authors report that open data should comply with the following 10 principles: "complete, primary, timely, accessible, machine-processable, non-discriminatory, non-proprietary, permanent, license-free, and preferably free of charge." In the same respect, [33] mentioned that open data should be complete, primary (should include original data and metadata about data collection), timely and permanent (having appropriate control mechanism for data versions) and [34] stated the fact that openness is a good governance principle. However, open data proliferation has brought potential perils and insecurity, due to aspects such as who benefits from them and who might be harmed by data sharing [8].
The following categories of open data are acknowledged in [29]: • Sensor data: data collected from different type of sensors found in a city (from traditional sensors that provide data about physical phenomenon to wearables that collect data about human activities and behavior); • Image and video data: mostly data from video surveillance or other video sources; • Text data: a complementary data source for many smart city applications.
As stated in [33], the most valuable sources of data are represented by open data initiatives of the government. In the last couple of years, these initiatives have burst out around the world, founding a goldmine for public administration [35]. Similar to open data, open government data (OGD) are available and accessible to everyone for their own needs, and they are made freely available for re-use for any purpose. The differentiation comes from the fact that they are produced with public money and the license specifies the terms of use (data.Europa.eu). For the reason of our research, we will further use the term open data, whilst it also refers to open government data.
Open data proliferation has established a 'data commons'. Similar to a park or a playground, the data commons is a public good, which is accessible to the public. Open digital data can be copied limitlessly while the original in physical terms is not affected in any way [8]. Open data or OGD, as a source of information and knowledge in a knowledgebased economy, might well be a free resource for end-users; however, its production, maintenance and gathering need to be secured and maintained, with significant cost by skilled staff, with appropriate AI and Big Data technologies, and through implemented systems with open standards [34].
Data is collected using different tools and mechanisms; therefore, we can find different type of data sources. Open data sources include any information that can be obtained without a privileged position [36]. The most relevant open data sources are: social media (Facebook, Instagram, Twitter, LinkedIn, YouTube, and other), electronic media (newspapers, news sites, other), blogs (Wordpress, Tumblr), booking and accommodation (Booking, FourSquare), satellite imagery (Landviewer, Copernicus Open Hub, Sentinel Hub etc.), and government data (World Bank Open Data, European Union Data Portal, open data in Canada, Data.gov, country level sites with open access to government data, and many more).
At the governmental level, open data can be a powerful force for public accountability, as [37] mentions, because information can be analyzed, processed, and combined in an easier way, which allows a new level of public scrutiny. Making data available for public will increase governmental engagement from citizens and potentially add value to the data [38]. Data availability definitely supports innovation and contributes to economic growth.
There are data sets available, which allow direct access to data and so, interested parties gain instant and easy access to data. However, there are situations in which there is the need to perform data extraction from available data sources using different techniques. What is important for any application is the quality of gathered data, in order to have a correct representation of the real world and to be fitted for their intended use [31]. Furthermore, the following aspects are also extremely relevant for developing applications for smart cities [39]: (1) storing and managing databases, as large amount of data is collected, and (2) integrating data from many sources.

Artificial Intelligence for Smart Cities: Machine Learning and Deep Learning
Moving forward in the SC ecosystem, another main component is technology; after 2010, more and more different technologies have been employed in smart city related developments. Information and communications technologies enable the detection and collection of data, diffusion of the data through the network and development of specific applications. In the recent years, key domains, such as urban planning, transportation, or energy make use of new technologies to provide smart applications to cities and their people: networking and communications, IoT, big data, analytics, cloud or edge computing, or artificial intelligence [40]. Other new technologies adopted in the SC context are autonomous vehicles, 5G, blockchain, virtual reality, and digital twins [41].
IoT (short for Internet of things) is a tool that provides specific services that give low level support to different applications offered to citizens. Technological advances, such as standard communication protocols and wireless networks made it possible to obtain sensor data at any time and everywhere. The cloud-based infrastructure of a smart city architecture allows the information to be communicated to the connected objects/entities. The cloud offers the adjustment of computational resources according to the demand and transforms capital expenditures into operational costs. The enormous production of digital data in cities is due to the fact that all the actions effected on the personal computers, laptops, mobile phones, and other connected objects leave a trace. Big data offers the capabilities to capture, sore, manage and analyze this huge volume of information, which is persistently growing, accumulating, and waiting to be analyzed [42].
Artificial intelligence (AI) represents an innovative technology meant to deal with the urban challenges of environment, people, transportation, security, or economy. Even more, AI is a key enabler to improve data processing and transformation into useful information and knowledge intelligence for the sustainable cities [21]. While initially being defined as the science of making machines intelligent, today's AI represents a combination of machine learning and deep learning techniques. Things have evolved a lot in less than ten years from the moment when the Google's unsupervised neural network learned to recognize cats in YouTube videos with 74.8% accuracy. In the SC environment, where types of data acquired vary from text, images, videos, social media, or sensors, AI has the potential to analyze the gathered and integrated big data and employ cloud computing for the operational costs and resources optimization [43].
Machine learning (ML) allows applications to become more precise at predicting outcomes without being explicitly programmed to do so. Their algorithms use existent data as input and learn to predict new output values. Based on how algorithms learn, there are four ML types: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning (see Figure 2 for an overview of each type along with the problems they can solve). In supervised learning, algorithms are supplied with labeled data to be used in training (input) and variables that the algorithm assesses for correlations are defined (output).
Unsupervised learning employs algorithms that train on unlabeled data. The algorithm is capable to find meaningful connection(s) in the scanned data set. The trained data and the predictions are predetermined.
Semi-supervised learning involves a mix of supervised and unsupervised. While the algorithm is fed with mostly labeled training data, the model has the liberty to investigate other data. The result is based on its own understanding of the data set.
Reinforcement learning teaches a machine to perform a "multi-step process" based on plainly defined rules. While the algorithm is programmed using positive or negative indications to fulfill a task, it may also determine on its own what steps to take during the process.
The last decade has been the decade of neural networks (also known as ANNartificial NN) due to the availability of the computational power and the data required for good training. Algorithms and architectures were adapted to the neural networks specifics. Imitating the human brain behavior, a NN includes node layers with definite roles: one input layer, one or more hidden layers, and one output layer. Nodes are connected to one another and they also have an associated weight and threshold. From a node, data is sent to the next layer only if the output of that node is above the indicated threshold value. Otherwise, no data is passed along to the next layer [46]. The NN that includes more than three layers is considered a deep learning algorithm (see Figure 3).
Neural networks provide a multitude of advantages: they require less formal training because of their excellent learning capabilities, they are able to detect complex nonlinear relationships, they may work with manifold different training algorithms, and they prove their flexibility because they understand various forms of data [47]. On the other hand, NN algorithms necessitate more time and large computational operations to train a model with large volumes of data and their ability to explicitly identify causal relationships between variables is limited. A great challenge for NN is to avoid overfitting, which impedes the NN capacity to generalize well to new data. In this respect, NN algorithms perform well when their complexity is fitting the complexity of the data. Deep learning (DL) is considered a subset of ML but is different in the way algorithms learn and how much data each type of algorithm uses. It is very beneficial that DL eliminates manual intervention through the automation of much of the feature extraction part of the process. Being able to use large data sets, it earned the title of "scalable machine learning". From a practical point of view, deep learning is meant for more complex use cases because a DL model requires more data points to improve its accuracy, while machine learning is able to work with smaller datasets because a ML model may rely on less data.
As also reported in [10], in the last five years there has been an exponential growth for articles with experiments or applications of machine learning techniques in all the smart city areas. Different from them, we are taking into consideration the immense interest for the open data platforms that are constantly expanding in the public space and, in view of that, we will investigate the specifics of open data based machine learning applications for smart cities.

Methods
This section describes the methodology applied for the systematic literature review. The process of systematic literature review consists of the following activities: formulate research questions, select studies, extract required data, analyze and synthesize data, describe the results. At first, we defined the research questions-clear statements that conduct our literature review. For the SLR methodology, we have used PRISMA, which consists of four phases: identification, screening, eligibility, and included, together with a comprehensive checklist. The included results were then assessed and interpreted to give answers to the research questions.

Search Strategy and Criteria for Inclusion/Exclusion
In our search strategy, we have started with selecting major scientific databases, i.e., Web of Science, Scopus, IEEE eXplore, AIS (Association for Information Systems library), Springer, and ProQuest, along with some popular ones such as Semantic Scholar and MDPI. We have tracked published research results in June 2021 using a comprehensive search string: (("artificial intelligence" OR "machine learning" OR "deep learning") AND ("Open Data" OR "open government data") AND ("smart city" OR "smart cities")) A total number of 472 papers, all written in English, from 2011 to 2021 period were selected in the eight searched databases (see Table 2). According to PRISMA method, the initial 472 records were screened with the first purpose of eliminating duplicates. Consequently, 61 records were excluded and the explanation resides in fact that papers are frequently indexed in more than one database. The 411 obtained records were further screened and inclusion and exclusion criteria were applied in two rounds (see Figure 4). In the first round, papers were checked for eligibility to make sure they were peer-reviewed articles and that they discuss machine learning applications for smart cities and open data. Assessing their abstracts, 194 papers were excluded. The large number of excluded papers is the consequence of having a long list of papers based on the comprehensive search string. When carefully reading the abstracts, we found out that many papers did not actually make use of open data or did not employ machine learning techniques for smart city applications and therefore were eliminated.
In the second round, the remaining 187 papers were examined in their content (indepth analysis) in order to retain for evaluation only the papers with actual applications (experiments), which were elaborated with machine learning techniques, and based on open data. We excluded 118 papers because they did not have definite applications that were developed with ML techniques-they were position papers, present frameworks or taxonomies, or some models. Only 69 eligible papers were included in the systematic review that was performed through in-depth analysis of: used ML techniques, type of open data and the challenges encountered in data utilization, and the SC area that the application addresses.

Results and Discussion
Our relevant sample includes 69 selected records, out of which 43 are published journal articles and 26 in the proceedings of international conferences. Appendix A includes the complete list of selected papers with a synthetic description for each of them. While the search was run for period 2011-2021, after the screening of the records, it has reduced to 2013-2021 due to the lack of papers published in 2011-2012. Data was initially collected using a shared Google sheet and then it was exported and processed to obtain visualizations with Microsoft Power BI. Results are detailed as follows.
As regards the time analysis, the records cover a period of 9 years, starting from 2013 and the latest ones were published in 2021. Figure 5 pictures the distribution in time. The trend indicates a significant growth starting with 2017 with a maximum in 2019. It is our belief that the ascending trend illustrated for 2017-2020 will continue given that in the last couple of years open data initiatives favored data-driven innovation [49] and fostered the delivery of ML based smart solutions. The fact that we have only five records in 2021 is attributable to the searching time (June 2021); as a consequence, we cannot have a final number for the papers published in 2021.
When investigating the areas of SC application as we have previously described (see Figure 1), only two areas are slightly represented, i.e., smart people (two applications) and smart economy (four applications); the total and the specific articles (coded as in Appendix A) are presented in Figure 6.

RQ1. Which Learning Types and Algorithms Are Used in Open Data Based ML Applications for each of the Smart City Areas?
Machine learning algorithms offer a world of potential, making available to developers many routes to take, along with the type of machine learning they opt for, in a wide variety of smart city applications.
As revealed above, use of machine learning with open data in SC is a recent topic of interest; we only found papers dating from 2013 to 2021. While at first supervised learning stood out as the preferred ML technique, in the last five years deep learning has been the definite most often used machine learning type in open data-based SC applications (see Figure 7). This may offer valuable insight in terms of the approach to take when designing open data-based SC solutions. Given the nature of imperfect data within open data sources and the seemingly random data points generated within a city, deep learning may be the most relevant tactic to use for such circumstances. Besides, unlike traditional ML algorithms, DL can deal with great amounts of data, therefore, providing high-level solutions to the smart city problems [49].
Based on this reasoning, we wanted to visually signify the distinction between classical machine learning and neural networks/deep learning. The classical ML algorithms were divided into the following four categories: supervised learning, where data is labeled (e.g., decision trees, or linear regression); unsupervised learning, where data is not labeled (e.g., K-means, SVD), semi-supervised learning, where some of the data is labeled and some is not, and reinforcement learning. Deep learning is the fifth category, and the grouping of deep learning algorithms (such as LSTM or CNN) was focused on our assessment of the algorithms that best fit the NN rendition of hidden layers. The complete list of papers organized for each of the five ML types discovered in our sample (coded as in Appendix A) is presented in Figure 8. Overall, three of the machine learning techniques stand out: deep learning (46.38% of the papers), supervised learning (34.78%), and unsupervised learning (10.14%). The reasoning here is that supervised learning is constantly used in SC applications because it has proved its value, while deep learning arose recently but has proved to be more suitable for the SC applications. Considering on deep learning-based applications, we dove deeper in order to identify what algorithms are mostly used (see Figure 9) and the results include long short-term memory (LSTM), convolutional neural networks (CNN), and artificial neural networks (ANN), along with comparisons between different algorithms (multiple algorithms). The rest of the algorithms are used few times (less than two times) or are iterations of the aforementioned algorithms. From this finding, we may learn that so far in SC applications dominate LSTM, CNN and ANN algorithms. As regards the types of learning employed in the deep learning applications, when classifying the algorithms they used (based on [50]), we have learned that 26 papers (37.68% of the total papers and 81.25% in the deep learning papers) applied supervised learning. Only one paper applies unsupervised learning and the other five are using hybrid or semi-supervised learning.
Another interesting piece of evidence is that in many ML applications researchers choose to use multiple algorithms; this is the case mainly for the supervised learning applications (54% of the cases), where the research focuses on identifying of the best choice of algorithm from a range of algorithms already confirmed for the specific application. The distribution of algorithms used in supervised learning SC applications is represented in Figure 10. Regarding the purpose of the applications, we investigated the SC areas relative to the ML techniques employed (see Figure 11 for all the data) and we have discovered the following: Figure 11. Machine learning techniques used in each of the smart city areas.

•
Deep learning and supervised learning are reported in all the SC types of applications (excepting Smart people for one of them).

•
Reinforcement learning has only one field of application, i.e., smart mobility. This looks similar to the 'multi-step process' but does not fit the other SC areas of application.

•
There is an evident supremacy of deep learning application in half of the SC areas: smart economy (75% of the applications), smart environment (67% of the applications), and smart living (53.85%). • Supervised learning is utilized in half of the smart governance applications. • Even if it does not represent a first choice, we believe that unsupervised learning has a great potential for SC applications, and we have discovered one or two applications in almost all SC areas.

•
If deep learning is the most prevalent machine learning type, semi-supervised learning is the last choice in SC applications.

RQ2. What Are the Sources for Open Data?
Smart city applications are growingly present in cities around the world, and these are designed to address different issues/topics for a modern city that is in continuous development. The trend of open data and open government can be validated again with the use of open data for developing smart city applications. This has been possible because over the last three decades, data sources have become available to the public once with the "open movement". The trigger for this movement has been the launch of Internet in 1991. There are several issues related to this topic, as both public and private organizations have adhered to it, but it has not been an impediment to make data available for the wide public. Data is collected from different sources, such as sensors, cameras, mobile devices and many others, and that this data is found in different repositories, such as data sets, public websites, data platforms, and many others.
Researchers together with smart city applications developers need to define their scope and to clearly identify what data is needed for their purpose. Obtaining data is not the 'show stopper' anymore, as they have numerous options of choice.
Nowadays, open data are available from varying areas of interest making it possible to create different SC applications. While reviewing our sample of records we have found that 55% of the applications have used open data platforms (see Figure 12). This option is obviously explained through the availability of data at any time. It is not unexpected, considering that for each year that passes more and more governments and local authorities align with the "open policy" (providing data for a transparent administration). Public websites of private companies are another source of data identified in our review (16%). There is a lot more effort needed to extract accurate data in this scenario, but it proves to be a valuable source of data. The mentioned effort refers to extracting, cleaning, and preparing the data. Numerous websites hold useful information that is shared with the public and that can be used in a variety of ways to build application for smart cities.
It is imperative for cities to implement more and more solutions that can support the citizens in terms of services provided by local authorities (example: authorizing new constructions and demolitions), but also in terms of quality of life (example: forestation of cities, pollution of areas, traffic control, etc.). This can be performed using other sources of OD, such as satellite images. This data source is reported in 15% of the reviewed articles. Considering the development of the satellite technology, their images have become more accessible to obtain, and there is no wonder that smart cities applications have found use cases.
Nevertheless, there are also situations in which multiple data sources are needed to obtain the desired result, case of complex projects or applications. In our review, this was the case for 13% of articles. We have encountered combinations such as the OD platform and user-generated, OD platform and private companies' websites, satellite images and user-generated, private companies' websites and private companies' websites. When using different type of data there are also accompanying difficulties, especially when the need to integrate data retrieved from numerous data sources in one dataset.
As regards the type of the smart city applications, it is noteworthy to observe that for all of them open data platforms represent the most utilized OD source (see Figure 13). Not surprisingly, applications for smart governance rely mostly on OD platforms with 69% of them using this source. In addition, 63% of the applications for smart mobility are using open data platforms. As regards the type of the smart city applications, it is noteworthy to observe that for all of them open data platforms represent the most utilized OD source (see Figure 14). Not surprisingly, applications for smart governance rely mostly on OD platforms with 69% of them using this source. In addition, 63% of the applications for smart mobility are using open data platforms.
We have determined interesting results when analyzing the types of data used for different ML techniques (see Figure 13). Regarding the OD platforms, they are the most used source of data in applications based on unsupervised learning, but they are also the favorite choice for supervised learning (58%) and deep learning applications (50%). The satellite images are used only in supervised learning and deep learning applications, while public websites data are useful for all the applications we have analyzed.

RQ3. What Are the Challenges of Open Data Utilization in ML Applications for Smart Cities?
In the modern society, the vast amount of data is a challenge in itself. On one hand, smart cities initiatives may take advantage of the large volume of available raw data and on the other hand, data collecting and sharing among all stakeholders is not an easy task. The open data movement-a new form of democracy-changed the circumstances and gave new prospects to the applications for smart cities. Shared open data portals ensure that all stakeholders are on the same page when information is updated effortlessly. When data portals open out from internal sharing to external publishing, inter-organizational 'synergy and connectivity' is attained in areas such as electricity, water, environment, traffic management, or safety. The most frequent reported challenges for open data usage may be summarized as:

•
Integrating data from many sources, as a result of having many heterogeneous sources of data (from sensors to social media) that originate from different public organizations/departments or even private companies; • Multitude of data formats, with text data that is structured or semi-structured, but also images, videos and other unstructured data sources-they need to be harmonized in the data gathering phase; • Quality of the data regarding aspects such as accuracy, consistency or data imbalance, sometimes affecting the data validation activity; • Data traceability, using specific mechanisms to track the origin of the data based on accurate and reliable metadata.
In light of this, considering the third RQ, in the whole analyzed studies we have discovered that the most frequent entry (17 papers) was that no problem related to open data were reported. For those articles where problems were encountered, the challenge to overcome the multitude of data formats appears the most frequently (13 papers), followed by the challenge of not having all the needed data (11 papers), data quality issues (7 papers) and data consistency (5 papers). Another finding is that 13 of the developed applications come across with multiple problems related to open data utilization. The less confronted problems are having duplicates and having an imbalanced dataset (see Figure 15 where the results are presented in descendent order as percentages in the total number of papers).
With respect to the different sources of data utilized, some of the researchers reported that open data platforms are suitable as data source in their ML based applications (10 papers), while others confront with problems such as data format (nine cases) or insufficient data (nine cases) using the OD platforms source. In addition, most of the multiple problems situation was also encountered when using OD platforms (for 10 cases). When using the satellite images, the most reported challenges in data utilization are related to data quality and data format or consistency. The complete image on the confronted problems is represented in Figure 16.  Mining into our dataset with the challenges meet for different types of machine learning applications, we have discovered 25 out of the total deep learning based applications (meaning 78%) have encountered problems related to data utilization. The most frequently, researchers come across the problem of not having sufficient data or data being incomplete (25% of the DL application papers) and have to address the problem with the data format (19% of the DL application papers). The situation looks better for ML applications using supervised learning where in almost a third of them there were no problems with data utilization. For this type of application, frequent problems were associated to data format (17%) and data quality (17%) and likewise, frequently, researchers indicated that they have met multiple problems with data utilization (17%). Figure 17 depicts the complete report on challenges related to data utilization for each type of ML application. With respect to the SC areas of the applications, we have observed that data format is a challenge for applications in smart living (38%) and smart mobility (25%), while applications in smart living also confront the problem of not having sufficient or complete data. All results of the analysis of open data utilization challenges for different areas of SC applications are included in Figure 18.
The area of SC where we met most of the applications (31%) that have not encountered problems with open data utilization is smart governance. Regarding this opinion of 'no problems', we should mention that possibly some authors have not included a discussion on the problems with open data utilization. Our results are based on the reported problems only.

Conclusions
Many organizations' leaders today start asking the question 'what can my data do' so beginning to realize the immense potential residing in data. Along with analytics, machine learning already gave the measure of its value and usefulness. However, building and maintaining ML applications is not an easy task and all the experiments, models of application, or case studies in the literature provide an appreciated and beneficial support and inspiration for the smart cities current and forthcoming projects.
AI and machine learning particularly become a core part of businesses, either private or public, around the world. Recent literature [10] pointed out the greater significance of proving the applicability of machine learning techniques in the smart city initiatives. Our research contributes to the current literature on ML applications for SC with a broad analysis that takes into consideration open data, which are becoming more available in the last years. Considering the other SLR approaches, we have decided to delineate and analyze the papers proposing actual applications for SC that make use of ML and open data. This area of investigation proved to be very recent, the initial 472 primary records were published in the last decade (2011-2021), and the final sample of 69 records covers the period of 2013-2021. Furthermore, the increasing number of applications and experiments in the last couple of years denotes the fact that we have addressed an emerging topic.
Some of the most significant results of the executed analysis are listed below: • The most used source of data is open data platforms (in over 50% of the applications); this result corroborates with [16]. • All the ML techniques were encountered but mostly used are deep learning and classical ML supervised learning. Some ML types have applications in only one SC area (reinforcement learning in smart mobility) and there are dominating choices for others, i.e., deep learning in smart mobility and classical ML supervised learning in smart governance. All these insights may be useful for upcoming applications because researchers can find direct connections to similar or related applications.

•
We proclaim deep learning as the most appealing ML technique because it is exploited in all the SC areas of application, dominating by far in the smart environment and the smart living areas. Deep learning techniques demonstrate that they are able to deal with the huge volumes of data that are constantly produced in modern cities and to develop solutions for the most prevailing urban issues. • However, the algorithms applied are very diverse, depending on the application and the ML technique. The most employed learning type is the supervised learning and the predominant choice of work is to apply multiple algorithms.

•
Among the four types of learning in classical ML applications and in deep learning applications, the supervised learning distinguishes as a preferred option in all the SC areas, based on using open data platforms and operating multiple algorithms.
Furthermore, we have researched the challenges associated with open data utilization in ML applications where our analysis has generated meaningful results. As data can be retrieved from numerous data sources for different domains, there are also issues/challenges for researchers/developers when using these in the SC applications. We discovered that when using an open data platform the challenges could vary from quality of data, to frequency of data collection, to consistency of data, to data format, or to no issues at all. These challenges were described in relation to the different data source categories and the types of ML applications.
Our paper has some limitations. Concerning RQ3, there is a probability of bias in our analysis, while some authors may not have included in their papers the problems they met with open data utilization-sometimes there are not such problems but in other cases they exist but are not stated. We followed PRISMA methodology and then in-depth analysis with visualizations executed with Power BI, but we could also applied, other techniques such as the bibliometric technique to provide more information. Additionally, we could dig more into the ML algorithms usage analysis in order to discover more on the rationale of choosing algorithms for ML applications. From a regional perspective, we did not explore aspects such as the countries or regions that offer the most amount of such papers, the relevance or the irrelevance of election years in contrast with the publishing of such papers, the political climate and open data laws surrounding the regions able to produce the highest amount of such papers. We think about all these shortcomings as opportunities for further research. This paper has some theoretical and practical implications. The theoretical contribution is that we give a detail perception of the current approaches of open data-based machine learning applications in the smart city initiatives by reviewing the very recent literature. In addition, another theoretical contribution is a valuable synthesis of open data based ML experiments/applications, obtained by classifying the investigated papers by the SC area of application, ML technique and algorithms, and types of open data manipulated. From a practical perspective, in view of the increasing interest from both academia and the industry professionals in researching the SC innovation under the perspective of machine learning and open data huge potential, we believe that they may find valuable insights in our analysis. Our results are particularly helpful for researchers who begin working to develop ML applications for SC because they can calibrate theirs based on previous practical results we have explored in this paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The  The proposed study deployed a deep sequential model for the early prediction of at-risk students based on their week-wise clickstream interactions with the VLE. The article proposes a model of mobility where displacements are grouped together into geographical clusters. Authors propose a ML algorithm to infer the probability of finding people in geographical locations and the probability of movement between pairs of locations.

145-150
Authors propose an application created to forecast the future traffic flow and plan the driving route. It helps effectively relieve traffic flow, reduce travel time and carbon emissions.  In this work focus, a ML application is created for to make predictions about the statuses of the stations of a public bicycle service.

594-603
In this work, the authors present and evaluate an end-to-end framework for computing disaggregated population mapping employing convolutional neural networks. Authors propose an integrated system model for intelligent waste collection, and the quantification of its benefits and economic costs when deploying and using it for evaluating its feasibility as a real world smart city application.      The main focus is to apply ML in order to perform an in-depth analysis of the major types of crimes that occurred in the city, observe the trend over the years, and determine how various attributes contribute to specific crimes. The paper proposes the use of one single source of data, publicly available: Sentinel-2 satellite imagery and tested whether they could automatically extract them with a state-of-the-art deep-learning framework and whether, in the end, the extracted features could predict vitality.   (7):1-15 The paper propose a system for collecting public data on car parkoccupancy values, display them in a user-friendly web service, store them to beconsulted as a historical archive, and use these past data to predict the carparks' occupancy rate in the coming week. This research examine two sets of supervised machine learning techniques in order to predict the visitors' distribution in the next timesteps and evaluate them using real data from a large music event. The study aims at exploring the potential of machine learning algorithms in the context of an object-based image analysis and to thoroughly test the algorithm's performance undervarying conditions to optimize their usage for urban pattern recognition tasks. The scope is to assess the feasibility and accuracy of an automated mapping methodology using multi-step pipeline that combines deep learning and geospatial techniques for detecting sustainable roofs of up to 100% in some cities. The study proposed an interdisciplinary research method to predict multi-building energy use by integrating a social network analysis with an artificial neural network technique. This paper attempts to explore the ability of machine learning algorithms to model grid-level of residential land prices using the case of Wuhan in China. Several land price prediction models were built using five machine learning algorithms and various geographic variables. This study provides the details of a machine learning based approach that enables the prediction of impact of construction projects on quality of life in urban settings through the quantification of changes on quality of life indicators (e.g., noise, air quality, traffic) in cities, inferred by open city data.