Machine Learning and Big Data in the Impact Literature. A Bibliometric Review with Scientiﬁc Mapping in Web of Science

: Combined use of machine learning and large data allows us to analyze data and ﬁnd explanatory models that would not be possible with traditional techniques, which is basic within the principles of symmetry. The present study focuses on the analysis of the scientiﬁc production and performance of the Machine Learning and Big Data (MLBD) concepts. A bibliometric methodology of scientiﬁc mapping has been used, based on processes of estimation, quantiﬁcation, analytical tracking, and evaluation of scientiﬁc research. A total of 4240 scientiﬁc publications from the Web of Science (WoS) have been analyzed. Our results show a constant and ascending evolution of the scientiﬁc production on MLBD, 2018 and 2019 being the most productive years. The productions are mainly in English language. The topics are variable in the di ﬀ erent periods analyzed, where “machine-learning” is the one that shows the greatest bibliometric indicators, it is found in most of motor topics and is the one that o ﬀ ers the greatest line of continuity between the di ﬀ erent periods. It can be concluded that research on MLBD is of interest and relevance to the scientiﬁc community, which focuses its studies on the branch of machine-learning.


Introduction
The idea of Machine Learning was not unique in computing, but due to the consistently varying nature of necessities of the present world it has come up in unique forms. With the expansion of the web, a large amount of advanced data are being created, which implies that there are much more data accessible for machines to learn and analyze. Today, calculations of machine learning empower the computers to speak with autonomously driven cars and humans, compose and coordinate reports, and find accused terrorists. Supervised, unsupervised and reinforcement learning are the three sub-areas of Machine Learning [1].
Some Machine Learning techniques for processing of Big Data are not efficient and are not adaptable to get together a high volume, value, velocity, and variety, hence it requests to rehash itself for handling of big data [2]. Adaptability is a difficult problem with conventional calculations of machine learning [3]. In the event that a machine learning approach is utilized to address a calculation deficiency and a material science-based model is accessible, at that point numerical outcomes might be adequate in requests to process acceptable execution measures [2]. Machine learning is utilized in Web search, spam channels, advertisement situation, recommender frameworks, credit scoring,

Purpose of Study
In the present study the concepts "Machine Learning" and "Big Data" (MLBD) in the scientific literature registered in the Web of Science (WoS) database are analyzed. For the development of this research, scientific mapping will be used based on the measurement of different bibliometric indicators and the dynamic and structural development of the delimited constructs. Previous studies of impact journals of the Journal Citation Reports (JCR) have been taken as a methodological model with the purpose of following a research method validated by specialists in this type of analysis [27,28].
The purpose of this study is to analyze the path and projection of both terms in the indexed publications in the main WoS collection. First, the database was analyzed to inquire about the state of the matter and verify the existence of studies that have treated the concepts presented at the bibliometric level. As a result, no study was reported in which MLBD were related and analyzed using the scientific mapping technique.
This work assumes an exploratory component that contributes to the reduction of the gap produced in the literature found in WoS. In this line, the findings reached here will be a breakthrough in science by presenting new results that may arouse the interest of other researchers to continue studying in this state of the art.
The objectives proposed in this research are: To know the performance of scientific production indexed in WoS alluding to MLBD.
To determine the scientific evolution of MLBD in WoS.
To create the most incidents about MLBD in WoS.
To find out the most influential authors in MLBD in WoS.

Research Design
Bibliometrics was the methodology used to develop the study and achieve the scope of the proposed objectives. The choice of this research approach was based on the greatness of Scientometrics for the search, registration, analysis and prediction of scientific literature [29]. For optimal development, the guidelines of experts in bibliometric were followed [30].
This study focused on the analysis of co-words [31] and bibliometric indicators such as the h-index and derivatives (g, hg, q2) [32]. This study allowed us to generate maps with nodes that determined the performance and location of various conceptual subdomains linked to the terms "Machine Learning" and "Big Data". This served to specify the thematic development of these constructs in WoS [33].

Procedure and Data Analysis
The research process was carried out in different actions. First the database was selected. In this case, WoS was chosen as a database that contains a large number of indexed impact studies. Next, the keywords to be analyzed were determined. In this study the terms "Machine Learning" and "Big Data" were chosen, after consultation in various specialized thesauri. Next, the search equation was constructed. The result was "Machine Learning" [TOPIC] AND "Big Data" [TOPIC] with the intention of refining the process of reporting scientific documents that had such terms in title, abstract, and keywords of indexed publications.
These first actions obtained a scientific production of 4328 documents. The first studies dated back to the year 2010. Therefore, the literature of the last 10 years (2010-2019) was taken, suppressing studies published in 2020 (n = 74) for not having finished the year and duplicates or indexed incorrectly (n = 14). Therefore, the unit of analysis focused on 4240 documents. This figure was the result of the application of various production indicators with their respective inclusion criteria such as year of publication (all production except 2020), language (x ≥ 10), publication area (x ≥ 700), type of documents (x ≥ 100), organizations (x ≥ 50), authors (x ≥ 10), sources of origin (x ≥ 30), countries (x ≥ 200), citation (the four most cited documents; x ≥ 250). The monitoring of these actions resulted in the generation of the following flow chart based on the protocols of the PRISMA-P (PRISMA for systematic review protocols (PRISMA-P) matrix ( Figure 1). To analyze the reported literature, various software was used. Two are tools from WoS, Analyze Results and Creation Citation Report. These were used to extract the data related to the year, authorship, country, type of document, institution, language, medium and most cited documents. The other program was SciMAT, used to longitudinally analyze the structural and dynamic development of scientific production. For an effective analysis, the instructions of experts in this latest software were followed [34]. SciMAT allowed the following thematic co-word analysis to be carried out through the following processes: Recognition: in this process various actions were carried out: a) analyze the keywords of the reported documents (n = 12657); b) generate a map of co-occurrence nodes; c) Develop a standardized network of co-words; d) Detect the keywords with greater significance (n = 11993); e) Represent the most influential topics and terms through a clustering algorithm.
Reproduction: Following the principles of centrality and density, a strategic diagram and a thematic network were developed. Centrality measures the degree of interaction of a network with other networks and is expressed by the equation c=10.∑ekh, where k is a keyword belonging to the topic and h a keyword belonging to other topics. Centrality analyses the strength of external links to other themes.
This value was considered as the measure of the importance of a theme in the development of the entire field of research analyzed. Density measures the internal strength of the network and is expressed by the equation d=100. ∑eij/w, where i and j are keywords belonging to the topic and w is the number of keywords in the topic. Density analyzes the strength of the internal links between all the keywords that describe the research topic. This value was considered a measure of the degree of development of the topic under study. In the graphic study generated, there were four areas: upper right (motor and relevant topics), upper left (rooted and isolated themes), lower left (missing or projected topics) and lower right (low development and transversal themes).
Determination: The development of the nodes in different periods or time intervals is studied. In this case, five periods were delimited (P1 = 2010-2015; P2 = 2016; P3 = 2017; P4 = 2018; P5 = 2019). The strength of association was achieved through the volume of keywords in common in the different periods. However, for the authorship all literary production was taken. Therefore, a single period was configured (PX = 2010-2019). To analyze the reported literature, various software was used. Two are tools from WoS, Analyze Results and Creation Citation Report. These were used to extract the data related to the year, authorship, country, type of document, institution, language, medium and most cited documents. The other program was SciMAT, used to longitudinally analyze the structural and dynamic development of scientific production. For an effective analysis, the instructions of experts in this latest software were followed [34]. SciMAT allowed the following thematic co-word analysis to be carried out through the following processes: Recognition: in this process various actions were carried out: a) analyze the keywords of the reported documents (n = 12657); b) generate a map of co-occurrence nodes; c) Develop a standardized network of co-words; d) Detect the keywords with greater significance (n = 11993); e) Represent the most influential topics and terms through a clustering algorithm.
Reproduction: Following the principles of centrality and density, a strategic diagram and a thematic network were developed. Centrality measures the degree of interaction of a network with other networks and is expressed by the equation c=10. e kh , where k is a keyword belonging to the topic and h a keyword belonging to other topics. Centrality analyses the strength of external links to other themes.
This value was considered as the measure of the importance of a theme in the development of the entire field of research analyzed. Density measures the internal strength of the network and is expressed by the equation d = 100. e ij /w, where i and j are keywords belonging to the topic and w is the number of keywords in the topic. Density analyzes the strength of the internal links between all the keywords that describe the research topic. This value was considered a measure of the degree of development of the topic under study. In the graphic study generated, there were four areas: upper right (motor and relevant topics), upper left (rooted and isolated themes), lower left (missing or projected topics) and lower right (low development and transversal themes).
The strength of association was achieved through the volume of keywords in common in the different periods. However, for the authorship all literary production was taken. Therefore, a single period was configured (P X = 2010-2019).
Performance: To carry out this process, various production indicators were taken with their corresponding inclusion criteria in order to be considered in the study (Table 1). Table 1. Production indicators and inclusion criteria.

Performance and Scientific Production
The evolution of the 4240 documents in the scientific production on MLBD has been constant and continuous in the time, having an exponential growth from its beginnings until the year 2018, where they maintained a stable level of production until the year 2019. In other words, the production levels in 2018 and 2019 were even, showing an equal interest in both years by the scientific community ( Figure 2). Performance: To carry out this process, various production indicators were taken with their corresponding inclusion criteria in order to be considered in the study (Table 1).

Configuration
Values Analysis unit Keywords authors, keywords WoS

Performance and Scientific Production
The evolution of the 4240 documents in the scientific production on MLBD has been constant and continuous in the time, having an exponential growth from its beginnings until the year 2018, where they maintained a stable level of production until the year 2019. In other words, the production levels in 2018 and 2019 were even, showing an equal interest in both years by the scientific community ( Figure 2). The language chosen by authors for the presentation of the academic results was mostly English (Table 2a). The main areas of knowledge in MLBD studies were maintained with even numbers in Computer Science Theory Methods, Computer Science Information Systems and Engineering Electrical Electronic (Table 2b).
There were even numbers in the type of document used to present the information, being used mainly the articles and the communications in congresses (Table 2c). The main organization that The language chosen by authors for the presentation of the academic results was mostly English (Table 2a). The main areas of knowledge in MLBD studies were maintained with even numbers in Computer Science Theory Methods, Computer Science Information Systems and Engineering Electrical Electronic (Table 2b). There were even numbers in the type of document used to present the information, being used mainly the articles and the communications in congresses (Table 2c). The main organization that referred to MLBD studies was the University of California Systems, being quite distant from the rest (Table 2d).
The authors with the highest production were Wang L. and Wang Y., there being no great differences with the rest of the authors (Table 2e). The main source of presentation of studies on MLBD was the IEEE International Conference on Big Data, which gathered the compilation of works developed in congresses.
The main journal was Lecture Notes in Computer Science, which was the main producer in this field of study (Table 2f). The country with the greatest interest in production over MLBD was the United States, with twice as much production as the next country, China (Table 2g).
The reference authors for the scientific community, due to his high citation, was Kosinski, M.; Stillwell, D.; Graepel, T., with their article titled "Private traits and attributes are predictable from digital records of human behaviour", who accumulated a high number of citations in the MLBD study. These authors were followed, with fewer citations, by Muja, M.; Lowe, D.G., with their article titled "Scalable Nearest Neighbor Algorithms for High Dimensional Data". (Table 3).

Structural and Thematic Development
The evolution of keywords shows information about the number of keywords in each of the established time intervals, the number of matching keywords between the periods and the number of keywords leaving and entering a certain period with respect to another. In this case, a more established line of research can be observed in the last four years, which shows the same trend in the scientific community itself (Figure 3). The reference authors for the scientific community, due to his high citation, was Kosinski, M.; Stillwell, D.; Graepel, T., with their article titled "Private traits and attributes are predictable from digital records of human behaviour", who accumulated a high number of citations in the MLBD study. These authors were followed, with fewer citations, by Muja, M.; Lowe, D.G., with their article titled "Scalable Nearest Neighbor Algorithms for High Dimensional Data". (Table 3).

Structural and Thematic Development
The evolution of keywords shows information about the number of keywords in each of the established time intervals, the number of matching keywords between the periods and the number of keywords leaving and entering a certain period with respect to another. In this case, a more established line of research can be observed in the last four years, which shows the same trend in the scientific community itself (Figure 3). The academic performance in the established periods offers the subjects with the greatest bibliometric indicators, using the h index as the main reference, and completing this information with the g index, hg index and q 2 index, in addition to the number of citations.
In this case, the ʺmachine-learningʺ theme was shown to be the one that presents the highest bibliometric indicators in all periods, except in 2016, where ʺpredictionsʺ was the theme with the highest values. The variety of themes that appeared in the different periods is noteworthy, offering the main lines of research developed (Table 4). The academic performance in the established periods offers the subjects with the greatest bibliometric indicators, using the h index as the main reference, and completing this information with the g index, hg index and q 2 index, in addition to the number of citations.
In this case, the "machine-learning" theme was shown to be the one that presents the highest bibliometric indicators in all periods, except in 2016, where "predictions" was the theme with the highest values. The variety of themes that appeared in the different periods is noteworthy, offering the main lines of research developed (Table 4). The diagrams of the intervals developed show data on the importance of each of the themes in the different periods. For this purpose, a grouping process wasdeveloped, according to Callon's indicators, which assesses the degree of interaction of a network with respect to other networks, from two axes: centrality, which analyzes the strength of the relationship of external links with other topics, where it shows the importance of the development of a topic in a field of research; and density, which assesses the internal strength of the network, analysing the internal links between the key words that are grouped around a specific topic, giving information on the degree of development of a field of study. In the first period (2010-2015), the driving themes were "machine-learning" and "Hadoop".
In the second period (2016), the driving themes were "privacy", "Smart-meter" and "neural-networks". In the third period (2017) it was "mapreduce", "apache-spark" and "support-vectormachine". In the fourth period, it was "machine-learning", "mortality", "precision-medicine", "random-forest" and "data-analytics". In the last period (2019) it was "machine-learning", "internet", "artificial-neural-networks" and "mapreduce". In this period, we must bear in mind the themes "feature-selection", decision-making" and "information", given that their location in the diagram makes them unknown, given that they may be the driving force in the future or may tend to disappear from scientific production ( Figure 4).

Thematic Evolution of the Terms
The thematic evolution analyzes the thematic development of the scientific field studied, according to the number of established time periods. In this case, T t is the set of themes detected for a given period, where U € T t represents each of the themes detected in period X. Let V E T t+1 be the set of themes detected in the following period of time x+1. In this case, it can be determined that there was thematic evolution from theme U to theme V if there were thematic networks of both themes, in which at least one keyword was shared. In this way, V could be considered an evolved theme from U. The keywords k € U Ω V were considered as thematic nexus or conceptual nexus of evolution.
The importance of thematic nexus was measured by the number of themes they had in common, measured in solid lines and dotted lines. Solid lines mean that the linked topics shared the same name, that is, both topics are labelled with the same keyword, or the label of one of the topics was part of the other topic. A dashed line means that the topics shared elements other than the name of the topics. The strength of the links between two topics is proportional to the value of the Jaccard index of both topics. The volume of the spheres is proportional to the number of documents associated with the topic.
The results show that a conceptual gap existed if all periods were taken into account, given that there were no themes that were repeated in all established intervals. In this case, the year 2016 was the one that produces this gap, given that the rest of the periods the conceptual line marks "machine-learning", especially in the last three years, where the connection, besides being thematic, was solid and consolidated, placing it as a reference in this field of research.
A relevant aspect to bear in mind is that there are more thematic connections than key words, which shows that the trends in research are established and connected. The evolution of the studies on MLBD determines how the studies, in the first years were based on purely computer science aspects, evolving towards medical aspects, thus showing the transformation of the research in this field of study ( Figure 5).

Authors with a Higher Relevance Index
According to the data shown in the authors study, Peixoto, R. or Poornachandran, P. were positioned as the driving authors, while Song, J.N. was positioned as an author who may be relevant in the future in this field of study. There were authors such as Passos, I.C., Momayoun, H. and Mosavi, A., who although they showed the highest h indexes, were both developed and isolated, or were basic and transversal ( Figure 6).

Authors with a Higher Relevance Index
According to the data shown in the authors study, Peixoto, R. or Poornachandran, P. were positioned as the driving authors, while Song, J.N. was positioned as an author who may be relevant in the future in this field of study. There were authors such as Passos, I.C., Momayoun, H. and Mosavi, A., who although they showed the highest h indexes, were both developed and isolated, or were basic and transversal ( Figure 6).

Discussion
As it has been shown in the previous section, there was an exponential growth within MLBD publications from 2010 to 2018, thus maintaining the number of publications the following year [35]. This reflects a state-of-the-art that is booming and is of interest to researchers from the scientific community who are contributing to the advancement of this field of knowledge and, in the same way, to science.
In a firm way, as it happens in other studies [36][37][38], the language of publications is mostly English, as a worldwide scientific language, and publications with very striking figures in other languages are relegated, not reaching significant and outstanding values such as Anglo-Saxon, which is situated as the predominant language.
Most of the published documents are articles and communications in congresses, in order to be able to disseminate the studies on MLBD [9]. These ones reach an outstanding figure, hovering over almost two thousand documents each, while the rest of the type of documents such as material review and editorial obtain minority figures in comparison.
The most productive organization on MLBD studies is the University of California Systems, with a great difference from other organizations. As in other studies on other topics [39]. The United States remains the richest country in MLBD productions, followed by China, with almost half of its number of publications. Regarding the study that has received the most citations, it is called Private traits and attributes are predictable from digital records of human behavior, published in Proceedings of the national academy of sciences of the United States of America in 2013, by the authors Kosinski, M., Stillwell, D., and Graepel, T. [35], exceeding half a thousand citations. On the other hand, it is determined that the machine-learning theme has the greatest bibliometric indicators in the periods analyzed.

Conclusions
After the analysis, it is shown that there are more connections between the themes than between the keywords themselves, so important for discourse analysis, which reveals that the research trend is related to studies. The publications on the subject of this research have been more prolific during the last four years, which is where there is a greater coincidence of keywords between periods. When considering a theme par excellence in this type of studies, it is "machine-learning", which has the highest bibliometric levels in most of the analyzed times and is the one that most appears as a motor theme in the established periods. This shows the great importance of the term on the part of the scientific community when carrying out its investigations, being even above Big Data.
Research in this field also shows that there are many thematic connections between them, being able to elucidate that the studies are related to each other. In addition, it shows an evolution in the base of the studies, going at first based on purely medical aspects, advancing in recent times to aspects related to the field of medicine. Finally, the authors Peixoto, R. or Poornachandran, P. are placed as motor authors, and therefore, those that have more relevance and importance for the educational community.
As a future prospect, the idea arises to propose an in-depth analysis of content on MLBD publications, analyzing whether the trend in the texts arises as investigations with or without experiences, and/or if they are only at a theoretical level and if these publications maintain the defence in a positive or negative direction of the subject analyzed.
There are several limitations presented in this investigation. First, there is the debugging of the data presented in the WoS, where repeated documents are presented or that are not related to the subject of the study. Second, the establishment of the intervals, in this case a matter of fairness, since researchers have always tried to maintain a similar number of documents in each of the intervals. Third and last, the parameters marked in this study have been established according to the researchers' own criteria, who have tried to present the results according to their size and relevance. Therefore, the data presented here should be analyzed with caution, since changing the parameters established in this investigation may lead to a variation in the number and connections in the subjects presented.