Big Data Usage in European Countries: Cluster Analysis Approach

: The goal of this research was to investigate the level of digital divide among selected European countries according to the big data usage among their enterprises. For that purpose, we apply the K-means clustering methodology on the Eurostat data about the big data usage in European enterprises. The results indicate that there is a signiﬁcant di ﬀ erence between selected European countries according to the overall usage of big data in their enterprises. Moreover, the enterprises that use internal experts also used diverse big data sources. Since the usage of diverse big data sources allows enterprises to gather more relevant information about their customers and competitors, this indicates that enterprises with stronger internal big data expertise also have a better chance of building strong competitiveness based on big data utilization. Finally, the substantial di ﬀ erences among the industries were found according to the level of big data usage


Introduction
The development of information and communication technologies (ICTs) in the last several decades has an important role in the world's socio-economic progress.Countries with higher levels of ICTs adoption enjoy better economic outcomes in return [1].Nevertheless, digital society is still an elusive aspiration for some countries, which is, in turn, causing a digital divide both at the individual and at the enterprise level [2].
In 2003, a World Summit was held in Geneva, which addressed various technological issues, with the digital divide being one of them.A digital divide occurs when groups are formed with different levels of access to specific technological infrastructures, and it is often measured at the level of individual persons.On a psychosocial level, this divide can refer to those that embrace the new digital revolution and those that reject it, for various personal and demographic reasons [3].However, recently, the digital divide has substantially decreased for some of the technologies [4].
On the other hand, new and upcoming technologies contribute to the digital divide among enterprises, which is especially worrisome, since enterprises nowadays heavily depend on ICTs as leverage for increasing their competitiveness.One of such technologies is big data, which is mainly driven by the emergence of Industry 4.0.The notion of Industry 4.0 (or Industrie 4.0), was initially proposed as a concept at the 2011 Hannover Fair, while in 2013, it became a German strategic initiative [5].As remarked by Witkowski [6], the fourth industrial revolution (Industry 4.0) is facilitated with the development of the Internet of Things (IoT) and big data.These technologies enabled the automation smart devices or sensors; (ii) analyzing big data from the geolocation of portable devices; and (iii) analyzing big data generated from social media.
Recent studies have reported on the beneficial impact of big data analytics in diverse industries [8,23,27].The source of different impact stems from the different nature of the data relevant for different industries, e.g., structured data, textual data, multimedia files, web and social media logs, network logs, internet-of-things, and mobile logs [10,28].Castelo-Branco et al. investigated Industry 4.0 in EU countries [29].Their findings suggest that differences in manufacturing digitization could be partially explained by enterprises' big data maturity.
Since big data acquisition, management, and analytics have recently emerged, the skills relevant to big data are scarce on the labor market, as well as in the curriculum of bachelor and master educational programs [30].To fill this gap, abundant massive open online courses and extracurricular courses have been launched, such as "Data Science and Big Data Analytics: Making Data-Driven Decisions" available at MIT [31].Rohrbeck discussed that the availability to use internal ICT experts is a significant driver of profitability since such experts have in-depth knowledge about the enterprise data, processes, and strategical goals [32].Due to the shortage of big data skills, enterprises likely employ both internal and external big data experts.However, the question emerges if the availability of internal experts could lead to a greater level of big data utilization.
In this work, we focus on the usage of big data in Europe, intending to investigate differences between European countries according to the usage of big data by their enterprises, since the digital divide at the enterprise level has been demonstrated for various ICTs.For that purpose, we analyzed the data about big data usage from Eurostat, which was collected as part of the European ICT usage survey [33], which includes the information about the overall usage of big data and usage of various big data sources (e.g., social media, internet of things), as well as usage of internal and external big data expertise.
We analyzed these data by using K-means cluster analysis, which is often used for analyzing the digital divide due to its ability to form homogenous groups of cases based on the usage of several variables [34,35].Our analysis generated three clusters, which were in turn compared according to the level of usage of internal or external big data experts, and the level of big data usage in various industries.Results indicate that the big data digital divide is present in European countries, both at the country and industry level.Utilization of experts is also confirmed as a benefit to the big data utilization.
The paper is organized as follows.After the introduction section, the methodology section describes the data and statistical methods used.The third section presents the results of cluster analysis and compares the usage of internal or external big data expertise, and usage of big data in various industries.The final section summarizes the main ideas of the study and provides a discussion of theoretical and practical contributions.

Data
For this research, we use the data set about the big data usage in enterprises obtained by Eurostat.Table 1 presents the variables used in the research.Data consists of two groups of variables: (i) sourced of big data used in enterprises and (ii) big data external or internal expertise employed in enterprises.
The data have been collected by the National Statistical Offices in 2018 for the 28 European countries.The dataset includes all of the European countries, as well as Norway, while leaving out the UK.Data have been collected on the enterprise level for the three groups of enterprises according to size (small, medium, and large).The size of the enterprise has been established based on the number of persons employed.The information about the usage of the following big data sources is collected: enterprise's smart devices or sensors, geolocation of portable devices, and data generated from social media, or usage of any data source.Variables of usage of internal or external expertise for conducting big data analysis are also taken into account.Finally, the information about the enterprises' industry is extracted from Eurostat.

Research Questions and Statistical Analysis
For the analysis of the big data digital divide in European countries, we pose three research questions: (i) RQ1.What is the level of big data digital divide among European countries according to the usage of big data technologies in small, medium, and large enterprises, taking into account various sources of big data?; (ii) RQ2.What is the impact of using internal or external big data experts for delivering big data solutions to the level of acceptance of big data?; (iii) RQ3.What is the level of big data usage in various industries, and how it is related to the level of acceptance of big data?
The first research question (RQ1) were addressed by using cluster analysis.Cluster analysis aims to decrease the dimensionality of a dataset by identifying homogenous groups of data [36].The clustering of data instances resulted in groups with similar in-between features, while the data instances in different groups had significantly different features.
The first step in cluster analysis was to determine the characteristics, i.e., variables, that will be used for the segmentation of data [37,38].The clustering variables are usually selected concerning the theory and the specific topic of the research [39].Consequently, 12 observed variables on the big data utilization have been used for the clustering in our analysis.The second step in cluster analysis is to select the clustering method [39].There are several clustering methods, but the most employed one is the non-hierarchical k-means clustering approach [40,41], due to its ability to reach a stable solution, which increases the trustworthiness of the results [39].The third step in cluster analysis is choosing the number of clusters.In k-means, the number of clusters should be selected by the analyst, using the various rules or expert knowledge.There are several approaches proposed for this purpose [42].We opted for observing the graph of the cost sequence to find the appropriate number of clusters [43], supplemented with the v-fold cross-validation approach to find the optimal number of clusters, and ensure the robustness of the solution [39,42,44].Finally, after the cluster solution was found, the interpretation of clustering results can be made concerning the underlying theory and research domain.
To provide an answer to the first research question (RQ1), we analyzed the countries in clusters according to their geographical position, and utilized big data analysis among small, medium, and large enterprises.
The second research question (RQ2) was answered using ANOVA analysis to investigate the differences among countries in clusters according to the usage of internal or external expertise for delivering big data solutions.
ANOVA analysis was also used for answering the third research question (RQ3), to investigate the different levels of big data usage in European enterprises across various industries.

Descriptive Statistics Analysis
Table 2 presents the descriptive statistics of the observed variables.Big data utilization was measured as a percentage of the enterprises using a certain big data source.Therefore, the data about the usage of various big data sources (enterprise's smart devices or sensors, geolocation of portable devices, data generated from social media, or any data sources) among the small, medium, and large enterprises were examined.Overall, ICTs are most often used by large enterprises that have the highest need for sophisticated ICT solutions, as well as sufficient financial and human resources for its implementation [3].This trend was also observed in the big data usage presented in Table 1.On average, 33.5% of large enterprises use big data from any source.In detail, 19.89% of large enterprises use big data from the enterprise's smart devices or sensors, and 13.71% of large enterprises use data from the geolocation of portable devices.Similarly, 13.5% of large enterprises exploit data insights from social media.
On the other side, for every big data source category, small enterprises have the lowest level of big data usage.This result indicates that small enterprises do not recognize the need for big data analysis or do not have the resources to conduct it.The lowest results are detected for the devices category, where only 3.21% of small enterprises indicate that they use big data from the enterprise's smart devices or sensors.The situation is somewhat better when observing the utilization of big data from any source, with 10.57% of small enterprises using at least one of the big data sources.
Regarding the usage of internal or external big data expertise, 29.8% of the large enterprises use in-house experts for big data analysis.At the same time, 15.36% of the medium enterprises do the same, followed by 7.56% for small enterprises.Big data analysis is conducted by the external service provider in 12.6% of the large enterprises, 6.76% of the medium enterprises, and 3.92 of the small enterprises.This result indicates that small enterprises do not have sufficient human resources to utilize big data analysis.

K-Means Cluster Analysis
K-means clustering was applied using the variables presented in Table 2. To calculate the initial centroids, the maximum average distance was applied.Afterward, data instances have been iteratively assigned to the cluster with the closest centroid, using the Squared Euclidian distance.As already mentioned, k-means clustering starts with choosing the appropriate number of clusters.There are several approaches for deciding upon the number of clusters in k-means.Some of the approaches include the "elbow" method, thumb rule, information criterion, and cross-validation [42].Along with these mathematically oriented and graphically assisted approaches, expert knowledge rooted in the theoretical background of the field is suitable for selecting the number of clusters in some situations [45].However, this approach can result in common researcher bias.We opted for observing the graph of the cost sequence to find the appropriate number of clusters [44,46].Additionally, v-fold cross-validation has been employed [44,47].V-fold cross-validation selects random v samples of data that are divided into the validation set, and training set, to ensure the stability of the results.If the clustering algorithm works well, it provides similar partitions regardless of the sample drawn out from the original dataset [42].
The graph of the cost sequence is presented in Figure 1, which shows an error function for the different numbers of cluster solutions.The error function presented in Figure 1 can be defined as an "average distance of observations in testing samples to the cluster centroids to which the observations were assigned" [46].The goal was to minimize the cost to the eligible level, and the "elbow" method [43] was used for this purpose.As is noticeable from Figure 1, the graph displays an elbow at three clusters.Increasing the number of clusters over three does not decrease the error function.Thus, the graph indicates that the three-cluster solution would be optimal in our case.Therefore, the k-means analysis was conducted with three clusters.
The ANOVA analysis of the clustering variables is shown in Table 3, indicating that all clustering variables are statistically significant for the formation of clusters.In other words, the average values of the variable across clusters are statistically different among each other, confirming that unique clusters of countries can be identified.Countries that are members of Cluster 1 have the overall highest usage of big data analysis, taking into account all the observed variables, followed by Cluster 3 (Table 4).On the other hand, the lowest mean values are noticed for the enterprises of the European countries within Cluster 2.Moreover, large enterprises analyze big data more than medium and small ones, for almost all of the data sources analyzed.Figure 2 presents the graph of the clusters' means of observed variables across the clusters.

Geographical Distribution of Clusters
To provide an answer to the first research question (RQ1) that investigates the level of big data digital divide among European countries in small, medium, and large enterprises, taking into account various sources of big data, the geographical distribution of the clusters has been analyzed.Table 5 presents the distribution of the observed 28 European countries according to clusters, and Figure 3 presents the distribution of clusters of the European countries according to their geographical position.Cluster 1 has the highest mean values of all observed variables, and it contains the following countries: Belgium, Finland, Ireland, Malta, and the Netherlands, which is 18% of the observed sample.Cluster 2 comprises the majority of the observed countries, 15 of them, which is 54.5% of the observed sample, including Austria, Bulgaria, Croatia, Cyprus, Czechia, Greece, Hungary, Italy, Latvia, Poland, Romania, Slovakia, Slovenia, Spain, and Sweden.Cluster 3 comprises the following countries: Denmark, Estonia, France, Germany, Lithuania, Luxembourg, Portugal and Norway, which is 28.5% of the observed sample.It can be noted that the countries in Cluster 1, in which enterprises use big data to the highest extent compared to the other two clusters, are among the most developed countries in Europe.Countries in Cluster 3 are also among the most developed, and they are following the countries in Cluster 1 according to the big data usage among their enterprises.Cluster 2 contains the largest number of post-transition countries that are lagging in terms of economic development, such as Bulgaria, Greece, Romania, Slovakia, and Croatia.This cluster also contains developed countries, such as Sweden and Austria.It can be concluded that the big data digital divide is present in European countries, especially among large companies, among which the difference between the clusters is the highest (Figure 2).Although our results are informative and indicate the substantial differences between the usage of big data between more developed and less developed European countries, they should be taken into account when considering the practices of the global economy according to which the enterprises often operate in more than one country, organized as subsidies or large multinational corporations.

Relationship between Big Data Utilization and Source of Expertise (Internal or External)
The second research question (RQ2) refers to the investigation of the relationship between big data utilization and source of expertise, which can be internal or external.Therefore, the average values of the big data source of expertise across clusters have been calculated and presented in Figure 4.The results of the analysis reveal that the highest average values are noticed the Cluster 1, followed by Cluster 2. Once again, the lowest average values, compared to other clusters, have been calculated for the countries belonging to Cluster 2. However, it can be noted that the differences are the largest between the clusters for the usage of internal expertise in large enterprises.A similar trend has been observed among medium-sized enterprises.On the other side, the differences are the smallest between the clusters for the usage of both external and internal expertise in small enterprises.Table 6 presents in detail the mean values of the percentage of enterprises in each cluster according to the usage of external and internal expertise for big data analysis.For example, 14.75% of small enterprises are using the internal expertise for big data analysis in Cluster 1, 5.08% in Cluster 2, and 8% in Cluster 3. ANOVA analysis revealed that these differences are statistically significant for all the observed variables at a 1% significance level.Since the usage of internal expertise is the highest in Cluster 1 compared to other two clusters, and the observed differences are lower according to the usage of external expertise, it can be concluded that the usage of internal expertise significantly contributes to the overall usage of big data in European enterprises, especially in the case of large and medium enterprises.

Relationship between Big Data Utilization and Industry Type
The last research question referred to the relationship between big data utilization and industry type (RQ3).Figure 5 presents the average usage of any big data source among countries in three clusters, according to specific industries.In all observed industry types, Cluster 1 achieved the highest average values regarding big data utilization in comparison to the other two clusters.In line with the results of other research questions, Cluster 2 has the lowest average values of big data usage for all the observed industry types.The highest average values have been achieved in the Information and communication industry, followed by Electricity, gas, steam, air conditioning, and water supply.Table 7 presents the results of the descriptive statistics of big data usage across industry types, as well as the results of the ANOVA analysis.For example, 16.6% of manufacturing enterprises are using big data in Cluster 1, 6.57% in Cluster 2, and 11.89% in Cluster 3.For most of the industries, the ANOVA analysis revealed that the observed differences are statistically significant at a 1% level.However, differences are statistically significant at a 5% level of the following industries Transportation and storage as well as the Real estate activities industry.In these industries, the observed mean values are also the most similar between the observed clusters, indicating that in these clusters, enterprises behave similarly.This result could be partially explained by the fact that these industries are among the most globalized, with enterprises that often operate in more than one country.

Discussion and Conclusion
The goal of the research was to investigate the level of digital divide among European countries according to the big data on the country level, and among different industries.Usage of big data helps enterprises to improve their competitiveness [7], which can be obtained in the following manner.First, big data allows enterprises to gather information about their customers, from social media and additional online sources, thus contributing to the big data-driven customer intelligence.Second, big data allows enterprises to gather information about their competitors, from the competitors' websites, and various secondary sources, such as stock exchanges, thus contributing to the big data-driven competitive intelligence.Third, big data supports companies in the utilization of Industry 4.0, thus contributing to the big data-driven process intelligence.
The first research question (RQ1) aims to reveal the differences among enterprises in European countries according to the usage of big data technologies in small, medium, and large enterprises.The results of the analysis revealed that the European countries can be divided into three homogenous clusters with distinctive differences between them according to the level of big data usage.The highest overall usage of big data is observed in Cluster 1, closely followed by Cluster 3, both of which mostly comprise the most developed European countries.The usage of big data is lowest in Cluster 2, which mostly comprises the post-transition developing European countries.Therefore, it can be concluded that the digital divide is present in European countries according to the usage of big data in its enterprise, however, taking into account the fact that a substantial number of enterprises operate in more than one country, such as multinational companies.
The second research question (RQ2) referred to the impact of using internal or external expertise for big data analysis.The results revealed that that use big data more often, rely, at the same time, on their internal experts far more than external service providers.This trend is more present in large enterprises compared to small and middle ones.
The third research question (RQ3) referred to the level of big data usage in various industries.The results revealed that in all observed industry types, enterprises belonging to Cluster 1 (the best performing cluster) had the highest average values compared to the other two clusters, while those from Cluster 3 had the lowest ones.Within the Cluster 1 results, the highest average values have been achieved by enterprises in Information and communication industry, followed by the Electricity, gas, steam, air conditioning, and water supply, which leads to the conclusion that such industries are the most efficient in big data utilization and its conversion to business value.
Our research contributes to several lines of research, resulting in the following theoretical contributions: (i) the confirmation of the research results about the leadership of Northern European countries in terms of the technological innovations; (ii) there are substantial differences between the industries in terms of big data usage, with the manufacturing industry lagging, which can be a signal of a worrisome trend of the European countries lagging behind other leaders of Industry 4.0, such as the USA and China; and (iii) large enterprises continue to be the most effective in the utilization of innovative technologies, which is also a signal of substantial obstacles faced by the small companies in the implementation of big data that could, in turn, further curb their growth and competitiveness.These contributions will be elaborated on with more details in the following sections.
First, we confirm the results of the previous research that the Northern European countries are leading according to the utilization of innovative industries, such as big data.Although the information technology development of the European Union is one of the highest in the world, a digital divide is manifested internally, among the member states [48,49].Northern European countries still have a significantly greater percentage of citizens connected to the internet, in part likely to increasing capabilities of the hardware and decreasing cost of electronic goods and services, such as internet services, computer software, and accessories, as well as personal computers [3].This indicates that Northern European countries tend to experience fewer negative effects of automation, as the jobs in these countries are more complex and harder to automate.Therefore, a high level of digital development prevents the negative impact of technologies both at the country and enterprise level.On the other side, the low level of digital development reinforces the negative impact of technologies in less developed countries.Although the digital divide has decreased at the personal level among the developed European countries [49], the digital divide at the country level is decreasing slowly due to its complex relationship with economic development.In the new digital divide, Industry 4.0 will play a significant role, and one of the major disadvantaged groups will be those with low levels of education.This is where the predicted job loss will mostly occur, as routine jobs will be replaced by those requiring analytical and problem-solving skills, flexibility in decision making, and higher levels of education and training in certain topics, such as computer science, mechanical and electronic engineering.Those with a mix of all of these skills, i.e., mechatronic experts, will have a particular advantage in this new industry.Moreover, it is worth noting that it is predicted that jobs requiring social and interpersonal skills, creativity, and innovation will increase [50].
Second, our results revealed that several industries are leading to big data utilization, such as information technology.However, this is likely to be the result of the overall technical competence of their employees, since our research results revealed that enterprises mostly rely on internal big data experts.On the other hand, it is worrisome that European manufacturing enterprises that should be the leaders in Industry 4.0 revolution are lagging in terms of big data usage.
Third, we confirmed previous research that large companies are leading in the implementation of innovative technologies, such as big data.Therefore, large enterprises will have a great advantage in this regard, as they will possess the top talent and resources, thereby being better able to decide on the right technologies.However, future small enterprises or garage start-ups, due to their groundbreaking new ideas, might be able to compete well on this kind of market as well, as this was the case with top firms, such as Google, Facebook, and Amazon [50].
The practical implications of our work indicate the need for interventions in educational programs.First, higher education institutions should consider the introduction of a strong bachelor and master curriculum with a focus on big data acquisition, management, and analysis.Second, massive open online courses and life-learning program about big data should be introduced at national levels, since internationally available courses (e.g., Udemy, Coursera) are not sufficient for fulfilling the demand for big data skills.Such programs should be specially tailored for the usage of open source big data software that could be used by small enterprises, to fasten their efficiency in acquiring internal expertise for big data, and at the same time decreasing their costs.Moreover, our results are useful for the enterprises itself, which may be reluctant to hire or educate big data experts due to possible costs.However, our research results indicate that the availability of internal experts is the strongest incentive for the utilization of big data analysis, which is, in turn, a path towards increased competitiveness.
Limitations of this study refer to the fact that the research has been conducted on a sample of selected European countries with different legislations, history, and level of economic development, which can all influence big data usage and acceptance within an enterprise from a certain country.Moreover, we focused our research on country-level data, while the data on an enterprise-level could gain results that could provide more evidence on the efficiency of enterprises in using big data for tactical, operational, and strategic decision-making.Finally, the global economy allows enterprises to operate in more than one country, which should be taken into account when evaluating the results of our research.For these reasons, future research should expand this study to enterprises worldwide, focusing on an enterprise level.

Figure 1 .
Figure 1.Graph of the cost sequence.

Figure 3 .
Figure 3.European countries according to clusters.Grey color indicates countries that were not included in the analysis.

Figure 4 .
Figure 4. Average values of the big data source of expertise (internal or external) across clusters.

Figure 5 .
Figure 5. Average values of big data utilization across industry types and clusters.

Table 1 .
Variables on big data utilization in European countries used in the research.

Table 2 .
Descriptive statistics of the observed variables.
Figure 2. Graph of the cluster means.

Table 5 .
Distribution of countries according to clusters.

Table 6 .
Descriptive statistics of the source of big data expertise according to clusters; ANOVA analysis.

Table 7 .
Descriptive statistics of big data utilization across industry types and clusters; ANOVA analysis.