Previous Article in Journal
Simultaneous EEG-fNIRS Data on Learning Capability via Implicit Learning Induced by Cognitive Tasks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Data Science Project Barriers—A Systematic Review

Algoritmi Research Center/LASI (Associate Laboratory for Intelligent Systems), Department of Production and Systems, School of Engineering, University of Minho, 4800-058 Guimarães, Portugal
*
Author to whom correspondence should be addressed.
Data 2025, 10(8), 132; https://doi.org/10.3390/data10080132
Submission received: 14 May 2025 / Revised: 1 August 2025 / Accepted: 12 August 2025 / Published: 20 August 2025

Abstract

This study aims to identify and categorize barriers to the success of Data Science (DS) projects through a systematic literature review combined with quantitative methods of analysis. PRISMA is used to conduct a literature review to identify the barriers in the existing literature. With techniques from bibliometrics and network science, the barriers are hierarchically clustered using the Jaccard distance as a measure of dissimilarity. The review identified 27 barriers to the success of DS projects from 26 studies. These barriers were grouped into six thematic clusters: people, data and technology, management, economic, project, and external barriers. The barrier “insufficient skills” is the most frequently cited in the literature and the most frequently considered critical. From the quantitative analysis, the barriers “insufficient skills”, “poor data quality”, “data privacy and security”, “lack of support from top management”, “insufficient funding”, “insufficient ROI or justification”, “government policies and regulation”, and “inadequate, immature or inconsistent methodology” were identified as the most central in their cluster.

1. Introduction

The world economy has proven capable of generating more and more resources each year, despite climate concerns. During the decade that began in 2010, the world produced an average global GDP growth above 2% every year—except for 2020, the year of the coronavirus outbreak [1]. However, there seems to be nothing in the economy for which growth compares to that witnessed in data generation: in 2020, 64 zettabytes of data (64 × 1021 bytes) were generated, equivalent to approximately 32 times what was generated 10 years before, at the beginning of the decade [2].
This exponential growth in data generation has led to a host of data storage and processing challenges, but also to many opportunities. One of the first people to realize this was Clive Humby, who in 2006 coined the famous phrase “data is the new oil” [3], referring not only to the intrinsic value of data but also to its need for refinement. At least from an economic perspective, his prediction proved accurate, as the so-called “FAAMG” (Facebook, Apple, Amazon, Microsoft, and Google), technology giants that generate and process enormous amounts of data, were among the top 10 companies in the world by 2021 [4]. Reinforcing this idea, a report commissioned by BusinessWire [5] predicts a $243 billion market by 2027 for so-called Big Data (BD)—data of great volume, velocity, and variety.
To deal with the world of Big Data, a new area of knowledge has emerged and consolidated—Data Science (DS). The term in its current definition was coined in 2008 [6]. LinkedIn [7] says one hundred and fifty million new jobs related to digital transformation are expected in the next 5 years—among them are artificial intelligence (AI) specialists and data scientists. These professionals are responding to a growing demand from industry. An annual survey by NewVantage [8] that interviews blue chip executives in the US market shows that 99% of these companies are investing in Big Data and AI.
Despite the investment, not all companies are successful when it comes to implementation. Only 13% of companies surveyed by NewVantage [8] had fully deployed Big Data in their operations. From the perspective of deployment initiatives, the surveys reveal a low rate of success: estimates range from 27% [9] and 20% [10] to 8% [11].
On one hand, Big Data and AI present huge potential. On the other hand, despite high investment, the success rate of initiatives is low. However, the field of Data Science goes beyond Big Data or AI—despite the common use of those terms as synonyms—and the current literature focuses on the adoption of Big Data by organizations, with a reduced analysis of barriers that are related to the context of a DS project. Moreover, when categorizing barriers, the results differ from study to study. Thus, this study aims to answer the following research questions:
RQ1: What are the barriers to the success of Data Science projects?
RQ2: How should these barriers be categorized? Additionally, which barriers are the most relevant to their categories (or clusters)?

2. Literature Review

2.1. Data Science

Data Science encompasses “a set of principles, problem definitions, algorithms, and processes for extracting non-obvious patterns from large data sets” and aims to “improve decision-making by basing decisions on knowledge extracted from data” [12].
There are, perhaps, two arguments against this definition. The first is that “large” in “large data sets” is subjective. An answer to that is the definition of large as “too large or complex for a human being to analyse”. While we are capable of understanding patterns in dozens of records with two or three attributes, data science is generally applied in contexts where we seek to find patterns among tens, hundreds, or perhaps even millions of attributes among an even larger number of records. The second argument is that there is already a branch of science that deals with data collection and analysis: statistics. This argument is also valid; however, the fields of statistics and Data Science have been differentiated to such an extent that it is no longer possible to give the same name to different fields. The first and most important differentiation is that data scientists use both statistical tools and computational algorithms, such as machine learning (ML) algorithms. The second differentiation is that data scientists are also concerned with how to extract the data, process it, and then implement the model they built. Finally, the third differentiation is perhaps cultural. Breiman [13] published an article entitled “Statistical Modelling: The Two Cultures.” In short, the traditional view is that the goal of data analysis is explanatory, in the sense of bringing about an understanding of how the data was generated. In contrast, the new view, which better defines Data Science, sees the goal of data analysis as that of developing algorithms with the highest prediction capacity possible.
Data Science, therefore, has become sufficiently differentiated from statistics and other fields to earn a name of its own. DS combines knowledge from statistics, computer science, and ML to solve current problems by analyzing large datasets. The discipline is recent because only recently has humanity begun to produce and be able to analyze large amounts of data. In addition to the emergence of Big Data, the economy of scale provided by the data market and the evolution of processing equipment contributed to the emergence of Data Science [12], such that even today, despite the emergence of several higher-education courses in the discipline, the demand for these professionals continues to increase [7].
When defining Data Science, it is also important to characterize some similar terms that are commonly mixed: specifically, Big Data, data analytics, and machine learning. Big Data is traditionally defined based on the 3Vs: data in large volume, variety, and velocity. Volume means the amount of data, variety refers to the types of data, and velocity is the speed with which data is generated. More recently, the 3Vs have been expanded into the 5Vs, to include veracity and value, and even the 6Vs (see, for example, Rehman et al. [14]). Data analytics, or data analysis, is simply the process of inspecting, cleaning, transforming, and modeling data for the purpose of supporting decision making [15]. ML, on the other hand, is an advanced technique of data analysis and can be seen as a component of data analytics [16].
The terms Big Data and data analytics are often used together under the new term Big Data Analytics (BDA). The latter is sometimes used as a synonym for DS (see, for example, Krasteva and Ilieva [17]). However, some authors choose to differentiate between the terms—considering BDA as the process of analyzing Big Data, and DS the broader field of knowledge. Moreover, DS does not apply only to Big Data. Having lots of data helps, but having the right data is more important [12].

2.2. Barriers for Data Science Projects

From PMI [18], a project is “a temporary effort undertaken to create a unique product, service, or result. The temporary nature of projects indicates a beginning and an end to the project work or a phase of the project work.” DS deals, naturally, more with projects than with processes—whether developing a clustering algorithm for customer segmentation [19,20], detecting anomalies such as fraud [21,22], or building a predictive model to identify whether a taxpayer is evading their taxes [23,24]—data scientists usually work in unique, temporary efforts.
These efforts, however, do not always pay off. As noted previously, up to 92% of projects fail [11]. Hindering the success of DS projects are multiple barriers—such as insufficient skills [25], low data quality [26], and lack of collaboration [27].
To deliver, a standard DS project goes through the stages of understanding the business problem, understanding the data, preparing the data, modeling, evaluating, and deploying [12]. Different methodologies have been developed to address three of the fundamental aspects of a DS project—project management, team management, and data management, with Cross-Industry Standard Process for Data Mining (CRISP-DM) historically being the most popular [28]. However, DS methodologies lack integrity, including, notably, CRISP-DM [29]. Moreover, as noted by Morlock and Boßlau [30], a fourth aspect—change management—is ignored.

3. Methods

To fulfil the research objective of identifying and categorizing barriers to the success of DS projects, this study is subdivided into two stages (similar to Rameezdeen [31]). In the first, the PRISMA methodology [32] is used to identify the barriers in the literature. In the second stage, the thematic clustering of the barriers is performed using quantitative methods with techniques from bibliometrics and network science. As suggested by Xiao and Watson [33], grouping factors into themes is a way to deepen the understanding of the research problem.

3.1. Systematic Literature Review

To identify the barriers to DS projects in the existing literature, a systematic literature review (SLR) was conducted based on the PRISMA methodology [32]. The methodology consists of 4 steps: identification, screening, selection, and inclusion of studies. After searching and selecting the primary studies with the search terms, an expansion of the search was performed through the snowball technique (or citation searching), in which new studies are sought from the citations of the initial studies.
The selection process included the following criteria: (a) studies that list DS barriers; (b) research methodology includes some variant of multi-criteria decision-making (MCDM) methods, since these methods have a higher depth of factor analysis, including impact estimation and/or cause-and-effect relationships; (c) studies from journals included in Scopus and Web of Science (WoS) databases; and (d) studies in English. No temporal criteria have been included. The result of this research showed that there are no MCDM analyses specifically for DS projects. The result was not totally satisfactory in the sense that all 14 studies found are focused on the adoption or deployment of Big Data or Big Data Analytics by organizations in general, albeit in varying contexts. Although these studies are still helpful in answering the research questions, given Big Data is a similar field to DS, and thus were included in the SLR, they focus on barriers faced by organizations during an adoption process and do not fully include barriers related to DS projects.
Thus, to supplement the first search strategy, a second strategy was designed, broadening the scope to include methodologies beyond MCDM analysis and including conference studies, but restricting the scope to DS projects, to exclude studies that refer broadly to technology adoption or deployment in organizations. In this search, the selection process included the following generic criteria: (a) studies that list barriers for DS projects; (b) studies from journals or conferences included in Scopus and WoS; and (c) studies in English. Table 1 details the two search strategies and terms for identification. The sum of strategies 1 and 2 is equivalent to the total number of studies identified in databases (452), before removal of duplicates (137), as shown in Figure 1. After identification, the screening process included title and abstract reading.
After retrieval, the studies were further assessed for eligibility with full paper reading and excluded for (a) being out of scope, when the study did not support in answering the research questions (e.g., [35]); (b) lack of clarity, when the barriers are not clearly defined (e.g., [36]); (c) being excessively circumstantial, when the barriers were context-specific and could not be generalized (e.g., [37]); and (d) having inconsistent methodology. The data collection process involved gathering the barriers from each of the studies included in the review in a single list, marking each of the studies that mentioned them. This list is presented in the Results section.
Researchers (1, 2, and 3) defined the identification, screening, inclusion, and data collection process, which was then carried out by researcher (1). We assessed the methodological quality of included studies by considering key aspects such as clarity of aims, study design, transparency of data collection and analysis, and adherence to the stated methods. Given that the review focused on identifying barriers rather than measuring effect sizes, the overall risk of bias was considered low.
By using a structured approach to the SLR, following PRISMA guidelines, and searching two major research databases with two distinct search strategies, we aimed to reduce the risk of reporting bias due to missing results.
The authors declare adherence to the PRISMA Statement [32]. However, the SLR was not registered, and the review protocol was not registered a priori in PROSPERO or OSF.

3.2. Quantitative Methods for Clustering Barriers

Of the selected studies, 12 categorized barriers into thematic groups, i.e., classified barriers as organizational, technological, economic, etc. (e.g., [38]). Each study used its own categorization criteria—including the choice of the number of groups, the definition of the group, and how the barriers were classified. This presents the literature review with the challenge of how to categorize the barriers, given the multitude of criteria used in the different studies. This challenge was overcome by choosing a quantitative method of thematic grouping of the barriers that considers the categorization of all 12 studies. Rosenthal and DiMatteo [39] state that by numerically aggregating the results of multiple studies, quantitative methods help solve the problem of multiple answers to a research question.
Each study categorized the barriers into 2 to 8 thematic groups. In total, 52 distinct groups were formed. Since the barriers were grouped based on their relation to a theme, it is implied that the barriers in a group are correlated (even if only thematically). Since a barrier–barrier relationship is defined as the common presence of two barriers in a group, the number of relationships formed in a group grows proportionally to the number of barriers in that group, following:
R = N ( N 1 ) 2 ,
where R is the number of relationships in a group and N > 1 is the number of barriers in a group.
From the relationships formed by the 52 groups, a relation network is established, with each barrier as a vertex, and each relationship a connection between two vertices. This network can be interpreted from the perspective of network science, whose goal is to develop approaches and techniques for understanding networks [40].
For the categorization of the barriers, this study used hierarchical clustering with the average linkage method, which is one of the most used in clustering and tends to generate better results [41]. This method considers not only the relationships between vertices, but also the strength of those relationships—or, more accurately, the lack thereof—and is therefore suitable for this study. If a barrier–barrier pair was categorized in the same group, this means that there is a relationship between these barriers. The higher the number of common groups two barriers have with each other, the greater the strength of the relationship.
The proximity measure that best characterizes this conjuncture is the Jaccard index. Named after the scientist of the same name [42], the measure was chosen thanks to its normalized range of values (from 0 to 1) and the characteristics of the network. Its complement, Jaccard distance, is commonly used by clustering algorithms [41], and is defined by the formula [43]:
J δ B n , B m = 1 J B n , B m = 1 B n B m B n B m ,
where J δ is the Jaccard distance, J is the Jaccard index, and B n and B m are sets. In this case, each set represents a barrier and contains the different groups (from 1 to 52) into which the barriers are categorized.
To determine the number of clusters, a combination of the elbow method and the silhouette score was used. The elbow method involves the relationship between the average within-cluster distance ( ¯ ) and the number of clusters ( k ). The idea is that increasing k from a certain point provides diminishing returns in terms of decreasing ¯ . That value is the “right” number of clusters [44]. The average within-cluster distance for a number of clusters can be calculated with:
¯ k = i ,   j d ( b i ,   b j ) k R k ,
where d is the Jaccard distance between two barriers b i and b j , which belong to the same cluster, and R K is the number of relationships within a cluster.
The silhouette statistic is calculated for each vertex and provides how well it fits its assigned cluster. The statistic is normalized: a value near 1 means an almost perfect cluster allocation and a value near −1 means the opposite. For a given vertex ( b n ), the silhouette ( s ) is the difference between the mean nearest-cluster distance ( c ) and the mean intra-cluster distance ( a ) divided by the biggest between the two [45]:
s b n = c b n a b n m a x ( c b n ,   a b n ) ,
where the silhouette score is the average value for all silhouettes.
To complement the analysis, two centrality measures were calculated: degree centrality and closeness centrality. The degree centrality C D represents the number of vertices adjacent (that is, related or connected) to a vertex ( b n ) whose degree is being calculated [46]:
C D b n = d e g r e e b n
The second measure is closeness centrality, formally defined as the inverse of the sum of the distances from one vertex to the others. This measure computes the relative proximity of a vertex to the others [40]. Here, the measure is calculated within-group and is used to quantify the centrality of a vertex relative to its cluster. Closeness centrality ( C c ) in its normalized form is defined by:
C c b n = N 1 b m d b n ,   b m ,
where N is the number of vertices—or barriers—in a cluster, d is the Jaccard distance between two vertices, b n is the vertex whose centrality is to be obtained, and b m represents the other vertices.
We enhanced robustness and confidence by excluding studies with inconsistent methodology and by including only barriers that were reported by more than one study. This approach helped reduce the influence of outliers. Additionally, by clustering them as described, we ensure that the thematic categories reflect patterns observed across multiple sources.

4. Results

4.1. Systematic Literature Review

The SLR resulted in a total of 26 studies, whose context, methods, and keypoints are presented in Table 2. In the keypoints column, we summarize the main objective of the paper, what the studied barrier is hindering (e.g., BDA or ML adoption), and indicate whether the study analyzes the impact of each barrier, the relationships among them, and whether it provides a classification for the barriers. These last aspects are generally determined by the study’s methods—DEMATEL, for instance, calculates a relationship matrix.
Of the 26 studies included in the SLR, 12 categorized the barriers into themes, such as organizational or technological barriers. The barriers were grouped into two to eight categories, depending on the study. The method used for categorization was based on existing literature, expert opinion, or the authors’ own judgment. In addition, 11 studies performed an impact analysis capable of informing which are the most critical or highest impact barriers. In all these cases, some MCDM method was used for this purpose.
Figure 2 details the number of articles by research context. It is possible to notice a wide range of industries and sectors that were studied, with some focal points in manufacturing and supply chains.
Differences in context and methodology may help explain the heterogeneity in reported barriers across studies. In the following section, we list each barrier identified in individual studies and then group them into broader thematic categories. This approach helps capture underlying issues faced by data science projects across diverse settings.

Barriers Identified

The studies included in the SLR identified, on average, 12 barriers to the adoption of Big Data or to the success of DS projects, with a high degree of repetition. As presented in the methods section, one of the exclusion criteria for studies from the systematic review was a lack of clarity in exposing the barriers—that is, the study needs to establish, in objective and clear terms, which barriers were found in the investigation. After that, the procedure for identifying the barriers was a simple extraction of the barriers as written in the studies.
In addition to the study exclusion criteria, barriers were also discarded if (a) they were too specific to the research context, such as the industry or the case study in question; (b) they were not categorized into any group; or (c) they were not confirmed by more than one study.
The compilation procedure was based on the names and descriptions of the barriers as given by the studies. Thus, in response to RQ1, the review identified 27 unique barriers, described in Table 3 and identified in Table 4.
The 27 barriers identified and described above are presented in Table 4, cross-referenced with each study. Evaluating the SLR results, it is notable that some barriers are more frequently cited in the literature. Barriers B1 through B6 appeared at least 14 times in the literature (>50%), displaying higher confidence that data science projects face these barriers independent of context. Of these, B1, B3, B4, and B6 are also the most cited as critical. Barrier B1 (Insufficient skills) is the most frequent, both in the number of citations (20) and the number of times the barrier was identified as critical (5).

4.2. Clustering the Barriers

The input for the hierarchical clustering is the distance matrix, which determines how dissimilar two barriers are. This is based on the categorizations made by the studies included in the SLR: a relation between two barriers is formed when they belong to the same group. The greater the number of common groups to which a pair of barriers belongs, the greater the strength of the relationship, or similarity, that these barriers have. Conversely, the smaller the number of groups in common, the greater the distance, or dissimilarity, between the barriers. Jaccard distance, presented in the methods section, was the measure of dissimilarity used.
The distance matrix is shown in Table 5, and the network of relationships created from it can be seen in Figure 3. Visually, it is possible to see that the barriers cluster naturally into similar groups.
The hierarchical clustering of the barriers was accomplished with the average linkage method, using Jaccard distance as a measure of dissimilarity. The average linkage method tends to generate better results [41].
The number of clusters was determined using a combination of the elbow method and the silhouette score. The elbow method uses the average within-cluster distance to look for an elbow, that is, the point at which a further increase in the number of clusters has a diminishing decrease in the average distance. The silhouette score compares between dissimilarity to within dissimilarity and is a normalized value between −1 and 1. The number of clusters chosen should maximize the silhouette score [45]. Looking at Figure 4, the elbow is formed at six clusters, and the silhouette is maximized at six and eight clusters. Therefore, six was the chosen number of clusters; that is, within the range for the studies (between two and eight clusters). The clustering can be seen in the dendrogram in Figure 5.
In Table 6, the six clusters are presented, as well as the centrality measures for each of the barriers. The clusters were named based on the common theme between the barriers in the group, especially those considered most central to the cluster.

5. Discussion

Analyzing the Jaccard distances, some highly correlated pairs stand out, for example: barriers B2 (Poor data quality) and B8 (Lack of an integrated data environment) are frequently grouped in the same category, as well as B9 (Insufficient funding) and B17 (High investment and maintenance cost). The themes of the pairs are similar, which was the expected result.
Barriers B1 (Insufficient skills), B2 (Poor data quality), B6 (Lack of support from top management), B9 (Insufficient funding), and B22 (Inadequate or inconsistent methodology) were the barriers with the highest closeness centrality values of their respective clusters and can therefore be considered as the geometric center of each cluster. Again, barriers B1, B6, and B9, and additionally, barriers B4 (Data privacy and security), B15 (Insufficient ROI or business case), and B18 (Government policies and regulation), were the barriers with the highest degree in each cluster. Degree measures the number of vertices adjacent to a vertex, and thus these barriers have a greater number of relationships, internal or external to the cluster. Both measures denote greater relevance to these barriers in relation to the others, although from different points of view, and could be considered as potentially having a greater overall importance to the success of DS projects. Compared to the frequency of citations in the SLR, 4 these barriers are also among the most cited (B1, B2, B4, and B6).
The vertex of greatest closeness is not always the vertex of greatest degree—while B2 (Poor data quality) represents the geometric center of the cluster of data and technology barriers, B4 (Data privacy and security) is the highest degree in the cluster, as it has more connections to barriers from different clusters, such as barrier B18 (Government policies and regulation), for example. This suggests that data privacy and security in a DS project are related to government policies and regulations. In fact, the attention given to the topic of data privacy is greatly influenced by local regulations. While Europe has the extensive and restrictive GDPR (EU General Data Protection Regulation), in the US, data privacy is regulated by a multitude of federal regulations and even state legislation, such as California’s CCPA (California Consumer Privacy Act) [68].
At the top of people barriers is insufficient skills. DS is a skill-intensive field, requiring knowledge in statistics, ML, data wrangling, privacy regulations, domain expertise, and others [12], and therefore requires professionals with those specific skills.
For the management barriers, the most central is a lack of support from top management. This is likely to hinder success because a project requires continuous resource investment and inter-organizational collaboration [26]. Similarly, if there is a mismatch between strategy and project goals, it is unlikely to maintain investment and collaboration.
On the side of data and technology, data quality comes as the most frequently cited and as the most central to the cluster. Throughout the data lifecycle—from generation to processing, storage, and consumption—many factors can impact data quality, such as collection errors, lack of validation, mismanagement, etc. Data quality is commonly understood as the degree to which it is fit for use, that is, it must be accurate, relevant, and easy to interpret [69]. Without those qualities, it is hard to expect success in a project with such a high reliance on data. Besides managing poor data quality, the lack of an integrated environment can be very time-consuming [63] since this adds the complexity of having to merge and normalize data from many different sources.
Economic factors also play an important role. Big Data applications require investing in costly IT infrastructure, human resources, and tools [38]. Without financial support, higher-cost options such as leveraging cloud computing and increasing data generation can be vetoed, negatively impacting project outcomes [25].
Most central to project barriers is inadequate or inconsistent methodology. Besides the issue of not following standard practices [58], current methodologies lack integrity—which, according to Martinez et al. [29], is failing to properly address project management, team management, or data management. Moreover, a fourth aspect—change management—is notably ignored [30], which could lead to sustainability issues.
Lastly, organizations running DS projects face external barriers, which include government policies and difficulties in acquiring external data sources for usage. As stated, governments can impact data access and usage through privacy regulations, but also through other means, such as the availability of public data, public IT infrastructure, and collaboration with the private sector [26].
The centrality rankings are broadly consistent with the findings of previous reviews where impact or criticality was measured, which have frequently identified the lack of top management support, insufficient funding, and poor data quality as critical. The clusters also align broadly both in number of clusters—between two and eight for the studies included—and themes, where technology, data, people, organization, and cost-related are the most common. The differences between all studies, including this paper, mostly arise in how individual barriers are clustered. While B4 (Data privacy and security) is usually categorized in data-related themes, Barham and Daim [64] classify it as legal-related, reflecting the different perspectives that can be taken into account. Beyond the comprehensive list of barriers, the main contribution of this study is quantitatively derived categories, considering the classifications from all included studies, each with a central barrier identified through network centrality.
Organizations undertaking DS projects should expect to encounter one or many of the barriers listed here. Proactively addressing these issues, such as establishing data collection processes to avoid data quality issues, ensuring collaborators have the needed skills, and offering support from top management, is essential to improving project prospects. However, no amount of preparation can fully avoid all problems, and managers must act decisively. The six barrier categories proposed in this study provide a structured framework to guide managerial thinking, helping identify where issues are likely to arise and what areas require attention. When resources are limited, the centrality measures can serve as a useful tool for prioritization.

Limitations

The studies included in this review varied in focus, context, and terminology. Most of the evidence was qualitative or descriptive in nature, which is appropriate for identifying barriers but may affect the comparability of findings. Additionally, as many findings relied on participants’ perspectives, they reflect subjective experiences that are often context-specific, limiting generalizability. The methodological rigor of MCDM studies, and the number of studies from multiple contexts, as seen in Figure 2, help mitigate these limitations, but not entirely.
While the review followed established PRISMA guidelines, it relies on manual processes for screening and data extraction, which, while thorough, carry a small risk of oversight. Despite efforts to conduct a comprehensive search, it is possible that some relevant studies were missed.

6. Conclusions

The SLR resulted in the identification of 27 barriers to DS project success referenced by multiple studies in the literature. These barriers were grouped into six thematic clusters: people, data and technology, management, economic, project, and external barriers.
Barriers B1 (Insufficient skills), B2 (Poor data quality), B3 (Insufficient IT infrastructure), B4 (Data privacy and security), B5 (IT illiteracy), and B6 (Lack of support from top management) were the most frequently identified in the literature, being cited by 14 or more studies (>50%). Of these, B1, B3, B4, and B6 were also the most frequently cited as critical.
B1 (Insufficient skills), B2 (Poor data quality), B4 (Data privacy and security), B6 (Lack of support from top management), B9 (Insufficient funding), B15 (Insufficient ROI or business case), B18 (Government policies and regulation), and B22 (Inadequate or inconsistent methodology) were identified as being the most central, or of greater importance, based on measures of network centrality. The intersection between the barriers most frequently cited in the literature and those are barriers B1, B2, B4, and B6.
The distance matrix constructed from the categorizations of the studies included in the SLR and used for clustering the barriers also provides a proxy for the correlations between the barriers, which can serve as a basis for building hypotheses of interrelationship and causality—that is, how the barriers relate to each other and whether one set of barriers causes others. If the lack of an integrated data environment (B8) causes problems in data quality (B2), then solving the former will also solve the latter. This type of analysis is important for understanding failures in DS projects and could be explored in future studies.
We hope the results of this study will serve as an extensive yet specific survey of the barriers that hinder DS projects. Furthermore, this study used a quantitative method for clustering the barriers, helping solve the challenge of multiple answers to the research question. The categorization helps with the understanding of the barriers to the success of DS projects, and the centrality measures provide information about the relative importance of each.

Author Contributions

Conceptualization, N.L., L.C. and R.M.L.; methodology, N.L., L.C. and R.M.L.; software, N.L.; validation, L.C. and R.M.L.; formal analysis, N.L.; investigation, N.L.; resources, N.L.; data curation, N.L.; writing—original draft preparation, N.L.; writing—review and editing, L.C. and R.M.L.; visualization, N.L.; supervision, L.C. and R.M.L.; project administration, L.C. and R.M.L.; funding acquisition, L.C. and R.M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by FCT—Fundação para a Ciência e Tecnologia within the R&D Unit Project Scope UID/00319/Centro ALGORITMI (ALGORITMI/UM).

Data Availability Statement

The original contributions presented in the study are included in the article, and further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. The World Bank GDP Growth (Annual %). Available online: https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG (accessed on 28 January 2022).
  2. Statista Research Department Total Data Volume Worldwide 2010–2025. Available online: https://www.statista.com/statistics/871513/worldwide-data-created/ (accessed on 20 November 2022).
  3. Arthur, C. Tech Giants May Be Huge, but Nothing Matches Big Data. The Guardian, 23 August 2013. [Google Scholar]
  4. Statista Research Department Biggest Companies in the World by Market Cap 2021. Available online: https://www.statista.com/statistics/263264/top-companies-in-the-world-by-market-capitalization/ (accessed on 28 January 2022).
  5. BusinessWire Global $243 Billion Big Data Market Trajectory & Analytics to 2027. Available online: https://www.businesswire.com/news/home/20201208005685/en/Global-243-Billion-Big-Data-Market-Trajectory-Analytics-to-2027-Age-of-Analytics-Provides-the-Cornerstone-for-the-Disruptive-Growth-Proliferation-of-Big-Data-Technologies---ResearchAndMarkets.com (accessed on 28 January 2022).
  6. Davenport, T.H.; Patil, D.J. Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, 1 October 2012. [Google Scholar]
  7. LinkedIn US Jobs on the Rise Report. Available online: https://business.linkedin.com/talent-solutions/resources/talent-acquisition/jobs-on-the-rise-us (accessed on 28 January 2022).
  8. NewVantage Partners LLC. Big Data and AI Executive Survey 2021: Executive Summary of Findings. Available online: https://www.newvantage.com/_files/ugd/e5361a_d59b4629443945a0b0661d494abb5233.pdf (accessed on 28 January 2022).
  9. Capgemini Consulting Cracking the Data Conundrum: How Successful Companies Make Big Data Operational 2014. Available online: https://www.capgemini.com/gb-en/wp-content/uploads/sites/3/2019/01/Cracking-the-Data-Conundrum-How-Successful-Companies-Make-Big-Data-Operational.pdf (accessed on 28 January 2022).
  10. White, A. Our Top Data and Analytics Predicts for 2019. Gartner Blog Network 2019. Available online: https://blogs.gartner.com/andrew_white/2019/01/03/our-top-data-and-analytics-predicts-for-2019/ (accessed on 28 January 2022).
  11. Fleming, O.; Fountaine, T.; Henke, N.; Saleh, T. Getting Your Organization’s Advanced Analytics Program Right. Available online: https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/ten-red-flags-signaling-your-analytics-program-will-fail (accessed on 28 January 2022).
  12. Kelleher, J.D.; Tierney, B. Data Science; The MIT Press: Cambridge, MA, USA, 2018; ISBN 978-0-262-53543-4. [Google Scholar]
  13. Breiman, L. Statistical Modeling: The Two Cultures. Stat. Sci. 2001, 16, 199–215. [Google Scholar] [CrossRef]
  14. Habib ur Rehman, M.; Liew, C.S.; Abbas, A.; Jayaraman, P.P.; Wah, T.Y.; Khan, S.U. Big Data Reduction Methods: A Survey. Data Sci. Eng. 2016, 1, 265–284. [Google Scholar] [CrossRef]
  15. Brown, M.S. Transforming Unstructured Data into Useful Information. In Big Data, Mining, and Analytics; Auerbach Publications: Boca Raton, FL, USA, 2014; ISBN 978-0-429-09529-0. [Google Scholar]
  16. L’Heureux, A.; Grolinger, K.; Elyamany, H.F.; Capretz, M.A.M. Machine Learning with Big Data: Challenges and Approaches. IEEE Access 2017, 5, 7776–7797. [Google Scholar] [CrossRef]
  17. Krasteva, I.; Ilieva, S. Adopting Agile Software Development Methodologies in Big Data Projects—A Systematic Literature Review of Experience Reports. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 2028–2033. [Google Scholar]
  18. Project Management Institute (PMI). The Standard for Project Management and a Guide to the Project Management Body of Knowledge (PMBOK Guide), 7th ed.; Project Management Institute: Newtown Square, PA, USA, 2021; ISBN 978-1-62825-667-3. [Google Scholar]
  19. Bi, W.; Cai, M.; Liu, M.; Li, G. A Big Data Clustering Algorithm for Mitigating the Risk of Customer Churn. IEEE Trans. Ind. Inform. 2016, 12, 1270–1281. [Google Scholar] [CrossRef]
  20. Jagabathula, S.; Subramanian, L.; Venkataraman, A. A Model-Based Embedding Technique for Segmenting Customers. Oper. Res. 2018, 66, 1247–1267. [Google Scholar] [CrossRef]
  21. Thennakoon, A.; Bhagyani, C.; Premadasa, S.; Mihiranga, S.; Kuruwitaarachchi, N. Real-Time Credit Card Fraud Detection Using Machine Learning. In Proceedings of the 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 10–11 January 2019; pp. 488–493. [Google Scholar]
  22. Mittal, S.; Tyagi, S. Performance Evaluation of Machine Learning Algorithms for Credit Card Fraud Detection. In Proceedings of the 2019 9th International Conference on Cloud Computing, Data Science & Engineering, Noida, India, 10–11 January 2019; pp. 320–324. [Google Scholar]
  23. Liou, F.-M.; Yang, C.-H. Predicting Business Failure under the Existence of Fraudulent Financial Reporting. Int. J. Account. Inf. Manag. 2008, 16, 74–86. [Google Scholar] [CrossRef]
  24. Tian, F.; Lan, T.; Chao, K.-M.; Godwin, N.; Zheng, Q.; Shah, N.; Zhang, F. Mining Suspicious Tax Evasion Groups in Big Data. IEEE Trans. Knowl. Data Eng. 2016, 28, 2651–2664. [Google Scholar] [CrossRef]
  25. Raut, R.; Yadav, V.S.; Cheikhrouhou, N.; Narwane, V.S.; Narkhede, B.E. Big Data Analytics: Implementation Challenges in Indian Manufacturing Supply Chains. Comput. Ind. 2021, 125, 103368. [Google Scholar] [CrossRef]
  26. Park, J.-H.; Kim, Y.B. Factors Activating Big Data Adoption by Korean Firms. J. Comput. Inf. Syst. 2021, 61, 285–293. [Google Scholar] [CrossRef]
  27. Nayal, K.; Raut, R.D.; Queiroz, M.M.; Yadav, V.S.; Narkhede, B.E. Are Artificial Intelligence and Machine Learning Suitable to Tackle the COVID-19 Impacts? An Agriculture Supply Chain Perspective. Int. J. Logist. Manag. 2021, 34, 304–335. [Google Scholar] [CrossRef]
  28. Piatetsky, G. CRISP-DM, Still the Top Methodology for Analytics, Data Mining, or Data Science Projects. KDnuggets 2014. Available online: https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html (accessed on 7 May 2023).
  29. Martinez, I.; Viles, E.; Olaizola, I.G. Data Science Methodologies: Current Challenges and Future Approaches. Big Data Res. 2021, 24, 100183. [Google Scholar] [CrossRef]
  30. Morlock, F.; Boßlau, M. Concept for Enabling Customer-Oriented Data Analytics via Integration of Production Process Improvement Methods and Data Science Methods. Procedia CIRP 2021, 104, 542–546. [Google Scholar] [CrossRef]
  31. Rameezdeen, R.; Chileshe, N.; Hosseini, M.R.; Lehmann, S. A Qualitative Examination of Major Barriers in Implementation of Reverse Logistics within the South Australian Construction Sector. Int. J. Constr. Manag. 2016, 16, 185–196. [Google Scholar] [CrossRef]
  32. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
  33. Xiao, Y.; Watson, M. Guidance on Conducting a Systematic Literature Review. J. Plan. Educ. Res. 2019, 39, 93–112. [Google Scholar] [CrossRef]
  34. Haddaway, N.R.; Page, M.J.; Pritchard, C.C.; McGuinness, L.A. PRISMA2020: An R Package and Shiny App for Producing PRISMA 2020-compliant Flow Diagrams, with Interactivity for Optimised Digital Transparency and Open Synthesis. Campbell Syst. Rev. 2022, 18, e1230. [Google Scholar] [CrossRef]
  35. Brock, V.F.; Khan, H.U. Are Enterprises Ready for Big Data Analytics? A Survey-Based Approach. Int. J. Bus. Inf. Syst. 2017, 25, 256. [Google Scholar] [CrossRef]
  36. Saltz, J.; Shamshurin, I.; Connors, C. Predicting Data Science Sociotechnical Execution Challenges by Categorizing Data Science Projects. J. Assoc. Inf. Sci. Technol. 2017, 68, 2720–2728. [Google Scholar] [CrossRef]
  37. Bernardi, L.; Mavridis, T.; Estevez, P. 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.Com. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1743–1751. [Google Scholar]
  38. Alalawneh, A.A.F.; Alkhatib, S.F. The Barriers to Big Data Adoption in Developing Economies. Electron. J. Inf. Syst. Dev. Ctries. 2021, 87, e12151. [Google Scholar] [CrossRef]
  39. Rosenthal, R.; DiMatteo, M.R. Meta-Analysis: Recent Developments in Quantitative Methods for Literature Reviews. Annu. Rev. Psychol. 2001, 52, 59–82. [Google Scholar] [CrossRef]
  40. Börner, K.; Sanyal, S.; Vespignani, A. Network Science. Annu. Rev. Inf. Sci. Technol. 2007, 41, 537–607. [Google Scholar] [CrossRef]
  41. Newman, M.E.J. Communities, Modules and Large-Scale Structure in Networks. Nat. Phys. 2012, 8, 25–31. [Google Scholar] [CrossRef]
  42. Jaccard, P. The Distribution of the Flora in the Alpine Zone.1. New Phytol. 1912, 11, 37–50. [Google Scholar] [CrossRef]
  43. Kosub, S. A Note on the Triangle Inequality for the Jaccard Distance. Pattern Recognit. Lett. 2019, 120, 36–38. [Google Scholar] [CrossRef]
  44. Thorndike, R.L. Who Belongs in the Family? Psychometrika 1953, 18, 267–276. [Google Scholar] [CrossRef]
  45. Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
  46. Riveros, C.; Salas, J.; Skibski, O. How to Choose the Root: Centrality Measures over Tree Structures. arXiv 2021, arXiv:2112.13736. [Google Scholar]
  47. Kavre, M.; Gardas, B.; Narwane, V.; Jafari Navimipour, N.; Yalcin, S. Evaluating the Effect of Human Factors on Big Data Analytics and Cloud of Things Adoption in the Manufacturing Micro, Small, and Medium Enterprises. IT Prof. 2022, 24, 17–26. [Google Scholar] [CrossRef]
  48. Sharma, M.; Luthra, S.; Joshi, S.; Kumar, A. Implementing Challenges of Artificial Intelligence: Evidence from Public Manufacturing Sector of an Emerging Economy. Gov. Inf. Q. 2022, 39, 101624. [Google Scholar] [CrossRef]
  49. Gangadhari, R.K.; Khanzode, V.; Murthy, S.; Dennehy, D. Modelling the Relationships between the Barriers to Implementing Machine Learning for Accident Analysis: The Indian Petroleum Industry. Benchmarking 2022, 30, 3357–3381. [Google Scholar] [CrossRef]
  50. Gupta, A.K.; Goyal, H. Framework for Implementing Big Data Analytics in Indian Manufacturing: ISM-MICMAC and Fuzzy-AHP Approach. Inf. Technol. Manag. 2021, 22, 207–229. [Google Scholar] [CrossRef]
  51. Bahrami, F.; Kanaani, F.; Turkina, E.; Moin, M.S.; Shahbazi, M. Key Challenges in Big Data Startups: An Exploratory Study in Iran. Iran. J. Manag. Stud. 2021, 14, 273–289. [Google Scholar] [CrossRef]
  52. Raut, R.; Narwane, V.; Kumar Mangla, S.; Yadav, V.S.; Narkhede, B.E.; Luthra, S. Unlocking Causal Relations of Barriers to Big Data Analytics in Manufacturing Firms. Ind. Manag. Data Syst. 2021, 121, 1939–1968. [Google Scholar] [CrossRef]
  53. Bag, S.; Gupta, S.; Wood, L. Big Data Analytics in Sustainable Humanitarian Supply Chain: Barriers and Their Interactions. Ann. Oper. Res. 2020, 319, 721–760. [Google Scholar] [CrossRef]
  54. Zhang, X.; Lam, J.S.L. A Fuzzy Delphi-AHP-TOPSIS Framework to Identify Barriers in Big Data Analytics Adoption: Case of Maritime Organizations. Marit. Policy Manag. 2019, 46, 781–801. [Google Scholar] [CrossRef]
  55. Moktadir, M.A.; Ali, S.M.; Paul, S.K.; Shukla, N. Barriers to Big Data Analytics in Manufacturing Supply Chains: A Case Study from Bangladesh. Comput. Ind. Eng. 2019, 128, 1063–1075. [Google Scholar] [CrossRef]
  56. Shukla, M.; Mattar, L. Next Generation Smart Sustainable Auditing Systems Using Big Data Analytics: Understanding the Interaction of Critical Barriers. Comput. Ind. Eng. 2019, 128, 1015–1026. [Google Scholar] [CrossRef]
  57. Kastouni, M.Z.; Ait Lahcen, A. Big Data Analytics in Telecommunications: Governance, Architecture and Use Cases. J. King Saud. Univ. Comput. Inf. Sci. 2022, 34, 2758–2770. [Google Scholar] [CrossRef]
  58. Aho, T.; Kilamo, T.; Lwakatare, L.; Mikkonen, T.; Sievi-Korte, O.; Yaman, S. Managing and Composing Teams in Data Science: An Empirical Study. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 2291–2300. [Google Scholar]
  59. Escobar, C.A.; McGovern, M.E.; Morales-Menendez, R. Quality 4.0: A Review of Big Data Challenges in Manufacturing. J. Intell. Manuf. 2021, 32, 2319–2334. [Google Scholar] [CrossRef]
  60. Wang, S.; Wang, H. Big Data for Small and Medium-Sized Enterprises (SME): A Knowledge Management Model. J. Knowl. Manag. 2020, 24, 881–897. [Google Scholar] [CrossRef]
  61. Saltz, J.S.; Shamshurin, I. Achieving Agile Big Data Science: The Evolution of a Team’s Agile Process Methodology. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3477–3485. [Google Scholar]
  62. Jensen, M.H.; Nielsen, P.A.; Persson, J.S. Managing Big Data Analytics Projects: The Challenges of Realizing Value. In Proceedings of the 27th European Conference on Information Systems (ECIS), Muenster, Germany, 8–14 June 2019. [Google Scholar]
  63. Kim, M.; Zimmermann, T.; DeLine, R.; Begel, A. Data Scientists in Software Teams: State of the Art and Challenges. IEEE Trans. Softw. Eng. 2018, 44, 1024–1038. [Google Scholar] [CrossRef]
  64. Barham, H.; Daim, T. Identifying Critical Issues in Smart City Big Data Project Implementation. In Proceedings of the SCC ‘18: The 1st ACM/EIGSCC Symposium on Smart Cities and Communities, Portland, OR, USA, 20–22 June 2018. [Google Scholar]
  65. Barham, H. Achieving Competitive Advantage through Big Data: A Literature Review. In Proceedings of the 2017 Portland International Conference on Management of Engineering and Technology (PICMET), Portland, OR, USA, 9–13 July 2017; pp. 1–7. [Google Scholar]
  66. Becker, D.K. Predicting Outcomes for Big Data Projects: Big Data Project Dynamics (BDPD): Research in Progress. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 2320–2330. [Google Scholar]
  67. Chipidza, W.; George, J.; Koch, H. Chartering Predictive Analytics: A Case Study. In Proceedings of the 22nd Americas Conference on Information Systems, AMCIS 2016, San Diego, CA, USA, 11–14 August 2016. [Google Scholar]
  68. Klosowski, T. The State of Consumer Data Privacy Laws in the US (And Why It Matters). Wirecutter. 6 September 2021. Available online: https://www.nytimes.com/wirecutter/blog/state-of-privacy-laws-in-us/ (accessed on 20 November 2022).
  69. Wang, R.Y.; Strong, D.M. Beyond Accuracy: What Data Quality Means to Data Consumers. J. Manag. Inf. Syst. 1996, 12, 5–33. [Google Scholar] [CrossRef]
Figure 1. PRISMA flow diagram; created with [34].
Figure 1. PRISMA flow diagram; created with [34].
Data 10 00132 g001
Figure 2. nº of studies by research context; the total number surpasses 26, given that a study can be associated with more than one context; created with MS Excel.
Figure 2. nº of studies by research context; the total number surpasses 26, given that a study can be associated with more than one context; created with MS Excel.
Data 10 00132 g002
Figure 3. Relationship network: a line represents a relation between two barriers, its length is the Jaccard distance, and colors represent the clusters; distances > 0.9 are not displayed for visual clarity; created with NetworkX v3.2.1 library for Python v3.12.4.
Figure 3. Relationship network: a line represents a relation between two barriers, its length is the Jaccard distance, and colors represent the clusters; distances > 0.9 are not displayed for visual clarity; created with NetworkX v3.2.1 library for Python v3.12.4.
Data 10 00132 g003
Figure 4. Determination of the nº of clusters using the elbow method and the silhouette score; created with MS Excel.
Figure 4. Determination of the nº of clusters using the elbow method and the silhouette score; created with MS Excel.
Data 10 00132 g004
Figure 5. Dendrogram showing the clusters, the y-axis displays the Jaccard distance; created with Scipy v1.13.1 library for Python v3.12.4.
Figure 5. Dendrogram showing the clusters, the y-axis displays the Jaccard distance; created with Scipy v1.13.1 library for Python v3.12.4.
Data 10 00132 g005
Table 1. Search strategies and search terms.
Table 1. Search strategies and search terms.
Search StrategyTopicSearch Terms *ResultSearch Date
Strategy 1Data Science“data science” OR “big data” OR “data analytics” OR “machine learning” AND21320 October 2022
Barrier“barrier*” OR “obstacle*” OR “challenge*” OR “hindrance*” AND
MCDM“multi-criteria decision-making” OR “MCDM” OR “AHP” OR “TOPSIS” OR “VIKOR” OR “ANP” OR “DEMATEL” OR “PROMETHEE” OR “ELECTRE” OR “ISM” OR “TISM” OR “MICMAC”
Strategy 2Data Science Project“data science project” OR “big data project” OR “data analytics project” OR “machine learning project” AND23921 October 2022
Barrier or project failure“barrier*” OR “obstacle*” OR “challenge*” OR “hindrance*” OR “fail*”
* Terms such as “data mining” and “artificial intelligence” were initially considered and tested but ultimately excluded due to high retrieval of unrelated or overly broad studies, particularly outside the scope of data initiatives.
Table 2. Summary of the studies of Data Science barriers.
Table 2. Summary of the studies of Data Science barriers.
#AuthorContextMethodsKeypoints
AKavre et al. (2022) [47]Small and medium-sized enterprises (SME), IndiaLiterature review, expert opinion, ISM, DEMATELBarriers to BDA adoption. Analyzes the impact and relationships among the barriers.
BSharma et al. (2022) [48]Manufacturing, public sector, IndiaDEMATELBarriers to machine learning adoption. Analyzes the impact and relationships between the barriers.
CGangadhari et al. (2022) [49]Oil industry, IndiaLiterature review (PRISMA), DEMATEL, COPRAS, MOORA, Delphi (n = 10)Barriers to machine learning adoption. Analyzes the impact and relationships between the barriers.
DNayal et al. (2021) [27]Agriculture, COVID-19, IndiaDelphi, ISM, Fuzzy MICMAC, ANPBarriers to machine learning adoption. Analyzes the impact and relationships between the barriers.
ERaut, Yadav, et al. (2021) [25]Manufacturing, IndiaISM, Delphi, Fuzzy MICMAC, DEMATEL, expert opinion (n = 47)Barriers to BDA adoption. Analyzes the impact and relationships between the barriers.
FPark & Kim (2021) [26]No specific industry, KoreaAHP (n = 50), regression analysis (n = 226)Barriers to BD adoption. Categorizes and analyzes the impact of the barriers.
GGupta & Goyal (2021) [50]Manufacturing, IndiaLiterature review, survey, ISM, MICMAC, Fuzzy AHP, expert opinion (n = 16)Barriers to BDA adoption. Categorizes and analyzes the impact and relationships among the barriers.
HBahrami et al. (2021) [51]Startups, IranInterviews, survey, Fuzzy AHPBarriers faced by BD startups. Categorizes and analyzes the impact of the barriers.
IRaut, Narwane, et al. (2021) [52]Supply chain, IndiaLiterature review, survey, DEMATEL, ANP, expert opinion (n = 15)Barriers to BDA adoption. Analyzes the impact and relationships among the barriers.
JBag et al. (2020) [53]Third sector, supply chain, AfricaLiterature review, Fuzzy TISM (n = 5), survey (n = 108), SEMBarriers to BDA adoption. Analyzes the impact and relationships among the barriers.
KAlalawneh & Alkhatib (2020) [38]Financial, industrial, services, public and supply chain sectors, JordanLiterature review, Semi-structured interviews, survey (n = 23), AHP, TOPSISBarriers to BD adoption. Categorizes and analyzes the impact and relationships among the barriers.
LZhang & Lam (2019) [54]Maritime industryFuzzy-Delphi (n = 6), Fuzzy AHP (n = 20), TOPSISBarriers to BDA adoption. Categorizes and analyzes the impact of the barriers.
MMoktadir et al. (2019) [55]Supply chain, BangladeshLiterature review, Delphi (n = 15), AHP, expert opinionBarriers to BDA adoption. Categorizes and analyzes the impact of the barriers.
NShukla & Mattar (2019) [56]AgricultureLiterature review, ISM, MICMAC, expert opinionBarriers to BDA adoption. Analyzes the impact and relationships between the barriers.
OKastouni & Lahcen (2022) [57]TelecommunicationsCase Study (n = 1)Barriers for the studied BDA project. Categorizes the barriers.
PMartinez et al. (2021) [29]No specific contextLiterature reviewDS project management methodologies. Proposes a framework for framing the methodologies. Raises challenges for DS projects.
QAho et al. (2021) [58]No specific contextSurvey (n = 50)Barriers and other issues for DS projects.
REscobar et al. (2021) [59]ManufacturingLiterature reviewBarriers to DS projects.
SWang & Wang (2020) [60]Small and medium-sized enterprisesLiterature review, case studies (n = 8)Proposes a knowledge management model for BD. Raises barriers for BD projects.
TSaltz & Shamshurin (2019) [61]UndisclosedCase study (n = 1)Agile methodologies for DS projects. Raises barriers for DS projects.
UJensen et al. (2019) [62]EnergyCase study (n = 1)Barriers for the BD project studied.
VKim et al. (2018) [63]TechnologySurvey (n = 793)Barriers and other issues faced by data scientists in projects. Categorizes the barriers.
WBarham & Daim (2018) [64]Smart citiesLiterature reviewBarriers for BD projects. Categorizes the barriers.
XBarham (2017) [65]No specific contextLiterature reviewBarriers and other issues for DS projects.
YBecker (2017) [66]No specific contextSurvey (n = 19), system modelling and simulationModel for predicting outcomes of BD projects. Raises causes of failure for projects of this type.
ZChipidza et al. (2016) [67]Supply chainCase study (n = 1)Barriers for the BD project studied.
Table 3. Description of the barriers.
Table 3. Description of the barriers.
#BarrierDescription
B1Insufficient skillsThe organization does not have individuals with the necessary skills to execute the project.
B2Poor data qualityThe data provided is of poor quality (i.e., lacking accuracy, relevancy, and representation)
B3Insufficient IT infrastructureThe organization’s IT infrastructure does not support the project’s needs.
B4Data privacy and securityPrivacy and security risks are not managed properly.
B5IT illiteracyEmployees involved with the project do not have basic technology knowledge in order to be able to cooperate or use the tools developed.
B6Lack of support from top managementTop management does not provide the necessary support for the development of the project.
B7Complexity of data or technologyData or technology characteristics are too complex.
B8Lack of an integrated data environmentThere is a lack of an integrated data environment from which the project can load and write information.
B9Insufficient fundingThere is not enough funding for the project’s needs.
B10Immature technology and lack of appropriate toolsThe technology is immature in the sense of well-established responsibilities, processes, and tools.
B11Strategy mismatchThere is a mismatch between company strategy and project goals.
B12Inadequate data sharing policyThe organization’s data sharing policy hinders the team’s access to needed data.
B13Scalability issuesThe established infrastructure does not have sufficient scalability capacity to support data processing and storage over time.
B14Resistance to change and other cultural barriersThere is unwillingness on the part of the organization to adapt to the new processes. This includes other cultural barriers such as fear of technology and behavioral problems.
B15Insufficient ROI or business caseThe proposed project does not have sufficient ROI or business case (justification) to gain the support of the decision-makers.
B16Inadequate training programs and facilitiesThe organization’s training programs and/or facilities are inadequate.
B17High investment and maintenance costProject implementation requires a high investment cost and also leads to high maintenance costs.
B18Government policies and regulationGovernment regulation and/or policies are insufficient, counterproductive, or unclear.
B19Poor data management and architectureThe organization’s data management and architecture are inadequate, making it difficult to acquire data and integrate the new processes into the existing architecture.
B20Lack of coordination, collaboration, and communicationThere is a lack of coordination, collaboration, or communication between the parties involved in the project.
B21Data availabilityThe necessary data is not available, either due to a lack of access, or other issues, such as data loss, lack of record keeping, and physical (non-electronic) storage.
B22Inadequate or inconsistent methodologyThe methodology used is inadequate, inconsistent, or immature.
B23External sources of dataExternal sources of data are not available to the organization, either due to legal issues, high costs, etc.
B24Scope, objectives, and expected results unclearScope, objectives and expected results were poorly defined at the beginning of the project.
B25Uncertainty about benefitsThe organization is uncertain about the benefits of the project. There is a lack of support to start the project or to develop the activities.
B26Deployment and sustainability issuesDeployment and/or sustainability are inadequate leading to low usage of project deliveries, possibly due to a lack of an implementation plan to manage the changes in the business processes.
B27Associated risksRisks associated with the project are high or overestimated.
Table 4. Barriers identified in the literature. (∙) indicates a reference to the barrier, (x) indicates a reference that classified the barrier as critical, Freq. indicates the number of references for each barrier, and F. (x) indicates the number of references that identify each barrier as critical.
Table 4. Barriers identified in the literature. (∙) indicates a reference to the barrier, (x) indicates a reference that classified the barrier as critical, Freq. indicates the number of references for each barrier, and F. (x) indicates the number of references that identify each barrier as critical.
Search strategies
Strategy 1Strategy 2
#Freq.F. (x)ABCDEFGHIJKLMNOPQRSTUVWXYZ
B1205 xxxxx
B2182 xx
B3154 x x x x
B4154 xxx x
B5152x x
B6144x xx x
B7112 xx
B8112 x x
B9103 xxx
B1093 x xx
B1192 x x
B1291 x
B1390
B1490
B1590
B1683x xx
B1782 x x
B1882 x x
B1981 x
B2081x
B2180
B2261 x
B2360
B2460
B2552 x x
B2640
B2740
Table 5. Distance matrix. Values represent Jaccard distance between one barrier and another, where 1 = no similarity (no relation) and 0 = maximum similarity. Note: the “B” is hidden starting from B10.
Table 5. Distance matrix. Values represent Jaccard distance between one barrier and another, where 1 = no similarity (no relation) and 0 = maximum similarity. Note: the “B” is hidden starting from B10.
B1B2B3B4B5B6B7B8B91011121314151617181920212223242526
B21.0
B31.00.7
B41.00.60.9
B50.61.01.01.0
B60.81.01.00.90.8
B71.00.60.80.71.01.0
B81.00.30.80.71.01.00.4
B90.91.01.01.01.00.91.01.0
B101.00.90.60.91.01.00.81.01.0
B110.91.01.01.00.90.61.01.01.01.0
B120.90.91.00.91.00.81.00.91.01.00.9
B131.00.60.70.71.01.00.60.71.00.61.01.0
B141.01.01.00.91.00.81.01.01.01.00.80.71.0
B150.91.01.01.00.90.91.01.00.91.00.91.01.01.0
B160.71.00.91.00.90.91.01.00.90.91.00.91.00.91.0
B170.81.01.01.01.00.91.01.00.31.01.01.01.01.00.90.9
B181.01.01.00.91.01.01.01.00.91.01.01.01.01.01.01.01.0
B191.00.80.90.91.01.00.90.81.01.01.00.90.91.01.01.01.01.0
B200.91.01.01.01.00.91.01.01.01.00.81.01.01.01.01.01.01.01.0
B211.00.60.90.91.01.00.90.71.00.81.00.90.91.01.01.01.01.01.01.0
B221.01.01.01.01.01.01.01.01.01.01.01.01.01.00.61.01.01.01.01.01.0
B231.01.01.00.91.01.01.01.01.01.01.01.01.01.01.01.01.00.31.01.01.01.0
B241.01.01.01.01.00.91.01.01.01.00.80.91.00.80.81.01.01.01.01.01.00.81.0
B250.90.90.90.90.80.90.90.91.01.00.81.01.01.00.81.01.01.01.01.01.01.01.01.0
B260.91.01.01.00.90.91.01.01.01.01.01.01.01.00.91.01.01.01.00.81.00.61.00.81.0
B271.01.01.01.01.01.01.01.00.81.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.0
Table 6. Clusters and centrality measures.
Table 6. Clusters and centrality measures.
#ClusterClosenessDegree
People
B1Insufficient skills1.3811
B5IT illiteracy1.287
B16Inadequate training programs and facilities1.169
B25Uncertainty about benefits1.1010
Data and technology
B2Poor data quality1.5710
B8Lack of an integrated data environment1.529
B7Complexity of data or technology1.429
B13Scalability issues1.428
B3Insufficient IT infrastructure1.2910
B4Data privacy and security1.2714
B21Data availability1.228
B10Immature technology and lack of appropriate tools1.207
B19Poor data management and architecture1.107
Management
B6Lack of support from top management1.3114
B11Strategy mismatch1.279
B14Resistance to change and other cultural barriers1.226
B12Inadequate data sharing policy1.1711
B20Lack of coordination, collaboration, and communication1.094
Economic
B9Insufficient funding1.727
B17High investment and maintenance cost1.505
B27Associated risks1.091
Project
B22Inadequate or inconsistent methodology1.543
B15Insufficient ROI or business case1.3310
B26Deployment and maintenance issues1.337
B24Scope, objectives, or expected results unclear1.287
External
B18Government policies and regulation4.003
B23External sources of data4.002
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Labarrère, N.; Costa, L.; Lima, R.M. Data Science Project Barriers—A Systematic Review. Data 2025, 10, 132. https://doi.org/10.3390/data10080132

AMA Style

Labarrère N, Costa L, Lima RM. Data Science Project Barriers—A Systematic Review. Data. 2025; 10(8):132. https://doi.org/10.3390/data10080132

Chicago/Turabian Style

Labarrère, Natan, Lino Costa, and Rui M. Lima. 2025. "Data Science Project Barriers—A Systematic Review" Data 10, no. 8: 132. https://doi.org/10.3390/data10080132

APA Style

Labarrère, N., Costa, L., & Lima, R. M. (2025). Data Science Project Barriers—A Systematic Review. Data, 10(8), 132. https://doi.org/10.3390/data10080132

Article Metrics

Back to TopTop