Next Article in Journal
Hand Dynamics in Healthy Individuals and Spinal Cord Injury Patients During Real and Virtual Box and Block Test
Previous Article in Journal
Strain Rate Effect on Artificially Cemented Clay with Fully Developed and Developing Structure
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Future-Ready Skills Across Big Data Ecosystems: Insights from Machine Learning-Driven Human Resource Analytics

by
Fatih Gurcan
1,
Beyza Gudek
1,
Gonca Gokce Menekse Dalveren
2 and
Mohammad Derawi
3,*
1
Department of Management Information Systems, Faculty of Economics and Administrative Sciences, Karadeniz Technical University, Trabzon 61080, Turkey
2
Department of Computer Engineering, Izmir Bakircay University, Izmir 35665, Turkey
3
Department of Electronic Systems, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology, 7034 Gjøvik, Norway
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(11), 5841; https://doi.org/10.3390/app15115841
Submission received: 19 April 2025 / Revised: 17 May 2025 / Accepted: 21 May 2025 / Published: 22 May 2025

Abstract

:
This study aims to analyze online job postings using machine learning-based, semantic approaches and to identify the expertise roles and competencies required for big data professions. The methodology of this study employs latent Dirichlet allocation (LDA), a probabilistic topic modeling technique, to reveal hidden semantic structures within a corpus of big data job postings. As a result of our analysis, we have identified seven expertise roles, six proficiency areas, and 32 competencies (knowledge, skills, and abilities) necessary for big data professions. These positions include “developer”, “engineer”, “architect”, “analyst”, “manager”, “administrator”, and “consultant”. The six essential proficiency areas for big data are “big data knowledge”, “developer skills”, “big data analytics”, “cloud services”, “soft skills”, and “technical background”. Furthermore, the top five skills emerged as “big data processing”, “big data tools”, “communication skills”, “remote development”, and “big data architecture”. The findings of our study indicated that the competencies required for big data careers cover a broad spectrum, including technical, analytical, developer, and soft skills. Our findings provide a competency map for big data professions, detailing the roles and skills required. It is anticipated that the findings will assist big data professionals in assessing and enhancing their competencies, businesses in meeting their big data labor force needs, and academies in customizing their big data training programs to meet industry requirements.

1. Introduction

As a result of the leadership of today’s rapidly advancing information and communication technologies, big data has become an indispensable strategic information resource and a potential power more than ever before. The ever-increasing participation and utilization of online communication and interaction networks have led to the exponential growth of various types of data produced and shared by individuals, institutions, devices, and systems [1]. Big data consists of massive and frequently unstructured datasets made available by a variety of data sources, such as online platforms, social networks, and organizations [2]. These large-scale data sources have the potential to facilitate increased productivity, quicker and more effective decision-making, more accurate forecasts, and greater adaptability [3]. However, processing this large volume of diverse types of unstructured data with efficient processes and methodologies and obtaining meaningful inferences requires extensive knowledge, skills, and abilities [4]. Big data competencies include a variety of data-driven operational skills, such as all types of data collection, data logging, data cleansing, data analytics, data visualization, data security and privacy, scaling, deployment, empowerment, and querying [3,5,6]. Among the responsibilities of big data specialists, interpreting data and identifying their strategic and potent potential through the use of data mining and analytic models is the most essential responsibility. Big data expertise requires familiarity with the required software and hardware to acquire data and manage multiple data sources [1,7,8].
To derive all possible benefits from sources of big data, both government and commercial organizations must possess the necessary big data knowledge and skills [9]. Big data competencies are crucial skills that will play an important role in today’s data-driven decision-making across all industries [3]. As a result, the need for professionals with the knowledge and skills to effectively manage and interpret big data has become a pressing issue that must be addressed. There are new career opportunities for big data professionals every day, particularly in the fields of big data processing and big data analytics [10,11]. As the use of services and applications centered on big data has become more pervasive, the demand for qualified labor in the field of big data has steadily increased, and big data has become one of the most important sources of employment in the current labor market [4,5,8]. In this context, big data competency requirements provide IT professionals with diverse career opportunities [6]. For these workers, big data jobs have become increasingly tangible over the past decade. Throughout the life cycle of big data-oriented applications, big data professionals may assume various responsibilities and roles, depending on the scope and nature of their work [4,12]. By analyzing and defining the roles (titles and areas of expertise) and responsibilities (competencies and skills) of big data professionals, it is possible to gain a better understanding of all available big data professions [4,5]. There is a clear research gap in identifying the roles and related skills required for the most prominent big data jobs and defining the academic supply required to meet this demand.
Given the current background, we aim to close the aforementioned knowledge gap by empirically identifying and defining the job roles and skills necessary for businesses to effectively exploit the potential of big data. In particular, we intend to (1) identify and categorize competency requirements for big data professionals from a practitioner’s perspective and (2) establish correlations between these requirements and professional roles. To achieve this, we utilized online job postings as a data source and conducted a topic modeling-based semantic content analysis on 2280 big data job postings. This analysis resulted in the identification of 32 topics (skills) for big data professions, which define seven expertise roles and their respective competencies. For each of these roles, a conceptual job description has been developed, and the primary roles and responsibilities of big data professionals have been identified. Consequently, the findings of this study can provide insightful contributions to big data communities with a wide spectrum of profiles, and to industries in different arenas that are embracing the potential and benefits of big data.

2. Background and Related Work

Since the term “big data” first appeared in the scientific literature, it has been a dynamic industry closely related to human resources and a significant source of employment [13]. The requirement for expertise in big data, which has emerged as a natural consequence of the widespread use of big data, has been a fundamental issue frequently emphasized in the literature, particularly over the past decade [14]. Recent research indicates that, unlike traditional data proficiency, big data-driven services and applications require more advanced and specialized professional knowledge and skills.
Numerous synchronized processes based on storage, processing, analytics, and visualization comprise big data workflows [2]. Services and applications based on real-time data processing and analysis utilizing dynamic data flows have spawned new specializations for big data. Notable big data applications include online analytical processing (OLAP), streaming analytics, business intelligence, and business data analytics [15]. Consequently, meeting the growing demand for skilled labor in the big data industry has emerged as a problem that must be addressed in the near future [12]. This issue has been identified in the literature as a big data skills gap and has been highlighted in numerous scientific studies and industry reports [8,12,14]. On the other hand, a limited number of specific studies have been conducted based on the analysis of job postings to reveal the required knowledge and skills in the big data industry [5]. The focus of these studies was on various aspects of big data competency requirements, including big data analytics, data science, and other big data specializations [3,6,8,16].
Among them, the study by Debortoli et al. (2014) is one of the most prominent. They compiled a dataset with 1357 business intelligence and 450 big data job postings [3]. They analyzed business intelligence and big data job postings using text mining procedures and the latent semantic analysis (LSA) technique. De Mauroa et al. (2018) analyzed 2786 job postings related to big data utilizing the latent Dirichlet allocation (LDA) method in a separate study [4]. They identified four roles (business analysts, data scientists, developers, and system administrators) and nine skills that correspond to these roles. Using text mining techniques, Gardiner et al. (2018) analyzed 1216 big data job postings [8]. In a similar study, Radovilsky et al. (2018) surveyed 1050 online job postings using semantic content analysis to identify the knowledge domains and skill sets that distinguish data scientist and data analyst positions [16].
Gurcan (2019) analyzed 2175 big data job postings using latent Dirichlet allocation (LDA), a generative method for topic modeling, and identified 60 topics for big data competencies [17]. Using the LDA-based topic modeling technique, Gurcan and Cagiltay (2019) analyzed 2638 job postings and revealed the big data skills required for software engineering [5]. Using text mining techniques, Persaud (2020) analyzed 3009 job postings for big data analytics. This study found that cognitive and functional skills, as well as social and meta-skills, are necessary for big data analytics [10]. Using the LSA method, Halwani et al. (2022) conducted a study on big data and data science skills by analyzing 1200 job postings. As a result, they determined the connections between academic and sectoral perspectives [6].
These studies, which investigate the various contexts of big data careers, have revealed the knowledge domains and skill sets required for big data professionals. The studies aimed to close the talent gap in the big data industry, benefiting both industry and academia. In this regard, our study on the competency requirements for big data extends the methodology and findings of the previous research.

3. Materials and Methods

Using quantitative procedures based on topic modeling, we will investigate the major dynamics, themes, and tendencies of the big data labor market. This study’s phases include data collection, preprocessing, and analysis. An empirical dataset of big data job postings was subjected to a semantic content analysis based on probabilistic topic modeling using a semi-automated methodology. The overall research design and methodological workflow are illustrated in Figure 1, which provides a visual guide to the sequential steps of the analysis. The phases of data collection, preprocessing, analysis, and interpretation are each addressed in detail in the following subsections to outline the methodological workflow of the study.

3.1. Data Collection and Preprocessing

The World Wide Web (www), which contains a vast amount of information in a variety of formats and structures, is an ideal source of information for such scientific studies. To create the most relevant dataset, we sought out a reputable data source (online employment website) from which we could receive unrestricted job postings. Indeed.com [18], a well-known employment website with extensive search, filter, and retrieval capabilities, was selected as the data source for this study. Indeed.com is the most popular job site in the world, with over 250 million monthly visitors [17,18,19]. All job postings containing the phrase “big data” in the job title were considered big data jobs in this context. Using our search query, you can view examples of the job postings that comprise our empirical dataset on the Indeed website. Using an API developed by Indeed.com, 2280 big data job postings published between July 2024 and December 2024 were retrieved [18]. The job postings were saved in a database, and an experimental dataset was compiled. Each job listing in the dataset is divided into two sections: job title and job description.
After creating the dataset, sequential data preprocessing steps were applied to the dataset consisting of texts from job postings [20]. Data preprocessing is essential, especially for the success of web-based unstructured text-driven data analysis [21]. In this analysis, the experimental dataset was subjected to a series of sequential preprocessing steps. Initially, text was tokenized, or divided into words, to obtain meaningful attributes [22,23]. All text was converted to all lowercase letters. Then, unnecessary punctuation, web links, HTML tags, and characters were removed. Then, stop words in English were removed from the texts to save space. The process of lemmatization was applied to the texts in order to reduce the terms from their derived form to their simple form [24,25,26]. The remaining words were used to create a document–term matrix (DTM) in order to conduct the intended numerical analysis [27,28,29].

3.2. Data Analysis and Interpretation

In the first phase of the empirical analysis, we presented a taxonomy for big data specializations by classifying job titles based on their essential expertise roles. The job title of each big data position is a significant indicator of the expertise role of the candidate who will work in that position, i.e., in which role they will work [4]. In this regard, based on the words in the job titles, each position has been assigned a role associated with the big data area of expertise. For instance, job titles such as big data engineer, big data systems engineer, and big data infrastructure engineer, which contain derivatives of the word “engineering”, fall within the engineer role’s competency area. Similarly, the words in all other big data job titles were identified and assigned to the corresponding competency roles. Thus, seven primary expertise roles for big data professions have been identified.
After identifying the expertise roles, latent Dirichlet allocation (LDA) topic modeling was performed on the textual corpus of big data job postings [27,29,30]. LDA is a probabilistic and generative method for demonstrating the semantic structure of a textual corpus [27]. Text documents contain hidden semantic patterns referred to as “topics”. Multiple topics may appear in varying proportions in a text document. LDA-based topic modeling is based on determining these topics and their proportions. LDA employs an unsupervised learning model to discover hidden topics in an unstructured text corpus without training or explanation. Consequently, the LDA algorithm is utilized for the topic modeling analysis of large datasets [29,31,32,33].
To fit and apply the LDA model to the big data job postings corpus, we used Gensim, a Python (3.12) library specialized for topic modeling [22,29]. The LDA model was then applied with K values ranging from 10 to 60 in order to determine the optimal number of topics [29,34]. For each K value change within this range, a model’s coherence score was calculated. When we examined the calculated coherence scores for each K in this range, we discovered that the coherence score decreased as the number of topics increased. The maximum coherence score, which expresses the best semantic consistency of the discovered topics, was obtained for K = 32 topics [22,34]. For each of the 32 topics, the top 10 descriptive keywords with the highest frequency were determined [19,29,30]. Using the open card sorting method, each topic’s topic name was determined and assigned after descriptive keywords were considered [35]. Additionally, the distribution rates of the discovered topics per document, the distribution of words per topic, and the proportions of the topics in the entire corpus were calculated [19,31]. In all subsequent analyses, these 32 topics discovered by LDA were used to represent big data skills.

4. Results

4.1. Identification of Expertise Roles

This section presents the results of an analysis of job postings involving big data. Expertise roles and job titles were extracted from the job postings’ title texts. Taking into account the distribution of job titles across the roles, seven specialization roles were determined. Table 1 lists these positions, their top five job titles, and their respective percentages.
According to Table 1, the “developer” role has been the most prevalent specialty role, accounting for 34.69%. The rate for “engineer” is then 33.99%. It follows the role of “architect” with a proportional decrease of 10.48%. The remaining positions are “analyst” (9.15%), “manager” (5.18%), “administrator” (3.39%), and “consultant” (3.21%). According to these results, “developer” and “engineer” roles account for nearly 70% of the big data positions. The remaining five roles account for roughly 30%.

4.2. Identification of Skill Sets

This phase of our analysis reveals the fundamental competencies and information domains associated with big data. Thirty-two skills have been identified as a result of LDA-based topic modeling analysis. Table 2 provides the percentage distribution of the identified competencies along with the keywords that define each skill-based topic label.
The results depicted in Table 2 reveal a wide range of technical, analytic, and soft skills within the domain of big data. These 32 discovered skills also emphasize the interdisciplinary nature of big data competencies. As shown in the table, “big data processing” (6.82%) is the most popular skill related to big data, followed by “big data tools” (5.93%). “Communication skills” (5.02%) is positioned third. This demonstrates that soft skills are as important as technical skills for jobs involving big data. The lowest-ranked competencies in the distribution are “Azure Cloud” (1.77%), “team working” (1.76%), and “decision-making” (1.63%), indicating a relatively lower emphasis on these topics.

4.3. Taxonomy of the Skill Sets by Competency Areas

In this phase of the analysis, the knowledge domains and skill sets were categorized and presented in a more systematic way. The knowledge domains and skills denoted by the 32 topics were allocated into six competency areas, and a systematic competency map was developed for big data skills. The percentages of the skills according to the competency areas are presented in Figure 2. As shown in Figure 2, the first competency area, “big data knowledge” (36.30%), contains ten knowledge and skill items consisting of “big data processing”, “big data tools”, “big data architecture”, “information security”, “scalable systems”, “database”, “distributed systems”, “data warehousing”, “Hadoop ecosystem”, and “data streaming”. The second, “developer skills” (20.25%), has six items, namely “remote development”, “programming languages”, “Agile development”, “business applications”, “application development”, and “testing”. The third, “big data analytics” (14.88%), contains four items, specifically “data analytics”, “analytical skills”, “business analytics”, “machine learning”, “data visualization”, and “decision-making”. The fourth, “cloud services” (10.51%), has four items: “Amazon EMR”, “Google Cloud”, “AWS Data Services”, and “Azure Cloud”. The fifth, “soft skills” (10.16%), consists of three items: “communication skills”, “project management”, and “team working”. The sixth, “technical background” (7.90%), has three items comprising “technical knowledge”, “bachelor degree”, and “troubleshooting”.
This taxonomy systematically reveals the most in-demand competency areas within the big data ecosystem. The fact that “big data knowledge” ranks first, with a high proportion of 36.30%, highlights the central role of technically grounded expertise in the industry. The substantial shares of “developer skills” and “big data analytics” further emphasize the importance of practical software development and analytical capabilities. In contrast, the lower percentages associated with “soft skills” and “technical background” suggest that while technical competencies are prioritized, complementary skills are also necessary. In summary, this distribution reflects the industry’s strong demand for professionals who possess both domain-specific technical knowledge and applied problem-solving abilities.

4.4. Mapping Skill Sets to Expertise Roles

At this stage of analysis, skill sets and roles are matched. From this perspective, a systematic taxonomy of roles and skills is provided for the field of big data. The proportional distributions, correlations, and skills in the field of big data are presented in Table 3. The table’s skills are arranged according to the percentages in the final column. This table displays each role’s dominant and passive skills in green and red, respectively. The seven roles are ordered from largest to smallest in columns.
Even though “big data processing” has the highest rate in the first row of Table 3, it is dominant in the “engineer” role and relatively passive in the “administrator” role. “Big data tools” and “communication skills” are also essential “developer” skills. With these findings, it is clear that the “developer” role in the big data field requires both technical and soft skills.
In order to provide a clearer understanding of the results of the analysis, we have summarized the first ten skills of each role in Table 4. As seen in the table, 24 of the 32 skills are included in the first 10 skills of each role’s list.
As shown in Table 4, “communication skills” is one of the top ten skills for all big data positions. Second, the term “big data tools” is present in six of the seven roles. Additional dominant skills include “analytical skills”, “big data architecture”, and “big data processing”. Similar to Table 3, Table 4 applies a coloring scheme (from green to red) from dominant to passive skills. For instance, the top three competencies for the “developer” position are “big data tools”, “programming languages”, and “remote development”. Their order numbers are displayed in green. In contrast, the last three highlighted skills are “Amazon EMR” (8), “scalable systems” (9), and “business applications” (10). The order numbers of these last three are shown in red. To identify the most dominant skills associated with other roles, refer to Table 4 by focusing on the corresponding column for each role. Each row in this table indicates the relative importance of each skill. For instance, in Row 2 of Table 4, “big data tools” ranks first for “developer”, second for “engineer”, and fourth for “architect”. However, this skill is not among the top ten for “analyst”.

5. Discussion

Using text mining and probabilistic topic modeling techniques, this study analyzed the semantic content of job postings for big data jobs. This analysis uncovered seven expertise roles, 32 skills, and six competency areas required for big data professions. A high percentage (70%) of big data professionals have “developer” or “engineer” as their primary role, which is a significant inference. This finding was also emphasized in previous studies. De Mauro et al. (2018) identified four roles as “Business Analysts”, “Data Scientists”, “Developers”, and “System Managers” in their study [4]. Our findings overlap with three of these four roles. We determined the results for “engineer”, “architect”, “manager”, and “consultant” in different ways. We found that the “architect” position is approximately three times more in demand than the “administrator” position. However, De Mauro et al. (2018) added “Architect” to “System Managers” [4].
These results are significant indicators of the diversification and evolution of big data roles and skills over time. According to their study, the “Architect” role is part of the engineering family [4]. However, according to our findings, the percentage of the “architect” role is extremely high. It is also seen that they should have different skills from “engineers” and “developers”. On the other hand, the “consultant” role among the specialist roles, unlike other studies, is another remarkable finding. Due to the rapidly changing and transforming work dynamics, it is seen that the role of “consultant” is becoming increasingly important for big data professions. De Mauro et al. (2018) described program managers and data consultants as roles within the family of business analysts [4]. In the literature, “consultant” has not been highlighted as the primary role [4,10,14]. One of our further findings is that “project management” proficiency is the most sought-after skill for “manager” and “consultant” positions. In this sense, experts who assume the role of “consultant” are expected to have senior management skills.
Our second discovery is that big data skills contain six core proficiency areas that include “big data knowledge” (36.30%), “developer skills” (20.25%), “big data analytics” (14.88%), “cloud services” (10.51%), “soft skills” (10.16%), and “technical background” (7.90%). This finding revealed that the required skills for big data jobs include a wide range of competency areas. The wide spectrum of skills required for big data jobs is an important finding frequently mentioned in previous studies [4,5,8,10]. Compared to earlier studies such as [4,5], our findings confirm a broader spectrum of competencies, especially highlighting the simultaneous importance of technical and interpersonal skills in the current big data labor market. The emergence of soft skills—most notably “communication skills” and “team working”—among the top competencies across multiple roles reflects a shift towards more human-centric expectations in technically demanding roles, consistent with the evolving nature of data-driven team work and organizational adaptability [3,6,10].
Our findings clearly demonstrate that the vast majority of big data professions require employees at all levels to possess a collective set of technical, developer, analytical, and soft skills in order to drive decision-making processes that combine technical and analytical understandings with intuition and vision [10,16,17,36]. This alignment suggests that the big data profession is increasingly interdisciplinary, requiring a fusion of analytical thinking, programming expertise, and interpersonal competencies, rather than purely technical proficiency. Our results reinforce the argument that in environments characterized by complex data infrastructures and agile development teams, soft skills play a decisive role in enabling collaboration, problem-solving, and cross-functional communication. It has been revealed by our study that valuable insights and predictions for businesses are provided by combining machine learning and business analytics approaches with big data skills [15,37,38,39].
Our study also showed that big data is a multidisciplinary field comprised of various disciplines, as stated in previous studies [4,8,17]. The aforementioned studies argued that a variety of skills are required to analyze, interpret, and maintain big data in order to make effective business decisions, emphasizing that the discipline of big data is complex and multifaceted [4,8,17]. According to our topic modeling analysis, the top five skills emerged as “big data processing”, “big data tools”, “communication skills”, “remote development”, and “big data architecture”. The topic of “communication skills” is another significant finding of our analysis. There are many studies supporting this finding [4,6,8,17,40]. As shown in Table 4, “communication skills” is the only skill required for all roles and is among the top ten skills. With a rate of 5.02%, it also ranks third among big data skills. This result demonstrates the significance of communication-focused abilities. Despite the increasing demand for new technological skills, many studies have clearly emphasized the necessity of communication and social skills for big data professions [4,5,6,8,10,17,20,26,41].
“Team working” is another soft skill that should be emphasized. Employers expect a coordinated and collaborative team work style from employees with various qualifications. This is because big data is a work environment that requires different skill sets from experts in different roles (see Table 3). Another remarkable finding from our analysis is the topic “bachelor degree”. As seen in Table 4, “bachelor degree” is only included in the tenth skill of the “analyst” role and remains passive compared to other skills. So, this skill was not in the top ten skills of the other six roles. As a result, it is seen that professional experience and competencies are more dominant than undergraduate degrees in the competencies required for big data professions.
Among big data knowledge and skills, the number and high rates of competencies based on cloud services such as “Amazon EMR”, “Google Cloud”, “AWS Data Services”, and “Azure Cloud” have revealed the close relationship between big data and cloud services (see Figure 2). Likewise, the fact that many tools and technologies used in big data processes are cloud-based confirms this finding. Our findings regarding cloud services show that, as stated in other studies [5,8,19], big data specialists prefer cloud-based services and platforms over traditional platforms. This is because cloud computing platforms and services provide big data specialists with more innovative paradigms and solutions for storing, analyzing, managing, and processing big data [42].
Although the big data knowledge and skill sets that emerged as a result of our analysis are similar to other studies, some specific skills have changed proportionally over time. The change and diversification in industrial skill demands have caused some competencies to fall into the background and some new skills to emerge [5,43,44,45]. This is due to the dynamic structure of the field, which is constantly updated and renews itself. Information technologies and big data-oriented applications and services are constantly evolving and changing [33,44]. In addition to identifying key competencies, this study emphasizes the practical and ethical implications of digital transformation. Educational programs should integrate both technical and interpersonal skills, while human resources departments must adapt to continuous upskilling demands. The growing need for inclusive, lifelong learning highlights the importance of providing training that is not only relevant but also socially equitable [26,46]. Based on our findings, we predict that in the near future, the roles of expertise and related competencies in the big data field will become more diverse, and different sub-disciplines will emerge as the umbrella of the big data field. The big data phenomenon is becoming more popular day by day, and therefore it is likely that big data undergraduate and graduate programs in various scopes will emerge from the big data discipline.

6. Conclusions

In this study, we aim to bridge the skills gap between industry and academia by empirically defining the roles and skills required for big data expertise from the practitioner’s perspective, thereby enabling industries to effectively exploit the potential gains of big data. To this end, a machine learning-based semantic content analysis was conducted on 5444 online job postings for big data positions using text mining and topic modeling procedures. As a result of this analysis, seven expertise roles, six proficiency areas, and 32 competencies (knowledge, skills, and abilities) required for big data professions were revealed. Based on these findings, we developed a practitioner-oriented conceptual taxonomy of the knowledge, skills, and abilities required for big data professions. This taxonomy validates the multifaceted nature of big data competencies as well as their complexity and diversity.
Based on our findings, it can be said that the big data competencies revealed in this research can be an informative guide for data science education and lifelong learning programs in order to satisfactorily meet the qualified human resources needs of the industry. It is anticipated that the findings of this study will contribute to the following issues: (1) to companies in human resource management and employment of a qualified workforce in the field of big data; (2) to big data experts in the evaluation and development of their competencies; (3) to academic institutions and organizations in the development of up-to-date big data curricula to meet emerging industrial workforce demands; and (4) to students and candidates in guiding their careers in the field of big data.
As with many studies, this research has a number of potential limitations. Considering the fact that the big data discipline and the employment market are evolving rapidly, it should be noted that our analysis covers only a certain time period. It should be taken into account that such analyses, in the near future, may produce different findings over time as the qualified human resources requirements of industries are constantly changing. As a data source, only the job postings on Indeed.com, the world’s leading job search site, were included in this study. One limitation of this study is the reliance on a single data source (Indeed.com), which may introduce potential sampling bias. Future research can benefit from incorporating multiple job platforms such as LinkedIn, Glassdoor, and Monster to ensure broader representativeness and cross-platform validation. In this regard, our investigation can be expanded by including different employment sites as data sources. The methodology of the study is based on the LDA topic modeling approach, which is extensively used in text mining. By using different alternative approaches for text mining and topic modeling, future studies can expand our methodology and findings and provide descriptive implications that reveal the time-based evolution of big data proficiencies.

Author Contributions

Literature review and research background, B.G., F.G., G.G.M.D. and M.D.; conceptualization, B.G., F.G., G.G.M.D. and M.D.; methodology, B.G., F.G., G.G.M.D. and M.D.; software, B.G., F.G., G.G.M.D. and M.D.; validation, B.G., F.G., G.G.M.D. and M.D.; investigation, B.G., F.G., G.G.M.D. and M.D.; resources, B.G., F.G., G.G.M.D. and M.D.; data curation, B.G., F.G., G.G.M.D. and M.D.; writing—original draft preparation, B.G., F.G., G.G.M.D. and M.D.; writing—review and editing, B.G., F.G., G.G.M.D. and M.D.; visualization, B.G., F.G., G.G.M.D. and M.D.; supervision, B.G., F.G., G.G.M.D. and M.D.; project administration, B.G., F.G., G.G.M.D. and M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, H.; Chiang, R.H.L.; Storey, V.C. Business Intelligence and Analytics: From Big Data to Big Impact. Mis Q. 2012, 36, 1165–1188. [Google Scholar] [CrossRef]
  2. Philip Chen, C.L.; Zhang, C.-Y. Data-Intensive Applications, Challenges, Techniques and Technologies: A Survey on Big Data. Inf. Sci. 2014, 275, 314–347. [Google Scholar] [CrossRef]
  3. Debortoli, S.; Müller, O.; Vom Brocke, J. Comparing Business Intelligence and Big Data Skills: A Text Mining Study Using Job Advertisements. Bus. Inf. Syst. Eng. 2014, 6, 289–300. [Google Scholar] [CrossRef]
  4. De Mauro, A.; Greco, M.; Grimaldi, M.; Ritala, P. Human Resources for Big Data Professions: A Systematic Classification of Job Roles and Required Skill Sets. Inf. Process. Manag. 2018, 54, 807–817. [Google Scholar] [CrossRef]
  5. Gurcan, F.; Cagiltay, N.E. Big Data Software Engineering: Analysis of Knowledge Domains and Skill Sets Using LDA-Based Topic Modeling. IEEE Access 2019, 7, 82541–82552. [Google Scholar] [CrossRef]
  6. Halwani, M.A.; Amirkiaee, S.Y.; Evangelopoulos, N.; Prybutok, V. Job Qualifications Study for Data Science and Big Data Professions. Inf. Technol. People 2022, 35, 510–525. [Google Scholar] [CrossRef]
  7. Kantardzic, M. Data Mining: Concepts, Models, Methods, and Algorithms, 2nd ed.; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2011; ISBN 9780470890455 (ISBN). [Google Scholar]
  8. Gardiner, A.; Aasheim, C.; Rutner, P.; Williams, S. Skill Requirements in Big Data: A Content Analysis of Job Advertisements. J. Comput. Inf. Syst. 2018, 58, 374–384. [Google Scholar] [CrossRef]
  9. Deb, D.; Fuad, M. Integrating Big Data and Cloud Computing Topics into the Computing Curricula: A Modular Approach. J. Parallel Distrib. Comput. 2021, 157, 303–315. [Google Scholar] [CrossRef]
  10. Persaud, A. Key Competencies for Big Data Analytics Professions: A Multimethod Study. Inf. Technol. People 2021, 34, 178–203. [Google Scholar] [CrossRef]
  11. Verma, A.; Yurov, K.M.; Lane, P.L.; Yurova, Y.V. An Investigation of Skill Requirements for Business and Data Analytics Positions: A Content Analysis of Job Advertisements. J. Educ. Bus. 2019, 94, 243–250. [Google Scholar] [CrossRef]
  12. Miller, S. Collaborative Approaches Needed to Close the Big Data Skills Gap. J. Organ. Des. 2014, 3, 26. [Google Scholar] [CrossRef]
  13. Gandomi, A.; Haider, M. Beyond the Hype: Big Data Concepts, Methods, and Analytics. Int. J. Inf. Manag. 2015, 35, 137–144. [Google Scholar] [CrossRef]
  14. Debao, D.; Yinxia, M.; Min, Z. Analysis of Big Data Job Requirements Based on K-Means Text Clustering in China. PLoS ONE 2021, 16, e0255419. [Google Scholar] [CrossRef]
  15. Najafabadi, M.M.; Villanustre, F.; Khoshgoftaar, T.M.; Seliya, N.; Wald, R.; Muharemagic, E. Deep Learning Applications and Challenges in Big Data Analytics. J. Big Data 2015, 2, 1. [Google Scholar] [CrossRef]
  16. Radovilsky, Z.; Hegde, V.; Acharya, A.; Uma, U. Skills Requirements of Business Data Analytics and Data Science Jobs: A Comparative Analysis. J. Supply Chain Oper. Manag. 2018, 16, 82–101. [Google Scholar]
  17. Gurcan, F. Extraction of Core Competencies for Big Data: Implications for Competency-Based Engineering Education. Int. J. Eng. Educ. 2019, 35, 1110–1115. [Google Scholar]
  18. Indeed. Indeed Job Search. Available online: https://www.indeed.com/ (accessed on 16 January 2024).
  19. Ozyurt, O.; Gurcan, F.; Dalveren, G.G.M.; Derawi, M. Career in Cloud Computing: Exploratory Analysis of In-Demand Competency Areas and Skill Sets. Appl. Sci. 2022, 12, 9787. [Google Scholar] [CrossRef]
  20. Montandon, J.E.; Politowski, C.; Silva, L.L.; Valente, M.T.; Petrillo, F.; Guéhéneuc, Y.G. What Skills Do IT Companies Look for in New Developers? A Study with Stack Overflow Jobs. Inf. Softw. Technol. 2021, 129, 106429. [Google Scholar] [CrossRef]
  21. Gurcan, F. Major Research Topics in Big Data: A Literature Analysis from 2013 to 2017 Using Probabilistic Topic Models. In Proceedings of the 2018 International Conference on Artificial Intelligence and Data Processing, IDAP 2018, Malatya, Turkey, 28–30 September 2018; pp. 1–4. [Google Scholar]
  22. Řehůřek, R.; Sojka, P. Gensim—Statistical Semantics in Python; Masaryk University: Brno, Czech Republic, 2011; Volume 6611. [Google Scholar]
  23. Ningrum, P.K.; Pansombut, T.; Ueranantasun, A. Text Mining of Online Job Advertisements to Identify Direct Discrimination during Job Hunting Process: A Case Study in Indonesia. PLoS ONE 2020, 15, e0233746. [Google Scholar] [CrossRef]
  24. Uysal, A.K.; Gunal, S. The Impact of Preprocessing on Text Classification. Inf. Process. Manag. 2014, 50, 104–112. [Google Scholar] [CrossRef]
  25. Murakami, R.; Chakraborty, B. Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts. Sensors 2022, 22, 852. [Google Scholar] [CrossRef] [PubMed]
  26. Calanca, F.; Sayfullina, L.; Minkus, L.; Wagner, C.; Malmi, E. Responsible Team Players Wanted: An Analysis of Soft Skill Requirements in Job Advertisements. EPJ Data Sci. 2019, 8, 13. [Google Scholar] [CrossRef]
  27. Blei, D.M. Probabilistic Topic Models. Commun. ACM 2012, 55, 77–84. [Google Scholar] [CrossRef]
  28. Subakti, A.; Murfi, H.; Hariadi, N. The Performance of BERT as Data Representation of Text Clustering. J. Big Data 2022, 9, 15. [Google Scholar] [CrossRef]
  29. Egger, R.; Yu, J. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef]
  30. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar] [CrossRef]
  31. Blei, D.M.; Lafferty, J.D. Correction: A Correlated Topic Model of Science. Ann. Appl. Stat. 2007, 1, 634. [Google Scholar] [CrossRef]
  32. Gurcan, F.; Erdogdu, F.; Cagiltay, N.E.; Cagiltay, K. Student Engagement Research Trends of Past 10 Years: A Machine Learning-Based Analysis of 42,000 Research Articles. Educ. Inf. Technol. 2023, 28, 15067–15091. [Google Scholar] [CrossRef]
  33. Alibasic, A.; Upadhyay, H.; Simsekler, M.C.E.; Kurfess, T.; Woon, W.L.; Omar, M.A. Evaluation of the Trends in Jobs and Skill-Sets Using Data Analytics: A Case Study. J. Big Data 2022, 9, 32. [Google Scholar] [CrossRef]
  34. Mimno, D.; Wallach, H.M.; Talley, E.; Leenders, M.; McCallum, A. Optimizing Semantic Coherence in Topic Models. In Proceedings of the EMNLP 2011—Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–29 July 2011. [Google Scholar]
  35. Katsanos, C.; Avouris, N.; Stamelos, I.; Tselios, N.; Demetriadis, S.; Angelis, L. Cross-Study Reliability of the Open Card Sorting Method. In Proceedings of the Conference on Human Factors in Computing Systems—Proceedings, Glasgow, Scotland, 4–9 May 2019. [Google Scholar]
  36. Han, F.; Ren, J. Analyzing Big Data Professionals: Cultivating Holistic Skills Through University Education and Market Demands. IEEE Access 2024, 12, 23568–23577. [Google Scholar] [CrossRef]
  37. Kılınç, M.; Aydın, C.; Tarhan, Ç. Do Machine Learning and Business Analytics Approaches Answer the Question of ‘Will Your Kickstarter Project Be Successful? Istanbul Bus. Res. 2021, 50, 255–274. [Google Scholar] [CrossRef]
  38. Gurcan, F. What Are Developers Talking about Information Security? A Large-Scale Study Using Semantic Analysis of Q&A Posts. PeerJ Comput. Sci. 2024, 10, e1954. [Google Scholar] [CrossRef]
  39. Bonesso, S.; Bruni, E.; Gerli, F. How Big Data Creates New Job Opportunities: Skill Profiles of Emerging Professional Roles. In Behavioral Competencies of Digital Professionals; Springer: Berlin/Heidelberg, Germany, 2020; pp. 21–39. [Google Scholar]
  40. Wowczko, I.A. Skills and Vacancy Analysis with Data Mining Techniques. Informatics 2015, 2, 31–49. [Google Scholar] [CrossRef]
  41. Karakolis, E.; Kapsalis, P.; Skalidakis, S.; Kontzinos, C.; Kokkinakos, P.; Markaki, O.; Askounis, D. Bridging the Gap between Technological Education and Job Market Requirements through Data Analytics and Decision Support Services. Appl. Sci. 2022, 12, 7139. [Google Scholar] [CrossRef]
  42. Yang, C.; Huang, Q.; Li, Z.; Liu, K.; Hu, F. Big Data and Cloud Computing: Innovation Opportunities and Challenges. Int. J. Digit. Earth 2017, 10, 13–53. [Google Scholar] [CrossRef]
  43. Boselli, R.; Cesarini, M.; Mercorio, F.; Mezzanzanica, M. Classifying Online Job Advertisements through Machine Learning. Futur. Gener. Comput. Syst. 2018, 2, 31–49. [Google Scholar] [CrossRef]
  44. Gurcan, F. What Issues Are Data Scientists Talking about? Identification of Current Data Science Issues Using Semantic Content Analysis of Q&A Communities. PeerJ Comput. Sci. 2023, 9, e1361. [Google Scholar] [CrossRef]
  45. Aljohani, N.R.; Aslam, M.A.; Khadidos, A.O.; Hassan, S.U. A Methodological Framework to Predict Future Market Needs for Sustainable Skills Management Using AI and Big Data Technologies. Appl. Sci. 2022, 12, 6898. [Google Scholar] [CrossRef]
  46. Moreno, A.M.; Sanchez-Segura, M.I.; Medina-Dominguez, F.; Carvajal, L. Balancing Software Engineering Education and Industrial Needs. J. Syst. Softw. 2012, 85, 1607–1620. [Google Scholar] [CrossRef]
Figure 1. An overview of the research methodology.
Figure 1. An overview of the research methodology.
Applsci 15 05841 g001
Figure 2. Taxonomy of big data skills by competency areas.
Figure 2. Taxonomy of big data skills by competency areas.
Applsci 15 05841 g002
Table 1. Expertise roles and related job titles for big data professions.
Table 1. Expertise roles and related job titles for big data professions.
RoleRelated TitlesRate (%)
DeveloperBig Data Developer; Big Data Software Engineer; Big Data Spark Developer; Big Data Hadoop Developer; Java Big Data Developer34.69
EngineerBig Data Engineer; Senior Big Data Engineer; Lead Big Data Engineer; Principal Big Data Engineer; Big Data Platform Engineer33.99
ArchitectBig Data Architect; Big Data Solution Architect; Google Cloud Big Data Architect; Senior Big Data Architect10.48
AnalystBig Data Analyst; Big Data Analytics; Business Analyst; BI analyst; AWS Big Data Analytics9.05
ManagerBig Data Lead; Big Data Technical Lead; Big Data Program Manager; Big Data Solution Manager; Big Data Product Manager5.18
AdministratorBig Data DBA; Big Data Admin; Big Data Administrator; Big Data Hadoop Administrator; Big Data Platform Administrator3.39
ConsultantBig Data Consultant; Big Data Hadoop Consultant; Big Data Solution Consultant; Big Data Technology Consultant; AWS Big Data Consultant3.21
Table 2. The 32 topics (skills) discovered by LDA.
Table 2. The 32 topics (skills) discovered by LDA.
Topic NameKeywords%
Big Data Processingdata pipeline process build engineer platform analytic etl lake develop6.82
Big Data Toolsspark hadoop hive hbase java knowledge scala developer python sql5.93
Communication Skillsskill strong ability communication write environment excellent problem good service5.02
Remote Developmentremote developer warehouse software reliably telecommuting connect good location part-time4.69
Big Data Architecturesolution design architecture architect technical pipeline system structure enterprise implement4.64
Programming Languagessoftware solution development programming tool practice java python language scala4.38
Agile Developmentsoftware development design agile scrum product modeling team cross-functional customer4.03
Information Securityinformation safety security privacy data financial banking risk prevent threat3.75
Project Managementproject management technical plan identify lead manage process manager program3.38
Scalable Systemsscalability platform engineering engineer deliver structure scale service analytic healthcare3.28
Amazon EMRamazon system design software service emr aws engineer development distribute3.23
Data Analyticsdata analysis model business develop tool sql quality analytic knowledge3.16
Business Applicationsapplication development business support integration deploy delivery software agility design3.09
Analytical Skillsanalysis analytics critical prediction capability inferential analytical problem-solving report3.05
Business Analyticsbusiness analytics drive product customer leadership partner organization role strategy3.01
Google Cloudcloud platform build google service gcp infrastructure kubernete docker engine2.99
Technical Knowledgetechnical knowledge expert strong skill background look grow diversity professional2.98
Databasedatabase nosql sql stream management system distribute relational kafka mongodb2.93
Bachelor Degreescience computer degree engineering relate field security system minimum qualification2.73
AWS Data Servicesaws cloud redshift sql emr python tool lambda glue engineer2.53
Distributed Systemsdistribute apache system kafka process parallel hdfs storm computing oracle2.39
Data Warehousingwarehouse data warehousing process storage repository store tool model server2.33
Hadoop Ecosystemsystem hadoop performance issue cluster support database infrastructure security admin2.28
Machine Learninglearn machine data science ml build model algorithm scientist learning2.25
Troubleshootingcustomer support service troubleshooting aws help technical engineer application amazon2.19
Application Developmentapplication develop process system software programming platform qualify language code2.11
Data Streamingdata processing spark hadoop kafka real-time frameworks streaming apache storm1.94
Testingtest testing code quality design unit automation etl qa agile1.94
Data Visualizationdata visual report visualization graph view time tableau chart infogram design1.78
Azure Cloudclient azure solution consult delivery consultant databrick professional microsoft sql1.77
Team Workingteamwork lead member join independently collaborate member contact solidarity1.76
Decision-makingdata business work decision-making core key judgment successful system1.63
Table 3. Distribution of skills by expertise roles.
Table 3. Distribution of skills by expertise roles.
Topics (Skills)DeveloperEngineerArchitectAnalystManagerAdministratorConsultantRate
Big Data Processing1.903.260.900.290.260.100.116.82
Big Data Tools2.931.880.500.090.220.160.155.93
Communication Skills1.931.690.460.290.300.120.225.02
Remote Development2.171.550.450.120.170.110.114.69
Big Data Architecture1.011.461.420.170.230.170.184.64
Programming Languages2.221.430.300.160.120.080.074.38
Agile Development1.881.430.280.190.140.070.054.03
Information Security2.010.880.280.160.110.120.193.75
Project Management0.921.000.350.260.550.060.233.38
Scalable Systems1.201.370.200.230.170.050.073.28
Amazon EMR1.860.470.410.190.060.010.223.23
Data Analytics0.741.560.270.390.100.030.073.16
Business Applications1.171.000.290.170.240.070.153.09
Analytical Skills0.610.270.351.150.260.230.173.05
Business Analytics0.650.360.471.040.300.040.163.01
Google Cloud0.811.280.530.100.080.050.142.99
Technical Knowledge1.081.150.260.200.120.070.102.98
Database1.171.220.270.080.090.050.052.93
Bachelor Degree0.811.190.190.270.130.080.072.73
AWS Data Services0.811.200.280.070.090.040.042.53
Distributed Systems0.700.970.160.170.140.120.132.39
Data Warehousing0.431.030.260.050.060.470.022.33
Hadoop Ecosystem0.510.830.250.150.050.460.042.28
Machine Learning0.710.500.080.760.050.020.132.25
Troubleshooting0.340.960.140.300.070.340.052.19
Application Development1.110.640.120.090.080.040.042.11
Data Streaming0.340.960.340.100.070.080.051.94
Testing0.940.690.070.080.090.030.041.94
Data Visualization0.520.120.060.870.170.020.021.78
Azure Cloud0.400.650.330.100.150.030.111.77
Team Working0.560.850.090.110.070.050.031.76
Decision-making0.240.120.140.650.420.030.021.63
Table 4. Ranking of the top 10 skills for each of the expertise roles.
Table 4. Ranking of the top 10 skills for each of the expertise roles.
Topics (Skills)DeveloperEngineerArchitectAnalystManagerAdministratorConsultantTotal
Communication Skills53693737
Big Data Tools124 9696
Analytical Skills 1015465
Big Data Architecture 61 8555
Big Data Processing61286 5
Remote Development357 10 4
Business Analytics 524 74
Business Applications10 7 83
Google Cloud 103 103
Amazon EMR8 8 23
Information Security4 843
Project Management 9 1 13
Scalable Systems99 2
Agile Development78 2
Data Visualization 310 2
Troubleshooting 7 3 2
Data Analytics 4 6 2
Programming Languages27 2
Decision-making 52 2
Bachelor Degree 10 1
Distributed Systems 9 1
Machine Learning 4 1
Hadoop Ecosystem 2 1
Data Warehousing 1 1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gurcan, F.; Gudek, B.; Menekse Dalveren, G.G.; Derawi, M. Future-Ready Skills Across Big Data Ecosystems: Insights from Machine Learning-Driven Human Resource Analytics. Appl. Sci. 2025, 15, 5841. https://doi.org/10.3390/app15115841

AMA Style

Gurcan F, Gudek B, Menekse Dalveren GG, Derawi M. Future-Ready Skills Across Big Data Ecosystems: Insights from Machine Learning-Driven Human Resource Analytics. Applied Sciences. 2025; 15(11):5841. https://doi.org/10.3390/app15115841

Chicago/Turabian Style

Gurcan, Fatih, Beyza Gudek, Gonca Gokce Menekse Dalveren, and Mohammad Derawi. 2025. "Future-Ready Skills Across Big Data Ecosystems: Insights from Machine Learning-Driven Human Resource Analytics" Applied Sciences 15, no. 11: 5841. https://doi.org/10.3390/app15115841

APA Style

Gurcan, F., Gudek, B., Menekse Dalveren, G. G., & Derawi, M. (2025). Future-Ready Skills Across Big Data Ecosystems: Insights from Machine Learning-Driven Human Resource Analytics. Applied Sciences, 15(11), 5841. https://doi.org/10.3390/app15115841

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop