Data | Special Issue : Data Mining and Computational Intelligence for E-Learning and Education

Research

Jump to: Other

31 pages, 1887 KB

Open AccessArticle

ZaQQ: A New Arabic Dataset for Automatic Essay Scoring via a Novel Human–AI Collaborative Framework

by Yomna Elsayed, Emad Nabil, Marwan Torki, Safiullah Faizullah and Ayman Khalafallah

Data 2025, 10(9), 148; https://doi.org/10.3390/data10090148 - 19 Sep 2025

Cited by 5 | Viewed by 3091

Abstract

Automated essay scoring (AES) has become an essential tool in educational assessment. However, applying AES to the Arabic language presents notable challenges, primarily due to the lack of labeled datasets. This data scarcity hampers the development of reliable machine learning models and slows [...] Read more.

Automated essay scoring (AES) has become an essential tool in educational assessment. However, applying AES to the Arabic language presents notable challenges, primarily due to the lack of labeled datasets. This data scarcity hampers the development of reliable machine learning models and slows progress in Arabic natural language processing for educational use. While manual annotation by human experts remains the most accurate method for essay evaluation, it is often too costly and time-consuming to create large-scale datasets, especially for low-resource languages like Arabic. In this work, we introduce a human–AI collaborative framework designed to overcome the shortage of scored Arabic essays. Leveraging QAES, a high-quality annotated dataset, our approach uses Large Language Models (LLMs) to generate multidimensional essay evaluations across seven key writing traits: Relevance, Organization, Vocabulary, Style, Development, Mechanics, and Structure. To ensure accuracy and consistency, we design prompting strategies and validation procedures tailored to each trait. This system is then applied to two unannotated Arabic essay datasets: ZAEBUC and QALB. As a result, we introduce ZaQQ, a newly annotated dataset that merges ZAEBUC, QAES, and QALB. Our findings demonstrate that human–AI collaboration can significantly enhance the availability of labeled resources without compromising assessment quality. The proposed framework serves as a scalable and replicable model for addressing data annotation challenges in low-resource languages and supports the broader goal of expanding access to automated educational assessment tools where expert evaluation is limited. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

18 pages, 1810 KB

Open AccessArticle

Analysis of Student Dropout Risk in Higher Education Using Proportional Hazards Model and Based on Entry Characteristics

by Liga Paura, Irina Arhipova, Gatis Vitols and Sandra Sproge

Data 2025, 10(7), 110; https://doi.org/10.3390/data10070110 - 8 Jul 2025

Cited by 8 | Viewed by 7309

Abstract

The aim of this study is to identify the key factors contributing to student dropout and to develop a predictive model that estimates the dropout risk of students based on their entry characteristics and enrolment registration data. Our analysis is based on the [...] Read more.

The aim of this study is to identify the key factors contributing to student dropout and to develop a predictive model that estimates the dropout risk of students based on their entry characteristics and enrolment registration data. Our analysis is based on the registration and academic data of 971 full-time and part-time bachelor’s students in five faculties, who were enrolled in the academic year 2021–2022 at the Latvia University of Life Sciences and Technologies (LBTU). The dropout analysis was done during the 3.5 years of study, when the students started their last semester in engineering and information technology, agriculture and food technology, economics and social sciences, and forest and environmental studies and when veterinary medicine students had completed more than half of their program of study. Survival analysis methods were used during the study. Students’ dropout risk in relation to gender, faculty, priority to study in the program, and secondary school performance (SM) was estimated using the Proportional hazard model (Cox model). The highest student dropout was observed during the first year of study. Secondary school performance was a significant predictor of students’ dropout risk; students with higher SM had a lower dropout risk (HR = 0.66, p < 0.05). As well, student dropout can be explained by faculty or study programme. Students in economics and social sciences were at lower dropout risk than the students from the other faculties. Results show the model’s concordance index was 0.59, and this indicates that additional or stronger predictors may be needed to improve model performance. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

24 pages, 1586 KB

Open AccessArticle

Effective Education System for Athletes Utilising Big Data and AI Technology

by Martin Mičiak, Dominika Toman, Roman Adámik, Ema Kufová, Branislav Škulec, Nikola Mozolová and Aneta Hoferová

Data 2025, 10(7), 102; https://doi.org/10.3390/data10070102 - 24 Jun 2025

Cited by 3 | Viewed by 3379

Abstract

Education leads to building successful careers. However, different groups of students have different studying preferences. Our target group are athletes, combining their education and sports training. The main objective is to provide recommendations for an effective education system for athletes, improving their chances [...] Read more.

Education leads to building successful careers. However, different groups of students have different studying preferences. Our target group are athletes, combining their education and sports training. The main objective is to provide recommendations for an effective education system for athletes, improving their chances of finding new careers after leaving sports. Such a system must include Big Data and utilise AI possibilities currently available that support athletes’ career planning and development in a meaningful way. The main objective is specified by the following partial objectives: identifying what types of Big Data to analyse in connection with the athletes’ education; revealing what AI tools to include in the athletes’ education for their better preparation for a career after sports; determining what knowledge of AI and Big Data athletes need to stay relevant once they enter the labour market. Our study combines secondary and primary data sources. The secondary data (used in the orientation analysis) include case studies on AI and Big Data connected to education. The primary data were collected via a survey performed on over 200 Slovak junior athletes. The results show directions for the sports policymakers and sports organisations’ managers willing to improve their athletes’ career prospects. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

10 pages, 267 KB

Open AccessArticle

Dataset on Programming Competencies Development Using Scratch and a Recommender System in a Non-WEIRD Primary School Context

by Jesennia Cárdenas-Cobo, Cristian Vidal-Silva and Nicolás Máquez

Data 2025, 10(6), 86; https://doi.org/10.3390/data10060086 - 3 Jun 2025

Cited by 1 | Viewed by 2110

Abstract

The ability to program has become an essential competence for individuals in an increasingly digital world. However, access to programming education remains unequal, particularly in non-WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. This study presents a dataset resulting from an educational intervention [...] Read more.

The ability to program has become an essential competence for individuals in an increasingly digital world. However, access to programming education remains unequal, particularly in non-WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. This study presents a dataset resulting from an educational intervention designed to foster programming competencies and computational thinking skills among primary school students aged 8 to 12 years in Milagro, Ecuador. The intervention integrated Scratch, a block-based programming environment that simplifies coding by eliminating syntactic barriers, and the CARAMBA recommendation system, which provided personalized learning paths based on students’ progression and preferences. A structured educational process was implemented, including an initial diagnostic test to assess logical reasoning, guided activities in Scratch to build foundational skills, a phase of personalized practice with CARAMBA, and a final computational thinking evaluation using a validated assessment instrument. The resulting dataset encompasses diverse information: demographic data, logical reasoning test scores, computational thinking test results pre- and post-intervention, activity logs from Scratch, recommendation histories from CARAMBA, and qualitative feedback from university student tutors who supported the intervention. The dataset is anonymized, ethically collected, and made available under a CC-BY 4.0 license to encourage reuse. This resource is particularly valuable for researchers and practitioners interested in computational thinking development, educational data mining, personalized learning systems, and digital equity initiatives. It supports comparative studies between WEIRD and non-WEIRD populations, validation of adaptive learning models, and the design of inclusive programming curricula. Furthermore, the dataset enables the application of machine learning techniques to predict educational outcomes and optimize personalized educational strategies. By offering this dataset openly, the study contributes to filling critical gaps in educational research, promoting inclusive access to programming education, and fostering a more comprehensive understanding of how computational competencies can be developed across diverse socioeconomic and cultural contexts. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

29 pages, 4066 KB

Open AccessArticle

SAPEx-D: A Comprehensive Dataset for Predictive Analytics in Personalized Education Using Machine Learning

by Muhammad Adnan Aslam, Fiza Murtaza, Muhammad Ehatisham Ul Haq, Amanullah Yasin and Numan Ali

Data 2025, 10(3), 27; https://doi.org/10.3390/data10030027 - 20 Feb 2025

Cited by 12 | Viewed by 4882

Abstract

Education is crucial for leading a productive life and obtaining necessary resources. Higher education institutions are progressively incorporating artificial intelligence into conventional teaching methods as a result of innovations in technology. As a high academic record raises a university’s ranking and increases student [...] Read more.

Education is crucial for leading a productive life and obtaining necessary resources. Higher education institutions are progressively incorporating artificial intelligence into conventional teaching methods as a result of innovations in technology. As a high academic record raises a university’s ranking and increases student career chances, predicting learning success has been a central focus in education. Both performance analysis and providing high-quality instruction are challenges faced by modern schools. Maintaining high academic standards, juggling life and academics, and adjusting to technology are problems that students must overcome. In this study, we present a comprehensive dataset, SAPEx-D (Student Academic Performance Exploration), designed to predict student performance, encompassing a wide array of personal, familial, academic, and behavioral factors. Our data collection effort at Air University, Islamabad, Pakistan, involved both online and paper questionnaires completed by students across multiple departments, ensuring diverse representation. After meticulous preprocessing to remove duplicates and entries with significant missing values, we retained 494 valid responses. The dataset includes detailed attributes such as demographic information, parental education and occupation, study habits, reading frequencies, and transportation modes. To facilitate robust analysis, we encoded ordinal attributes using label encoding and nominal attributes using one-hot encoding, expanding our dataset from 38 to 88 attributes. Feature scaling was performed to standardize the range and distribution of data, using a normalization technique. Our analysis revealed that factors such as degree major, parental education, reading frequency, and scholarship type significantly influence student performance. The machine learning models applied to this dataset, including Gradient Boosting and Random Forest, demonstrated high accuracy and robustness, underscoring the dataset’s potential for insightful academic performance prediction. In terms of model performance, Gradient Boosting achieved an accuracy of 68.7% and an F1-score of 68% for the eight-class classification task. For the three-class classification, Random Forest outperformed other models, reaching an accuracy of 80.8% and an F1-score of 78%. These findings highlight the importance of comprehensive data in understanding and predicting academic outcomes, paving the way for more personalized and effective educational strategies. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

17 pages, 662 KB

Open AccessArticle

A Bayesian State-Space Approach to Dynamic Hierarchical Logistic Regression for Evolving Student Risk in Educational Analytics

by Moeketsi Mosia

Data 2025, 10(2), 23; https://doi.org/10.3390/data10020023 - 7 Feb 2025

Cited by 3 | Viewed by 3170

Abstract

Early detection of academically at-risk students is crucial for designing timely interventions that improve educational outcomes. However, many existing approaches either ignore the temporal evolution of student performance or rely on “black box” models that sacrifice interpretability. In this study, we develop a [...] Read more.

Early detection of academically at-risk students is crucial for designing timely interventions that improve educational outcomes. However, many existing approaches either ignore the temporal evolution of student performance or rely on “black box” models that sacrifice interpretability. In this study, we develop a dynamic hierarchical logistic regression model in a fully Bayesian framework to address these shortcomings. Our method leverages partial pooling across students and employs a state-space formulation, allowing each student’s log-odds of failure to evolve over multiple assessments. By using Markov chain Monte Carlo for inference, we obtain robust posterior estimates and credible intervals for both population-level and individual-specific effects, while posterior predictive checks ensure model adequacy and calibration. Results from simulated and real-world datasets indicate that the proposed approach more accurately tracks fluctuations in student risk compared to static logistic regression, and it yields interpretable insights into how engagement patterns and demographic factors influence failure probability. We conclude that a Bayesian dynamic hierarchical model not only enhances prediction of at-risk students but also provides actionable feedback for instructors and administrators seeking evidence-based interventions. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

14 pages, 770 KB

Open AccessArticle

Stress Factors in Higher Education: A Data Analysis Case

by Rodolfo Bojorque, Fernando Moscoso, Fernando Pesántez and Ángela Flores

Data 2025, 10(2), 22; https://doi.org/10.3390/data10020022 - 7 Feb 2025

Cited by 2 | Viewed by 4900

Abstract

This study investigates stressors in higher education, focusing on their impact on students and faculty at Universidad Politécnica Salesiana (UPS) and using eight years of comprehensive data. Employing data mining techniques, the research analyzed enrollment, retention, graduation, employability, socioeconomic status, academic performance, and [...] Read more.

This study investigates stressors in higher education, focusing on their impact on students and faculty at Universidad Politécnica Salesiana (UPS) and using eight years of comprehensive data. Employing data mining techniques, the research analyzed enrollment, retention, graduation, employability, socioeconomic status, academic performance, and faculty workload to uncover patterns affecting academic outcomes. The study found that UPS exhibits a stable educational system, maintaining consistent metrics across student success indicators. However, the COVID-19 pandemic presented unique stressors, evidenced by a paradoxical increase in student grades during heightened faculty stress levels. This anomaly suggests a potential link between academic rigor and faculty well-being during systemic disruptions. Stressors affecting students directly correlated with reduced academic performance, highlighting the importance of early detection and intervention. Conversely, faculty stress was reflected in adjustments to grading practices, raising questions about institutional pressures and faculty motivation. These findings emphasize the value of proactive data analytics in identifying stress-induced anomalies to support student success and faculty well-being. The study advocates for further research on faculty burnout, motivation, and institutional strategies to mitigate stressors, underscoring the potential of data-driven approaches to enhance the quality and sustainability of higher education ecosystems. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

Other

Jump to: Research

9 pages, 806 KB

Open AccessData Descriptor

Tracking K-12 and Higher Education Job Postings Through Web-Scraped Longitudinal Data

by Mark A. Perkins and Bolaji Aderibigbe Akorede

Data 2026, 11(3), 52; https://doi.org/10.3390/data11030052 - 6 Mar 2026

Viewed by 1073

Abstract

Teacher shortages and workforce trends in education are critical policy and research concerns. This study presents a robust data collection pipeline that systematically web-scrapes job postings for K-12 and higher education job postings across multiple sources. While the methodology could theoretically be adapted [...] Read more.

Teacher shortages and workforce trends in education are critical policy and research concerns. This study presents a robust data collection pipeline that systematically web-scrapes job postings for K-12 and higher education job postings across multiple sources. While the methodology could theoretically be adapted to other job categories, the pipeline is specifically implemented for educational job postings due to platform-specific structures and scraping constraints. Using R, we extract, clean, and archive job postings weekly, compiling them into a longitudinal master dataset that tracks trends in teacher openings over time. Our approach enables monthly trend analysis, providing insights into hiring patterns, subject-area demands, and geographic disparities. By making this dataset available, we contribute both a reproducible methodological pipeline for scraping, cleaning, and standardizing K-12 and higher education job postings, and a validated longitudinal dataset for research and workforce policy applications. This data descriptor details the methodology, data structure, and potential applications for researchers and policymakers monitoring education sector employment trends. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

11 pages, 335 KB

Open AccessData Descriptor

Anonymized Dataset of Information Systems and Technology Students at a South African University for Learning Analytics

by Rushil Raghavjee, Prabhakar Rontala Subramaniam and Irene Govender

Data 2026, 11(1), 1; https://doi.org/10.3390/data11010001 - 19 Dec 2025

Viewed by 1024

Abstract

Advancements in data storage and data processing technologies has compelled higher education institutions to optimise the use of their data. Many universities globally have begun to implement learning analytics at their institutions to better understand and improve teaching and learning. African higher education [...] Read more.

Advancements in data storage and data processing technologies has compelled higher education institutions to optimise the use of their data. Many universities globally have begun to implement learning analytics at their institutions to better understand and improve teaching and learning. African higher education institutions have been slow to implement learning analytics despite the continued accumulation of digital data. The research related to this study presents a dataset of Information Systems and Technology (IS&T) students from the University of KwaZulu-Natal, a South African university. The dataset comprises approximately 14,000 registered student records from 10 IS&T courses, primarily consisting of demographic data, academic performance (including past IS&T courses and school records), and Learning Management System (LMS) interaction data. The dataset exhibits an imbalance, characterised by a higher proportion of students who have successfully completed courses compared to those who have not. The dataset will be of interest to researchers engaged in learning analytics application studies, including early pass/fail prediction and grade classification, as well as those who want to test their techniques on a real-world dataset. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

10 pages, 1572 KB

Open AccessData Descriptor

Simultaneous EEG-fNIRS Data on Learning Capability via Implicit Learning Induced by Cognitive Tasks

by Chayapol Chaiyanan, Thanate Angsuwatanakul, Keiji Iramina and Boonserm Kaewkamnerdpong

Data 2025, 10(8), 131; https://doi.org/10.3390/data10080131 - 18 Aug 2025

Cited by 2 | Viewed by 2471

Abstract

The development of real-time learning assessment tools is hindered by an incomplete understanding of the underlying neural mechanisms. To address this gap, this study aimed to identify the specific neural correlates of implicit learning, a foundational process crucial for skill acquisition. We collected [...] Read more.

The development of real-time learning assessment tools is hindered by an incomplete understanding of the underlying neural mechanisms. To address this gap, this study aimed to identify the specific neural correlates of implicit learning, a foundational process crucial for skill acquisition. We collected simultaneous electroencephalography and functional near-infrared spectroscopy data from thirty healthy adults (ages 21–29) performing a serial reaction time task designed to induce implicit learning. By capturing both electrophysiological and hemodynamic responses concurrently at shared locations, this dataset offers a unique opportunity to investigate neurovascular coupling during implicit learning and gain deeper insights into the neural mechanisms of learning. The dataset is categorized into two groups: participants who demonstrated implicit learning (based on post-experiment interviews) and those who did not. This dataset enables the identification of prominent brain regions, features, and temporal patterns associated with successful implicit learning. This identification will form the basis for future real-time learning assessment tools. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

8 pages, 529 KB

Open AccessData Descriptor

An Extended Dataset of Educational Quality Across Countries (1970–2023)

by Hanol Lee and Jong-Wha Lee

Data 2025, 10(8), 130; https://doi.org/10.3390/data10080130 - 15 Aug 2025

Cited by 2 | Viewed by 5757

Abstract

This study presents an extended dataset on educational quality covering 101 countries, from 1970 to 2023. While existing international assessments, such as the Programme for International Student Assessment (PISA) and Trends in International Mathematics and Science Study (TIMSS), offer valuable snapshots of student [...] Read more.

This study presents an extended dataset on educational quality covering 101 countries, from 1970 to 2023. While existing international assessments, such as the Programme for International Student Assessment (PISA) and Trends in International Mathematics and Science Study (TIMSS), offer valuable snapshots of student performance, their limited coverage across countries and years constrains broader analyses. To address this limitation, we harmonized observed test scores across assessments and imputed missing values using both linear interpolation and machine learning (Least Absolute Shrinkage and Selection Operator (LASSO) regression). The dataset included (i) harmonized test scores for 15 year olds, (ii) annual educational quality indicators for the 15–19 age group, and (iii) educational quality indexes for the working-age population (15–64). These measures are provided in machine-readable formats and support empirical research on human capital, economic development, and global education inequalities across economies. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

15 pages, 781 KB

Open AccessData Descriptor

NPFC-Test: A Multimodal Dataset from an Interactive Digital Assessment Using Wearables and Self-Reports

by Luis Fernando Morán-Mirabal, Luis Eduardo Güemes-Frese, Mariana Favarony-Avila, Sergio Noé Torres-Rodríguez and Jessica Alejandra Ruiz-Ramirez

Data 2025, 10(7), 103; https://doi.org/10.3390/data10070103 - 30 Jun 2025

Cited by 2 | Viewed by 1947

Abstract

The growing implementation of digital platforms and mobile devices in educational environments has generated the need to explore new approaches for evaluating the learning experience beyond traditional self-reports or instructor presence. In this context, the NPFC-Test dataset was created from an experimental protocol [...] Read more.

The growing implementation of digital platforms and mobile devices in educational environments has generated the need to explore new approaches for evaluating the learning experience beyond traditional self-reports or instructor presence. In this context, the NPFC-Test dataset was created from an experimental protocol conducted at the Experiential Classroom of the Institute for the Future of Education. The dataset was built by collecting multimodal indicators such as neuronal, physiological, and facial data using a portable EEG headband, a medical-grade biometric bracelet, a high-resolution depth camera, and self-report questionnaires. The participants were exposed to a digital test lasting 20 min, composed of audiovisual stimuli and cognitive challenges, during which synchronized data from all devices were gathered. The dataset includes timestamped records related to emotional valence, arousal, and concentration, offering a valuable resource for multimodal learning analytics (MMLA). The recorded data were processed through calibration procedures, temporal alignment techniques, and emotion recognition models. It is expected that the NPFC-Test dataset will support future studies in human–computer interaction and educational data science by providing structured evidence to analyze cognitive and emotional states in learning processes. In addition, it offers a replicable framework for capturing synchronized biometric and behavioral data in controlled academic settings. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

18 pages, 6224 KB

Open AccessData Descriptor

A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset

by Ibidapo Dare Dada, Adio T. Akinwale and Ti-Jesu Tunde-Adeleke

Data 2025, 10(6), 87; https://doi.org/10.3390/data10060087 - 6 Jun 2025

Cited by 4 | Viewed by 5508

Abstract

The increasing volume of student assessments, particularly open-ended responses, presents a significant challenge for educators in ensuring grading accuracy, consistency, and efficiency. This paper presents a structured dataset designed for the development and evaluation of automated grading systems in higher education. The primary [...] Read more.

The increasing volume of student assessments, particularly open-ended responses, presents a significant challenge for educators in ensuring grading accuracy, consistency, and efficiency. This paper presents a structured dataset designed for the development and evaluation of automated grading systems in higher education. The primary objective is to create a high-quality dataset that facilitates the development and evaluation of natural language processing (NLP) models for automated grading. The dataset comprises student responses to open-ended questions from the Management Information Systems (MIS221) and Project Management (MIS415) courses at Covenant University, collected during the 2022/2023 academic session. The responses were originally handwritten, scanned, and transcribed into Word documents. Each response is paired with corresponding scores assigned by human graders, following a detailed marking guide. To assess the dataset’s potential for automated grading applications, several machine learning and transformer-based models were tested, including TF-IDF with Linear Regression, TF-IDF with Cosine Similarity, BERT, SBERT, RoBERTa, and Longformer. The experimental results demonstrate that transformer-based models outperform traditional methods, with Longformer achieving the highest Spearman’s Correlation of 0.77 and the lowest Mean Squared Error (MSE) of 0.04, indicating a strong alignment between model predictions and human grading. The findings highlight the effectiveness of deep learning models in capturing the semantic and contextual meaning of both student responses and marking guides, making it possible to develop more scalable and reliable automated grading solutions. This dataset offers valuable insights into student performance and serves as a foundational resource for integrating educational technology into automated assessment systems. Future work will focus on enhancing grading consistency and expanding the dataset for broader academic applications. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

11 pages, 1930 KB

Open AccessData Descriptor

Towards a Datatset of Digitalized Historical German VET and CVET Regulations

by Thomas Reiser, Jens Dörpinghaus, Petra Steiner and Michael Tiemann

Data 2024, 9(11), 128; https://doi.org/10.3390/data9110128 - 3 Nov 2024

Cited by 4 | Viewed by 2472

Abstract

The digitization of historical documents has gained particular interest in recent years in the digital humanities. The goal is to digitize historical documents by extracting and structuring text from scanned images. Here, we focus on the processing of historical German VET (vocational education [...] Read more.

The digitization of historical documents has gained particular interest in recent years in the digital humanities. The goal is to digitize historical documents by extracting and structuring text from scanned images. Here, we focus on the processing of historical German VET (vocational education and training) and CVET (continuing vocational education and training) regulations to support educational research. This dataset contains data from 1908 to the present and includes 2125 documents as PDF, 983 fully converted XML documents, and additional metadata for 7090 documents from the archive. We present an overview of the historical background and the challenges of processing different historical documents from three different federal states. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition

Share This Special Issue

Editor

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Related Special Issues

Published Papers (14 papers)

Research

Other

Further Information

Guidelines

MDPI Initiatives

Follow MDPI