Previous Issue
Volume 10, September
 
 

Data, Volume 10, Issue 10 (October 2025) – 13 articles

  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
58 pages, 744 KB  
Article
Review and Comparative Analysis of Databases for Speech Emotion Recognition
by Salvatore Serrano, Omar Serghini, Giulia Esposito, Silvia Carbone, Carmela Mento, Alessandro Floris, Simone Porcu and Luigi Atzori
Data 2025, 10(10), 164; https://doi.org/10.3390/data10100164 - 14 Oct 2025
Abstract
Speech emotion recognition (SER) has become increasingly important in areas such as healthcare, customer service, robotics, and human–computer interaction. The progress of this field depends not only on advances in algorithms but also on the databases that provide the training material for SER [...] Read more.
Speech emotion recognition (SER) has become increasingly important in areas such as healthcare, customer service, robotics, and human–computer interaction. The progress of this field depends not only on advances in algorithms but also on the databases that provide the training material for SER systems. These resources set the boundaries for how well models can generalize across speakers, contexts, and cultures. In this paper, we present a narrative review and comparative analysis of emotional speech corpora released up to mid-2025, bringing together both psychological and technical perspectives. Rather than following a systematic review protocol, our approach focuses on providing a critical synthesis of more than fifty corpora covering acted, elicited, and natural speech. We examine how these databases were collected, how emotions were annotated, their demographic diversity, and their ecological validity, while also acknowledging the limits of available documentation. Beyond description, we identify recurring strengths and weaknesses, highlight emerging gaps, and discuss recent usage patterns to offer researchers both a practical guide for dataset selection and a critical perspective on how corpus design continues to shape the development of robust and generalizable SER systems. Full article
Show Figures

Figure 1

16 pages, 5977 KB  
Data Descriptor
Comparative Data Analysis of Non-Destructive Testing for Hollow Heart in Potatoes
by Mary M. Hofle, Nusrat Farheen, Mathew Zachary Shumway, Evan D. Mosher, Keyave C. Hone and Marco P. Schoen
Data 2025, 10(10), 163; https://doi.org/10.3390/data10100163 - 14 Oct 2025
Abstract
Hollow heart, and other crop defects, can be devastating to farmers. Hollow heart is not a disease but a physiological disorder affected by temperature, soil moisture, plant density, and other factors. These defects can cause substantial annual losses for farmers. Currently, potatoes are [...] Read more.
Hollow heart, and other crop defects, can be devastating to farmers. Hollow heart is not a disease but a physiological disorder affected by temperature, soil moisture, plant density, and other factors. These defects can cause substantial annual losses for farmers. Currently, potatoes are shipped and inspected from producers to shipping points and markets. At these facilities, samples are inspected for defects. Detection of hollow heart consists of halving potatoes and visually inspecting for defects. The defect size is compared to USDA hollow heart classification charts for acceptance or rejection. An automatic, non-destructive system to identify hollow heart has the potential to improve quality. Two methods have been developed to collect data for such a system: acoustic signal capture and visual/vibration signal capture. Data is collected and stored for one potato at a time. The procedure includes the collection of weight, proportional size, and volume, as well as the generation of an acoustic sound signal through a drop test and a motion signal captured through a vision system. To simulate hollow heart, potatoes are cored and retested by producing a new set of data. Each potato is manually cut and inspected for true hollow heart. The generated data includes over 1000 samples, each comprising proportional volume, weight, proportional size, motion, and acoustic data. Such a dataset does not exist in the current literature and can serve for the development of machine learning algorithms to detect hollow heart nondestructively. In this paper, the data is also analyzed in terms of its statistical properties, as applied for possible feature engineering in machine learning. Full article
Show Figures

Figure 1

12 pages, 1191 KB  
Data Descriptor
University Student Dropout: A Longitudinal Dataset of Demographic, Socioeconomic, and Academic Indicators
by Arnau Igualde-Sáez, José P. Garcia-Sabater, Juan A. Marin-Garcia, Sergio Puche García, Carlos Turró, Ignacio Despujol, Marina Alonso, José V. Benlloch-Dualde, Pedro Pablo Soriano Jiménez and Julien Maheut
Data 2025, 10(10), 162; https://doi.org/10.3390/data10100162 - 14 Oct 2025
Abstract
This dataset contains detailed information on student trajectories and dropout factors at a Spanish technological university offering Science, Technology, Engineering, Arts, and Mathematics programs. The data comprise demographic, socioeconomic, and academic variables for all enrolled students, including those in bachelor’s, master’s, doctoral, and [...] Read more.
This dataset contains detailed information on student trajectories and dropout factors at a Spanish technological university offering Science, Technology, Engineering, Arts, and Mathematics programs. The data comprise demographic, socioeconomic, and academic variables for all enrolled students, including those in bachelor’s, master’s, doctoral, and lifelong learning programs, across three complete academic years, excluding periods affected by the SARS-CoV-2 pandemic. The data were collected and standardized from disjointed internal data sources, and fully anonymized. The dataset contains information about 39,364 students, 4989 courses in 163 degrees, and 77 variables related to admission pathways, academic performance indicators, socio-demographic background, digital activity in the Learning Management System, and Wi-Fi access records. Each of the 464,739 records corresponds to a course enrolment per student per year, enabling longitudinal analyses of academic progression and dropout. This data has the potential to be reused to support research on factors influencing student retention, allow for the development of predictive models to identify students at risk of leaving their studies, and offer a resource for comparative studies in higher education. Full article
Show Figures

Figure 1

18 pages, 2882 KB  
Article
A Preferences Corpus and Annotation Scheme for Human-Guided Alignment of Time-Series GPTs
by Ricardo A. Calix, Tyamo Okosun, Chenn Zhou and Hong Wang
Data 2025, 10(10), 161; https://doi.org/10.3390/data10100161 - 9 Oct 2025
Viewed by 315
Abstract
The process of time-series forecasting such as predicting trajectories of silicon content in blast furnaces is a difficult task. Most time-series approaches today focus on scalar-type MSE loss optimization. This optimization approach, while widely common, could benefit from the use of human expert [...] Read more.
The process of time-series forecasting such as predicting trajectories of silicon content in blast furnaces is a difficult task. Most time-series approaches today focus on scalar-type MSE loss optimization. This optimization approach, while widely common, could benefit from the use of human expert or process-level preferences. In this paper, we introduce a novel alignment and fine-tuning approach that involves learning from a corpus of preferred and dis-preferred time-series prediction trajectories. Our contributions include (1) a preference annotation pipeline for time-series forecasts, (2) the application of Score-based Preference Optimization (SPO) to train decoder-only transformers from preferences, and (3) results showing improvements in forecast quality. The approach is validated on both proprietary blast furnace data and the UCI Appliances Energy dataset. The proposed preference corpus and training strategy offer a new option for fine-tuning sequence models in industrial settings. Full article
Show Figures

Figure 1

14 pages, 1530 KB  
Article
Assessing Musculoskeletal Injury Risk in Hospital Healthcare Professionals During a Single Daily Patient-Handling Task
by Xiaoxu Ji, Thomaz Ahualli de Sanctis, Mahmoud Alwahkyan, Xin Gao, Jenna Miller and Sarah Thomas
Data 2025, 10(10), 160; https://doi.org/10.3390/data10100160 - 8 Oct 2025
Viewed by 239
Abstract
Background: Healthcare professionals are at significant risk of musculoskeletal injuries due to the physically demanding nature of patient-handling tasks. While various ergonomic interventions have been introduced to mitigate these risks, comprehensive methods for assessing and addressing musculoskeletal hazards remain limited. Purpose: This study [...] Read more.
Background: Healthcare professionals are at significant risk of musculoskeletal injuries due to the physically demanding nature of patient-handling tasks. While various ergonomic interventions have been introduced to mitigate these risks, comprehensive methods for assessing and addressing musculoskeletal hazards remain limited. Purpose: This study presents a novel approach to evaluating musculoskeletal injury risks among healthcare workers, marking the first instance in which two motion tracking systems are used simultaneously. This dual-system setup enables a more comprehensive and dynamic analysis of worker interactions in real time. Healthcare professionals were divided into three groups to perform patient transfer tasks. Three key poses within the task, associated with peak lumbar forces, were identified and analyzed. Results: The resulting compressive forces on the participants’ lower back ranged from 581.0 N to 3589.1 N, and the Anterior–Posterior (A/P) shear forces ranged from 33.1 N to 912.3 N across the three poses. Relative differences in trunk flexion showed strong correlations with compressive and A/P shear forces at each pose, respectively. Discussion and conclusion: Strong associations were found between lumbar loads and participant’s anthropometrics. Recommendations for optimal postures and partner pairings were developed to help reduce the risk of lower back injuries during patient handling. Full article
Show Figures

Figure 1

9 pages, 616 KB  
Article
Expected Shot Impact Timing (xSIT) and Other Advanced Metrics as Indicators of Performance in English Men’s and Women’s Professional Football
by Blanca De-la-Cruz-Torres, Miguel Navarro-Castro and Anselmo Ruiz-de-Alarcón-Quintero
Data 2025, 10(10), 159; https://doi.org/10.3390/data10100159 - 2 Oct 2025
Viewed by 308
Abstract
Blackground: Football performance analysis has grown rapidly in recent years, with increasing interest in advanced metrics to more accurately evaluate both individual and team performance. The aim of this study was to examine the utility of the Expected Shots Impact Timing (xSIT) metric [...] Read more.
Blackground: Football performance analysis has grown rapidly in recent years, with increasing interest in advanced metrics to more accurately evaluate both individual and team performance. The aim of this study was to examine the utility of the Expected Shots Impact Timing (xSIT) metric as an indicator of shooting performance in English professional football, specifically in the men’s Premier League (PL) and the Women’s Super League (WSL). Methods: A total of 9831 shots from the PL (2015/16 season) and 3219 shots from the WSL (2020/21 season) were analyzed. Data were obtained from publicly accessible football databases. The variables examined included goals, Possession Value (PV), Expected Goals (xG), Expected Goals on Target (xGOT), and xSIT. All variables were normalized per match (90 min). Descriptive statistics, correlational analyses, and comparative analyses between leagues. Results: The WSL exhibited a significantly higher PV than the PL (p < 0.001), whereas the remaining metrics showed no significant differences between leagues (p > 0.05). Moreover, in the WSL, all performance indicators displayed very strong correlations with goals, while in the PL, similarly strong associations were observed, except for PV, which showed only a weak relationship. Conclusions: the xSIT metric, as an indicator of shooting performance, may be regarded as an influential factor in determining match outcomes across both leagues. Full article
(This article belongs to the Special Issue Big Data and Data-Driven Research in Sports)
Show Figures

Figure 1

13 pages, 724 KB  
Article
Research on the Development and Application of the GDELT Event Database
by Dengxi Hong, Zexin Fu, Xin Zhang and Yan Pan
Data 2025, 10(10), 158; https://doi.org/10.3390/data10100158 - 1 Oct 2025
Viewed by 439
Abstract
This study investigates the development and application of the GDELT (Global Database of Events, Language, and Tone) news database. Through experiments, we conducted a quantitative statistical analysis of the GDELT event database to evaluate its practical characteristics. The results indicate that although the [...] Read more.
This study investigates the development and application of the GDELT (Global Database of Events, Language, and Tone) news database. Through experiments, we conducted a quantitative statistical analysis of the GDELT event database to evaluate its practical characteristics. The results indicate that although the database achieves comprehensive coverage across all countries and regions and includes most major global media outlets, the accuracy rate of its key fields is only approximately 55%, with a data redundancy as high as 20%. Based on these findings, while the GDELT data demonstrates good coverage and data integrity, data correction and deduplication are recommended before its use in research contexts and industrial applications. Subsequently, a survey of the existing literature reveals that current studies using GDELT primarily focused on event-related metrics, such as event quantity, tone, and GoldsteinScale, for application in international relations analysis, crisis event prediction, policy effectiveness testing, and public opinion impact analysis. Nevertheless, news constitutes a fundamental channel of information dissemination in media networks, and the propagation of news events through these networks represents a critical area of study for information recommendation, public opinion guidance, and crisis intervention. Existing research has employed the Event, GKG, and Mentions tables to construct cross-national news flow network models. However, the informational correlations across different data table fields have not been fully leveraged in preliminary data selection, leading to substantial computational overhead. To advance research in this field, this study employs chained list queries on the Event and Mentions tables within GDELT. Using social network analysis, we constructed a media co-occurrence network of event reports, through which core hubs and associative relationships within the event dissemination network are identified. Full article
Show Figures

Figure 1

10 pages, 2446 KB  
Data Descriptor
A Multi-Class Labeled Ionospheric Dataset for Machine Learning Anomaly Detection
by Aleksandra Kolarski, Filip Arnaut, Sreten Jevremović, Zoran R. Mijić and Vladimir A. Srećković
Data 2025, 10(10), 157; https://doi.org/10.3390/data10100157 - 30 Sep 2025
Viewed by 314
Abstract
The binary anomaly detection (classification) of ionospheric data related to Very Low Frequency (VLF) signal amplitude in prior research demonstrated the potential for development and further advancement. Further data quality improvement is integral for advancing the development of machine learning (ML)-based ionospheric data [...] Read more.
The binary anomaly detection (classification) of ionospheric data related to Very Low Frequency (VLF) signal amplitude in prior research demonstrated the potential for development and further advancement. Further data quality improvement is integral for advancing the development of machine learning (ML)-based ionospheric data (VLF signal amplitude) anomaly detection. This paper presents the transition from binary to multi-class classification of ionospheric signal amplitude datasets. The dataset comprises 19 transmitter–receiver pairs and 383,041 manually labeled amplitude instances. The target variable was reclassified from a binary classification (normal and anomalous data points) to a six-class classification that distinguishes between daytime undisturbed signals, nighttime signals, solar flare effects, instrument errors, instrumental noise, and outlier data points. Furthermore, in addition to the dataset, we developed a freely accessible web-based tool designed to facilitate the conversion of MATLAB data files to TRAINSET-compatible formats, thereby establishing a completely free and open data pipeline from the WALDO world data repository to data labeling software. This novel dataset facilitates further research in ionospheric signal amplitude anomaly detection, concentrating on effective and efficient anomaly detection in ionospheric signal amplitude data. The potential outcomes of employing anomaly detection techniques on ionospheric signal amplitude data may be extended to other space weather parameters in the future, such as ELF/LF datasets and other relevant datasets. Full article
(This article belongs to the Section Spatial Data Science and Digital Earth)
Show Figures

Figure 1

20 pages, 970 KB  
Article
Automated Test Generation Using Large Language Models
by Marcin Andrzejewski, Nina Dubicka, Jędrzej Podolak, Marek Kowal and Jakub Siłka
Data 2025, 10(10), 156; https://doi.org/10.3390/data10100156 - 30 Sep 2025
Viewed by 523
Abstract
This study explores the potential of generative AI, specifically Large Language Models (LLMs), in automating unit test generation in Python 3.13. We analyze tests, both those created by programmers and those generated by LLM models, for fifty source code cases. Our main focus [...] Read more.
This study explores the potential of generative AI, specifically Large Language Models (LLMs), in automating unit test generation in Python 3.13. We analyze tests, both those created by programmers and those generated by LLM models, for fifty source code cases. Our main focus is on how the choice of model, the difficulty of the source code, and the prompting strategy influence the quality of the generated tests. The results show that AI models can help automate test creation for simple code, but their effectiveness decreases for more complex tasks. We introduce an embedding-based similarity analysis to assess how closely AI-generated tests resemble human-written ones, revealing that AI outputs often lack semantic diversity. The study also highlights the potential of AI models for rapid test prototyping, which can significantly speed up the software development cycle. However, further customization and training of the models on specific use cases is needed to achieve greater precision. Our findings provide practical insights into integrating LLMs into software testing workflows and emphasize the importance of prompt design and model selection. Full article
Show Figures

Figure 1

12 pages, 1732 KB  
Data Descriptor
A Dataset of Environmental Toxins for Water Monitoring in Coastal Waters of Southern Centre, Vietnam: Case of Nha Trang Bay
by Hoang Xuan Ben, Tran Cong Thinh and Phan Minh-Thu
Data 2025, 10(10), 155; https://doi.org/10.3390/data10100155 - 29 Sep 2025
Viewed by 374
Abstract
This study presents a comprehensive dataset developed to monitor coastal water quality in the south-central region of Vietnam, focusing on Nha Trang Bay. Environmental data were collected from four research cruises conducted between 2013 and 2024. Water samples were taken at two depths: [...] Read more.
This study presents a comprehensive dataset developed to monitor coastal water quality in the south-central region of Vietnam, focusing on Nha Trang Bay. Environmental data were collected from four research cruises conducted between 2013 and 2024. Water samples were taken at two depths: surface samples at approximately 0.5–1.0 m below the water surface, and bottom samples 1.0 to 2.0 m above the seabed, depending on site-specific bathymetry. These samples were analyzed for key water quality parameters, including biological oxygen demand (BOD5), dissolved inorganic nitrogen (DIN), dissolved inorganic phosphorus (DIP), and Chlorophyll-a (Chl-a). The data establish a valuable baseline for assessing both spatial and temporal patterns of water quality, and for calculating eutrophication index to evaluate potential environmental degradation. Importantly, it also demonstrates practical applications for environmental management. The dataset can support assessments of how seasonal tourism peaks contribute to nutrient enrichment, how aquaculture expansion affects dissolved oxygen dynamics, and how water quality trends evolve under increasing anthropogenic pressure. These applications make it a useful resource for evaluating pollution control efforts and for guiding sustainable development in coastal areas. By promoting open access, the dataset not only supports scientific research but also strengthens evidence-based management strategies to protect ecosystem health and socio-economic resilience in Nha Trang Bay. Full article
Show Figures

Figure 1

18 pages, 1089 KB  
Data Descriptor
Digital Accessibility of Solar Energy Variability Through Short-Term Measurements: Data Descriptor
by Fernando Venâncio Mucomole, Carlos Augusto Santos Silva and Lourenço Lázaro Magaia
Data 2025, 10(10), 154; https://doi.org/10.3390/data10100154 - 28 Sep 2025
Viewed by 215
Abstract
A variety of factors, such as absorption, reflection, and attenuation by atmospheric elements, influence the quantity of solar energy that reaches the surface of the Earth. This, in turn, impacts photovoltaic (PV) power generation. In light of this, a digital assessment of solar [...] Read more.
A variety of factors, such as absorption, reflection, and attenuation by atmospheric elements, influence the quantity of solar energy that reaches the surface of the Earth. This, in turn, impacts photovoltaic (PV) power generation. In light of this, a digital assessment of solar energy variability through short-term measurements was conducted to enhance PV power output. The clear-sky index Kt* methodology was employed, effectively eliminating any indications of solar energy obstruction and comparing the measured radiation to the theoretical clear-sky radiation. The solar energy data were gathered in Mozambique, specifically in the southern region at Maputo–1, Massangena, Ndindiza, and Pembe, in the mid-region at Chipera, Nhamadzi, Barue–1, and Barue–2, as well as in the northern region at Nipepe-1, Nipepe-2, Nanhupo-1, Nanhupo-2, and Chomba, over the period from 2005 to 2024, with measurement intervals ranging from 1 to 10 min and 1 h during the measurement campaigns conducted by FUNAE and INAM, with additional data sourced from the PVGIS, Meteonorm, NOAA, and NASA solar databases. The analysis indicates a Kt* value with a density approaching 1 for clear days, while intermediate-sky days exhibit characteristics that lie between those of clear and cloudy days. It can be inferred that there exists a robust correlation among sky types, with values ranging from 0.95 to 0.89 per station, alongside correlated energies, which experience a regression with coefficients between 0.79 and 0.95. Based on the analysis of the sample, the region demonstrates significant potential for solar energy utilization, and similar sampling methodologies can be applied in other locations to optimize PV output and other solar energy projects. Full article
(This article belongs to the Topic Smart Energy Systems, 2nd Edition)
Show Figures

Figure 1

17 pages, 3841 KB  
Article
Sliding Performance Evaluation with Machine Learning-Based Trajectory Analysis for Skeleton
by Ting Yu, Zhen Peng, Zining Wang, Weiya Chen and Bo Huo
Data 2025, 10(10), 153; https://doi.org/10.3390/data10100153 - 24 Sep 2025
Viewed by 408
Abstract
Skeleton is an extreme sliding sport in the Winter Olympics, where formulating targeted sliding strategies, based on training videos to navigate complex tracks, is particularly important. To make in-depth use of training video records, this study proposes an analytical method based on Mixture [...] Read more.
Skeleton is an extreme sliding sport in the Winter Olympics, where formulating targeted sliding strategies, based on training videos to navigate complex tracks, is particularly important. To make in-depth use of training video records, this study proposes an analytical method based on Mixture of Gaussians (MoG) and K-means clustering to extract and analyze trajectories from recorded videos for sliding performance evaluation and strategy development. A case study was conducted using data from the Chinese national skeleton team at the Yanqing Sliding Center, obtaining 741, 834, and 726 sliding trajectories from three representative curves. These trajectories were divided into groups based on sliding completion time (fast, medium, and slow groups). The consistency of trajectories within each group was calculated to evaluate sliding stability, while trajectory patterns in the fast group were clustered and described based on the average values of multiple features (starting position, ending position, and apex orthogonal offset). The results showed that more skilled athletes exhibited greater sliding stability (lower ρC-values), and on each curve, there were sliding patterns that performed significantly better than others. This research quantifies the characteristics of athletes’ sliding trajectories on curves, facilitating the visual tracking of training effects and the development of personalized strategies. It provides coaches and athletes with scientific decision-making support and clear directions for improvement, ultimately enabling precise enhancements in training efficiency and competitive performance, while also laying a technical foundation for the future development of intelligent training systems. Full article
(This article belongs to the Special Issue Big Data and Data-Driven Research in Sports)
Show Figures

Figure 1

10 pages, 2476 KB  
Data Descriptor
In Situ Monitoring and Bioluminescence Kinetics of Pseudomonas fluorescens M3A Bioluminescent Reporter with Bacteriophage ΦS1
by Phillip R. Myer, Pankaj Bhatt, Halis Simsek and Bruce M. Applegate
Data 2025, 10(10), 152; https://doi.org/10.3390/data10100152 - 23 Sep 2025
Viewed by 318
Abstract
Food spoilage and the associated organisms are a continuing concern for the food industry. The microorganisms involved with food spoilage in pasteurized milk can be introduced in a variety of ways, which include those that survive pasteurization and/or are introduced post-pasteurization. The use [...] Read more.
Food spoilage and the associated organisms are a continuing concern for the food industry. The microorganisms involved with food spoilage in pasteurized milk can be introduced in a variety of ways, which include those that survive pasteurization and/or are introduced post-pasteurization. The use of bacteriophages for therapeutic regimens and as a method for the biocontrol of food-borne pathogens has been widely studied and applied; however, their use in the biocontrol against spoilage organisms is in its nascency. Bioluminescent bacteria offer the ability to act as cell-death reporters. In the case of using bacteriophage against spoilage-associated bacteria, cell death results in the loss of bioluminescence. In this study, a bioluminescent Pseudomonas species, Pseudomonas fluorescens M3A, was used to monitor the efficacy of the bacteriophage-associated biocontrol system within laboratory bacterial growth broth and fluid milk using bacteriophage ΦS1. Utilizing a bioluminescence kinetic assay with ten-fold serially diluted P. fluorescens M3A and bacteriophage ΦS1, data demonstrated rapid inactivation of bacterial growth, and at low bacteriophage titers. Cell death was indicated by the loss of bacterial bioluminescence. These data help to support the application of bacteriophage-based technologies against spoilage-associated bacteria to prolong shelf life in the event of microbial growth. Full article
Show Figures

Figure 1

Previous Issue
Back to TopTop