Next Issue
Volume 10, June
Previous Issue
Volume 10, April
 
 

Data, Volume 10, Issue 5 (May 2025) – 23 articles

Cover Story (view full-size image): Harnessing the power of electroencephalography (EEG) as a potential biomarker for quantifying dementia, such as Alzheimer's disease or frontotemporal dementia, has long been the focus of extensive research. While the exploration of dementia biomarkers and the investigation into automatic diagnoses are ongoing, progress in these areas has been hindered by the scarcity of publicly available datasets. Offering a groundbreaking contribution, our paper presents the first publicly accessible dataset of EEG recordings, encompassing patients with Alzheimer's disease and frontotemporal dementia as well as healthy individuals. By providing this invaluable resource, we aim to accelerate research in the field and foster collaboration among diverse teams. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
13 pages, 1886 KiB  
Data Descriptor
δ-MedBioclim: A New Dataset Bridging Current and Projected Bioclimatic Variables for the Euro-Mediterranean Region
by Giovanni-Breogán Ferreiro-Lera, Ángel Penas and Sara del Río
Data 2025, 10(5), 78; https://doi.org/10.3390/data10050078 - 16 May 2025
Viewed by 124
Abstract
This data descriptor presents δ-MedBioclim, a newly developed dataset for the Euro-Mediterranean region. This dataset applies the delta-change method by comparing the values of 25 General Circulation Models (GCMs) for the reference period (1981–2010) with their projections for future periods (2026–2050, 2051–2075, and [...] Read more.
This data descriptor presents δ-MedBioclim, a newly developed dataset for the Euro-Mediterranean region. This dataset applies the delta-change method by comparing the values of 25 General Circulation Models (GCMs) for the reference period (1981–2010) with their projections for future periods (2026–2050, 2051–2075, and 2076–2100) under the SSP1-RCP2.6, SSP2-RCP4.5, and SSP5-RCP8.5 scenarios. These anomalies are added to two pre-existing datasets, ERA5-Land and CHELSA, yielding resolutions of 0.1° and 0.01°, respectively. Additionally, this manuscript provides a ranking of GCMs for each major river basin within the study area to guide model selection. δ-MedBioclim includes, for all the aforementioned scenarios, monthly mean temperature, total monthly precipitation, and 23 bioclimatic variables, including 9 (biorm1 to biorm9) from the Worldwide Bioclimatic Classification System (WBCS) that are not available in other databases. It also provides two bioclimatic classifications: Köppen–Geiger and WBCS. This dataset is expected to be a valuable resource for modeling the distribution of Mediterranean species and habitats, which are highly affected by climate change. Full article
(This article belongs to the Section Spatial Data Science and Digital Earth)
Show Figures

Figure 1

12 pages, 24527 KiB  
Data Descriptor
A Machine Learning Dataset of Artificial Inner Ring Damage on Cylindrical Roller Bearings Measured Under Varying Cross-Influences
by Christopher Schnur, Payman Goodarzi, Yannick Robin, Julian Schauer and Andreas Schütze
Data 2025, 10(5), 77; https://doi.org/10.3390/data10050077 - 16 May 2025
Viewed by 85
Abstract
In practical machine learning (ML) applications, covariate shifts and dependencies can significantly impact model robustness and prediction quality, leading to performance degradation under distribution shifts. In industrial settings, it is crucial to account for covariates during the design of experiments to ensure reliable [...] Read more.
In practical machine learning (ML) applications, covariate shifts and dependencies can significantly impact model robustness and prediction quality, leading to performance degradation under distribution shifts. In industrial settings, it is crucial to account for covariates during the design of experiments to ensure reliable generalization. The presented dataset of undamaged and artificially damaged cylindrical roller bearings is designed to address the lack of data resources for targeting domain and distribution shifts in this field. The dataset considers multiple key covariates, including mounting position, load, and rotational speed. Each covariate consists of multiple levels optimized for group-based cross-validation. This allows the user to exclude specific groups in the training to validate and test the algorithm. Using this approach, algorithms can be evaluated for their robustness and the effect on the model caused by distribution shifts, allowing their generalization capabilities to be studied under realistic conditions. Full article
Show Figures

Figure 1

21 pages, 792 KiB  
Article
Computing Non-Dominated Flexible Skylines in Vertically Distributed Datasets with No Random Access
by Davide Martinenghi
Data 2025, 10(5), 76; https://doi.org/10.3390/data10050076 - 15 May 2025
Viewed by 114
Abstract
In today’s data-driven world, algorithms operating with vertically distributed datasets are crucial due to the increasing prevalence of large-scale, decentralized data storage. These algorithms process data locally, thereby reducing data transfer and exposure to breaches, while at the same time improving scalability thanks [...] Read more.
In today’s data-driven world, algorithms operating with vertically distributed datasets are crucial due to the increasing prevalence of large-scale, decentralized data storage. These algorithms process data locally, thereby reducing data transfer and exposure to breaches, while at the same time improving scalability thanks to data distribution across multiple sources. Top-k queries are a key tool in vertically distributed scenarios and are widely applied in critical applications involving sensitive data. Classical top-k algorithms typically resort to sorted access to sequentially scan the dataset and to random access to retrieve a tuple by its id. However, the latter kind of access is sometimes too costly to be feasible, and algorithms need to be designed for the so-called “no random access” (NRA) scenario. The latest efforts in this direction do not cover the recent advances in ranking queries, which propose hybridizations of top-k queries (which are preference-aware and control the output size) and skyline queries (which are preference-agnostic and have uncontrolled output size). The non-dominated flexible skyline (ND) is one such proposal, which tries to obtain the best of top-k and skyline queries. We introduce an algorithm for computing ND in the NRA scenario, prove its correctness and optimality within its class, and provide an experimental evaluation covering a wide range of cases, with both synthetic and real datasets. Full article
Show Figures

Figure 1

8 pages, 498 KiB  
Data Descriptor
First Whole Genome Sequencing Data of Six Greek Sheep Breeds
by Antiopi Tsoureki, George Tsiolas, Maria Kyritsi, Eleftherios Pavlou, Anagnostis Argiriou and Sofia Michailidou
Data 2025, 10(5), 75; https://doi.org/10.3390/data10050075 - 14 May 2025
Viewed by 303
Abstract
Sheep farming is a common agricultural practice in Greece, with many sheep populations belonging to Greek breeds. However, their genetic makeup remains relatively unexplored and limited information is available for their genetic variability. Here, we provide the first whole genome sequencing (WGS) data [...] Read more.
Sheep farming is a common agricultural practice in Greece, with many sheep populations belonging to Greek breeds. However, their genetic makeup remains relatively unexplored and limited information is available for their genetic variability. Here, we provide the first whole genome sequencing (WGS) data for six Greek sheep breeds, namely Chios, Kalarritiko, Karagouniko, Lesvos, Serres, and Thraki breeds. We performed variant discovery analysis on the data and identified 23,526,500 high-quality variants. The high average variant depth (148.7X ± 28.3) and low Single Nucleotide Polymorphism (SNP) density (1 variant per 111 bases) in the callset demonstrated the high quality of the data. The vast majority of the variants (97.46%) were located in non-coding regions, while a small percentage (1.32%) was positioned in exonic regions. The overall transition to transversion−Ti/Tv (2.449) and heterozygous to non-reference homozygous−Het/Hom (1.49) ratios further confirmed the callset’s high quality. This dataset comprises the first WGS data for six Greek sheep breeds, providing invaluable information to the Greek agricultural sector for the design and implementation of targeted breeding schemes, for traceability purposes, and for the overall enhancement of the sector, in terms of performance and sustainability. Full article
Show Figures

Figure 1

13 pages, 726 KiB  
Data Descriptor
A Non-Binary Approach to Super-Enhancer Identification and Clustering: A Dataset for Tumor- and Treatment-Associated Dynamics in Mouse Tissues
by Ekaterina D. Osintseva, German A. Ashniev, Alexey V. Orlov, Petr I. Nikitin, Zoia G. Zaitseva, Vladimir V. Volkov and Natalia N. Orlova
Data 2025, 10(5), 74; https://doi.org/10.3390/data10050074 - 14 May 2025
Viewed by 213
Abstract
Super-enhancers (SEs) are large clusters of highly active enhancers that play key regulatory roles in cell identity, development, and disease. While conventional methods classify SEs in a binary fashion—super-enhancer or not—this threshold-based approach can overlook significant intermediate states of enhancer activity. Here, we [...] Read more.
Super-enhancers (SEs) are large clusters of highly active enhancers that play key regulatory roles in cell identity, development, and disease. While conventional methods classify SEs in a binary fashion—super-enhancer or not—this threshold-based approach can overlook significant intermediate states of enhancer activity. Here, we present a dataset and accompanying framework that facilitate a more nuanced, non-binary examination of SE activation across mouse tissue types (mammary gland, lung tissue, and NMuMG cells) and various experimental conditions (normal, tumor, and drug-treated samples). By consolidating overlapping SE intervals and capturing continuous enhancer activity metrics (e.g., ChIP-seq signal intensities), our dataset reveals gradual transitions between moderate and high enhancer activity levels that are not captured by strictly binary classification. Additionally, the data include extensive functional annotations, linking SE loci to nearby genes and enabling immediate downstream analyses such as clustering and gene ontology enrichment. The flexible approach supports broader investigations of enhancer landscapes, offering a comprehensive platform for understanding how SE activation underpins disease mechanisms, therapeutic response, and developmental processes. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

13 pages, 1955 KiB  
Article
A Data-Driven Approach to Tourism Demand Forecasting: Integrating Web Search Data into a SARIMAX Model
by Geun-Cheol Lee
Data 2025, 10(5), 73; https://doi.org/10.3390/data10050073 - 10 May 2025
Viewed by 291
Abstract
Tourism is a core sector of Singapore’s economy, contributing significantly to Gross Domestic Product (GDP) and employment. Accurate tourism demand forecasting is essential for strategic planning, resource allocation, and economic stability, particularly in the post-COVID-19 era. This study develops a SARIMAX-based forecasting model [...] Read more.
Tourism is a core sector of Singapore’s economy, contributing significantly to Gross Domestic Product (GDP) and employment. Accurate tourism demand forecasting is essential for strategic planning, resource allocation, and economic stability, particularly in the post-COVID-19 era. This study develops a SARIMAX-based forecasting model to predict monthly visitor arrivals to Singapore, integrating web search data from Google Trends and external factors. To enhance model accuracy, a systematic selection process was applied to identify the effective subset of external variables. Results of the empirical experiments demonstrate that the proposed SARIMAX model outperforms traditional univariate models, including SARIMA, Holt–Winters, and Prophet, as well as machine learning-based approaches such as Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNNs). When forecasting the 24-month period of 2023 and 2024, the proposed model achieves the lowest Mean Absolute Percentage Error (MAPE) of 7.32%. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

26 pages, 3763 KiB  
Article
Tracking Religious Freedom Violations with the Violent Incidents Database: A Methodological Approach and Comparative Analysis
by Dennis P. Petri, Kyle J. Wisdom and John T. Bainbridge
Data 2025, 10(5), 72; https://doi.org/10.3390/data10050072 - 10 May 2025
Viewed by 497
Abstract
Measuring and comparing religious freedom across countries and over time requires reliable and valid data sources. Existing religious freedom datasets are either based on the coding of qualitative data (such as the Religion and State Project or the Pew Research Center), on expert [...] Read more.
Measuring and comparing religious freedom across countries and over time requires reliable and valid data sources. Existing religious freedom datasets are either based on the coding of qualitative data (such as the Religion and State Project or the Pew Research Center), on expert opinions (V-Dem or the World Watch List) or on surveys (Anti-Defamation League). Each of these approaches has its strengths and limitations. In this study, we present the Violent Incidents Database (VID), a complementary tool designed to collect, record, and analyze violent incidents related to violations of religious freedom based on media reports and other public sources. We critically describe the criteria and process for selecting, coding and verifying the incidents, as well as the categories and indicators used to classify them. We also compare the VID with other existing religious freedom datasets and show how the VID provides a complementary picture of the nature and dynamics of religious freedom violations. We offer a preliminary analysis of the data collected through the end of 2024 with selected figures for data visualization. We conclude by discussing anticipated improvements for the VID as well as its potential applications for policy makers, advocates, and practitioners. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

10 pages, 1880 KiB  
Data Descriptor
Historical Bolide Infrasound Dataset (1960–1972)
by Elizabeth A. Silber and Rodney W. Whitaker
Data 2025, 10(5), 71; https://doi.org/10.3390/data10050071 - 9 May 2025
Viewed by 210
Abstract
We present the first fully curated, publicly accessible archive of infrasonic records from ten large bolide events documented by the U.S. Air Force Technical Applications Center’s global microbarometer network between 1960 and 1972. Captured on analog strip-chart paper, these waveforms predate modern digital [...] Read more.
We present the first fully curated, publicly accessible archive of infrasonic records from ten large bolide events documented by the U.S. Air Force Technical Applications Center’s global microbarometer network between 1960 and 1972. Captured on analog strip-chart paper, these waveforms predate modern digital arrays and space-based sensors, making them a unique window on meteoroid activity in the mid-twentieth century. Prior studies drew important scientific conclusions from the records but released only limited artifacts, chiefly period–amplitude tables and unprocessed scans, leaving the underlying data inaccessible for independent study. The present release transforms those limited excerpts into a research-ready resource. By capturing ten large events in the mid-20th century, the dataset constitutes a critical reference point for assessing bolide activity before the advent of modern space-based and digital ground-based monitoring. The multi-year coverage and worldwide distribution of events provide a valuable reference for comparing past and more recent detections, facilitating assessments of long-term flux and the dynamics of acoustic wave propagation in Earth’s atmosphere. The dataset’s availability in a consolidated format ensures straightforward access to waveforms and derived measurements, supporting a wide range of scientific inquiries into bolide physics and infrasound monitoring. By preserving these historical acoustic observations, the collection maintains a significant record of mid-20th-century meteoroid entries. It thereby establishes a basis for further refinement of impact hazard evaluations, contributes to historical continuity in atmospheric observation, and enriches the study of meteoroid-generated infrasound signals on a global scale. Full article
Show Figures

Figure 1

21 pages, 360 KiB  
Article
Linear Dimensionality Reduction: What Is Better?
by Mohit Baliyan and Evgeny M. Mirkes
Data 2025, 10(5), 70; https://doi.org/10.3390/data10050070 - 6 May 2025
Viewed by 191
Abstract
This research paper focuses on dimensionality reduction, which is a major subproblem in any data processing operation. Dimensionality reduction based on principal components is the most used methodology. Our paper examines three heuristics, namely Kaiser’s rule, the broken stick, and the conditional number [...] Read more.
This research paper focuses on dimensionality reduction, which is a major subproblem in any data processing operation. Dimensionality reduction based on principal components is the most used methodology. Our paper examines three heuristics, namely Kaiser’s rule, the broken stick, and the conditional number rule, for selecting informative principal components when using principal component analysis to reduce high-dimensional data to lower dimensions. This study uses 22 classification datasets and three classifiers, namely Fisher’s discriminant classifier, logistic regression, and K nearest neighbors, to test the effectiveness of the three heuristics. The results show that there is no universal answer to the best intrinsic dimension, but the conditional number heuristic performs better, on average. This means that the conditional number heuristic is the best candidate for automatic data pre-processing. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

17 pages, 6804 KiB  
Data Descriptor
Mineralogical and Geochemical Compositions of Sedimentary Rocks in the Gosau Group (Late Cretaceous), Grünbach–Neue Welt Area, Austria
by Xinxuan Xiang, Eun Young Lee, Erich Draganits and Michael Wagreich
Data 2025, 10(5), 69; https://doi.org/10.3390/data10050069 - 6 May 2025
Viewed by 184
Abstract
Sedimentary rocks of the Gosau Group in the Grünbach–Neue Welt area (Eastern Alps, Austria) were analyzed to determine their mineralogical and geochemical compositions. This study includes the following: (1) the identification of major minerals using X-ray diffraction (XRD), (2) the analysis of major, [...] Read more.
Sedimentary rocks of the Gosau Group in the Grünbach–Neue Welt area (Eastern Alps, Austria) were analyzed to determine their mineralogical and geochemical compositions. This study includes the following: (1) the identification of major minerals using X-ray diffraction (XRD), (2) the analysis of major, minor, and trace elements via X-ray fluorescence spectroscopy (XRF) and inductively coupled plasma mass spectrometry (ICP-MS), and (3) the quantification of total organic carbon (TOC), total nitrogen (TN), and total sulfur (TS) using an Elementar Unicube analyzer. Samples were collected from four artificial trenches and one outcrop in Maiersdorf, spanning the Grünbach and Piesting formations deposited during a terrestrial-to-marine transition in the upper Santonian to Campanian (Late Cretaceous). The dominant minerals—quartz, muscovite, illite, and calcite—exhibit relative abundances corresponding with variations in major oxide concentrations. Minor elements show variability but generally follow consistent trends. Trace and rare earth elements display greater variability but similar patterns, with a broader distribution in the Grünbach Formation. Elevated TOC, TN, and TS values are observed near the formation boundary and in the Piesting formation. These results offer the mineralogical and geochemical characterization of the strata, and lay a foundation for further investigations into the paleoenvironmental and basin evolution of the Gosau Group in the region, providing a comparative framework for Gosau basins across the Eastern Alps. Full article
Show Figures

Figure 1

22 pages, 687 KiB  
Article
Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets
by Pedro Martins, Filipe Cardoso, Paulo Váz, José Silva and Maryam Abbasi
Data 2025, 10(5), 68; https://doi.org/10.3390/data10050068 - 5 May 2025
Viewed by 653
Abstract
Data cleaning remains one of the most time-consuming and critical steps in modern data science, directly influencing the reliability and accuracy of downstream analytics. In this paper, we present a comprehensive evaluation of five widely used data cleaning tools—OpenRefine, Dedupe, Great Expectations, TidyData [...] Read more.
Data cleaning remains one of the most time-consuming and critical steps in modern data science, directly influencing the reliability and accuracy of downstream analytics. In this paper, we present a comprehensive evaluation of five widely used data cleaning tools—OpenRefine, Dedupe, Great Expectations, TidyData (PyJanitor), and a baseline Pandas pipeline—applied to large-scale, messy datasets spanning three domains (healthcare, finance, and industrial telemetry). We benchmark each tool on dataset sizes ranging from 1 million to 100 million records, measuring execution time, memory usage, error detection accuracy, and scalability under increasing data volumes. Additionally, we assess qualitative aspects such as usability and ease of integration, reflecting real-world adoption concerns. We incorporate recent findings on parallelized data cleaning and highlight how domain-specific anomalies (e.g., negative amounts in finance, sensor corruption in industrial telemetry) can significantly impact tool choice. Our findings reveal that no single solution excels across all metrics; while Dedupe provides robust duplicate detection and Great Expectations offers in-depth rule-based validation, tools like TidyData and baseline Pandas pipelines demonstrate strong scalability and flexibility under chunk-based ingestion. The choice of tool ultimately depends on domain-specific requirements (e.g., approximate matching in finance and strict auditing in healthcare) and the magnitude of available computational resources. By highlighting each framework’s strengths and limitations, this study offers data practitioners clear, evidence-driven guidance for selecting and combining tools to tackle large-scale data cleaning challenges. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

15 pages, 14645 KiB  
Data Descriptor
Tracking U.S. Land Cover Changes: A Dataset of Sentinel-2 Imagery and Dynamic World Labels (2016–2024)
by Antonio Rangel, Juan Terven, Diana-Margarita Córdova-Esparza, Julio-Alejandro Romero-González, Alfonso Ramírez-Pedraza, Edgar A. Chávez-Urbiola, Francisco. J. Willars-Rodríguez and Gendry Alfonso-Francia
Data 2025, 10(5), 67; https://doi.org/10.3390/data10050067 - 4 May 2025
Viewed by 484
Abstract
Monitoring land cover changes is crucial for understanding how natural processes and human activities such as deforestation, urbanization, and agriculture reshape the environment. We introduce a publicly available dataset covering the entire United States from 2016 to 2024, integrating six spectral bands (Red, [...] Read more.
Monitoring land cover changes is crucial for understanding how natural processes and human activities such as deforestation, urbanization, and agriculture reshape the environment. We introduce a publicly available dataset covering the entire United States from 2016 to 2024, integrating six spectral bands (Red, Green, Blue, NIR, SWIR1, and SWIR2) from Sentinel-2 imagery with pixel-level land cover annotations from the Dynamic World dataset. This combined resource provides a consistent, high-resolution view of the nation’s landscapes, enabling detailed analysis of both short- and long-term changes. To ease the complexities of remote sensing data handling, we supply comprehensive code for data loading, basic analysis, and visualization. We also demonstrate an example application—semantic segmentation with state-of-the-art models—to evaluate dataset quality and reveal challenges associated with minority classes. The dataset and accompanying tools facilitate research in environmental monitoring, urban planning, and climate adaptation, offering a valuable asset for understanding evolving land cover dynamics over time. Full article
Show Figures

Figure 1

4 pages, 482 KiB  
Data Descriptor
Zooplankton Standing Stock Biomass and Population Density: Data from Long-Term Studies Covering Changes in Trophy and Climate Impacts in a Deep Subalpine Lake (Lake Maggiore, Italy)
by Roberta Piscia, Rossana Caroni and Marina Manca
Data 2025, 10(5), 66; https://doi.org/10.3390/data10050066 - 2 May 2025
Viewed by 296
Abstract
Lake Maggiore is a deep subalpine lake that has been well studied since the last century thanks to a monitoring program funded by the International Commission for the Protection of Italian–Swiss Waters. The monitoring program comprises both abiotic and biotic parameters, including zooplankton [...] Read more.
Lake Maggiore is a deep subalpine lake that has been well studied since the last century thanks to a monitoring program funded by the International Commission for the Protection of Italian–Swiss Waters. The monitoring program comprises both abiotic and biotic parameters, including zooplankton pelagic organisms. In this study, we present a dataset of 15,563 records of population densities and standing stock biomass for zooplankton pelagic taxa recorded over 43 years (1981–2023). The long-term dataset is valuable for tracing changes in trophic conditions experienced by the lake during the last century (eutrophication and its reversal) and the impact of global warming. Zooplankton samples (Crustacea and Rotifera Monogononta) were collected within 0–50 m depth by vertical hauls with an 80 µm light plankton sampler. The sampling frequency was monthly, with the exception of the 2009–2012 period, which employed seasonal frequency. The estimation of zooplankton taxon abundance and of its standing stock biomass is crucial in order to quantify the flux of matter, energy, and pollutants up to the upper trophic levels of the food web. The dataset provided is also suitable for food web analysis because the zooplankton taxa have been classified according to their ecological roles (microphagous organisms; primary and secondary consumers). Full article
Show Figures

Figure 1

8 pages, 372 KiB  
Data Descriptor
Dataset on Food Waste in Households: The Case of Latvia
by Ilze Beitane, Sandra Iriste, Martins Sabovics, Gita Krumina-Zemture and Janis Jenzis
Data 2025, 10(5), 65; https://doi.org/10.3390/data10050065 - 30 Apr 2025
Viewed by 193
Abstract
This publication presents raw data from an online survey in Latvia that reflects households’ practices, opinions, attitudes, and social responsibility regarding food waste. A total of 1336 respondents (households) participated in the survey. The questionnaire consisted of three parts, with the first part [...] Read more.
This publication presents raw data from an online survey in Latvia that reflects households’ practices, opinions, attitudes, and social responsibility regarding food waste. A total of 1336 respondents (households) participated in the survey. The questionnaire consisted of three parts, with the first part focusing on daily food habits and shopping habits, the second part focusing on respondents’ opinions and social responsibility on food waste management, and the third part containing questions on the frequency of shopping for different product groups. The dataset presented in the publication includes survey questions and response options, as well as raw survey data that can be used to compare households’ food waste behavior across countries. The data can help policy makers make data-driven decisions or serve as the basis for further research. Full article
Show Figures

Figure 1

14 pages, 4526 KiB  
Data Descriptor
A Complementary Dataset of Scalp EEG Recordings Featuring Participants with Alzheimer’s Disease, Frontotemporal Dementia, and Healthy Controls, Obtained from Photostimulation EEG
by Aimilia Ntetska, Andreas Miltiadous, Markos G. Tsipouras, Katerina D. Tzimourta, Theodora Afrantou, Panagiotis Ioannidis, Dimitrios G. Tsalikakis, Konstantinos Sakkas, Emmanouil D. Oikonomou, Nikolaos Grigoriadis, Pantelis Angelidis, Nikolaos Giannakeas and Alexandros T. Tzallas
Data 2025, 10(5), 64; https://doi.org/10.3390/data10050064 - 29 Apr 2025
Viewed by 419
Abstract
Research interest in the application of electroencephalogram (EEG) as a non-invasive diagnostic tool for the automated detection of neurodegenerative diseases is growing. Open-access datasets have become crucial for researchers developing such methodologies. Our previously published open-access dataset of resting-state (eyes-closed) EEG recordings from [...] Read more.
Research interest in the application of electroencephalogram (EEG) as a non-invasive diagnostic tool for the automated detection of neurodegenerative diseases is growing. Open-access datasets have become crucial for researchers developing such methodologies. Our previously published open-access dataset of resting-state (eyes-closed) EEG recordings from patients with Alzheimer’s disease (AD), frontotemporal dementia (FTD), and cognitively normal (CN) controls has attracted significant attention. In this paper, we present a complementary dataset consisting of eyes-open photic stimulation recordings from the same cohort. The dataset includes recordings from 88 participants (36 AD, 23 FTD, and 29 CN) and is provided in Brain Imaging Data Structure (BIDS) format, promoting consistency and ease of use across research groups. Additionally, a fully preprocessed version is included, using EEGLAB-based pipelines that involve filtering, artifact removal, and Independent Component Analysis, preparing the data for machine learning applications. This new dataset enables the study of brain responses to visual stimulation across different cognitive states and supports the development and validation of automated classification algorithms for dementia detection. It offers a valuable benchmark for both methodological comparisons and biological investigations, and it is expected to significantly contribute to the fields of neurodegenerative disease research, biomarker discovery, and EEG-based diagnostics. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

53 pages, 1551 KiB  
Article
From Crisis to Algorithm: Credit Delinquency Prediction in Peru Under Critical External Factors Using Machine Learning
by Jomark Noriega, Luis Rivera, Jorge Castañeda and José Herrera
Data 2025, 10(5), 63; https://doi.org/10.3390/data10050063 - 28 Apr 2025
Viewed by 335
Abstract
Robust credit risk prediction in emerging economies increasingly demands the integration of external factors (EFs) beyond borrowers’ control. This study introduces a scenario-based methodology to incorporate EF—namely COVID-19 severity (mortality and confirmed cases), climate anomalies (temperature deviations, weather-induced road blockages), and social unrest—into [...] Read more.
Robust credit risk prediction in emerging economies increasingly demands the integration of external factors (EFs) beyond borrowers’ control. This study introduces a scenario-based methodology to incorporate EF—namely COVID-19 severity (mortality and confirmed cases), climate anomalies (temperature deviations, weather-induced road blockages), and social unrest—into machine learning (ML) models for credit delinquency prediction. The approach is grounded in a CRISP-DM framework, combining stationarity testing (Dickey–Fuller), causality analysis (Granger), and post hoc explainability (SHAP, LIME), along with performance evaluation via AUC, ACC, KS, and F1 metrics. The empirical analysis uses nearly 8.2 million records compiled from multiple sources, including 367,000 credit operations granted to individuals and microbusiness owners by a regulated Peruvian financial institution (FMOD) between January 2020 and September 2023. These data also include time series of delinquency by economic activity, external factor indicators (e.g., mortality, climate disruptions, and protest events), and their dynamic interactions assessed through Granger causality to evaluate both the intensity and propagation of external shocks. The results confirm that EF inclusion significantly enhances model performance and robustness. Time-lagged mortality (COVID MOV) emerges as the most powerful single predictor of delinquency, while compound crises (climate and unrest) further intensify default risk—particularly in portfolios without public support. Among the evaluated models, CNN and XGB consistently demonstrate superior adaptability, defined as their ability to maintain strong predictive performance across diverse stress scenarios—including pandemic, climate, and unrest contexts—and to dynamically adjust to varying input distributions and portfolio conditions. Post hoc analyses reveal that EF effects dynamically interact with borrower income, indebtedness, and behavioral traits. This study provides a scalable, explainable framework for integrating systemic shocks into credit risk modeling. The findings contribute to more informed, adaptive, and transparent lending decisions in volatile economic contexts, relevant to financial institutions, regulators, and risk practitioners in emerging markets. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

10 pages, 1175 KiB  
Data Descriptor
A Dataset for Examining the Problem of the Use of Accounting Semi-Identity-Based Models in Econometrics
by Francisco Javier Sánchez-Vidal
Data 2025, 10(5), 62; https://doi.org/10.3390/data10050062 - 28 Apr 2025
Viewed by 198
Abstract
The problem of using accounting semi-identity-based (ASI) models in Econometrics can be severe in certain circumstances, and estimations from OLS regressions in such models may not accurately reflect causal relationships. This dataset was generated through Monte Carlo simulations, which allowed for the precise [...] Read more.
The problem of using accounting semi-identity-based (ASI) models in Econometrics can be severe in certain circumstances, and estimations from OLS regressions in such models may not accurately reflect causal relationships. This dataset was generated through Monte Carlo simulations, which allowed for the precise control of a causal relationship. The problem of an ASI cannot be directly demonstrated in real samples, as researchers lack insight into the specific factors driving each company’s investment policy. Consequently, it is impossible to distinguish whether regression results in such datasets stem from actual causality or are merely a byproduct of arithmetic distortions introduced by the ASI. The strategy of addressing this issue through simulations allows researchers to determine the true value of any estimator with certainty. The selected model for testing the influence of the ASI problem is the investment-cash flow sensitivity model (Fazzari, Hubbard and Petersen (FHP hereinafter) (1988)), which seeks to establish a relationship between a company’s investments and its cash flows and which is an ASI as well. The dataset included randomly generated independent variables (cash flows and Tobin’s Q) to analyze how they influence the dependent variable (cash flows). The Monte Carlo methodology in Stata enabled repeated sampling to assess how ASIs affect regression models, highlighting their impact on variable relationships and the unreliability of estimated coefficients. The purpose of this paper is twofold: its first goal is to provide a deeper explanation of the syntax in the related article, offering more insights into the ASI problem. The openly available dataset supports replication and further research on ASIs’ effects in economic models and can be adapted for other ASI-based analyses, as the information comprised in the reusability examples prove. Second, our aim is to encourage research supported by Monte Carlo simulations, as they enable the modeling of a comprehensive ecosystem of economic relationships between variables. This allows researchers to address a variety of issues, such as partial correlations, heteroskedasticity, multicollinearity, autocorrelation, endogeneity, and more, while testing their impact on the true value of coefficients. Full article
Show Figures

Figure 1

12 pages, 2485 KiB  
Data Descriptor
Time-Course Transcriptomic Dataset of Gallic Acid-Induced Human Cervical Carcinoma HeLa Cell Death
by Ho Man Tang and Peter Chi Keung Cheung
Data 2025, 10(5), 61; https://doi.org/10.3390/data10050061 - 28 Apr 2025
Viewed by 252
Abstract
Gallic acid is a natural phenolic acid that displays potent anti-cancer activity in a large variety of cell types and rodent cancer xenograft models. Although research has focused on determining the efficacy of gallic acid against various types of human cancer cells, the [...] Read more.
Gallic acid is a natural phenolic acid that displays potent anti-cancer activity in a large variety of cell types and rodent cancer xenograft models. Although research has focused on determining the efficacy of gallic acid against various types of human cancer cells, the molecular mechanisms governing the anti-cancer properties of gallic acid remain largely unclear, and a transcriptomic study of gallic acid-induced cancer cell death has rarely been reported. Therefore, we applied time-course bulk RNA-sequencing to elucidate the molecular signature of gallic acid-induced cell death in human cervical cancer HeLa cells, as this is a widely used in vitro model in the field. Our RNA-sequencing dataset covers the early (2nd hour), middle (4th, 6th hour), and late (9th hour) stages of the cell death process after exposure of HeLa cells to gallic acid, and the untreated (0th hour) cells served as controls. Differential expression of messenger RNAs (mRNAs) and long non-coding RNAs (lncRNAs) was identified at each time point in the dataset. In summary, this dataset is a unique and valuable resource with which the scientific community can explore the molecular mechanisms and identify druggable regulators of the gallic acid-induced cell death process in cancer. Full article
Show Figures

Figure 1

22 pages, 2020 KiB  
Article
A Synergistic Bridge Between Human–Computer Interaction and Data Management Within CDSS
by Ali Azadi and Francisco José García-Peñalvo
Data 2025, 10(5), 60; https://doi.org/10.3390/data10050060 - 26 Apr 2025
Cited by 1 | Viewed by 316
Abstract
Clinical Decision Support Systems (CDSSs) have become indispensable in medical decision-making. The heterogeneity and vast volume of medical data require firm attention to data management and integration strategies. On the other hand, CDSS functionality must be enhanced through improved human–computer interaction (HCI) principles. [...] Read more.
Clinical Decision Support Systems (CDSSs) have become indispensable in medical decision-making. The heterogeneity and vast volume of medical data require firm attention to data management and integration strategies. On the other hand, CDSS functionality must be enhanced through improved human–computer interaction (HCI) principles. This study investigates the bidirectional relationship between data management practices (specifically data entry management, data transformation, and data integration) and HCI principles within CDSSs. Through a novel framework and practical case studies, we demonstrate how high-quality data entry, driven by controlled workflows and automated technologies, is crucial for system usability and reliability. We explore the transformative positive impact of robust data management techniques, including standardization, normalization, and advanced integration solutions, on the HCI elements and overall system performance. Conversely, we illustrate how effective HCI design improves data quality by reducing cognitive load, minimizing errors, and fostering user engagement. The findings reveal a synergistic relationship between HCI and data science, providing actionable insights for designing intuitive and efficient CDSSs. This research bridges the gap between technical and human-centric approaches, advancing CDSS usability, decision accuracy, and clinician trust for better patient outcomes. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

28 pages, 11666 KiB  
Data Descriptor
Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI
by Marshall Elam, Dustin Mink, Sikha S. Bagui, Russell Plenkers and Subhash C. Bagui
Data 2025, 10(5), 59; https://doi.org/10.3390/data10050059 - 25 Apr 2025
Viewed by 393
Abstract
This paper describes the creation of a new dataset, UWF-ZeekData24, aligned with the Enterprise MITRE ATT&CK Framework, that addresses critical shortcomings in existing network security datasets. Controlling the construction of attacks and meticulously labeling the data provides a more accurate and dynamic environment [...] Read more.
This paper describes the creation of a new dataset, UWF-ZeekData24, aligned with the Enterprise MITRE ATT&CK Framework, that addresses critical shortcomings in existing network security datasets. Controlling the construction of attacks and meticulously labeling the data provides a more accurate and dynamic environment for testing of IDS/IPS systems and their machine learning algorithms. The outcomes of this research will assist in the development of cybersecurity solutions as well as increase the robustness and adaptability towards modern day cybersecurity threats. This new carefully engineered dataset will enhance cyber defense mechanisms that are responsible for safeguarding critical infrastructures and digital assets. Finally, this paper discusses the differences between crowd-sourced data and data collected in a more controlled environment. Full article
Show Figures

Figure 1

25 pages, 14600 KiB  
Article
Using Visualization to Evaluate the Performance of Algorithms for Multivariate Time Series Classification
by Edgar Acuña and Roxana Aparicio
Data 2025, 10(5), 58; https://doi.org/10.3390/data10050058 - 24 Apr 2025
Viewed by 338
Abstract
In this paper, we use visualization tools to give insight into the performance of six classifiers on multivariate time series data. Five of these classifiers are deep learning models, while the Rocket classifier represents a non-deep learning approach. Our comparison is conducted across [...] Read more.
In this paper, we use visualization tools to give insight into the performance of six classifiers on multivariate time series data. Five of these classifiers are deep learning models, while the Rocket classifier represents a non-deep learning approach. Our comparison is conducted across fifteen datasets from the UEA repository. Additionally, we apply data engineering techniques to each dataset, allowing us to assess classifier performance concerning the available features and channels within the time series. The results of our experiments indicate that the ROCKET classifier consistently achieves strong performance across most datasets, while the Transformer model underperforms, likely due to the limited number of instances per class in certain datasets. Full article
(This article belongs to the Topic Future Trends and Challenges in Data Mining Technology)
Show Figures

Figure 1

10 pages, 1490 KiB  
Data Descriptor
The Long-Term Annual Datasets for Azov Sea Basin Ecosystems for 1925–2024 and Russian Sturgeon Occurrences in 2000–2024
by Mikhail M. Piatinskii, Dmitrii G. Bitiutskii, Arsen V. Mirzoyan, Valerii A. Luzhniak, Vladimir N. Belousov, Dmitry F. Afanasyev, Svetlana V. Zhukova, Sergey N. Kulba, Lyubov A. Zhivoglyadova, Dmitrii V. Hrenkin, Tatjana I. Podmareva, Polina M. Cherniavksaia, Dmitrii S. Burlachko, Nadejda S. Elfimova, Olga V. Kirichenko and Inna D. Kozobrod
Data 2025, 10(5), 57; https://doi.org/10.3390/data10050057 - 24 Apr 2025
Viewed by 280
Abstract
The abundance of the Russian sturgeon population in the Sea of Azov declined many times in the XX–XXI centuries. This paper presents long-term annual and spatial occurrence datasets to create statistical and machine learning models to better understand the distribution patterns as well [...] Read more.
The abundance of the Russian sturgeon population in the Sea of Azov declined many times in the XX–XXI centuries. This paper presents long-term annual and spatial occurrence datasets to create statistical and machine learning models to better understand the distribution patterns as well as biological and ecological features. The annual dataset provides annually averaged results of environmental and biotic population estimates obtained by in situ observations for 1925–2024. The spatial occurrence dataset contains raw survey observations with a bottom trawl over the period of 2000–2024. Preliminary diagnostics of the annual dataset reveal no evidence of non-stationarity or significant outliers that cannot be explained by biological parameters. The published dataset allows any researcher to perform statistical and machine learning-based analysis in order to compare and describe the population abundance or spatial occurrence of Russian sturgeon in the Sea of Azov. Full article
Show Figures

Graphical abstract

8 pages, 1783 KiB  
Data Descriptor
Orange Leaves Images Dataset for the Detection of Huanglongbing
by Juan Carlos Torres-Galván, Paul Hernández Herrera, Juan Antonio Obispo, Xocoyotzin Guadalupe Ávila Cruz, Liliana Montserrat Camacho Ibarra, Paula Magaldi Morales Orosco, Alfonso Alba, Edgar R. Arce-Santana, Valdemar Arce-Guevara, J. S. Murguía, Edgar Guevara and Miguel G. Ramírez-Elías
Data 2025, 10(5), 56; https://doi.org/10.3390/data10050056 - 23 Apr 2025
Viewed by 405
Abstract
In agriculture, machine learning (ML) and deep learning (DL) have increased significantly in the last few years. The use of ML and DL for image classification in plant disease has generated significant interest due to their cost, automatization, scalability, and early detection. However, [...] Read more.
In agriculture, machine learning (ML) and deep learning (DL) have increased significantly in the last few years. The use of ML and DL for image classification in plant disease has generated significant interest due to their cost, automatization, scalability, and early detection. However, high-quality image datasets are required to train robust classifier models for plant disease detection. In this work, we have created an image dataset of 649 orange leaves divided into two groups: control (n = 379) and huanglongbing (HLB) disease (n = 270). The images were acquired with several smartphone cameras of high resolution and processed to remove the background. The dataset enriches the information on characteristics and symptoms of citrus leaves with HLB and healthy leaves. This enhancement makes the dataset potentially valuable for disease identification through leaf segmentation and abnormality detection, particularly when applying ML and DL models. Full article
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop