Next Issue
Volume 10, December
Previous Issue
Volume 10, October
 
 

Data, Volume 10, Issue 11 (November 2025) – 26 articles

Cover Story (view full-size image): The Cramér–Von Mises statistic determines if certain data follow a theoretical distribution. An accurate probability is obtained from a Monte Carlo simulation. Here, for sample sizes from 2 to 30, 21 replicates of large sizes (5,120,000,000) have been generated, allowing us to obtain accurate permilles of the CM statistic. There is an increase in the variability from smaller to bigger values of the CM statistic obtained from the MC experiment. However, the standard deviation shows that the estimation noise is below 10−4 most of the time. The permille-level precision enables precise critical values and p-values, improving hypothesis testing confidence in quality control, bioinformatics, and financial modeling, to give only some examples. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
13 pages, 2259 KB  
Data Descriptor
Sampling the Darcy Friction Factor Using Halton, Hammersley, Sobol, and Korobov Sequences: Data Points from the Colebrook Relation
by Dejan Brkić and Marko Milošević
Data 2025, 10(11), 193; https://doi.org/10.3390/data10110193 - 20 Nov 2025
Viewed by 327
Abstract
When the Colebrook equation is used in its original implicit form, the unknown pipe flow friction factor can only be obtained through time-consuming and computationally demanding iterative calculations. The empirical Colebrook equation relates the unknown Darcy friction factor to a known Reynolds number [...] Read more.
When the Colebrook equation is used in its original implicit form, the unknown pipe flow friction factor can only be obtained through time-consuming and computationally demanding iterative calculations. The empirical Colebrook equation relates the unknown Darcy friction factor to a known Reynolds number and a known relative roughness of a pipe’s inner surface. It is widely used in engineering. To simplify computations, a variety of explicit approximations have been developed, the accuracy of which must be carefully evaluated. For this purpose, this Data Descriptor gives a sufficient number of pipe flow friction factor values that are computed using a highly accurate iterative algorithm to solve the implicit Colebrook equation. These values serve as reference data, spanning the range relevant to engineering applications, and provide benchmarks for evaluating the accuracy of the approximations. The sampling points within the datasets are distributed in a way that minimizes gaps in the data. In this study, a Python Version v1 script was used to generate quasi-random samples, including Halton, Hammersley, Sobol, and deterministic lattice-based Korobov samples, which produce smaller gaps than purely random samples generated for comparison purposes. Using these sequences, a total of 220 = 1,048,576 data points were generated, and the corresponding datasets are provided in in the zenodo repositoryWhen a smaller subset of points is needed, the required number of initial points from these sequences can be used directly. Full article
Show Figures

Figure 1

16 pages, 594 KB  
Article
A Data-Driven Analysis of Cognitive Learning and Illusion Effects in University Mathematics
by Rodolfo Bojorque, Fernando Moscoso, Miguel Arcos-Argudo and Fernando Pesántez
Data 2025, 10(11), 192; https://doi.org/10.3390/data10110192 - 19 Nov 2025
Viewed by 441
Abstract
The increasing adoption of video-based instruction and digital assessment in higher education has reshaped how students interact with learning materials. However, it also introduces cognitive and behavioral biases that challenge the accuracy of self-perceived learning. This study aims to bridge the gap between [...] Read more.
The increasing adoption of video-based instruction and digital assessment in higher education has reshaped how students interact with learning materials. However, it also introduces cognitive and behavioral biases that challenge the accuracy of self-perceived learning. This study aims to bridge the gap between perceived and actual learning by investigating how illusion learning—an overestimation of understanding driven by the fluency of instructional media and autonomous study behaviors—affects cognitive performance in university mathematics. Specifically, it examines how students’ performance evolves across Bloom’s cognitive domains (Understanding, Application, and Analysis) from midterm to final assessments. This paper presents a data-driven investigation that combines the theoretical framework of illusion learning, the tendency to overestimate understanding based on the fluency of instructional media, with empirical evidence drawn from a structured and anonymized dataset of 294 undergraduate students enrolled in a Linear Algebra course. The dataset records midterm and final exam scores across three cognitive domains (Understanding, Application, and Analysis) aligned with Bloom’s taxonomy. Through paired-sample testing, descriptive analytics, and visual inspection, the study identifies significant improvement in analytical reasoning, moderate progress in application, and persistent overconfidence in self-assessment. These results suggest that while students develop higher-order problem-solving skills, a cognitive gap remains between perceived and actual mastery. Beyond contributing to the theoretical understanding of metacognitive illusion, this paper provides a reproducible dataset and analysis framework that can inform future work in learning analytics, educational psychology, and behavioral modeling in higher education. Full article
Show Figures

Figure 1

26 pages, 2492 KB  
Data Descriptor
A Mexican Enhanced Dataset of Pollutant Releases and Transfers (2004 to 2022) with IARC Cancer Classifications
by Hugo G. Reyes-Anastacio, Ivan Lopez-Arevalo, Jose L. Gonzalez-Compean, Melesio Crespo-Sanchez, Jaqueline Calderon and Heriberto Aguirre-Meneses
Data 2025, 10(11), 191; https://doi.org/10.3390/data10110191 - 19 Nov 2025
Viewed by 378
Abstract
As a member of the North American Free Trade Agreement, the Mexican Ministry of Environment and Natural Resources publishes the Pollutant Releases and Transfers Registry of Substances annually, in accordance with the Official Mexican Norm Standard NOM-165-Semarnat-2013. This registry comprises 19 datasets (one [...] Read more.
As a member of the North American Free Trade Agreement, the Mexican Ministry of Environment and Natural Resources publishes the Pollutant Releases and Transfers Registry of Substances annually, in accordance with the Official Mexican Norm Standard NOM-165-Semarnat-2013. This registry comprises 19 datasets (one per year, from 2004 to 2022). These have not preserved the same structure and categorical values, making it difficult to avoid their fusion with other datasets and conduct exploratory studies. These datasets contain (a)data on substances released and transferred to the environment and (b) data on producer facilities. They do not include additional data to make any other kind of query. Users must create adapted versions of these datasets while achieving isolated analyses. This paper describes a method for integrating the Pollutant Release and Transfer Registry dataset, enhanced with facilities data and cancer classification from the International Agency for Research on Cancer, to produce an improved and augmented public data source for academic or research purposes. The obtained database contains geospatial information, which enabled us to analyze the dataset at the state or municipal level to create digital products that can be used to inform decisions about environmental pollution. Full article
Show Figures

Figure 1

2 pages, 113 KB  
Correction
Correction: Zhao, Q.; Wentz, E.A. A MODIS/ASTER Airborne Simulator (MASTER) Imagery for Urban Heat Island Research. Data 2016, 1, 7
by Qunshan Zhao and Elizabeth A. Wentz
Data 2025, 10(11), 190; https://doi.org/10.3390/data10110190 - 19 Nov 2025
Viewed by 116
Abstract
Additional Affiliation(s) [...] Full article
16 pages, 225 KB  
Data Descriptor
Increasing the Usability of the American Time Use Survey: IPUMS ATUS
by Kari C. W. Williams, Sarah M. Flood, Liana C. Sayer and Julia A. Rivera Drew
Data 2025, 10(11), 189; https://doi.org/10.3390/data10110189 - 14 Nov 2025
Viewed by 626
Abstract
This paper describes IPUMS ATUS, which simplifies the use of time diary data by disseminating a harmonized and enhanced version of the American Time Use Survey (ATUS). The ATUS time diary data capture the detailed activities over a 24 h period for thousands [...] Read more.
This paper describes IPUMS ATUS, which simplifies the use of time diary data by disseminating a harmonized and enhanced version of the American Time Use Survey (ATUS). The ATUS time diary data capture the detailed activities over a 24 h period for thousands of respondents along with their sociodemographic characteristics. The ability to measure, at a population level, how people spend their time provides nearly endless possibilities for examining questions that hinge on understanding human behavior. The flexible data format can be used to estimate time use as captured by stylized survey questions (e.g., sleep duration, work hours), but it also allows for the study of activity sequencing and the context of time use (e.g., where it happens, who else is present). However, wrangling the complex, hierarchical record structure of the data requires advanced programming skills. To address these challenges, IPUMS ATUS harmonizes the ATUS data and provides customization tools that allow researchers to (i) combine data from multiple original ATUS files and (ii) easily create and save custom variables that summarize time use utilizing the full array of contextual information spread across the complex record structure of the ATUS. Full article
22 pages, 1540 KB  
Article
Building Data Literacy for Sustainable Development: A Framework for Effective Training
by Raed A. T. Said, Kassim S. Mwitondi, Leila Benseddik and Laroussi Chemlali
Data 2025, 10(11), 188; https://doi.org/10.3390/data10110188 - 11 Nov 2025
Viewed by 542
Abstract
As the transformative influence of novel technologies sweeps across industries, organisations are called upon to position their staff in the equally dynamic operational environment, which includes embedding technical and legal communication skills in their training programs. For many organisations, internal and external communication [...] Read more.
As the transformative influence of novel technologies sweeps across industries, organisations are called upon to position their staff in the equally dynamic operational environment, which includes embedding technical and legal communication skills in their training programs. For many organisations, internal and external communication of data modelling and related concepts, reporting, and monitoring still pose major challenges. The aim of this research is to develop an effective data training framework for learners with or without mathematical or computational maturity. It also addresses subtle aspects such as the legal and ethical implications of dealing with organisational data. Data was collected from a training course in Python, delivered to government employees in different departments in the United Arab Emirates (UAE). A structured questionnaire was designed to measure the effectiveness of the training program using Python, from the employees’ perspective, based on three key attributes: their personal characteristics, professional characteristics, and technical knowledge. A descriptive analysis of aggregations, deviations, and proportions was used to describe the data attributes gathered for the study. The main findings revealed a huge knowledge gap across disciplines regarding the core skills of big data analytics. In addition, the findings highlighted that previous knowledge about statistical methods of data analysis along with prior programming knowledge made it easier for employees to gain skills in data analytics. While the results of this study showed that their training program was beneficial for the vast majority of participants, responses from the survey indicate that providing a solid knowledge of technical communication, legal and ethical aspects would offer significant insights into the big data analytics field. Based on the findings, we make recommendations for adapting conventional data analytics approaches to align with the complexity or the attainment of the non-orthogonal United Nations Sustainable Development Goals (SDG). Associations of selected responses from the survey with some of the key data attributes indicate that the research highlights vital roles that technology and data-driven skills will play in ensuring a more prosperous and sustainable future for all. Full article
Show Figures

Figure 1

19 pages, 1054 KB  
Article
Perspectives on Research and Personalized Healthcare in the Context of Federated FAIR Data Based on an Exploratory Study by Medical Researchers
by Elena Poenaru, Monica Dugăeşescu, Călin Poenaru, Iulia Andrei-Bitere, Livia-Cristiana Băicoianu-Niţescu, Traian-Vasile Constantin, Aurelian Zugravu, Brandusa Bitel, Maria Magdalena Constantin and Smaranda Stoleru
Data 2025, 10(11), 187; https://doi.org/10.3390/data10110187 - 11 Nov 2025
Viewed by 402
Abstract
Background: Research in personalized medicine, with applications in oncology, dermatology, cardiology, urology, and general healthcare, requires facile and safe access to accurate data. Due to its particularly sensitive character, obtaining health-related data, storing it in repositories, and federating it are challenging, especially [...] Read more.
Background: Research in personalized medicine, with applications in oncology, dermatology, cardiology, urology, and general healthcare, requires facile and safe access to accurate data. Due to its particularly sensitive character, obtaining health-related data, storing it in repositories, and federating it are challenging, especially in the context of open science and FAIR data. Methods: An online survey was conducted among medical researchers to gain insights into their knowledge and experience regarding the following topics: health data repositories and data federation, as well as their opinions regarding data sharing and their willingness to participate in sharing data. Results: The survey was completed by 189 respondents, the majority of whom were attending physicians and PhD candidates. Most of them acknowledged the complex, beneficial implications of data federation in the medical field but had concerns about data protection, with 75% declaring that they would agree to share data. A general lack of awareness (80%) about the importance of interoperability for federated data repositories was observed. Conclusions: Implementing federated data repositories in the health field requires thorough understanding, knowledge, and collaboration, enabling translational medicine to reach its full potential. Understanding the needs of all involved parties can shape the success of medical data federation initiatives, with this study serving as a foundation for further research. Full article
(This article belongs to the Special Issue Data Management in Life Sciences)
Show Figures

Figure 1

11 pages, 229 KB  
Data Descriptor
A Thirty-Day Dataset of Malicious HTTP Requests Blocked by OWASP ModSecurity on a Production Web Server
by Geza Lucz and Bertalan Forstner
Data 2025, 10(11), 186; https://doi.org/10.3390/data10110186 - 11 Nov 2025
Viewed by 752
Abstract
We present a real-world dataset capturing thirty consecutive days of malicious HTTP traffic filtered and blocked by the OWASP ModSecurity Web Application Firewall (WAF) on a live production server. Each entry corresponds to a request that triggered one or more rules in the [...] Read more.
We present a real-world dataset capturing thirty consecutive days of malicious HTTP traffic filtered and blocked by the OWASP ModSecurity Web Application Firewall (WAF) on a live production server. Each entry corresponds to a request that triggered one or more rules in the OWASP Core Rule Set (CRS), resulting in its inclusion in the audit log due to suspected exploitation attempts. The dataset includes attack categories such as SQL injection, cross-site scripting (XSS), local file inclusion, scanner probes, and various malformed or evasive input forms. The data has been carefully anonymized to protect sensitive information while preserving critical structural tags, including request method, URI, triggered rule IDs, request headers, and user-agent strings. This dataset provides a real-world resource for cybersecurity researchers, particularly those developing or evaluating intrusion detection systems (IDSs), WAF rule tuning strategies, anomaly detection algorithms, and adversarial machine learning models. The dataset also allows performance testing of threat prevention pipelines. By making this dataset publicly available, we aim to support reproducible research in web security, encourage benchmarking of detection techniques under real-world conditions, and contribute insight into the nature of contemporary web-based threats observed in an uncontrolled environment. Full article
(This article belongs to the Section Information Systems and Data Management)
13 pages, 1217 KB  
Article
Photodissociation Processes Involving the SiH+ Molecular Ion: New Datasets for Modeling
by V. A. Srećković, H. Delibašić-Marković, L. M. Ignjatović, V. Petrović and V. Vujčić
Data 2025, 10(11), 185; https://doi.org/10.3390/data10110185 - 7 Nov 2025
Viewed by 428
Abstract
This paper investigates the photodissociation of the SiH+ molecular ion, a non-symmetric diatomic species composed of silicon and hydrogen. We provide calculated molecular data and characterize electronic states, deriving cross-sections and spectral absorption rate coefficients as functions of temperature (1000–10,000 [...] Read more.
This paper investigates the photodissociation of the SiH+ molecular ion, a non-symmetric diatomic species composed of silicon and hydrogen. We provide calculated molecular data and characterize electronic states, deriving cross-sections and spectral absorption rate coefficients as functions of temperature (1000–10,000 K) and EUV and UV wavelength. The calculations are performed within a quantum–mechanical framework of bound–free radiative transitions, using ab initio electronic potentials and dipole transition functions as inputs. In addition, we present a straightforward fitting formula that enables practical interpolation of photodissociation cross-sections and spectral rate coefficients, providing a novel closed-form representation of the dataset for modeling purposes. The resulting dataset provides a consistent and accessible reference for advanced photochemical modeling in laboratory plasmas and astrophysical environments. Full article
Show Figures

Figure 1

29 pages, 1971 KB  
Article
Resilience of Scientific Collaboration Networks in Young Universities Based on Bibliometric and Network Analysis
by Oleksandr Kuchanskyi, Yurii Andrashko, Andrii Biloshchytskyi, Aidos Mukhatayev, Svitlana Biloshchytska and Firuza Numanova
Data 2025, 10(11), 184; https://doi.org/10.3390/data10110184 - 7 Nov 2025
Viewed by 594
Abstract
The resilience of scientific collaboration networks is a key factor in ensuring the long-term academic development of young universities. This study examines the resilience of scientific collaboration networks among young universities based on bibliometric and network analysis. Based on bibliometric data from the [...] Read more.
The resilience of scientific collaboration networks is a key factor in ensuring the long-term academic development of young universities. This study examines the resilience of scientific collaboration networks among young universities based on bibliometric and network analysis. Based on bibliometric data from the open database OpenAlex (as of September 2025, the database contains over 271 million scientific publications and 105 million authors), weighted undirected co-authorship graphs were constructed for four young universities from China, Kazakhstan, and the United Kingdom: Astana IT University, AITU (founded in 2019), Nazarbayev University, NU (2010), University of Suffolk, US (2007), and ShanghaiTech University, STU (2013). Key resilience indicators were calculated, including clustering coefficients, assortativity, modularity, and the dynamics of the largest connected component under different node removal scenarios. The study revealed that NU and STU have a highly resilient structure of scientific collaboration. AITU has been characterized by dynamic development and increasing resilience, particularly after 2023. The US network is fragmented and dependent on a small group of core researchers. However, despite its limited scale, it demonstrates a certain stability in preserving its core. Therefore, recommendations for the development of young universities have been formulated based on the research results. The findings highlight the importance of fostering horizontal scientific ties, deepening international cooperation, and developing long-term institutional strategies for young universities. Full article
Show Figures

Figure 1

11 pages, 190 KB  
Data Descriptor
Survey Data on the Knowledge, Attitudes, and Practices of Patients Attending the Diabetes Control Program in a Network of Health Institutions in Cali, Colombia
by Janeth Gil-Forero, Luis Felipe Ramírez-Otero, Naydú Acosta-Ramírez and Gloria Anais Tunubala-Ipia
Data 2025, 10(11), 183; https://doi.org/10.3390/data10110183 - 6 Nov 2025
Viewed by 573
Abstract
Diabetes is a global and local epidemic, with an exponential growth trend in prevalence rates. This article presents data collected through a survey administered to a probabilistic sample of patients enrolled in a diabetes control program within a network of health institutions in [...] Read more.
Diabetes is a global and local epidemic, with an exponential growth trend in prevalence rates. This article presents data collected through a survey administered to a probabilistic sample of patients enrolled in a diabetes control program within a network of health institutions in Cali, Colombia. The purpose of the survey was to explore knowledge, attitudes, and practices related to diabetes. The survey was designed as part of the quantitative component of a mixed methods macroproject, and the questionnaire was developed based on a review of the literature and the research team’s expertise in the field. The results of the article correspond to the description of the database and combine raw survey data with additional analytical variables derived from grouped response options or recoded items. The data provides a valuable source of information for further research and for decision-makers interested in diabetes risk management. In conclusion, this database enables other broader studies on factors related to adherence to conventional treatments and the use of nonconventional treatments for type 2 diabetes. Full article
35 pages, 1313 KB  
Review
Big Data Sharing: A Comprehensive Survey
by Shan Jiang
Data 2025, 10(11), 182; https://doi.org/10.3390/data10110182 - 5 Nov 2025
Viewed by 1190
Abstract
The transformative potential of big data across various industries has been demonstrated. However, the data held by different stakeholders often lack interoperability, resulting in isolated data silos that limit the overall value. Collaborative data efforts can enhance the total value beyond the sum [...] Read more.
The transformative potential of big data across various industries has been demonstrated. However, the data held by different stakeholders often lack interoperability, resulting in isolated data silos that limit the overall value. Collaborative data efforts can enhance the total value beyond the sum of individual parts. Thus, big data sharing is crucial for transitioning from isolated data silos to integrated data ecosystems, thereby maximizing the value of big data. Despite its potential, big data sharing faces numerous challenges, including data heterogeneity, the absence of pricing models, and concerns about data security. A substantial body of research has been dedicated to addressing these issues. This paper offers the first comprehensive survey that formally defines and delves into the technical details of big data sharing. Initially, we formally define big data sharing as the act of data sharers to share big data so that the sharees can find, access, and use it in the agreed ways and differentiate it from related concepts such as open data, data exchange, and big data trading. We clarify the general procedures, benefits, requirements, and applications associated with big data sharing. Subsequently, we examine existing big data-sharing platforms, categorizing them into data-hosting centers, data aggregation centers, and decentralized solutions. We then identify the challenges in developing big data-sharing solutions and provide explanations of the existing approaches to these challenges. Finally, the survey concludes with a discussion on future research directions. This survey presents the latest developments and research in the field of big data sharing and aims to inspire further scholarly inquiry. Full article
Show Figures

Figure 1

7 pages, 3618 KB  
Data Descriptor
Small Samples’ Permille Cramér–Von Mises Statistic Critical Values for Continuous Distributions as Functions of Sample Size
by Lorentz Jäntschi
Data 2025, 10(11), 181; https://doi.org/10.3390/data10110181 - 5 Nov 2025
Viewed by 257
Abstract
Along with other order statistics, the Cramér–von Mises (CM) statistic can assess the goodness of fit. CM does not have an explicit formula for the cumulative distribution function and the alternate way is to obtain its critical value from a Monte Carlo (MC) [...] Read more.
Along with other order statistics, the Cramér–von Mises (CM) statistic can assess the goodness of fit. CM does not have an explicit formula for the cumulative distribution function and the alternate way is to obtain its critical value from a Monte Carlo (MC) experiment. A high resolution experiment was deployed to generate a large amount of data resembling CM. Twenty-one repetitions of the experiment were conducted, and in each case, critical values of the CM statistic were obtained for all permilles and sample sizes from 2 to 30. The raw data presented here can serve to interpolate and extract probabilities associated with CM statistic directly, or to obtain a mathematical model for the bivariate dependence. Full article
Show Figures

Figure 1

22 pages, 15846 KB  
Article
NutritionVerse3D2D: Large 3D Object and 2D Image Food Dataset for Dietary Intake Estimation
by Chi-en Amy Tai, Matthew Keller, Saeejith Nair, Yuhao Chen, Yifan Wu, Olivia Markham, Krish Parmar, Pengcheng Xi and Alexander Wong
Data 2025, 10(11), 180; https://doi.org/10.3390/data10110180 - 4 Nov 2025
Viewed by 595
Abstract
Elderly populations often face significant challenges when it comes to dietary intake tracking, often exacerbated by health complications. Unfortunately, conventional diet assessment techniques such as food frequency questionnaires, food diaries, and 24 h recall are subject to substantial bias. Recent advancements in machine [...] Read more.
Elderly populations often face significant challenges when it comes to dietary intake tracking, often exacerbated by health complications. Unfortunately, conventional diet assessment techniques such as food frequency questionnaires, food diaries, and 24 h recall are subject to substantial bias. Recent advancements in machine learning and computer vision show promise of automated nutrition tracking methods of food, but require a large, high-quality dataset in order to accurately identify the nutrients from the food on the plate. However, manual creation of large-scale datasets with such diversity is time-consuming and hard to scale. On the other hand, synthesized 3D food models enable view augmentation to generate countless photorealistic 2D renderings from any viewpoint, reducing imbalance across camera angles. In this paper, we present a process to collect a large image dataset of food scenes that span diverse viewpoints and highlight its usage in dietary intake estimation. We first collect quality 3D objects of food items (NV-3D) that are used to generate photorealistic synthetic 2D food images (NV-Synth) and then manually collect a validation 2D food image dataset (NV-Real). We benchmark various intake estimation approaches on these datasets and present NutritionVerse3D2D, a collection of datasets that contain 3D objects and 2D images, along with models that estimate intake from the 2D food images. We release all the datasets along with the developed models to accelerate machine learning research on dietary sensing. Full article
Show Figures

Figure 1

13 pages, 457 KB  
Data Descriptor
CBIS-DDSM-R: A Curated Radiomic Feature Dataset for Breast Cancer Classification
by Erika Sánchez-Femat, Carlos E. Galván-Tejada, Jorge I. Galván-Tejada, Hamurabi Gamboa-Rosales, Huizilopoztli Luna-García, Luis Alberto Flores-Chaires, Javier Saldívar-Pérez, Rafael Reveles-Martínez and José M. Celaya-Padilla
Data 2025, 10(11), 179; https://doi.org/10.3390/data10110179 - 4 Nov 2025
Viewed by 1736
Abstract
Early and accurate breast cancer detection is critical for patient outcomes. The Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM) has been instrumental for computer-aided diagnosis (CAD) systems. However, the lack of a standardized preprocessing pipeline and consistent metadata [...] Read more.
Early and accurate breast cancer detection is critical for patient outcomes. The Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM) has been instrumental for computer-aided diagnosis (CAD) systems. However, the lack of a standardized preprocessing pipeline and consistent metadata has limited its utility for reproducible quantitative imaging or radiomics. This paper introduces CBIS-DDSM-R, an open-source, radiomics-ready extension of the original dataset. It provides an automated pipeline for preprocessing mammograms and extracts a standardized set of 93 radiomics features per lesion, adhering to Image Biomarker Standardisation Initiative (IBSI) guidelines using PyRadiomics. The resulting dataset combines clinical and radiomics data into a unified format, offering a robust benchmark for developing and validating reproducible radiomics models for breast cancer characterization. Full article
Show Figures

Figure 1

10 pages, 11571 KB  
Technical Note
ncPick: A Lightweight Toolkit for Extracting, Analyzing, and Visualizing ECMWF ERA5 NetCDF Data
by Sreten Jevremović, Filip Arnaut, Aleksandra Kolarski and Vladimir A. Srećković
Data 2025, 10(11), 178; https://doi.org/10.3390/data10110178 - 2 Nov 2025
Viewed by 476
Abstract
The European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5) datasets provide a rich source of climatological data. However, their Network Common Data Form (NetCDF) structure can be a barrier for researchers who are not experienced with specialized data tools or programming [...] Read more.
The European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5) datasets provide a rich source of climatological data. However, their Network Common Data Form (NetCDF) structure can be a barrier for researchers who are not experienced with specialized data tools or programming languages. To address this challenge, we developed ncPick, a lightweight, Windows-based application designed to make ERA5 data more accessible and easier to use. The software enables users to load NetCDF files, select points of interest manually or through shapefiles, and export the data directly to Comma-separated values (CSV) format for further processing in common tools such as Excel, R, or within ncPick itself. Additional modules allow for quick visualization, descriptive statistics, interpolation, and the generation of time-of-day heatmaps, as well as practical data handling functions such as merging and downsampling CSV files based on the time-axis. Validation tests confirmed that ncPick outputs are consistent with those from established tools (such as Panoply). The toolkit was found to be stable across different Windows systems and suitable for a range of datasets. While it has limitations with very large files and does not include automated data download for version 1 of the software, ncPick offers an accessible solution for researchers, students, and other professionals seeking a reliable and intuitive way to work with ERA5 NetCDF data. Full article
(This article belongs to the Section Spatial Data Science and Digital Earth)
Show Figures

Figure 1

26 pages, 15315 KB  
Article
Machine and Deep Learning Framework for Sargassum Detection and Fractional Cover Estimation Using Multi-Sensor Satellite Imagery
by José Manuel Echevarría-Rubio, Guillermo Martínez-Flores and Rubén Antelmo Morales-Pérez
Data 2025, 10(11), 177; https://doi.org/10.3390/data10110177 - 1 Nov 2025
Viewed by 551
Abstract
Over the past decade, recurring influxes of pelagic Sargassum have posed significant environmental and economic challenges in the Caribbean Sea. Effective monitoring is crucial for understanding bloom dynamics and mitigating their impacts. This study presents a comprehensive machine learning (ML) and deep learning [...] Read more.
Over the past decade, recurring influxes of pelagic Sargassum have posed significant environmental and economic challenges in the Caribbean Sea. Effective monitoring is crucial for understanding bloom dynamics and mitigating their impacts. This study presents a comprehensive machine learning (ML) and deep learning (DL) framework for detecting Sargassum and estimating its fractional cover using imagery from key satellite sensors: the Operational Land Imager (OLI) on Landsat-8 and the Multispectral Instrument (MSI) on Sentinel-2. A spectral library was constructed from five core spectral bands (Blue, Green, Red, Near-Infrared, and Short-Wave Infrared). It was used to train an ensemble of five diverse classifiers: Random Forest (RF), K-Nearest Neighbors (KNN), XGBoost (XGB), a Multi-Layer Perceptron (MLP), and a 1D Convolutional Neural Network (1D-CNN). All models achieved high classification performance on a held-out test set, with weighted F1-scores exceeding 0.976. The probabilistic outputs from these classifiers were then leveraged as a direct proxy for the sub-pixel fractional cover of Sargassum. Critically, an inter-algorithm agreement analysis revealed that detections on real-world imagery are typically either of very high (unanimous) or very low (contentious) confidence, highlighting the diagnostic power of the ensemble approach. The resulting framework provides a robust and quantitative pathway for generating confidence-aware estimates of Sargassum distribution. This work supports efforts to manage these harmful algal blooms by providing vital information on detection certainty, while underscoring the critical need to empirically validate fractional cover proxies against in situ or UAV measurements. Full article
(This article belongs to the Section Spatial Data Science and Digital Earth)
Show Figures

Figure 1

7 pages, 9296 KB  
Data Descriptor
Groundwater Table Depth Monitoring Dataset (2023–2025) from an Extracted Kaigu Peatland Section in Central Latvia
by Normunds Stivrins, Jānis Bikše, Sabina Alta and Inga Grinfelde
Data 2025, 10(11), 176; https://doi.org/10.3390/data10110176 - 1 Nov 2025
Viewed by 321
Abstract
Extracted peatlands experience strong hydrological fluctuations due to drainage, vegetation succession, and climatic variability, yet long-term, high-frequency groundwater data remain scarce in Northern Europe. Our dataset presents two years (June 2023–May 2025) of 30-min groundwater table depth (WTD) measurements from six wells installed [...] Read more.
Extracted peatlands experience strong hydrological fluctuations due to drainage, vegetation succession, and climatic variability, yet long-term, high-frequency groundwater data remain scarce in Northern Europe. Our dataset presents two years (June 2023–May 2025) of 30-min groundwater table depth (WTD) measurements from six wells installed across contrasting Greenhouse Gass Emission Site Types (GEST 5, 6, 15, 20) in the Kaigu peatlands, central Latvia. Each well was equipped with an automatic pressure transducer (TD-Diver, van Essen Instruments) recording absolute pressure (m H2O). The dataset also includes metadata on coordinates, installation elevation, well construction, and manual control measurements. All values are unprocessed, i.e., they represent original logger outputs without atmospheric or elevation correction, enabling users to apply their own calibration or referencing methods. This is the first openly available high-frequency extracted peatland groundwater pressure dataset from the Baltic region and provides a foundation for hydrological modelling and rewetting designs. Full article
Show Figures

Figure 1

11 pages, 1035 KB  
Data Descriptor
Electroencephalography Dataset of Young Drivers and Non-Drivers Under Visual and Auditory Distraction Using a Go/No-Go Paradigm
by Yasmany García-Ramírez, Luis Gordillo and Brian Pereira
Data 2025, 10(11), 175; https://doi.org/10.3390/data10110175 - 1 Nov 2025
Viewed by 886
Abstract
Electroencephalography (EEG) provides insights into the neural mechanisms underlying attention, response inhibition, and distraction in cognitive tasks. This dataset was collected to examine neural activity in young drivers and non-drivers performing Go/No-Go tasks under visual and auditory distraction conditions. A total of 40 [...] Read more.
Electroencephalography (EEG) provides insights into the neural mechanisms underlying attention, response inhibition, and distraction in cognitive tasks. This dataset was collected to examine neural activity in young drivers and non-drivers performing Go/No-Go tasks under visual and auditory distraction conditions. A total of 40 university students (20 drivers, 20 non-drivers; balanced by sex) completed eight experimental blocks combining visual or auditory stimuli with realistic distractions, such as text message notifications and phone call simulations. EEG was recorded using a 16-channel BrainAccess MIDI system at 250 Hz. Experiments 1, 3, 5, and 7 served as transitional blocks without participant responses and were excluded from behavioral and event-related potential analyses; however, their EEG recordings and event markers are included for baseline or exploratory analyses. The dataset comprises raw EEG files, event markers for Go/No-Go stimuli and distractions, and metadata on participant demographics and mobile phone usage. This resource enables studies of attentional control, inhibitory processes, and distraction-related neural dynamics, supporting research in cognitive neuroscience, brain–computer interfaces, and transportation safety. Full article
Show Figures

Figure 1

19 pages, 3087 KB  
Article
Web Scraping Chilean News Media: A Dataset for Analyzing Social Unrest Coverage (2019–2023)
by Ignacio Molina, José Morales and Brian Keith
Data 2025, 10(11), 174; https://doi.org/10.3390/data10110174 - 31 Oct 2025
Viewed by 686
Abstract
This paper presents a dataset of Chilean news media coverage during the social unrest and constitutional processes from 2019 to 2023. Using Python-based web scraping with BeautifulSoup and Selenium, we collected articles from 15 Chilean news outlets between 15 November 2019 and 17 [...] Read more.
This paper presents a dataset of Chilean news media coverage during the social unrest and constitutional processes from 2019 to 2023. Using Python-based web scraping with BeautifulSoup and Selenium, we collected articles from 15 Chilean news outlets between 15 November 2019 and 17 December 2023. The initial collection of 1254 articles was filtered to 931 usable data points after removing non-relevant content, duplicates, and articles unrelated to the Chilean social outburst. Each news outlet required specific extraction approaches due to varying HTML structures, with some outlets inaccessible due to paywalls or anti-scraping mechanisms. The dataset is structured in JSON format with standardized fields including title, content, date, author, and source metadata. This resource supports research on media coverage during political events and provides data for Spanish-language processing tasks. The dataset and extraction code are publicly available on GitHub. Full article
Show Figures

Figure 1

55 pages, 6674 KB  
Article
Method for Detecting Low-Intensity DDoS Attacks Based on a Combined Neural Network and Its Application in Law Enforcement Activities
by Serhii Vladov, Oksana Mulesa, Victoria Vysotska, Petro Horvat, Nataliia Paziura, Oleksandra Kolobylina, Oleh Mieshkov, Oleksandr Ilnytskyi and Oleh Koropatov
Data 2025, 10(11), 173; https://doi.org/10.3390/data10110173 - 30 Oct 2025
Viewed by 655
Abstract
The article presents a method for detecting low-intensity DDoS attacks, focused on identifying difficult-to-detect “low-and-slow” scenarios that remain undetectable by traditional defence systems. The key feature of the developed method is the statistical criteria’s (χ2 and T statistics, energy ratio, reconstruction [...] Read more.
The article presents a method for detecting low-intensity DDoS attacks, focused on identifying difficult-to-detect “low-and-slow” scenarios that remain undetectable by traditional defence systems. The key feature of the developed method is the statistical criteria’s (χ2 and T statistics, energy ratio, reconstruction errors) integration with a combined neural network architecture, including convolutional and transformer blocks coupled with an autoencoder and a calibrated regressor. The developed neural network architecture combines mathematical validity and high sensitivity to weak anomalies with the ability to generate interpretable artefacts that are suitable for subsequent forensic analysis. The developed method implements a multi-layered process, according to which the first level statistically evaluates the flow intensity and interpacket intervals, and the second level processes features using a neural network module, generating an integral blend-score S metric. ROC-AUC and PR-AUC metrics, learning curve analysis, and the estimate of the calibration error (ECE) were used for validation. Experimental results demonstrated the superiority of the proposed method over existing approaches, as the achieved values of ROC-AUC and PR-AUC were 0.80 and 0.866, respectively, with an ECE level of 0.04, indicating a high accuracy of attack detection. The study’s contribution lies in a method combining statistical and neural network analysis development, as well as in ensuring the evidentiary value of the results through the generation of structured incident reports (PCAP slices, time windows, cryptographic hashes). The obtained results expand the toolkit for cyber-attack analysis and open up prospects for the methods’ practical application in monitoring systems and law enforcement agencies. Full article
Show Figures

Figure 1

31 pages, 767 KB  
Article
From Offloading to Engagement: An Experimental Study on Structured Prompting and Critical Reasoning with Generative AI
by Michael Gerlich
Data 2025, 10(11), 172; https://doi.org/10.3390/data10110172 - 30 Oct 2025
Viewed by 4234
Abstract
The rapid adoption of generative AI raises questions not only about its transformative potential but also about its cognitive and societal risks. This study contributes to the debate by presenting cross-country experimental data (n = 150; Germany, Switzerland, United Kingdom) on how [...] Read more.
The rapid adoption of generative AI raises questions not only about its transformative potential but also about its cognitive and societal risks. This study contributes to the debate by presenting cross-country experimental data (n = 150; Germany, Switzerland, United Kingdom) on how individuals engage with generative AI under different conditions: human-only, human + AI (unguided), human + AI (guided with structured prompting), and AI-only benchmarks. Across 450 evaluated responses, critical reasoning was assessed via expert rubric ratings, while perceived reflective engagement was captured through self-report indices. Results show that unguided AI use fosters cognitive offloading without improving reasoning quality, whereas structured prompting significantly reduces offloading and enhances both critical reasoning and reflective engagement. Mediation and latent class analyses reveal that guided AI use supports deeper human involvement and mitigates demographic disparities in performance. Beyond theoretical contributions, this study offers practical implications for business and society. As organisations integrate AI into workflows, unstructured use risks undermining workforce decision making and critical engagement. Structured prompting, by contrast, provides a scalable and low-cost governance tool that fosters responsible adoption, supports equitable access to technological benefits, and aligns with societal calls for human-centric AI. These findings highlight the dual nature of AI as both a productivity enabler and a cognitive risk, and position structured prompting as a promising intervention to navigate the emerging challenges of AI adoption in business and society. Full article
Show Figures

Figure 1

27 pages, 8980 KB  
Article
A Database of High-Resolution Meteorological Drought Comprehensive Index Across China for the 1951–2022 Period
by Xijia Zhou, Mingwei Zhang, Guicai Li, Yuanyuan Wang and Zhaodi Guo
Data 2025, 10(11), 171; https://doi.org/10.3390/data10110171 - 28 Oct 2025
Viewed by 642
Abstract
Drought events exacerbated by global climate change occur frequently in China. Currently, high-spatiotemporal-resolution gridded meteorological drought index datasets are generally available for single time scales (e.g., 30, 60, 90, and 150 days) and do not fully account for seasonal differences in the impact [...] Read more.
Drought events exacerbated by global climate change occur frequently in China. Currently, high-spatiotemporal-resolution gridded meteorological drought index datasets are generally available for single time scales (e.g., 30, 60, 90, and 150 days) and do not fully account for seasonal differences in the impact of drought on vegetation, thus limiting their accuracy when monitoring drought in different regions of China. To compensate for the limitations of existing drought index datasets, a Chinese regional daily meteorological drought comprehensive index (MCI) dataset covering 1951–2022 with a spatial resolution of 0.1 degrees was developed, and standardized precipitation index (SPI) and standardized precipitation evaporation index (SPEI) datasets at 30- and 90-day scales were constructed based on ERA5-Land datasets. Compared with the existing SPI and SPEI datasets, the generated dataset exhibits a high degree of consistency with those in eastern part of China (R2 > 0.5; the average biases were close to 0 and significantly smaller than RMSEs of the fitting). Additionally, the MCI dataset can more accurately reflect the changes in shallow soil moisture in the eastern part of China in a timely manner (R2 > 0.7 for the 0–7 cm depth), thus providing notable empirical support for research on drought development in different ecosystems. Full article
(This article belongs to the Section Spatial Data Science and Digital Earth)
Show Figures

Figure 1

14 pages, 22331 KB  
Data Descriptor
Electrical Measurement Dataset from a University Laboratory for Smart Energy Applications
by Sergio D. Saldarriaga-Zuluaga, José Ricardo Velasco-Méndez, Carlos Mario Moreno-Paniagua, Bayron Alvarez-Arboleda and Sergio Andres Estrada-Mesa
Data 2025, 10(11), 170; https://doi.org/10.3390/data10110170 - 26 Oct 2025
Viewed by 812
Abstract
Continuous monitoring of electrical parameters is essential for understanding energy consumption, assessing power quality, and analyzing load behavior. This paper presents a dataset comprising measurements of three-phase voltages and currents, active and reactive power (per phase and total), power factor, and system frequency. [...] Read more.
Continuous monitoring of electrical parameters is essential for understanding energy consumption, assessing power quality, and analyzing load behavior. This paper presents a dataset comprising measurements of three-phase voltages and currents, active and reactive power (per phase and total), power factor, and system frequency. The data was collected between April and December 2024 in the low-voltage system of a university laboratory, using high-accuracy power analyzers installed at the point of common coupling. Measurements were recorded every 10 min, generating 79 files with 432 records each, for a total of approximately 34,128 entries. To ensure data quality, the values were validated, erroneous entries removed, and consistency verified using power triangle relationships. The curated dataset is provided in tabular (CSV) format, with each record including a timestamp, three-phase voltages, three-phase currents, active and reactive power (per phase and total), power factor (per phase and global), and system frequency. This dataset offers a comprehensive characterization of electrical behavior in a university laboratory over a nine-month period. It is openly available for reuse and can support research in power system analysis, renewable energy integration, demand forecasting, energy efficiency, and the development of machine learning models for smart energy applications. Full article
(This article belongs to the Topic Smart Energy Systems, 2nd Edition)
Show Figures

Figure 1

12 pages, 1202 KB  
Data Descriptor
Toward Responsible AI in High-Stakes Domains: A Dataset for Building Static Analysis with LLMs in Structural Engineering
by Carlos Avila, Daniel Ilbay, Paola Tapia and David Rivera
Data 2025, 10(11), 169; https://doi.org/10.3390/data10110169 - 24 Oct 2025
Viewed by 639
Abstract
Modern engineering increasingly operates within socio-technical networks, such as the interdependence of energy grids, transport systems, and building codes, where decisions must be reliable and transparent. Large language models (LLMs) such as GPT promise efficiency by interpreting domain-specific queries and generating outputs, yet [...] Read more.
Modern engineering increasingly operates within socio-technical networks, such as the interdependence of energy grids, transport systems, and building codes, where decisions must be reliable and transparent. Large language models (LLMs) such as GPT promise efficiency by interpreting domain-specific queries and generating outputs, yet their predictive nature can introduce biases or fabricated values—risks that are unacceptable in structural engineering, where safety and compliance are paramount. This work presents a dataset that embeds generative AI into validated computational workflows through the Model Context Protocol (MCP). MCP enables API-based integration between ChatGPT (GPT-4o) and numerical solvers by converting natural-language prompts into structured solver commands. This creates context-aware exchanges—for example, transforming a query on seismic drift limits into an OpenSees analysis—whose results are benchmarked against manually generated ETABS models. This architecture ensures traceability, reproducibility, and alignment with seismic design standards. The dataset contains prompts, GPT outputs, solver-based analyses, and comparative error metrics for four reinforced concrete frame models designed under Ecuadorian (NEC-15) and U.S. (ASCE 7-22) codes. The end-to-end runtime for these scenarios, including LLM prompting, MCP orchestration, and solver execution, ranged between 6 and 12 s, demonstrating feasibility for design and verification workflows. Beyond providing records, the dataset establishes a reproducible methodology for integrating LLMs into engineering practice, with three goals: enabling independent verification, fostering collaboration across AI and civil engineering, and setting benchmarks for responsible AI use in high-stakes domains. Full article
Show Figures

Figure 1

28 pages, 2676 KB  
Article
Multi-Aspect Sentiment Classification of Arabic Tourism Reviews Using BERT and Classical Machine Learning
by Samar Zaid, Amal Hamed Alharbi and Halima Samra
Data 2025, 10(11), 168; https://doi.org/10.3390/data10110168 - 23 Oct 2025
Viewed by 850
Abstract
Understanding visitor sentiment is essential for developing effective tourism strategies, particularly as Google Maps reviews have become a key channel for public feedback on tourist attractions. Yet, the unstructured format and dialectal diversity of Arabic reviews pose significant challenges for extracting actionable insights [...] Read more.
Understanding visitor sentiment is essential for developing effective tourism strategies, particularly as Google Maps reviews have become a key channel for public feedback on tourist attractions. Yet, the unstructured format and dialectal diversity of Arabic reviews pose significant challenges for extracting actionable insights at scale. This study evaluates the performance of traditional machine learning and transformer-based models for aspect-based sentiment analysis (ABSA) on Arabic Google Maps reviews of tourist sites across Saudi Arabia. A manually annotated dataset of more than 3500 reviews was constructed to assess model effectiveness across six tourism-related aspects: price, cleanliness, facilities, service, environment, and overall experience. Experimental results demonstrate that multi-head BERT architectures, particularly AraBERT, consistently outperform traditional classifiers in identifying aspect-level sentiment. Ara-BERT achieved an F1-score of 0.97 for the cleanliness aspect, compared with 0.91 for the best-performing classical model (LinearSVC), indicating a substantial improvement. The proposed ABSA framework facilitates automated, fine-grained analysis of visitor perceptions, enabling data-driven decision-making for tourism authorities and contributing to the strategic objectives of Saudi Vision 20300. Full article
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop