Journal Description
Data
Data
is a peer-reviewed, open access journal on data in science, with the aim of enhancing data transparency and reusability. The journal publishes in two sections: a section on the collection, treatment and analysis methods of data in science; a section publishing descriptions of scientific and scholarly datasets (one dataset per paper). The journal is published monthly online by MDPI.
- Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
- High Visibility: indexed within Scopus, ESCI (Web of Science), Ei Compendex, dblp, Inspec, RePEc, and other databases.
- Journal Rank: JCR - Q2 (Multidisciplinary Sciences) / CiteScore - Q2 (Information Systems and Management)
- Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 25 days after submission; acceptance to publication is undertaken in 2.9 days (median values for papers published in this journal in the second half of 2025).
- Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.
- Journal Cluster of Information Systems and Technology: Analytics, Applied System Innovation, Cryptography, Data, Digital, Informatics, Information, Journal of Cybersecurity and Privacy and Multimedia.
Impact Factor:
2.0 (2024);
5-Year Impact Factor:
2.1 (2024)
Latest Articles
VaxiGen Database of Tumor Immunogens
Data 2026, 11(5), 123; https://doi.org/10.3390/data11050123 - 20 May 2026
Abstract
Peptide-based cancer vaccines have emerged as a prominent focus in contemporary oncological research, as the quest for innovative cancer treatment modalities continues to gain momentum. A pivotal facet of their development is the precise delineation and characterization of immunogenic tumor antigens. In this
[...] Read more.
Peptide-based cancer vaccines have emerged as a prominent focus in contemporary oncological research, as the quest for innovative cancer treatment modalities continues to gain momentum. A pivotal facet of their development is the precise delineation and characterization of immunogenic tumor antigens. In this context, VaxiJen stands out as one of the most widely used and cited computational servers for predicting immunogenicity, making it an invaluable tool for in silico antigen prediction. However, the database underpinning VaxiJen’s predictions has not undergone a comprehensive update for over fifteen years. To address this, a systematic search of the PubMed database was conducted to identify scholarly articles reporting data on novel immunogenic proteins and peptides undergoing human testing. The corresponding sequences of these proteins and peptides were subsequently curated from UniProtKB. Therefore, in this study, we introduce an updated dataset encompassing a repertoire of tumor immunogens, comprising 546 full-length human proteins and 212 human tumor peptides, as well as tumor non-immunogens, comprising 548 full-length human proteins and 181 human tumor peptides. The recently compiled VaxiGen tumor dataset is openly accessible. Researchers can conveniently download, search, and process it. This dataset, when paired with a suitable negative dataset, can further serve as a valuable training set, thereby facilitating improved predictions of the potential immunogenicity of hitherto uncharacterized protein or peptide sequences.
Full article
(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)
Open AccessArticle
Evaluating the Integrity of LLM-Generated Citations: Prevalence and Risks of Fabricated References in Scientific Literature
by
Pablo Picazo-Sanchez and Lara Ortiz-Martin
Data 2026, 11(5), 122; https://doi.org/10.3390/data11050122 - 20 May 2026
Abstract
Large Language Models have become important in our lives, and academia is not agnostic to this trend, offering tools like text rephrasing and summarisation. However, this integration raises significant concerns regarding the integrity of science. In this paper, we investigate hallucinations of LLMs
[...] Read more.
Large Language Models have become important in our lives, and academia is not agnostic to this trend, offering tools like text rephrasing and summarisation. However, this integration raises significant concerns regarding the integrity of science. In this paper, we investigate hallucinations of LLMs when generating scientific references. Using nine LLMs, we generated a dataset of 74,196 Bib references to quantify and analyse fabricated references, focusing on distinguishing between intrinsic and extrinsic hallucinations. Also, we extracted and analysed 127,063 references from 3541 published papers in 2023 to assess the prevalence of fake bibliographic data. Our manual verification process identified eight instances of fabricated references. While the overall rate is statistically low, the mere existence of fabricated content in the peer-reviewed literature is a critical integrity issue, demonstrating a vulnerability in current academic validation systems. The significance of our finding is not the statistical prevalence but rather the necessity for rigorous, human-validated processes to prevent the injection of spurious citations regardless of their source.
Full article
(This article belongs to the Special Issue Mining and Computational Intelligence for E-Learning and Education—4th Edition)
Open AccessData Descriptor
Agricultural Life Cycle Assessment Dataset for Phase 1 Goals, Products, and Scope Definitions
by
Rahmah Alhashim and Aavudai Anandhi
Data 2026, 11(5), 121; https://doi.org/10.3390/data11050121 - 20 May 2026
Abstract
Life cycle assessment (LCA) is widely used to evaluate the environmental impacts of agricultural production systems with four phases. The first phase (Phase 1) is an important phase describing the goal and scope of the entire LCA. However, data on existing and potential
[...] Read more.
Life cycle assessment (LCA) is widely used to evaluate the environmental impacts of agricultural production systems with four phases. The first phase (Phase 1) is an important phase describing the goal and scope of the entire LCA. However, data on existing and potential goals and scope are scattered across studies and not available at a single location, making it hard to reuse and compare them. The objective of this study is to create a dataset of Phase 1 information (goals, products, and scopes) for agricultural LCA. The dataset was generated from a systematic review of 184 published agricultural LCA studies, including peer-reviewed journal articles and selected conference papers, published between 1999 and 2025, following PRISMA guidelines. Studies were identified through keyword searches on Google Scholar and screened for relevance and availability. Only studies that clearly reported Phase 1 information were included. Data was collected manually and organized using standard IDs. The dataset has 41 goals, 65 products, and 7 scopes; each was assigned an ID (Goal_ID, Product_ID, Scope_ID, and Stage_ID) to support consistency and traceability. The dataset supports comparisons across studies, assists users in selecting appropriate goals, products, and system boundaries, and can support the development of LCA tools, databases, and decision-support frameworks.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures

Figure 1
Open AccessArticle
Long-Term Repositories—Maintaining Research Databases for Interdisciplinary Projects
by
Vincent Feldmar, Tanja Kramm, Constanze Curdt, Dirk Hoffmeister, Olaf Bubenzer and Georg Bareth
Data 2026, 11(5), 120; https://doi.org/10.3390/data11050120 - 19 May 2026
Abstract
This paper describes the maintenance and modernisation of custom data repositories that have supported three long-term research projects since their launch in 2007. It highlights an almost complete rewrite of the code base in 2020 and 2021. To enable research data management (RDM)
[...] Read more.
This paper describes the maintenance and modernisation of custom data repositories that have supported three long-term research projects since their launch in 2007. It highlights an almost complete rewrite of the code base in 2020 and 2021. To enable research data management (RDM) that adheres to modern and FAIR standards, many features were rebuilt and streamlined for ease of use, with the goal of reducing the friction involved in RDM as much as possible. The update significantly improved the file upload by switching to a fully browser-based solution and completely overhauled the outdated metadata editor with a more interactive version. The ability to search for and find data in the repository has also been enhanced by switching to a flexible, filter-based solution which displays results in real time. It shows that older repositories can be kept in line with the changing landscape of RDM, ensuring that the research data contained therein is not lost. Through these updates and continuing maintenance, these data repositories have stayed available for almost 20 years, even after project funding has ended.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures

Figure 1
Open AccessData Descriptor
Genome-Based Characterization of Bacillus velezensis HM1 from Silver Mine Tailings Reveals Potential Metal Resistance and Sulfur Assimilation Traits
by
Gustavo Cuaxinque-Flores, Lorena Jacqueline Gómez-Godínez, Marco A. Ramírez-Mosqueda, Jorge David Cadena-Zamudio, Alma Armenta-Medina and José Luis Aguirre-Noyola
Data 2026, 11(5), 119; https://doi.org/10.3390/data11050119 - 15 May 2026
Abstract
The genus Bacillus is widely recognized for its metabolic versatility, enabling it to colonize extreme environments, including sites contaminated with metals. In this study, we report the genome of B. velezensis strain HM1, isolated from sulfur-rich mine tailings from silver mining activities in
[...] Read more.
The genus Bacillus is widely recognized for its metabolic versatility, enabling it to colonize extreme environments, including sites contaminated with metals. In this study, we report the genome of B. velezensis strain HM1, isolated from sulfur-rich mine tailings from silver mining activities in southwestern Mexico. Isolation was performed by heat treatment followed by selective cultivation in a medium enriched with mine tailings extract (metals and sulfates), resulting in a single dominant morphotype corresponding to strain HM1. Whole-genome sequencing was carried out using the Illumina NovaSeq platform (2 × 250 bp). The assembled genome of strain HM1 has a size of 4,044,128 bp, distributed across 20 contigs, with an N50 of 700,388 bp and an L50 of 3, and an average coverage of 66.8×. The GC content was 46.31%, with an estimated completeness of 99.81% and contamination of 0.01%. Genome analyses indicate that the assembly corresponds to a single chromosome, with no evidence of plasmid replicons. Genome annotation identified 3950 coding sequences (CDSs), 83 tRNAs, 11 rRNAs, 26 ncRNAs, and 4 sORFs. Phylogenomic analysis, together with genomic similarity metrics (ANI > 98.6%, AAI > 98.8%, dDDH > 87%), confirms its classification as Bacillus velezensis. Functionally, the genome encodes multiple genes involved in resistance to metals and metalloids (including ABC transporters, efflux pumps, and biotransformation enzymes), as well as a complete pathway for sulfate assimilation. Collectively, these genomic features reveal a broad repertoire of adaptive strategies employed by strain HM1 to thrive in metal-contaminated environments.
Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 3rd Edition)
►▼
Show Figures

Graphical abstract
Open AccessData Descriptor
A Dataset: Experimental Analysis of Outdoor Exposed Four-Year-Old Photovoltaic Modules in Dhaka, Bangladesh
by
Md. Sabbir Alam, Ahmed Al Mansur, Shahariar Ahmed Himo, Md. Imamul Islam, Khawza Iftekhar Uddin Ahmed and Md. Fayyaz Khan
Data 2026, 11(5), 118; https://doi.org/10.3390/data11050118 - 14 May 2026
Abstract
►▼
Show Figures
The long-term performance of photovoltaic (PV) modules significantly affects the reliability and economic viability of solar energy systems, as various environmental and operational factors can gradually degrade module efficiency and reduce energy output. This study investigates the long-term performance degradation analysis of 40
[...] Read more.
The long-term performance of photovoltaic (PV) modules significantly affects the reliability and economic viability of solar energy systems, as various environmental and operational factors can gradually degrade module efficiency and reduce energy output. This study investigates the long-term performance degradation analysis of 40 outdoor photovoltaic (PV) modules exposed for four years on a five-level building in Mirpur, Dhaka, Bangladesh. Electrical parameters, including voltage, current, power, and fill factor, were measured using a PROVA 1011 PV analyzer under IEC60904-1 standard test conditions, and analyzed to evaluate the extent of long-term degradation of PV modules. The image-based analysis identified degradation factors such as dust accumulation, soiling, hotspots, discoloration, micro-cracks, delamination, and corrosion. All test data were normalized to standard conditions (1000 W/m2, 25 °C) for consistency. The measured average maximum power output was 9.85 W, with an average fill factor of 0.713 and a standard deviation of 0.939 for the 40 photovoltaic modules with a rated capacity of 10 W each. The dataset provides valuable insights for researchers and industry professionals to assess long-term PV performance, optimize maintenance strategies, and support solar energy deployment in tropical environments. Additionally, it can aid policymakers in developing regulatory frameworks for improving solar infrastructure resilience.
Full article

Figure 1
Open AccessData Descriptor
A Rare Earth Elements Database for Peru
by
Sergio Ticona, Pablo A. Garcia-Chevesich, Héctor L. Venegas-Quiñones, Guido Salas, Oliver Wanderley Gomez Villagra, Zidane Rooney Pachari Gutierrez, Johanys Trujillo Choque, Madeleine Guillen, Gisella Martínez, Rolando Quispe Aquino, Marcela Huerta, Yezelia Cáceres, Eliseo Zeballos, Mario Nuñez, Cesar Carbajal, Elizabeth Holley and Rod Eggert
Data 2026, 11(5), 117; https://doi.org/10.3390/data11050117 - 13 May 2026
Abstract
The global transition toward low-carbon energy and advanced technologies has intensified demands for rare earth elements (REEs), while available data remain fragmented across government and academic sources. To address this gap, this study compiles and standardizes a publicly accessible geochemical database of REE
[...] Read more.
The global transition toward low-carbon energy and advanced technologies has intensified demands for rare earth elements (REEs), while available data remain fragmented across government and academic sources. To address this gap, this study compiles and standardizes a publicly accessible geochemical database of REE concentrations (among other variables) across Peru, motivated by the need for consolidated, evidence-based resource assessment. The dataset integrates over 30,000 records from national agencies and university repositories, classified by sample source (stream sediments, rocks and minerals, deep and surface soils, tailings, and industrial materials). This structure enhances comparability and interpretation across geological and environmental contexts. By providing a centralized, high-resolution dataset for an undercharacterized mineral province, this work offers a resource for exploration, policy development, and sustainable management amid growing global demand and strategic interest in critical minerals.
Full article
(This article belongs to the Section Spatial Data Science for Environment and Earth)
►▼
Show Figures

Figure 1
Open AccessData Descriptor
A Dataset of Raw Fabric Grayscale Images for Defect Detection
by
Ruben Pérez-Llorens, Teresa Albero-Albero and Javier Silvestre-Blanes
Data 2026, 11(5), 116; https://doi.org/10.3390/data11050116 - 12 May 2026
Abstract
This article presents RAW-FABRID (RAW FABric Image Dataset), a publicly available annotated dataset for raw fabric defect detection using computer vision techniques. It addresses a major limitation in textile inspection, where reliance on private datasets hinders objective methodological comparisons. RAW-FABRID was acquired using
[...] Read more.
This article presents RAW-FABRID (RAW FABric Image Dataset), a publicly available annotated dataset for raw fabric defect detection using computer vision techniques. It addresses a major limitation in textile inspection, where reliance on private datasets hinders objective methodological comparisons. RAW-FABRID was acquired using a custom-built inspection machine equipped with controlled LED illumination and a line-scan camera. The dataset includes grayscale fabric images collected from several manufacturers to ensure variability in textures and patterns. It comprises 709 high-resolution images (1792 × 1024 pixels), including both defect-free and defective samples. To maximize reusability, data are provided in two complementary formats: high-resolution images (cropped to remove peripheral acquisition artifacts) for global analysis, and a patch-based organization following the widely adopted MVTec Anomaly Detection benchmark structure. The latter divides images into 256 × 256 pixel patches for direct machine learning integration. Crucially, the dataset is accompanied by comprehensive metadata (CSV) and precise COCO-formatted annotations (JSON) for both subsets, ensuring full traceability and supporting object detection and semantic segmentation. The dataset is publicly available through Mendeley Data, enabling reproducible research and objective benchmarking of defect detection algorithms.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures

Figure 1
Open AccessArticle
A New Measurement-Based Benchmark Data Set for Radio Spectrum Analysis Applications
by
Szilárd László Takács, Lajos Muzsai, Zoltán Németh, Bence Bakos, András Lukács, Csaba Huszty, Péter Vári and András Lapsánszky
Data 2026, 11(5), 115; https://doi.org/10.3390/data11050115 - 11 May 2026
Abstract
Radio spectrum is a limited national resource whose efficient utilization is of strategic importance. With the rapid advancement of wireless technologies, maintaining spectrum cleanliness and enabling fast and reliable anomaly detection have become critical challenges. Artificial intelligence (AI)-based approaches have recently shown great
[...] Read more.
Radio spectrum is a limited national resource whose efficient utilization is of strategic importance. With the rapid advancement of wireless technologies, maintaining spectrum cleanliness and enabling fast and reliable anomaly detection have become critical challenges. Artificial intelligence (AI)-based approaches have recently shown great promise in addressing these issues; however, their effectiveness strongly depends on the availability of high-quality, representative, and annotated datasets. Generating such datasets is a complex task, further complicated by environmental conditions such as weather. Hungary’s nationwide spectrum monitoring network enables continuous observation of frequency bands, thereby providing the opportunity to construct large-scale and sustainable datasets. This study introduces a measurement methodology designed for the FM sound broadcasting in the VHF band (87.5–108 MHz), presents the resulting dataset, and details the annotation process. The published, openly accessible dataset is expected to serve not only as a valuable reference point but also as a benchmark for the international research community, facilitating the development, validation, and objective comparison of AI-driven spectrum monitoring solutions.
Full article
(This article belongs to the Topic Data Stream Mining and Processing)
►▼
Show Figures

Figure 1
Open AccessData Descriptor
Dataset for Collaborative Robotics
by
Shurook S. Almohamade, John A. Clark and James Law
Data 2026, 11(5), 114; https://doi.org/10.3390/data11050114 - 10 May 2026
Abstract
►▼
Show Figures
This dataset represents the physical interactions collected by the robot’s sensors during a collaborative effort between humans and robots. The experiment was conducted at the Sheffield Robotics laboratory at the University of Sheffield, UK, utilizing a KUKA LBR iiwa 7 R800 serial manipulator.
[...] Read more.
This dataset represents the physical interactions collected by the robot’s sensors during a collaborative effort between humans and robots. The experiment was conducted at the Sheffield Robotics laboratory at the University of Sheffield, UK, utilizing a KUKA LBR iiwa 7 R800 serial manipulator. Thirty participants, consisting of 14 males and 16 females, participated, including both students and faculty members. Participants were instructed to guide the robot’s end effector through a two-dimensional maze situated on a horizontal plane. Each participant performed the same task 15 times, resulting in 450 complete interaction sequences. All data files are provided in CSV (comma-separated values) file format, which allows data to be stored in a table-structured format. The complete dataset is publicly available via the Mendeley Data repository (DOI: 10.17632/4fr33dkrjt.3).
Full article

Figure 1
Open AccessArticle
TCM-MS2Link: A Unified AI-Ready Dataset Integrating TCM Herb–Compound Knowledge and MS/MS Spectral Data
by
Qianjin Li, Feifan Zhao, Jihang Zhang, Heng Zhou, Lin Guo and Xingchuang Xiong
Data 2026, 11(5), 113; https://doi.org/10.3390/data11050113 - 10 May 2026
Abstract
This study presents TCM-MS2Link, a standardized mass spectrometry-based association dataset for traditional Chinese medicine (TCM), serving as an important resource for natural product research in TCM. The dataset adopts a dual-layer “knowledge–data” architecture: the first layer, TCM-MolLink, comprises curated herb–compound association data, constructed
[...] Read more.
This study presents TCM-MS2Link, a standardized mass spectrometry-based association dataset for traditional Chinese medicine (TCM), serving as an important resource for natural product research in TCM. The dataset adopts a dual-layer “knowledge–data” architecture: the first layer, TCM-MolLink, comprises curated herb–compound association data, constructed through the integration of multiple heterogeneous databases and rigorous consistency filtering to establish high-confidence relationships between TCM herbs and their chemical constituents; the second layer, MS2-MLReady, is a benchmark dataset for mass spectrometry-based machine learning which, after systematic data cleaning, standardized preprocessing, and well-designed data partitioning, can directly support the training and evaluation of artificial intelligence models. By addressing key limitations in existing public resources, including data fragmentation, inconsistent annotations, and insufficient computational usability, TCM-MS2Link effectively overcomes major bottlenecks in the systematic analysis of TCM components and data-driven research. This study significantly enhances the reliability of herb–compound associations and the modeling readiness of mass spectrometry data, providing a high-quality, standardized, and reusable data foundation for applications such as TCM knowledge base construction and automated spectrum–structure identification, thereby promoting the advancement of TCM informatics and data-driven research.
Full article
(This article belongs to the Section Data Science for Chemistry, Energy and Materials)
►▼
Show Figures

Figure 1
Open AccessArticle
A Scalable Data Pipeline for Early Detection and Decision Support in Higher Education: YuumCare
by
Anabel Pineda-Briseño, María Guadalupe Hernández-Compean, Gabriela Aida Flores-Becerra, María de Jesús Hernández-Quezada and Mayra Manuela De los Santos-Alonso
Data 2026, 11(5), 112; https://doi.org/10.3390/data11050112 - 10 May 2026
Abstract
Early identification of behavioral risk patterns in large student populations remains a challenge in higher education, particularly when support systems depend on voluntary help-seeking. This study presents YuumCare, a structured and scalable framework that operationalizes population-level digital screening through a reproducible data pipeline
[...] Read more.
Early identification of behavioral risk patterns in large student populations remains a challenge in higher education, particularly when support systems depend on voluntary help-seeking. This study presents YuumCare, a structured and scalable framework that operationalizes population-level digital screening through a reproducible data pipeline for early detection and decision support. The framework was implemented during the first weeks of the academic term in a public higher education institution in Latin America, where 466 first-year students (38.9% coverage) completed a structured questionnaire capturing indicators of emotional well-being, academic pressure, and help-seeking attitudes. Responses were processed through a structured data pipeline comprising data ingestion, preparation, feature construction, and rule-based classification, transforming distributed self-reported data into standardized features and interpretable institutional signals for consistent analysis at scale. Results show that emotional strain, evaluation-related anxiety, and adaptation difficulties emerge early and frequently co-occur, while most students report low willingness to seek professional support. The classification process indicates that approximately one third of the cohort presents moderate to critical levels of need, providing a structured representation of vulnerability. The proposed approach connects digital screening with institutional decision-making through an interpretable and operational workflow that does not rely on complex infrastructure. Beyond descriptive findings, the study contributes a lightweight and reproducible data framework that supports scalable monitoring and coordinated response under real-world constraints, demonstrating the feasibility of transforming self-reported behavioral data into actionable decision-support signals for population-level monitoring in higher education.
Full article
(This article belongs to the Special Issue Mining and Computational Intelligence for E-Learning and Education—4th Edition)
►▼
Show Figures

Figure 1
Open AccessArticle
Intervention to Improve Attitudes Toward Stuttering: A Multi-Site International Replication and Expansion
by
Kenneth O. St. Louis, Ben Bolton-Grant, Autumn Cannon, Edna J. Carlo, Sveta Fichman, Shweta Gupta, Krittika Kunda, Hailey M. O’Como, Catherine Porter, Bárbara M. Pratts Pérez, Isabella Reichel, Anne Z. Williams, Salman Abdi, Elizabeth F. Aliveto, Ann Beste-Guldborg, Agata Błachnio, Timothy Flynn, Lejla Junuzović-Žunić, Aneta Przepiórka, Hossein Rezai, Chelsea Roche, Mohyeddin Teimouri Sangani, Michael Azios, Shin Ying Chu, Irena Polewczyk, Cara M. Singer, John A. Tetnowski, Janet S. Tilstra and Katarzyna Węsierskaadd
Show full author list
remove
Hide full author list
Data 2026, 11(5), 111; https://doi.org/10.3390/data11050111 - 8 May 2026
Abstract
►▼
Show Figures
Background: Negative public attitudes promote undesirable stereotypes and stigma in stutterers. Method: To mitigate negative attitudes, 403 respondents combined from 16 international samples filled out the Public Opinion Survey of Human Attributes–Stuttering (POSHA–S) before and after interventions to improve attitudes and
[...] Read more.
Background: Negative public attitudes promote undesirable stereotypes and stigma in stutterers. Method: To mitigate negative attitudes, 403 respondents combined from 16 international samples filled out the Public Opinion Survey of Human Attributes–Stuttering (POSHA–S) before and after interventions to improve attitudes and were compared to 249 respondents from seven control groups. Investigators aimed (a) to replicate an extreme case of regression to the mean (i.e., “crossover” effect) reported earlier in larger combined samples in which respondents with high pre-scores ended with low post-scores, respondents with low pre-scores finished with high post-scores, and intermediate scorers were unchanged; and (b) to identify individual POSHA–S items related to overall attitude change and among the high and low scorers. Results: As in previous studies, stuttering attitudes improved in the intervention group but not in the control group. Intervention and control respondents demonstrated “crossover” but less than the earlier samples due to lower pre–post correlations. Item contributions to pre–post change and differences among the three change groups were inconsistent; however, high agreement items by respondents were less likely to vary than low agreement items. Conclusion: The “crossover” effect was replicated, and future research should explore its presence in other measures or conditions.
Full article

Figure 1
Open AccessData Descriptor
Methodology for Generating, Augmenting, and Validating an Audio Dataset for Classifying Fire and Forest Sounds
by
Robert-Nicolae Boştinaru, Sebastian-Alexandru Drǎguşin, Nicu Bizon and Vasile-Gabriel Iana
Data 2026, 11(5), 110; https://doi.org/10.3390/data11050110 - 8 May 2026
Abstract
This paper proposes a methodological framework for building a binary audio dataset for the automatic classification of fire sounds and forest ambience. Two operational recordings, one for the fire class and one for the forest class, are used strictly as seed data for
[...] Read more.
This paper proposes a methodological framework for building a binary audio dataset for the automatic classification of fire sounds and forest ambience. Two operational recordings, one for the fire class and one for the forest class, are used strictly as seed data for controlled segmentation and augmentation. The workflow includes mono conversion at 16 kHz, amplitude normalization, segmentation into 5 s windows with 2 s overlap, low-intensity stochastic augmentation, and the systematic logging of the generated samples. The study also explains why augmented data are appropriate for training and internal validation, while final performance claims must remain reserved for testing on independent, standardized real recordings.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures

Figure 1
Open AccessData Descriptor
A Curated Experimental Dataset of UCS and CBR Results from Biopolymer-Based Two-Additive Stabilisation Studies on Fine-Grained Soils
by
Abolfazl Baghbani, Delaram Bahrampour, Ahmad Moballegh and Firas Daghistani
Data 2026, 11(5), 109; https://doi.org/10.3390/data11050109 - 8 May 2026
Abstract
Published laboratory data on soil stabilisation are abundant, yet they remain fragmented across studies and are often difficult to reuse because of inconsistent reporting formats, heterogeneous testing conditions, and incomplete metadata. This article presents a curated experimental dataset compiled from 20 published studies
[...] Read more.
Published laboratory data on soil stabilisation are abundant, yet they remain fragmented across studies and are often difficult to reuse because of inconsistent reporting formats, heterogeneous testing conditions, and incomplete metadata. This article presents a curated experimental dataset compiled from 20 published studies on fine-grained soils, comprising 560 records, including 397 unconfined compressive strength (UCS) results and 163 California Bearing Ratio (CBR) results. The dataset is defined by the inclusion of laboratory studies designed around biopolymer-based two-additive stabilisation frameworks, while intentionally retaining untreated and single-additive comparator records reported within the same experimental programmes. This design is a key distinguishing feature of the dataset because it enables analysis of baseline soil behaviour, isolated additive effects, and combined-additive responses within a traceable study context. Across the included studies, the treatment systems cover a wide range of biopolymer- and lignin-related materials, including xanthan gum, guar gum, chitosan, sodium lignosulfonate, and electrolyte lignin stabiliser, together with complementary additives such as cement, lime, fly ash, ground granulated blast-furnace slag, rice husk ash, glass powder, concrete waste, nano-additives, and natural or synthetic fibres. In addition to UCS and CBR outcomes, the dataset preserves key contextual variables required for meaningful secondary reuse, including soil classification, grain-size fractions, Atterberg limits, compaction properties, curing duration, additive identities and dosages, and source-level traceability. The data are distributed as a structured Excel workbook comprising two cleaned outcome-specific sheets (CBR_clean and UCS_clean) and four supporting documentation sheets (StudyInventory, DataDictionary, VocabularyMap, and QC_Log). Record-level identifiers, DOI-linked source fields, inferred-curing flags, and qualified outcome descriptors are retained to support auditability, selective filtering, and reproducible reuse. The resulting dataset provides a practical foundation for comparative assessment of stabilisation strategies, pavement and subgrade engineering studies, meta-analysis, and machine learning applications in geotechnical engineering.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures

Graphical abstract
Open AccessArticle
A Data-Driven Approach to Cardiometabolic Risk Stratification: Development of the Adiposity-Fitness Imbalance Index Using a National Chilean Dataset
by
Rodrigo Yáñez-Sepúlveda, José Francisco Tornero-Aguilera, Mario Muñoz-López, Edgar Sancho-Haro, Yeny Concha-Cisternas, Exal Garcia-Carrillo, Jacqueline Páez-Herrera, Felipe Montalva-Valenzuela and Eduardo Guzmán-Muñoz
Data 2026, 11(5), 108; https://doi.org/10.3390/data11050108 - 8 May 2026
Abstract
The increasing prevalence of adolescent obesity and declining physical fitness highlights the need for integrative, non-invasive tools to identify central-adiposity–related cardiometabolic risk early. This study aimed to develop and analytically evaluate the adiposity–fitness imbalance (AFI) index and to examine its association with an
[...] Read more.
The increasing prevalence of adolescent obesity and declining physical fitness highlights the need for integrative, non-invasive tools to identify central-adiposity–related cardiometabolic risk early. This study aimed to develop and analytically evaluate the adiposity–fitness imbalance (AFI) index and to examine its association with an anthropometric proxy of cardiometabolic risk (waist-to-height ratio > 0.50) in a nationally representative sample of Chilean adolescents. This cross-sectional study analyzed data from 7852 students from the Chilean National Physical Fitness Assessment System (SIMCE-EF). The AFI index was calculated as the difference between standardized adiposity and fitness components. Logistic and robust linear regression models were used. Higher standing long jump (OR = 0.69, 95% CI 0.65–0.74), push-ups (OR = 0.76, 95% CI 0.71–0.80), sit-ups (OR = 0.81, 95% CI 0.77–0.85), and VO2max (OR = 0.82, 95% CI 0.75–0.89) were associated with lower odds of elevated WHtR (all p < 0.001), and a small protective association was also observed for flexibility (OR = 0.93, 95% CI 0.88–0.99, p = 0.016). Each one-standard-deviation increase in the AFI index was associated with a substantially higher odds of elevated WHtR (OR = 26.74, 95% CI 22.57–31.68, p < 0.001). In a sensitivity analysis that removed WHtR from the adiposity pillar, to avoid component–outcome overlap, the AFI index remained strongly associated with the outcome (OR per 1 SD = 14.60, 95% CI 12.77–16.70), with internal-validation discrimination of AUC = 0.93. The AFI index may represent a practical and scalable tool for early screening of central-adiposity–related risk in adolescents.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures

Figure 1
Open AccessArticle
FD-TamperBoard: A Tampering Features Dataset of Fuel Dispenser PCBs for Illicit Metering Detection
by
Chenbo Pei, Bin Wang, Xingchuang Xiong, Zhanshuo Cao and Zilong Liu
Data 2026, 11(5), 107; https://doi.org/10.3390/data11050107 - 7 May 2026
Abstract
►▼
Show Figures
With the development of the Internet of Things (IoT) and microelectronics technology, the methods used to tamper with fuel dispensers have become increasingly concealed, posing significant challenges to market supervision and law enforcement. This paper releases a tampering features dataset of assembled printed
[...] Read more.
With the development of the Internet of Things (IoT) and microelectronics technology, the methods used to tamper with fuel dispensers have become increasingly concealed, posing significant challenges to market supervision and law enforcement. This paper releases a tampering features dataset of assembled printed circuit boards (PCBs) from fuel dispensers, aiming to provide high-quality data support for automated, computer-vision-based illicit metering detection. The dataset encompasses multi-class tampering features derived from 189 high-resolution images of PCBs seized during real-world law enforcement, covering 5 mainstream brands. To eliminate perspective bias, rigorous lens distortion correction and four-point homography transformation preprocessing were conducted on the images. Additionally, six typical tampering features (e.g., the addition of tampered surface-mount resistors) were manually and precisely annotated, and then cross-checked and confirmed by domain experts. Furthermore, the dataset was benchmarked using multiple generations of You Only Look Once (YOLO) object detection models (Baseline Validation), which have been demonstrated to handle both large and small object detection in high-resolution images. The evaluation results, including confusion matrices and t-distributed Stochastic Neighbor Embedding (t-SNE) feature clustering diagrams, demonstrate the reliability and effectiveness of this dataset for training high-precision fraud detection models. This dataset is intended to support computer vision and anti-fraud research, promoting the automated development of fuel dispenser tampering detection.
Full article

Graphical abstract
Open AccessData Descriptor
Multimodal Dataset of In-Home Physiological and Inertial Measurements from Older Heart Failure Patients
by
Marcin Kolakowski, Vitomir Djaja-Josko, Jerzy Kolakowski, Irina Georgiana Mocanu, Oana Cramariuc, Ian Perera, Jerzy Gąsowski and Karolina Piotrowicz
Data 2026, 11(5), 106; https://doi.org/10.3390/data11050106 - 7 May 2026
Abstract
Numerous studies have shown that remote monitoring of heart failure patients can reduce hospital readmission rates and mortality. This dataset includes multimodal physiological and inertial signals (acceleration and angular velocity data) recorded with PerHeart—a remote health monitoring platform intended for heart failure patients.
[...] Read more.
Numerous studies have shown that remote monitoring of heart failure patients can reduce hospital readmission rates and mortality. This dataset includes multimodal physiological and inertial signals (acceleration and angular velocity data) recorded with PerHeart—a remote health monitoring platform intended for heart failure patients. In the pilot, which took place in Poland, 27 participants’ health was monitored for one month using the platform with commercially available devices (blood pressure meters, pulse oximeters, bathroom scales, thermometers, and glucometers), resulting in over four thousand physiological measurements. Eight adults were additionally monitored for gait and activity analysis using custom wrist sensors with inertial measurement units, yielding 2536 h of movement data collected over 204 days with almost 690,000 steps detected.
Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 3rd Edition)
►▼
Show Figures

Graphical abstract
Open AccessArticle
A Conceptual Framework for Semantic Indexing of Data Sources Based on Structured Peer-to-Peer Model, Hilbert Curve, Hypercube and Data Analysis
by
Mohammed Ammari, Fadwa Ammari and Abdelaziz Boumahdi
Data 2026, 11(5), 105; https://doi.org/10.3390/data11050105 - 5 May 2026
Abstract
Semantic indexing ensures better organization and optimized searching of heterogeneous, autonomous, and distributed data sources. This approach leverages meaning and context rather than just keywords to better manage the increasing volume, complexity, and heterogeneity of modern data, enabling precise searching, optimized integration, and
[...] Read more.
Semantic indexing ensures better organization and optimized searching of heterogeneous, autonomous, and distributed data sources. This approach leverages meaning and context rather than just keywords to better manage the increasing volume, complexity, and heterogeneity of modern data, enabling precise searching, optimized integration, and improved interoperability between domains. Several approaches to semantic indexing are available: ontology-based indexing, machine learning and automated semantic annotation of data sources. However, the main challenge remains scaling up. This article focuses on a conceptual framework designed for scalable semantic indexing of data sources based on a structured peer-to-peer architecture adapted for managing a very large number of nodes, Hilbert curve renowned for its preservation of semantic affinity while scaling, hypercube structure with its efficient diffusion algorithm, semantic annotation of data sources based on keywords, as well as machine learning techniques, in particular, multidimensional data analysis. An illustrative exploratory example of the Meta Skills semantic class is presented to outline the proposed architecture. This study proposes a conceptual and exploratory framework for large-scale semantic indexing of data sources. The proposed approach has not yet been implemented or validated on a large scale; its objective is to provide an initial structured model to serve as a basis for future empirical research.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures

Figure 1
Open AccessArticle
Development of Intra-Individual Process Metrics in a Serious-Video Game Intervention for ADHD
by
Marina Martin-Moratinos, Marcos Bella-Fernández, Maria Rodrigo-Yanguas, Carlos González-Tardón, Aarón Sújar and Hilario Blasco-Fontecilla
Data 2026, 11(5), 104; https://doi.org/10.3390/data11050104 - 5 May 2026
Abstract
(1) Background: Attention-deficit/hyperactivity disorder (ADHD) is characterized by persistent difficulties related to inattention, hyperactivity, and impulsivity, which significantly impair daily functioning. The primary objective of this study is to examine the utility of intra-individual metrics as indicators of dynamic cognitive regulation during the
[...] Read more.
(1) Background: Attention-deficit/hyperactivity disorder (ADHD) is characterized by persistent difficulties related to inattention, hyperactivity, and impulsivity, which significantly impair daily functioning. The primary objective of this study is to examine the utility of intra-individual metrics as indicators of dynamic cognitive regulation during the intervention with a serious video game (The Secret Trail of Moon, MOON). (2) Methods: Performance data were collected from participants with ADHD enrolled in a randomized clinical trial. Within the MOON group, intra-individual metrics were derived from repeated gameplay sessions of a continuous performance task. For each participant, simple linear regression models were used to estimate the slope of performance across repeated exposures to the task. Slopes were interpreted as indicators of intra-individual change over time. The within-subject standard deviation was also calculated to observe how much a person’s performance fluctuates between sessions. (3) Results: A total of 76 patients with ADHD participated in the clinical trial and were randomized in a 1:1 ratio (MOON: n = 38, 50% and control: n = 38, 50%). The mean performance index of the MOON group (M = 0.88, SD = 0.09) indicates a generally high level of response accuracy, with moderate inter-individual variability across participants. Notably, moderate intra-individual variability (e.g., RT variability, lapse-related indices) was observed, suggesting fluctuations in attentional control despite stable average performance. The absence of linear improvement should not be interpreted as a lack of intervention effect, but rather as evidence of rapid task familiarization and ceiling effects. (4) Conclusions: Intra-individual variability may be a key metric for understanding attentional control in ecological, game-based environments. In this context, performance variability and attentional stability emerge as more sensitive indicators of cognitive regulation than mean-level changes.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures

Figure 1
Journal Menu
► ▼ Journal Menu-
- Data Home
- Aims & Scope
- Editorial Board
- Reviewer Board
- Topical Advisory Panel
- Instructions for Authors
- Guidelines for Reviewers
- Special Issues
- Topics
- Sections & Collections
- Article Processing Charge
- Indexing & Archiving
- Editor’s Choice Articles
- Most Cited & Viewed
- Journal Statistics
- Journal History
- Journal Awards
- Conferences
- Editorial Office
- 10th Anniversary
Journal Browser
► ▼ Journal BrowserHighly Accessed Articles
Latest Books
E-Mail Alert
News
Topics
Topic in
Algorithms, Data, Earth, Geosciences, Mathematics, Land, Water, IJGI
Applications of Algorithms in Risk Assessment and Evaluation
Topic Editors: Yiding Bao, Qiang WeiDeadline: 31 July 2026
Topic in
AI, Algorithms, BDCC, Computers, Data, Future Internet, Informatics, Information, MAKE, Publications, Smart Cities
Learning to Live with Gen-AI
Topic Editors: Antony Bryant, Paolo Bellavista, Kenji Suzuki, Horacio Saggion, Roberto Montemanni, Andreas Holzinger, Min ChenDeadline: 31 August 2026
Topic in
Geosciences, IJGI, Remote Sensing, Sensors, Data
Advances in Sensor Data Fusion and AI for Environmental Monitoring
Topic Editors: Zhenyu Yu, Mohd Yamani Idna Idris, Yu Li, Aleksandar Dj ValjarevićDeadline: 30 September 2026
Topic in
Aerospace, Applied Sciences, Data, Remote Sensing, Sensors, Universe
Techniques and Science Exploitations for Earth Observation and Planetary Exploration-2nd Edition
Topic Editors: Yu Tao, Siting Xiong, Rui SongDeadline: 30 November 2026
Conferences
Special Issues
Special Issue in
Data
Lossy Compression of Scientific Data
Guest Editors: Jinyang Liu, Sian JinDeadline: 31 May 2026
Special Issue in
Data
Digital Transformation in Materials Science: Data-Driven Approaches to Metal Matrix Composites
Guest Editors: Tzanko Donchev, Mihail KolevDeadline: 30 June 2026
Special Issue in
Data
Ethical AI and Responsible Data Science
Guest Editor: Donghee ShinDeadline: 30 June 2026
Special Issue in
Data
Data in Behavioral and Experimental Research: Datasets and Applications
Guest Editors: Jie Zheng, Jaimie W LienDeadline: 31 July 2026
Topical Collections
Topical Collection in
Data
Modern Geophysical and Climate Data Analysis: Tools and Methods
Collection Editors: Vladimir Sreckovic, Zoran Mijic


