Previous Issue
Volume 10, May
 
 

Data, Volume 10, Issue 6 (June 2025) – 13 articles

Cover Story (view full-size image): Artificial intelligence (AI) shows great potential in the filed of biodiversity conservation; however, high-quality, regional datasets are scarce, particularly for endangered species in urban environments. In this paper, we introduce Macao-ebird, a novel dataset for the AI-driven conservation of birds in Macao. The dataset comprises two subsets: Macao-ebird-cls for classification (7341 images; 24 species) and Macao-ebird-det for object detection, created with efficient AI-assisted labeling. Validation with YOLOv8-v12 series models achieved a high mAP50 of 0.984. By focusing on region-specific endangered species in a complex urban setting, Macao-ebird provides a crucial benchmark for advancing AI in avian conservation. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
11 pages, 841 KiB  
Data Descriptor
Sensor-Based Monitoring Data from an Industrial System of Centrifugal Pumps
by Angelo Martone, Alessia D’Ambrosio, Michele Ferrucci, Assuntina Cembalo, Gianpaolo Romano and Gaetano Zazzaro
Data 2025, 10(6), 91; https://doi.org/10.3390/data10060091 (registering DOI) - 19 Jun 2025
Abstract
We present a detailed dataset collected via a wireless IoT sensor network monitoring three industrial centrifugal pumps (units A, B, and C) at the Italian Aerospace Research Centre (CIRA), along with the methods for data collection and structuring. Background: Centrifugal pumps are [...] Read more.
We present a detailed dataset collected via a wireless IoT sensor network monitoring three industrial centrifugal pumps (units A, B, and C) at the Italian Aerospace Research Centre (CIRA), along with the methods for data collection and structuring. Background: Centrifugal pumps are critical in industrial plants, and monitoring their condition is essential to ensure reliability, safety, and efficiency. High-quality operational data under normal operating conditions are fundamental for developing effective maintenance strategies and diagnostic models. Methods: Data were gathered by means of smart sensors measuring motor and pump vibrations, temperatures, outlet fluid pressures, and environmental conditions. Data were transmitted over a WirelessHART mesh network and acquired through an IoT architecture. Results: The dataset consists of eight CSV files, each representing a specific pump during a distinct operational day. Each file includes timestamped measurements of displacement, peak vibration values, sensor temperatures, fluid pressure, ambient temperature, and atmospheric pressure. Conclusions: This dataset supports advanced methodologies in feature extraction, multivariate signal analysis, unsupervised pattern discovery, vibration analysis, and the development of digital twins and soft sensing models for predictive maintenance optimization. Full article
Show Figures

Figure 1

22 pages, 979 KiB  
Article
Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data
by Esmeralda Brati, Alma Braimllari and Ardit Gjeçi
Data 2025, 10(6), 90; https://doi.org/10.3390/data10060090 - 17 Jun 2025
Viewed by 14
Abstract
Insurance is essential for financial risk protection, but claim management is complex and requires accurate classification and forecasting strategies. This study aimed to empirically evaluate the performance of classification algorithms, including Logistic Regression, Decision Tree, Random Forest, XGBoost, K-Nearest Neighbors, Support Vector Machine, [...] Read more.
Insurance is essential for financial risk protection, but claim management is complex and requires accurate classification and forecasting strategies. This study aimed to empirically evaluate the performance of classification algorithms, including Logistic Regression, Decision Tree, Random Forest, XGBoost, K-Nearest Neighbors, Support Vector Machine, and Naïve Bayes to predict high insurance claims. The research analyses the variables of claims, vehicles, and insured parties that influence the classification of high-cost claims. This investigation utilizes a dataset comprising 802 observations of bodily injury claims from the motor liability portfolio of a private insurance company in Albania, covering the period from 2018 to 2024. In order to evaluate and compare the performance of the models, we employed evaluation criteria, including classification accuracy (CA), area under the curve (AUC), confusion matrix, and error rates. We found that Random Forest performs better, achieving the highest classification accuracy (CA = 0.8867, AUC = 0.9437) with the lowest error rates, followed by the XGBoost model. At the same time, logistic regression demonstrated the weakest performance. Key predictive factors in high claim classification include claim type, deferred period, vehicle brand and age of driver. These findings highlight the potential of machine learning models in improving claim classification and risk assessment and refine underwriting policy. Full article
Show Figures

Figure 1

23 pages, 3249 KiB  
Article
User Experience and Perceptions of AI-Generated E-Commerce Content: A Survey-Based Evaluation of Functionality, Aesthetics, and Security
by Chrysa Stamkou, Vaggelis Saprikis, George F. Fragulis and Ioannis Antoniadis
Data 2025, 10(6), 89; https://doi.org/10.3390/data10060089 - 17 Jun 2025
Viewed by 8
Abstract
The integration of generative artificial intelligence (AI) in e-commerce is constantly increasing and in different forms, while transforming content creation. Its impact on user experience remains underexplored. This study examines user perceptions of AI-generated e-commerce content, focusing on functionality, aesthetics, and security. A [...] Read more.
The integration of generative artificial intelligence (AI) in e-commerce is constantly increasing and in different forms, while transforming content creation. Its impact on user experience remains underexplored. This study examines user perceptions of AI-generated e-commerce content, focusing on functionality, aesthetics, and security. A survey was conducted where 223 participants were requested to browse through the pages of an online store developed using ChatGPT and DALL·E and evaluate it, providing feedback through a constructed questionnaire. The collected data was subjected to descriptive statistical analysis, exploratory factor analysis (EFA), and comparative statistical tests to identify key user experience dimensions and possible demographic variances in satisfaction. Factor analysis extracted two main components influencing user experience: “Service Quality and Security” and “Design and Aesthetics”. Further analysis highlighted a slight variation in user evaluations between male and female participants. Although security-related questions were addressed with caution, the rest of the findings indicate that AI-generated content was well-received and highly rated. Clearly, generative AI is a valuable tool for businesses, AI developers, and anyone seeking to optimize AI-driven processes to enhance user engagement. It can be confidently concluded that it positively contributes to the development of a functional and aesthetically appealing e-commerce platform. Full article
Show Figures

Figure 1

32 pages, 2284 KiB  
Article
Rethinking Inequality: The Complex Dynamics Beyond the Kuznets Curve
by Sarthak Pattnaik, Maryan Rizinski and Eugene Pinsky
Data 2025, 10(6), 88; https://doi.org/10.3390/data10060088 - 14 Jun 2025
Viewed by 120
Abstract
Income inequality has emerged as a defining challenge of our time, particularly in advanced economies, where the gap between rich and poor has reached unprecedented levels. This study analyzes income inequality trends from 2000 to 2023 across developed countries (the United States, the [...] Read more.
Income inequality has emerged as a defining challenge of our time, particularly in advanced economies, where the gap between rich and poor has reached unprecedented levels. This study analyzes income inequality trends from 2000 to 2023 across developed countries (the United States, the United Kingdom, Germany, and France) and developing nations using World Bank Gini coefficient data. We employ comprehensive visualization techniques, Pareto distribution analysis, and ARIMA time-series forecasting models to evaluate the effectiveness of the Kuznets curve as a predictor of income inequality. Our analysis reveals significant deviations from the traditional inverse U-shaped Kuznets curve across all examined countries, with persistent volatility rather than the predicted decline in inequality. Forecasts using ARIMA and neural networks indicate continued fluctuations in inequality through 2030, with the U.S. and Germany showing upward trends while France and the UK demonstrate relative stability. These findings challenge the conventional Kuznets hypothesis and demonstrate that contemporary inequality patterns are influenced by factors beyond economic development, including technological change, globalization, and policy choices. This research contributes to the literature by providing empirical evidence that the Kuznets curve has limited predictive power in modern economies, informing policymakers about the need for targeted interventions to address persistent inequality rather than relying on economic growth alone. Full article
Show Figures

Figure 1

17 pages, 6224 KiB  
Data Descriptor
A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset
by Ibidapo Dare Dada, Adio T. Akinwale and Ti-Jesu Tunde-Adeleke
Data 2025, 10(6), 87; https://doi.org/10.3390/data10060087 - 6 Jun 2025
Viewed by 396
Abstract
The increasing volume of student assessments, particularly open-ended responses, presents a significant challenge for educators in ensuring grading accuracy, consistency, and efficiency. This paper presents a structured dataset designed for the development and evaluation of automated grading systems in higher education. The primary [...] Read more.
The increasing volume of student assessments, particularly open-ended responses, presents a significant challenge for educators in ensuring grading accuracy, consistency, and efficiency. This paper presents a structured dataset designed for the development and evaluation of automated grading systems in higher education. The primary objective is to create a high-quality dataset that facilitates the development and evaluation of natural language processing (NLP) models for automated grading. The dataset comprises student responses to open-ended questions from the Management Information Systems (MIS221) and Project Management (MIS415) courses at Covenant University, collected during the 2022/2023 academic session. The responses were originally handwritten, scanned, and transcribed into Word documents. Each response is paired with corresponding scores assigned by human graders, following a detailed marking guide. To assess the dataset’s potential for automated grading applications, several machine learning and transformer-based models were tested, including TF-IDF with Linear Regression, TF-IDF with Cosine Similarity, BERT, SBERT, RoBERTa, and Longformer. The experimental results demonstrate that transformer-based models outperform traditional methods, with Longformer achieving the highest Spearman’s Correlation of 0.77 and the lowest Mean Squared Error (MSE) of 0.04, indicating a strong alignment between model predictions and human grading. The findings highlight the effectiveness of deep learning models in capturing the semantic and contextual meaning of both student responses and marking guides, making it possible to develop more scalable and reliable automated grading solutions. This dataset offers valuable insights into student performance and serves as a foundational resource for integrating educational technology into automated assessment systems. Future work will focus on enhancing grading consistency and expanding the dataset for broader academic applications. Full article
Show Figures

Figure 1

10 pages, 267 KiB  
Article
Dataset on Programming Competencies Development Using Scratch and a Recommender System in a Non-WEIRD Primary School Context
by Jesennia Cárdenas-Cobo, Cristian Vidal-Silva and Nicolás Máquez
Data 2025, 10(6), 86; https://doi.org/10.3390/data10060086 - 3 Jun 2025
Viewed by 302
Abstract
The ability to program has become an essential competence for individuals in an increasingly digital world. However, access to programming education remains unequal, particularly in non-WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. This study presents a dataset resulting from an educational intervention [...] Read more.
The ability to program has become an essential competence for individuals in an increasingly digital world. However, access to programming education remains unequal, particularly in non-WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. This study presents a dataset resulting from an educational intervention designed to foster programming competencies and computational thinking skills among primary school students aged 8 to 12 years in Milagro, Ecuador. The intervention integrated Scratch, a block-based programming environment that simplifies coding by eliminating syntactic barriers, and the CARAMBA recommendation system, which provided personalized learning paths based on students’ progression and preferences. A structured educational process was implemented, including an initial diagnostic test to assess logical reasoning, guided activities in Scratch to build foundational skills, a phase of personalized practice with CARAMBA, and a final computational thinking evaluation using a validated assessment instrument. The resulting dataset encompasses diverse information: demographic data, logical reasoning test scores, computational thinking test results pre- and post-intervention, activity logs from Scratch, recommendation histories from CARAMBA, and qualitative feedback from university student tutors who supported the intervention. The dataset is anonymized, ethically collected, and made available under a CC-BY 4.0 license to encourage reuse. This resource is particularly valuable for researchers and practitioners interested in computational thinking development, educational data mining, personalized learning systems, and digital equity initiatives. It supports comparative studies between WEIRD and non-WEIRD populations, validation of adaptive learning models, and the design of inclusive programming curricula. Furthermore, the dataset enables the application of machine learning techniques to predict educational outcomes and optimize personalized educational strategies. By offering this dataset openly, the study contributes to filling critical gaps in educational research, promoting inclusive access to programming education, and fostering a more comprehensive understanding of how computational competencies can be developed across diverse socioeconomic and cultural contexts. Full article
Show Figures

Figure 1

27 pages, 1199 KiB  
Article
Event Prediction Using Spatial–Temporal Data for a Predictive Traffic Accident Approach Through Categorical Logic
by Eleftheria Koutsaki, George Vardakis and Nikos Papadakis
Data 2025, 10(6), 85; https://doi.org/10.3390/data10060085 - 3 Jun 2025
Viewed by 334
Abstract
An event is an occurrence that takes place at a specific time and location that can be either weather-related (snowfall), social (crime), natural (earthquake), political (political unrest), or medical (pandemic) in nature. These events do not belong to the “normal” or “usual” spectrum [...] Read more.
An event is an occurrence that takes place at a specific time and location that can be either weather-related (snowfall), social (crime), natural (earthquake), political (political unrest), or medical (pandemic) in nature. These events do not belong to the “normal” or “usual” spectrum and result in a change in a given situation; thus, their prediction would be very beneficial, both in terms of timely response to them and for their prevention, for example, the prevention of traffic accidents. However, this is currently challenging for researchers, who are called upon to manage and analyze a huge volume of data in order to design applications for predicting events using artificial intelligence and high computing power. Although significant progress has been made in this area, the heterogeneity in the input data that a forecasting application needs to process—in terms of their nature (spatial, temporal, and semantic)—and the corresponding complex dependencies between them constitute the greatest challenge for researchers. For this reason, the initial forecasting applications process data for specific situations, in terms of number and characteristics, while, at the same time, having the possibility to respond to different situations, e.g., an application that predicts a pandemic can also predict a central phenomenon, simply by using different data types. In this work, we present the forecasting applications that have been designed to date. We also present a model for predicting traffic accidents using categorical logic, creating a Knowledge Base using the Resolution algorithm as a proof of concept. We study and analyze all possible scenarios that arise under different conditions. Finally, we implement the traffic accident prediction model using the Prolog language with the corresponding Queries in JPL. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

15 pages, 3594 KiB  
Article
Macao-ebird: A Curated Dataset for Artificial-Intelligence-Powered Bird Surveillance and Conservation in Macao
by Xiaoyuan Huang, Silvia Mirri and Su-Kit Tang
Data 2025, 10(6), 84; https://doi.org/10.3390/data10060084 - 30 May 2025
Viewed by 373
Abstract
Artificial intelligence (AI) currently exhibits considerable potential within the realm of biodiversity conservation. However, high-quality regionally customized datasets remain scarce, particularly within urban environments. The existing large-scale bird image datasets often lack a dedicated focus on endangered species endemic to specific geographic regions, [...] Read more.
Artificial intelligence (AI) currently exhibits considerable potential within the realm of biodiversity conservation. However, high-quality regionally customized datasets remain scarce, particularly within urban environments. The existing large-scale bird image datasets often lack a dedicated focus on endangered species endemic to specific geographic regions, as well as a nuanced consideration of the complex interplay between urban and natural environmental contexts. Therefore, this paper introduces Macao-ebird, a novel dataset designed to advance AI-driven recognition and conservation of endangered bird species in Macao. The dataset comprises two subsets: (1) Macao-ebird-cls, a classification dataset with 7341 images covering 24 bird species, emphasizing endangered and vulnerable species native to Macao; and (2) Macao-ebird-det, an object detection dataset generated through AI-agent-assisted labeling using grounding DETR with improved denoising anchor boxes (DINO), significantly reducing manual annotation effort while maintaining high-quality bounding-box annotations. We validate the dataset’s utility through baseline experiments with the You Only Look Once (YOLO) v8–v12 series, achieving a mean average precision (mAP50) of up to 0.984. Macao-ebird addresses critical gaps in the existing datasets by focusing on region-specific endangered species and complex urban–natural environments, providing a benchmark for AI applications in avian conservation. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)
Show Figures

Figure 1

16 pages, 342 KiB  
Article
Strategies for Embedding Research Data Management Through Effective Communication
by Fadwa Alshawaf
Data 2025, 10(6), 83; https://doi.org/10.3390/data10060083 - 27 May 2025
Viewed by 304
Abstract
Effective research data management (RDM) is essential for ensuring research integrity, reproducibility, and compliance with FAIR principles. Despite the development of comprehensive RDM frameworks, many institutions still struggle to ensure widespread engagement and compliance among researchers and staff. Adoption of RDM practices remains [...] Read more.
Effective research data management (RDM) is essential for ensuring research integrity, reproducibility, and compliance with FAIR principles. Despite the development of comprehensive RDM frameworks, many institutions still struggle to ensure widespread engagement and compliance among researchers and staff. Adoption of RDM practices remains slow due to limited awareness, unclear benefits, and perceived administrative burdens. Using Mendelow’s Matrix, this study draws on survey data to map key stakeholders, such as researchers, RDM professionals, institutional leadership, funding bodies, and infrastructure providers, based on their power and interest to ensure developing tailored communication strategies. This paper presents a communication strategy to enhance RDM adoption by improving visibility, fostering engagement, and encouraging the integration of RDM into research workflows and curricula. It outlines key approaches, including awareness campaigns, targeted publishing, strategic partnerships, and knowledge-driven promotion to embed RDM into research workflows. Full article
Show Figures

Figure 1

15 pages, 3561 KiB  
Data Descriptor
Acoustic Data on Vowel Nasalization Across Prosodic Conditions in L1 Korean and L2 English by Native Korean Speakers
by Jiyoung Jang, Sahyang Kim and Taehong Cho
Data 2025, 10(6), 82; https://doi.org/10.3390/data10060082 - 23 May 2025
Viewed by 382
Abstract
This article presents acoustic data on coarticulatory vowel nasalization from the productions of twelve L1 Korean speakers and of fourteen Korean learners of L2 English. The dataset includes eight monosyllabic target words embedded in eight carrier sentences, each repeated four times per speaker. [...] Read more.
This article presents acoustic data on coarticulatory vowel nasalization from the productions of twelve L1 Korean speakers and of fourteen Korean learners of L2 English. The dataset includes eight monosyllabic target words embedded in eight carrier sentences, each repeated four times per speaker. Half of the words contain a nasal coda such as p*am in Korean and bomb in English and the other half a nasal onset such as mat in Korean and mob in English. These were produced under varied prosodic conditions, including three phrase positions and two focus conditions, enabling analysis of prosodic effects on vowel nasalization across languages along with individual speaker variation. The accompanying CSV files provide acoustic measurements such as nasal consonant duration, A1-P0, and normalized A1-P0 at multiple timepoints within the vowel. While theoretical implications have been discussed in two published studies, the full dataset is published here. By making these data publicly available, we aim to promote broad reuse and encourage further research at the intersection of prosody, phonetics, and second language acquisition—ultimately advancing our understanding of how phonetic patterns emerge, transfer, and vary across languages and learners. Full article
Show Figures

Figure 1

11 pages, 1781 KiB  
Data Descriptor
Electroencephalogram Dataset of Visually Imagined Arabic Alphabet for Brain–Computer Interface Design and Evaluation
by Rami Alazrai, Khalid Naqi, Alaa Elkouni, Amr Hamza, Farah Hammam, Sahar Qaadan, Mohammad I. Daoud, Mostafa Z. Ali and Hasan Al-Nashash
Data 2025, 10(6), 81; https://doi.org/10.3390/data10060081 - 22 May 2025
Viewed by 362
Abstract
Visual imagery (VI) is a mental process in which an individual generates and sustains a mental image of an object without physically seeing it. Recent advancements in assistive technology have enabled the utilization of VI mental tasks as a control paradigm to design [...] Read more.
Visual imagery (VI) is a mental process in which an individual generates and sustains a mental image of an object without physically seeing it. Recent advancements in assistive technology have enabled the utilization of VI mental tasks as a control paradigm to design brain–computer interfaces (BCIs) capable of generating numerous control signals. This, in turn, enables the design of control systems to assist individuals with locked-in syndrome in communicating and interacting with their environment. This paper presents an electroencephalogram (EEG) dataset captured from 30 healthy native Arabic-speaking subjects (12 females and 18 males; mean age: 20.8 years; age range: 19–23) while they visually imagined the 28 letters of the Arabic alphabet. Each subject conducted 10 trials per letter, resulting in 280 trials per participant and a total of 8400 trials for the entire dataset. The EEG signals were recorded using the EMOTIV Epoc X wireless EEG headset (San Francisco, CA, USA), which is equipped with 14 data electrodes and two reference electrodes arranged according to the 10–20 international system, with a sampling rate of 256 Hz. To the best of our knowledge, this is the first EEG dataset that focuses on visually imagined Arabic letters. Full article
Show Figures

Figure 1

13 pages, 464 KiB  
Article
Population Genetics Data of 21 Autosomal STR Loci in the Romanian Population
by George Popoiu, Florin Stanciu, Veronica Cuțăr, Simona Vladu, Paulina Podgoreanu, Violeta Nicola, Ionel Marius Stoian, Anastasia Procopciuc, Bogdan Hațegan, Bogdan Negoiță, Alis Mihaela Păunache, Adnana Cotolea, Ana Rădulescu, Adrian Constantin Hubca and Sergiu Emil Georgescu
Data 2025, 10(6), 80; https://doi.org/10.3390/data10060080 - 22 May 2025
Viewed by 313
Abstract
This study aimed to determine the allele frequencies and genetic diversity of 21 autosomal short tandem repeat (STR) loci from the Expanded U.S. Core Loci and European Standard Set in the Romanian population. A random sample of 928 unrelated men from all Romanian [...] Read more.
This study aimed to determine the allele frequencies and genetic diversity of 21 autosomal short tandem repeat (STR) loci from the Expanded U.S. Core Loci and European Standard Set in the Romanian population. A random sample of 928 unrelated men from all Romanian counties was analyzed using the Investigator 24plex QS and Investigator 24plex GO! Kits (Qiagen). The genotypes were determined, and the allele frequencies were calculated using the STRidER tool. The results provide updated population genetic data for the Romanian population, which is essential for accurate calculation of DNA evidence weight in forensic casework. Full article
Show Figures

Figure 1

10 pages, 923 KiB  
Data Descriptor
Dataset of the Effects of a Low Dose of Isoflavones in Beef Cattle Undergoing Tall Fescue Toxicosis
by Juan F. Cordero-Llarena, Kyle J. McLean, Madison T. Henniger, F. Neal Schrick, Gary E. Bates and Phillip R. Myer
Data 2025, 10(6), 79; https://doi.org/10.3390/data10060079 - 22 May 2025
Viewed by 282
Abstract
Tall fescue toxicosis negatively impacts blood flow, elevates body temperature, and reduces beef cattle’s average daily gain (ADG). In previous studies, isoflavones have diminished the symptoms of tall fescue toxicosis in ruminants. Therefore, this dataset determined the impact of low concentrations of isoflavone [...] Read more.
Tall fescue toxicosis negatively impacts blood flow, elevates body temperature, and reduces beef cattle’s average daily gain (ADG). In previous studies, isoflavones have diminished the symptoms of tall fescue toxicosis in ruminants. Therefore, this dataset determined the impact of low concentrations of isoflavone doses on animal vasculature, body temperature, ADG, and rumen microbial communities in beef cattle. A 21-day experiment with Angus cattle consisted of four isoflavone doses: 0 g, 2 g, 4 g, and 6 g, along with a control group. Isoflavones were mixed with 0.5 kg of dried distiller’s grains (DDGs). Daily individual rectal temperatures were recorded. Weekly blood serum was collected via coccygeal venipuncture, blood vasculature data were measured via color Doppler ultrasound, and body weight (BW) was recorded. Approximately 100 mL of rumen content was collected at the end of the trial. The pulsatility index (PI) decreased in the control group compared to the 2 g and 4 g groups (p = 0.01). Animals in the isoflavone treatment groups recorded a higher rectal temperature (p < 0.05). ADG was reduced in animals undergoing isoflavone treatments (p < 0.001). Finally, there was no impact on the rumen microbial communities (p > 0.05). Isoflavone supplementation may mitigate tall fescue toxicosis and improve animal performance at greater doses. Full article
Show Figures

Figure 1

Previous Issue
Back to TopTop