Data

9 pages, 1210 KB

Open AccessData Descriptor

Preferred Colleague Dataset: A Human-Annotated Dataset of Perceived Colleague Preference

by Deepu Krishnareddy, Bakir Hadžić, Hamid Gazerpour, Michael Danner, Zhuoqi Zeng and Matthias Rätsch

Data 2026, 11(5), 100; https://doi.org/10.3390/data11050100 - 1 May 2026

Abstract

Recruitment is a time-consuming process, and AI systems are increasingly being used to support the decision-making process. However, machine learning models used in such systems can inherit bias if the underlying training data reflects biased human preferences. It is essential to analyze and [...] Read more.

Recruitment is a time-consuming process, and AI systems are increasingly being used to support the decision-making process. However, machine learning models used in such systems can inherit bias if the underlying training data reflects biased human preferences. It is essential to analyze and quantify these biases in order to develop fairer AI systems. To address this issue, we collected human judgments of colleague preference for 2200 face images. The face image set includes images of different ethnicities and genders, as well as both real and synthetically generated faces. The images were annotated by humans from diverse backgrounds in terms of age, gender, and ethnicity. Annotators were shown series of pairs of face images and asked to select which individual they would prefer as a colleague. We gathered responses from 451 annotators and aggregated the annotations to compute a preference score for each image. This dataset provides a basis for understanding human bias in colleague preference and can support the development of fair and unbiased AI models for use in recruitment settings. Full article

(This article belongs to the Special Issue Data in Behavioral and Experimental Research: Datasets and Applications)

► Show Figures

Figure 1

8 pages, 528 KB

Open AccessData Descriptor

Whole-Genome Sequencing Dataset from Two High-Risk Breast Cancer Families Negative for BRCA1/2 and Other Known Susceptibility Genes

by Silvia González-Martínez, Alejandra Rezqallah Arón, José Manuel Pérez-García, José Palacios, Belén Pérez-Mies, Javier Román, Laia Garrigos, Judith Balmaña, Daniela Camacho, Sandra Íñiguez-Muñoz, Diego M. Marzese and Javier Cortés

Data 2026, 11(5), 99; https://doi.org/10.3390/data11050099 - 30 Apr 2026

Abstract

Hereditary breast cancer (BC) remains unexplained in a substantial proportion of families who test negative for BRCA1/2 and other known susceptibility genes. To contribute to the genomic characterization of these unresolved cases, we generated a whole-genome sequencing (WGS) dataset from six women belonging [...] Read more.

Hereditary breast cancer (BC) remains unexplained in a substantial proportion of families who test negative for BRCA1/2 and other known susceptibility genes. To contribute to the genomic characterization of these unresolved cases, we generated a whole-genome sequencing (WGS) dataset from six women belonging to two unrelated high-risk families, each comprising three sisters diagnosed with BC. All participants had previously received negative results in conventional multigene panel testing. WGS was performed on peripheral blood DNA using the Illumina NovaSeq platform, followed by variant calling against GRCh38 and the comprehensive annotation of single-nucleotide variants, indels, and structural variants. For each family, we identified shared ClinVar-annotated variants, rare exonic or splice-site alterations, and intronic variants located within a curated set of 286 cancer-related genes. The dataset includes per-patient VCF files, copy number variation annotations, and family-level variant summaries. Raw and processed data are publicly available through the Sequence Read Archive and Zenodo. This resource supports variant reinterpretation, exploration of regulatory and intronic regions, and methodological benchmarking in the study of familial BC beyond established susceptibility genes. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

20 pages, 1275 KB

Open AccessArticle

Machine Learning Models for Predicting Professional Disqualification in Peruvian Association Members

by Manuel Pretel Pretel, Yeny Chávez Llempén, Abel Angel Sullon Macalupu, Paulo Canas Rodrigues, Javier Linkolk López-Gonzales and Esteban Tocto-Cano

Data 2026, 11(5), 98; https://doi.org/10.3390/data11050098 - 30 Apr 2026

Abstract

The disqualification of licensed professionals for non-payment of their monthly fees constitutes a significant operational risk to the financial sustainability of professional associations. This problem highlights the need for predictive tools that can anticipate the risk of disqualification and protect institutional stability. The [...] Read more.

The disqualification of licensed professionals for non-payment of their monthly fees constitutes a significant operational risk to the financial sustainability of professional associations. This problem highlights the need for predictive tools that can anticipate the risk of disqualification and protect institutional stability. The main objective of this study was to develop a supervised machine learning model for estimating the risk of disqualification among registered professionals based on historical and contextual variables. An empirical, applied, and quantitative study was conducted by analyzing more than 5.7 million financial records corresponding to 27,964 registered professionals. Multiple supervised classification algorithms, including ensemble models such as CatBoost and XGBoost, were evaluated using stratified cross-validation and class-balancing techniques to address the substantial imbalance in the data. The results indicated that CatBoost performed best (F1-score = 57.96%; AUC = 0.72), whereas XGBoost showed greater stability across cross-validation folds. In conclusion, the model developed supports the timely identification of members at high-risk of disqualification, enabling the implementation of early warning systems and proactive institutional financial management strategies. Full article

(This article belongs to the Topic Applications of Algorithms in Risk Assessment and Evaluation)

► Show Figures

Figure 1

12 pages, 2488 KB

Open AccessArticle

Bibliometric Analysis of the Literature Regarding MRI-Linac: A Paradigm Shift in Radiation Oncology

by Andrea Emanuele Guerini, Paolo Rondi, Federico Mastroleo, Stefania Volpe, Stefano Riga, Stefania Nici, Marco Luzzara, Giulio Ferrazzi, Marco Krengli, Davide Farina, Luigi Spiazzi, Barbara Alicja Jereczek-Fossa, Marco Ravanelli and Michela Buglione di Monale e Bastia

Data 2026, 11(5), 97; https://doi.org/10.3390/data11050097 - 28 Apr 2026

Abstract

Background: By integrating an MRI scanner and a linear accelerator, MR-linac systems provide superior soft tissue imaging and allow to perform adaptive radiotherapy adjusted on daily anatomical changes. The advent of this technology represents a revolution in radiation oncology and could improve treatment [...] Read more.

Background: By integrating an MRI scanner and a linear accelerator, MR-linac systems provide superior soft tissue imaging and allow to perform adaptive radiotherapy adjusted on daily anatomical changes. The advent of this technology represents a revolution in radiation oncology and could improve treatment accuracy and clinical outcomes. We performed a comprehensive bibliometric analysis with the aim of displaying the available scientific literature and trends regarding MR-linac. Methods: Scopus database was investigated, considering documents published up to 6 April 2025. Keywords encompassed terms related to “MR-linac” or “MRI-linac” and possible combinations and acronyms. BibTeX data file was imported into Biblioshiny (Bibliometrix package—v. 4.1.4) and analysis was conducted using R code (R version 4.3.2) and the Bibliometrix package (version 4.1.4). Results: A total of 1624 articles on MR-linac were identified. The number of annual publications gradually increased from 21 in 2008, peaking at 211 in 2022 and then remaining substantially stable in subsequent years. Most of the papers were original articles (79.2%) and the majority was published by the 10 journals with the largest output. Remarkably, of 6385 identified authors, over 85% were from one of the 10 most represented countries (including European, North American and Asian nations). Consistently, the 10 institutions with the larger output were North American, Australian or European and provided over 60% of the articles. International co-authorship was found in only 23.6% of the articles. Keyword and co-occurrence analyses identified MR-guided radiotherapy, SBRT, dosimetry, and adaptive strategies as core themes, with emerging trends in radiomics, diffusion metrics, and deep learning. Conclusions: Bibliometric analysis identified trends and patterns of scientific publications regarding MR-linac, highlighting a growing interest in the topic. Nonetheless, it should be considered that the majority of the papers were published by a few journals and over 85% of authors were from 10 countries, demonstrating an evident disparity across nations. Multicentric international research protocols and common frameworks could foster the transition towards collaborative practice-changing studies. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

9 pages, 1748 KB

Open AccessData Descriptor

Draft Genome Sequence Data of Multidrug-Resistant Pseudomonas aeruginosa, Strain ASK-80

by Shilippreet Kour, Shilpa Sharma, Achhada Ujalkaur Avatsingh, Prem Prashant Chaudhary and Nasib Singh

Data 2026, 11(5), 96; https://doi.org/10.3390/data11050096 - 26 Apr 2026

Abstract

In this study, we report the draft genome sequence of Pseudomonas aeruginosa strain ASK-80, a multidrug-resistant bacterium isolated from municipal wastewater in Baddi, district Solan, Himachal Pradesh, India. The whole genome was sequenced through Illumina MiSeq sequencing (150 bp paired-end). The size of [...] Read more.

In this study, we report the draft genome sequence of Pseudomonas aeruginosa strain ASK-80, a multidrug-resistant bacterium isolated from municipal wastewater in Baddi, district Solan, Himachal Pradesh, India. The whole genome was sequenced through Illumina MiSeq sequencing (150 bp paired-end). The size of the assembled genome was 6,261,345 bp, and the genome annotation revealed 5834 genes, including 5778 CDSs, 5748 protein-coding genes, 56 RNA genes and 30 pseudo genes. Genomic characterization revealed the occurrence of multiple antibiotic resistance genes (bla_OXA-396, bla_OXA-486, bla_OXA-494, bla_PAO, bla_PDC-8, aph(3′)-IIb, catB7, fosA and others), virulence genes (algB, chpA, clpV1, exsA, flgA, pilB, pvcA, toxA, tse1, and waaA), insertion sequences, transposable elements and phage sequences. This genome data may serve as a valuable resource for comparative genomics of P. aeruginosa and research on the antibiotic resistance surveillance of wastewater. Full article

(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 3rd Edition)

► Show Figures

Figure 1

32 pages, 1307 KB

Open AccessArticle

The Influence of AI Competency and Soft Skills on Innovative University Competency: An Integrated SEM–Artificial Neural Network (SEM–ANN) Model

by Kittipol Wisaeng and Thongchai Kaewkiriya

Data 2026, 11(5), 95; https://doi.org/10.3390/data11050095 - 25 Apr 2026

Abstract

This study addresses the growing necessity to understand how artificial intelligence (AI) competency and soft skills jointly influence organizational innovation and performance in the era of digital transformation. Despite the rapid adoption of AI technologies across industries, organizations continue to face significant challenges [...] Read more.

This study addresses the growing necessity to understand how artificial intelligence (AI) competency and soft skills jointly influence organizational innovation and performance in the era of digital transformation. Despite the rapid adoption of AI technologies across industries, organizations continue to face significant challenges in effectively integrating technical AI capabilities with essential human-centric soft skills such as communication, adaptability, and leadership. This gap often limits the realization of AI-driven value and sustainable competitive advantage. The primary challenge in this research area is the lack of comprehensive models that simultaneously examine AI competency and soft skills within a unified framework, particularly in emerging economies where digital maturity varies widely. Existing studies tend to focus either on technical competencies or behavioral factors in isolation, leading to fragmented insights. To address these challenges, this study proposes a novel integrated research model that examines the combined effects of AI competency and soft skills on innovation outcomes and organizational performance. The model is empirically validated using structural equation modeling (SEM), providing robust evidence of the interrelationships among key constructs. The findings reveal that both AI competency and soft skills significantly contribute to innovation capability, which in turn enhances organizational performance. The study offers important theoretical and practical implications by bridging the gap between technical and human dimensions of AI adoption, thereby providing a more holistic understanding of digital transformation success. Full article

(This article belongs to the Special Issue Mining and Computational Intelligence for E-Learning and Education—4th Edition)

15 pages, 1359 KB

Open AccessData Descriptor

Dataset for Cyclic Nonlinear Numerical Modelling of Corroded Reinforced Concrete Columns and Frames

by Dariniel Barrera-Jiménez, Franco Carpio-Santamaría, Sergio Márquez-Domínguez, Irving Ramírez-González, José Barradas-Hernández, Rolando Salgado-Estrada, Alejandro Vargas-Colorado, José Piña-Flores, Gustavo Delgado-Reyes and Armando Aguilar-Menéndez

Data 2026, 11(5), 94; https://doi.org/10.3390/data11050094 - 25 Apr 2026

Abstract

Corrosion of reinforcing steel is a key cause of deterioration in reinforced concrete (RC) structures exposed to coastal environments with chloride presence. The loss of reinforcing steel cross-sectional area, cracking of the concrete cover, and reduction in confinement progressively decrease both strength and [...] Read more.

Corrosion of reinforcing steel is a key cause of deterioration in reinforced concrete (RC) structures exposed to coastal environments with chloride presence. The loss of reinforcing steel cross-sectional area, cracking of the concrete cover, and reduction in confinement progressively decrease both strength and ductility of structural elements. This study provides a reproducible, open-access dataset, compiling input parameters and numerical results of the cyclic behaviour of isolated RC columns and RC frames, specifically addressing their nonlinear cyclic response under moderate corrosion (η < 25%), as well as in the non-corroded (baseline) conditions, generated through conventional nonlinear modelling. In terms of modelling, the methodology applies fibre-section modelling for columns and concentrated plastic hinges for beams. Furthermore, the corrosion effects are incorporated by reducing the steel area and ultimate strain, while also accounting for the decrease in compressive strength of the cracked concrete cover. Therefore, the cyclic response is represented by a Pivot-type hysteretic model. It is worth noting that the dataset provides model input information, such as material stress–strain relationships and backbone curves reflecting corrosion-induced deterioration. It also includes structural outputs, such as force–displacement relationships, and envelopes of quasi-static hysteretic cycles for the analyzed columns and frames. Overall, the dataset facilitates the calibration and validation of numerical models for RC structures affected by corrosion. In conclusion, the contribution enhances the reliability of computational simulations and supports the development of predictive tools for structural performance under degradation scenarios. Full article

► Show Figures

Graphical abstract

15 pages, 2272 KB

Open AccessData Descriptor

Dataset on Visitor Experience and Digital Technologies at the Archaeological Site of Ancient Dodona

by Elissavet Kosta, Fotios Bosmos, Nikolaos Giannakeas and Alexandros Τ. Tzallas

Data 2026, 11(5), 93; https://doi.org/10.3390/data11050093 - 24 Apr 2026

Abstract

This paper presents a dataset collected through a visitor questionnaire survey conducted at the Archaeological Site of Ancient Dodona, Greece, a large-scale, spatially complex open-air archaeological site. The dataset documents visitors’ experiences, perceptions, and information needs, as well as their attitudes toward the [...] Read more.

This paper presents a dataset collected through a visitor questionnaire survey conducted at the Archaeological Site of Ancient Dodona, Greece, a large-scale, spatially complex open-air archaeological site. The dataset documents visitors’ experiences, perceptions, and information needs, as well as their attitudes toward the use of digital technologies for heritage interpretation and engagement. The questionnaire was administered in printed form to adult visitors at the entrance and exit of the archaeological site. A total of 99 valid responses were collected. The dataset includes information on visitor demographics, visit characteristics, perceptions of existing interpretive material, spatial behavior within the site, and attitudes toward digital applications such as augmented reality, digital storytelling, and interactive tools. All data are fully anonymized and contain no personally identifiable or sensitive information. The dataset supports research in the fields of visitor studies, cultural heritage interpretation, digital heritage, and cultural tourism, and may be reused for comparative studies or for the design and evaluation of digital mediation applications in archaeological contexts. The dataset enables cross-tabulation analyses exploring associations between visitor characteristics and attitudes toward digital mediation, thereby supporting visitor segmentation and the evidence-based development of digital interpretation strategies in archaeological contexts. Full article

► Show Figures

Figure 1

20 pages, 10122 KB

Open AccessData Descriptor

A Decadal Dataset of Offshore Weather and Normalized Wind–Solar Power Yield for Long-Term Evolution and Capacity Siting Planning in the Beibu Gulf, China

by Ziniu Li, Xin Guo, Zhonghao Qian, Aihua Zhou, Lin Peng and Suyang Zhou

Data 2026, 11(5), 92; https://doi.org/10.3390/data11050092 - 24 Apr 2026

Abstract

For offshore renewable energy planning and intelligent power management, access to long-term, high-resolution, and physically consistent meteorological and power generation records is essential. Such data supports a wide range of tasks, including resource assessment, hybrid system capacity sizing, grid operation planning, and data-driven [...] Read more.

For offshore renewable energy planning and intelligent power management, access to long-term, high-resolution, and physically consistent meteorological and power generation records is essential. Such data supports a wide range of tasks, including resource assessment, hybrid system capacity sizing, grid operation planning, and data-driven forecasting model development. This article presents the construction of a 10-year continuous hourly dataset for 16 deep-sea grid sites in the Beibu Gulf, China, spanning from January 2016 to December 2025. The raw meteorological variables, including 10 m wind speed, wind direction, solar irradiance, and 2 m air temperature, were retrieved from the NASA POWER satellite database and subsequently cleaned using a 24 h periodic substitution algorithm designed to preserve the physical integrity of daily weather cycles. The dataset is organized into two sub-datasets, the Historical Weather Dataset and the Normalized Power Yield Dataset, with the latter providing normalized wind and solar power outputs on a 1.0 per-unit (p.u.) basis derived from a wind turbine power curve model and a PV thermodynamic model. All 32 CSV files are freely accessible online with UTF-8 encoding. The utility of the dataset is illustrated through two representative application cases including offshore site selection with hybrid capacity sizing and physics-informed deep learning forecasting, demonstrating its suitability for both engineering analysis and machine learning model development. Full article

► Show Figures

Figure 1

22 pages, 3857 KB

Open AccessData Descriptor

Methodology and Toolset for an Electric Vehicle Trajectory Dataset Creation: DEVRT

by Harbil Arregui, Iñaki Cejudo, Eider Irigoyen and Estíbaliz Loyo

Data 2026, 11(5), 91; https://doi.org/10.3390/data11050091 - 23 Apr 2026

Abstract

This paper presents the toolset, methodology and procedure followed to create a dataset from battery electric vehicle trajectories, called DEVRT—Dataset of Electric Vehicle Real Trips. Understanding the behaviour of electric vehicles and their battery consumption under real-life conditions and journeys is required in [...] Read more.

This paper presents the toolset, methodology and procedure followed to create a dataset from battery electric vehicle trajectories, called DEVRT—Dataset of Electric Vehicle Real Trips. Understanding the behaviour of electric vehicles and their battery consumption under real-life conditions and journeys is required in the shift towards the electrification of transport of people and goods. This paper aims to contribute with the provision of real measurements in different types of routes and environmental contexts at the time of driving to support data analytics and modelling techniques, essential for extracting actionable insights from electric vehicle battery consumption. The preparation, on-route and post-processing steps of the followed methodology are depicted. The outcome dataset consists of probe data collected over 4 days following heterogeneous routes performed by four different drivers using two electric vehicles (one more suitable to city usage and the other one more suitable for longer trips). This probe data is complemented with associated road network characterisation information, traffic flow measurements and weather extracted from auxiliary data sources. The paper presents a comprehensive description of the geographical characteristics of the trajectories, qualitative and quantitative characterisation of planned routes to create these trajectories, and criteria used to select them. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

20 pages, 1445 KB

Open AccessArticle

Agricultural Soil pH in Fiji

by Diogenes L. Antille, Xueyu Zhao, Jack C. J. Vernon, Timothy P. Stewart, Maria Narayan, James R. F. Barringer, Thomas Caspari, Peter Zund and Ben C. T. Macdonald

Data 2026, 11(4), 90; https://doi.org/10.3390/data11040090 - 20 Apr 2026

Abstract

Agriculture in the Pacific is driven primarily by small-scale private farmers, many of whom do not have access to soil testing services or advice, nor the means to interpret analytical results into soil management and agronomic recommendations. Soil degradation through the process of [...] Read more.

Agriculture in the Pacific is driven primarily by small-scale private farmers, many of whom do not have access to soil testing services or advice, nor the means to interpret analytical results into soil management and agronomic recommendations. Soil degradation through the process of acidification poses a significant risk to food and income security as it directly threatens crop productivity. The nutritional quality of food crops may also be affected through sub-optimal nutrient uptake by plants and nutrient imbalances. The dataset reported here provides a useful platform for the development of a decision-support tool (DST) that will assist Fiji farmers in understanding and managing soil pH and soil acidity. The DST will enable making informed decisions about liming to help correct soil pH. To support this development, historical soil pH data available from the Pacific Soils Portal were combined with updated analyses of agricultural soils from 17 locations in Viti Levu Island (Fiji) collected during a field campaign undertaken in August 2025. The soils were sampled at two depth intervals (0–15 and 15–30 cm) and analyzed for pH using a variety of methods. These methods included direct field measurements using a portable pH-meter as well as traditional laboratory determinations. Of the soils sampled, it was found that most soils exhibited pH levels below 7, which were observed for both depth intervals. Across all samples taken in 2025, it was found that 54.3% of them had soil pH < 5, 38.6% had soil pH between 5 and 6, and 7.1% had pH > 6 (based on soil pH_1:5 soil-to-water method). Depending upon specific land uses, climate and cropping intensity, it was recommended that routine liming be built into soil fertility management programs to help farmers overcome soil acidity-related constraints to production. Liming frequency, timing of application and application rate will need to be determined for specific soil and cropping situations; however, it was suggested that soil pH was not changed by more than 1 unit each time lime was applied. Such an approach should reduce the risk of soil organic matter loss through accelerated mineralization, which would be challenging to restore in that environment if soils remained under continuous cropping. The analytical information contained in this article expanded and updated the datasets available in the Pacific Soils Portal. Furthermore, this work provided an opportunity to build analytical expertise in aspects of soil chemistry at local organizations to support academic and extension activities as well as the ongoing development of the Pacific Soils Portal. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

23 pages, 1495 KB

Open AccessArticle

Quantitative Evaluation of the Data Governance Policies of “Double First-Class” Universities in China—Based on the PMC Index Model

by Jianfang Gao, Chunlin Li and Tifeng Jiao

Data 2026, 11(4), 89; https://doi.org/10.3390/data11040089 - 20 Apr 2026

Abstract

University data governance is an essential requirement for the informatization of universities and holds significant importance in advancing the modernization of university governance systems and governance capabilities. This study focuses on the data governance policies released by “Double First-Class” universities in China since [...] Read more.

University data governance is an essential requirement for the informatization of universities and holds significant importance in advancing the modernization of university governance systems and governance capabilities. This study focuses on the data governance policies released by “Double First-Class” universities in China since 2015. Based on policy text mining and the PMC index model, the paper developed an evaluation system for university data governance policies consisting of 9 primary indicators and 43 secondary indicators and conducted quantitative assessment. The results indicate that the policies are of good quality overall, with 25% rated as excellent, 66.1% as good, and 8.9% as moderate. Many universities have made significant progress in formulating data governance policies. However, there is still considerable room for improvement. For example, while the policy objectives are clearly defined, certain aspects require further refinement; the stakeholder involvement is relatively narrow, lacking diversity; and the mix of policy instruments is imbalanced. To address these issues, it is recommended that policies be optimized by balancing regulatory priorities, establishing a multi-stakeholder collaborative governance framework, and rationalizing the policy instruments mix. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

16 pages, 470 KB

Open AccessData Descriptor

PromptTone: A Dataset for Evaluating Large Language Model Code Generation Under Varying Prompt Politeness Levels

by Manuel Andruccioli, Giovanni Delnevo, Silvia Mirri and Paola Salomoni

Data 2026, 11(4), 88; https://doi.org/10.3390/data11040088 - 19 Apr 2026

Abstract

The increasing adoption of Large Language Models (LLMs) in software development has enabled automatic code generation from natural language, yet the influence of communicative factors such as prompt tone remains underexplored. This work introduces PromptTone, a controlled dataset designed to investigate how variations [...] Read more.

The increasing adoption of Large Language Models (LLMs) in software development has enabled automatic code generation from natural language, yet the influence of communicative factors such as prompt tone remains underexplored. This work introduces PromptTone, a controlled dataset designed to investigate how variations in prompt politeness affect LLM-based code generation in web development. The dataset is constructed through a structured experimental design combining three variables: programming paradigm (Vue.js Composition API vs. Options API), LLM provider (GPT, Claude, Gemini), and prompt tone (impolite, neutral, polite), resulting in 396 generated components across 22 implementations. Data were collected in an educational setting under a single-prompt constraint to capture first-shot model behavior, and are provided in both hierarchical and CSV formats, including prompts, generated code, and error annotations. Preliminary analysis reveals that prompt tone influences output characteristics such as verbosity, with model-specific patterns: for instance, some models exhibit increased output length with more polite prompts, while others remain stable. Differences also emerge across programming paradigms, suggesting an interaction between tone and code structure. These findings highlight that LLMs are sensitive not only to semantic content but also to pragmatic aspects of input. Overall, the dataset provides a novel benchmark for studying human–LLM interaction in code generation, supporting future research on prompt engineering, model evaluation, and socially-aware Artificial Intelligence (AI)-assisted development tools. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

21 pages, 1060 KB

Open AccessArticle

Data-Driven Probabilistic MACCs for Smart Cities: Monte Carlo Simulation and Bayesian Inference of Rebound Effects

by Arnoldo Eluzaim Rodriguez-Sanchez, Edgar Tello-Leal, Bárbara A. Macías-Hernández and Jaciel David Hernandez-Resendiz

Data 2026, 11(4), 87; https://doi.org/10.3390/data11040087 - 17 Apr 2026

Abstract

The shift toward Smart Cities heavily relies on adopting energy-efficiency strategies to meet ambitious decarbonization targets. However, the rebound effect, where improvements in technical efficiency are partly offset by increased energy consumption, often reduces the expected environmental and economic benefits. Traditional Marginal [...] Read more.

The shift toward Smart Cities heavily relies on adopting energy-efficiency strategies to meet ambitious decarbonization targets. However, the rebound effect, where improvements in technical efficiency are partly offset by increased energy consumption, often reduces the expected environmental and economic benefits. Traditional Marginal Abatement Cost Curves (MACC) often ignore this behavioral feedback, which can lead to an overestimation of mitigation potential. This paper introduces a data-driven probabilistic framework for assessing the influence of the rebound effect on a portfolio of urban mitigation strategies by integrating behavioral feedback into a bottom-up MACC. By combining Monte Carlo (MC) simulations to address parametric uncertainty with Bayesian Networks (BN) for conditional inference, the robustness of nine strategies is examined across residential, commercial, and transportation sectors. The results demonstrate that even a moderate rebound effect (

η = 0.5

) causes a

10.09 %

decrease in total net abatement, dropping from 24.86 to 22.35 tCO₂e, and significantly raises costs. Notably, the number of strictly cost-effective strategies (

M A C < 0

) decreases from six to three, highlighting the fragility of certain “win–win” measures. This framework introduces the concepts of Financial Backfire Probability (FBP) and Environmental Backfire Probability (EBP) as new metrics for urban planning. These findings emphasize that rebound tolerance is a critical factor in climate policy, indicating that additional measures, such as Internet of Things (IoT)-based monitoring and demand-side management, may be necessary to prevent performance erosion amid behavioral uncertainty. Full article

(This article belongs to the Special Issue IoT and Big Data Applications in Smart Cities: Recent Advances, Challenges, and Critical Issues)

► Show Figures

Figure 1

17 pages, 592 KB

Open AccessArticle

Modelling Extreme Losses in JSE Life Insurance Price Index Growth Rates Using the Generalised Extreme Value Distribution (GEVD) and the Generalised Pareto Distribution (GPD)

by Delson Chikobvu, Tendai Makoni and Frans Frederik Koning

Data 2026, 11(4), 86; https://doi.org/10.3390/data11040086 - 16 Apr 2026

Abstract

The life insurance sector plays a critical role in financial system stability but is inherently exposed to extreme market fluctuations due to long-term liabilities and asset–liability mismatches. This study investigates extreme losses in the growth rates of the JSE Life Insurance Price Index [...] Read more.

The life insurance sector plays a critical role in financial system stability but is inherently exposed to extreme market fluctuations due to long-term liabilities and asset–liability mismatches. This study investigates extreme losses in the growth rates of the JSE Life Insurance Price Index (LIPI) using the Generalised Extreme Value Distribution (GEVD) and the Generalised Pareto Distribution (GPD) under the Extreme Value Theory (EVT) framework. Monthly data from January 2000 to October 2023 were transformed into a loss series, and extreme events were captured using quarterly block maxima and a POT threshold at the 95th percentile. Model parameters were estimated through Maximum Likelihood Estimation, and downside risk was assessed using return levels, Value-at-Risk (VaR), and Tail Value-at-Risk (tVaR). The GEVD model produced a negative shape parameter, consistent with a bounded Weibull-type tail, while the GPD indicated a heavy-tailed distribution. Return level estimates show escalating loss magnitudes and widening uncertainty over longer horizons, reflecting the challenges of projecting rare events. Kupiec backtesting confirms the adequacy and reliability of the GEVD-based VaR across all confidence levels, whereas the GPD underestimates risk at lower thresholds. These findings indicate significant tail risk within the South African life insurance equity segment and underscore the importance of EVT-based risk measures for capital planning and regulatory oversight. The study contributes to financial risk modelling in the life insurance sector and offers practical insights for strengthening solvency assessment and enterprise risk management frameworks. Full article

► Show Figures

Figure 1

20 pages, 7589 KB

Open AccessArticle

AEConvs: A Novel Dataset and Benchmark for Evaluating Empathetic Response Generation in Arabic LLMs

by Afnan Alkhathlan and Abdulrahman A. Mirza

Data 2026, 11(4), 85; https://doi.org/10.3390/data11040085 - 14 Apr 2026

Abstract

Empathy—the ability to understand and respond to others’ emotions and perspectives—is a key communication skill for humans; however, it is under-explored within current conversational systems. While large language models (LLMs) have demonstrated a remarkable capability to generate coherent and contextually relevant output, they [...] Read more.

Empathy—the ability to understand and respond to others’ emotions and perspectives—is a key communication skill for humans; however, it is under-explored within current conversational systems. While large language models (LLMs) have demonstrated a remarkable capability to generate coherent and contextually relevant output, they often struggle to exhibit genuine empathy, resulting in artificial and dull responses, particularly in low-resource languages such as Arabic. Notably, the research on empathetic conversational systems in Arabic is still in its early stages, mainly due to the scarcity of open-domain conversational data. To address this gap, we introduce Arabic Empathetic Conversations (AEConvs), a genuine Arabic conversational dataset featuring more than 4K open-domain dyadic empathetic conversations. This dataset provides a valuable resource that captures nuanced emotional and empathetic cues in the Arabic language. Using AEConvs, we evaluate and compare the empathetic capabilities of two state-of-the-art generative Arabic LLMs—AceGPT-chat and Jais-chat—under zero-shot and fine-tuning training settings. Human evaluation results demonstrate that while both models exhibit some form of empathy in zero-shot settings, fine-tuning on AEConvs improved their ability to generate more fine-grained empathetic responses while also yielding enhancements in fluency and context adherence. Additionally, automatic evaluation indicated improved language modeling and better lexical and semantic similarity with human reference responses. This study highlights the importance of culturally and linguistically tailored datasets in advancing empathetic conversational AI. We publicly release the AEConvs dataset, providing a valuable resource for future advancements in the field. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

17 pages, 735 KB

Open AccessData Descriptor

Daily and Accumulated Training-to-Match Load Ratios in Professional Soccer: The Influence of Starting Status and Playing Position Across a Full Competitive Season

by Alejandro Sierra-Casas, Daniel Castillo, Filipe Manuel Clemente and Alejandro Rodríguez-Fernández

Data 2026, 11(4), 84; https://doi.org/10.3390/data11040084 - 14 Apr 2026

Abstract

Introduction: Monitoring training load is essential in elite soccer to optimize performance and reduce injury risk. The training-to-match load ratio (TMr) has emerged as a useful metric to contextualize training demands relative to competitive match exposure. The objective of this study was to [...] Read more.

Introduction: Monitoring training load is essential in elite soccer to optimize performance and reduce injury risk. The training-to-match load ratio (TMr) has emerged as a useful metric to contextualize training demands relative to competitive match exposure. The objective of this study was to compare daily and accumulated TMr between starters and non-starters over a professional season, considering microcycle day and playing position. Methods: Twenty players (Tier 3) from a professional team were monitored during a full competitive season (30 microcycles; 144 training sessions; 30 matches). External load variables, namely total distance (TD), high-speed distance (HSD), sprint distance (SPD), high metabolic load distance (HMLD), acceleration (ACC) and deceleration (DCC), were collected using 10 Hz GPS devices (STATSports). Daily and microcycle TMr were calculated relative to each player’s maximal match value registered during a full competitive period. Linear mixed-effects models examined the effects of starting status, microcycle day, and playing position. Results: Linear mixed models revealed significant three-way interactions (status × day × position) for locomotor variables: TD (F = 3.36, p < 0.001), HSD (F = 2.49, p < 0.001), and SPD (F = 3.37, p < 0.001). Starters accumulated higher loads on match day, whereas non-starters showed higher TMr on MD + 1 and MD + 2. Position-specific differences emerged during acquisition sessions (i.e., MD − 5 to MD − 3), particularly for wide midfielders (WMs) and central defenders (CDs). No significant three-way interactions were observed for ACC, DCC, or HMLD absolute loads (p > 0.05), nor for any accumulated microcycle TMr metrics (p > 0.05). Conclusions: TMr effectively differentiates preparation strategies between starters and non-starters. Although “top-up conditioning” sessions increase early-week relative loads for non-starters, position-specific variations–particularly in mechanical variables during acquisition sessions–highlight the need for individualized load prescription. Full article

(This article belongs to the Special Issue Big Data and Data-Driven Research in Sports)

► Show Figures

Figure 1

13 pages, 2447 KB

Open AccessData Descriptor

Electric Vehicle Routing with Time Windows and Heterogeneous Charging-Station Attribute Dataset

by Ayoub Hanif, Meryem Abid, Mohamed Tabaa, Hassna Bensag and Mohamed Youssfi

Data 2026, 11(4), 83; https://doi.org/10.3390/data11040083 - 12 Apr 2026

Abstract

This paper describes the benchmark dataset for the electric vehicle routing problem with time windows. It is designed to facilitate the large-scale and reproducible evaluation of routing approaches under diverse charging scenarios. It is an extension of the Homberger 1000-customer vehicle-routing benchmark dataset [...] Read more.

This paper describes the benchmark dataset for the electric vehicle routing problem with time windows. It is designed to facilitate the large-scale and reproducible evaluation of routing approaches under diverse charging scenarios. It is an extension of the Homberger 1000-customer vehicle-routing benchmark dataset through the incorporation of computationally derived charging-station data. For the 60 base instances included in the dataset, charging-station locations are randomly generated within the customer-coordinate bounds, and two variants are provided, resulting in 120 benchmark problems used in the validation and baseline analyses. A normalized local customer-density score is derived for each station. It is used to determine charging rates and log-normal parameters for prices and waiting times. Two variants are included in the dataset. Variant A maintains the original customer time-window constraints, while Variant B relaxes customer due dates based on the distance from the depot, subject to the depot closing time. The dataset is complemented by instance files, station attributes, parameters, and scripts. It also includes the results of feasibility tests, baseline solver tests, difficulty analyses, and sensitivity tests. These results show that the benchmark includes both easier and harder instance classes under different charging settings. Overall, the dataset is intended to support its use as a reproducible benchmark. Full article

► Show Figures

Figure 1

8 pages, 586 KB

Open AccessData Descriptor

Urinary Metabolite Panel Dataset for Bulgarian Children with Autism Spectrum Disorder (ASD)

by Victor Slavov, Lubomir Traikov, Stanislava Ciurinskiene, Maria Savcheva, Till Heine, Radka Tafradjiiska-Hadjiolova, Alexandra Zlatarova, Ivan Tourtourikov, Dilyana Madzharova, Anita Kavrakova and Tanya Kadiyska

Data 2026, 11(4), 82; https://doi.org/10.3390/data11040082 - 10 Apr 2026

Abstract

This Data Descriptor presents an anonymized, shuffled dataset of creatinine-normalized urinary metabolite measurements from 73 Bulgarian children with autism spectrum disorder (ASD), released to support reuse in secondary analyses and cross-cohort comparisons. The public release represents a pathway-oriented 24-marker subset from a broader [...] Read more.

This Data Descriptor presents an anonymized, shuffled dataset of creatinine-normalized urinary metabolite measurements from 73 Bulgarian children with autism spectrum disorder (ASD), released to support reuse in secondary analyses and cross-cohort comparisons. The public release represents a pathway-oriented 24-marker subset from a broader urinary diagnostic panel, assembled as a self-contained resource for investigators working in these metabolic domains. Spot urine results are provided as individual-level values after creatinine normalization; for trimethylamine, values below the limit of quantification (LOQ) were replaced with LOQ/2. The deposit contains measurements for 24 urinary markers grouped into three functional classes (neurotransmitters and aromatic amino acid precursors; one-carbon/methylation and vitamin-related metabolites; and energy metabolism/organic acids with microbiome-related amines). The underlying cohort comprised children aged 3–13 years, and no contemporaneous neurotypical control group was enrolled. Second-morning, midstream, acid-stabilized spot urine samples were collected within the provider’s workflow; metabolites were measured by LC–MS/MS, and spot urinary creatinine was measured enzymatically for normalization. The release includes the results table in both XLSX and CSV formats, a reference limits and units file for contextual interpretation, a data dictionary, a README, a changelog, and SHA-256 checksums for integrity verification. The public files contain de-identified analytical variables only and omit individual-level demographics, dates, standalone urinary creatinine, and richer clinical metadata to preserve anonymity. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

Journal Description

Latest Articles

Journal Menu

Journal Browser

Highly Accessed Articles

Latest Books

E-Mail Alert

News

Topics

Conferences

Special Issues

Topical Collections

Further Information

Guidelines

MDPI Initiatives

Follow MDPI