Data

15 pages, 905 KB

Open AccessData Descriptor

Dataset on Continuous Sewer Hydraulic and Pollutant Concentration Observations from 2008 to 2011 Including Precipitation Data, Laboratory Analysis and a Hydrodynamic Model

by Markus Pichler, Thomas Hofer, Valentin Gamerith and Günter Gruber

Data 2026, 11(3), 45; https://doi.org/10.3390/data11030045 - 26 Feb 2026

Abstract

This dataset compiles continuous hydraulic and water quality observations from the combined sewer overflow structure at the outlet of the Graz-West R05 catchment in Austria, covering the period from 2008 to 2011. It integrates high-resolution in-sewer measurements of flow rate, water level, flow [...] Read more.

This dataset compiles continuous hydraulic and water quality observations from the combined sewer overflow structure at the outlet of the Graz-West R05 catchment in Austria, covering the period from 2008 to 2011. It integrates high-resolution in-sewer measurements of flow rate, water level, flow velocity and water quality parametres (COD, TSS, temperature), complemented by laboratory analyses of discrete grab samples. Water quality parametres were monitored using an in situ UV/VIS spectrometer installed on a floating pontoon. Additional locally calibrated COD values derived from laboratory measurements are included. The in-sewer data were acquired at 1 or 3 min intervals depending on flow conditions. Flow rates, water levels and overflow discharges were monitored using radar and ultrasonic sensors. Three nearby tipping-bucket rain gauges provided time-stamped precipitation increments, enabling the detailed reconstruction of wet-weather dynamics. A hydrodynamic SWMM model of the catchment, including geospatial information and dry-weather calibration, is included to support modelling applications. This combination of long-term measurements and a calibrated hydrodynamic model supports the development, testing and validation of process-based, statistical or data-driven approaches for simulating combined sewer system behaviour and pollutant dynamics. Full article

► Show Figures

Figure 1

9 pages, 323 KB

Open AccessData Descriptor

Dataset of Students for Learning Analytics with Gamification

by Fotios Bosmos, Elissavet Kosta, Konstantinos Sakkas, Niki Eleni Ntagka and Nikolaos Giannakeas

Data 2026, 11(3), 44; https://doi.org/10.3390/data11030044 - 25 Feb 2026

Abstract

Digital technologies for storing, processing, and extracting knowledge from data have significantly influenced educational institutions, leading to the adoption, evaluation, and adaptation of new learning models. This data descriptor presents a dataset collected from junior high school students during Computer Science lessons focused [...] Read more.

Digital technologies for storing, processing, and extracting knowledge from data have significantly influenced educational institutions, leading to the adoption, evaluation, and adaptation of new learning models. This data descriptor presents a dataset collected from junior high school students during Computer Science lessons focused on creating geometric constructions using the Scratch visual programming environment. The dataset includes 56 recorded student files consisting mainly of student feedback collected after using gamification through digital quizzes for evaluation and self-assessment, addressing psychological aspects, motivation, participation, and collaboration. The dataset presents a balanced distribution in terms of respondent characteristics and will be of interest to researchers involved in the application of gamification elements in the learning process, researchers studying comprehensive education programs, and teachers interested in innovative teaching practices in their subjects. Full article

(This article belongs to the Special Issue Mining and Computational Intelligence for E-Learning and Education—4th Edition)

► Show Figures

Figure 1

14 pages, 307 KB

Open AccessData Descriptor

Dataset on Suicide Risk, Substance Abuse, and Family Functioning Among University Students in Cali, Colombia

by Naydu Acosta-Ramírez, Jorge Mario Angulo-Mosquera and Alejandro Botero-Carvajal

Data 2026, 11(2), 43; https://doi.org/10.3390/data11020043 - 23 Feb 2026

Abstract

Globally, one in eight people experience a mental disorder, which constitutes a leading cause of years lived with disability and disproportionately affects young people. Gaps in scientific knowledge have been identified, with limited studies in university students. This article presents an open-access database [...] Read more.

Globally, one in eight people experience a mental disorder, which constitutes a leading cause of years lived with disability and disproportionately affects young people. Gaps in scientific knowledge have been identified, with limited studies in university students. This article presents an open-access database on mental health and family functioning, collected through a survey of undergraduate students in health sciences programs at a private university in Cali (Colombia). The purpose was to explore suicide risk, substance abuse and family functioning using three structured questionnaires (Family APGAR, Dast-10, and PANSI), together with sociodemographic variables, organized in four sections (family and peer support, substance use, suicidal ideation, and background). The results of the article correspond to the database description, which includes finally 574 records obtained from students of health sciences programs (medicine, dentistry, psychology, prehospital care, nursing, dental mechanics). The data are provided as raw, analyzable files (spreadsheet formats) free of charge from Mendeley Data. In conclusion, the scientific impact of these data lies in their potential to be reused by researchers and higher-education decision-makers for secondary analyses that guide the development of mental and family health interventions for groups linked to undergraduate programs in the health sector. Full article

(This article belongs to the Topic Communications Challenges in Health and Well-Being, 2nd Edition)

► Show Figures

Figure 1

18 pages, 4591 KB

Open AccessData Descriptor

Individual-Level Behavioral Dataset Linking Trace Eyeblink Conditioning, Contextual Fear Memory, and Home-Cage Activities in rTg4510 and Wild-Type Mice with Doxycycline Treatment

by Ryo Kachi, Takuma Nishijo and Yasushi Kishimoto

Data 2026, 11(2), 42; https://doi.org/10.3390/data11020042 - 16 Feb 2026

Abstract

This dataset provides synchronized multimodal behavioral measurements from 36 mice across four experimental groups: wild-type and rTg4510 tauopathy mice, each tested with or without doxycycline-mediated suppression of mutant tau expression. Of these, 34 mice had complete measurements across all three behavioral paradigms and [...] Read more.

This dataset provides synchronized multimodal behavioral measurements from 36 mice across four experimental groups: wild-type and rTg4510 tauopathy mice, each tested with or without doxycycline-mediated suppression of mutant tau expression. Of these, 34 mice had complete measurements across all three behavioral paradigms and were used for analyses requiring full cross-task linkage. At six months of age, all animals underwent three standardized behavioral paradigms: home cage monitoring, ten-day trace eyeblink conditioning, and contextual fear conditioning. The individual-level data included locomotor activity, rearing duration, conditioned response metrics, eyelid closure latencies, and contextual freezing percentages. All measurements were linked using unique mouse identifiers, enabling cross-task analysis without preprocessing or imputation. The dataset was accompanied by a complete data dictionary, processing workflow diagram, and validation analyses demonstrating cross-paradigm correlations. The cross-task associations are illustrated in the main figures, with additional early phase acquisition and temporal processing correlations provided in the main figures. Provided in an open CSV format with detailed metadata, this resource supports behavioral phenotyping, machine learning applications, and the investigation of learning mechanisms in tauopathy models. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Graphical abstract

12 pages, 561 KB

Open AccessData Descriptor

Perceptions of Security, Victimization, and Coexistence: A Database from Cali, Colombia

by Jhon James Mora, Enrique Javier Burbano-Valencia, Angie Mondragón-Mayo and José Santiago Arroyo Mina

Data 2026, 11(2), 41; https://doi.org/10.3390/data11020041 - 14 Feb 2026

Abstract

This article addresses a key evidence gap in urban safety policy in Colombia: the absence of publicly accessible microdata that jointly measure victimization, perception of security, and probability of sanctions among socioeconomically vulnerable residents. It aims to provide a clean, linkable dataset that [...] Read more.

This article addresses a key evidence gap in urban safety policy in Colombia: the absence of publicly accessible microdata that jointly measure victimization, perception of security, and probability of sanctions among socioeconomically vulnerable residents. It aims to provide a clean, linkable dataset that enables analysis of variations in these issues across demographic and territorial groups in Cali (recently classified as the 29th most dangerous city worldwide, with 1028 and 1065 homicides in 2024 and 2025, respectively). It reports face-to-face survey data collected from 22 July to 16 August 2024, at Sistema de Identificación de Potenciales Beneficiarios de Programas Sociales (SISBEN) service points. The final dataset includes 2139 adults (aged 18–95 years) and combines (i) primary responses on perceived safety (e.g., public space safety and surveillance cameras), perceived likelihood of sanction, victimization, and self-protection measures with (ii) selected sociodemographic and household characteristics drawn from SISBEN IV records. Individual-level linkage was implemented using respondent identification at interviews, yielding an integrated anonymized file suitable for replication and secondary analysis. The dataset enables distributive analyses of insecurity (e.g., by sex, age, and ethnicity—including Afro-descendant populations) within a policy-relevant target group and supports evaluation and targeting of local interventions by providing individual-level indicators. Full article

► Show Figures

Figure 1

13 pages, 2157 KB

Open AccessData Descriptor

Georeferenced Snow Depth and Snow Water Equivalent Dataset (2025) from East Kazakhstan Region

by Dmitry Chernykh, Roman Biryukov, Lilia Lubenets, Andrey Bondarovich, Nurassyl Zhomartkan, Almasbek Maulit, Dauren Nurekenov, Kamilla Rakhymbek, Yerzhan Baiburin and Aliya Nugumanova

Data 2026, 11(2), 40; https://doi.org/10.3390/data11020040 - 13 Feb 2026

Abstract

In this work, we present the Snow Depth and Snow Water Equivalent Dataset for specific areas located in the East Kazakhstan Region that can be exploited to monitor and understand water resource dynamics in mountain regions. The present dataset represents a georeferenced collection [...] Read more.

In this work, we present the Snow Depth and Snow Water Equivalent Dataset for specific areas located in the East Kazakhstan Region that can be exploited to monitor and understand water resource dynamics in mountain regions. The present dataset represents a georeferenced collection of snow depth, snow density, and derived snow water equivalent (SWE) measurements obtained through manual snow surveys. Snow survey observations were conducted during field campaigns in the East Kazakhstan Region during the period of maximum snow accumulation from 27 February to 6 March 2025. Snow survey sites were selected to maximize coverage of diverse landscape settings and snow accumulation conditions. In total, 111 snow survey sites were established across the East Kazakhstan Region, and 2331 snow depth measurements and 555 snow density measurements were collected. In post-field (laboratory) processing, snow water equivalent (SWE) was calculated for all snow survey sites based on measured snow depth and snow density values. Full article

► Show Figures

Figure 1

10 pages, 1011 KB

Open AccessArticle

The Role of Shot Velocity in Advanced Post-Shot Metrics: Evidence from the UEFA European Football Championships

by Blanca De-la-Cruz-Torres, Anselmo Ruiz-de-Alarcón-Quintero and Miguel Navarro-Castro

Data 2026, 11(2), 39; https://doi.org/10.3390/data11020039 - 13 Feb 2026

Abstract

Introduction: Ball velocity is a critical determinant of shot effectiveness in football, yet its influence on advanced post-shot metrics, such as expected shot impact timing (xSIT) and expected goals on target (xGOT), remains poorly understood, particularly in the context of sex-specific differences. This [...] Read more.

Introduction: Ball velocity is a critical determinant of shot effectiveness in football, yet its influence on advanced post-shot metrics, such as expected shot impact timing (xSIT) and expected goals on target (xGOT), remains poorly understood, particularly in the context of sex-specific differences. This study examined the relationship between ball velocity and these metrics in men’s and women’s elite European tournaments. Methods: A total of 2174 shots were analyzed from all matches of the 2024 UEFA Men’s EURO (n = 1305) and 2025 UEFA Women’s EURO (n = 869), classified as goal shots on target, non-goal shots on target, and shots off target. Ball velocity was measured for each shot, and its associations with xSIT, our own xGOT model and the StatsBomb xGOT model were quantified using correlation coefficients. Results: Ball velocity differed significantly between sexes (p < 0.001), with higher values in men, and goal shots on target exhibited lower velocities than non-goal or off-target shots, indicating a speed–accuracy trade-off. Only xSIT and our own xGOT model were sensitive to ball velocity, reflecting sex-specific differences (p < 0.001). When comparing shot types across advanced metrics, a consistent trend was observed in both tournaments: xSIT showed no significant differences between goal and non-goal shots, whereas both xGOT models were higher for goal shots on target. Correlations indicated a moderate positive relationship between xSIT and ball velocity, and moderate negative correlations for both xGOT models, slightly stronger in men. Conclusions: Ball velocity is a critical factor influencing shot performance and advanced post-shot metrics, with notable sex-specific differences. Full article

(This article belongs to the Special Issue Big Data and Data-Driven Research in Sports)

► Show Figures

Figure 1

18 pages, 6606 KB

Open AccessData Descriptor

Annotated IoT Dataset of Waste Collection Events

by Peter Tarábek, Andrej Michalek, Roman Hriník, Ľubomír Králik and Karol Decsi

Data 2026, 11(2), 38; https://doi.org/10.3390/data11020038 - 11 Feb 2026

Abstract

This work presents a curated dataset of multimodal sensor measurements from Internet of Things (IoT) units mounted on waste collection vehicles. Each unit records multiple data streams including GPS position, vehicle velocity, radar-based container presence, accelerometer readings of the lifting arm, and RFID [...] Read more.

This work presents a curated dataset of multimodal sensor measurements from Internet of Things (IoT) units mounted on waste collection vehicles. Each unit records multiple data streams including GPS position, vehicle velocity, radar-based container presence, accelerometer readings of the lifting arm, and RFID tag identifiers of the bins. The dataset provides two complementary forms of annotation: (1) algorithmically generated events that were manually cleaned through visual inspection of sensor signals, offering large-scale coverage across 5 vehicles over a total of 25 collection days, and (2) manually validated events derived from synchronized video recordings, representing ground truth for 3 vehicles over 8 collection days. In total, the dataset contains 12,391 annotated waste collection events. The dataset spans diverse operational conditions with varying container sizes and includes both RFID-equipped and non-RFID bins. It can be used to train and evaluate machine learning models for event detection, anomaly recognition, or explainability studies, and to support practical applications such as Pay-as-you-throw (PAYT) waste management schemes. By combining multimodal sensor signals with reliable annotations, the dataset represents a unique resource for advancing research in smart waste collection and the broader field of IoT-enabled urban services. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

25 pages, 18392 KB

Open AccessData Descriptor

A Century of Migration (1830–1939): 735,000 Enriched Records from Bremen’s Ship Passenger Lists

by Tobias Perschl, Pauline Schmidt, Sebastian Gassner and Malte Rehbein

Data 2026, 11(2), 37; https://doi.org/10.3390/data11020037 - 10 Feb 2026

Abstract

This paper publishes 735,000 historical passenger entries from the German North Sea port of Bremen, created between 1830 and 1939, and now structured, enriched, and processed into a research-ready database. It provides an overview of the original archival documents and their datafication, beginning [...] Read more.

This paper publishes 735,000 historical passenger entries from the German North Sea port of Bremen, created between 1830 and 1939, and now structured, enriched, and processed into a research-ready database. It provides an overview of the original archival documents and their datafication, beginning with a historical account of why the passenger lists were created and which information they recorded. Building on extensive prior work—largely carried out by a team of volunteer transcribers with expertise in family history and genealogy—the lists were transcribed manually and first made available online in 2003. To enhance their analytical value, we computationally post-processed these data through (1) data cleaning, especially addressing spelling variants and transcription errors; (2) data normalisation, including conversion into standardised formats; and (3) data augmentation by adding identifiers, geographic information, and multiple classifications. Finally, we discuss limitations of the resulting dataset as well as its analytical potential. Full article

► Show Figures

Figure 1

21 pages, 12506 KB

Open AccessData Descriptor

S-EDARA: An Atmospheric River Dataset Supplement to EDARA for Impact Assessment

by Ruping Mo

Data 2026, 11(2), 36; https://doi.org/10.3390/data11020036 - 10 Feb 2026

Abstract

Atmospheric rivers (ARs) play a critical role in producing high-impact weather events, including extreme precipitation, flooding, gusty winds, and rapid temperature changes. Building upon the recently published EDARA (ERA5-based Dataset for Atmospheric River Analysis), we present S-EDARA, a supplementary dataset that enhances AR [...] Read more.

Atmospheric rivers (ARs) play a critical role in producing high-impact weather events, including extreme precipitation, flooding, gusty winds, and rapid temperature changes. Building upon the recently published EDARA (ERA5-based Dataset for Atmospheric River Analysis), we present S-EDARA, a supplementary dataset that enhances AR impact assessment capabilities through a newer AR detection algorithm and additional impact-related metrics. S-EDARA includes AR shapes identified by the tARget version 4 (ARS4) algorithm, strong integrated vapour transport (SIVT) indicators, and pseudo total precipitation rate (PTPR) fields. The dataset features both numerical data and interactive graphical catalogues displaying ARS4, SIVT, PTPR, gusty winds, and 24 h temperature changes at 6-hourly intervals. These enhancements enable more comprehensive analysis of AR impacts and characteristics, particularly for regions experiencing rapidly changing meteorological conditions during AR events. The dataset covers the period from 1940 to the present and is publicly available through the Federated Research Data Repository. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

14 pages, 717 KB

Open AccessData Descriptor

In Situ Crop and Soil Data and UAV Imagery from Winter Wheat Fields in a Bulgarian Site

by Petar Dimitrov, Eugenia Roumenina, Georgi Jelev, Lachezar Filchev, Alexander Gikov, Ilina Kamenova, Iliana Ilieva, Dessislava Ganeva, Milena Kercheva, Martin Banov, Veneta Krasteva, Viktor Kolchakov, Emil Dimitrov and Nevena Miteva

Data 2026, 11(2), 35; https://doi.org/10.3390/data11020035 - 7 Feb 2026

Abstract

This data descriptor presents a dataset comprising crop and soil parameters measured in winter wheat fields near the town of Knezha, Bulgaria. The data were collected as part of a project evaluating the potential of vegetation indices derived from Sentinel-2 satellite imagery to [...] Read more.

This data descriptor presents a dataset comprising crop and soil parameters measured in winter wheat fields near the town of Knezha, Bulgaria. The data were collected as part of a project evaluating the potential of vegetation indices derived from Sentinel-2 satellite imagery to predict biophysical and biochemical crop parameters. The core dataset consists of measurements obtained from 20 m × 20 m field plots and includes a broad range of parameters: leaf area index, fraction of absorbed photosynthetically active radiation, vegetation cover fraction, chlorophyll content, above-ground biomass, plant nitrogen content, biological yield, surface soil moisture, spectral reflectance, plant density, crop height, visual assessments of disease or pest damage, and data on weed occurrence. The dataset is complemented by unmanned aerial vehicle imagery, crop calendars, and field management information. The main soil types in the study area were characterized through soil profiles, while meteorological data were obtained from an automated weather station. The data were collected during the 2016–2017 and 2017–2018 agricultural seasons. The dataset is freely available for download and serves as a valuable resource for researchers in remote sensing—particularly for validating satellite-derived products—as well as for specialists involved in winter wheat monitoring, modeling, and agronomic studies. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

3 pages, 157 KB

Open AccessData Descriptor

Normative Physical Fitness Profiles and Sex Differences in University Students of Sport Sciences: An Open Dataset of Anthropometrics, Flexibility, Strength, and Jump Performance

by Julio Martín-Ruiz and Laura Ruiz-Sanchis

Data 2026, 11(2), 34; https://doi.org/10.3390/data11020034 - 7 Feb 2026

Abstract

This Data Descriptor provides an open, anonymized dataset describing anthropometric and physical fitness outcomes in undergraduate students enrolled in a Physical Activity and Sport Sciences degree program. The dataset included 156 participants (28 females and 128 males) and reported sex, age, body mass, [...] Read more.

This Data Descriptor provides an open, anonymized dataset describing anthropometric and physical fitness outcomes in undergraduate students enrolled in a Physical Activity and Sport Sciences degree program. The dataset included 156 participants (28 females and 128 males) and reported sex, age, body mass, stature, and body mass index, alongside standardized field-based tests covering flexibility, muscular endurance, strength, and jump performance. Hip flexibility was assessed using the Thomas test on both sides. Trunk extensor endurance was measured using the Biering–Sørensen test, and upper-body strength–endurance was assessed using a dead-hang test. Upper limb strength was recorded as elbow flexion strength. Lower limb power was evaluated using vertical jump tests, including Abalakov, squat jump, and countermovement jump, and a derived indicator (IE) was provided to facilitate comparisons across jump modalities. The data are distributed as a machine-readable CSV file accompanied by a detailed data dictionary describing the variables, units, and missingness. The dataset is intended to support the reproducible reporting of normative fitness profiles in sports science students, facilitate teaching and benchmarking in exercise science contexts, and enable secondary analyses exploring associations between anthropometry and physical performance. For reproducible inferential comparisons, users may apply Welch’s two-sample t-test for sex-based differences. Full article

(This article belongs to the Special Issue Big Data and Data-Driven Research in Sports)

17 pages, 1299 KB

Open AccessData Descriptor

Synthetic and Encoded Database of Dengue, Zika, Chikungunya, and Influenza Derived from the Literature

by Elí Cruz-Parada, Guillermina Vivar-Estudillo, Laura Pérez-Campos Mayoral, María Teresa Hernández-Huerta, Alma Dolores Pérez-Santiago, Carlos Romero-Diaz, Eduardo Pérez-Campos Mayoral, Iván Antonio García-Montalvo, Lucia Martínez-Martínez, Héctor Martínez-Ruiz, Idarh Matadamas, Miriam Emily Avendaño-Villegas, Margarito Martínez Cruz, Hector Alejandro Cabrera-Fuentes, Aldo Eleazar Pérez-Ramos, Eduardo Lorenzo Pérez-Campos and Carlos Mauricio Lastre-Domínguez

Data 2026, 11(2), 33; https://doi.org/10.3390/data11020033 - 6 Feb 2026

Abstract

This work presents a synthetic binary database of Dengue, Zika, Chikungunya, and Influenza constructed entirely from clinical information extracted from the scientific literature. Due to the limited availability and heterogeneity of clinical records in medical units—particularly for arboviral diseases—existing datasets are often insufficient [...] Read more.

This work presents a synthetic binary database of Dengue, Zika, Chikungunya, and Influenza constructed entirely from clinical information extracted from the scientific literature. Due to the limited availability and heterogeneity of clinical records in medical units—particularly for arboviral diseases—existing datasets are often insufficient for developing robust Machine Learning models. To address this limitation, an extensive search of PubMed and Google Scholar was conducted between February 2024 and May 2025, following strict selection criteria focused on diagnostic confirmation. The resulting dataset comprises 48,214 records and 67 standardized signs and symptoms, homogenized across all pathologies. Each record is fully binary, contains no missing values, and represents symptom presence or absence. The composition includes 22,379 Dengue records, 7135 Zika records, 7959 Chikungunya records, and 10,741 Influenza records. Symptom prevalence was analyzed, revealing consistency with patterns reported in epidemiological and clinical studies, supporting the dataset’s plausibility. This database enables statistical exploration and direct integration into Machine Learning pipelines without the need for imputation. It has been used in an in silico predictive study of arboviral diseases, employing Influenza as a negative control, and serves as a reproducible, literature-derived resource for computational modeling. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

12 pages, 1148 KB

Open AccessData Descriptor

Psoriatic Arthritis (PsA) Clinical Lipidomics Dataset with Hidden Laboratory Workflow Artifacts: A Benchmark Dataset for Data Processing Quality Control in Lipidomics

by Jörn Lötsch, Robert Gurke, Lisa Hahnefeld, Frank Behrens and Gerd Geisslinger

Data 2026, 11(2), 32; https://doi.org/10.3390/data11020032 - 3 Feb 2026

Abstract

This dataset presents a real-world lipidomics resource for developing and benchmarking quality control methods, batch effect detection algorithms, and data validation workflows. The data originates from a cross-sectional clinical study of psoriatic arthritis (PsA) patients (n = 81) and healthy controls (n = [...] Read more.

This dataset presents a real-world lipidomics resource for developing and benchmarking quality control methods, batch effect detection algorithms, and data validation workflows. The data originates from a cross-sectional clinical study of psoriatic arthritis (PsA) patients (n = 81) and healthy controls (n = 26), matched for age, sex, and body mass index, which was collected at a tertiary university rheumatology center. Subtle laboratory irregularities were detected only through advanced unsupervised analysis, after passing conventional quality control and standard analytical methods. Blood samples were processed using standardized protocols and analyzed using high-resolution and tandem mass spectrometry platforms. Both targeted and untargeted lipid assays captured lipids of several classes (including carnitines, ceramides, glycerophospholipids, sphingolipids, glycerolipids, fatty acids, sterols and esters, endocannabinoids). The dataset is organized into four comma-separated value (CSV) files: (1) Box–Cox-transformed and imputed lipidomics values; (2) outlier-cleaned and imputed values on the original scale; (3) metadata including clinical classifications, biological sex, and batch information for all assay types and control sample processing dates; and (4) a variable-level description file (readme.csv). The 292 lipid variables are named according to LIPID MAPS classification and standardized nomenclature. Complete batch documentation and FAIR-compliant data structure make this dataset valuable for testing the robustness of analytical pipelines and quality control in lipidomics and related omics fields. This unique dataset does not compete with larger lipidomics quality control datasets for comparisons of results but provides a unique, real-life lipidomics dataset displaying traces of the laboratory sample processing schedule, which can be used to challenge quality control frameworks. Full article

► Show Figures

Figure 1

18 pages, 2474 KB

Open AccessData Descriptor

An Integrated Environmental and Perceptual Dataset for Predicting Comfort in Smart Campuses During the Fall Semester

by Gianni Tumedei, Chiara Ceccarini, Giovanni Delnevo and Catia Prandi

Data 2026, 11(2), 31; https://doi.org/10.3390/data11020031 - 3 Feb 2026

Abstract

Indoor environmental comfort plays a central role in occupants’ well-being, learning outcomes, and productivity, especially in educational buildings characterized by high occupancy variability and diverse activities. This paper presents a real-world dataset collected at the Cesena Campus of the University of Bologna, aimed [...] Read more.

Indoor environmental comfort plays a central role in occupants’ well-being, learning outcomes, and productivity, especially in educational buildings characterized by high occupancy variability and diverse activities. This paper presents a real-world dataset collected at the Cesena Campus of the University of Bologna, aimed at supporting occupant-centric comfort analysis and prediction in classrooms and laboratories. The dataset integrates continuous environmental measurements, such as temperature, humidity, noise, air pressure, and CO₂ concentration, with subjective comfort feedback gathered from students during regular lectures. Data were collected using permanently installed ceiling sensors and additional control sensors placed near occupants, enabling both longitudinal monitoring and validation analyses. Furthermore, the dataset includes both repeated comfort perception reports and a one-time comfort definition phase capturing individual relevance weights for different comfort dimensions. By combining objective and subjective data in realistic academic settings, the dataset provides a valuable resource for developing, benchmarking, and validating data-driven models for smart campus applications, indoor comfort prediction, and human-centered building analytics. Full article

► Show Figures

Figure 1

11 pages, 1038 KB

Open AccessData Descriptor

Refined IDRiD: An Enhanced Dataset for Diabetic Retinopathy Segmentation with Expert-Validated Annotations and Comprehensive Anatomical Context

by Sakon Chankhachon, Supaporn Kansomkeat, Patama Bhurayanontachai and Sathit Intajag

Data 2026, 11(2), 30; https://doi.org/10.3390/data11020030 - 1 Feb 2026

Abstract

The Indian Diabetic Retinopathy Image Dataset (IDRiD) has been widely adopted for DR lesion segmentation research. However, it contains annotation gaps for proliferative DR lesions and labeling errors that limit its utility for comprehensive automated screening systems. We present Refined IDRiD, an enhanced [...] Read more.

The Indian Diabetic Retinopathy Image Dataset (IDRiD) has been widely adopted for DR lesion segmentation research. However, it contains annotation gaps for proliferative DR lesions and labeling errors that limit its utility for comprehensive automated screening systems. We present Refined IDRiD, an enhanced version that addresses these limitations through (1) expert ophthalmologist validation and correction of labeling errors in original annotations for four non-proliferative lesions (microaneurysms, hemorrhages, hard exudates, cotton-wool spots), (2) the addition of three critical proliferative DR lesion annotations (neovascularization, vitreous hemorrhage, intraretinal microvascular abnormalities), and (3) the integration of comprehensive anatomical context (optic disc, fovea, blood vessels, retinal region). A team of three ophthalmologists (one senior specialist with >10 years’ experience, two expert fundus image annotators) conducted systematic annotation refinement, achieving an inter-rater agreement F1-score of 0.9012. The enhanced dataset comprises 81 high-resolution fundus images with pixel-level annotations for seven DR lesion types and four anatomical structures. All images were cropped to the retinal region of interest and resized to 1024 × 1024 pixels, with annotations stored as unified grayscale masks containing 12 classes enabling efficient multi-task learning. Refined IDRiD enables training of comprehensive DR screening systems capable of detecting both non-proliferative and proliferative stages while reducing false positives through anatomical context awareness. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

18 pages, 1081 KB

Open AccessData Descriptor

Controlled Generation of Synthetic Spanish Texts: A Dataset Using LLMs with and Without Contextual Retrieval

by José M. García-Campos, Agustín W. Lara-Romero, Vicente Mayor and Jorge Calvillo-Arbizu

Data 2026, 11(2), 29; https://doi.org/10.3390/data11020029 - 1 Feb 2026

Abstract

The increasing ability of Large Language Models (LLMs) to generate fluent and coherent text has heightened the need for resources to analyze and detect synthetic content, particularly in Spanish, where the scarcity of datasets hinders the development of reliable detection systems. This work [...] Read more.

The increasing ability of Large Language Models (LLMs) to generate fluent and coherent text has heightened the need for resources to analyze and detect synthetic content, particularly in Spanish, where the scarcity of datasets hinders the development of reliable detection systems. This work presents a Spanish-language dataset of 18,236 synthetic news descriptions generated from real journalistic headlines using a fully reproducible, open-source pipeline. The methodology used to produce the dataset includes both a Retrieval Augmented Generation (RAG) approach, which incorporates contextual information from recent news descriptions, and a NO-RAG approach, which relies solely on the headline. Texts were generated with the instruction-tuned Mistral 7B Instruct model, systematically varying temperature to explore the effect of generation parameters. The dataset includes detailed metadata linking each synthetic description to its source headline, generation settings, and, when applicable, retrieved contextual content. By combining contextual grounding, controlled parameter variation, and source-level traceability, this dataset provides a reproducible and richly annotated resource that supports research in Spanish synthetic text and evaluation of LLM-based generation. Full article

► Show Figures

Figure 1

11 pages, 1353 KB

Open AccessData Descriptor

Dual-Source Synthetic Uzbek Corpora for Sentiment Analysis and NER with Controlled Emoji Signals

by Bobur Saidov, Vladimir Barakhnin, Shohrux Madirimov, Umid Ibragimov, Shakhboz Meylikulov, Sultonbek Normamatov, Feruza Bahodirova, Javlonbek Matnazarov and Zarnigor Fayzullaeva

Data 2026, 11(2), 28; https://doi.org/10.3390/data11020028 - 1 Feb 2026

Abstract

This data descriptor presents two fully synthetic corpora for sentiment analysis and named entity recognition (NER) in Uzbek. The first corpus contains 12,000 hybrid synthetic sentences generated from templates with lexical randomization, automatic insertion of named entities (PER/ORG/LOC), lexicon-based polarity scoring, and a [...] Read more.

This data descriptor presents two fully synthetic corpora for sentiment analysis and named entity recognition (NER) in Uzbek. The first corpus contains 12,000 hybrid synthetic sentences generated from templates with lexical randomization, automatic insertion of named entities (PER/ORG/LOC), lexicon-based polarity scoring, and a controlled emoji distribution. The second corpus includes 3000 “manual-style” sentences designed to resemble short, naturally structured messages. Although the manual-style subset was initially intended to be emoji-free, the released version includes a 39.6% emoji presence (sentences containing at least one emoji) to maintain comparability in emotional markers across corpora. Both corpora are released in CSV, XLSX, and JSONL formats and share a unified schema (id, text, sentiment, entities, entity_type, polarity_score, polarity_source, token_count, emojis, emoji_position, emoji_sentiment, conflict_flag, sentiment_from_polarity_score, split). The dataset is publicly available via Mendeley Data (DOI: 10.17632/y2d5pcyrzz.3). Full article

► Show Figures

Figure 1

10 pages, 1516 KB

Open AccessData Descriptor

Multiplex Immunofluorescence and Histopathology Dataset of Cell Cycle–Related Proteins in Renal Cell Carcinoma

by Hazem Abdullah, In Hwa Um, Grant D. Stewart, Alexander Laird, Kathryn Kirkwood, Chang Wook Jeong, Cheol Kwak, Kyung Chul Moon, TranSORCE Team, Tim Eisen, Elena Frangou, Anne Warren, Angela Meade and David J. Harrison

Data 2026, 11(2), 27; https://doi.org/10.3390/data11020027 - 1 Feb 2026

Abstract

Clear-cell renal cell carcinoma (ccRCC) accounts for the majority of kidney cancer diagnoses and exhibits widely variable clinical behaviour. The dataset described here was generated to support the discovery of robust biomarkers of tumour cell-cycle arrest and to inform the risk-stratified management of [...] Read more.

Clear-cell renal cell carcinoma (ccRCC) accounts for the majority of kidney cancer diagnoses and exhibits widely variable clinical behaviour. The dataset described here was generated to support the discovery of robust biomarkers of tumour cell-cycle arrest and to inform the risk-stratified management of ccRCC. We assembled four independent cohorts including 480 patients from the UK arm of the SORCE adjuvant trial, 300 patients from a surgically treated series in Korea, 120 patients from a retrospective Scottish cohort, and a paired primary–metastatic cohort comprising 62 patients. Formalin-fixed paraffin-embedded nephrectomy specimens were processed for routine hematoxylin and eosin (H&E) histology, and for multiplex immunofluorescence (mIF). The mIF panels detect the cyclin-dependent kinase inhibitor p21^CDKN1a, the DNA replication licencing factor MCM2, endoglin/CD105, Lamin B1 and nuclear DNA (Hoechst). Whole-slide images (WSIs) were acquired at high resolution, and artificial-intelligence pipelines were used to segment nuclei, classify individual cells into arrested phenotypes, and calculate the fraction of cells. Accompanying metadata include demographics, tumour stage, grade, Leibovich score, treatment arm (sorafenib/placebo), relapse events, and disease-free survival. All images and derived tables are released under a CC0 licence via the BioImage Archive, ensuring unrestricted reuse. This multi-cohort dataset provides a rich resource for studying cell-cycle arrest and proliferation markers, training image-analysis algorithms, and developing prognostic signatures in RCC. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

Journal Description

Latest Articles

Journal Menu

Journal Browser

Highly Accessed Articles

Latest Books

E-Mail Alert

News

Topics

Conferences

Special Issues

Topical Collections

Further Information

Guidelines

MDPI Initiatives

Follow MDPI