Next Issue
Volume 11, January
Previous Issue
Volume 10, November
 
 

Data, Volume 10, Issue 12 (December 2025) – 18 articles

Cover Story (view full-size image): Liepāja Lake is a shallow, coastal freshwater body and a Natura 2000 protected area that has experienced long-term pressures from urbanization, agriculture, and legacy industrial activities. This paper presents an openly accessible dataset of major and trace element concentrations in surface sediments and surface waters collected across Liepāja Lake during a coordinated field campaign in July 2024. Using standardized sampling protocols and ICP-MS analysis, the dataset provides high-resolution georeferenced measurements for 31 elements with accompanying uncertainty estimates. These data establish a robust baseline for pollution assessment, spatial analysis, modelling, and future monitoring, supporting evidence-based management and restoration of climate-sensitive Baltic coastal lakes. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
12 pages, 2468 KB  
Article
A Real-World Underwater Video Dataset with Labeled Frames and Water-Quality Metadata for Aquaculture Monitoring
by Osbaldo Aragón-Banderas, Leonardo Trujillo, Yolocuauhtli Salazar, Guillaume J. V. E. Baguette and Jesús L. Arce-Valdez
Data 2025, 10(12), 211; https://doi.org/10.3390/data10120211 - 18 Dec 2025
Viewed by 654
Abstract
Aquaculture monitoring increasingly relies on computer vision to evaluate fish behavior and welfare under farming conditions. This dataset was collected in a commercial recirculating aquaculture system (RAS) integrated with hydroponics in Queretaro, Mexico, to support the development of robust visual models for Nile [...] Read more.
Aquaculture monitoring increasingly relies on computer vision to evaluate fish behavior and welfare under farming conditions. This dataset was collected in a commercial recirculating aquaculture system (RAS) integrated with hydroponics in Queretaro, Mexico, to support the development of robust visual models for Nile tilapia (Oreochromis niloticus). More than ten hours of underwater recordings were curated into 31 clips of 30 s each, a duration selected to balance representativeness of fish activity with a manageable size for annotation and training. Videos were captured using commercial action cameras at multiple resolutions (1920 × 1080 to 5312 × 4648 px), frame rates (24–60 fps), depths, and lighting configurations, reproducing real-world challenges such as turbidity, suspended solids, and variable illumination. For each recording, physicochemical parameters were measured, including temperature, pH, dissolved oxygen and turbidity, and are provided in a structured CSV file. In addition to the raw videos, the dataset includes 3520 extracted frames annotated using a polygon-based JSON format, enabling direct use for training object detection and behavior recognition models. This dual resource of unprocessed clips and annotated images enhances reproducibility, benchmarking, and comparative studies. By combining synchronized environmental data with annotated underwater imagery, the dataset contributes a non-invasive and versatile resource for advancing aquaculture monitoring through computer vision. Full article
Show Figures

Figure 1

18 pages, 3229 KB  
Article
Labels4Rails: A Railway Image Annotation Tool and Associated Reference Dataset
by Tina Hiebert, Florian Hofstetter, Carsten Thomas, Savera Mushtaq, Eero Kaan and Biranavan Parameswaran
Data 2025, 10(12), 210; https://doi.org/10.3390/data10120210 - 16 Dec 2025
Viewed by 486
Abstract
The development of autonomous train systems relies heavily on machine learning (ML) models, which in turn depend on large, high-quality annotated datasets for training and evaluation. The railway domain lacks adequate public datasets and efficient annotation tools. To address this gap, we present [...] Read more.
The development of autonomous train systems relies heavily on machine learning (ML) models, which in turn depend on large, high-quality annotated datasets for training and evaluation. The railway domain lacks adequate public datasets and efficient annotation tools. To address this gap, we present Labels4Rails, a tool designed specifically for the annotation of railway scenes. It captures track topology, switch states including switch directions, and informational tags regarding the images’ content and leverages consistent camera perspectives and the fixed track geometries inherent to railways for annotation efficiency. We used Labels4Rails to create the L4R_NLB reference dataset from Norwegian railway footage. The dataset contains 10,253 annotated images across four seasons, including 1415 switch annotations. Both the tool and dataset are publicly available. Full article
Show Figures

Figure 1

13 pages, 3528 KB  
Data Descriptor
AlimurgITA: A Database of the Italian Alimurgic Flora
by Piera Di Marzio, Angela Di Iorio, Carmen Giancola and Bruno Paura
Data 2025, 10(12), 209; https://doi.org/10.3390/data10120209 - 16 Dec 2025
Viewed by 278
Abstract
The AlimurgITA portal is a user-friendly and effective tool for researching Wild Edible Plants (WEPs). It provides valuable information on alimurgic plant species, aiding conservation and potential applications (agricultural, food, etc.). Users can interact with authors to report errors and contribute to the [...] Read more.
The AlimurgITA portal is a user-friendly and effective tool for researching Wild Edible Plants (WEPs). It provides valuable information on alimurgic plant species, aiding conservation and potential applications (agricultural, food, etc.). Users can interact with authors to report errors and contribute to the knowledge base regarding local uses. The authors will update the site every six months to include new data. Currently, the online database contains data on 1116 taxa used in 20 Italian regions: updated scientific name and link to the site Acta Plantarum, family, main synonyms, common name in Italian and regional dialect, chorotype, life form, a map showing the regions where it is known to be used, the part used, how it is used, and the bibliography. From the home page, you can search for taxa by scientific name, and there are pages dedicated to summaries of the entries: scientific name, family, chorotype, life form, method of use, and part used. Additionally, within the FuD WE PIC Project, the AlimurgITA entity list is being integrated with Italian vegetation data from the European Vegetation Archive to model WEPs richness, identify diversity hotspots, and explore the relationship between WEPs diversity and habitat types. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

14 pages, 2851 KB  
Article
Automated Building of a Multidialectal Parallel Arabic Corpus Using Large Language Models
by Khalid Almeman
Data 2025, 10(12), 208; https://doi.org/10.3390/data10120208 - 12 Dec 2025
Viewed by 592
Abstract
The development of Natural Language Processing applications tailored for diverse Arabic-speaking users requires specialized Arabic corpora, which are currently lacking in existing Arabic linguistic resources. Therefore, in this study, a multidialectal parallel Arabic corpus is built, focusing on the travel and tourism domain. [...] Read more.
The development of Natural Language Processing applications tailored for diverse Arabic-speaking users requires specialized Arabic corpora, which are currently lacking in existing Arabic linguistic resources. Therefore, in this study, a multidialectal parallel Arabic corpus is built, focusing on the travel and tourism domain. By leveraging the text generation and dialectal transformation capabilities of Large Language Models, an initial set of approximately 100,000 parallel sentences was generated. Following a rigorous multi-stage deduplication process, 50,010 unique parallel sentences were obtained from Modern Standard Arabic (MSA) and five major Arabic dialects—Saudi, Egyptian, Iraqi, Levantine, and Moroccan. This study presents the detailed methodology of corpus generation and refinement, describes the characteristics of the generated corpus, and provides a comprehensive statistical analysis highlighting the corpus size, lexical diversity, and linguistic overlap between MSA and the five dialects. This corpus represents a valuable resource for researchers and developers in Arabic dialect processing and AI applications that require nuanced contextual understanding. Full article
Show Figures

Figure 1

24 pages, 9423 KB  
Article
Operator Learning with Branch–Trunk Factorization for Macroscopic Short-Term Speed Forecasting
by Bin Yu, Yong Chen, Dawei Luo and Joonsoo Bae
Data 2025, 10(12), 207; https://doi.org/10.3390/data10120207 - 12 Dec 2025
Viewed by 427
Abstract
Logistics operations demand real-time visibility and rapid response, yet minute-level traffic speed forecasting remains challenging due to heterogeneous data sources and frequent distribution shifts. This paper proposes a Deep Operator Network (DeepONet)-based framework that treats traffic prediction as learning a mapping from historical [...] Read more.
Logistics operations demand real-time visibility and rapid response, yet minute-level traffic speed forecasting remains challenging due to heterogeneous data sources and frequent distribution shifts. This paper proposes a Deep Operator Network (DeepONet)-based framework that treats traffic prediction as learning a mapping from historical states and boundary conditions to future speed states, enabling robust forecasting under changing scenarios. We project logistics demand onto a road network to generate diverse congestion scenarios and employ a branch–trunk architecture to decouple historical dynamics from exogenous contexts. Experiments on both a controlled simulation dataset and the real-world Metropolitan Los Angeles (METR-LA) benchmark demonstrate that the proposed method outperforms classical regression and deep learning baselines in cross-scenario generalization. Specifically, the operator learning approach effectively adapts to unseen boundary conditions without retraining, establishing a promising direction for resilient and adaptive logistics forecasting. Full article
Show Figures

Figure 1

21 pages, 33699 KB  
Data Descriptor
A Dataset for the Medical Support Vehicle Location–Allocation Problem
by Miguel Medina-Perez, Giovanni Guzmán, Magdalena Saldana-Perez, Adriana Lara and Miguel Torres-Ruiz
Data 2025, 10(12), 206; https://doi.org/10.3390/data10120206 - 10 Dec 2025
Viewed by 358
Abstract
In mass-casualty incidents, emergency responders require access to accurate and timely information to support informed decision-making and ensure the efficient allocation of resources. This article presents a dataset derived from a case study conducted in Mexico City (CDMX) based on the earthquake of [...] Read more.
In mass-casualty incidents, emergency responders require access to accurate and timely information to support informed decision-making and ensure the efficient allocation of resources. This article presents a dataset derived from a case study conducted in Mexico City (CDMX) based on the earthquake of 19 September 2017. The dataset presents hypothetical scenarios involving multiple demand points and large numbers of victims, making it suitable for analysis using optimization techniques. It integrates voluntary collaborative geographic information, open government data sources, and historical records, and details the data collection, cleaning, and preprocessing stages. The accompanying Python 3 source code enables users to update the original data for consistent analysis and processing. Researchers can adapt this dataset to other cities with similar risk characteristics, such as Santiago (Chile), Los Angeles (USA), or Tokyo (Japan), and extend it to other types of catastrophic events, including floods, landslides, or epidemics, to support emergency response and resource allocation planning. Full article
Show Figures

Figure 1

12 pages, 697 KB  
Data Descriptor
Computational Dataset for Polymer–Pharmaceutical Interactions: MD/MM-PBSA and DFT Resources for Molecularly Imprinted Polymer (MIP) Design
by David Visentin, Mario Lovrić, Dejan Milenković, Robert Vianello, Željka Maglica, Kristina Tolić Čop and Dragana Mutavdžić Pavlović
Data 2025, 10(12), 205; https://doi.org/10.3390/data10120205 - 10 Dec 2025
Viewed by 385
Abstract
Molecularly imprinted polymers (MIPs) are promising sorbents for selectively capturing pharmaceutically active compounds (PhACs), but design remains slow because candidate screening is largely experimental or based on computationally expensive methods. We present MIP–PhAC, an open, curated resource of polymer–pharmaceutical interaction energies generated from [...] Read more.
Molecularly imprinted polymers (MIPs) are promising sorbents for selectively capturing pharmaceutically active compounds (PhACs), but design remains slow because candidate screening is largely experimental or based on computationally expensive methods. We present MIP–PhAC, an open, curated resource of polymer–pharmaceutical interaction energies generated from molecular dynamics (MD) followed by MM/PBSA analysis, with a small DFT subset for cross-method comparison. This resource is comprised of two complementary datasets: MIP–PhAC-Calibrated, a benchmark set with manually verified pH-7 microstates that reports both monomeric (pre-polymerized) and polymeric (short-chain) MD/MMPBSA energies and includes a DFT subset; and MIP–PhAC-Screen, a broader, high-throughput collection produced under a uniform automated workflow (including automated protonation) for rapid within-polymer ranking and machine learning development. For each MIP—PhAC pair we provide ΔG* components (electrostatics, van der Waals, polar and non-polar solvation; −TΔS omitted), summary statistics from post-convergence frames, simulation inputs, and chemical metadata. To our knowledge, MIP–PhAC is the largest open, curated dataset of polymer–pharmaceutical interaction energies to date. It enables benchmarking of end-point methods, reproducible protocol evaluation, data-driven ranking of polymer–pharmaceutical combinations, and training/validation of machine learning (ML) models for MIP design on modest compute budgets. Full article
Show Figures

Figure 1

10 pages, 4187 KB  
Data Descriptor
Early-Season Field Reference Dataset of Croplands in a Consolidated Agricultural Frontier in the Brazilian Cerrado
by Ana Larissa Ribeiro de Freitas, Fábio Furlan Gama, Ivo Augusto Lopes Magalhães and Edson Eyji Sano
Data 2025, 10(12), 204; https://doi.org/10.3390/data10120204 - 10 Dec 2025
Viewed by 530
Abstract
This dataset presents field observations collected in the municipality of Goiatuba, Goiás State, Brazil, a consolidated and representative agricultural frontier of the Brazilian Cerrado biome. The region presents diverse land use dynamics, including annual cropping systems, irrigated fields with up to three harvests [...] Read more.
This dataset presents field observations collected in the municipality of Goiatuba, Goiás State, Brazil, a consolidated and representative agricultural frontier of the Brazilian Cerrado biome. The region presents diverse land use dynamics, including annual cropping systems, irrigated fields with up to three harvests per year, and pasturelands. We conducted a field campaign from 3 to 7 November 2025, corresponding to the beginning of the 2025/2026 Brazilian crop season, when crops were at distinct early phenological stages. To ensure representativeness, we delineated 117 reference fields prior to the field campaign, and an additional 463 plots were surveyed during work. Geographic coordinates, crop types, and photographic records were obtained using the GPX Viewer application, a handheld GPS receiver, and the QField 3.7.9 mobile GIS application running on a tablet uploaded with Sentinel-2 true-color imagery and the municipal road network. Plot boundaries were subsequently digitized in QGIS Desktop 3.34.1 software, following a conservative mapping strategy to minimize edge effects and internal heterogeneity associated with trees and water catchment basins. In total, more than 26,000 hectares of agricultural fields were mapped, along with additional land use and land cover polygons representing water bodies, urban areas, and natural vegetation fragments. All reference fields were labeled based on in situ observations and linked to Sentinel-2 mosaics downloaded via the Google Earth Engine platform. This dataset is well-suited for training, testing, and validation of remote sensing classifiers, benchmarking studies, and agricultural mapping initiatives focused on the beginning of the agricultural season in the Brazilian Cerrado. Full article
(This article belongs to the Special Issue New Progress in Big Earth Data)
Show Figures

Figure 1

26 pages, 10467 KB  
Article
ANSEC-MM: Identifying Antecedents of Negative Public Sentiment Through Expression Capacity: A Mixed-Methods Approach to Crisis Mitigation
by Zeeshan Rasheed, Shahzad Ashraf and Syed Kanza Mehak
Data 2025, 10(12), 203; https://doi.org/10.3390/data10120203 - 9 Dec 2025
Viewed by 410
Abstract
Social networks have emerged as integral platforms for communication and information dissemination in contemporary society. The spread of negative sentiments and its impact on activities of users in social networks is a crucial issue. When users receive negative reviews about news or articles, [...] Read more.
Social networks have emerged as integral platforms for communication and information dissemination in contemporary society. The spread of negative sentiments and its impact on activities of users in social networks is a crucial issue. When users receive negative reviews about news or articles, regardless of authenticity, they form opinions based on their own understanding, and statistics show that more than 90% of the time this reveals predictable behavior patterns. To address this situation, the proposed Antecedents of Negative Sentiment through Expression Capacity: Mixed Methods (ANSEC-MM) study identifies the antecedents of negative sentiment using expression capacity as a mixed-methods approach to mitigate the generation of negative sentiments. The proposed model introduces the concept of identification of influencer nodes with further categorization into active and inactive influencer nodes. The model separates negative influencer nodes from positive nodes and processes the negative influencer nodes further. A Node Expressive Capacity (NE) metric predicts the frequency with which users interact with neighboring influencer nodes, which contributes to the generation of negative sentiments. A Cognitive Effect Coefficient (φ) defines the temperament status of the users. Through further computation, the model distinguishes the proportion of negative sentiments from positive ones. Negative sentiment mitigation is achieved through a developed algorithmic approach. Performance is tested and compared across three datasets against state-of-the-art models: EANN, BERT, and AOAN. The proposed model demonstrated superior performance in negative sentiment detection and mitigation, achieving accuracy rates of 90% and 88%, respectively, compared to existing models. Full article
(This article belongs to the Special Issue Advances in Graph-Structured Data: Methods and Applications)
Show Figures

Graphical abstract

18 pages, 1347 KB  
Data Descriptor
China’s 15-Year Mine Accident Report Dataset (2010–2025): Construction and Analysis
by Maoquan Wan, Hao Li, Hao Wang, Hanjun Gong and Jie Hou
Data 2025, 10(12), 202; https://doi.org/10.3390/data10120202 - 4 Dec 2025
Viewed by 1223
Abstract
Mine accidents pose severe threats to worker safety and sustainable mining development in China. However, existing mine accident data in China are often scattered, unstructured, and lack systematic integration, which limits their application in safety research and practice. This study constructed a standardized [...] Read more.
Mine accidents pose severe threats to worker safety and sustainable mining development in China. However, existing mine accident data in China are often scattered, unstructured, and lack systematic integration, which limits their application in safety research and practice. This study constructed a standardized structured dataset using 532 mine accident reports from official channels covering the period 2010–2025. The dataset went through four stages: data collection, standardized cleaning, structured annotation, and quality validation. It is stored in JSON Lines (JSONL) format for easy reuse. The dataset covers 27 provinces/autonomous regions/municipalities in China. Among accident levels, general accidents account for 65.6%; among accident types, roof accidents account for 20.3%. Accidents are geographically concentrated, with 11.7%, 8.3%, and 7.7% occurring in Shanxi, Gansu, and Inner Mongolia, respectively. Official data have shown an annual average decrease of 9.7% in mine accidents from 2018 to 2022, reflecting improved safety governance. This dataset addresses the gap of a full-element structured mine accident database in China, providing high-quality data for accident causation modeling, regional risk early warning, and safety policy evaluation. It also supports mine enterprises in targeted risk prevention and regulatory authorities in precise regulatory enforcement. Full article
Show Figures

Figure 1

40 pages, 614 KB  
Review
Data Quality in the Age of AI: A Review of Governance, Ethics, and the FAIR Principles
by Miriam Guillen-Aguinaga, Enrique Aguinaga-Ontoso, Laura Guillen-Aguinaga, Francisco Guillen-Grima and Ines Aguinaga-Ontoso
Data 2025, 10(12), 201; https://doi.org/10.3390/data10120201 - 4 Dec 2025
Viewed by 3052
Abstract
Data quality is fundamental to scientific integrity, reproducibility, and evidence-based decision-making. Nevertheless, many datasets lack transparency in their collection and curation, undermining trust and reusability across research domains. This narrative review synthesizes scientific and technical literature published between 1996 and 2025, complemented by [...] Read more.
Data quality is fundamental to scientific integrity, reproducibility, and evidence-based decision-making. Nevertheless, many datasets lack transparency in their collection and curation, undermining trust and reusability across research domains. This narrative review synthesizes scientific and technical literature published between 1996 and 2025, complemented by international standards (ISO/IEC 25012, ISO 8000), to provide an integrated overview of data quality frameworks, governance, and ethical considerations in the era of Artificial Intelligence (AI). Sources were retrieved from PubMed, Scopus, Web of Science, and grey literature. Across sectors, accuracy, completeness, consistency, timeliness, and accessibility consistently emerged as universal quality dimensions. Evidence from healthcare, business, and public administration suggests that poor data quality leads to substantial financial losses, operational inefficiencies, and erosion of trust. Emerging frameworks are increasingly integrating FAIR principles (Findability, Accessibility, Interoperability, Reusability) and incorporating ethical safeguards, including bias mitigation in AI systems. Data quality is not solely a technical issue but a socio-organizational challenge that requires robust governance and continuous assurance throughout the data lifecycle. Embedding quality and ethical governance into data management practices is crucial for producing trustworthy, reusable, and reproducible data that supports sound science and informed decision-making. Full article
Show Figures

Figure 1

10 pages, 11432 KB  
Data Descriptor
Georeferenced Sediment and Surface Water Element Concentrations in the Coastal Liepāja Lake (Latvia), 2024
by Inga Grinfelde, Uldis Valainis, Maris Nitcis, Ieva Buske, Jana Grave, Normunds Stivrins, Vilda Grybauskiene, Gitana Vyciene, Maris Bertins and Jovita Pilecka-Ulcugaceva
Data 2025, 10(12), 200; https://doi.org/10.3390/data10120200 - 3 Dec 2025
Viewed by 350
Abstract
Liepāja Lake, a Natura 2000 protected area and one of the largest coastal freshwater bodies in Latvia, has been historically influenced by urbanization, diffuse agricultural inputs, and legacy contamination from metallurgy and ship-repair industries. Comprehensive, spatially explicit data on its sediment and water [...] Read more.
Liepāja Lake, a Natura 2000 protected area and one of the largest coastal freshwater bodies in Latvia, has been historically influenced by urbanization, diffuse agricultural inputs, and legacy contamination from metallurgy and ship-repair industries. Comprehensive, spatially explicit data on its sediment and water chemistry were previously lacking. The dataset used in this study provides an openly accessible record of major and trace element concentrations in surface sediments and surface waters collected during the 2024 field campaign. Sampling sites were distributed across northern, central, and southern zones to capture gradients in anthropogenic pressure and natural variability. Water samples were filtered and acidified following ISO 15587-2:2002, while sediments were homogenized, sieved, and digested following EPA 3051a. Both matrices were analyzed using Inductively Coupled Plasma Mass Spectrometry (ICP-MS, Agilent 8900 ICP-QQQ) with multi-element calibration traceable to NIST standards. The dataset comprises 31 analytes (Li–Bi) with paired standard deviation values, reported in mg kg–1 (sediments) and µg L–1 (water). Rigorous validation included certified reference materials, duplicates, blanks, and statistical outlier screening. The resulting data form a reliable geochemical baseline for assessing pollution sources, quantifying spatial heterogeneity, and supporting future monitoring, modeling, and restoration efforts in climate-sensitive Baltic coastal lakes. Full article
Show Figures

Figure 1

12 pages, 2829 KB  
Data Descriptor
Sound Absorption Coefficient Data for Laboratory-Produced Sound-Absorbing Panels from Textile Waste
by Kristaps Siltumens, Inga Grinfelde, Raitis Brencis and Andris Paeglitis
Data 2025, 10(12), 199; https://doi.org/10.3390/data10120199 - 2 Dec 2025
Viewed by 494
Abstract
With the increasing demand for sustainable building materials, it has become essential to identify sustainable alternatives to conventional sound absorbers, particularly in the context of waste reduction and the circular economy. The aim of this study was to compile and describe a structured [...] Read more.
With the increasing demand for sustainable building materials, it has become essential to identify sustainable alternatives to conventional sound absorbers, particularly in the context of waste reduction and the circular economy. The aim of this study was to compile and describe a structured dataset of sound absorption coefficients for laboratory-produced panels made from recycled textile materials. Five types of panels were developed using cotton, polyester, wool, linen, and a mixed composition of textiles. A biopolymer binder was applied to ensure structural stability of the materials. Following careful sorting, shredding, and homogenization of the textile waste, test specimens were prepared and examined under controlled laboratory conditions. The sound absorption coefficients were measured using an AFD 1000 impedance tube in accordance with the ISO 10534-2 standard, across a frequency range from 6.25 to 6393.75 Hz. For each material, three repeated measurements were performed, and mean values were calculated to ensure accuracy and reliability. The resulting dataset contains structured values of sound absorption coefficients, which can be applied in building acoustics modeling, comparative studies with conventional insulation materials, and the development of new sustainable products. In addition, the data can be used in educational contexts and machine learning applications to predict the acoustic properties of recycled textile composites. Full article
Show Figures

Figure 1

16 pages, 446 KB  
Data Descriptor
Open Dataset on Neurocognitive Complaints and Physical Symptoms in Long COVID: A Six-Month Post-Infection Cohort
by Somayeh Pour Mohammadi, Francisco Mercado Romero, Moein Noroozi Fashkhami and Irene Peláez
Data 2025, 10(12), 198; https://doi.org/10.3390/data10120198 - 1 Dec 2025
Viewed by 626
Abstract
Long COVID is frequently accompanied by enduring neurocognitive and physical symptoms that substantially affect quality of life. Cognitive complaints—including difficulties in memory, attention, and executive functioning—often co-occur with physical manifestations such as fatigue, dyspnea, and headache. Despite growing research, openly available datasets integrating [...] Read more.
Long COVID is frequently accompanied by enduring neurocognitive and physical symptoms that substantially affect quality of life. Cognitive complaints—including difficulties in memory, attention, and executive functioning—often co-occur with physical manifestations such as fatigue, dyspnea, and headache. Despite growing research, openly available datasets integrating demographic, cognitive, and physical symptom profiles assessed during chronic phases of Long COVID remain scarce. Here, we present two complementary self-report datasets collected ≥6 months after the most recent COVID-19 infection. The first dataset (“Neuro–Long COVID-212”) includes demographic information, binary neurocognitive symptom indicators, and a 14-item Post-COVID Cognitive Impairment Scale assessing memory and attention complaints. The second dataset (“Neuro–Long COVID–210”) provides a broad range of physical symptoms—operationally defined as somatic and neurological complaints (e.g., fatigue, pain, sleep disturbance, anosmia/ageusia)—recorded as binary indicators (present/absent). Data were collected online via the Porsline platform using individualized links, with remote researcher support to ensure accuracy. Quality assurance procedures included duplicate-response removal, consistency checks, and transparent handling of missing values. The datasets are released in Excel (.xlsx) format, fully de-identified and accompanied by a detailed data dictionary to facilitate reuse. These datasets enable reproducibility, secondary analyses, and meta-analyses on cognitive and physical outcomes in Long COVID, and may inform future cross-disciplinary rehabilitation research. Full article
Show Figures

Figure 1

14 pages, 2974 KB  
Data Descriptor
Articulatory Data on Preboundary Lengthening Across Prominence Conditions in American English
by Jiyoung Jang, Sahyang Kim and Taehong Cho
Data 2025, 10(12), 197; https://doi.org/10.3390/data10120197 - 1 Dec 2025
Viewed by 271
Abstract
This article presents articulatory–kinematic data on preboundary lengthening (Intonational Phrase-final lengthening) from the productions of ten native speakers of American English—a relatively rare class of phonetic data compared with the more widely available acoustic data. The dataset includes three trisyllabic nonce words (bábaba, [...] Read more.
This article presents articulatory–kinematic data on preboundary lengthening (Intonational Phrase-final lengthening) from the productions of ten native speakers of American English—a relatively rare class of phonetic data compared with the more widely available acoustic data. The dataset includes three trisyllabic nonce words (bábaba, babába, bababá), each designed to manipulate the location of lexical stress. These were produced under prosodic conditions that varied in boundary position and focus-induced phrasal prominence, enabling analysis of how preboundary lengthening is distributed across words with different lexical stress locations and how it interacts with prosodic prominence. Articulatory data were collected using electromagnetic articulography (EMA, Carstens AG200), providing kinematic measurements such as movement duration, peak velocity, and displacement of articulatory gestures. The accompanying files allow examination of individual speaker variation in these measures as modulated by prosodic structure, including boundary and prominence effects. While theoretical findings have been reported in a previous study, the full dataset, including detailed descriptions of individual speaker patterns, is made available here. By making these less commonly available articulatory data publicly available, we aim to promote broad reuse and support further research in prosody, articulatory phonetics, and speech production. Full article
Show Figures

Figure 1

14 pages, 1118 KB  
Article
Using Machine Learning to Identify Predictors of Heterogeneous Intervention Effects in Childhood Obesity Prevention
by Elizabeth Mannion, Kristine Bihrmann, Nanna Julie Olsen, Berit Lilienthal Heitmann and Christian Ritz
Data 2025, 10(12), 196; https://doi.org/10.3390/data10120196 - 1 Dec 2025
Viewed by 453
Abstract
Obesity prevention interventions in children often produce small or null effects. However, ignoring heterogeneous responses may widen pre-existing inequalities. This secondary analysis explored baseline predictors of differential effects on BMI z-score, Fat mass (%), stress, and sleep outcomes in obesity-susceptible, healthy-weight children (n [...] Read more.
Obesity prevention interventions in children often produce small or null effects. However, ignoring heterogeneous responses may widen pre-existing inequalities. This secondary analysis explored baseline predictors of differential effects on BMI z-score, Fat mass (%), stress, and sleep outcomes in obesity-susceptible, healthy-weight children (n = 543). A modified LASSO regression was applied to baseline characteristics, including physical activity and socio-demographics. Few predictors were retained. For BMI z-score, weekly chores and parental divorce were the strongest predictors: children who did chores had a slightly larger increase in BMI z-score in the intervention group compared with controls (MD = 0.15, 95% CI: −0.03, 0.33), while children with divorced parents showed a smaller increase (MD = −0.19, 95% CI: −0.69, 0.31). These results align with evidence that low-intensity activity has limited impact on obesity outcomes and that children with compounded vulnerability may respond differently to tailored interventions. Even when overall effects are small, machine learning approaches can identify potential predictors of heterogeneous intervention effects, supporting the design of future targeted interventions aimed at reducing inequalities. Full article
Show Figures

Figure 1

27 pages, 2582 KB  
Data Descriptor
DECOVID: A UK Two-Center Harmonized Database of Acute Care Electronic Health Records for COVID-19 Research
by DECOVID Consortium, Louis J. M. Aslett, Andreea Avramescu, Nicholas Bakewell, Isabel Birds, Louise Bowler, Michael P. J. Camilleri, Sheng-Chia Chung, David A. Clifton, Samuel N. Cohen, Nathan Constantine-Cooke, Eric G. Daub, Shaun Davidson, Spiros Denaxas, Karla Diaz-Ordaz, Richard Feltbower, Suzy Gallier, Stephen Gardiner, Francesca Gasperoni, Robert J. B. Goudie, Rebecca E. Green, Marlous Hall, Chris Holmes, John R. Hurst, Mark M. Iles, Joao Jorge, Emma Karoune, Ruth Keogh, Ruairidh King, Ruth King, Paul D. W. Kirk, Roman Klapaukh, Samaneh Kouchaki, Alvina G. Lai, Nathan Lea, Clemence Leyrat, Kezhi Li, Watjana Lilaonitkul, Huiqi Y. Lu, Terry Lyons, Ann Marie Mallon, Andrew Manderson, Nicolò Margaritella, Joshua Matteson, Sam Morley, Hannah Nicholls, Martin O’Reilly, Christina Pagel, Edward Palmer, Jack Roberts, Timothy J. Roberts, David S. Robertson, James Robinson, Patrick Rockenschaub, Roy Ruddle, Elizabeth Sapey, Luis Santos, Andrew A. S. Soltan, Fang Gao Smith, Colin Starr, Oliver Strickson, Li Su, Mia S. Tackney, Johan H. Thygesen, Ana Torralbo, Alice Turner, Catalina A. Vallejos, Chenyang Wang, Kirstie Whitaker, Tony Whitehouse, David R. Westhead, Wai Keong Wong, Yue Wu, Lingyi Yang and Xiaoxu Zouadd Show full author list remove Hide full author list
Data 2025, 10(12), 195; https://doi.org/10.3390/data10120195 - 24 Nov 2025
Viewed by 678
Abstract
The DECOVID database contains harmonized pseudonymized electronic health record (EHR) data on all adult (≥18 years old) patients presenting to two large, digitally mature centers in the United Kingdom between 1 January 2020 and 28 February 2021, with follow-up until at least 28 [...] Read more.
The DECOVID database contains harmonized pseudonymized electronic health record (EHR) data on all adult (≥18 years old) patients presenting to two large, digitally mature centers in the United Kingdom between 1 January 2020 and 28 February 2021, with follow-up until at least 28 March 2021. The database was originally developed to support the COVID-19 response but is now available via the PIONEER data hub for researchers to explore a wide range of research questions, including exploratory analyses, risk factor assessment, prediction modeling, and comparative effectiveness studies. Raw data were extracted from local EHRs and transformed into a standardized form (Observational Health Data Sciences and Informatics-Common Data Model version 5.3.1). The database includes 165,420 patients across 256,804 hospital presentations. For these patients, highly granular data are available, including patient demographics, longitudinal vital signs, physiology, treatments, laboratory findings, clinical diagnoses, and outcomes. There are 10,030 patients with COVID-19, of whom 1472 died in hospital. Full article
Show Figures

Figure 1

12 pages, 7963 KB  
Data Descriptor
SurfaceEMG Datasets for Hand Gesture Recognition Under Constant and Three-Level Force Conditions
by Cinthya Alejandra Zúñiga-Castillo, Víctor Alejandro Anaya-Mosqueda, Natalia Margarita Rendón-Caballero, Marcos Aviles, José M. Álvarez-Alvarado, Roberto Augusto Gómez-Loenzo and Juvenal Rodríguez-Reséndiz
Data 2025, 10(12), 194; https://doi.org/10.3390/data10120194 - 22 Nov 2025
Viewed by 924
Abstract
This work introduces two complementary surface electromyography (sEMG) datasets for hand gesture recognition. Signals were collected from 40 healthy subjects aged 18 to 40 years, divided into two independent groups of 20 participants each. In both datasets, subjects performed five hand gestures. Most [...] Read more.
This work introduces two complementary surface electromyography (sEMG) datasets for hand gesture recognition. Signals were collected from 40 healthy subjects aged 18 to 40 years, divided into two independent groups of 20 participants each. In both datasets, subjects performed five hand gestures. Most of the gestures are the same, although the exact set and the order differ slightly between datasets. For example, Dataset 2 (DS2) includes the simultaneous flexion of the thumb and index finger, which is not present in Dataset 1 (DS1). Data were recorded with three bipolar sEMG sensors placed on the dominant forearm (flexor digitorum superficialis, extensor digitorum, and flexor pollicis longus). A battery-powered acquisition system was used, with sampling rates of 1000 Hz for DS1 and 1500 Hz for DS2. DS1 contains recordings performed at a constant moderate force, while DS2 includes three force levels (low, medium, and high). Both datasets provide raw signals and pre-processed versions segmented into overlapping windows, with clear file structures and annotations, enabling feature extraction for machine learning applications. Together, they constitute a large-scale standardized sEMG resource that supports the development and benchmarking of gesture and force recognition algorithms for rehabilitation, assistive technologies, and prosthetic control. Full article
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop