Data | September 2025 - Browse Articles

24 pages, 4286 KB

Open AccessArticle

Validation of Anthropogenic Emission Inventories in Japan: A WRF-Chem Comparison of PM₂.₅, SO₂, NO_x and CO Against Observations

by Kenichi Tatsumi and Nguyen Thi Hong Diep

Data 2025, 10(9), 151; https://doi.org/10.3390/data10090151 - 22 Sep 2025

Viewed by 506

Abstract

Reliable, high-resolution emission inventories are essential for accurately simulating air quality and for designing evidence-based mitigation policies. Yet their performance over Japan—where transboundary inflow, strict fuel regulations, and complex source mixes coexist—remains poorly quantified. This study therefore benchmarks four widely used anthropogenic inventories—REAS [...] Read more.

Reliable, high-resolution emission inventories are essential for accurately simulating air quality and for designing evidence-based mitigation policies. Yet their performance over Japan—where transboundary inflow, strict fuel regulations, and complex source mixes coexist—remains poorly quantified. This study therefore benchmarks four widely used anthropogenic inventories—REAS v3.2.1, CAMS-GLOB-ANT v6.2, ECLIPSE v6b, and HTAP v3—by coupling each to WRF-Chem (10 km grid) and comparing simulated surface PM₂.₅, SO₂, CO, and NO_x with observations from >900 stations across eight Japanese regions for the years 2010 and 2015. All simulations shared identical meteorology, chemistry, and natural-source inputs (MEGAN 2.1 biogenic VOCs; FINN v1.5 biomass burning) so that differences in model output isolate the influence of anthropogenic emissions. HTAP delivered the most balanced SO₂ and CO fields (regional mean biases mostly within ±25%), whereas ECLIPSE reproduced NO_x spatial gradients best, albeit with a negative overall bias. REAS captured industrial SO₂ reliably but over-estimated PM₂.₅ and NO_x in western conurbations while under-estimating them in rural prefectures. CAMS-GLOB-ANT showed systematic biases—under-estimating PM₂.₅ and CO yet markedly over-estimating SO₂—highlighting the need for Japan-specific sulfur-fuel adjustments. For several pollutant–region combinations, absolute errors exceeded 100%, confirming that emissions uncertainty, not model physics, dominates regional air quality error even under identical dynamical and chemical settings. These findings underscore the importance of inventory-specific and pollutant-specific selection—or better, multi-inventory ensemble approaches—when assessing Japanese air quality and formulating policy. Routine assimilation of ground and satellite data, together with inverse modeling, is recommended to narrow residual biases and improve future inventories. Full article

► Show Figures

Figure 1

9 pages, 952 KB

Open AccessData Descriptor

A Framework for the Datasets of CRDS CO₂ and CH₄ Stable Carbon Isotope Measurements in the Atmosphere

by Francesco D’Amico, Ivano Ammoscato, Giorgia De Benedetto, Luana Malacaria, Salvatore Sinopoli, Teresa Lo Feudo, Daniel Gullì and Claudia Roberta Calidonna

Data 2025, 10(9), 150; https://doi.org/10.3390/data10090150 - 22 Sep 2025

Viewed by 345

Abstract

Accessible datasets of greenhouse gas (GHG) concentrations help define long-term trends on a global scale and also provide significant information on the characteristic variability of emission sources and sinks. The integration of stable carbon isotope measurements of carbon dioxide (CO₂) and [...] Read more.

Accessible datasets of greenhouse gas (GHG) concentrations help define long-term trends on a global scale and also provide significant information on the characteristic variability of emission sources and sinks. The integration of stable carbon isotope measurements of carbon dioxide (CO₂) and methane (CH₄) can significantly increase the accuracy and reliability of source apportionment efforts, due to the isotopic fractionation processes and fingerprint that characterize each mechanism. Via isotopic parameters such as δ¹³C, the ratio of ¹³C to ¹²C compared to an international standard (VPDB, Vienna Pee Dee Belemnite), it is in fact possible to discriminate, for example, between thermogenic and microbial sources of CH₄, thus ensuring a more detailed understanding of global balances. A number of stations within the Italian consortium of atmospheric observation sites have been equipped with Picarro G2201-i CRDS (Cavity Ring-Down Spectrometry) analyzers capable of measuring the stable carbon isotopic ratios of CO₂ and CH₄, reported as δ¹³C-CO₂ and δ¹³C-CO₂, respectively. The first dataset (Lamezia Terme, Calabria region) of the consortium resulting from these measurements was released, and a second dataset (Potenza, Basilicata region) from another station was also released, relying on the same format to effectively standardize these new types of datasets. This work provides details on the data, format, and methods used to generate these products and describes a framework for the format and processing of similar data products based on CRD spectroscopy. Full article

► Show Figures

Figure 1

13 pages, 874 KB

Open AccessData Descriptor

The Tabular Accessibility Dataset: A Benchmark for LLM-Based Web Accessibility Auditing

by Manuel Andruccioli, Barry Bassi, Giovanni Delnevo and Paola Salomoni

Data 2025, 10(9), 149; https://doi.org/10.3390/data10090149 - 19 Sep 2025

Cited by 1 | Viewed by 869

Abstract

This dataset was developed to support research at the intersection of web accessibility and Artificial Intelligence, with a focus on evaluating how Large Language Models (LLMs) can detect and remediate accessibility issues in source code. It consists of code examples written in PHP, [...] Read more.

This dataset was developed to support research at the intersection of web accessibility and Artificial Intelligence, with a focus on evaluating how Large Language Models (LLMs) can detect and remediate accessibility issues in source code. It consists of code examples written in PHP, Angular, React, and Vue.js, organized into accessible and non-accessible versions of tabular components. A substantial portion of the dataset was collected from student-developed Vue components, implemented using both the Options and Composition APIs. The dataset is structured to enable both a static analysis of source code and a dynamic analysis of rendered outputs, supporting a range of accessibility research tasks. All files are in plain text and adhere to the FAIR principles, with open licensing (CC BY 4.0) and long-term hosting via Zenodo. This resource is intended for researchers and practitioners working on LLM-based accessibility validation, inclusive software engineering, and AI-assisted frontend development. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

31 pages, 1887 KB

Open AccessArticle

ZaQQ: A New Arabic Dataset for Automatic Essay Scoring via a Novel Human–AI Collaborative Framework

by Yomna Elsayed, Emad Nabil, Marwan Torki, Safiullah Faizullah and Ayman Khalafallah

Data 2025, 10(9), 148; https://doi.org/10.3390/data10090148 - 19 Sep 2025

Viewed by 660

Abstract

Automated essay scoring (AES) has become an essential tool in educational assessment. However, applying AES to the Arabic language presents notable challenges, primarily due to the lack of labeled datasets. This data scarcity hampers the development of reliable machine learning models and slows [...] Read more.

Automated essay scoring (AES) has become an essential tool in educational assessment. However, applying AES to the Arabic language presents notable challenges, primarily due to the lack of labeled datasets. This data scarcity hampers the development of reliable machine learning models and slows progress in Arabic natural language processing for educational use. While manual annotation by human experts remains the most accurate method for essay evaluation, it is often too costly and time-consuming to create large-scale datasets, especially for low-resource languages like Arabic. In this work, we introduce a human–AI collaborative framework designed to overcome the shortage of scored Arabic essays. Leveraging QAES, a high-quality annotated dataset, our approach uses Large Language Models (LLMs) to generate multidimensional essay evaluations across seven key writing traits: Relevance, Organization, Vocabulary, Style, Development, Mechanics, and Structure. To ensure accuracy and consistency, we design prompting strategies and validation procedures tailored to each trait. This system is then applied to two unannotated Arabic essay datasets: ZAEBUC and QALB. As a result, we introduce ZaQQ, a newly annotated dataset that merges ZAEBUC, QAES, and QALB. Our findings demonstrate that human–AI collaboration can significantly enhance the availability of labeled resources without compromising assessment quality. The proposed framework serves as a scalable and replicable model for addressing data annotation challenges in low-resource languages and supports the broader goal of expanding access to automated educational assessment tools where expert evaluation is limited. Full article

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

► Show Figures

Figure 1

16 pages, 729 KB

Open AccessData Descriptor

An International Database of Public Attitudes Toward Stuttering

by Kenneth O. St. Louis

Data 2025, 10(9), 147; https://doi.org/10.3390/data10090147 - 18 Sep 2025

Viewed by 616

Abstract

The Public Opinion Survey of Human Attributes–Stuttering (POSHA–S) Database, intermittently updated, at the time of this report, contains 25,739 respondents from 45 countries with responses in 28 languages, representing 11 world regions. Among public and selected population samples, more than 600 [...] Read more.

The Public Opinion Survey of Human Attributes–Stuttering (POSHA–S) Database, intermittently updated, at the time of this report, contains 25,739 respondents from 45 countries with responses in 28 languages, representing 11 world regions. Among public and selected population samples, more than 600 self-identified stutterers are included. The Microsoft Excel database file features more than 150 columns of POSHA–S results. Some data, such as state/province and country of respondents, primary job or occupation, languages known, race, and religion, are included as text. Other demographic items and all attitude items are numerical data. The POSHA–S has check boxes or scales of 1–5 for other demographic variables and general ratings that compare stuttering to four other “anchor” attributes (intelligence, left-handedness, obesity, and mental illness). All subsequent stuttering attitude items are scored on a scale of 1–3, reflecting “no”, “not sure”, and “yes”, respectively. All scaled ratings are converted to a uniform −100 to +100 scale, with some item ratings inverted so that, uniformly, higher ratings reflect more positive attitudes and lower ratings reflect more negative attitudes. All respondents are classified according to population, a category within population, region or continent, country, language, and other distinctive features. Full article

► Show Figures

Figure 1

22 pages, 13347 KB

Open AccessArticle

UTHECA_USE: A Multi-Source Dataset on Human Thermal Perception and Urban Environmental Factors in Seville

by Noelia Hernández-Barba, José-Antonio Rodríguez-Gallego, Carlos Rivera-Gómez and Carmen Galán-Marín

Data 2025, 10(9), 146; https://doi.org/10.3390/data10090146 - 16 Sep 2025

Viewed by 474

Abstract

This paper introduces UTHECA_USE, a dataset of 989 observations collected in Seville, Spain (2023–2025), integrating microclimatic, personal, and urban morphological data. It comprises 55 variables, including in situ measurements of air and globe temperatures, humidity, wind speed, derived indices such as the Universal [...] Read more.

This paper introduces UTHECA_USE, a dataset of 989 observations collected in Seville, Spain (2023–2025), integrating microclimatic, personal, and urban morphological data. It comprises 55 variables, including in situ measurements of air and globe temperatures, humidity, wind speed, derived indices such as the Universal Thermal Climate Index (UTCI), demographic and physiological participant data, subjective thermal perception, and detailed urban form characteristics. The surface temperature data of urban materials are included in a subset. The dataset is openly accessible under a permissive license, and this data descriptor documents the collection methods, calibration, survey design, and data processing to ensure reproducibility and transparency. The UTHECA project aims to develop a more accurate and adaptive outdoor thermal comfort (OTC) assessment model to guide effective, inclusive urban strategies to improve human thermal perception and climate resilience. UTHECA_USE facilitates research on outdoor thermal comfort and urban microclimates, supporting diverse analyses linking human perception, environmental conditions, and urban morphology. Full article

(This article belongs to the Collection Modern Geophysical and Climate Data Analysis: Tools and Methods)

► Show Figures

Figure 1

7 pages, 622 KB

Open AccessData Descriptor

High-Resolution Magnetic Susceptibility Dataset from Borehole Samples from the “Rudnik” Mine Tailings, Republic of Serbia

by Vesna Cvetkov and Filip Arnaut

Data 2025, 10(9), 145; https://doi.org/10.3390/data10090145 - 16 Sep 2025

Viewed by 432

Abstract

In 2024, high-resolution (10 cm resolution) magnetic susceptibility (MS) data acquisition and subsequent sample preparation and laboratory measurements were conducted at the “Rudnik” mine tailing site in the Republic of Serbia. The dataset consists of 1010 measurements obtained from 7 boreholes, with the [...] Read more.

In 2024, high-resolution (10 cm resolution) magnetic susceptibility (MS) data acquisition and subsequent sample preparation and laboratory measurements were conducted at the “Rudnik” mine tailing site in the Republic of Serbia. The dataset consists of 1010 measurements obtained from 7 boreholes, with the largest borehole containing 218 continuously measured MS samples and the smallest containing 103 measured values. The dataset includes mass magnetic susceptibility data from seven boreholes, accompanied by lithological descriptions of the respective samples and measured sample mass data. High-resolution MS data were obtained during the characterization phase of flotation tailings, as the MS technique is established as an effective proxy for detecting heavy metals in tailings, while also being cost-effective, straightforward, and rapid. Consequently, researchers can acquire extensive data which is correlated with heavy metal concentrations while reserving costly and time-intensive chemical analyses only for the most relevant samples obtained by the analysis of MS values. The significance of such datasets resides in their ability to foster transparency and collaboration, thereby facilitating cross-disciplinary research that may enhance the methodology of the MS technique. In addition to its direct geophysical applications, the dataset fosters transparency and interdisciplinary collaboration, allowing geoscientists, statisticians, and data scientists to evaluate and refine methodologies that could improve the efficiency of the MS technique in the future. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

18 pages, 2599 KB

Open AccessArticle

EEG Dataset for Emotion Analysis

by Catalina Aguirre-Grisales, Maria José Arbeláez-Arias, Andrés Felipe Valencia-Rincón, Hector Fabio Torres-Cardona and Jose Luis Rodriguéz-Sotelo

Data 2025, 10(9), 144; https://doi.org/10.3390/data10090144 - 11 Sep 2025

Viewed by 1103

Abstract

This work presents an EEG signal database derived from the induction of three emotional states using auditory stimuli. To this end, an experiment was designed in which 30 selected affective sounds from the IADS database were presented to 36 volunteers, from whom EEG [...] Read more.

This work presents an EEG signal database derived from the induction of three emotional states using auditory stimuli. To this end, an experiment was designed in which 30 selected affective sounds from the IADS database were presented to 36 volunteers, from whom EEG signals were acquired. Stimuli were randomly configured in the Psychopy platform and synchronized via the LSL library with the OpenVibe signal acquisition platform. The 16-channel NautilusG brain computer interface was used for signal acquisition. As part of the database validation, a recognition system for the three emotional states was developed. This system utilized machine-learning-based parameterization and classification techniques, achieving detection percentages of 88.57%. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

25 pages, 1262 KB

Open AccessArticle

Comprehensive Evaluation of Water Resource Carrying Capacity in Hebei Province Based on a Combined Weighting–TOPSIS Model

by Nianning Wang, Qichao Zhao, Lihua Yuan, Yaosen Chen, Ying Hong and Sijie Chen

Data 2025, 10(9), 143; https://doi.org/10.3390/data10090143 - 10 Sep 2025

Viewed by 438

Abstract

Water scarcity severely restricts the sustainable development of water-stressed regions like Hebei Province. A scientific assessment of water resource carrying capacity (WRCC) is essential. However, single-weighting methods often lead to biased results. To address this limitation, we propose a combined weighting model that [...] Read more.

Water scarcity severely restricts the sustainable development of water-stressed regions like Hebei Province. A scientific assessment of water resource carrying capacity (WRCC) is essential. However, single-weighting methods often lead to biased results. To address this limitation, we propose a combined weighting model that integrates the Entropy Weight Method (EWM), Projection Pursuit (PP), and CRITIC. To support this model, we developed a multi-dimensional, long-term WRCC evaluation dataset covering 11 prefecture-level cities in Hebei Province over 24 years (2000–2023). This approach simultaneously considers data dispersion, inter-indicator conflict, and structural features. It ensures that a more balanced weighting scheme is obtained. The traditional TOPSIS model was further improved through Grey Relational Analysis (GRA), which enhanced the discriminatory power and stability of WRCC assessment. The findings were as follows: (1) From 2000 to 2023, the WRCC in Hebei Province showed a fluctuating upward trend and a “high-north, low-south” spatial gradient. (2) Obstacle analysis revealed a vicious cycle of “resource scarcity–structural conflict–ecological deficit”. This cycle is caused by excessive exploitation of groundwater and low efficiency of industrial water use. The combined weighting–GRA–TOPSIS model offers a reliable WRCC diagnostic tool. The results indicate the core barriers to water use in Hebei and provide targeted policy ideas for sustainable development. Full article

► Show Figures

Figure 1

11 pages, 1247 KB

Open AccessData Descriptor

A Leaf Chlorophyll Content Dataset for Crops: A Comparative Study Using Spectrophotometric and Multispectral Imagery Data

by Andrés Felipe Solis Pino, Juan David Solarte Moreno, Carlos Iván Vásquez Valencia and Jhon Alexander Guerrero Narváez

Data 2025, 10(9), 142; https://doi.org/10.3390/data10090142 - 9 Sep 2025

Cited by 2 | Viewed by 693

Abstract

This paper presents a dataset for a comparative analysis of direct (spectrophotometric) and indirect (multispectral imagery-based) methods for quantifying crop leaf chlorophyll content. The dataset originates from a study conducted in the Department of Cauca, Colombia, a region characterized by diverse agricultural production. [...] Read more.

This paper presents a dataset for a comparative analysis of direct (spectrophotometric) and indirect (multispectral imagery-based) methods for quantifying crop leaf chlorophyll content. The dataset originates from a study conducted in the Department of Cauca, Colombia, a region characterized by diverse agricultural production. Data collection focused on seven economically important crops, namely coffee (Coffea arabica), Hass avocado (Persea americana), potato (Solanum tuberosum), tomato (Solanum lycopersicum), sugar cane (Saccharum officinarum), corn (Zea mays) and banana (Musa paradisiaca). Sampling was conducted across various locations and phenological stages (healthy, wilted, senescent), with each leaf subdivided into six sections (A–F) to facilitate the analysis of intra-leaf chlorophyll distribution. Direct measurements of leaf chlorophyll content were obtained by laboratory spectrophotometry following the method of Jeffrey and Humphrey, allowing for the determination of chlorophyll A and B content. Simultaneously, indirect estimates of leaf chlorophyll content were obtained from multispectral images captured at the leaf level using a MicaSense Red-Edge camera under controlled illumination. A set of 32 vegetation indices was then calculated from these multispectral images using MATLAB. Both direct and indirect methods were applied to the same leaf samples to allow for direct comparison. The dataset, provided as an Excel (.xlsx) file, comprises raw data covering laboratory-measured chlorophyll A and B content and calculated values for the 32 vegetation indices. Each row of the tabular dataset represents an individual leaf sample, identified by plant species, leaf identifier, and phenological stage. The resulting dataset, containing 16,660 records, is structured to support research evaluating the direct relationship between spectrophotometric measurements and multispectral image-based vegetation indices for estimating leaf chlorophyll content. Spearman’s correlation coefficient reveals significant positive relationships between leaf chlorophyll content and several vegetation indices, highlighting its potential for a nondestructive assessment of this pigment. Therefore, this dataset offers significant potential for researchers in remote sensing, precision agriculture, and plant physiology to assess the accuracy and reliability of various vegetation indices in diverse crops and conditions, develop and refine chlorophyll estimation models, and execute meta-analyses or comparative studies on leaf chlorophyll quantification methodologies. Full article

► Show Figures

Figure 1

13 pages, 293 KB

Open AccessArticle

Scalable Model-Based Diagnosis with FastDiag: A Dataset and Parallel Benchmark Framework

by Delia Isabel Carrión León, Cristian Vidal-Silva and Nicolás Márquez

Data 2025, 10(9), 141; https://doi.org/10.3390/data10090141 - 3 Sep 2025

Viewed by 659

Abstract

FastDiag is a widely used algorithm for model-based diagnosis, computing minimal subsets of constraints whose removal restores consistency in knowledge-based systems. As applications grow in complexity, researchers have proposed parallel extensions such as Java-version FastDiagP and FastDiagP++ to accelerate diagnosis through speculative and [...] Read more.

FastDiag is a widely used algorithm for model-based diagnosis, computing minimal subsets of constraints whose removal restores consistency in knowledge-based systems. As applications grow in complexity, researchers have proposed parallel extensions such as Java-version FastDiagP and FastDiagP++ to accelerate diagnosis through speculative and multiprocessing strategies. This paper presents a reproducible and extensible framework for evaluating FastDiag and its parallel variants across a benchmark suite of feature models and ontology-like constraints. We analyze each variant in terms of recursion structure, runtime performance, and diagnostic correctness. Tracking mechanisms and structured logs enable the fine-grained comparison of recursive behavior and branching strategies. Technical validation confirms that parallel execution preserves minimality and structural soundness, while benchmark results show runtime improvements of up to 4× with FastDiagP++. The accompanying dataset, available as open source, supports educational use, algorithmic benchmarking, and integration into interactive configuration environments. The framework is primarily intended for reproducible benchmarking and teaching with open-source implementations that facilitate analysis and extension. Full article

► Show Figures

Figure 1

5 pages, 205 KB

Open AccessData Descriptor

Data on Stark Broadening of N V Spectral Lines

by Milan S. Dimitrijević, Magdalena D. Christova and Sylvie Sahal-Bréchot

Data 2025, 10(9), 140; https://doi.org/10.3390/data10090140 - 31 Aug 2025

Viewed by 545

Abstract

A data set on Stark broadening parameters defining the Lorentzian line profile (spectral line widths and shifts) for 31 multiplets of four-times-charged nitrogen ion (N V), with lines broadened by impacts with electrons (e), protons (p), He II ions,

α

particles (He III), [...] Read more.

A data set on Stark broadening parameters defining the Lorentzian line profile (spectral line widths and shifts) for 31 multiplets of four-times-charged nitrogen ion (N V), with lines broadened by impacts with electrons (e), protons (p), He II ions,

α

particles (He III), and B III, B IV, B V, and B VI ions, is given. The above-mentioned data have been calculated within the frame of the semiclassical perturbation theory, for temperatures from 50,000 K to 1,000,000 K, and densities of perturbers from 10¹⁵ cm⁻³ up to 10²¹ cm⁻³. These data are, first of all, of interest for diagnostics and modeling of laser-driven plasma in experiments and investigations of proton–boron fusion, especially when the target is boron nitride (BN). Data on Stark broadening by collisions with e, p, He II ions, and

α

particles are useful for the investigation of stellar plasma, in particular for white dwarf atmospheres and subphotospheric layer modeling. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

14 pages, 1043 KB

Open AccessArticle

A Dataset and Experimental Evaluation of a Parallel Conflict Detection Solution for Model-Based Diagnosis

by Jessica Janina Cabezas-Quinto, Cristian Vidal-Silva, Jorge Serrano-Malebrán and Nicolás Márquez

Data 2025, 10(9), 139; https://doi.org/10.3390/data10090139 - 29 Aug 2025

Viewed by 570

Abstract

This article presents a dataset and experimental evaluation of a parallelized variant of Junker’s QuickXPlain algorithm, designed to efficiently compute minimal conflict sets in constraint-based diagnosis tasks. The dataset includes performance benchmarks, conflict traces, and solution metadata for a wide range of configurable [...] Read more.

This article presents a dataset and experimental evaluation of a parallelized variant of Junker’s QuickXPlain algorithm, designed to efficiently compute minimal conflict sets in constraint-based diagnosis tasks. The dataset includes performance benchmarks, conflict traces, and solution metadata for a wide range of configurable diagnosis problems based on real-world and synthetic CSP instances. Our parallel variant leverages multicore architectures to reduce computation time while preserving the completeness and minimality guarantees of QuickXPlain. All evaluations were conducted using reproducible scripts and parameter configurations, enabling comparison across different algorithmic strategies. The provided dataset can be used to replicate experiments, analyze scalability under varying problem sizes, and serve as a baseline for future improvements in conflict explanation algorithms. The full dataset, codebase, and benchmarking scripts are openly available and documented to promote transparency and reusability in constraint-based diagnostic systems research. Full article

► Show Figures

Figure 1

15 pages, 2075 KB

Open AccessData Descriptor

A Curated Dataset of Regional Meteor Events with Simultaneous Optical and Infrasound Observations (2006–2011)

by Elizabeth A. Silber, Emerson Brown, Andrea R. Thompson and Vedant Sawal

Data 2025, 10(9), 138; https://doi.org/10.3390/data10090138 - 28 Aug 2025

Viewed by 801

Abstract

We present a curated, openly accessible dataset of 71 regional meteor events simultaneously recorded by optical and infrasound instrumentation between 2006 and 2011. These events were captured during an observational campaign using the all-sky cameras of the Southern Ontario Meteor Network and the [...] Read more.

We present a curated, openly accessible dataset of 71 regional meteor events simultaneously recorded by optical and infrasound instrumentation between 2006 and 2011. These events were captured during an observational campaign using the all-sky cameras of the Southern Ontario Meteor Network and the co-located Elginfield Infrasound Array. Each entry provides optical trajectory measurements, infrasound waveforms, and atmospheric specification profiles. The integration of optical and acoustic data enables robust linkage between observed acoustic signals and specific points along meteor trajectories, offering new opportunities to examine shock wave generation, propagation, and energy deposition processes. This release fills a critical observational gap by providing the first validated, openly accessible archive of simultaneous optical–infrasound meteor observations that supports trajectory reconstruction, acoustic propagation modeling, and energy deposition analyses. By making these data openly available in a structured format, this work establishes a durable reference resource that advances reproducibility, fosters cross-disciplinary research, and underpins future developments in meteor physics, atmospheric acoustics, and planetary defense. Full article

► Show Figures

Figure 1

18 pages, 712 KB

Open AccessArticle

The Discussions of Monkeypox Misinformation on Social Media

by Or Elroy and Abraham Yosipof

Data 2025, 10(9), 137; https://doi.org/10.3390/data10090137 - 25 Aug 2025

Viewed by 900

Abstract

The global outbreak of the monkeypox virus was declared a health emergency by the World Health Organization (WHO). During such emergencies, misinformation about health suggestions can spread rapidly, leading to serious consequences. This study investigates the relationships between tweet readability, user engagement, and [...] Read more.

The global outbreak of the monkeypox virus was declared a health emergency by the World Health Organization (WHO). During such emergencies, misinformation about health suggestions can spread rapidly, leading to serious consequences. This study investigates the relationships between tweet readability, user engagement, and susceptibility to misinformation. Our conceptual model posits that tweet readability influences user engagement, which in turn affects the spread of misinformation. Specifically, we hypothesize that tweets with higher readability and grammatical correctness garner more user engagement and that misinformation tweets tend to be less readable than accurate information tweets. To test these hypotheses, we collected over 1.4 million tweets related to monkeypox discussions on X (formerly Twitter) and trained a semi-supervised learning classifier to categorize them as misinformation or not-misinformation. We analyzed the readability and grammar levels of these tweets using established metrics. Our findings indicate that readability and grammatical correctness significantly boost user engagement with accurate information, thereby enhancing its dissemination. Conversely, misinformation tweets are generally less readable, which reduces their spread. This study contributes to the advancement of knowledge by elucidating the role of readability in combating misinformation. Practically, it suggests that improving the readability and grammatical correctness of accurate information can enhance user engagement and consequently mitigate the spread of misinformation during health emergencies. These insights offer valuable strategies for public health communication and social media platforms to more effectively address misinformation. Full article

► Show Figures

Figure 1

22 pages, 1506 KB

Open AccessArticle

A FAIR Perspective on Data Quality Frameworks

by Nicholas Nicholson, Raquel Negrao Carvalho and Iztok Štotl

Data 2025, 10(9), 136; https://doi.org/10.3390/data10090136 - 23 Aug 2025

Viewed by 973

Abstract

Despite considerable effort and analysis over the last two to three decades, no integrated scenario yet exists for data quality frameworks. Currently, the choice is between several frameworks dependent upon the type and use of data. While the frameworks are appropriate to their [...] Read more.

Despite considerable effort and analysis over the last two to three decades, no integrated scenario yet exists for data quality frameworks. Currently, the choice is between several frameworks dependent upon the type and use of data. While the frameworks are appropriate to their specific purposes, they are generally prescriptive of the quality dimensions they prescribe. We reappraise the basis for measuring data quality by laying out a concept for a framework that addresses data quality from the foundational basis of the FAIR data guiding principles. We advocate for a federated data contextualisation framework able to handle the FAIR-related quality dimensions in the general data contextualisation descriptions and the remaining intrinsic data quality dimensions in associated dedicated context spaces without being overly prescriptive. A framework designed along these lines provides several advantages, not least of which is its ability to encapsulate most other data quality frameworks. Moreover, by contextualising data according to the FAIR data principles, many subjective quality measures are managed automatically and can even be quantified to a degree, whereas objective intrinsic quality measures can be handled to any level of granularity for any data type. This serves to avoid blurring quality dimensions between the data and the data application perspectives as well as to support data quality provenance by providing traceability over a chain of data processing operations. We show by example how some of these concepts can be implemented at a practical level. Full article

► Show Figures

Figure 1

15 pages, 1905 KB

Open AccessArticle

Predicting Real Estate Prices Using Machine Learning in Bosnia and Herzegovina

by Zvezdan Stojanović, Dario Galić and Hava Kahrić

Data 2025, 10(9), 135; https://doi.org/10.3390/data10090135 - 23 Aug 2025

Viewed by 1386

Abstract

The real estate market has a major impact on the economy and everyday life. Accurate real estate valuation is essential for buyers, sellers, investors, and government institutions. Traditionally, valuation has been conducted using various estimation models. However, recent advancements in information technology, particularly [...] Read more.

The real estate market has a major impact on the economy and everyday life. Accurate real estate valuation is essential for buyers, sellers, investors, and government institutions. Traditionally, valuation has been conducted using various estimation models. However, recent advancements in information technology, particularly in artificial intelligence and machine learning, have enabled more precise predictions of real estate prices. Machine learning allows computers to recognize patterns in data and create models that can predict prices based on the characteristics of the property, such as location, square footage, number of rooms, age of the building, and similar features. The aim of this paper is to investigate how the application of machine learning can be used to predict real estate prices. A machine learning model was developed using four algorithms: Linear Regression, Random Forest Regression, XGBoost, and K-Nearest Neighbors. The dataset used in this study was collected from major online real estate listing portals in Bosnia and Herzegovina. The performance of each model was evaluated using the R² score, Root Mean Squared Error (RMSE), scatter plots, and error distributions. Based on this evaluation, the most accurate model was selected. Additionally, a simple web interface was created to allow for non-experts to easily obtain property price estimates. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Data, Volume 10, Issue 9 (September 2025) – 17 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI