You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.

Search for Articles:

Title / Keyword

Author / Affiliation / Email

Journal

Article Type

Advanced Search

Section

Special Issue

Volume

Issue

Number

Page

Logical OperatorOperator

Search Text

Search Type

Journal Description

Data

Data is a peer-reviewed, open access journal on data in science, with the aim of enhancing data transparency and reusability. The journal publishes in two sections: a section on the collection, treatment and analysis methods of data in science; a section publishing descriptions of scientific and scholarly datasets (one dataset per paper). The journal is published monthly online by MDPI.

Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
High Visibility: indexed within Scopus, ESCI (Web of Science), Ei Compendex, dblp, Inspec, RePEc, and other databases.
Journal Rank: JCR - Q2 (Multidisciplinary Sciences) / CiteScore - Q2 (Information Systems and Management)
Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 27.7 days after submission; acceptance to publication is undertaken in 3.5 days (median values for papers published in this journal in the first half of 2024).
Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.

Impact Factor: 2.2 (2023); 5-Year Impact Factor: 2.4 (2023)

Imprint Information Journal Flyer Open Access ISSN: 2306-5729

Latest Articles

13 pages, 3690 KiB

Open AccessArticle

Non-Linear Relationship between MiRNA Regulatory Activity and Binding Site Counts on Target mRNAs

by Shuangmei Tian, Ziyu Zhao, Beibei Ren and Degeng Wang

Data 2024, 9(10), 111; https://doi.org/10.3390/data9100111 - 25 Sep 2024

MicroRNAs (miRNA) exert regulatory actions via base pairing with their binding sites on target mRNAs. Cooperative binding, i.e., synergism, among binding sites on an mRNA is biochemically well characterized. We studied whether this synergism is reflected in the global relationship between miRNA-mediated regulatory [...] Read more.

MicroRNAs (miRNA) exert regulatory actions via base pairing with their binding sites on target mRNAs. Cooperative binding, i.e., synergism, among binding sites on an mRNA is biochemically well characterized. We studied whether this synergism is reflected in the global relationship between miRNA-mediated regulatory activity and miRNA binding site count on the target mRNAs, i.e., leading to a non-linear relationship between the two. Recently, using our own and public datasets, we have enquired into miRNA regulatory actions: first, we analyzed the power-law distribution pattern of miRNA binding sites; second, we found that, strikingly, mRNAs for core miRNA regulatory apparatus proteins have extraordinarily high binding site counts, forming self-feedback-control loops; third, we revealed that tumor suppressor mRNAs generally have more sites than oncogene mRNAs; and fourth, we characterized enrichment of miRNA-targeted mRNAs in translationally less active polysomes relative to more active polysomes. In these four studies, we qualitatively observed obvious positive correlation between the extent to which an mRNA is miRNA-regulated and its binding site count. This paper summarizes the datasets used. We also quantitatively analyzed the correlation by comparative linear and non-linear regression analyses. Non-linear relationships, i.e., accelerating rise of regulatory activity as binding site count increases, fit the data much better, conceivably a transcriptome-level reflection of cooperative binding among miRNA binding sites on a target mRNA. This observation is potentially a guide for integrative quantitative modeling of the miRNA regulatory system. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

11 pages, 2193 KiB

Open AccessArticle

Comprehensive Overview of Long-Term Ecosystem Research Datasets at LTER Site Oberes Stubachtal

by Bernhard Zagel, Hans Wiesenegger, Robert R. Junker and Gerhard Ehgartner

Data 2024, 9(10), 110; https://doi.org/10.3390/data9100110 - 25 Sep 2024

This article provides a comprehensive overview of all currently available datasets of the Long-term Ecosystem Research (LTER) site Oberes Stubachtal. The site is located in the Hohe Tauern mountain range (Eastern Alps, Austria) and includes both protected areas (Hohe Tauern National Park) and [...] Read more.

This article provides a comprehensive overview of all currently available datasets of the Long-term Ecosystem Research (LTER) site Oberes Stubachtal. The site is located in the Hohe Tauern mountain range (Eastern Alps, Austria) and includes both protected areas (Hohe Tauern National Park) and unprotected areas (Stubach valley). While the main research focus of the site is on high mountains, glaciology, glacial hydrology, and biodiversity, the eLTER Whole-System Approach (WAILS) was used for data selection. This approach involves a systematic screening of all available data to assess their suitability as eLTER Standard Observations (SOs). This includes the geosphere, atmosphere, hydrosphere, biosphere, and sociosphere. These SOs are fundamental to the development of a comprehensive long-term ecosystem research framework. In total, more than 40 datasets have been collated for the LTER site Oberes Stubachtal and included in the Dynamic Ecological Information Management System—Site and Data Registry (DEIMS-SDR), the eLTER’s data platform. This paper provides a detailed inventory of the datasets and their primary attributes, evaluates them against the WAILS-required observation data, and offers insights into strategies for future initiatives. All datasets are made available through dedicated repositories for FAIR (findable, accessible, interoperable, reusable) use. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

17 pages, 1246 KiB

Open AccessData Descriptor

Data on Economic Analysis: 2017 Social Accounting Matrices (SAMs) for South Africa

by Ramigo Pfunzo, Yonas T. Bahta and Henry Jordaan

Data 2024, 9(9), 109; https://doi.org/10.3390/data9090109 - 20 Sep 2024

The purpose of the Social Accounting Matrix (SAM) is to improve the quality of the database for modelling, including, but not limited to, policy analysis, multiplier analysis, price analysis, and Computable General Equilibrium. This article contributes to constructing the 2017 national SAM for [...] Read more.

The purpose of the Social Accounting Matrix (SAM) is to improve the quality of the database for modelling, including, but not limited to, policy analysis, multiplier analysis, price analysis, and Computable General Equilibrium. This article contributes to constructing the 2017 national SAM for South Africa, incorporating regional accounts. Only in Limpopo Province of South Africa are agricultural industries, labour, and households captured at the district level, while agricultural industry, labour, and household accounts in other provinces remain unchanged. The main data sources for constructing a SAM are found from different sources, such as Supply and Use Tables, National Accounts, Census of Commercial Agriculture, Quarterly Labour Force Survey, South Africa Revenue Service, Global Insight (regional explorer), and South Africa Reserve Bank. The dataset recorded that land returns for irrigation agriculture were highest (18.2%) in the Northern Cape Province of South Africa compared to other provinces, whereas the Free State Province of South Africa rainfed agriculture had the largest shares (22%) for payment to land. Regarding intermediate inputs, rainfed agriculture in the Western Cape, Free State, and Kwazulu-Natal Provinces paid approximately 0.4% for using intermediate inputs. In terms of the districts, land returns for irrigation were highest in the Vhembe district of Limpopo Province of South Africa with 0.3%. Despite Mopani district of Limpopo Province of South Africa having the lowest land returns for irrigation agriculture, it has the highest share (1.6%) of payment to land from rainfed agriculture. The manufacturing and community service sectors had a trade deficit, whereas other sectors experienced a trade surplus. The main challenges found in developing a SAM are scarcity of data to attain the information needed for disaggregation for the sub-matrices and insufficient information from different data sources for estimating missing information to ensure the row and column totals of the SAM are consistent and complete. Full article

► Show Figures

Figure 1

20 pages, 1147 KiB

Open AccessData Descriptor

Dataset on the Validation and Standardization of the Questionnaire for the Self-Assessment of Service-Learning Experiences in Higher Education (QaSLu)

by Roberto Sánchez-Cabrero, Elena López-de-Arana Prado, Pilar Aramburuzabala and Rosario Cerrillo

Data 2024, 9(9), 108; https://doi.org/10.3390/data9090108 - 19 Sep 2024

This dataset shows the original validation and standardization of the Questionnaire for the Self-Assessment of Service-Learning Experiences in Higher Education (QaSLu). The QaSLu is the first instrument to measure university service-learning (USL), validated following a strict qualitative and quantitative process by a sample [...] Read more.

This dataset shows the original validation and standardization of the Questionnaire for the Self-Assessment of Service-Learning Experiences in Higher Education (QaSLu). The QaSLu is the first instrument to measure university service-learning (USL), validated following a strict qualitative and quantitative process by a sample of experts in USL and generating rating scales for different profiles of professors. The Delphi method was used for the qualitative validation by 16 academic experts, who evaluated the relevance and clarity of the items. After two consultation rounds, 45 items were qualitatively validated, generating the QaSLu-45. Then, 118 instructors from 43 universities took part as the sample in the quantitative validation procedure. Quantitative validation was carried out through goodness-of-fit measures using confirmatory factor analysis and the final configuration optimized using one-factor robust exploratory factor analysis, determining the most optimal version of the questionnaire under the law of parsimony, the QaSLu-27, with only 27 items and better psychometric properties. Finally, rating scales were calculated to compare different profiles of USL professors. These findings offer a valid, strong, and trustworthy instrument. The QaSLu-27 may be helpful for the design of USL experiences, in addition to facilitating the assessment of such programs to enhance teaching and learning processes. Full article

► Show Figures

Figure 1

13 pages, 3866 KiB

Open AccessData Descriptor

OSBA: An Open Neonatal Neuroimaging Atlas and Template for Spina Bifida Aperta

by Anna Speckert, Hui Ji, Kelly Payette, Patrice Grehten, Raimund Kottke, Samuel Ackermann, Beth Padden, Luca Mazzone, Ueli Moehrlen, Spina Bifida Study Group Zurich and Andras Jakab

Data 2024, 9(9), 107; https://doi.org/10.3390/data9090107 - 17 Sep 2024

We present the Open Spina Bifida Aperta (OSBA) atlas, an open atlas and set of neuroimaging templates for spina bifida aperta (SBA). Traditional brain atlases may not adequately capture anatomical variations present in pediatric or disease-specific cohorts. The OSBA atlas fills this gap [...] Read more.

We present the Open Spina Bifida Aperta (OSBA) atlas, an open atlas and set of neuroimaging templates for spina bifida aperta (SBA). Traditional brain atlases may not adequately capture anatomical variations present in pediatric or disease-specific cohorts. The OSBA atlas fills this gap by representing the computationally averaged anatomy of the neonatal brain with SBA after fetal surgical repair. The OSBA atlas was constructed using structural T2-weighted and diffusion tensor MRIs of 28 newborns with SBA who underwent prenatal surgical correction. The corrected gestational age at MRI was 38.1 ± 1.1 weeks (mean ± SD). The OSBA atlas consists of T2-weighted and fractional anisotropy templates, along with nine tissue prior maps and region of interest (ROI) delineations. The OSBA atlas offers a standardized reference space for spatial normalization and anatomical ROI definition. Our image segmentation and cortical ribbon definition are based on a human-in-the-loop approach, which includes manual segmentation. The precise alignment of the ROIs was achieved by a combination of manual image alignment and automated, non-linear image registration. From the clinical and neuroimaging perspective, the OSBA atlas enables more accurate spatial standardization and ROI-based analyses and supports advanced analyses such as diffusion tractography and connectomic studies in newborns affected by this condition. Full article

► Show Figures

Figure 1

13 pages, 8984 KiB

Open AccessData Descriptor

Analysis of Split-System Air Conditioner Faults through Electrical Measurement Data

by Anderson Carlos de Oliveira, Abel Cavalcante Lima Filho, Francisco Antonio Belo and André Victor Oliveira Cadena

Data 2024, 9(9), 106; https://doi.org/10.3390/data9090106 - 13 Sep 2024

This work presents an electrical measurement dataset from a split-system air conditioner in normal operating conditions and with specific faults, such as incrustation in the condenser and evaporator air inlet with different levels of blocking, which often occurs in this type of equipment. [...] Read more.

This work presents an electrical measurement dataset from a split-system air conditioner in normal operating conditions and with specific faults, such as incrustation in the condenser and evaporator air inlet with different levels of blocking, which often occurs in this type of equipment. We also added compressor capacitor degradation, which is a very common fault in this type of equipment, although it is scarcely addressed in research. The data were obtained through a non-invasive current sensor and a grain-oriented voltage sensor containing the values of the current and voltage of equipment that was installed in the field and tested at different levels for these fault conditions. This work not only explains how the entire data collection process was carried out but also presents two examples of fast Fourier transform (FFT) applications for the detection and diagnosis of faults through the electrical measurements analyzed in our studies, which had good effectiveness. Full article

► Show Figures

Figure 1

17 pages, 6352 KiB

Open AccessData Descriptor

Experimental Data in a Greenhouse with and without Cultivation of Stringless Blue Lake Beans

by Sebastian-Camilo Vanegas-Ayala, Julio Barón-Velandia, Oscar-Mauricio Garcia-Chavez, Adrian Romero-Palencia and Daniel-David Leal-Lara

Data 2024, 9(9), 105; https://doi.org/10.3390/data9090105 - 4 Sep 2024

Greenhouse cultivation is one of the current strategies to address the challenges of food production, sustainability, and food quality. Similarly, the use of technological tools to automate greenhouse environments through a set of sensors and actuators allows for the control and improvement of [...] Read more.

Greenhouse cultivation is one of the current strategies to address the challenges of food production, sustainability, and food quality. Similarly, the use of technological tools to automate greenhouse environments through a set of sensors and actuators allows for the control and improvement of processes within this environment. This document presents data collected from the sensors and actuators of two identical greenhouse environments, one with the cultivation of stringless blue lake beans and the other without cultivation. The aim is that this dataset will provide a broader characterization of the behavior of climatic variables inside greenhouse environments and how they are impacted by control actions, subsequently contributing to the development of new research on implementations of or improvements to control, supervision, management, and automation actions in greenhouse environments. Full article

► Show Figures

Figure 1

8 pages, 339 KiB

Open AccessData Descriptor

Interruption Audio & Transcript: Derived from Group Affect and Performance Dataset

by Daniel Doyle and Ovidiu Şerban

Data 2024, 9(9), 104; https://doi.org/10.3390/data9090104 - 31 Aug 2024

Despite the widespread development and use of chatbots, there is a lack of audio-based interruption datasets. This study provides a dataset of 200 manually annotated interruptions from a broader set of 355 data points of overlapping utterances. The dataset is derived from the [...] Read more.

Despite the widespread development and use of chatbots, there is a lack of audio-based interruption datasets. This study provides a dataset of 200 manually annotated interruptions from a broader set of 355 data points of overlapping utterances. The dataset is derived from the Group Affect and Performance dataset managed by the University of the Fraser Valley, Canada. It includes both audio files and transcripts, allowing for multi-modal analysis. Given the extensive literature and the varied definitions of interruptions, it was necessary to establish precise definitions. The study aims to provide a comprehensive dataset for researchers to build and improve interruption prediction models. The findings demonstrate that classification models can generalize well to identify interruptions based on this dataset’s audio. This opens up research avenues with respect to interruption-related topics, ranging from multi-modal interruption classification using text and audio modalities to the analysis of group dynamics. Full article

► Show Figures

Figure 1

10 pages, 1662 KiB

Open AccessData Descriptor

TM–IoV: A First-of-Its-Kind Multilabeled Trust Parameter Dataset for Evaluating Trust in the Internet of Vehicles

by Yingxun Wang, Adnan Mahmood, Mohamad Faizrizwan Mohd Sabri and Hushairi Zen

Data 2024, 9(9), 103; https://doi.org/10.3390/data9090103 - 31 Aug 2024

The emerging and promising paradigm of the Internet of Vehicles (IoV) employ vehicle-to-everything communication for facilitating vehicles to not only communicate with one another but also with the supporting roadside infrastructure, vulnerable pedestrians, and the backbone network in a bid to primarily address [...] Read more.

The emerging and promising paradigm of the Internet of Vehicles (IoV) employ vehicle-to-everything communication for facilitating vehicles to not only communicate with one another but also with the supporting roadside infrastructure, vulnerable pedestrians, and the backbone network in a bid to primarily address a number of safety-critical vehicular applications. Nevertheless, owing to the inherent characteristics of IoV networks, in particular, of being (a) highly dynamic in nature and which results in a continual change in the network topology and (b) non-deterministic owing to the intricate nature of its entities and their interrelationships, they are susceptible to a number of malicious attacks. Such kinds of attacks, if and when materialized, jeopardizes the entire IoV network, thereby putting human lives at risk. Whilst the cryptographic-based mechanisms are capable of mitigating the external attacks, the internal attacks are extremely hard to tackle. Trust, therefore, is an indispensable tool since it facilitates in the timely identification and eradication of malicious entities responsible for launching internal attacks in an IoV network. To date, there is no dataset pertinent to trust management in the context of IoV networks and the same has proven to be a bottleneck for conducting an in-depth research in this domain. The manuscript-at-hand, accordingly, presents a first of its kind trust-based IoV dataset encompassing 96,707 interactions amongst 79 vehicles at different time instances. The dataset involves nine salient trust parameters, i.e., packet delivery ratio, similarity, external similarity, internal similarity, familiarity, external familiarity, internal familiarity, reward/punishment, and context, which play a considerable role in ascertaining the trust of a vehicle within an IoV network. Full article

► Show Figures

Figure 1

9 pages, 878 KiB

Open AccessArticle

An Expected Goals on Target (xGOT) Metric as a New Metric for Analyzing Elite Soccer Player Performance

by Anselmo Ruiz-de-Alarcón-Quintero and Blanca De-la-Cruz-Torres

Data 2024, 9(9), 102; https://doi.org/10.3390/data9090102 - 28 Aug 2024

Introduction: Football analysis is an applied research area that has seen a huge upsurge in recent years. More complex analysis to understand the soccer players’ or teams’ performances during matches is required. The objective of this study was to prove the usefulness of [...] Read more.

Introduction: Football analysis is an applied research area that has seen a huge upsurge in recent years. More complex analysis to understand the soccer players’ or teams’ performances during matches is required. The objective of this study was to prove the usefulness of the expected goals on target (xGOT) metric, as a good indicator of a soccer team’s performance in professional Spanish football leagues, both in the women’s and men’s categories. Method: The data for the Spanish teams were collected from the statistical website Football Reference. The 2023/24 season was analyzed for Spanish leagues, both in the women’s and men’s categories (LigaF and LaLiga, respectively). For all teams, the following variables were calculated: goals, possession value (PV), expected goals (xG) and xGOT. All data obtained for each variable were normalized by match (90 min). A descriptive and correlational statistical analysis was carried out. Results: In the men’s league, this study found a high correlation between goals per match and xGOT (R² = 0.9248) while in the women’s league, there was a high correlation between goals per match (R² = 0.9820) and xG and between goals per match and xGOT (R² = 0.9574). Conclusions: In the LaLiga, the xGOT was the best metric that represented the match result while in the LigaF, the xG and the xGOT were the best metrics that represented the match score. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining in Exercise, Sports and Health Research)

► Show Figures

Figure 1

12 pages, 2888 KiB

Open AccessArticle

Viral Targets in the Human Interactome with Comprehensive Centrality Analysis: SARS-CoV-2, a Case Study

by Nilesh Kumar and M. Shahid Mukhtar

Data 2024, 9(8), 101; https://doi.org/10.3390/data9080101 - 20 Aug 2024

Network centrality analyses have proven to be successful in identifying important nodes in diverse host–pathogen interactomes. The current study presents a comprehensive investigation of the human interactome and SARS-CoV-2 host targets. We first constructed a comprehensive human interactome by compiling experimentally validated protein–protein [...] Read more.

Network centrality analyses have proven to be successful in identifying important nodes in diverse host–pathogen interactomes. The current study presents a comprehensive investigation of the human interactome and SARS-CoV-2 host targets. We first constructed a comprehensive human interactome by compiling experimentally validated protein–protein interactions (PPIs) from eight distinct sources. Additionally, we compiled a comprehensive list of 1449 SARS-CoV-2 host proteins and analyzed their interactions within the human interactome, which identified enriched biological processes and pathways. Seven diverse topological features were employed to reveal the enrichment of the SARS-CoV-2 targets in the human interactome, with closeness centrality emerging as the most effective metric. Furthermore, a novel approach called CentralityCosDist was employed to predict SARS-CoV-2 targets, which proved to be effective in expanding the pool of predicted targets. Pathway enrichment analyses further elucidated the functional roles and potential mechanisms associated with predicted targets. Overall, this study provides valuable insights into the complex interplay between SARS-CoV-2 and the host’s cellular machinery, contributing to a deeper understanding of viral infection and immune response modulation. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

10 pages, 13509 KiB

Open AccessData Descriptor

Dataset of Registered Hematoxylin–Eosin and Ki67 Histopathological Image Pairs Complemented by a Registration Algorithm

by Dominika Petríková, Ivan Cimrák, Katarína Tobiášová and Lukáš Plank

Data 2024, 9(8), 100; https://doi.org/10.3390/data9080100 - 7 Aug 2024

In this work, we describe a dataset suitable for analyzing the extent to which hematoxylin–eosin (HE)-stained tissue contains information about the expression of Ki67 in immunohistochemistry staining. The dataset provides images of corresponding pairs of HE and Ki67 stainings and is complemented by [...] Read more.

In this work, we describe a dataset suitable for analyzing the extent to which hematoxylin–eosin (HE)-stained tissue contains information about the expression of Ki67 in immunohistochemistry staining. The dataset provides images of corresponding pairs of HE and Ki67 stainings and is complemented by algorithms for computing the Ki67 index. We introduce a dataset of high-resolution histological images of testicular seminoma tissue. The dataset comprises digitized histology slides from 77 conventional testicular seminoma patients, obtained via surgical resection. For each patient, two physically adjacent tissue sections are stained: one with hematoxylin and eosin, and one with Ki67 immunohistochemistry staining. This results in a total of 154 high-resolution images. The images are provided in PNG format, facilitating ease of use for image analysis compared to the original scanner output formats. Each image contains enough tissue to generate thousands of non-overlapping 224 × 224 pixel patches. This shows the potential to generate more than 50,000 pairs of patches, one with HE staining and a corresponding Ki67 patch that depicts a very similar part of the tissue. Finally, we present the results of applying a ResNet neural network for the classification of HE patches into categories according to their Ki67 label. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

24 pages, 696 KiB

Open AccessArticle

A Performance Analysis of Hybrid and Columnar Cloud Databases for Efficient Schema Design in Distributed Data Warehouse as a Service

by Fred Eduardo Revoredo Rabelo Ferreira and Robson do Nascimento Fidalgo

Data 2024, 9(8), 99; https://doi.org/10.3390/data9080099 - 5 Aug 2024

A Data Warehouse (DW) is a centralized database that stores large volumes of historical data for analysis and reporting. In a world where enterprise data grows exponentially, new architectures are being investigated to overcome the deficiencies of traditional Database Management Systems (DBMSs), driving [...] Read more.

A Data Warehouse (DW) is a centralized database that stores large volumes of historical data for analysis and reporting. In a world where enterprise data grows exponentially, new architectures are being investigated to overcome the deficiencies of traditional Database Management Systems (DBMSs), driving a shift towards more modern, cloud-based solutions that provide resources such as distributed processing, columnar storage, and horizontal scalability without the overhead of physical hardware management, i.e., a Database as a Service (DBaaS). Choosing the appropriate class of DBMS is a critical decision for organizations, and there are important differences that impact data volume and query performance (e.g., architecture, data models, and storage) to support analytics in a distributed cloud environment efficiently. In this sense, we carry out an experimental evaluation to analyze the performance of several DBaaS and the impact of data modeling, specifically the usage of a partially normalized Star Schema and a fully denormalized Flat Table Schema, to further comprehend their behavior in different configurations and designs in terms of data schema, storage form, memory availability, and cluster size. The analysis is done in two volumes of data generated by a well-established benchmark, comparing the performance of the DW in terms of average execution time, memory usage, data volume, and loading time. Our results provide guidelines for efficient DW design, showing, for example, that the denormalization of the schema does not guarantee improved performance, as solutions performed differently depending on its architecture. We also show that a Hybrid Processing (HTAP) NewSQL solution can outperform solutions that support only Online Analytical Processing (OLAP) in terms of overall execution time, but that the performance of each query is deeply influenced by its selectivity and by the number of join functions. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

16 pages, 677 KiB

Open AccessArticle

Arabic Lexical Substitution: AraLexSubD Dataset and AraLexSub Pipeline

by Eman Naser-Karajah and Nabil Arman

Data 2024, 9(8), 98; https://doi.org/10.3390/data9080098 - 30 Jul 2024

Lexical substitution aims to generate a list of equivalent substitutions (i.e., synonyms) to a sentence’s target word or phrase while preserving the sentence’s meaning to improve writing, enhance language understanding, improve natural language processing models, and handle ambiguity. This task has recently attracted [...] Read more.

Lexical substitution aims to generate a list of equivalent substitutions (i.e., synonyms) to a sentence’s target word or phrase while preserving the sentence’s meaning to improve writing, enhance language understanding, improve natural language processing models, and handle ambiguity. This task has recently attracted much attention in many languages. Despite the richness of Arabic vocabulary, limited research has been performed on the lexical substitution task due to the lack of annotated data. To bridge this gap, we present the first Arabic lexical substitution benchmark dataset AraLexSubD for benchmarking lexical substitution pipelines. AraLexSubD is manually built by eight native Arabic speakers and linguists (six linguist annotators, a doctor, and an economist) who annotate the 630 sentences. AraLexSubD covers three domains: general, finance, and medical. It encompasses 2476 substitution candidates ranked according to their semantic relatedness. We also present the first Arabic lexical substitution pipeline, AraLexSub, which uses the AraBERT pre-trained language model. The pipeline consists of several modules: substitute generation, substitute filtering, and candidate ranking. The filtering step shows its effectiveness by achieving an increase of 1.6 in the F1 score on the entire AraLexSubD dataset. Additionally, an error analysis of the experiment is reported. To our knowledge, this is the first study on Arabic lexical substitution. Full article

► Show Figures

Figure 1

9 pages, 1535 KiB

Open AccessData Descriptor

Genomic Insights into Bacillus thuringiensis V-CO3.3: Unveiling Its Genetic Potential against Nematodes

by Leopoldo Palma, Yolanda Bel and Baltasar Escriche

Data 2024, 9(8), 97; https://doi.org/10.3390/data9080097 - 29 Jul 2024

Bacillus thuringiensis (Bt) is a Gram-positive, spore-forming, and ubiquitous bacterium harboring plasmids encoding a variety of proteins with insecticidal activity, but also with activity against nematodes. The aim of this work was to perform the genome sequencing and analysis of a native Bt [...] Read more.

Bacillus thuringiensis (Bt) is a Gram-positive, spore-forming, and ubiquitous bacterium harboring plasmids encoding a variety of proteins with insecticidal activity, but also with activity against nematodes. The aim of this work was to perform the genome sequencing and analysis of a native Bt strain showing bipyramidal parasporal crystals and designated V-CO3.3, which was isolated from the dust of a grain storehouse in Córdoba (Spain). Its genome comprised 99 high-quality assembled contigs accounting for a total size of 5.2 Mb and 35.1% G + C. Phylogenetic analyses suggested that this strain should be renamed as Bacillus cereus s.s. biovar Thuringiensis. Gene annotation revealed a total of 5495 genes, among which, 1 was identified as encoding a Cry5Ba homolog protein with well-documented toxicity against nematodes. These results suggest that this Bt strain has interesting potential for nematode biocontrol. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

13 pages, 2561 KiB

Open AccessData Descriptor

Data on the Land Cover Transition, Subsequent Landscape Degradation, and Improvement in Semi-Arid Rainfed Agricultural Land in North–West Tunisia

by Zahra Shiri, Aymen Frija, Hichem Rejeb, Hassen Ouerghemmi and Quang Bao Le

Data 2024, 9(8), 96; https://doi.org/10.3390/data9080096 - 29 Jul 2024

Understanding past landscape changes is crucial to promote agroecological landscape transitions. This study analyzes past land cover changes (LCCs) alongside subsequent degradation and improvements in the study area. The input land cover (LC) data were taken from ESRI’s ArcGIS Living Atlas of the [...] Read more.

Understanding past landscape changes is crucial to promote agroecological landscape transitions. This study analyzes past land cover changes (LCCs) alongside subsequent degradation and improvements in the study area. The input land cover (LC) data were taken from ESRI’s ArcGIS Living Atlas of the World and then assessed for accuracy using ground truth data points randomly selected from high-resolution images on the Google Earth Engine. The LCC analyses were performed on QGIS 3.28.15 using the Semi-Automatic Classification Plugin (SCP) to generate LCC data. The degradation or improvement derived from the analyzed data was subsequently assessed using the UNCCD Good Practice Guidance to generate land cover degradation data. Using the Landscape Ecology Statistics (LecoS) plugin in QGIS, the input LC data were processed to provide landscape metrics. The data presented in this article show that the studied landscape is not static, even over a short-term time horizon (2017–2022). The transition from one LC class to another had an impact on the ecosystem and induced different states of degradation. For the three main LC classes (forest, crops, and rangeland) representing 98.9% of the total area in 2022, the landscape metrics, especially the number of patches, reflected a 105% increase in landscape fragmentation between 2017 and 2022. Full article

(This article belongs to the Topic Techniques and Science Exploitations for Earth Observation and Planetary Exploration)

► Show Figures

Figure 1

20 pages, 2105 KiB

Open AccessArticle

Bootstrap Method as a Tool for Analyzing Data with Atypical Distributions Deviating from Parametric Assumptions: Critique and Effectiveness Evaluation

by Joanna Kostanek, Kamil Karolczak, Wiktor Kuliczkowski and Cezary Watala

Data 2024, 9(8), 95; https://doi.org/10.3390/data9080095 - 26 Jul 2024

In today’s research environment characterized by exponential data growth and increasing complexity, the selection of appropriate statistical tests, tailored to research objectives and data distributions, is paramount for rigorous analysis and accurate interpretation. This article explores the growing prominence of bootstrapping, an advanced [...] Read more.

In today’s research environment characterized by exponential data growth and increasing complexity, the selection of appropriate statistical tests, tailored to research objectives and data distributions, is paramount for rigorous analysis and accurate interpretation. This article explores the growing prominence of bootstrapping, an advanced statistical technique for multiple comparisons analysis, offering flexibility and customization by estimating sample distributions without assuming population distributions, thus serving as a valuable alternative to traditional methods in various data scenarios. Computer simulations were conducted using data from cardiovascular disease patients. Two approaches, spontaneous partly controlled simulation and fully constrained simulation using self-written R scripts, were utilized to generate datasets with specified distributions and analyze the data using tests for comparing more than two groups. The utilization of the bootstrap method greatly improves statistical analysis, especially in overcoming the constraints of conventional parametric tests. Our research showcased its effectiveness in comparing multiple scenarios, yielding strong findings across diverse distributions, even with minor inflation in p values. Serving as a valuable substitute for parametric approaches, bootstrap promotes careful consideration when rejecting hypotheses, thus fostering a deeper understanding of statistical nuances and bolstering analytical rigor. Full article

► Show Figures

Figure 1

18 pages, 1124 KiB

Open AccessData Descriptor

SparrKULee: A Speech-Evoked Auditory Response Repository from KU Leuven, Containing the EEG of 85 Participants

by Bernd Accou, Lies Bollens, Marlies Gillis, Wendy Verheijen, Hugo Van hamme and Tom Francart

Data 2024, 9(8), 94; https://doi.org/10.3390/data9080094 - 26 Jul 2024

Cited by 4

Researchers investigating the neural mechanisms underlying speech perception often employ electroencephalography (EEG) to record brain activity while participants listen to spoken language. The high temporal resolution of EEG enables the study of neural responses to fast and dynamic speech signals. Previous studies have [...] Read more.

Researchers investigating the neural mechanisms underlying speech perception often employ electroencephalography (EEG) to record brain activity while participants listen to spoken language. The high temporal resolution of EEG enables the study of neural responses to fast and dynamic speech signals. Previous studies have successfully extracted speech characteristics from EEG data and, conversely, predicted EEG activity from speech features. Machine learning techniques are generally employed to construct encoding and decoding models, which necessitate a substantial quantity of data. We present SparrKULee, a Speech-evoked Auditory Repository of EEG data, measured at KU Leuven, comprising 64-channel EEG recordings from 85 young individuals with normal hearing, each of whom listened to 90–150 min of natural speech. This dataset is more extensive than any currently available dataset in terms of both the number of participants and the quantity of data per participant. It is suitable for training larger machine learning models. We evaluate the dataset using linear and state-of-the-art non-linear models in a speech encoding/decoding and match/mismatch paradigm, providing benchmark scores for future research. Full article

► Show Figures

Figure 1

24 pages, 388 KiB

Open AccessArticle

Optimizing Database Performance in Complex Event Processing through Indexing Strategies

by Maryam Abbasi, Marco V. Bernardo, Paulo Váz, José Silva and Pedro Martins

Data 2024, 9(8), 93; https://doi.org/10.3390/data9080093 - 24 Jul 2024

Complex event processing (CEP) systems have gained significant importance in various domains, such as finance, logistics, and security, where the real-time analysis of event streams is crucial. However, as the volume and complexity of event data continue to grow, optimizing the performance of [...] Read more.

Complex event processing (CEP) systems have gained significant importance in various domains, such as finance, logistics, and security, where the real-time analysis of event streams is crucial. However, as the volume and complexity of event data continue to grow, optimizing the performance of CEP systems becomes a critical challenge. This paper investigates the impact of indexing strategies on the performance of databases handling complex event processing. We propose a novel indexing technique, called Hierarchical Temporal Indexing (HTI), specifically designed for the efficient processing of complex event queries. HTI leverages the temporal nature of event data and employs a multi-level indexing approach to optimize query execution. By combining temporal indexing with spatial- and attribute-based indexing, HTI aims to accelerate the retrieval and processing of relevant events, thereby improving overall query performance. In this study, we evaluate the effectiveness of HTI by implementing complex event queries on various CEP systems with different indexing strategies. We conduct a comprehensive performance analysis, measuring the query execution times and resource utilization (CPU, memory, etc.), and analyzing the execution plans and query optimization techniques employed by each system. Our experimental results demonstrate that the proposed HTI indexing strategy outperforms traditional indexing approaches, particularly for complex event queries involving temporal constraints and multi-dimensional event attributes. We provide insights into the strengths and weaknesses of each indexing strategy, identifying the factors that influence performance, such as data volume, query complexity, and event characteristics. Furthermore, we discuss the implications of our findings for the design and optimization of CEP systems, offering recommendations for indexing strategy selection based on the specific requirements and workload characteristics. Finally, we outline the potential limitations of our study and suggest future research directions in this domain. Full article

18 pages, 7475 KiB

Open AccessData Descriptor

BELMASK—An Audiovisual Dataset of Adversely Produced Speech for Auditory Cognition Research

by Cleopatra Christina Moshona, Frederic Rudawski, André Fiebig and Ennes Sarradj

Data 2024, 9(8), 92; https://doi.org/10.3390/data9080092 - 24 Jul 2024

In this article, we introduce the Berlin Dataset of Lombard and Masked Speech (BELMASK), a phonetically controlled audiovisual dataset of speech produced in adverse speaking conditions, and describe the development of the related speech task. The dataset contains in total 128 min of [...] Read more.

In this article, we introduce the Berlin Dataset of Lombard and Masked Speech (BELMASK), a phonetically controlled audiovisual dataset of speech produced in adverse speaking conditions, and describe the development of the related speech task. The dataset contains in total 128 min of audio and video recordings of 10 German native speakers (4 female, 6 male) with a mean age of 30.2 years (SD: 6.3 years), uttering matrix sentences in cued, uninstructed speech in four conditions: (i) with a Filtering Facepiece P2 (FFP2) mask in silence, (ii) without an FFP2 mask in silence, (iii) with an FFP2 mask while exposed to noise, (iv) without an FFP2 mask while exposed to noise. Noise consisted of mixed-gender six-talker babble played over headphones to the speakers, triggering the Lombard effect. All conditions are readily available in face-and-voice and voice-only formats. The speech material is annotated, employing a multi-layer architecture, and was originally conceptualized to be used for the administration of a working memory task. The dataset is stored in a restricted-access Zenodo repository and is available for academic research in the area of speech communication, acoustics, psychology and related disciplines upon request, after signing an End User License Agreement (EULA). Full article

► Show Figures

Figure 1

More Articles...

Submit to Data Review for Data

Journal Menu

Journal Browser

► Journal Browser

Highly Accessed Articles

View More...

Latest Books

More Books and Reprints...

E-Mail Alert

News

23 September 2024
Meet Us at the 87th Annual Meeting of the Association for Information Science and Technology, 25–29 October 2024, Calgary, Canada

11 September 2024
MDPI’s 2023 Best PhD Thesis Awards—Winners Announced

23 August 2024
Meet Us at the 2024 IEEE Information Theory Workshop, 24–28 November 2024, Shenzhen, China

More News & Announcements...

Topics

Propose a Topic

Topic in Algorithms, Data, Information, Mathematics, Symmetry

Decision-Making and Data Mining for Sustainable Computing Topic Editors: Sunil Jha, Malgorzata Rataj, Xiaorui Zhang
Deadline: 30 November 2024

Topic in BDCC, Data, MAKE, Mathematics

Big Data Intelligence: Methodologies and Applications Topic Editors: Liang Zhao, Liang Zou, Boxiang Dong
Deadline: 31 December 2024

Topic in BDCC, Data, Environments, Geosciences, Remote Sensing

Database, Mechanism and Risk Assessment of Slope Geologic Hazards Topic Editors: Chong Xu, Yingying Tian, Xiaoyi Shao, Zikang Xiao, Yulong Cui
Deadline: 28 February 2025

Topic in Data, Energies, Sensors, Sustainability, Water

Water and Energy Monitoring and Their Nexus Topic Editors: Lucas Pereira, Hugo Morais, Wolf-Gerrit Früh
Deadline: 31 March 2025

More Topics

Conferences

Announce Your Conference

More Conferences...

Special Issues

Propose a Special Issue

Special Issue in Data

Data-Driven Approaches for Safety in Industrial Sites Guest Editors: Francesca Mauro, Mara Lombardi, Mario Fargnoli
Deadline: 30 October 2024

Special Issue in Data

Machine Learning and Data Mining in Exercise, Sports and Health Research Guest Editor: Daniel Rojas-Valverde
Deadline: 31 October 2024

Special Issue in Data

Benchmarking Datasets in Bioinformatics, 2nd Volume Guest Editor: Pufeng Du
Deadline: 20 November 2024

Special Issue in Data

Data in Astrophysics and Geophysics: Research and Applications, 3rd Volume Guest Editors: Vladimir Sreckovic, Milan S. Dimitrijević, Zoran Mijic
Deadline: 30 November 2024

More Special Issues

Topical Collections

Topical Collection in Data

Modern Geophysical and Climate Data Analysis: Tools and Methods Collection Editors: Vladimir Sreckovic, Zoran Mijic

Back to TopTop