Next Issue
Volume 9, September
Previous Issue
Volume 9, July
 
 

Data, Volume 9, Issue 8 (August 2024) – 12 articles

Cover Story (view full-size image): Researchers investigating the neural mechanisms underlying speech perception often employ electroencephalography (EEG) to record brain activity while participants listen to spoken language. The high temporal resolution of EEG enables the study of neural responses to dynamic speech signals. Machine learning techniques are generally employed to construct encoding and decoding models, and such techniques necessitate a substantial quantity of data. We present SparrKULee, a speech-evoked auditory repository of EEG data comprising 64-channel EEG recordings from 85 young individuals with normal hearing, each of whom listened to 90–150 min of natural speech. SparrKULee is more extensive than currently available datasets in terms of both the number of participants and the quantity of data per participant. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
12 pages, 2888 KiB  
Article
Viral Targets in the Human Interactome with Comprehensive Centrality Analysis: SARS-CoV-2, a Case Study
by Nilesh Kumar and M. Shahid Mukhtar
Data 2024, 9(8), 101; https://doi.org/10.3390/data9080101 - 20 Aug 2024
Viewed by 1676
Abstract
Network centrality analyses have proven to be successful in identifying important nodes in diverse host–pathogen interactomes. The current study presents a comprehensive investigation of the human interactome and SARS-CoV-2 host targets. We first constructed a comprehensive human interactome by compiling experimentally validated protein–protein [...] Read more.
Network centrality analyses have proven to be successful in identifying important nodes in diverse host–pathogen interactomes. The current study presents a comprehensive investigation of the human interactome and SARS-CoV-2 host targets. We first constructed a comprehensive human interactome by compiling experimentally validated protein–protein interactions (PPIs) from eight distinct sources. Additionally, we compiled a comprehensive list of 1449 SARS-CoV-2 host proteins and analyzed their interactions within the human interactome, which identified enriched biological processes and pathways. Seven diverse topological features were employed to reveal the enrichment of the SARS-CoV-2 targets in the human interactome, with closeness centrality emerging as the most effective metric. Furthermore, a novel approach called CentralityCosDist was employed to predict SARS-CoV-2 targets, which proved to be effective in expanding the pool of predicted targets. Pathway enrichment analyses further elucidated the functional roles and potential mechanisms associated with predicted targets. Overall, this study provides valuable insights into the complex interplay between SARS-CoV-2 and the host’s cellular machinery, contributing to a deeper understanding of viral infection and immune response modulation. Full article
Show Figures

Figure 1

10 pages, 13509 KiB  
Data Descriptor
Dataset of Registered Hematoxylin–Eosin and Ki67 Histopathological Image Pairs Complemented by a Registration Algorithm
by Dominika Petríková, Ivan Cimrák, Katarína Tobiášová and Lukáš Plank
Data 2024, 9(8), 100; https://doi.org/10.3390/data9080100 - 7 Aug 2024
Cited by 2 | Viewed by 2665
Abstract
In this work, we describe a dataset suitable for analyzing the extent to which hematoxylin–eosin (HE)-stained tissue contains information about the expression of Ki67 in immunohistochemistry staining. The dataset provides images of corresponding pairs of HE and Ki67 stainings and is complemented by [...] Read more.
In this work, we describe a dataset suitable for analyzing the extent to which hematoxylin–eosin (HE)-stained tissue contains information about the expression of Ki67 in immunohistochemistry staining. The dataset provides images of corresponding pairs of HE and Ki67 stainings and is complemented by algorithms for computing the Ki67 index. We introduce a dataset of high-resolution histological images of testicular seminoma tissue. The dataset comprises digitized histology slides from 77 conventional testicular seminoma patients, obtained via surgical resection. For each patient, two physically adjacent tissue sections are stained: one with hematoxylin and eosin, and one with Ki67 immunohistochemistry staining. This results in a total of 154 high-resolution images. The images are provided in PNG format, facilitating ease of use for image analysis compared to the original scanner output formats. Each image contains enough tissue to generate thousands of non-overlapping 224 × 224 pixel patches. This shows the potential to generate more than 50,000 pairs of patches, one with HE staining and a corresponding Ki67 patch that depicts a very similar part of the tissue. Finally, we present the results of applying a ResNet neural network for the classification of HE patches into categories according to their Ki67 label. Full article
Show Figures

Figure 1

24 pages, 696 KiB  
Article
A Performance Analysis of Hybrid and Columnar Cloud Databases for Efficient Schema Design in Distributed Data Warehouse as a Service
by Fred Eduardo Revoredo Rabelo Ferreira and Robson do Nascimento Fidalgo
Data 2024, 9(8), 99; https://doi.org/10.3390/data9080099 - 5 Aug 2024
Cited by 2 | Viewed by 2373
Abstract
A Data Warehouse (DW) is a centralized database that stores large volumes of historical data for analysis and reporting. In a world where enterprise data grows exponentially, new architectures are being investigated to overcome the deficiencies of traditional Database Management Systems (DBMSs), driving [...] Read more.
A Data Warehouse (DW) is a centralized database that stores large volumes of historical data for analysis and reporting. In a world where enterprise data grows exponentially, new architectures are being investigated to overcome the deficiencies of traditional Database Management Systems (DBMSs), driving a shift towards more modern, cloud-based solutions that provide resources such as distributed processing, columnar storage, and horizontal scalability without the overhead of physical hardware management, i.e., a Database as a Service (DBaaS). Choosing the appropriate class of DBMS is a critical decision for organizations, and there are important differences that impact data volume and query performance (e.g., architecture, data models, and storage) to support analytics in a distributed cloud environment efficiently. In this sense, we carry out an experimental evaluation to analyze the performance of several DBaaS and the impact of data modeling, specifically the usage of a partially normalized Star Schema and a fully denormalized Flat Table Schema, to further comprehend their behavior in different configurations and designs in terms of data schema, storage form, memory availability, and cluster size. The analysis is done in two volumes of data generated by a well-established benchmark, comparing the performance of the DW in terms of average execution time, memory usage, data volume, and loading time. Our results provide guidelines for efficient DW design, showing, for example, that the denormalization of the schema does not guarantee improved performance, as solutions performed differently depending on its architecture. We also show that a Hybrid Processing (HTAP) NewSQL solution can outperform solutions that support only Online Analytical Processing (OLAP) in terms of overall execution time, but that the performance of each query is deeply influenced by its selectivity and by the number of join functions. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

16 pages, 677 KiB  
Article
Arabic Lexical Substitution: AraLexSubD Dataset and AraLexSub Pipeline
by Eman Naser-Karajah and Nabil Arman
Data 2024, 9(8), 98; https://doi.org/10.3390/data9080098 - 30 Jul 2024
Cited by 2 | Viewed by 1634
Abstract
Lexical substitution aims to generate a list of equivalent substitutions (i.e., synonyms) to a sentence’s target word or phrase while preserving the sentence’s meaning to improve writing, enhance language understanding, improve natural language processing models, and handle ambiguity. This task has recently attracted [...] Read more.
Lexical substitution aims to generate a list of equivalent substitutions (i.e., synonyms) to a sentence’s target word or phrase while preserving the sentence’s meaning to improve writing, enhance language understanding, improve natural language processing models, and handle ambiguity. This task has recently attracted much attention in many languages. Despite the richness of Arabic vocabulary, limited research has been performed on the lexical substitution task due to the lack of annotated data. To bridge this gap, we present the first Arabic lexical substitution benchmark dataset AraLexSubD for benchmarking lexical substitution pipelines. AraLexSubD is manually built by eight native Arabic speakers and linguists (six linguist annotators, a doctor, and an economist) who annotate the 630 sentences. AraLexSubD covers three domains: general, finance, and medical. It encompasses 2476 substitution candidates ranked according to their semantic relatedness. We also present the first Arabic lexical substitution pipeline, AraLexSub, which uses the AraBERT pre-trained language model. The pipeline consists of several modules: substitute generation, substitute filtering, and candidate ranking. The filtering step shows its effectiveness by achieving an increase of 1.6 in the F1 score on the entire AraLexSubD dataset. Additionally, an error analysis of the experiment is reported. To our knowledge, this is the first study on Arabic lexical substitution. Full article
Show Figures

Figure 1

9 pages, 1535 KiB  
Data Descriptor
Genomic Insights into Bacillus thuringiensis V-CO3.3: Unveiling Its Genetic Potential against Nematodes
by Leopoldo Palma, Yolanda Bel and Baltasar Escriche
Data 2024, 9(8), 97; https://doi.org/10.3390/data9080097 - 29 Jul 2024
Cited by 1 | Viewed by 1840
Abstract
Bacillus thuringiensis (Bt) is a Gram-positive, spore-forming, and ubiquitous bacterium harboring plasmids encoding a variety of proteins with insecticidal activity, but also with activity against nematodes. The aim of this work was to perform the genome sequencing and analysis of a native Bt [...] Read more.
Bacillus thuringiensis (Bt) is a Gram-positive, spore-forming, and ubiquitous bacterium harboring plasmids encoding a variety of proteins with insecticidal activity, but also with activity against nematodes. The aim of this work was to perform the genome sequencing and analysis of a native Bt strain showing bipyramidal parasporal crystals and designated V-CO3.3, which was isolated from the dust of a grain storehouse in Córdoba (Spain). Its genome comprised 99 high-quality assembled contigs accounting for a total size of 5.2 Mb and 35.1% G + C. Phylogenetic analyses suggested that this strain should be renamed as Bacillus cereus s.s. biovar Thuringiensis. Gene annotation revealed a total of 5495 genes, among which, 1 was identified as encoding a Cry5Ba homolog protein with well-documented toxicity against nematodes. These results suggest that this Bt strain has interesting potential for nematode biocontrol. Full article
Show Figures

Figure 1

13 pages, 2561 KiB  
Data Descriptor
Data on the Land Cover Transition, Subsequent Landscape Degradation, and Improvement in Semi-Arid Rainfed Agricultural Land in North–West Tunisia
by Zahra Shiri, Aymen Frija, Hichem Rejeb, Hassen Ouerghemmi and Quang Bao Le
Data 2024, 9(8), 96; https://doi.org/10.3390/data9080096 - 29 Jul 2024
Cited by 2 | Viewed by 2014
Abstract
Understanding past landscape changes is crucial to promote agroecological landscape transitions. This study analyzes past land cover changes (LCCs) alongside subsequent degradation and improvements in the study area. The input land cover (LC) data were taken from ESRI’s ArcGIS Living Atlas of the [...] Read more.
Understanding past landscape changes is crucial to promote agroecological landscape transitions. This study analyzes past land cover changes (LCCs) alongside subsequent degradation and improvements in the study area. The input land cover (LC) data were taken from ESRI’s ArcGIS Living Atlas of the World and then assessed for accuracy using ground truth data points randomly selected from high-resolution images on the Google Earth Engine. The LCC analyses were performed on QGIS 3.28.15 using the Semi-Automatic Classification Plugin (SCP) to generate LCC data. The degradation or improvement derived from the analyzed data was subsequently assessed using the UNCCD Good Practice Guidance to generate land cover degradation data. Using the Landscape Ecology Statistics (LecoS) plugin in QGIS, the input LC data were processed to provide landscape metrics. The data presented in this article show that the studied landscape is not static, even over a short-term time horizon (2017–2022). The transition from one LC class to another had an impact on the ecosystem and induced different states of degradation. For the three main LC classes (forest, crops, and rangeland) representing 98.9% of the total area in 2022, the landscape metrics, especially the number of patches, reflected a 105% increase in landscape fragmentation between 2017 and 2022. Full article
Show Figures

Figure 1

20 pages, 2105 KiB  
Article
Bootstrap Method as a Tool for Analyzing Data with Atypical Distributions Deviating from Parametric Assumptions: Critique and Effectiveness Evaluation
by Joanna Kostanek, Kamil Karolczak, Wiktor Kuliczkowski and Cezary Watala
Data 2024, 9(8), 95; https://doi.org/10.3390/data9080095 - 26 Jul 2024
Cited by 8 | Viewed by 4817
Abstract
In today’s research environment characterized by exponential data growth and increasing complexity, the selection of appropriate statistical tests, tailored to research objectives and data distributions, is paramount for rigorous analysis and accurate interpretation. This article explores the growing prominence of bootstrapping, an advanced [...] Read more.
In today’s research environment characterized by exponential data growth and increasing complexity, the selection of appropriate statistical tests, tailored to research objectives and data distributions, is paramount for rigorous analysis and accurate interpretation. This article explores the growing prominence of bootstrapping, an advanced statistical technique for multiple comparisons analysis, offering flexibility and customization by estimating sample distributions without assuming population distributions, thus serving as a valuable alternative to traditional methods in various data scenarios. Computer simulations were conducted using data from cardiovascular disease patients. Two approaches, spontaneous partly controlled simulation and fully constrained simulation using self-written R scripts, were utilized to generate datasets with specified distributions and analyze the data using tests for comparing more than two groups. The utilization of the bootstrap method greatly improves statistical analysis, especially in overcoming the constraints of conventional parametric tests. Our research showcased its effectiveness in comparing multiple scenarios, yielding strong findings across diverse distributions, even with minor inflation in p values. Serving as a valuable substitute for parametric approaches, bootstrap promotes careful consideration when rejecting hypotheses, thus fostering a deeper understanding of statistical nuances and bolstering analytical rigor. Full article
Show Figures

Figure 1

18 pages, 1124 KiB  
Data Descriptor
SparrKULee: A Speech-Evoked Auditory Response Repository from KU Leuven, Containing the EEG of 85 Participants
by Bernd Accou, Lies Bollens, Marlies Gillis, Wendy Verheijen, Hugo Van hamme and Tom Francart
Data 2024, 9(8), 94; https://doi.org/10.3390/data9080094 - 26 Jul 2024
Cited by 6 | Viewed by 2203
Abstract
Researchers investigating the neural mechanisms underlying speech perception often employ electroencephalography (EEG) to record brain activity while participants listen to spoken language. The high temporal resolution of EEG enables the study of neural responses to fast and dynamic speech signals. Previous studies have [...] Read more.
Researchers investigating the neural mechanisms underlying speech perception often employ electroencephalography (EEG) to record brain activity while participants listen to spoken language. The high temporal resolution of EEG enables the study of neural responses to fast and dynamic speech signals. Previous studies have successfully extracted speech characteristics from EEG data and, conversely, predicted EEG activity from speech features. Machine learning techniques are generally employed to construct encoding and decoding models, which necessitate a substantial quantity of data. We present SparrKULee, a Speech-evoked Auditory Repository of EEG data, measured at KU Leuven, comprising 64-channel EEG recordings from 85 young individuals with normal hearing, each of whom listened to 90–150 min of natural speech. This dataset is more extensive than any currently available dataset in terms of both the number of participants and the quantity of data per participant. It is suitable for training larger machine learning models. We evaluate the dataset using linear and state-of-the-art non-linear models in a speech encoding/decoding and match/mismatch paradigm, providing benchmark scores for future research. Full article
Show Figures

Figure 1

24 pages, 388 KiB  
Article
Optimizing Database Performance in Complex Event Processing through Indexing Strategies
by Maryam Abbasi, Marco V. Bernardo, Paulo Váz, José Silva and Pedro Martins
Data 2024, 9(8), 93; https://doi.org/10.3390/data9080093 - 24 Jul 2024
Cited by 1 | Viewed by 2308
Abstract
Complex event processing (CEP) systems have gained significant importance in various domains, such as finance, logistics, and security, where the real-time analysis of event streams is crucial. However, as the volume and complexity of event data continue to grow, optimizing the performance of [...] Read more.
Complex event processing (CEP) systems have gained significant importance in various domains, such as finance, logistics, and security, where the real-time analysis of event streams is crucial. However, as the volume and complexity of event data continue to grow, optimizing the performance of CEP systems becomes a critical challenge. This paper investigates the impact of indexing strategies on the performance of databases handling complex event processing. We propose a novel indexing technique, called Hierarchical Temporal Indexing (HTI), specifically designed for the efficient processing of complex event queries. HTI leverages the temporal nature of event data and employs a multi-level indexing approach to optimize query execution. By combining temporal indexing with spatial- and attribute-based indexing, HTI aims to accelerate the retrieval and processing of relevant events, thereby improving overall query performance. In this study, we evaluate the effectiveness of HTI by implementing complex event queries on various CEP systems with different indexing strategies. We conduct a comprehensive performance analysis, measuring the query execution times and resource utilization (CPU, memory, etc.), and analyzing the execution plans and query optimization techniques employed by each system. Our experimental results demonstrate that the proposed HTI indexing strategy outperforms traditional indexing approaches, particularly for complex event queries involving temporal constraints and multi-dimensional event attributes. We provide insights into the strengths and weaknesses of each indexing strategy, identifying the factors that influence performance, such as data volume, query complexity, and event characteristics. Furthermore, we discuss the implications of our findings for the design and optimization of CEP systems, offering recommendations for indexing strategy selection based on the specific requirements and workload characteristics. Finally, we outline the potential limitations of our study and suggest future research directions in this domain. Full article
18 pages, 7475 KiB  
Data Descriptor
BELMASK—An Audiovisual Dataset of Adversely Produced Speech for Auditory Cognition Research
by Cleopatra Christina Moshona, Frederic Rudawski, André Fiebig and Ennes Sarradj
Data 2024, 9(8), 92; https://doi.org/10.3390/data9080092 - 24 Jul 2024
Viewed by 1956
Abstract
In this article, we introduce the Berlin Dataset of Lombard and Masked Speech (BELMASK), a phonetically controlled audiovisual dataset of speech produced in adverse speaking conditions, and describe the development of the related speech task. The dataset contains in total 128 min of [...] Read more.
In this article, we introduce the Berlin Dataset of Lombard and Masked Speech (BELMASK), a phonetically controlled audiovisual dataset of speech produced in adverse speaking conditions, and describe the development of the related speech task. The dataset contains in total 128 min of audio and video recordings of 10 German native speakers (4 female, 6 male) with a mean age of 30.2 years (SD: 6.3 years), uttering matrix sentences in cued, uninstructed speech in four conditions: (i) with a Filtering Facepiece P2 (FFP2) mask in silence, (ii) without an FFP2 mask in silence, (iii) with an FFP2 mask while exposed to noise, (iv) without an FFP2 mask while exposed to noise. Noise consisted of mixed-gender six-talker babble played over headphones to the speakers, triggering the Lombard effect. All conditions are readily available in face-and-voice and voice-only formats. The speech material is annotated, employing a multi-layer architecture, and was originally conceptualized to be used for the administration of a working memory task. The dataset is stored in a restricted-access Zenodo repository and is available for academic research in the area of speech communication, acoustics, psychology and related disciplines upon request, after signing an End User License Agreement (EULA). Full article
Show Figures

Figure 1

8 pages, 3193 KiB  
Data Descriptor
Data Descriptor of Snakebites in Brazil from 2007 to 2020
by Alexandre Vilhena Silva-Neto, Gabriel Santos Mouta, Antônio Alcirley Silva Balieiro, Jady Shayenne Mota Cordeiro, Patricia Carvalho Silva Balieiro, Tatyana Costa Amorin Ramos, Djane Clarys Baia-da-Silva, Élisson Silva Rocha, Patricia Takako Endo, Theo Lynn, Wuelton Marcelo Monteiro and Vanderson Souza Sampaio
Data 2024, 9(8), 91; https://doi.org/10.3390/data9080091 - 24 Jul 2024
Viewed by 1722
Abstract
Snakebite envenomations (SBE) are a significant global public health threat due to their morbidity and mortality. This is a neglected public health issue in many tropical and subtropical countries. Brazil is in the top ten countries affected by SBE, with 32,160 cases reported [...] Read more.
Snakebite envenomations (SBE) are a significant global public health threat due to their morbidity and mortality. This is a neglected public health issue in many tropical and subtropical countries. Brazil is in the top ten countries affected by SBE, with 32,160 cases reported only in 2020, posing a high burden for this population. In this paper, we describe the data structure of snakebite records from 2007 to 2020 in the Notifiable Disease Information System (SINAN), made available by the Brazilian Ministry of Health (MoH). In addition, we also provide R scripts that allow a quick and automatic updating of data from the SINAN according to its availability. The data presented in this work are related to clinical and demographic information on SBE cases. Also, data on outcomes, laboratory results, and treatment are available. The dataset is available and freely accessible; however, preprocessing, adjustments, and standardization are necessary due to incompleteness and inconsistencies. Regardless of these limitations, it provides a solid basis for assessing different aspects and the national burden of envenoming. Full article
Show Figures

Figure 1

12 pages, 5548 KiB  
Data Descriptor
SaBi3d—A LiDAR Point Cloud Data Set of Car-to-Bicycle Overtaking Maneuvers
by Christian Odenwald and Moritz Beeking
Data 2024, 9(8), 90; https://doi.org/10.3390/data9080090 - 24 Jul 2024
Cited by 1 | Viewed by 3045
Abstract
While cycling presents environmental benefits and promotes a healthy lifestyle, the risks associated with overtaking maneuvers by motorized vehicles represent a significant barrier for many potential cyclists. A large-scale analysis of overtaking maneuvers could inform traffic researchers and city planners how to reduce [...] Read more.
While cycling presents environmental benefits and promotes a healthy lifestyle, the risks associated with overtaking maneuvers by motorized vehicles represent a significant barrier for many potential cyclists. A large-scale analysis of overtaking maneuvers could inform traffic researchers and city planners how to reduce these risks by better understanding these maneuvers. Drawing from the fields of sensor-based cycling research and from LiDAR-based traffic data sets, this paper provides a step towards addressing these safety concerns by introducing the Salzburg Bicycle 3d (SaBi3d) data set, which consists of LiDAR point clouds capturing car-to-bicycle overtaking maneuvers. The data set, collected using a LiDAR-equipped bicycle, facilitates the detailed analysis of a large quantity of overtaking maneuvers without the need for manual annotation through enabling automatic labeling by a neural network. Additionally, a benchmark result for 3D object detection using a competitive neural network is provided as a baseline for future research. The SaBi3d data set is structured identically to the nuScenes data set, and therefore offers compatibility with numerous existing object detection systems. This work provides valuable resources for future researchers to better understand cycling infrastructure and mitigate risks, thus promoting cycling as a viable mode of transportation. Full article
(This article belongs to the Section Spatial Data Science and Digital Earth)
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop