Data

Data, Vol. 9, Pages 66: A Series Production Data Set for Five-Axis CNC Milling

Anna-Maria Schmitt — 2024-04-30

Data, Vol. 9, Pages 66: A Series Production Data Set for Five-Axis CNC Milling

Authors: Anna-Maria Schmitt Bastian Engelmann

The described data set contains features from the machine control of a five-axis milling machine. The features were recorded during thirteen series productions. Each series production includes a changeover process in which the machine was set up for the production of a different product. In addition to the timestamps and the twenty recorded features derived from Numerical Control (NC) variables, the data set also contains labels for the different production phases. For this purpose, up to 23 phases were assigned, which are based on a generalized milling process. The data set consists of thirteen .csv files, each representing a series production. The data set was recorded in a production company in the contract manufacturing sector for components with real series orders in ongoing industrial production.

Data, Vol. 9, Pages 65: Spectral Library of Plant Species from Montesinho Natural Park in Portugal

Isabel Pôças — 2024-04-30

Data, Vol. 9, Pages 65: Spectral Library of Plant Species from Montesinho Natural Park in Portugal

Data doi: 10.3390/data9050065

Authors: Isabel Pôças Cátia Rodrigues de Almeida Salvador Arenas-Castro João C. Campos Nuno Garcia João Alírio Neftalí Sillero Ana C. Teodoro

In this work, we present and describe a spectral library (SL) with 15 vascular plant species from Montesinho Natural Park (MNP), a protected area in Northeast Portugal. We selected species from the vascular plants that are characteristic of the habitats in the MNP, based on their prevalence, and also included one invasive species: Alnus glutinosa (L.) Gaertn, Castanea sativa Mill., Cistus ladanifer L., Crataegus monogyna Jacq., Frangula alnus Mill., Fraxinus angustifolia Vahl, Quercus pyrenaica Willd., Quercus rotundifolia Lam., Trifolium repens L., Arbutus unedo L., Dactylis glomerata L., Genista falcata Brot., Cytisus multiflorus (L’Hér.) Sweet, Erica arborea L., and Acacia dealbata Link. We collected spectra (300–2500 nm) from five records per leaf and leaf side, which resulted in 538 spectra compiled in the SL. Additionally, we computed five vegetation indices from spectral data and analysed them to highlight specific characteristics and differences among the sampled species. We detail the data repository information and its organisation for a better understanding of the data and to facilitate its use. The SL structure can add valuable information about the selected plant species in MNP, contributing to conservation purposes. This plant species SL is publicly available in Zenodo platform.

Data, Vol. 9, Pages 64: A Comprehensive Dataset of the Aerodynamic and Geometric Coefficients of Airfoils in the Public Domain

Kanak Agarwal — 2024-04-30

Data, Vol. 9, Pages 64: A Comprehensive Dataset of the Aerodynamic and Geometric Coefficients of Airfoils in the Public Domain

Data doi: 10.3390/data9050064

Authors: Kanak Agarwal Vedant Vijaykrishnan Dyutit Mohanty Manikandan Murugaiah

This study presents an extensive collection of data on the aerodynamic behavior at a low Reynolds number and geometric coefficients for 2900 airfoils obtained through the class shape transformation (CST) method. By employing a verified OpenFOAM-based CFD simulation framework, lift and drag coefficients were determined at a Reynolds number of 105. Considering the limited availability of data on low Reynolds number airfoils, this dataset is invaluable for a wide range of applications, including unmanned aerial vehicles (UAVs) and wind turbines. Additionally, the study offers a method for automating CFD simulations that could be applied to obtain aerodynamic coefficients at higher Reynolds numbers. The breadth of this dataset also supports the enhancement and creation of machine learning (ML) models, further advancing research into the aerodynamics of airfoils and lifting surfaces.

Data, Vol. 9, Pages 63: Detailed Landslide Traces Database of Hancheng County, China, Based on High-Resolution Satellite Images Available on the Google Earth Platform

Zhao — 2024-04-29

Data, Vol. 9, Pages 63: Detailed Landslide Traces Database of Hancheng County, China, Based on High-Resolution Satellite Images Available on the Google Earth Platform

Data doi: 10.3390/data9050063

Authors: Zhao Xu Huang

Hancheng is located in the eastern part of China’s Shaanxi Province, near the west bank of the Yellow River. It is located at the junction of the active geological structure area. The rock layer is relatively fragmented, and landslide disasters are frequent. The occurrence of landslide disasters often causes a large number of casualties along with economic losses in the local area, seriously restricting local economic development. Although risk assessment and deformation mechanism analysis for single landslides have been performed for landslide disasters in the Hancheng area, this area lacks a landslide traces database. A complete landslide database comprises the basic data required for the study of landslide disasters and is an important requirement for subsequent landslide-related research. Therefore, this study used multi-temporal high-resolution optical images and human-computer interaction visual interpretation methods of the Google Earth platform to construct a landslide traces database in Hancheng County. The results showed that at least 6785 landslides had occurred in the study area. The total area of the landslides was about 95.38 km2, accounting for 5.88% of the study area. The average landslide area was 1406.04 m2, the largest landslide area was 377,841 m2, and the smallest landslide area was 202.96 m2. The results of this study provides an important basis for understanding the spatial distribution of landslides in Hancheng County, the evaluation of landslide susceptibility, and local disaster prevention and mitigation work.

Data, Vol. 9, Pages 62: Stimulated Microcontroller Dataset for New IoT Device Identification Schemes through On-Chip Sensor Monitoring

Alberto Ramos — 2024-04-28

Data, Vol. 9, Pages 62: Stimulated Microcontroller Dataset for New IoT Device Identification Schemes through On-Chip Sensor Monitoring

Data doi: 10.3390/data9050062

Authors: Alberto Ramos Honorio Martín Carmen Cámara Pedro Peris-Lopez

Legitimate identification of devices is crucial to ensure the security of present and future IoT ecosystems. In this regard, AI-based systems that exploit intrinsic hardware variations have gained notable relevance. Within this context, on-chip sensors included for monitoring purposes in a wide range of SoCs remain almost unexplored, despite their potential as a valuable source of both information and variability. In this work, we introduce and release a dataset comprising data collected from the on-chip temperature and voltage sensors of 20 microcontroller-based boards from the STM32L family. These boards were stimulated with five different algorithms, as workloads to elicit diverse responses. The dataset consists of five acquisitions (1.3 billion readouts) that are spaced over time and were obtained under different configurations using an automated platform. The raw dataset is publicly available, along with metadata and scripts developed to generate pre-processed T–V sequence sets. Finally, a proof of concept consisting of training a simple model is presented to demonstrate the feasibility of the identification system based on these data.

Data, Vol. 9, Pages 61: Training Datasets for Epilepsy Analysis: Preprocessing and Feature Extraction from Electroencephalography Time Series

Christian Riccio — 2024-04-26

Data, Vol. 9, Pages 61: Training Datasets for Epilepsy Analysis: Preprocessing and Feature Extraction from Electroencephalography Time Series

Data doi: 10.3390/data9050061

Authors: Christian Riccio Angelo Martone Gaetano Zazzaro Luigi Pavone

We describe 20 datasets derived through signal filtering and feature extraction steps applied to the raw time series EEG data of 20 epileptic patients, as well as the methods we used to derive them. Background: Epilepsy is a complex neurological disorder which has seizures as its hallmark. Electroencephalography plays a crucial role in epilepsy assessment, offering insights into the brain’s electrical activity and advancing our understanding of seizures. The availability of tagged training sets covering all seizure phases—inter-ictal, pre-ictal, ictal, and post-ictal—is crucial for data-driven epilepsy analyses. Methods: Using the sliding window technique with a two-second window length and a one-second time slip, we extract multiple features from the preprocessed EEG time series of 20 patients from the Freiburg Seizure Prediction Database. In addition, we assign a class label to each instance to specify its corresponding seizure phase. All these operations are made through a software application we developed, which is named Training Builder. Results: The 20 tagged training datasets each contain 1080 univariate and bivariate features, and are openly and publicly available. Conclusions: The datasets support the training of data-driven models for seizure detection, prediction, and clustering, based on features engineering.

Data, Vol. 9, Pages 60: Predicting Academic Success of College Students Using Machine Learning Techniques

Jorge Humberto Guanin-Fajardo — 2024-04-22

Data, Vol. 9, Pages 60: Predicting Academic Success of College Students Using Machine Learning Techniques

Data doi: 10.3390/data9040060

Authors: Jorge Humberto Guanin-Fajardo Javier Guaña-Moya Jorge Casillas

College context and academic performance are important determinants of academic success; using students’ prior experience with machine learning techniques to predict academic success before the end of the first year reinforces college self-efficacy. Dropout prediction is related to student retention and has been studied extensively in recent work; however, there is little literature on predicting academic success using educational machine learning. For this reason, CRISP-DM methodology was applied to extract relevant knowledge and features from the data. The dataset examined consists of 6690 records and 21 variables with academic and socioeconomic information. Preprocessing techniques and classification algorithms were analyzed. The area under the curve was used to measure the effectiveness of the algorithm; XGBoost had an AUC = 87.75% and correctly classified eight out of ten cases, while the decision tree improved interpretation with ten rules in seven out of ten cases. Recognizing the gaps in the study and that on-time completion of college consolidates college self-efficacy, creating intervention and support strategies to retain students is a priority for decision makers. Assessing the fairness and discrimination of the algorithms was the main limitation of this work. In the future, we intend to apply the extracted knowledge and learn about its influence of on university management.

Data, Vol. 9, Pages 59: Mapping of Data-Sharing Repositories for Paediatric Clinical Research—A Rapid Review

Mariagrazia Felisi — 2024-04-20

Data, Vol. 9, Pages 59: Mapping of Data-Sharing Repositories for Paediatric Clinical Research—A Rapid Review

Data doi: 10.3390/data9040059

Authors: Mariagrazia Felisi Fedele Bonifazi Maddalena Toma Claudia Pansieri Rebecca Leary Victoria Hedley Ronald Cornet Giorgio Reggiardo Annalisa Landi Annunziata D’Ercole Salma Malik Sinéad Nally Anando Sen Avril Palmeri Donato Bonifazi Adriana Ceci

The reuse of paediatric individual patient data (IPD) from clinical trials (CTs) is essential to overcome specific ethical, regulatory, methodological, and economic issues that hinder the progress of paediatric research. Sharing data through repositories enables the aggregation and dissemination of clinical information, fosters collaboration between researchers, and promotes transparency. This work aims to identify and describe existing data-sharing repositories (DSRs) developed to store, share, and reuse paediatric IPD from CTs. A rapid review of platforms providing access to electronic DSRs was conducted. A two-stage process was used to characterize DSRs: a first step of identification, followed by a second step of analysis using a set of eight purpose-built indicators. From an initial set of forty-five publicly available DSRs, twenty-one DSRs were identified as meeting the eligibility criteria. Only two DSRs were found to be totally focused on the paediatric population. Despite an increased awareness of the importance of data sharing, the results of this study show that paediatrics remains an area in which targeted efforts are still needed. Promoting initiatives to raise awareness of these DSRs and creating ad hoc measures and common standards for the sharing of paediatric CT data could help to bridge this gap in paediatric research.

Data, Vol. 9, Pages 58: Introduction to Reproducible Geospatial Analysis and Figures in R: A Tutorial Article

Philippe Maesen — 2024-04-20

Data, Vol. 9, Pages 58: Introduction to Reproducible Geospatial Analysis and Figures in R: A Tutorial Article

Data doi: 10.3390/data9040058

Authors: Philippe Maesen Edouard Salingros

The present article is intended to serve an educational purpose for data scientists and students who already have experience with the R language and which to start using it for geospatial analysis and map creation. The basic concepts of raster data, vector data, CRS and datum are first presented along with a basic workflow to conduct reproducible geospatial research in R. Examples of important types of maps (scatter, bubble, choropleth, hexbin and faceted) created from open-source environmental data are illustrated and their practical implementation in R is discussed. Through these examples, essential manipulations on geospatial vector data are demonstrated (reading, transforming CRS, creating geometries from scratch, buffer zones around existing geometries and intersections between geometries).

Data, Vol. 9, Pages 57: Experimental Data on Maximum Swelling Pressure of Clayey Soils and Related Soil Properties

Reza Taherdangkoo — 2024-04-16

Data, Vol. 9, Pages 57: Experimental Data on Maximum Swelling Pressure of Clayey Soils and Related Soil Properties

Data doi: 10.3390/data9040057

Authors: Reza Taherdangkoo Muntasir Shehab Thomas Nagel Faramarz Doulati Ardejani Christoph Butscher

Clayey soils exhibit significant volumetric changes in response to variations in water content. The swelling pressure of clayey soils is a critical parameter for evaluating the stability and performance of structures built on them, facilitating the development of appropriate design methodologies and mitigation strategies to ensure their long-term integrity and safety. We present a dataset comprising maximum swelling pressure values from 759 compacted soil samples, compiled from 16 articles published between 1994 and 2022. The dataset is classified into two main groups: 463 samples of natural clays and 296 samples of bentonite and bentonite mixtures, providing data on various types of soils and their properties. Different swelling test methods, including zero swelling, swell consolidation, restrained swell, double oedometer, free swelling, constant volume oedometer, UPC isochoric cell, isochoric oedometer and consolidometer, were employed to measure the maximum swelling pressure. The comprehensive nature of the dataset enhances its applicability for geotechnical projects. The dataset is a valuable resource for understanding the complex interactions between soil properties and swelling behavior, contributing to advancements in soil mechanics and geotechnical engineering.

Data, Vol. 9, Pages 56: A Dataset for Studying the Relationship between Human and Smart Devices

Francesco Lelli — 2024-04-11

Data, Vol. 9, Pages 56: A Dataset for Studying the Relationship between Human and Smart Devices

Data doi: 10.3390/data9040056

Authors: Francesco Lelli Heidi Toivonen

This dataset reports the responses to a survey designed for investigating the relationship that humans have with their smart devices. The dataset was collected between May and July 2020 and is a sample of over 500 respondents of various ethnicities and backgrounds. These data were used for modeling the ways that people relate to their devices using the notion of agency. However, the data can be used for complementing any study that intends to investigate a tool-mediated communication from the perspective of users, applying a variety of beliefs, attitudes, and expectations that users have in relation to their devices and themselves. This article presents the survey items as well as some preliminary data insights. The collected data were in English and the responses were anonymized to ensure GDPR compliance. The data were stored in a .csv file containing the respondents’ answers to the questions.

Data, Vol. 9, Pages 55: Learning from conect4children: A Collaborative Approach towards Standardisation of Disease-Specific Paediatric Research Data

Anando Sen — 2024-04-08

Data, Vol. 9, Pages 55: Learning from conect4children: A Collaborative Approach towards Standardisation of Disease-Specific Paediatric Research Data

Data doi: 10.3390/data9040055

Authors: Anando Sen Victoria Hedley Eva Degraeuwe Steven Hirschfeld Ronald Cornet Ramona Walls John Owen Peter N. Robinson Edward G. Neilan Thomas Liener Giovanni Nisato Neena Modi Simon Woodworth Avril Palmeri Ricarda Gaentzsch Melissa Walsh Teresa Berkery Joanne Lee Laura Persijn Kasey Baker Kristina An Haack Sonia Segovia Simon Julius O. B. Jacobsen Giorgio Reggiardo Melissa A. Kirwin Jessie Trueman Claudia Pansieri Donato Bonifazi Sinéad Nally Fedele Bonifazi Rebecca Leary Volker Straub

The conect4children (c4c) initiative was established to facilitate the development of new drugs and other therapies for paediatric patients. It is widely recognised that there are not enough medicines tested for all relevant ages of the paediatric population. To overcome this, it is imperative that clinical data from different sources are interoperable and can be pooled for larger post hoc studies. c4c has collaborated with the Clinical Data Interchange Standards Consortium (CDISC) to develop cross-cutting data resources that build on existing CDISC standards in an effort to standardise paediatric data. The natural next step was an extension to disease-specific data items. c4c brought together several existing initiatives and resources relevant to disease-specific data and analysed their use for standardising disease-specific data in clinical trials. Several case studies that combined disease-specific data from multiple trials have demonstrated the need for disease-specific data standardisation. We identified three relevant initiatives. These include European Reference Networks, European Joint Programme on Rare Diseases, and Pistoia Alliance. Other resources reviewed were National Cancer Institute Enterprise Vocabulary Services, CDISC standards, pharmaceutical company-specific data dictionaries, Human Phenotype Ontology, Phenopackets, Unified Registry for Inherited Metabolic Disorders, Orphacodes, Rare Disease Cures Accelerator-Data and Analytics Platform (RDCA-DAP), and Observational Medical Outcomes Partnership. The collaborative partners associated with these resources were also reviewed briefly. A plan of action focussed on collaboration was generated for standardising disease-specific paediatric clinical trial data. A paediatric data standards multistakeholder and multi-project user group was established to guide the remaining actions—FAIRification of metadata, a Phenopackets pilot with RDCA-DAP, applying Orphacodes to case report forms of clinical trials, introducing CDISC standards into European Reference Networks, testing of the CDISC Pediatric User Guide using data from the mentioned resources and organisation of further workshops and educational materials.

Data, Vol. 9, Pages 54: Illumina 16S rRNA Gene Sequencing Dataset of Bacterial Communities of Soil Associated with Ironwood Trees (Casuarina equisetifolia) in Guam

Tao Jin — 2024-04-07

Data, Vol. 9, Pages 54: Illumina 16S rRNA Gene Sequencing Dataset of Bacterial Communities of Soil Associated with Ironwood Trees (Casuarina equisetifolia) in Guam

Data doi: 10.3390/data9040054

Authors: Tao Jin Robert L. Schlub Claudia Husseneder

Ironwood trees, which are of great importance for the economy and environment of tropical areas, were first discovered to suffer from a slow progressive dieback in Guam in 2002, later referred to as ironwood tree decline (IWTD). A variety of biotic factors have been shown to be associated with IWTD, including putative bacterial pathogens Ralstonia solanacearum and Klebsiella species (K. variicola and K. oxytoca), the fungus Ganoderma australe, and termites. Due to the soilborne nature of these pathogens, soil microbiomes have been suggested to be a significant factor influencing tree health. In this project, we sequenced the microbiome in the soil collected from the root region of healthy ironwood trees and those showing signs of IWTD to evaluate the association between the bacterial community in soil and IWTD. This dataset contains 4,782,728 raw sequencing reads present in soil samples collected from thirty-nine ironwood trees with varying scales of decline severity in Guam obtained via sequencing the V1–V3 region of the 16S rRNA gene on the Illumina NovaSeq (2 × 250 bp) platform. Sequences were taxonomically assigned in QIIME2 using the SILVA 132 database. Firmicutes and Actinobacteria were the most dominant phyla in soil. Differences in soil microbiomes were detected between limestone and sand soil parent materials. No putative plant pathogens of the genera Ralstonia or Klebsiella were found in the samples. Bacterial diversity was not linked to parameters of IWTD. The dataset has been made publicly available through NCBI GenBank under BioProject ID PRJNA883256. This dataset can be used to compare the bacterial taxa present in soil associated with ironwood trees in Guam to bacteria communities of other geographical locations to identify microbial signatures of IWTD. In addition, this dataset can also be used to investigate the relationship between soil microbiomes and the microbiomes of ironwood trees as well as those of the termites which attack ironwood trees.

Data, Vol. 9, Pages 53: Wearable Device Bluetooth/BLE Physical Layer Dataset

Artis Rusins — 2024-04-03

Data, Vol. 9, Pages 53: Wearable Device Bluetooth/BLE Physical Layer Dataset

Data doi: 10.3390/data9040053

Authors: Artis Rusins Deniss Tiscenko Eriks Dobelis Eduards Blumbergs Krisjanis Nesenbergs Peteris Paikens

Wearable devices, such as headsets and activity trackers, rely heavily on the Bluetooth and/or the Bluetooth Low Energy wireless communication standard to exchange data with smartphones or other peripherals. Since these devices collect personal health and activity data, ensuring the privacy and security of the transmitted data is crucial. Therefore, we present a dataset that captures complete Bluetooth communications—including advertising, connection, data exchange, and disconnection—in an RF isolated environment using software-defined radio. We were able to successfully decode the captured Bluetooth packets using existing tools. This dataset provides researchers with the ability to fully analyze Bluetooth traffic and gain insight into communication patterns and potential security vulnerabilities.

Data, Vol. 9, Pages 52: Natural Language Processing Patents Landscape Analysis

Hend S. Al-Khalifa — 2024-03-31

Data, Vol. 9, Pages 52: Natural Language Processing Patents Landscape Analysis

Data doi: 10.3390/data9040052

Authors: Hend S. Al-Khalifa Taif AlOmar Ghala AlOlyyan

Understanding NLP patents provides valuable insights into innovation trends and competitive dynamics in artificial intelligence. This study uses the Lens patent database to investigate the landscape of NLP patents. The overall patent output in the NLP field on a global scale has exhibited a rapid growth over the past decade, indicating rising research and commercial interests in applying NLP techniques. By analyzing patent assignees, technology categories, and geographic distribution, we identify leading innovators as well as research hotspots in applying NLP. The patent landscape reflects intensifying competition between technology giants and research institutions. This research aims to synthesize key patterns and developments in NLP innovation revealed through patent data analysis, highlighting implications for firms and policymakers. A detailed understanding of NLP patenting activity can inform intellectual property strategy and technology investment decisions in this burgeoning AI domain.

Data, Vol. 9, Pages 51: Longitudinal Patterns of Online Activity and Social Feedback Are Associated with Current and Perceived Changes in Quality of Life in Adult Facebook Users

Davide Marengo — 2024-03-31

Data, Vol. 9, Pages 51: Longitudinal Patterns of Online Activity and Social Feedback Are Associated with Current and Perceived Changes in Quality of Life in Adult Facebook Users

Data doi: 10.3390/data9040051

Authors: Davide Marengo Michele Settanni

The present study explored how sharing verbal status updates on Facebook and receiving Likes, as a form of positive social feedback, correlate with current and perceived changes in Quality of Life (QoL). Utilizing the Facebook Graph API, we collected a longitudinal dataset comprising status updates and Likes received by 1577 adult Facebook users over a 12-month period. Two monthly indicators were calculated: the percentage of verbal status updates and the average number of Likes per post. Participants were administered a survey to assess current and perceived changes in QoL. Confirmatory Factor Analysis (CFA) and the Auto-Regressive Latent Trajectory Model with Structured Residuals (ALT-SRs) were used to model longitudinal patterns emerging from the objective recordings of Facebook activity and explore their correlation with QoL measures. Findings indicated a positive correlation between the percentage of verbal status updated on Facebook and current QoL. Online positive social feedback, measured through received Likes, was associated with both current QoL and perceived improvements in QoL. Of note, perceived improvements in QoL correlated with an increase in received Likes over time. Results highlight the relevance of collecting and modeling longitudinal Facebook data for the investigation of the association between activity on social media and individual well-being.

Data, Vol. 9, Pages 50: DNA of Music: Identifying Relationships among Different Versions of the Composition Sadhukarn from Thailand, Laos, and Cambodia Using Multivariate Statistics

Sumetus Eambangyung — 2024-03-30

Data, Vol. 9, Pages 50: DNA of Music: Identifying Relationships among Different Versions of the Composition Sadhukarn from Thailand, Laos, and Cambodia Using Multivariate Statistics

Data doi: 10.3390/data9040050

Authors: Sumetus Eambangyung Gretel Schwörer-Kohl Witoon Purahong

Sadhukarn, a sacred music composition performed ritually to salute and invite divine powers to open a ceremony or feast, is played in Thailand, Cambodia, and Laos. Different countries have unique versions, arranged based on musicians’ skills and en vogue styles. This study presents the results of multivariate statistical analyses of 26 different versions of Sadhukarn main melodies using non-metric multidimensional scaling (NMDS) and cluster analysis. The objective was to identify the optimal number of parameters for identifying the origin and relationships among Sadhukarn versions, including rhyme structures, pillar tone, rhythmic and melodic patterns, intervals, pitches, and combinations of these parameters. The data were analyzed using both full and normalized datasets (32 phrases) to avoid biases due to differences in phrases among versions. Overall, the combination of six parameters is the best approach for data analysis in both full and normalized datasets. The analysis of the ‘full version’ shows the separation of Sadhukarn versions from different countries of origin, while the analysis of the ‘normalized version’ reveals the rhyme structure, rhythmic structure, and pitch as crucial parameters for identifying Sadhukarn versions. We conclude that multivariate statistics are powerful tools for identifying relationships among different versions of Sadhukarn compositions from Thailand, Laos, and Cambodia and within the same countries of origin.

Data, Vol. 9, Pages 49: Analysis of a Bluetooth Traffic Dataset Obtained during University Examination Sessions

Radu Bouaru — 2024-03-30

Data, Vol. 9, Pages 49: Analysis of a Bluetooth Traffic Dataset Obtained during University Examination Sessions

Data doi: 10.3390/data9040049

Authors: Radu Bouaru Adrian Peculea Bogdan Iancu Sorin Buzura Emil Cebuc Vasile Dadarlat

In academic environments, students take exams simultaneously in campus examination classrooms. Due to recent advancements in technology, examination rooms are flooded with Bluetooth data traffic generated by personal devices (smartphones, smartwatches, etc.). The work presented in this article proposes a method for collecting Bluetooth traffic in an academic examination setting. The desired data were collected during several examination sessions using an Ubertooth One device, and then an in-depth post-processing analysis was performed on the collected dataset. The devices generating traffic were precisely located within the examination room, and areas with heightened data traffic were highlighted. Additionally, another goal of the current research was to provide a unique type of dataset to the academic community, facilitating its utilization in further research endeavors.

Data, Vol. 9, Pages 48: Luxury Car Data Analysis: A Literature Review

Pegah Barakati — 2024-03-30

Data, Vol. 9, Pages 48: Luxury Car Data Analysis: A Literature Review

Data doi: 10.3390/data9040048

Authors: Pegah Barakati Flavio Bertini Emanuele Corsi Maurizio Gabbrielli Danilo Montesi

The concept of luxury, considering it a rare and exclusive attribute, is evolving due to technological advances and the increasing influence of consumers in the market. Luxury cars have always symbolized wealth, social status, and sophistication. Recently, as technology progresses, the ability and interest to gather, store, and analyze data from these elegant vehicles has also increased. In recent years, the analysis of luxury car data has emerged as a significant area of research, highlighting researchers’ exploration of various aspects that may differentiate luxury cars from ordinary ones. For instance, researchers study factors such as economic impact, technological advancements, customer preferences and demographics, environmental implications, brand reputation, security, and performance. Although the percentage of individuals purchasing luxury cars is lower than that of ordinary cars, the significance of analyzing luxury car data lies in its impact on various aspects of the automotive industry and society. This literature review aims to provide an overview of the current state of the art in luxury car data analysis.

Data, Vol. 9, Pages 47: An EEG Dataset of Subject Pairs during Collaboration and Competition Tasks in Face-to-Face and Online Modalities

María A. Hernández-Mustieles — 2024-03-27

Data, Vol. 9, Pages 47: An EEG Dataset of Subject Pairs during Collaboration and Competition Tasks in Face-to-Face and Online Modalities

Data doi: 10.3390/data9040047

Authors: María A. Hernández-Mustieles Yoshua E. Lima-Carmona Axel A. Mendoza-Armenta Ximena Hernandez-Machain Diego A. Garza-Vélez Aranza Carrillo-Márquez Diana C. Rodríguez-Alvarado Jorge de J. Lozoya-Santos Mauricio A. Ramírez-Moreno

This dataset was acquired during collaboration and competition tasks performed by sixteen subject pairs (N = 32) of one female and one male under different (face-to-face and online) modalities. The collaborative task corresponds to cooperating to put together a 100-piece puzzle, while the competition task refers to playing against each other in a one-on-one classic 28-piece dominoes game. In the face-to-face modality, all interactions between the pair occurred in person. On the other hand, in the online modality, participants were physically separated, and interaction was only allowed through Zoom software with an active microphone and camera. Electroencephalography data of the two subjects were acquired simultaneously while performing the tasks. This article describes the experimental setup, the process of the data streams acquired during the tasks, and the assessment of data quality.

Data, Vol. 9, Pages 46: WEA-Acceptance Data—A Dataset of Acoustic, Meteorological, and Operational Wind Turbine Measurements

Daphne Schössow — 2024-03-15

Data, Vol. 9, Pages 46: WEA-Acceptance Data—A Dataset of Acoustic, Meteorological, and Operational Wind Turbine Measurements

Data doi: 10.3390/data9030046

Authors: Daphne Schössow Stephan Preihs Jürgen Peissig

In this article, a dataset is described which combines wind turbine supervisory control and data acquisition (SCADA), meteorological and acoustical data and thus gives a detailed description of a wind farm and its atmospheric and acoustic environment. The data were collected during different seasons for several weeks at a time, such that a multitude of environmental and operational conditions are covered. In five measurement campaigns, in total three different locations with similar surroundings were captured. The raw data were enhanced with derived values such as atmospheric stability or direction of sound propagation. Data of one month including all time series measurements as well as monophonic audio recordings are now published. The dataset also contains three exemplary use cases along with documents that describe the data pre-processing.

Data, Vol. 9, Pages 45: A Dataset of Benthic Species from Mesophotic Bioconstructions on the Apulian Coast (Southeastern Italy, Mediterranean Sea)

Maria Mercurio — 2024-03-08

Data, Vol. 9, Pages 45: A Dataset of Benthic Species from Mesophotic Bioconstructions on the Apulian Coast (Southeastern Italy, Mediterranean Sea)

Data doi: 10.3390/data9030045

Authors: Maria Mercurio Guadalupe Giménez Giorgio Bavestrello Frine Cardone Giuseppe Corriero Jacopo Giampaoletti Maria Flavia Gravina Cataldo Pierri Caterina Longo Adriana Giangrande Carlotta Nonnis Marzano

Marine bioconstructions are complex habitats that represent a hotspot of biodiversity. Among Mediterranean bioconstructions, those thriving on mesophotic bottoms on southeastern Italian coasts are of particular interest due to their horizontal and vertical extension. In general, the communities that develop in the Mediterranean twilight zone encompassed within the first 30 m of depth are better known, while relatively few data are available on those at greater depths. By further investigating the diversity and structure of mesophotic bioconstructions in the southern Adriatic, we can improve our understanding of Mediterranean biodiversity while developing effective conservation strategies to preserve these habitats of particular interest. The dataset reported here comprises records of benthic marine taxa from algae and invertebrate mesophotic bioconstructions investigated at six sites along the southern Adriatic coast of Italy, at depths between approximately 25 and 65 m. The dataset contains a total of 1718 records, covering 11 phyla and 648 benthic taxa, of which 580 were recognized at the species level. These data could provide a reference point for further investigations with descriptive or management purposes, including the possible assessment of mesophotic bioconstructions as refuges for shallow-water species.

Data, Vol. 9, Pages 44: Subjective Well-Being and Mental Health among College Students: Two Datasets for Diagnosis and Program Evaluation

Lina Martínez — 2024-03-06

Data, Vol. 9, Pages 44: Subjective Well-Being and Mental Health among College Students: Two Datasets for Diagnosis and Program Evaluation

Data doi: 10.3390/data9030044

Authors: Lina Martínez Esteban Robles Valeria Trofimoff Nicolás Vidal Andrés David Espada Nayith Mosquera Bryan Franco Víctor Sarmiento María Isabel Zafra

This paper presents two datasets about college students’ subjective well-being and mental health in a developing country. The first data set of this report offers a diagnosis of the prevalence of self-reported symptoms associated with stress, anxiety, depression, and overall evaluation of subjective well-being. The study uses validated scales to measure self-reported symptoms related to mental health conditions. To measure stress, the study used the Perceived Stress Scale (PSS-10) and the 7-item Generalized Anxiety Disorder Scale (GAD-7) to measure symptoms associated with anxiety (GAD-7), and the 9-item Patient Health Questionnaire (PHQ-9) to measure symptoms associated with depression. This diagnosis was collected in a college student sample of 3052 undergrad students in 2022 at a medium-sized university in Colombia. The second dataset reports the evaluation of a positive education intervention implemented in the same university. The Colombian Minister of Science and Technology financed the intervention to promote strategies to mitigate the consequences on college students’ well-being and mental health after the pandemic. The program evaluation data cover two years (2020–2022) with 193 college students in the treatment group (students enrolled in a class teaching evidence-based interventions to promote well-being and mental health awareness) and 135 students in the control group. Data for evaluation include a broad array of variables of life satisfaction, happiness, negative emotions, COVID-19 effects, relationships valuations, and habits and the measurement of three scales: The Satisfaction with Life Scale (SWLS), a brief measurement of depressive symptomatology (CESD-7), and the Brief Strengths Scale (BSS).

Data, Vol. 9, Pages 43: Pupil Data Upon Stimulation by Auditory Stimuli

Davide La Rosa — 2024-03-05

Data, Vol. 9, Pages 43: Pupil Data Upon Stimulation by Auditory Stimuli

Data doi: 10.3390/data9030043

Authors: Davide La Rosa Luca Bruschini Maria Paola Tramonti Fantozzi Paolo Orsini Mario Milazzo Antonino Crivello

Evaluating hearing in newborns and uncooperative patients can pose a considerable challenge. One potential solution might be to employ the Pupil Dilation Response (PDR) as an objective physiological metric. In this dataset descriptor paper, we present a collection of data showing changes in pupil dimension and shape upon presentation of auditory stimuli. In particular, we collected pupil data from 16 subjects, with no known hearing loss, upon different lighting conditions, measured in response to a series of 60–100 audible tones, all of the same frequency and amplitude, which may serve to further investigate any relationship between hearing capabilities and PDRs.

Data, Vol. 9, Pages 42: A Set of Ground Penetrating Radar Measures from Quarries

Stefano Bonduà — 2024-03-03

Data, Vol. 9, Pages 42: A Set of Ground Penetrating Radar Measures from Quarries

Data doi: 10.3390/data9030042

Authors: Stefano Bonduà André Monteiro Klen Massimiliano Pilone Laurentiu Asimopolos Natalia-Silvia Asimopolos

This paper presents a set of Ground Penetrating Radar (GPR) data obtained from in situ measurements conducted in four ornamental stone quarries located in Italy (Botticino quarry) and Romania (Ruschita, Carpinis, and Pietroasa quarries). The GPR is a Non-Destructive Testing (NDT) technique that enables the detection and localization of fractures without damage to the surface, among other capabilities. In this study, two instruments of ground-coupled GPR were used to detect and locate the fractures, discontinuities, or weakened zones. The GPR data contains radargrams for discontinuities and fracture detection, besides the geographic location of the measures. For each measurement site, a set of radargrams has been acquired in two orthogonal directions, allowing for a 3D reconstruction of the investigated site.

Data, Vol. 9, Pages 41: Defining the Balearic Islands’ Tourism Data Space: An Approach to Functional and Data Requirements

Dolores Ordóñez-Martínez — 2024-02-29

Data, Vol. 9, Pages 41: Defining the Balearic Islands’ Tourism Data Space: An Approach to Functional and Data Requirements

Data doi: 10.3390/data9030041

Authors: Dolores Ordóñez-Martínez Joana M. Seguí-Pons Maurici Ruiz-Pérez

The definition of a tourism data space (TDS) in the Balearic Islands is a complex process that involves identifying the types of questions to be addressed, including analytical tools, and determining the type of information to be incorporated. This study delves into the functional requirements of a Balearic Islands’ TDS based on the study of scientific research carried out in the field of tourism in the Balearic Islands and drawing comparisons with international scientific research in the field of tourism information. Utilizing a bibliometric analysis of the scientific literature, this study identifies the scientific requirements that should be met for the development of a robust, rigorous, and efficient TDS. The goal is to support excellent scientific research in tourism and facilitate the transfer of research results to the productive sector to maintain and improve the competitiveness of the Balearic Islands as a tourist destination. The results of the analysis provide a structured framework for the construction of the Balearic Islands’ TDS, outlining objectives, methods to be implemented, and information to be considered.

Data, Vol. 9, Pages 40: Draft Genome Sequence of Bacillus thuringiensis INTA 103-23 Reveals Its Insecticidal Properties: Insights from the Genomic Sequence

Leopoldo Palma — 2024-02-28

Data, Vol. 9, Pages 40: Draft Genome Sequence of Bacillus thuringiensis INTA 103-23 Reveals Its Insecticidal Properties: Insights from the Genomic Sequence

Data doi: 10.3390/data9030040

Authors: Leopoldo Palma Leila Ortiz José Niz Marcelo Berretta Diego Sauka

The genome of Bacillus thuringiensis strain INTA 103-23 was sequenced, revealing a high-quality draft assembly comprising 243 contigs with a total size of 6.30 Mb and a completeness of 99%. Phylogenetic analysis classified INTA 103-23 within the Bacillus cereus sensu stricto cluster. Genome annotation identified 6993 genes, including 2476 hypothetical proteins. Screening for pesticidal proteins unveiled 10 coding sequences with significant similarity to known pesticidal proteins, showcasing a potential efficacy against various insect orders. AntiSMASH analysis predicted 13 biosynthetic gene clusters (BGCs), including clusters with 100% similarity to petrobactin and anabaenopeptin NZ857/nostamide A. Notably, fengycin exhibited a 40% similarity within the identified clusters. Further exploration involved a comparative genomic analysis with ten phylogenetically closest genomes. The ANI values, calculated using fastANI, confirmed the closest relationships with strains classified under Bacillus cereus sensu stricto. This comprehensive genomic analysis of B. thuringiensis INTA 103-23 provides valuable insights into its genetic makeup, potential pesticidal activity, and biosynthetic capabilities. The identified BGCs and pesticidal proteins contribute to our understanding of the strain’s biocontrol potential against diverse agricultural pests.

Data, Vol. 9, Pages 39: CybAttT: A Dataset of Cyberattack News Tweets for Enhanced Threat Intelligence

Huda Lughbi — 2024-02-23

Data, Vol. 9, Pages 39: CybAttT: A Dataset of Cyberattack News Tweets for Enhanced Threat Intelligence

Data doi: 10.3390/data9030039

Authors: Huda Lughbi Mourad Mars Khaled Almotairi

The continuous developments in information technologies have resulted in a significant rise in security concerns, including cybercrimes, unauthorized access, and cyberattacks. Recently, researchers have increasingly turned to social media platforms like X to investigate cyberattacks. Analyzing and collecting news about cyberattacks from tweets can efficiently provide crucial insights into the attacks themselves, including their impacts, occurrence regions, and potential mitigation strategies. However, there is a shortage of labeled datasets related to cyberattacks. This paper describes CybAttT, a dataset of 36,071 English cyberattack-related tweets. These tweets are manually labeled into three classes: high-risk news, normal news, and not news. Our final overall Inner Annotation agreement was 0.99 (Fleiss kappa), which represents high agreement. To ensure dataset reliability and accuracy, we conducted rigorous experiments using different supervised machine learning algorithms and various fine-tuned language models to assess its quality and suitability for its intended purpose. A high F1-score of 87.6% achieved using the CybAttT dataset not only demonstrates the potential of our approach but also validates the high quality and thoroughness of its annotations. We have made our CybAttT dataset accessible to the public for research purposes.

Data, Vol. 9, Pages 38: Multimodal Hinglish Tweet Dataset for Deep Pragmatic Analysis

Pratibha — 2024-02-15

Data, Vol. 9, Pages 38: Multimodal Hinglish Tweet Dataset for Deep Pragmatic Analysis

Data doi: 10.3390/data9020038

Authors: Pratibha Amandeep Kaur Meenu Khurana Robertas Damaševičius

Wars, conflicts, and peace efforts have become inherent characteristics of regions, and understanding the prevailing sentiments related to these issues is crucial for finding long-lasting solutions. Twitter/‘X’, with its vast user base and real-time nature, provides a valuable source to assess the raw emotions and opinions of people regarding war, conflict, and peace. This paper focuses on collecting and curating hinglish tweets specifically related to wars, conflicts, and associated taxonomy. The creation of said dataset addresses the existing gap in contemporary literature, which lacks comprehensive datasets capturing the emotions and sentiments expressed by individuals regarding wars, conflicts, and peace efforts. This dataset holds significant value and application in deep pragmatic analysis as it enables future researchers to identify the flow of sentiments, analyze the information architecture surrounding war, conflict, and peace effects, and delve into the associated psychology in this context. To ensure the dataset’s quality and relevance, a meticulous selection process was employed, resulting in the inclusion of explanable 500 carefully chosen search filters. The dataset currently has 10,040 tweets that have been validated with the help of human expert to make sure they are correct and accurate.

Data, Vol. 9, Pages 37: Digital Elevation Models and Orthomosaics of the Dutch Noordwest Natuurkern Foredune Restoration Project

Gerben Ruessink — 2024-02-15

Data, Vol. 9, Pages 37: Digital Elevation Models and Orthomosaics of the Dutch Noordwest Natuurkern Foredune Restoration Project

Data doi: 10.3390/data9020037

Authors: Gerben Ruessink Dick Groenendijk Bas Arens

Coastal dunes worldwide are increasingly under pressure from the adverse effects of human activities. Therefore, more and more restoration measures are being taken to create conditions that help disturbed coastal dune ecosystems regenerate or recover naturally. However, many projects lack the (open-access) monitoring observations needed to signal whether further actions are needed, and hence lack the opportunity to “learn by doing”. This submission presents an open-access data set of 37 high-resolution digital elevation models and 24 orthomosaics collected before and after the excavation of five artificial foredune trough blowouts (“notches”) in winter 2012/2013 in the Dutch Zuid-Kennemerland National Park, one of the largest coastal dune restoration projects in northwest Europe. These high-resolution data provide a valuable resource for improving understanding of the biogeomorphic processes that determine the evolution of restored dune systems as well as developing guidelines to better design future restoration efforts with foredune notching.

Data, Vol. 9, Pages 36: AriAplBud: An Aerial Multi-Growth Stage Apple Flower Bud Dataset for Agricultural Object Detection Benchmarking

Wenan Yuan — 2024-02-11

Data, Vol. 9, Pages 36: AriAplBud: An Aerial Multi-Growth Stage Apple Flower Bud Dataset for Agricultural Object Detection Benchmarking

Data doi: 10.3390/data9020036

Authors: Wenan Yuan

As one of the most important topics in contemporary computer vision research, object detection has received wide attention from the precision agriculture community for diverse applications. While state-of-the-art object detection frameworks are usually evaluated against large-scale public datasets containing mostly non-agricultural objects, a specialized dataset that reflects unique properties of plants would aid researchers in investigating the utility of newly developed object detectors within agricultural contexts. This article presents AriAplBud: a close-up apple flower bud image dataset created using an unmanned aerial vehicle (UAV)-based red–green–blue (RGB) camera. AriAplBud contains 3600 images of apple flower buds at six growth stages, with 110,467 manual bounding box annotations as positive samples and 2520 additional empty orchard images containing no apple flower bud as negative samples. AriAplBud can be directly deployed for developing object detection models that accept Darknet annotation format without additional preprocessing steps, serving as a potential benchmark for future agricultural object detection research. A demonstration of developing YOLOv8-based apple flower bud detectors is also presented in this article.

Data, Vol. 9, Pages 35: COVID-19 Lockdown Effects on Sleep, Immune Fitness, Mood, Quality of Life, and Academic Functioning: Survey Data from Turkish University Students

Pauline A. Hendriksen — 2024-02-10

Data, Vol. 9, Pages 35: COVID-19 Lockdown Effects on Sleep, Immune Fitness, Mood, Quality of Life, and Academic Functioning: Survey Data from Turkish University Students

Data doi: 10.3390/data9020035

Authors: Pauline A. Hendriksen Sema Tan Evi C. van Oostrom Agnese Merlo Hilal Bardakçi Nilay Aksoy Johan Garssen Gillian Bruce Joris C. Verster

Previous studies from the Netherlands, Germany, and Argentina revealed that the 2019 coronavirus disease (COVID-19) pandemic and associated lockdown periods had a significant negative impact on the wellbeing and quality of life of students. The negative impact of lockdown periods on health correlates such as immune fitness, alcohol consumption, and mood were reflected in their academic functioning. As both the duration and intensity of lockdown measures differed between countries, it is important to replicate these findings in different countries and cultures. Therefore, the purpose of the current study was to examine the impact of the COVID-19 pandemic on immune fitness, mood, academic functioning, sleep, smoking, alcohol consumption, healthy diet, and quality of life among Turkish students. Turkish students in the age range of 18 to 30 years old were invited to complete an online survey. Data were collected from n = 307 participants and included retrospective assessments for six time periods: (1) BP (before the COVID-19 pandemic, 1 January 2020–10 March 2020), (2) NL1 (the first no lockdown period, 11 March 2020–28 April 2021), (3) the lockdown period (29 April 2021–17 May 2021), (4) NL2 (the second no lockdown period, 18 May 2021–31 December 2021), (5) NL3 (the third no lockdown period, 1 January 2022–December 2022), and (6) for the past month. In this data descriptor article, the content of the survey and the dataset are described.

Data, Vol. 9, Pages 34: Draft Genome Sequencing of the Bacillus thuringiensis var. Thuringiensis Highly Insecticidal Strain 800/15

Anton E. Shikov — 2024-02-10

Data, Vol. 9, Pages 34: Draft Genome Sequencing of the Bacillus thuringiensis var. Thuringiensis Highly Insecticidal Strain 800/15

Data doi: 10.3390/data9020034

Authors: Anton E. Shikov Iuliia A. Savina Maria N. Romanenko Anton A. Nizhnikov Kirill S. Antonets

The Bacillus thuringiensis serovar thuringiensis strain 800/15 has been actively used as an agent in biopreparations with high insecticidal activity against the larvae of the Colorado potato beetle Leptinotarsa decemlineata and gypsy moth Lymantria dispar. In the current study, we present the first draft genome of the 800/15 strain coupled with a comparative genomic analysis of its closest reference strains. The raw sequence data were obtained by Illumina technology on the HiSeq X platform and de novo assembled with the SPAdes v3.15.4 software. The genome reached 6,524,663 bp. in size and carried 6771 coding sequences, 3 of which represented loci encoding insecticidal toxins, namely, Spp1Aa1, Cry1Ab9, and Cry1Ba8 active against the orders Lepidoptera, Blattodea, Hemiptera, Diptera, and Coleoptera. We also revealed the biosynthetic gene clusters responsible for the synthesis of secondary metabolites, including fengycin, bacillibactin, and petrobactin with predicted antibacterial, fungicidal, and growth-promoting properties. Further comparative genomics suggested the strain is not enriched with genes linked with biological activities implying that agriculturally important properties rely more on the composition of loci rather than their abundance. The obtained genomic sequence of the strain with the experimental metadata could facilitate the computational prediction of bacterial isolates’ potency from genomic data.

Data, Vol. 9, Pages 33: Conflicting Marks Archive Dataset: A Dataset of Conflicting Marks from the Brazilian Intellectual Property Office

Igor Bezerra Reis — 2024-02-09

Data, Vol. 9, Pages 33: Conflicting Marks Archive Dataset: A Dataset of Conflicting Marks from the Brazilian Intellectual Property Office

Data doi: 10.3390/data9020033

Authors: Igor Bezerra Reis Rafael Ângelo Santos Leite Mateus Miranda Torres Alcides Gonçalves da Silva Neto Francisco José da Silva e Silva Ariel Soares Teles

A registered trademark represents one of a company’s most valuable intellectual assets, acting as a safeguard against possible reputational damage and financial losses resulting from infringements of this intellectual property. To be registered, a mark must be unique and distinctive in relation to other trademarks which are already registered. In this paper, we describe the CMAD, an acronym for Conflicting Marks Archive Dataset. This dataset has been meticulously organized into pairs of marks (Number of pairs = 18,355) involved in copyright infringement across word, figurative and mixed marks. Organizations sought to register these marks with the National Institute of Industrial Property (INPI) in Brazil, and had their applications denied after analysis by intellectual property specialists. The robustness of this dataset is ensured by the intrinsic similarity of the conflicting marks, since the decisions were made by INPI specialists. This characteristic provides a reliable basis for the development and testing of tools designed to analyze similarity between marks, thus contributing to the evolution of practices and computer-based solutions in the field of intellectual property.

Data, Vol. 9, Pages 32: Data in Astrophysics and Geophysics: Novel Research and Applications

Vladimir A. Srećković — 2024-02-08

Data, Vol. 9, Pages 32: Data in Astrophysics and Geophysics: Novel Research and Applications

Data doi: 10.3390/data9020032

Authors: Vladimir A. Srećković Milan S. Dimitrijević Zoran R. Mijić

Rapid development of communication technologies and constant technological improvements as a result of scientific discoveries require the establishment of specific databases [...]

Data, Vol. 9, Pages 31: The Yinshan Mountains Record over 10,000 Landslides

Jingjing Sun — 2024-02-08

Data, Vol. 9, Pages 31: The Yinshan Mountains Record over 10,000 Landslides

Data doi: 10.3390/data9020031

Authors: Jingjing Sun Chong Xu Liye Feng Lei Li Xuewei Zhang Wentao Yang

China boasts a vast expanse of mountainous terrain, characterized by intricate geological conditions and structural features, resulting in frequent geological disasters. Among these, landslides, as prototypical geological hazards, pose significant threats to both lives and property. Consequently, conducting a comprehensive landslide inventory in mountainous regions is imperative for current research. This study concentrates on the Yinshan Mountains, an ancient fault-block mountain range spanning east–west in the central Inner Mongolia Autonomous Region, extending from Langshan Mountains in the west to Damaqun Mountains in the east, with the narrow sense Xiao–Yin Mountains District in between. Employing multi-temporal high-resolution remote sensing images from Google Earth, this study conducted visual interpretation, identifying 10,968 landslides in the Yinshan area, encompassing a total area of 308.94 km2. The largest landslide occupies 2.95 km2, while the smallest covers 84.47 m2. Specifically, the Langshan area comprises 331 landslides with a total area of 11.96 km2, the narrow sense Xiao–Yin Mountains include 3393 landslides covering 64.13 km2, and the Manhan Mountains, Damaqun Mountains, and adjacent areas account for 7244 landslides over a total area of 232.85 km2. This research not only contributes to global landslide cataloging initiatives but also serves as a robust foundation for future geohazard prevention and management efforts.

Data, Vol. 9, Pages 30: Expanded Brain CT Dataset for the Development of AI Systems for Intracranial Hemorrhage Detection and Classification

Anna N. Khoruzhaya — 2024-02-06

Data, Vol. 9, Pages 30: Expanded Brain CT Dataset for the Development of AI Systems for Intracranial Hemorrhage Detection and Classification

Data doi: 10.3390/data9020030

Authors: Anna N. Khoruzhaya Tatiana M. Bobrovskaya Dmitriy V. Kozlov Dmitriy Kuligovskiy Vladimir P. Novik Kirill M. Arzamasov Elena I. Kremneva

Intracranial hemorrhage (ICH) is a dangerous life-threatening condition leading to disability. Timely and high-quality diagnosis plays a huge role in the course and outcome of this disease. The gold standard in determining ICH is computed tomography. This method requires a prompt involvement of highly qualified personnel, which is not always possible, for example, in case of a staff shortage or increased workload. In such a situation, every minute counts, and time can be lost. The solution to this problem seems to be a set of diagnostic decisions, including the use of artificial intelligence, which will help to identify patients with ICH in a timely manner and provide prompt and quality medical care. However, the main obstacle to the development of artificial intelligence is a lack of high-quality datasets for training and testing. In this paper, we present a dataset including 800 brain CT scans consisting of multiple series of DICOM images with and without signs of ICH, enriched with clinical and technical parameters, as well as the methodology of its generation utilizing natural language processing tools. The dataset is publicly available, which contributes to increased competition in the development of artificial intelligence systems and their advancement and quality improvement.

Data, Vol. 9, Pages 29: A Comprehensive Data Pipeline for Comparing the Effects of Momentum on Sports Leagues

Jordan Truman Paul Noel — 2024-02-01

Data, Vol. 9, Pages 29: A Comprehensive Data Pipeline for Comparing the Effects of Momentum on Sports Leagues

Data doi: 10.3390/data9020029

Authors: Jordan Truman Paul Noel Vinicius Prado da Fonseca Amilcar Soares

Momentum has been a consistently studied aspect of sports science for decades. Among the established literature, there has, at times, been a discrepancy between conclusions. However, if momentum is indeed an actual phenomenon, it would affect all aspects of sports, from player evaluation to pre-game prediction and betting. Therefore, using momentum-based features that quantify a team’s linear trend of play, we develop a data pipeline that uses a small sample of recent games to assess teams’ quality of play and measure the predictive power of momentum-based features versus the predictive power of more traditional frequency-based features across several leagues using several machine learning techniques. More precisely, we use our pipeline to determine the differences in the predictive power of momentum-based features and standard statistical features for the National Hockey League (NHL), National Basketball Association (NBA), and five major first-division European football leagues. Our findings show little evidence that momentum has superior predictive power in the NBA. Still, we found some instances of the effects of momentum on the NHL that produced better pre-game predictors, whereas we view a similar trend in European football/soccer. Our results indicate that momentum-based features combined with frequency-based features could improve pre-game prediction models and that, in the future, momentum should be studied more from a feature/performance indicator point-of-view and less from the view of the dependence of sequential outcomes, thus attempting to distance momentum from the binary view of winning and losing.

Data, Vol. 9, Pages 28: Organ-On-A-Chip (OOC) Image Dataset for Machine Learning and Tissue Model Evaluation

Valērija Movčana — 2024-02-01

Data, Vol. 9, Pages 28: Organ-On-A-Chip (OOC) Image Dataset for Machine Learning and Tissue Model Evaluation

Data doi: 10.3390/data9020028

Authors: Valērija Movčana Arnis Strods Karīna Narbute Fēlikss Rūmnieks Roberts Rimša Gatis Mozoļevskis Maksims Ivanovs Roberts Kadiķis Kārlis Gustavs Zviedris Laura Leja Anastasija Zujeva Tamāra Laimiņa Arturs Abols

Organ-on-a-chip (OOC) technology has emerged as a groundbreaking approach for emulating the physiological environment, revolutionizing biomedical research, drug development, and personalized medicine. OOC platforms offer more physiologically relevant microenvironments, enabling real-time monitoring of tissue, to develop functional tissue models. Imaging methods are the most common approach for daily monitoring of tissue development. Image-based machine learning serves as a valuable tool for enhancing and monitoring OOC models in real-time. This involves the classification of images generated through microscopy contributing to the refinement of model performance. This paper presents an image dataset, containing cell images generated from OOC setup with different cell types. There are 3072 images generated by an automated brightfield microscopy setup. For some images, parameters such as cell type, seeding density, time after seeding and flow rate are provided. These parameters along with predefined criteria can contribute to the evaluation of image quality and identification of potential artifacts. This dataset can be used as a basis for training machine learning classifiers for automated data analysis generated from an OOC setup providing more reliable tissue models, automated decision-making processes within the OOC framework and efficient research in the future.

Data, Vol. 9, Pages 27: Understanding Data Breach from a Global Perspective: Incident Visualization and Data Protection Law Review

Gabriel Arquelau Pimenta Rodrigues — 2024-01-31

Data, Vol. 9, Pages 27: Understanding Data Breach from a Global Perspective: Incident Visualization and Data Protection Law Review

Data doi: 10.3390/data9020027

Authors: Gabriel Arquelau Pimenta Rodrigues André Luiz Marques Serrano Amanda Nunes Lopes Espiñeira Lemos Edna Dias Canedo Fábio Lúcio Lopes de Mendonça Robson de Oliveira Albuquerque Ana Lucila Sandoval Orozco Luis Javier García Villalba

Data breaches result in data loss, including personal, health, and financial information that are crucial, sensitive, and private. The breach is a security incident in which personal and sensitive data are exposed to unauthorized individuals, with the potential to incur several privacy concerns. As an example, the French newspaper Le Figaro breached approximately 7.4 billion records that included full names, passwords, and e-mail and physical addresses. To reduce the likelihood and impact of such breaches, it is fundamental to strengthen the security efforts against this type of incident and, for that, it is first necessary to identify patterns of its occurrence, primarily related to the number of data records leaked, the affected geographical region, and its regulatory aspects. To advance the discussion in this regard, we study a dataset comprising 428 worldwide data breaches between 2018 and 2019, providing a visualization of the related statistics, such as the most affected countries, the predominant economic sector targeted in different countries, and the median number of records leaked per incident in different countries, regions, and sectors. We then discuss the data protection regulation in effect in each country comprised in the dataset, correlating key elements of the legislation with the statistical findings. As a result, we have identified an extensive disclosure of medical records in India and government data in Brazil in the time range. Based on the analysis and visualization, we find some interesting insights that researchers seldom focus on before, and it is apparent that the real dangers of data leaks are beyond the ordinary imagination. Finally, this paper contributes to the discussion regarding data protection laws and compliance regarding data breaches, supporting, for example, the decision process of data storage location in the cloud.

Data, Vol. 9, Pages 26: Dataset for Electronics and Plasmonics in Graphene, Silicene, and Germanene Nanostrips

Talia Tene — 2024-01-30

Data, Vol. 9, Pages 26: Dataset for Electronics and Plasmonics in Graphene, Silicene, and Germanene Nanostrips

Data doi: 10.3390/data9020026

Authors: Talia Tene Nataly Bonilla García Miguel Ángel Sáez Paguay John Vera Marco Guevara Cristian Vacacela Gomez Stefano Bellucci

The quest for novel materials with extraordinary electronic and plasmonic properties is an ongoing pursuit in the field of materials science. The dataset provides the results of a computational study that used ab initio and semi-analytical computations to model freestanding nanosystems. We delve into the world of ribbon-like materials, specifically graphene nanoribbons, silicene nanoribbons, and germanene nanoribbons, comparing their electronic and plasmonic characteristics. Our research reveals a myriad of insights, from the tunability of band structures and the influence of an atomic number on electronic properties to the adaptability of nanoribbons for optoelectronic applications. Further, we uncover the promise of these materials for biosensing, demonstrating their plasmon frequency tunability based on charge density and Fermi velocity modification. Our findings not only expand the understanding of these quasi-1D materials but also open new avenues for the development of cutting-edge devices and technologies. This data presentation holds immense potential for future advancements in electronics, optics, and molecular sensing.

Data, Vol. 9, Pages 25: Curating, Collecting, and Cataloguing Global COVID-19 Datasets for the Aim of Predicting Personalized Risk

Sepehr Golriz Khatami — 2024-01-29

Data, Vol. 9, Pages 25: Curating, Collecting, and Cataloguing Global COVID-19 Datasets for the Aim of Predicting Personalized Risk

Data doi: 10.3390/data9020025

Authors: Sepehr Golriz Khatami Astghik Sargsyan Maria Francesca Russo Daniel Domingo-Fernández Andrea Zaliani Abish Kaladharan Priya Sethumadhavan Sarah Mubeen Yojana Gadiya Reagon Karki Stephan Gebel Ram Kumar Ruppa Surulinathan Vanessa Lage-Rupprecht Saulius Archipovas Geltrude Mingrone Marc Jacobs Carsten Claussen Martin Hofmann-Apitius Alpha Tom Kodamullil

Although hundreds of datasets have been published since the beginning of the coronavirus pandemic, there is a lack of centralized resources where these datasets are listed and harmonized to facilitate their applicability and uptake by predictive modeling approaches. Firstly, such a centralized resource provides information about data owners to researchers who are searching datasets to develop their predictive models. Secondly, the harmonization of the datasets supports simultaneously taking advantage of several similar datasets. This, in turn, does not only ease the imperative external validation of data-driven models but can also be used for virtual cohort generation, which helps to overcome data sharing impediments. Here, we present that the COVID-19 data catalogue is a repository that provides a landscape view of COVID-19 studies and datasets as a putative source to enable researchers to develop personalized COVID-19 predictive risk models. The COVID-19 data catalogue currently contains over 400 studies and their relevant information collected from a wide range of global sources such as global initiatives, clinical trial repositories, publications, and data repositories. Further, the curated content stored in this data catalogue is complemented by a web application, providing visualizations of these studies, including their references, relevant information such as measured variables, and the geographical locations of where these studies were performed. This resource is one of the first to capture, organize, and store studies, datasets, and metadata related to COVID-19 in a comprehensive repository. We believe that our work will facilitate future research and development of personalized predictive risk models for COVID-19.

Data, Vol. 9, Pages 24: Mapping Hierarchical File Structures to Semantic Data Models for Efficient Data Integration into Research Data Management Systems

Henrik tom Wörden — 2024-01-26

Data, Vol. 9, Pages 24: Mapping Hierarchical File Structures to Semantic Data Models for Efficient Data Integration into Research Data Management Systems

Data doi: 10.3390/data9020024

Authors: Henrik tom Wörden Florian Spreckelsen Stefan Luther Ulrich Parlitz Alexander Schlemmer

Although other methods exist to store and manage data in modern information technology, the standard solution is file systems. Therefore, keeping well-organized file structures and file system layouts can be key to a sustainable research data management infrastructure. However, file structures alone lack several important capabilities for FAIR data management: the two most significant being insufficient visualization of data and inadequate possibilities for searching and obtaining an overview. Research data management systems (RDMSs) can fill this gap, but many do not support the simultaneous use of the file system and RDMS. This simultaneous use can have many benefits, but keeping data in RDMS in synchrony with the file structure is challenging. Here, we present concepts that allow for keeping file structures and semantic data models (in RDMS) synchronous. Furthermore, we propose a specification in yaml format that allows for a structured and extensible declaration and implementation of a mapping between the file system and data models used in semantic research data management. Implementing these concepts will facilitate the re-use of specifications for multiple use cases. Furthermore, the specification can serve as a machine-readable and, at the same time, human-readable documentation of specific file system structures. We demonstrate our work using the Open Source RDMS LinkAhead (previously named “CaosDB”).

Data, Vol. 9, Pages 23: Comprehensive Dataset on Pre-SARS-CoV-2 Infection Sports-Related Physical Activity Levels, Disease Severity, and Treatment Outcomes: Insights and Implications for COVID-19 Management

Dimitrios I. Bourdas — 2024-01-26

Data, Vol. 9, Pages 23: Comprehensive Dataset on Pre-SARS-CoV-2 Infection Sports-Related Physical Activity Levels, Disease Severity, and Treatment Outcomes: Insights and Implications for COVID-19 Management

Data doi: 10.3390/data9020023

Authors: Dimitrios I. Bourdas Panteleimon Bakirtzoglou Antonios K. Travlos Vasileios Andrianopoulos Emmanouil Zacharakis

This dataset aimed to explore associations between pre-SARS-CoV-2 infection exercise and sports-related physical activity (PA) levels and disease severity, along with treatments administered following the most recent SARS-CoV-2 infection. A comprehensive analysis investigated the relationships between PA categories (“Inactive”, “Low PA”, “Moderate PA”, “High PA”), disease severity (“Sporadic”, “Episodic”, “Recurrent”, “Frequent”, “Persistent”), and treatments post-SARS-CoV-2 infection (“No treatment”, “Home remedies”, “Prescribed medication”, “Hospital admission”, “Intensive care unit admission”) within a sample population (n = 5829) from the Hellenic territory. Utilizing the Active-Q questionnaire, data were collected from February to March 2023, capturing PA habits, participant characteristics, medical history, vaccination status, and illness experiences. Findings revealed an independent relationship between preinfection PA levels and disease severity (χ2 = 9.097, df = 12, p = 0.695). Additionally, a statistical dependency emerged between PA levels and illness treatment categories (χ2 = 39.362, df = 12, p < 0.001), particularly linking inactive PA with home remedies treatment. These results highlight the potential influence of preinfection PA on disease severity and treatment choices following SARS-CoV-2 infection. The dataset offers valuable insights into the interplay between PA, disease outcomes, and treatment decisions, aiding future research in shaping targeted interventions and public health strategies related to COVID-19 management.

Data, Vol. 9, Pages 22: Genomic Epidemiology Dataset for the Important Nosocomial Pathogenic Bacterium Acinetobacter baumannii

Andrey Shelenkov — 2024-01-26

Data, Vol. 9, Pages 22: Genomic Epidemiology Dataset for the Important Nosocomial Pathogenic Bacterium Acinetobacter baumannii

Data doi: 10.3390/data9020022

Authors: Andrey Shelenkov Yulia Mikhaylova Vasiliy Akimkin

The infections caused by various bacterial pathogens both in clinical and community settings represent a significant threat to public healthcare worldwide. The growing resistance to antimicrobial drugs acquired by bacterial species causing healthcare-associated infections has already become a life-threatening danger noticed by the World Health Organization. Several groups or lineages of bacterial isolates, usually called ‘the clones of high risk’, often drive the spread of resistance within particular species. Thus, it is vitally important to reveal and track the spread of such clones and the mechanisms by which they acquire antibiotic resistance and enhance their survival skills. Currently, the analysis of whole-genome sequences for bacterial isolates of interest is increasingly used for these purposes, including epidemiological surveillance and the development of spread prevention measures. However, the availability and uniformity of the data derived from genomic sequences often represent a bottleneck for such investigations. With this dataset, we present the results of a genomic epidemiology analysis of 17,546 genomes of a dangerous bacterial pathogen, Acinetobacter baumannii. Important typing information, including multilocus sequence typing (MLST)-based sequence types (STs), intrinsic blaOXA-51-like gene variants, capsular (KL) and oligosaccharide (OCL) types, CRISPR-Cas systems, and cgMLST profiles are presented, as well as the assignment of particular isolates to nine known international clones of high risk. The presence of antimicrobial resistance genes within the genomes is also reported. These data will be useful for researchers in the field of A. baumannii genomic epidemiology, resistance analysis, and prevention measure development.

Data, Vol. 9, Pages 21: MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions

Muhammad Bilal Shaikh — 2024-01-25

Data, Vol. 9, Pages 21: MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions

Data doi: 10.3390/data9020021

Authors: Muhammad Bilal Shaikh Douglas Chai Syed Mohammed Shamsul Islam Naveed Akhtar

Audio-image representations for a multimodal human action (MHAiR) dataset contains six different image representations of the audio signals that capture the temporal dynamics of the actions in a very compact and informative way. The dataset was extracted from the audio recordings which were captured from an existing video dataset, i.e., UCF101. Each data sample captured a duration of approximately 10 s long, and the overall dataset was split into 4893 training samples and 1944 testing samples. The resulting feature sequences were then converted into images, which can be used for human action recognition and other related tasks. These images can be used as a benchmark dataset for evaluating the performance of machine learning models for human action recognition and related tasks. These audio-image representations could be suitable for a wide range of applications, such as surveillance, healthcare monitoring, and robotics. The dataset can also be used for transfer learning, where pre-trained models can be fine-tuned on a specific task using specific audio images. Thus, this dataset can facilitate the development of new techniques and approaches for improving the accuracy of human action-related tasks and also serve as a standard benchmark for testing the performance of different machine learning models and algorithms.

Data, Vol. 9, Pages 20: An Optimized Hybrid Approach for Feature Selection Based on Chi-Square and Particle Swarm Optimization Algorithms

Amani Abdo — 2024-01-25

Data, Vol. 9, Pages 20: An Optimized Hybrid Approach for Feature Selection Based on Chi-Square and Particle Swarm Optimization Algorithms

Data doi: 10.3390/data9020020

Authors: Amani Abdo Rasha Mostafa Laila Abdel-Hamid

Feature selection is a significant issue in the machine learning process. Most datasets include features that are not needed for the problem being studied. These irrelevant features reduce both the efficiency and accuracy of the algorithm. It is possible to think about feature selection as an optimization problem. Swarm intelligence algorithms are promising techniques for solving this problem. This research paper presents a hybrid approach for tackling the problem of feature selection. A filter method (chi-square) and two wrapper swarm intelligence algorithms (grey wolf optimization (GWO) and particle swarm optimization (PSO)) are used in two different techniques to improve feature selection accuracy and system execution time. The performance of the two phases of the proposed approach is assessed using two distinct datasets. The results show that PSOGWO yields a maximum accuracy boost of 95.3%, while chi2-PSOGWO yields a maximum accuracy improvement of 95.961% for feature selection. The experimental results show that the proposed approach performs better than the compared approaches.

Data, Vol. 9, Pages 19: Draft Genome Sequence of the Commercial Strain Rhizobium ruizarguesonis bv. viciae RCAM1022

Olga A. Kulaeva — 2024-01-23

Data, Vol. 9, Pages 19: Draft Genome Sequence of the Commercial Strain Rhizobium ruizarguesonis bv. viciae RCAM1022

Data doi: 10.3390/data9020019

Authors: Olga A. Kulaeva Evgeny A. Zorin Anton S. Sulima Gulnar A. Akhtemova Vladimir A. Zhukov

Legume plants enter a symbiosis with soil nitrogen-fixing bacteria (rhizobia), thereby gaining access to assimilable atmospheric nitrogen. Since this symbiosis is important for agriculture, biofertilizers with effective strains of rhizobia are created for crop legumes to increase their yield and minimize the amounts of mineral fertilizers required. In this work, we sequenced and characterized the genome of Rhizobium ruizarguesonis bv. viciae strain RCAM1022, a component of the ‘Rhizotorfin’ biofertilizer produced in Russia and used for pea (Pisum sativum L.).

Data, Vol. 9, Pages 18: Can Data and Machine Learning Change the Future of Basic Income Models? A Bayesian Belief Networks Approach

Hamed Khalili — 2024-01-23

Data, Vol. 9, Pages 18: Can Data and Machine Learning Change the Future of Basic Income Models? A Bayesian Belief Networks Approach

Data doi: 10.3390/data9020018

Authors: Hamed Khalili

Appeals to governments for implementing basic income are contemporary. The theoretical backgrounds of the basic income notion only prescribe transferring equal amounts to individuals irrespective of their specific attributes. However, the most recent basic income initiatives all around the world are attached to certain rules with regard to the attributes of the households. This approach is facing significant challenges to appropriately recognize vulnerable groups. A possible alternative for setting rules with regard to the welfare attributes of the households is to employ artificial intelligence algorithms that can process unprecedented amounts of data. Can integrating machine learning change the future of basic income by predicting households vulnerable to future poverty? In this paper, we utilize multidimensional and longitudinal welfare data comprising one and a half million individuals’ data and a Bayesian beliefs network approach to examine the feasibility of predicting households’ vulnerability to future poverty based on the existing households’ welfare attributes.

Data, Vol. 9, Pages 17: Machine Learning Classification Workflow and Datasets for Ionospheric VLF Data Exclusion

Filip Arnaut — 2024-01-18

Data, Vol. 9, Pages 17: Machine Learning Classification Workflow and Datasets for Ionospheric VLF Data Exclusion

Data doi: 10.3390/data9010017

Authors: Filip Arnaut Aleksandra Kolarski Vladimir A. Srećković

Machine learning (ML) methods are commonly applied in the fields of extraterrestrial physics, space science, and plasma physics. In a prior publication, an ML classification technique, the Random Forest (RF) algorithm, was utilized to automatically identify and categorize erroneous signals, including instrument errors, noisy signals, outlier data points, and the impact of solar flares (SFs) on the ionosphere. This data communication includes the pre-processed dataset used in the aforementioned research, along with a workflow that utilizes the PyCaret library and a post-processing workflow. The code and data serve educational purposes in the interdisciplinary field of ML and ionospheric physics science, as well as being useful to other researchers for diverse objectives.

Data, Vol. 9, Pages 16: Elliott State Research Forest Timber Cruise, Oregon, 2015–2016

Todd West — 2024-01-18

Data, Vol. 9, Pages 16: Elliott State Research Forest Timber Cruise, Oregon, 2015–2016

Data doi: 10.3390/data9010016

Authors: Todd West Bogdan M. Strimbu

The Elliott State Research Forest comprises 33,700 ha of temperate, Douglas-fir rainforest along North America’s Pacific Coast (Oregon, United States). In 2015, naturally regenerated stands at least 92 years old covered 49% of the research area and sawtimber plantations younger than 68 years another 50%. During the winter of 2015–2016, a forest wide inventory sampled both naturally regenerated and plantation stands, recording 97,424 trees on 17,866 plots in 738 stands. The resulting dataset is atypical for the area as plot locations were not restricted to upland, commercially harvestable timber. Multiage stands and riparian areas were therefore documented along with plantations 2–61 years old and trees retained through clearcut harvests. This dataset constitutes the only open access, stand-based forest inventory currently available for a large area within the Oregon Coast Range. The dataset enables development of suites of models as well as many comparisons across stand ages and types, both at stand level and at the level of individual trees.

Data, Vol. 9, Pages 15: Proteomic and Metabolomic Analyses of the Blood Samples of Highly Trained Athletes

Kristina A. Malsagova — 2024-01-16

Data, Vol. 9, Pages 15: Proteomic and Metabolomic Analyses of the Blood Samples of Highly Trained Athletes

Data doi: 10.3390/data9010015

Authors: Kristina A. Malsagova Arthur T. Kopylov Vasiliy I. Pustovoyt Evgenii I. Balakin Ksenia A. Yurku Alexander A. Stepanov Liudmila I. Kulikova Vladimir R. Rudnev Anna L. Kaysheva

High exercise loading causes intricate and ambiguous proteomic and metabolic changes. This study aims to describe the dataset on protein and metabolite contents in plasma samples collected from highly trained athletes across different sports disciplines. The proteomic and metabolomic analyses of the plasma samples of highly trained athletes engaged in sports disciplines of different intensities were carried out using HPLC-MS/MS. The results are reported as two datasets (proteomic data in a derived mgf-file and metabolomic data in processed format), each containing the findings obtained by analyzing 93 mass spectra. Variations in the protein and metabolite contents of the biological samples are observed, depending on the intensity of training load for different sports disciplines. Mass spectrometric proteomic and metabolomic studies can be used for classifying different athlete phenotypes according to the intensity of sports discipline and for the assessment of the efficiency of the recovery period.

Data, Vol. 9, Pages 14: GeMSyD: Generic Framework for Synthetic Data Generation

Ramona Tolas — 2024-01-11

Data, Vol. 9, Pages 14: GeMSyD: Generic Framework for Synthetic Data Generation

Data doi: 10.3390/data9010014

Authors: Ramona Tolas Raluca Portase Rodica Potolea

In the era of data-driven technologies, the need for diverse and high-quality datasets for training and testing machine learning models has become increasingly critical. In this article, we present a versatile methodology, the Generic Methodology for Constructing Synthetic Data Generation (GeMSyD), which addresses the challenge of synthetic data creation in the context of smart devices. GeMSyD provides a framework that enables the generation of synthetic datasets, aligning them closely with real-world data. To demonstrate the utility of GeMSyD, we instantiate the methodology by constructing a synthetic data generation framework tailored to the domain of event-based data modeling, specifically focusing on user interactions with smart devices. Our framework leverages GeMSyD to create synthetic datasets that faithfully emulate the dynamics of human–device interactions, including the temporal dependencies. Furthermore, we showcase how the synthetic data generated using our framework can serve as a valuable resource for machine learning practitioners. By employing these synthetic datasets, we perform a series of experiments to evaluate the performance of a neural-network-based prediction model in the domain of smart device interaction. Our results underscore the potential of synthetic data in facilitating model development and benchmarking.

Data, Vol. 9, Pages 13: Adaptive Forecasting in Energy Consumption: A Bibliometric Analysis and Review

Manuel Jaramillo — 2024-01-11

Data, Vol. 9, Pages 13: Adaptive Forecasting in Energy Consumption: A Bibliometric Analysis and Review

Data doi: 10.3390/data9010013

Authors: Manuel Jaramillo Wilson Pavón Lisbeth Jaramillo

This paper addresses the challenges in forecasting electrical energy in the current era of renewable energy integration. It reviews advanced adaptive forecasting methodologies while also analyzing the evolution of research in this field through bibliometric analysis. The review highlights the key contributions and limitations of current models with an emphasis on the challenges of traditional methods. The analysis reveals that Long Short-Term Memory (LSTM) networks, optimization techniques, and deep learning have the potential to model the dynamic nature of energy consumption, but they also have higher computational demands and data requirements. This review aims to offer a balanced view of current advancements and challenges in forecasting methods, guiding researchers, policymakers, and industry experts. It advocates for collaborative innovation in adaptive methodologies to enhance forecasting accuracy and support the development of resilient, sustainable energy systems.

Data, Vol. 9, Pages 12: DeepSpaceYoloDataset: Annotated Astronomical Images Captured with Smart Telescopes

Olivier Parisot — 2024-01-10

Data, Vol. 9, Pages 12: DeepSpaceYoloDataset: Annotated Astronomical Images Captured with Smart Telescopes

Data doi: 10.3390/data9010012

Authors: Olivier Parisot

Recent smart telescopes allow the automatic collection of a large quantity of data for specific portions of the night sky—with the goal of capturing images of deep sky objects (nebula, galaxies, globular clusters). Nevertheless, human verification is still required afterwards to check whether celestial targets are effectively visible in the images produced by these instruments. Depending on the magnitude of deep sky objects, the observation conditions and the cumulative time of data acquisition, it is possible that only stars are present in the images. In addition, unfavorable external conditions (light pollution, bright moon, etc.) can make capture difficult. In this paper, we describe DeepSpaceYoloDataset, a set of 4696 RGB astronomical images captured by two smart telescopes and annotated with the positions of deep sky objects that are effectively in the images. This dataset can be used to train detection models on this type of image, enabling the better control of the duration of capture sessions, but also to detect unexpected celestial events such as supernova.

Data, Vol. 9, Pages 11: ADAS Simulation Result Dataset Processing Based on Improved BP Neural Network

Songyan Zhao — 2024-01-05

Data, Vol. 9, Pages 11: ADAS Simulation Result Dataset Processing Based on Improved BP Neural Network

Data doi: 10.3390/data9010011

Authors: Songyan Zhao Lingshan Chen Yongchao Huang

The autonomous driving simulation field lacks evaluation and forecasting systems for simulation results. The data obtained from the simulation of target algorithms and vehicle models cannot be reasonably estimated. This problem affects subsequent vehicle improvement and parameter calibration. The authors relied on the simulation results of the AEB algorithm. We selected the BP Neural Network as the basis and improved it with a genetic algorithm optimized via a roulette algorithm. The regression evaluation indicators of the prediction results show that the GA-BP neural network has better prediction accuracy and generalization ability than the original BP neural network and other optimized BP neural networks. This GA-BP neural network also fills the Gap in Evaluation and Prediction Systems.

Data, Vol. 9, Pages 10: Experimental Dataset of Tunable Mode Converter Based on Long-Period Fiber Gratings Written in Few-Mode Fiber: Impacts of Thermal, Wavelength, and Polarization Variations

Juan Soto-Perdomo — 2023-12-31

Data, Vol. 9, Pages 10: Experimental Dataset of Tunable Mode Converter Based on Long-Period Fiber Gratings Written in Few-Mode Fiber: Impacts of Thermal, Wavelength, and Polarization Variations

Data doi: 10.3390/data9010010

Authors: Juan Soto-Perdomo Erick Reyes-Vera Jorge Montoya-Cardona Pedro Torres

Mode division multiplexing (MDM) is currently one of the most attractive multiplexing techniques in optical communications, as it allows for an increase in the number of channels available for data transmission. Optical modal converters are one of the main devices used in this technique. Therefore, the characterization and improvement of these devices are of great current interest. In this work, we present a dataset of 49,736 near-field intensity images of a modal converter based on a long-period fiber grating (LPFG) written on a few-mode fiber (FMF). This characterization was performed experimentally at various wavelengths, polarizations, and temperature conditions when the device converted from LP01 mode to LP11 mode. The results show that the modal converter can be tuned by adjusting these parameters, and that its operation is optimal under specific circumstances which have a great impact on its performance. Additionally, the potential application of the database is validated in this work. A modal decomposition technique based on the particle swarm algorithm (PSO) was employed as a tool for determining the most effective combinations of modal weights and relative phases from the spatial distributions collected in the dataset. The proposed dataset can open up new opportunities for researchers working on image segmentation, detection, and classification problems related to MDM technology. In addition, we implement novel artificial intelligence techniques that can help in finding the optimal operating conditions for this type of device.

Data, Vol. 9, Pages 9: Wi-Gitation: Replica Wi-Fi CSI Dataset for Physical Agitation Activity Recognition

Nikita Sharma — 2023-12-30

Data, Vol. 9, Pages 9: Wi-Gitation: Replica Wi-Fi CSI Dataset for Physical Agitation Activity Recognition

Data doi: 10.3390/data9010009

Authors: Nikita Sharma Jeroen Klein Brinke L. M. A. Braakman Jansen Paul J. M. Havinga Duc V. Le

Agitation is a commonly found behavioral condition in persons with advanced dementia. It requires continuous monitoring to gain insights into agitation levels to assist caregivers in delivering adequate care. The available monitoring techniques use cameras and wearables which are distressful and intrusive and are thus often rejected by older adults. To enable continuous monitoring in older adult care, unobtrusive Wi-Fi channel state information (CSI) can be leveraged to monitor physical activities related to agitation. However, to the best of our knowledge, there are no realistic CSI datasets available for facilitating the classification of physical activities demonstrated during agitation scenarios such as disturbed walking, repetitive sitting–getting up, tapping on a surface, hand wringing, rubbing on a surface, flipping objects, and kicking. Therefore, in this paper, we present a public dataset named Wi-Gitation. For Wi-Gitation, the Wi-Fi CSI data were collected with twenty-three healthy participants depicting the aforementioned agitation-related physical activities at two different locations in a one-bedroom apartment with multiple receivers placed at different distances (0.5–8 m) from the participants. The validation results on the Wi-Gitation dataset indicate higher accuracies (F1-Scores ≥0.95) when employing mixed-data analysis, where the training and testing data share the same distribution. Conversely, in scenarios where the training and testing data differ in distribution (i.e., leave-one-out), the accuracies experienced a notable decline (F1-Scores ≤0.21). This dataset can be used for fundamental research on CSI signals and in the evaluation of advanced algorithms developed for tackling domain invariance in CSI-based human activity recognition.

Data, Vol. 9, Pages 8: DNA Methylome and Transcriptome Maps of Primary Colorectal Cancer and Matched Liver Metastasis

Priyadarshana Ajithkumar — 2023-12-29

Data, Vol. 9, Pages 8: DNA Methylome and Transcriptome Maps of Primary Colorectal Cancer and Matched Liver Metastasis

Data doi: 10.3390/data9010008

Authors: Priyadarshana Ajithkumar Gregory Gimenez Peter A. Stockwell Suzan Almomani Sarah A. Bowden Anna L. Leichter Antonio Ahn Sharon Pattison Sebastian Schmeier Frank A. Frizelle Michael R. Eccles Rachel V. Purcell Euan J. Rodger Aniruddha Chatterjee

Sequencing-based genome-wide DNA methylation, gene expression studies and associated data on paired colorectal cancer (CRC) primary and liver metastasis are very limited. We have profiled the DNA methylome and transcriptome of matched primary CRC and liver metastasis samples from the same patients. Genome-scale methylation and expression levels were examined using Reduced Representation Bisulfite Sequencing (RRBS) and RNA-Seq, respectively. To investigate DNA methylation and expression patterns, we generated a total of 1.01 × 109 RRBS reads and 4.38 × 108 RNA-Seq reads from the matched cancer tissues. Here, we describe in detail the sample features, experimental design, methods and bioinformatic pipeline for these epigenetic data. We demonstrate the quality of both the samples and sequence data obtained from the paired samples. The sequencing data obtained from this study will serve as a valuable resource for studying underlying mechanisms of distant metastasis and the utility of epigenetic profiles in cancer metastasis.

Data, Vol. 9, Pages 7: Data-Driven Analysis of MRI Scans: Exploring Brain Structure Variations in Colombian Adolescent Offenders

Germán Sánchez-Torres — 2023-12-26

Data, Vol. 9, Pages 7: Data-Driven Analysis of MRI Scans: Exploring Brain Structure Variations in Colombian Adolescent Offenders

Data doi: 10.3390/data9010007

Authors: Germán Sánchez-Torres Nallig Leal Mariana Pino

With the advancements in neuroimaging techniques, understanding the relationship between brain morphology and behavioral tendencies such as criminal behavior has garnered interest. This research addresses the investigation of disparities in neuroanatomical structures between adolescent offenders and non-offenders and considers the implications of such distinctions regarding offender behavior within adolescent populations. Employing data-driven methodologies, MRI scans of adolescents from Barranquilla, Colombia, were analyzed to explore morphological variations. Utilizing a 1.5 Tesla Siemens resonator (Siemens Healthineers, Erlangen, Germany), T1-weighted MPRAGE anatomical images were acquired and analyzed using a systematic five-step methodology including data acquisition, MRI pre-processing, feature selection, model selection, and model validation and evaluation. Participants, both offenders and non-offenders, were aged 14–18 and selected based on education, criminal history, and physical conditions. The research identified significant disparities in the volumes of 42 brain structures between adolescent offenders (AOs) and non-offenders (NOs), highlighting particular brain regions potentially associated with offending behavior. Additionally, a considerable proportion of AOs emanated from lower socioeconomic backgrounds and showcased marked substance use. The findings suggest that neuroanatomical disparities potentially correlate with criminal behavior among adolescents at a neurobiological level. Noticeable socio-environmental factors, such as lower socioeconomic status and substance abuse, were substantially prevalent among AOs. Particularly, neurobiological deviations in structures like ctx-lh-rostralmiddlefrontal and ctx-lh-caudalanteriorcingulate perhaps represent a link between neurological factors and external stimuli.

Data, Vol. 9, Pages 6: A Profit Maximization Model for Data Consumers with Data Providers’ Incentives in Personal Data Trading Market

Hyojin Park — 2023-12-25

Data, Vol. 9, Pages 6: A Profit Maximization Model for Data Consumers with Data Providers’ Incentives in Personal Data Trading Market

Data doi: 10.3390/data9010006

Authors: Hyojin Park Hyeontaek Oh Jun Kyun Choi

This paper proposes a profit maximization model for a data consumer when it buys personal data from data providers (by obtaining consent) through data brokers and provides their new services to data providers (i.e., service consumers). To observe the behavioral models of data providers, the data consumer, and service consumers, this paper proposes the willingness-to-sell model of personal data of data providers (which is affected by data providers’ behavior related to explicit consent), the service quality model obtained by the collected personal data from the data consumer’s perspective, and the willingness-to-pay model of service consumers regarding provided new services from the data consumer. Particularly, this paper jointly considers the behavior of data providers and service users under a limited budget. With parameters inspired by real-world surveys on data providers, this paper shows various numerical results to check the feasibility of the proposed models.

Data, Vol. 9, Pages 5: Single-Nucleotide Variants in PADI2 and PADI4 and Ancestry Informative Markers in Interstitial Lung Disease and Rheumatoid Arthritis among a Mexican Mestizo Population

Karol J. Nava-Quiroz — 2023-12-25

Data, Vol. 9, Pages 5: Single-Nucleotide Variants in PADI2 and PADI4 and Ancestry Informative Markers in Interstitial Lung Disease and Rheumatoid Arthritis among a Mexican Mestizo Population

Data doi: 10.3390/data9010005

Authors: Karol J. Nava-Quiroz Jorge Rojas-Serrano Gloria Pérez-Rubio Ivette Buendia-Roldan Mayra Mejía Juan Carlos Fernández-López Espiridión Ramos-Martínez Luis A. López-Flores Alma D. Del Ángel-Pablo Ramcés Falfán-Valencia

Rheumatoid arthritis (RA) is an autoimmune disease mainly characterized by joint inflammation. It presents extra-articular manifestations, with the lungs being one of the affected areas. Among these, damage to the pulmonary interstitium (Interstitial Lung Disease—ILD) has been linked to proteins involved in the inflammatory process and related to extracellular matrix deposition and lung fibrosis establishment. Peptidyl arginine deiminase enzymes (PAD), which carry out protein citrullination, play a role in this context. A genetic association analysis was conducted on genes encoding two PAD isoforms: PAD2 and PAD4. This analysis also included ancestry informative markers and protein level determination in samples from patients with RA, RA-associated ILD, and clinically healthy controls. Significant single nucleotide variants (SNV) and one haplotype were identified as susceptibility factors for RA-ILD development. Elevated levels of PAD4 were found in RA-ILD cases, while PADI2 showed an association with RA susceptibility. This work presents data obtained from previously published research. Population variability has been noticed in genetic association studies. We present data for 14 SNVs that show geographical and genetic variation across the Mexican population, which provides highly informative content and greater intrapopulation genetic diversity. Further investigations in the field should be considered in addition to AIMs. The data presented in this study were analyzed in association with SNV genotypes in PADI2 and PADI4 to assess susceptibility to ILD in RA, as well as with changes in PAD2 and PAD4 protein levels according to carrier genotype, in addition to the use of covariates such as ancestry markers.

Data, Vol. 9, Pages 4: An Urban Traffic Dataset Composed of Visible Images and Their Semantic Segmentation Generated by the CARLA Simulator

Sergio Bemposta Rosende — 2023-12-24

Data, Vol. 9, Pages 4: An Urban Traffic Dataset Composed of Visible Images and Their Semantic Segmentation Generated by the CARLA Simulator

Data doi: 10.3390/data9010004

Authors: Sergio Bemposta Rosende David San José Gavilán Javier Fernández-Andrés Javier Sánchez-Soriano

A dataset of aerial urban traffic images and their semantic segmentation is presented to be used to train computer vision algorithms, among which those based on convolutional neural networks stand out. This article explains the process of creating the complete dataset, which includes the acquisition of the images, the labeling of vehicles, pedestrians, and pedestrian crossings as well as a description of the structure and content of the dataset (which amounts to 8694 images including visible images and those corresponding to the semantic segmentation). The images were generated using the CARLA simulator (but were like those that could be obtained with fixed aerial cameras or by using multi-copter drones) in the field of intelligent transportation management. The presented dataset is available and accessible to improve the performance of vision and road traffic management systems, especially for the detection of incorrect or dangerous maneuvers.

Data, Vol. 9, Pages 3: Unlocking Insights: Analysing COVID-19 Lockdown Policies and Mobility Data in Victoria, Australia, through a Data-Driven Machine Learning Approach

Shiyang Lyu — 2023-12-21

Data, Vol. 9, Pages 3: Unlocking Insights: Analysing COVID-19 Lockdown Policies and Mobility Data in Victoria, Australia, through a Data-Driven Machine Learning Approach

Data doi: 10.3390/data9010003

Authors: Shiyang Lyu Oyelola Adegboye Kiki Adhinugraha Theophilus I. Emeto David Taniar

The state of Victoria, Australia, implemented one of the world’s most prolonged cumulative lockdowns in 2020 and 2021. Although lockdowns have proven effective in managing COVID-19 worldwide, this approach faced challenges in containing the rising infection in Victoria. This study evaluates the effects of short-term (less than 60 days) and long-term (more than 60 days) lockdowns on public mobility and the effectiveness of various social restriction measures within these periods. The aim is to understand the complexities of pandemic management by examining various measures over different lockdown durations, thereby contributing to more effective COVID-19 containment methods. Using restriction policy, community mobility, and COVID-19 data, a machine-learning-based simulation model was proposed, incorporating analysis of correlation, infection doubling time, and effective lockdown date. The model result highlights the significant impact of public event cancellations in preventing COVID-19 infection during short- and long-term lockdowns and the importance of international travel controls in long-term lockdowns. The effectiveness of social restriction was found to decrease significantly with the transition from short to long lockdowns, characterised by increased visits to public places and increased use of public transport, which may be associated with an increase in the effective reproduction number (Rt) and infected cases.

Data, Vol. 9, Pages 2: Medical Opinions Analysis about the Decrease of Autopsies Using Emerging Pattern Mining

Isaac Machorro-Cano — 2023-12-21

Data, Vol. 9, Pages 2: Medical Opinions Analysis about the Decrease of Autopsies Using Emerging Pattern Mining

Data doi: 10.3390/data9010002

Authors: Isaac Machorro-Cano Ingrid Aylin Ríos-Méndez José Antonio Palet-Guzmán Nidia Rodríguez-Mazahua Lisbeth Rodríguez-Mazahua Giner Alor-Hernández José Oscar Olmedo-Aguirre

An autopsy is a widely recognized procedure to guarantee ongoing enhancements in medicine. It finds extensive application in legal, scientific, medical, and research domains. However, declining autopsy rates in hospitals constitute a worldwide concern. For example, the Regional Hospital of Rio Blanco in Veracruz, Mexico, has substantially reduced the number of autopsies at hospitals in recent years. Since there are no documented historical records of a decrease in the frequency of autopsy cases, it is crucial to establish a methodological framework to substantiate any actual trends in the data. Emerging pattern mining (EPM) allows for finding differences between classes or data sets because it builds a descriptive data model concerning some given remarkable property. Data set description has become a significant application area in various contexts in recent years. In this research study, various EPM (emerging pattern mining) algorithms were used to extract emergent patterns from a data set collected based on medical experts’ perspectives on reducing hospital autopsies. Notably, the top-performing EPM algorithms were iEPMiner, LCMine, SJEP-C, Top-k minimal SJEPs, and Tree-based JEP-C. Among these, iEPMiner and LCMine demonstrated faster performance and produced superior emergent patterns when considering metrics such as Confidence, Weighted Relative Accuracy Criteria (WRACC), False Positive Rate (FPR), and True Positive Rate (TPR).

Data, Vol. 9, Pages 1: Expert-Annotated Dataset to Study Cyberbullying in Polish Language

Michal Ptaszynski — 2023-12-20

Data, Vol. 9, Pages 1: Expert-Annotated Dataset to Study Cyberbullying in Polish Language

Data doi: 10.3390/data9010001

Authors: Michal Ptaszynski Agata Pieciukiewicz Pawel Dybala Pawel Skrzek Kamil Soliwoda Marcin Fortuna Gniewosz Leliwa Michal Wroczynski

We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.

Data, Vol. 8, Pages 187: Genome Sequence of the Plant-Growth-Promoting Endophyte Curtobacterium flaccumfaciens Strain W004

Vladimir K. Chebotar — 2023-12-09

Data, Vol. 8, Pages 187: Genome Sequence of the Plant-Growth-Promoting Endophyte Curtobacterium flaccumfaciens Strain W004

Data doi: 10.3390/data8120187

Authors: Vladimir K. Chebotar Maria S. Gancheva Elena P. Chizhevskaya Maria E. Baganova Oksana V. Keleinikova Kharon A. Husainov Veronika N. Pishchik

We report the whole-genome sequences of the endophyte Curtobacterium flaccumfaciens strain W004 isolated from the seeds of winter wheat, cv. Bezostaya 100. The genome was obtained using Oxford Nanopore MinION sequencing. The bacterium has a circular chromosome consisting of 3.63 kbp with a G+C% content of 70.89%. We found that Curtobacterium flaccumfaciens strain W004 could promote the growth of spring wheat plants, resulting in an increase in grain yield of 54.3%. Sequencing the genome of this new strain can provide insights into its potential role in plant–microbe interactions.

Data, Vol. 8, Pages 186: A Qualitative Dataset for Coffee Bio-Aggressors Detection Based on the Ancestral Knowledge of the Cauca Coffee Farmers in Colombia

Juan Felipe Valencia-Mosquera — 2023-12-08

Data, Vol. 8, Pages 186: A Qualitative Dataset for Coffee Bio-Aggressors Detection Based on the Ancestral Knowledge of the Cauca Coffee Farmers in Colombia

Data doi: 10.3390/data8120186

Authors: Juan Felipe Valencia-Mosquera David Griol Mayra Solarte-Montoya Cristhian Figueroa Juan Carlos Corrales David Camilo Corrales

This paper describes a novel qualitative dataset regarding coffee pests based on the ancestral knowledge of coffee farmers in the Department of Cauca, Colombia. The dataset has been obtained from a survey applied to coffee growers with 432 records and 41 variables collected weekly from September 2020 to August 2021. The qualitative dataset includes climatic conditions, productive activities, external conditions, and coffee bio-aggressors. This dataset allows researchers to find patterns for coffee crop protection through the ancestral knowledge not detected by real-time agricultural sensors. As far as we are concerned, there are no datasets like the one presented in this paper with similar characteristics of qualitative value that express the empirical knowledge of coffee farmers used to detect triggers of causal behaviors of pests and diseases in coffee crops.

Data, Vol. 8, Pages 185: Land Cover Classification in the Antioquia Region of the Tropical Andes Using NICFI Satellite Data Program Imagery and Semantic Segmentation Techniques

Luisa F. Gomez-Ossa — 2023-12-04

Data, Vol. 8, Pages 185: Land Cover Classification in the Antioquia Region of the Tropical Andes Using NICFI Satellite Data Program Imagery and Semantic Segmentation Techniques

Data doi: 10.3390/data8120185

Authors: Luisa F. Gomez-Ossa German Sanchez-Torres John W. Branch-Bedoya

Land cover classification, generated from satellite imagery through semantic segmentation, has become fundamental for monitoring land use and land cover change (LULCC). The tropical Andes territory provides opportunities due to its significance in the provision of ecosystem services. However, the lack of reliable data for this region, coupled with challenges arising from its mountainous topography and diverse ecosystems, hinders the description of its coverage. Therefore, this research proposes the Tropical Andes Land Cover Dataset (TALANDCOVER). It is constructed from three sample strategies: aleatory, minimum 50%, and 70% of representation per class, which address imbalanced geographic data. Additionally, the U-Net deep learning model is applied for enhanced and tailored classification of land covers. Using high-resolution data from the NICFI program, our analysis focuses on the Department of Antioquia in Colombia. The TALANDCOVER dataset, presented in TIF format, comprises multiband R-G-B-NIR images paired with six labels (dense forest, grasslands, heterogeneous agricultural areas, bodies of water, built-up areas, and bare-degraded lands) with an estimated 0.76 F1 score compared to ground truth data by expert knowledge and surpassing the precision of existing global cover maps for the study area. To the best of our knowledge, this work is a pioneer in its release of open-source data for segmenting coverages with pixel-wise labeled NICFI imagery at a 4.77 m resolution. The experiments carried out with the application of the sample strategies and models show F1 score values of 0.70, 0.72, and 0.74 for aleatory, balanced 50%, and balanced 70%, respectively, over the expert segmented sample (ground truth), which suggests that the personalized application of our deep learning model, together with the TALANDCOVER dataset offers different possibilities that facilitate the training of deep architectures for the classification of large-scale covers in complex areas, such as the tropical Andes. This advance has significant potential for decision making, emphasizing sustainable land use and the conservation of natural resources.

Data, Vol. 8, Pages 184: An Urban Image Stimulus Set Generated from Social Media

Ardaman Kaur — 2023-12-01

Data, Vol. 8, Pages 184: An Urban Image Stimulus Set Generated from Social Media

Data doi: 10.3390/data8120184

Authors: Ardaman Kaur André Leite Rodrigues Sarah Hoogstraten Diego Andrés Blanco-Mora Bruno Miranda Paulo Morgado Dar Meshi

Social media data, such as photos and status posts, can be tagged with location information (geotagging). This geotagged information can be used for urban spatial analysis to explore neighborhood characteristics or mobility patterns. With increasing rural-to-urban migration, there is a need for comprehensive data capturing the complexity of urban settings and their influence on human experiences. Here, we share an urban image stimulus set from the city of Lisbon that researchers can use in their experiments. The stimulus set consists of 160 geotagged urban space photographs extracted from the Flickr social media platform. We divided the city into 100 × 100 m cells to calculate the cell image density (number of images in each cell) and the cell green index (Normalized Difference Vegetation Index of each cell) and assigned these values to each geotagged image. In addition, we also computed the popularity of each image (normalized views on the social network). We also categorized these images into two putative groups by photographer status (residents and tourists), with 80 images belonging to each group. With the rise in data-driven decisions in urban planning, this stimulus set helps explore human–urban environment interaction patterns, especially if complemented with survey/neuroimaging measures or machine-learning analyses.

Data, Vol. 8, Pages 183: Spectrogram Dataset of Korean Smartphone Audio Files Forged Using the “Mix Paste” Command

Yeongmin Son — 2023-12-01

Data, Vol. 8, Pages 183: Spectrogram Dataset of Korean Smartphone Audio Files Forged Using the “Mix Paste” Command

Data doi: 10.3390/data8120183

Authors: Yeongmin Son Won Jun Kwak Jae Wan Park

This study focuses on the field of voice forgery detection, which is increasing in importance owing to the introduction of advanced voice editing technologies and the proliferation of smartphones. This study introduces a unique dataset that was built specifically to identify forgeries created using the “Mix Paste” technique. This editing technique can overlay audio segments from similar or different environments without creating a new timeframe, making it nearly infeasible to detect forgeries using traditional methods. The dataset consists of 4665 and 45,672 spectrogram images from 1555 original audio files and 15,224 forged audio files, respectively. The original audio was recorded using iPhone and Samsung Galaxy smartphones to ensure a realistic sampling environment. The forged files were created from these recordings and subsequently converted into spectrograms. The dataset also provided the metadata of the original voice files, offering additional context and information that could be used for analysis and detection. This dataset not only fills a gap in existing research but also provides valuable support for developing more efficient deep learning models for voice forgery detection. By addressing the “Mix Paste” technique, the dataset caters to a critical need in voice authentication and forensics, potentially contributing to enhancing security in society.

Data, Vol. 8, Pages 182: An Automated Big Data Quality Anomaly Correction Framework Using Predictive Analysis

Widad Elouataoui — 2023-12-01

Data, Vol. 8, Pages 182: An Automated Big Data Quality Anomaly Correction Framework Using Predictive Analysis

Data doi: 10.3390/data8120182

Authors: Widad Elouataoui Saida El Mendili Youssef Gahi

Big data has emerged as a fundamental component in various domains, enabling organizations to extract valuable insights and make informed decisions. However, ensuring data quality is crucial for effectively using big data. Thus, big data quality has been gaining more attention in recent years by researchers and practitioners due to its significant impact on decision-making processes. However, existing studies addressing data quality anomalies often have a limited scope, concentrating on specific aspects such as outliers or inconsistencies. Moreover, many approaches are context-specific, lacking a generic solution applicable across different domains. To the best of our knowledge, no existing framework currently automatically addresses quality anomalies comprehensively and generically, considering all aspects of data quality. To fill the gaps in the field, we propose a sophisticated framework that automatically corrects big data quality anomalies using an intelligent predictive model. The proposed framework comprehensively addresses the main aspects of data quality by considering six key quality dimensions: Accuracy, Completeness, Conformity, Uniqueness, Consistency, and Readability. Moreover, the framework is not correlated to a specific field and is designed to be applicable across various areas, offering a generic approach to address data quality anomalies. The proposed framework was implemented on two datasets and has achieved an accuracy of 98.22%. Moreover, the results have shown that the framework has allowed the data quality to be boosted to a great score, reaching 99%, with an improvement rate of up to 14.76% of the quality score.

Data, Vol. 8, Pages 181: Internationalization in the Baltic Regional Accounts: A NUTS 3 Region Dataset

Rasmus Bøgh Holmen — 2023-11-30

Data, Vol. 8, Pages 181: Internationalization in the Baltic Regional Accounts: A NUTS 3 Region Dataset

Data doi: 10.3390/data8120181

Authors: Rasmus Bøgh Holmen Nicolas Gavoille Jaan Masso Arūnas Burinskas

Features of internationalization, such as trade, foreign direct investments, and international migration, are crucial for understanding the economic developments of small and open economies. However, studying internationalization at the country level may obscure significant heterogeneity in its relationship with economic growth and other economic and social outcomes. Regional accounts provide insights into the geography of internationalization, but collections of such disaggregated statistics are rarely provided by statistical bureaus. The purpose of this paper is twofold. First, we demonstrate how regional account data, including internationalization indicators, can be constructed to obtain consistent and homogeneous regional-level series using a combination of micro and macro data sources. Second, our aim is to foster spatial research on internationalization and the spatial economy in the Baltics by providing comprehensive data collection of socio-economic variables at the NUTS 3 regional level over time. This collection encompasses trade, FDI, and migration, enabling the study of internationalization and other features of the Baltic economy. We present a series of key features, revealing noticeable correlation patterns between regional development and internationalization.

Data, Vol. 8, Pages 180: Public Perception of ChatGPT and Transfer Learning for Tweets Sentiment Analysis Using Wolfram Mathematica

Yankang Su — 2023-11-28

Data, Vol. 8, Pages 180: Public Perception of ChatGPT and Transfer Learning for Tweets Sentiment Analysis Using Wolfram Mathematica

Data doi: 10.3390/data8120180

Authors: Yankang Su Zbigniew J. Kabala

Understanding public opinion on ChatGPT is crucial for recognizing its strengths and areas of concern. By utilizing natural language processing (NLP), this study delves into tweets regarding ChatGPT to determine temporal patterns, content features, and topic modeling and perform a sentiment analysis. Analyzing a dataset of 500,000 tweets, our research shifts from conventional data science tools like Python and R to exploit Wolfram Mathematica’s robust capabilities. Additionally, with the aim of solving the problem of ignoring semantic information in the LDA model feature extraction, a synergistic methodology entwining LDA, GloVe embeddings, and K-Nearest Neighbors (KNN) clustering is proposed to categorize topics within ChatGPT-related tweets. This comprehensive strategy ensures semantic, syntactic, and topical congruence within classified groups by utilizing the strengths of probabilistic modeling, semantic embeddings, and similarity-based clustering. While built-in sentiment classifiers often fall short in accuracy, we introduce four transfer learning techniques from the Wolfram Neural Net Repository to address this gap. Two of these techniques involve transferring static word embeddings, “GloVe” and “ConceptNet”, which are further processed using an LSTM layer. The remaining techniques center on fine-tuning pre-trained models using scantily annotated data; one refines embeddings from language models (ELMo), while the other fine-tunes bidirectional encoder representations from transformers (BERT). Our experiments on the dataset underscore the effectiveness of the four methods for the sentiment analysis of tweets. This investigation augments our comprehension of user sentiment towards ChatGPT and emphasizes the continued significance of exploration in this domain. Furthermore, this work serves as a pivotal reference for scholars who are accustomed to using Wolfram Mathematica in other research domains, aiding their efforts in text analytics on social media platforms.

Data, Vol. 8, Pages 179: A Tourist-Based Framework for Developing Digital Marketing for Small and Medium-Sized Enterprises in the Tourism Sector in Saudi Arabia

Rishaa Abdulaziz Alnajim — 2023-11-28

Data, Vol. 8, Pages 179: A Tourist-Based Framework for Developing Digital Marketing for Small and Medium-Sized Enterprises in the Tourism Sector in Saudi Arabia

Data doi: 10.3390/data8120179

Authors: Rishaa Abdulaziz Alnajim Bahjat Fakieh

Social media has become an essential tool for travel planning, with tourists increasingly using it to research destinations, book accommodation, and make travel arrangements. However, little is known about how tourists use social media for travel planning and what factors influence their intentions to use social media for this purpose. This thesis aims to understand tourists’ intentions to use social media for travel planning. Specifically, it investigates the factors influencing tourists’ intentions to use social media for planning travel to Saudi Arabia. It develops a machine learning (ML) classification model to assist Saudi tourism SMEs in creating effective digital marketing strategies for social media platforms. A survey was conducted with 573 tourists interested in visiting Saudi Arabia, using the Design Science Research (DSR) approach. The findings support the tourist-based theoretical framework, showing that perceived usefulness (PU), perceived ease of use (PEOU), satisfaction (SAT), marketing-generated content (MGC), and user-generated content (UGC) significantly impact tourists’ intentions to use social media for travel planning. Tourists’ characteristics and visit characteristics influenced their intentions to use MGC but not UGC. The tourist-based ML classification model, developed using the LinearSVC algorithm, achieved an accuracy of 99% when evaluated using the K-Fold Cross-Validation (KF-CV) technique. The findings of this study have several implications for Saudi tourism SMEs. First, the results suggest that SMEs should focus on developing social media content that is perceived as useful, easy to use, and satisfying. Second, the findings suggest that SMEs should focus on using MGC in their social media marketing campaigns. Third, the results suggest that SMEs should tailor their social media marketing campaigns to the characteristics of their target tourists. This study contributes to the literature on tourism marketing and social media by providing a better understanding of how tourists use social media for travel planning. Saudi tourism SMEs can use the findings of this study to develop more effective digital marketing strategies for social media platforms.

Data, Vol. 8, Pages 178: In Vivo Drug Testing during Embryonic Wound Healing: Establishing the Avian Model

Martin Bablok — 2023-11-25

Data, Vol. 8, Pages 178: In Vivo Drug Testing during Embryonic Wound Healing: Establishing the Avian Model

Data doi: 10.3390/data8120178

Authors: Martin Bablok Beate Brand-Saberi Morris Gellisch Gabriela Morosan-Puopolo

The relevance of identifying pathological processes in the context of embryonic development is increasingly gaining attention in terms of professionalized prenatal care. To analyze local effects of prenatally administered drugs during embryonic development, the model organism of the chicken embryo can be used in a first exploratory approach. For the examination of local dexamethasone administration—as an exemplary drug—common bead implantation protocols have been adapted to serve as an in vivo technique for local drug testing during embryonic skin regeneration. For this, acrylic beads were soaked in a dexamethasone solution and implanted into skin incisional wounds of 4-day-old chicken embryos. After further incubation, the effects of the applied substance on the process of embryonic skin regeneration were analyzed using histological and molecular biological techniques. This data descriptor contains a detailed microsurgical protocol, a representative video demonstration, and exemplary results of local glucocorticoid-induced changes during embryonic wound healing. To conclude, this method allows for the analysis of the local effects of a particular substance on a cellular level and can be extended to serve as an in vivo technique for numerous other drugs to be tested on embryonic tissue.

Data, Vol. 8, Pages 177: Dataset: Impact of β-Galactosylceramidase Overexpression on the Protein Profile of Braf(V600E) Mutated Melanoma Cells

Davide Capoferri — 2023-11-24

Data, Vol. 8, Pages 177: Dataset: Impact of β-Galactosylceramidase Overexpression on the Protein Profile of Braf(V600E) Mutated Melanoma Cells

Data doi: 10.3390/data8120177

Authors: Davide Capoferri Paola Chiodelli Stefano Calza Marcello Manfredi Marco Presta

β-Galactosylceramidase (GALC) is a lysosomal enzyme involved in sphingolipid metabolism by removing β-galactosyl moieties from β-galactosyl ceramide and β-galactosyl sphingosine. Previous observations have shown that GALC exerts a pro-oncogenic activity in human melanoma. Here, the impact of GALC overexpression on the proteomic landscape of BRAF-mutated A2058 and A375 human melanoma cell lines was investigated by liquid chromatography–tandem mass spectrometry analysis of the cell extracts. The results indicate that GALC overexpression causes the upregulation/downregulation of 172/99 proteins in GALC-transduced cells when compared to control cells. Gene ontology categorization of up/down-regulated proteins indicates that GALC may modulate the protein landscape in BRAF-mutated melanoma cells by affecting various biological processes, including RNA metabolism, cell organelle fate, and intracellular redox status. Overall, these data provide further insights into the pro-oncogenic functions of the sphingolipid metabolizing enzyme GALC in human melanoma.

Data, Vol. 8, Pages 176: Model Design and Applied Methodology in Geothermal Simulations in Very Low Enthalpy for Big Data Applications

Roberto Arranz-Revenga — 2023-11-23

Data, Vol. 8, Pages 176: Model Design and Applied Methodology in Geothermal Simulations in Very Low Enthalpy for Big Data Applications

Data doi: 10.3390/data8120176

Authors: Roberto Arranz-Revenga María Pilar Dorrego de Luxán Juan Herrera Herbert Luis Enrique García Cambronero

Low-enthalpy geothermal installations for heating, air conditioning, and domestic hot water are gaining traction due to efforts towards energy decarbonization. This article is part of a broader research project aimed at employing artificial intelligence and big data techniques to develop a predictive system for the thermal behavior of the ground in very low-enthalpy geothermal applications. In this initial article, a summarized process is outlined to generate large quantities of synthetic data through a ground simulation method. The proposed theoretical model allows simulation of the soil’s thermal behavior using an electrical equivalent. The electrical circuit derived is loaded into a simulation program along with an input function representing the system’s thermal load pattern. The simulator responds with another function that calculates the values of the ground over time. Some examples of value conversion and the utility of the input function system to encode thermal loads during simulation are demonstrated. It bears the limitation of invalidity in the presence of underground water currents. Model validation is pending, and once defined, a corresponding testing plan will be proposed for its validation.

Data, Vol. 8, Pages 175: Long-Term Spatiotemporal Oceanographic Data from the Northeast Pacific Ocean: 1980–2022 Reconstruction Based on the Korea Oceanographic Data Center (KODC) Dataset

Seong-Hyeon Kim — 2023-11-23

Data, Vol. 8, Pages 175: Long-Term Spatiotemporal Oceanographic Data from the Northeast Pacific Ocean: 1980–2022 Reconstruction Based on the Korea Oceanographic Data Center (KODC) Dataset

Data doi: 10.3390/data8120175

Authors: Seong-Hyeon Kim Hansoo Kim

The Korea Oceanographic Data Center (KODC), overseen by the National Institute of Fisheries Science (NIFS), is a pivotal hub for collecting, processing, and disseminating marine science data. By digitizing and subjecting observational data to rigorous quality control, the KODC ensures accurate information in line with international standards. The center actively engages in global partnerships and fosters marine data exchange. A wide array of marine information is provided through the KODC website, including observational metadata, coastal oceanographic data, real-time buoy records, and fishery environmental data. Coastal oceanographic observational data from 207 stations across various sea regions have been collected biannually since 1961. This dataset covers 14 standard water depths; includes essential parameters, such as temperature, salinity, nutrients, and pH; serves as the foundation for news, reports, and analyses by the NIFS; and is widely employed to study seasonal and regional marine variations, with researchers supplementing the limited data for comprehensive insights. The dataset offers information for each water depth at a 1 m interval over 1980–2022, facilitating research across disciplines. Data processing, including interpolation and quality control, is based on MATLAB. These data are classified by region and accessible online; hence, researchers can easily explore spatiotemporal trends in marine environments.

Data, Vol. 8, Pages 174: Machine Learning Applications to Identify Young Offenders Using Data from Cognitive Function Tests

María Claudia Bonfante — 2023-11-21

Data, Vol. 8, Pages 174: Machine Learning Applications to Identify Young Offenders Using Data from Cognitive Function Tests

Data doi: 10.3390/data8120174

Authors: María Claudia Bonfante Juan Contreras Montes Mariana Pino Ronald Ruiz Gabriel González

Machine learning techniques can be used to identify whether deficits in cognitive functions contribute to antisocial and aggressive behavior. This paper initially presents the results of tests conducted on delinquent and nondelinquent youths to assess their cognitive functions. The dataset extracted from these assessments, consisting of 37 predictor variables and one target, was used to train three algorithms which aim to predict whether the data correspond to those of a young offender or a nonoffending youth. Prior to this, statistical tests were conducted on the data to identify characteristics which exhibited significant differences in order to select the most relevant features and optimize the prediction results. Additionally, other feature selection methods, such as Boruta, RFE, and filter, were applied, and their effects on the accuracy of each of the three machine learning models used (SVM, RF, and KNN) were compared. In total, 80% of the data were utilized for training, while the remaining 20% were used for validation. The best result was achieved by the K-NN model, trained with 19 features selected by the Boruta method, followed by the SVM model, trained with 24 features selected by the filter method.

Data, Vol. 8, Pages 173: Biodiversity of Terrestrial Testate Amoebae in Western Siberia Lowland Peatlands

Damir Saldaev — 2023-11-17

Data, Vol. 8, Pages 173: Biodiversity of Terrestrial Testate Amoebae in Western Siberia Lowland Peatlands

Data doi: 10.3390/data8110173

Authors: Damir Saldaev Kirill Babeshko Viktor Chernyshov Anton Esaulov Xiuyuan Gu Nikita Kriuchkov Natalia Mazei Nailia Saldaeva Jiahui Su Andrey Tsyganov Basil Yakimov Svetlana Yushkovets Yuri Mazei

Testate amoebae are unicellular eukaryotic organisms covered with an external skeleton called a shell. They are an important component of many terrestrial ecosystems, especially peatlands, where they can be preserved in peat deposits and used as a proxy of surface wetness in paleoecological reconstructions. Here, we represent a database from a vast but poorly studied region of the Western Siberia Lowland containing information on TA occurrences in relation to substrate moisture and WTD. The dataset includes 88 species from 32 genera, with 2181 incidences and 21,562 counted individuals. All samples were collected in oligotrophic peatlands and prepared using the method of wet sieving with a subsequent sedimentation of aqueous suspensions. This database contributes to the understanding of the distribution of testate amoebae and can be further used in large-scale investigations.

Data, Vol. 8, Pages 172: Testate Amoebae (Amphitremida, Arcellinida, Euglyphida) in Sphagnum Bogs: The Dataset from Eastern Fennoscandia

Aleksandr Ivanovskii — 2023-11-15

Data, Vol. 8, Pages 172: Testate Amoebae (Amphitremida, Arcellinida, Euglyphida) in Sphagnum Bogs: The Dataset from Eastern Fennoscandia

Data doi: 10.3390/data8110172

Authors: Aleksandr Ivanovskii Kirill Babeshko Viktor Chernyshov Anton Esaulov Aleksandr Komarov Elena Malysheva Natalia Mazei Diana Meskhadze Damir Saldaev Andrey N. Tsyganov Yuri Mazei

The paper describes a dataset, comprising 236 surface moss samples and 143 testate amoeba taxa. The samples were collected in 11 Sphagnum-dominated bogs during frost-free seasons of 2004, 2007, 2009, 2017, and 2022. For the whole dataset, the sampling effort was sufficient in terms of observed species richness (143 species in total), though a regional species pool is deemed to be discovered incompletely (143 species is its lower 95 % confidence limit using Chao’s estimator). The local community composition demonstrated high heterogeneity in a reduced ordination space. It supports the opinion that the high versatility of bog ecosystems should be taken into account during ecological studies.

Data, Vol. 8, Pages 171: ChatGPT across Arabic Twitter: A Study of Topics, Sentiments, and Sarcasm

Shahad Al-Khalifa — 2023-11-14

Data, Vol. 8, Pages 171: ChatGPT across Arabic Twitter: A Study of Topics, Sentiments, and Sarcasm

Data doi: 10.3390/data8110171

Authors: Shahad Al-Khalifa Fatima Alhumaidhi Hind Alotaibi Hend S. Al-Khalifa

While ChatGPT has gained global significance and widespread adoption, its exploration within specific cultural contexts, particularly within the Arab world, remains relatively limited. This study investigates the discussions among early Arab users in Arabic tweets related to ChatGPT, focusing on topics, sentiments, and the presence of sarcasm. Data analysis and topic-modeling techniques were employed to examine 34,760 Arabic tweets collected using specific keywords. This study revealed a strong interest within the Arabic-speaking community in ChatGPT technology, with prevalent discussions spanning various topics, including controversies, regional relevance, fake content, and sector-specific dialogues. Despite the enthusiasm, concerns regarding ethical risks and negative implications of ChatGPT’s emergence were highlighted, indicating apprehension toward advanced artificial intelligence (AI) technology in language generation. Region-specific discussions underscored the diverse adoption of AI applications and ChatGPT technology. Sentiment analysis of the tweets demonstrated a predominantly neutral sentiment distribution (92.8%), suggesting a focus on objectivity and factuality over emotional expression. The prevalence of neutral sentiments indicated a preference for evidence-based reasoning and logical arguments, fostering constructive discussions influenced by cultural norms. Sarcasm was found in 4% of the tweets, distributed across various topics but not dominating the conversation. This study’s implications include the need for AI developers to address ethical concerns and the importance of educating users about the technology’s ethical considerations and risks. Policymakers should consider the regional relevance and potential scams, emphasizing the necessity for ethical guidelines and regulations.

Data, Vol. 8, Pages 170: Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

Sascha Wolfer — 2023-11-10

Data, Vol. 8, Pages 170: Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

Data doi: 10.3390/data8110170

Authors: Sascha Wolfer Alexander Koplenig Marc Kupietz Carolin Müller-Spitzer

We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.

Data, Vol. 8, Pages 169: Machine Learning for Credit Risk Prediction: A Systematic Literature Review

Jomark Pablo Noriega — 2023-11-07

Data, Vol. 8, Pages 169: Machine Learning for Credit Risk Prediction: A Systematic Literature Review

Data doi: 10.3390/data8110169

Authors: Jomark Pablo Noriega Luis Antonio Rivera José Alfredo Herrera

In this systematic review of the literature on using Machine Learning (ML) for credit risk prediction, we raise the need for financial institutions to use Artificial Intelligence (AI) and ML to assess credit risk, analyzing large volumes of information. We posed research questions about algorithms, metrics, results, datasets, variables, and related limitations in predicting credit risk. In addition, we searched renowned databases responding to them and identified 52 relevant studies within the credit industry of microfinance. Challenges and approaches in credit risk prediction using ML models were identified; we had difficulties with the implemented models such as the black box model, the need for explanatory artificial intelligence, the importance of selecting relevant features, addressing multicollinearity, and the problem of the imbalance in the input data. By answering the inquiries, we identified that the Boosted Category is the most researched family of ML models; the most commonly used metrics for evaluation are Area Under Curve (AUC), Accuracy (ACC), Recall, precision measure F1 (F1), and Precision. Research mainly uses public datasets to compare models, and private ones to generate new knowledge when applied to the real world. The most significant limitation identified is the representativeness of reality, and the variables primarily used in the microcredit industry are data related to the Demographic, Operation, and Payment behavior. This study aims to guide developers of credit risk management tools and software towards the existing ability of ML methods, metrics, and techniques used to forecast it, thereby minimizing possible losses due to default and guiding risk appetite.

Data, Vol. 8, Pages 168: Applying Eye Tracking with Deep Learning Techniques for Early-Stage Detection of Autism Spectrum Disorders

Zeyad A. T. Ahmed — 2023-11-03

Data, Vol. 8, Pages 168: Applying Eye Tracking with Deep Learning Techniques for Early-Stage Detection of Autism Spectrum Disorders

Data doi: 10.3390/data8110168

Authors: Zeyad A. T. Ahmed Eid Albalawi Theyazn H. H. Aldhyani Mukti E. Jadhav Prachi Janrao Mansour Ratib Mohammad Obeidat

Autism spectrum disorder (ASD) poses a complex challenge to researchers and practitioners, with its multifaceted etiology and varied manifestations. Timely intervention is critical in enhancing the developmental outcomes of individuals with ASD. This paper underscores the paramount significance of early detection and diagnosis as a pivotal precursor to effective intervention. To this end, integrating advanced technological tools, specifically eye-tracking technology and deep learning algorithms, is investigated for its potential to discriminate between children with ASD and their typically developing (TD) peers. By employing these methods, the research aims to contribute to refining early detection strategies and support mechanisms. This study introduces innovative deep learning models grounded in convolutional neural network (CNN) and recurrent neural network (RNN) architectures, employing an eye-tracking dataset for training. Of note, performance outcomes have been realised, with the bidirectional long short-term memory (BiLSTM) achieving an accuracy of 96.44%, the gated recurrent unit (GRU) attaining 97.49%, the CNN-LSTM hybridising to 97.94%, and the LSTM achieving the most remarkable accuracy result of 98.33%. These outcomes underscore the efficacy of the applied methodologies and the potential of advanced computational frameworks in achieving substantial accuracy levels in ASD detection and classification.

Data, Vol. 8, Pages 166: A Scalable Data Structure for Efficient Graph Analytics and In-Place Mutations

Soukaina Firmli — 2023-11-03

Data, Vol. 8, Pages 166: A Scalable Data Structure for Efficient Graph Analytics and In-Place Mutations

Data doi: 10.3390/data8110166

Authors: Soukaina Firmli Dalila Chiadmi

The graph model enables a broad range of analyses; thus, graph processing (GP) is an invaluable tool in data analytics. At the heart of every GP system lies a concurrent graph data structure that stores the graph. Such a data structure needs to be highly efficient for both graph algorithms and queries. Due to the continuous evolution, the sparsity, and the scale-free nature of real-world graphs, GP systems face the challenge of providing an appropriate graph data structure that enables both fast analytical workloads and fast, low-memory graph mutations. Existing graph structures offer a hard tradeoff among read-only performance, update friendliness, and memory consumption upon updates. In this paper, we introduce CSR++, a new graph data structure that removes these tradeoffs and enables both fast read-only analytics, and quick and memory-friendly mutations. CSR++ combines ideas from CSR, the fastest read-only data structure, and adjacency lists (ALs) to achieve the best of both worlds. We compare CSR++ to CSR, ALs from the Boost Graph Library (BGL), and the following state-of-the-art update-friendly graph structures: LLAMA, STINGER, GraphOne, and Teseo. In our evaluation, which is based on popular GP algorithms executed over real-world graphs, we show that CSR++ remains close to CSR in read-only concurrent performance (within 10% on average) while significantly outperforming CSR (by an order of magnitude) and LLAMA (by almost 2×) with frequent updates. We also show that both CSR++’s update throughput and analytics performance exceed those of several state-of-the-art graph structures while maintaining low memory consumption when the workload includes updates.

Data, Vol. 8, Pages 167: Draft Genome Sequence Data of Lysinibacillus sphaericus Strain 1795 with Insecticidal Properties

Maria N. Romanenko — 2023-11-03

Data, Vol. 8, Pages 167: Draft Genome Sequence Data of Lysinibacillus sphaericus Strain 1795 with Insecticidal Properties

Data doi: 10.3390/data8110167

Authors: Maria N. Romanenko Maksim A. Nesterenko Anton E. Shikov Anton A. Nizhnikov Kirill S. Antonets

Lysinibacillus sphaericus holds a significant agricultural importance by being able to produce insecticidal toxins and chemical moieties of varying antibacterial and fungicidal activities. In this study, the genome of the L. sphaericus strain 1795 is presented. Illumina short reads sequenced on the HiSeq X platform were used to obtain the genome’s assembly by applying the SPAdes v3.15.4 software. The genome size based on a cumulative length of 23 contigs reached 4.74 Mb, with a respective N50 of 1.34 Mb. The assembled genome carried 4672 genes, including 4643 protein-encoding ones, 5 of which represented loci coding for insecticidal toxins active against the orders Diptera, Lepidoptera, and Blattodea. We also revealed biosynthetic gene clusters responsible for the synthesis of secondary metabolites with predicted antibacterial, fungicidal, and growth-promoting properties. The genomic data provided will be helpful for deepening our understanding of genetic markers determining the efficient application of the L. sphaericus strain 1795 primarily for biocontrol purposes in veterinary and medical applications against several groups of blood-sucking insects.

Data, Vol. 8, Pages 165: Can We Mathematically Spot the Possible Manipulation of Results in Research Manuscripts Using Benford’s Law?

Teddy Lazebnik — 2023-10-31

Data, Vol. 8, Pages 165: Can We Mathematically Spot the Possible Manipulation of Results in Research Manuscripts Using Benford’s Law?

Data doi: 10.3390/data8110165

Authors: Teddy Lazebnik Dan Gorlitsky

The reproducibility of academic research has long been a persistent issue, contradicting one of the fundamental principles of science. Recently, there has been an increasing number of false claims found in academic manuscripts, casting doubt on the validity of reported results. In this paper, we utilize an adapted version of Benford’s law, a statistical phenomenon that describes the distribution of leading digits in naturally occurring datasets, to identify the potential manipulation of results in research manuscripts, solely using the aggregated data presented in those manuscripts rather than the commonly unavailable raw datasets. Our methodology applies the principles of Benford’s law to commonly employed analyses in academic manuscripts, thus reducing the need for the raw data itself. To validate our approach, we employed 100 open-source datasets and successfully predicted 79% of them accurately using our rules. Moreover, we tested the proposed method on known retracted manuscripts, showing that around half (48.6%) can be detected using the proposed method. Additionally, we analyzed 100 manuscripts published in the last two years across ten prominent economic journals, with 10 manuscripts randomly sampled from each journal. Our analysis predicted a 3% occurrence of results manipulation with a 96% confidence level. Our findings show that Benford’s law adapted for aggregated data, can be an initial tool for identifying data manipulation; however, it is not a silver bullet, requiring further investigation for each flagged manuscript due to the relatively low prediction accuracy.

Data, Vol. 8, Pages 164: Information Competences and Academic Achievement: A Dataset

Jacqueline Köhler — 2023-10-27

Data, Vol. 8, Pages 164: Information Competences and Academic Achievement: A Dataset

Data doi: 10.3390/data8110164

Authors: Jacqueline Köhler Roberto González-Ibáñez

Information literacy (IL) is becoming fundamental in the modern world. Although several IL standards and assessments have been developed for secondary and higher education, there is still no agreement about the possible associations between IL and both academic achievement and student dropout rates. In this article, we present a dataset including IL competences measurements, as well as academic achievement and socioeconomic indicators for 153 Chilean first- and second-year engineering students. The dataset is intended to allow researchers to use machine learning methods to study to what extent, if any, IL and academic achievement are related.

Data, Vol. 8, Pages 163: A Large-Scale Dataset of Search Interests Related to Disease X Originating from Different Geographic Regions

Nirmalya Thakur — 2023-10-26

Data, Vol. 8, Pages 163: A Large-Scale Dataset of Search Interests Related to Disease X Originating from Different Geographic Regions

Data doi: 10.3390/data8110163

Authors: Nirmalya Thakur Shuqi Cui Kesha A. Patel Isabella Hall Yuvraj Nihal Duggal

The World Health Organization (WHO) added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 and August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. These regions were chosen for data mining as these regions recorded significant search interests related to Disease X during this timeframe. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, an analysis of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis with a specific focus on Disease X.

Data, Vol. 8, Pages 162: The Development of a Water Resource Monitoring Ontology as a Research Tool for Sustainable Regional Development

Assel Ospan — 2023-10-26

Data, Vol. 8, Pages 162: The Development of a Water Resource Monitoring Ontology as a Research Tool for Sustainable Regional Development

Data doi: 10.3390/data8110162

Authors: Assel Ospan Madina Mansurova Vladimir Barakhnin Aliya Nugumanova Roman Titkov

The development of knowledge graphs about water resources as a tool for studying the sustainable development of a region is currently an urgent task, because the growing deterioration of the state of water bodies affects the ecology, economy, and health of the population of the region. This study presents a new ontological approach to water resource monitoring in Kazakhstan, providing data integration from heterogeneous sources, semantic analysis, decision support, and querying and searching and presenting new knowledge in the field of water monitoring. The contribution of this work is the integration of table extraction and understanding, semantic web rule language, semantic sensor network, time ontology methods, and the inclusion of a module of socioeconomic indicators that reveal the impact of water quality on the quality of life of the population. Using machine learning methods, the study derived six ontological rules to establish new knowledge about water resource monitoring. The results of the queries demonstrate the effectiveness of the proposed method, demonstrating its potential to improve water monitoring practices, promote sustainable resource management, and support decision-making processes in Kazakhstan, and can also be integrated into the ontology of water resources at the scale of Central Asia.

Data, Vol. 8, Pages 159: DataPLAN: A Web-Based Data Management Plan Generator for the Plant Sciences

Xiao-Ran Zhou — 2023-10-24

Data, Vol. 8, Pages 159: DataPLAN: A Web-Based Data Management Plan Generator for the Plant Sciences

Data doi: 10.3390/data8110159

Authors: Xiao-Ran Zhou Sebastian Beier Dominik Brilhaus Cristina Martins Rodrigues Timo Mühlhaus Dirk von Suchodoletz Richard M. Twyman Björn Usadel Angela Kranz

Research data management (RDM) combines a set of practices for the organization, storage and preservation of data from research projects. The RDM strategy of a project is usually formalized as a data management plan (DMP)—a document that sets out procedures to ensure data findability, accessibility, interoperability and reusability (FAIR-ness). Many aspects of RDM are standardized across disciplines so that data and metadata are reusable, but the components of DMPs in the plant sciences are often disconnected. The inability to reuse plant-specific DMP content across projects and funding sources requires additional time and effort to write unique DMPs for different settings. To address this issue, we developed DataPLAN—an open-source tool incorporating prewritten DMP content for the plant sciences that can be used online or offline to prepare multiple DMPs. The current version of DataPLAN supports Horizon 2020 and Horizon Europe projects, as well as projects funded by the German Research Foundation (DFG). Furthermore, DataPLAN offers the option for users to customize their own templates. Additional templates to accommodate other funding schemes will be added in the future. DataPLAN reduces the workload needed to create or update DMPs in the plant sciences by presenting standardized RDM practices optimized for different funding contexts.

Data, Vol. 8, Pages 161: Dataset: Biodiversity of Ground Beetles (Coleoptera, Carabidae) of the Republic of Mordovia (Russia)

Leonid V. Egorov — 2023-10-24

Data, Vol. 8, Pages 161: Dataset: Biodiversity of Ground Beetles (Coleoptera, Carabidae) of the Republic of Mordovia (Russia)

Data doi: 10.3390/data8110161

Authors: Leonid V. Egorov Viktor V. Aleksanov Sergei K. Alekseev Alexander B. Ruchin Oleg N. Artaev Mikhail N. Esin Sergei V. Lukiyanov Evgeniy A. Lobachev Gennadiy B. Semishin

(1) Background: Carabidae is one of the most diverse families of Coleoptera. Many species of Carabidae are sensitive to anthropogenic impacts and are indicators of their environmental state. Some species of large beetles are on the verge of extinction. The aim of this research is to describe the Carabidae fauna of the Republic of Mordovia (central part of European Russia); (2) Methods: The research was carried out in April-September 1979, 1987, 2000, 2001, 2005, 2007–2022. Collections were performed using a variety of methods (light trapping, soil traps, window traps, etc.). For each observation, the coordinates of the sampling location, abundance, and dates were recorded; (3) Results: The dataset contains data on 251 species of Carabidae from 12 subfamilies and 4576 occurrences. A total of 66,378 specimens of Carabidae were studied. Another 29 species are additionally known from other publications. Also, twenty-two species were excluded from the fauna of the region, as they were determined earlier by mistake (4). Conclusions: The biodiversity of Carabidae in the Republic of Mordovia included 280 species from 12 subfamilies. Four species (Agonum scitulum, Lebia scapularis, Bembidion humerale, and Bembidion tenellum) were identified for the first time in the Republic of Mordovia.

Data, Vol. 8, Pages 160: Fabaceae: South African Medicinal Plant Species Used in the Treatment and Management of Sexually Transmitted and Related Opportunistic Infections Associated with HIV-AIDS

Nkoana Ishmael Mongalo — 2023-10-24

Data, Vol. 8, Pages 160: Fabaceae: South African Medicinal Plant Species Used in the Treatment and Management of Sexually Transmitted and Related Opportunistic Infections Associated with HIV-AIDS

Data doi: 10.3390/data8110160

Authors: Nkoana Ishmael Mongalo Maropeng Vellry Raletsena

The use of medicinal plants, particularly in the treatment of sexually transmitted and related infections, is ancient. These plants may well be used as alternative and complementary medicine to a variety of antibiotics that may possess limitations mainly due to an emerging enormous antimicrobial resistance. Several computerized database literature sources such as ScienceDirect, Scopus, Scielo, PubMed, and Google Scholar were used to retrieve information on Fabaceae species used in the treatment and management of sexually transmitted and related infections in South Africa. The other information was sourced from various academic dissertations, theses, and botanical books. A total of 42 medicinal plant species belonging to the Fabaceae family, used in the treatment of sexually transmitted and related opportunistic infections associated with HIV-AIDS, have been documented. Trees were the most reported life form, yielding 47.62%, while Senna and Vachellia were the frequently cited genera yielding six and three species, respectively. Peltophorum africanum Sond. was the most preferred medicinal plant, yielding a frequency of citation of 14, while Vachellia karoo (Hayne) Banfi and Glasso as well as Elephantorrhiza burkei Benth. yielded 12 citations each. The most frequently used plant parts were roots, yielding 57.14%, while most of the plant species were administered orally after boiling (51.16%) until the infection subsided. Amazingly, many of the medicinal plant species are recommended for use to treat impotence (29.87%), while most common STI infections such as chlamydia (7.79%), gonorrhea (6.49%), syphilis (5.19%), genital warts (2.60%), and many other unidentified STIs that may include “Makgoma” and “Divhu” were less cited. Although there are widespread data on the in vitro evidence of the use of the Fabaceae species in the treatment of sexually transmitted and related infections, there is a need to explore the in vivo studies to further ascertain the use of species as a possible complementary and alternative medicine to the currently used antibiotics in both developing and underdeveloped countries. Furthermore, the toxicological profiles of many of these studies need to be further explored. The safety and efficacy of over-the-counter pharmaceutical products developed using these species also need to be explored.

Data, Vol. 8, Pages 158: Panel Regression Modelling for COVID-19 Infections and Deaths in Tamil Nadu, India

Rajarathinam Arunachalam — 2023-10-23

Data, Vol. 8, Pages 158: Panel Regression Modelling for COVID-19 Infections and Deaths in Tamil Nadu, India

Data doi: 10.3390/data8100158

Authors: Rajarathinam Arunachalam

The impacts of the coronavirus disease 2019 (COVID-19) pandemic have been extremely severe, with both economic and health crises experienced worldwide. Based on the panel regression model, this study examined the trends and correlations in the number of COVID-19-related deaths and the number of COVID-19-infected cases in all 37 regions of the Tamil Nadu state in India, in August 2020. The fixed effects model had the greatest R2 value of 78% and exhibited significant results. The slope coefficient was also highly significant, showing a considerable variation in the relationship between new COVID-19 cases and deaths. Additionally, for every unit increase in COVID-19-infected cases, the death rate increased by 0.02%.

Data, Vol. 8, Pages 157: Industrial Environment Multi-Sensor Dataset for Vehicle Indoor Tracking with Wi-Fi, Inertial and Odometry Data

Ivo Silva — 2023-10-23

Data, Vol. 8, Pages 157: Industrial Environment Multi-Sensor Dataset for Vehicle Indoor Tracking with Wi-Fi, Inertial and Odometry Data

Data doi: 10.3390/data8100157

Authors: Ivo Silva Cristiano Pendão Joaquín Torres-Sospedra Adriano Moreira

This paper describes a dataset collected in an industrial setting using a mobile unit resembling an industrial vehicle equipped with several sensors. Wi-Fi interfaces collect signals from available Access Points (APs), while motion sensors collect data regarding the mobile unit’s movement (orientation and displacement). The distinctive features of this dataset include synchronous data collection from multiple sensors, such as Wi-Fi data acquired from multiple interfaces (including a radio map), orientation provided by two low-cost Inertial Measurement Unit (IMU) sensors, and displacement (travelled distance) measured by an absolute encoder attached to the mobile unit’s wheel. Accurate ground-truth information was determined using a computer vision approach that recorded timestamps as the mobile unit passed through reference locations. We assessed the quality of the proposed dataset by applying baseline methods for dead reckoning and Wi-Fi fingerprinting. The average positioning error for simple dead reckoning, without using any other absolute positioning technique, is 8.25 m and 11.66 m for IMU1 and IMU2, respectively. The average positioning error for simple Wi-Fi fingerprinting is 2.19 m when combining the RSSI information from five Wi-Fi interfaces. This dataset contributes to the fields of Industry 4.0 and mobile sensing, providing researchers with a resource to develop, test, and evaluate indoor tracking solutions for industrial vehicles.

Data, Vol. 8, Pages 156: Cybersecurity Risk Assessments within Critical Infrastructure Social Networks

Alimbubi Aktayeva — 2023-10-19

Data, Vol. 8, Pages 156: Cybersecurity Risk Assessments within Critical Infrastructure Social Networks

Data doi: 10.3390/data8100156

Authors: Alimbubi Aktayeva Yerkhan Makatov Akku Kubigenova Tulegenovna Aibek Dautov Rozamgul Niyazova Maxud Zhamankarin Sergey Khan

Cybersecurity social networking is a new scientific and engineering discipline that was interdisciplinary in its early days, but is now transdisciplinary. The issues of reviewing and analyzing of principal tasks related to information collection, monitoring of social networks, assessment methods, and preventing and combating cybersecurity threats are, therefore, essential and pending. There is a need to design certain methods, models, and program complexes aimed at estimating risks related to the cyberspace of social networks and the support of their activities. This study considers a risk to be the combination of consequences of a given event (or incident) with a probable occurrence (likelihood of occurrence) involved, while risk assessment is a general issue of identification, estimation, and evaluation of risk. The findings of the study made it possible to elucidate that the technique of cognitive modeling for risk assessment is part of a comprehensive cybersecurity approach included in the requirements of basic IT standards, including IT security risk management. The study presents a comprehensive approach in the field of cybersecurity in social networks that allows for consideration of all the elements that constitute cybersecurity as a complex, interconnected system. The ultimate goal of this approach to cybersecurity is the organization of an uninterrupted scheme of protection against any impacts related to physical, hardware, software, network, and human objects or resources of the critical infrastructure of social networks, as well as the integration of various levels and means of protection.

Data, Vol. 8, Pages 155: A Data-Driven Exploration of a New Islamic Fatwas Dataset for Arabic NLP Tasks

Ohoud Alyemny — 2023-10-19

Data, Vol. 8, Pages 155: A Data-Driven Exploration of a New Islamic Fatwas Dataset for Arabic NLP Tasks

Data doi: 10.3390/data8100155

Authors: Ohoud Alyemny Hend Al-Khalifa Abdulrahman Mirza

Islamic content is a broad and diverse domain that encompasses various sources, topics, and perspectives. However, there is a lack of comprehensive and reliable datasets that can facilitate conducting studies on Islamic content. In this paper, we present fatwaset, the first public Arabic dataset of Islamic fatwas. It contains Islamic fatwas that we collected from various trusted and authenticated sources in the Islamic fatwa domain, such as agencies, religious scholars, and websites. Fatwaset is a rich resource as it does not only contain fatwas but also includes a considerable set of their surrounding metadata. It can be used for many natural language processing (NLP) tasks, such as language modeling, question answering, author attribution, topic identification, text classification, and text summarization. It can also support other domains that are related to Islamic culture, such as philosophy and language art. We describe the methodology and criteria we used to select the content, as well as the challenges and limitations we faced. Additionally, we perform an Exploratory Data Analysis (EDA), which investigates the dataset from different perspectives. The results of the EDA reveal important information that greatly benefits researchers in this area.

Data, Vol. 8, Pages 154: A Dataset of Non-Indigenous and Native Fish of the Volga and Kama Rivers (European Russia)

Dmitry P. Karabanov — 2023-10-18

Data, Vol. 8, Pages 154: A Dataset of Non-Indigenous and Native Fish of the Volga and Kama Rivers (European Russia)

Data doi: 10.3390/data8100154

Authors: Dmitry P. Karabanov Dmitry D. Pavlov Yury Y. Dgebuadze Mikhail I. Bazarov Elena A. Borovikova Yuriy V. Gerasimov Yulia V. Kodukhova Pavel B. Mikheev Eduard V. Nikitin Tatyana L. Opaleva Yuri A. Severov Rimma Z. Sabitova Alexey K. Smirnov Yury I. Solomatin Igor A. Stolbunov Alexander I. Tsvetkov Stanislav A. Vlasenko Irina S. Voroshilova Wenjun Zhong Xiaowei Zhang Alexey A. Kotov

Fish in the Volga-Kama River System (the largest river system in Europe) are important as a crucial food source for local populations; fish have the highest trophic level among hydrobionts. The purpose of this research is to describe the diversity of non-indigenous and native fish in the Volga and Kama Rivers, in the European part of Russia. This dataset encompasses data from June 2001 to September 2021 and comprises 1888 records (36,376 individual observations) for littoral and pelagic habitats from 143 sampling sites, representing 52 species from 42 genera in 22 families. The dataset has a Darwin Core standard format and has been fully released in the Global Biodiversity Information Facility (GBIF) under CC-BY 4.0 International license. The data are validated with several international databases such as FishBase, Eschmeyer’s Catalog of Fishes, the Barcode of Life Data System, and the SAS.Planet geoinformations system. Newly established populations have been found for several species belonging to the following Actinopteri families: Alosidae, Anguillidae, Cichlidae, Ehiravidae, Gobiidae, Odontobutidae, Syngnathidae, and Xenocyprididae. Therefore, this dataset can be used in the particular taxon species distribution analysis, which are especially important for non-indigenous species.