Data | January 2024 - Browse Articles

6 pages, 1393 KB

Open AccessData Descriptor

Machine Learning Classification Workflow and Datasets for Ionospheric VLF Data Exclusion

by Filip Arnaut, Aleksandra Kolarski and Vladimir A. Srećković

Data 2024, 9(1), 17; https://doi.org/10.3390/data9010017 - 18 Jan 2024

Cited by 3 | Viewed by 2885

Machine learning (ML) methods are commonly applied in the fields of extraterrestrial physics, space science, and plasma physics. In a prior publication, an ML classification technique, the Random Forest (RF) algorithm, was utilized to automatically identify and categorize erroneous signals, including instrument errors, [...] Read more.

Machine learning (ML) methods are commonly applied in the fields of extraterrestrial physics, space science, and plasma physics. In a prior publication, an ML classification technique, the Random Forest (RF) algorithm, was utilized to automatically identify and categorize erroneous signals, including instrument errors, noisy signals, outlier data points, and the impact of solar flares (SFs) on the ionosphere. This data communication includes the pre-processed dataset used in the aforementioned research, along with a workflow that utilizes the PyCaret library and a post-processing workflow. The code and data serve educational purposes in the interdisciplinary field of ML and ionospheric physics science, as well as being useful to other researchers for diverse objectives. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

9 pages, 5016 KB

Open AccessEditor’s ChoiceData Descriptor

Elliott State Research Forest Timber Cruise, Oregon, 2015–2016

by Todd West and Bogdan M. Strimbu

Data 2024, 9(1), 16; https://doi.org/10.3390/data9010016 - 18 Jan 2024

Cited by 1 | Viewed by 2608

Abstract

The Elliott State Research Forest comprises 33,700 ha of temperate, Douglas-fir rainforest along North America’s Pacific Coast (Oregon, United States). In 2015, naturally regenerated stands at least 92 years old covered 49% of the research area and sawtimber plantations younger than 68 years [...] Read more.

The Elliott State Research Forest comprises 33,700 ha of temperate, Douglas-fir rainforest along North America’s Pacific Coast (Oregon, United States). In 2015, naturally regenerated stands at least 92 years old covered 49% of the research area and sawtimber plantations younger than 68 years another 50%. During the winter of 2015–2016, a forest wide inventory sampled both naturally regenerated and plantation stands, recording 97,424 trees on 17,866 plots in 738 stands. The resulting dataset is atypical for the area as plot locations were not restricted to upland, commercially harvestable timber. Multiage stands and riparian areas were therefore documented along with plantations 2–61 years old and trees retained through clearcut harvests. This dataset constitutes the only open access, stand-based forest inventory currently available for a large area within the Oregon Coast Range. The dataset enables development of suites of models as well as many comparisons across stand ages and types, both at stand level and at the level of individual trees. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

10 pages, 1827 KB

Open AccessData Descriptor

Proteomic and Metabolomic Analyses of the Blood Samples of Highly Trained Athletes

by Kristina A. Malsagova, Arthur T. Kopylov, Vasiliy I. Pustovoyt, Evgenii I. Balakin, Ksenia A. Yurku, Alexander A. Stepanov, Liudmila I. Kulikova, Vladimir R. Rudnev and Anna L. Kaysheva

Data 2024, 9(1), 15; https://doi.org/10.3390/data9010015 - 16 Jan 2024

Cited by 4 | Viewed by 3069

Abstract

High exercise loading causes intricate and ambiguous proteomic and metabolic changes. This study aims to describe the dataset on protein and metabolite contents in plasma samples collected from highly trained athletes across different sports disciplines. The proteomic and metabolomic analyses of the plasma [...] Read more.

High exercise loading causes intricate and ambiguous proteomic and metabolic changes. This study aims to describe the dataset on protein and metabolite contents in plasma samples collected from highly trained athletes across different sports disciplines. The proteomic and metabolomic analyses of the plasma samples of highly trained athletes engaged in sports disciplines of different intensities were carried out using HPLC-MS/MS. The results are reported as two datasets (proteomic data in a derived mgf-file and metabolomic data in processed format), each containing the findings obtained by analyzing 93 mass spectra. Variations in the protein and metabolite contents of the biological samples are observed, depending on the intensity of training load for different sports disciplines. Mass spectrometric proteomic and metabolomic studies can be used for classifying different athlete phenotypes according to the intensity of sports discipline and for the assessment of the efficiency of the recovery period. Full article

► Show Figures

Figure 1

28 pages, 1002 KB

Open AccessArticle

GeMSyD: Generic Framework for Synthetic Data Generation

by Ramona Tolas, Raluca Portase and Rodica Potolea

Data 2024, 9(1), 14; https://doi.org/10.3390/data9010014 - 11 Jan 2024

Cited by 6 | Viewed by 5399

Abstract

In the era of data-driven technologies, the need for diverse and high-quality datasets for training and testing machine learning models has become increasingly critical. In this article, we present a versatile methodology, the Generic Methodology for Constructing Synthetic Data Generation (GeMSyD), which addresses [...] Read more.

In the era of data-driven technologies, the need for diverse and high-quality datasets for training and testing machine learning models has become increasingly critical. In this article, we present a versatile methodology, the Generic Methodology for Constructing Synthetic Data Generation (GeMSyD), which addresses the challenge of synthetic data creation in the context of smart devices. GeMSyD provides a framework that enables the generation of synthetic datasets, aligning them closely with real-world data. To demonstrate the utility of GeMSyD, we instantiate the methodology by constructing a synthetic data generation framework tailored to the domain of event-based data modeling, specifically focusing on user interactions with smart devices. Our framework leverages GeMSyD to create synthetic datasets that faithfully emulate the dynamics of human–device interactions, including the temporal dependencies. Furthermore, we showcase how the synthetic data generated using our framework can serve as a valuable resource for machine learning practitioners. By employing these synthetic datasets, we perform a series of experiments to evaluate the performance of a neural-network-based prediction model in the domain of smart device interaction. Our results underscore the potential of synthetic data in facilitating model development and benchmarking. Full article

► Show Figures

Figure 1

23 pages, 1346 KB

Open AccessReview

Adaptive Forecasting in Energy Consumption: A Bibliometric Analysis and Review

by Manuel Jaramillo, Wilson Pavón and Lisbeth Jaramillo

Data 2024, 9(1), 13; https://doi.org/10.3390/data9010013 - 11 Jan 2024

Cited by 22 | Viewed by 5514

Abstract

This paper addresses the challenges in forecasting electrical energy in the current era of renewable energy integration. It reviews advanced adaptive forecasting methodologies while also analyzing the evolution of research in this field through bibliometric analysis. The review highlights the key contributions and [...] Read more.

This paper addresses the challenges in forecasting electrical energy in the current era of renewable energy integration. It reviews advanced adaptive forecasting methodologies while also analyzing the evolution of research in this field through bibliometric analysis. The review highlights the key contributions and limitations of current models with an emphasis on the challenges of traditional methods. The analysis reveals that Long Short-Term Memory (LSTM) networks, optimization techniques, and deep learning have the potential to model the dynamic nature of energy consumption, but they also have higher computational demands and data requirements. This review aims to offer a balanced view of current advancements and challenges in forecasting methods, guiding researchers, policymakers, and industry experts. It advocates for collaborative innovation in adaptive methodologies to enhance forecasting accuracy and support the development of resilient, sustainable energy systems. Full article

► Show Figures

Figure 1

9 pages, 1840 KB

Open AccessData Descriptor

DeepSpaceYoloDataset: Annotated Astronomical Images Captured with Smart Telescopes

by Olivier Parisot

Data 2024, 9(1), 12; https://doi.org/10.3390/data9010012 - 10 Jan 2024

Cited by 4 | Viewed by 5386

Abstract

Recent smart telescopes allow the automatic collection of a large quantity of data for specific portions of the night sky—with the goal of capturing images of deep sky objects (nebula, galaxies, globular clusters). Nevertheless, human verification is still required afterwards to check whether [...] Read more.

Recent smart telescopes allow the automatic collection of a large quantity of data for specific portions of the night sky—with the goal of capturing images of deep sky objects (nebula, galaxies, globular clusters). Nevertheless, human verification is still required afterwards to check whether celestial targets are effectively visible in the images produced by these instruments. Depending on the magnitude of deep sky objects, the observation conditions and the cumulative time of data acquisition, it is possible that only stars are present in the images. In addition, unfavorable external conditions (light pollution, bright moon, etc.) can make capture difficult. In this paper, we describe DeepSpaceYoloDataset, a set of 4696 RGB astronomical images captured by two smart telescopes and annotated with the positions of deep sky objects that are effectively in the images. This dataset can be used to train detection models on this type of image, enabling the better control of the duration of capture sessions, but also to detect unexpected celestial events such as supernova. Full article

► Show Figures

Figure 1

16 pages, 2257 KB

Open AccessArticle

ADAS Simulation Result Dataset Processing Based on Improved BP Neural Network

by Songyan Zhao, Lingshan Chen and Yongchao Huang

Data 2024, 9(1), 11; https://doi.org/10.3390/data9010011 - 5 Jan 2024

Cited by 3 | Viewed by 3134

Abstract

The autonomous driving simulation field lacks evaluation and forecasting systems for simulation results. The data obtained from the simulation of target algorithms and vehicle models cannot be reasonably estimated. This problem affects subsequent vehicle improvement and parameter calibration. The authors relied on the [...] Read more.

The autonomous driving simulation field lacks evaluation and forecasting systems for simulation results. The data obtained from the simulation of target algorithms and vehicle models cannot be reasonably estimated. This problem affects subsequent vehicle improvement and parameter calibration. The authors relied on the simulation results of the AEB algorithm. We selected the BP Neural Network as the basis and improved it with a genetic algorithm optimized via a roulette algorithm. The regression evaluation indicators of the prediction results show that the GA-BP neural network has better prediction accuracy and generalization ability than the original BP neural network and other optimized BP neural networks. This GA-BP neural network also fills the Gap in Evaluation and Prediction Systems. Full article

► Show Figures

Figure 1

15 pages, 5777 KB

Open AccessArticle

Experimental Dataset of Tunable Mode Converter Based on Long-Period Fiber Gratings Written in Few-Mode Fiber: Impacts of Thermal, Wavelength, and Polarization Variations

by Juan Soto-Perdomo, Erick Reyes-Vera, Jorge Montoya-Cardona and Pedro Torres

Data 2024, 9(1), 10; https://doi.org/10.3390/data9010010 - 31 Dec 2023

Cited by 1 | Viewed by 2468

Abstract

Mode division multiplexing (MDM) is currently one of the most attractive multiplexing techniques in optical communications, as it allows for an increase in the number of channels available for data transmission. Optical modal converters are one of the main devices used in this [...] Read more.

Mode division multiplexing (MDM) is currently one of the most attractive multiplexing techniques in optical communications, as it allows for an increase in the number of channels available for data transmission. Optical modal converters are one of the main devices used in this technique. Therefore, the characterization and improvement of these devices are of great current interest. In this work, we present a dataset of 49,736 near-field intensity images of a modal converter based on a long-period fiber grating (LPFG) written on a few-mode fiber (FMF). This characterization was performed experimentally at various wavelengths, polarizations, and temperature conditions when the device converted from

{LP}_{01}

mode to

{LP}_{11}

mode. The results show that the modal converter can be tuned by adjusting these parameters, and that its operation is optimal under specific circumstances which have a great impact on its performance. Additionally, the potential application of the database is validated in this work. A modal decomposition technique based on the particle swarm algorithm (PSO) was employed as a tool for determining the most effective combinations of modal weights and relative phases from the spatial distributions collected in the dataset. The proposed dataset can open up new opportunities for researchers working on image segmentation, detection, and classification problems related to MDM technology. In addition, we implement novel artificial intelligence techniques that can help in finding the optimal operating conditions for this type of device. Full article

► Show Figures

Figure 1

26 pages, 6610 KB

Open AccessArticle

Wi-Gitation: Replica Wi-Fi CSI Dataset for Physical Agitation Activity Recognition

by Nikita Sharma, Jeroen Klein Brinke, L. M. A. Braakman Jansen, Paul J. M. Havinga and Duc V. Le

Data 2024, 9(1), 9; https://doi.org/10.3390/data9010009 - 30 Dec 2023

Cited by 5 | Viewed by 4846

Abstract

Agitation is a commonly found behavioral condition in persons with advanced dementia. It requires continuous monitoring to gain insights into agitation levels to assist caregivers in delivering adequate care. The available monitoring techniques use cameras and wearables which are distressful and intrusive and [...] Read more.

Agitation is a commonly found behavioral condition in persons with advanced dementia. It requires continuous monitoring to gain insights into agitation levels to assist caregivers in delivering adequate care. The available monitoring techniques use cameras and wearables which are distressful and intrusive and are thus often rejected by older adults. To enable continuous monitoring in older adult care, unobtrusive Wi-Fi channel state information (CSI) can be leveraged to monitor physical activities related to agitation. However, to the best of our knowledge, there are no realistic CSI datasets available for facilitating the classification of physical activities demonstrated during agitation scenarios such as disturbed walking, repetitive sitting–getting up, tapping on a surface, hand wringing, rubbing on a surface, flipping objects, and kicking. Therefore, in this paper, we present a public dataset named Wi-Gitation. For Wi-Gitation, the Wi-Fi CSI data were collected with twenty-three healthy participants depicting the aforementioned agitation-related physical activities at two different locations in a one-bedroom apartment with multiple receivers placed at different distances (0.5–8 m) from the participants. The validation results on the Wi-Gitation dataset indicate higher accuracies (

F_{1}

-Scores

\geq 0.95

) when employing mixed-data analysis, where the training and testing data share the same distribution. Conversely, in scenarios where the training and testing data differ in distribution (i.e., leave-one-out), the accuracies experienced a notable decline (

F_{1}

-Scores

\leq 0.21

). This dataset can be used for fundamental research on CSI signals and in the evaluation of advanced algorithms developed for tackling domain invariance in CSI-based human activity recognition. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

12 pages, 1825 KB

Open AccessData Descriptor

DNA Methylome and Transcriptome Maps of Primary Colorectal Cancer and Matched Liver Metastasis

by Priyadarshana Ajithkumar, Gregory Gimenez, Peter A. Stockwell, Suzan Almomani, Sarah A. Bowden, Anna L. Leichter, Antonio Ahn, Sharon Pattison, Sebastian Schmeier, Frank A. Frizelle, Michael R. Eccles, Rachel V. Purcell, Euan J. Rodger and Aniruddha Chatterjee

Data 2024, 9(1), 8; https://doi.org/10.3390/data9010008 - 29 Dec 2023

Cited by 1 | Viewed by 3435

Abstract

Sequencing-based genome-wide DNA methylation, gene expression studies and associated data on paired colorectal cancer (CRC) primary and liver metastasis are very limited. We have profiled the DNA methylome and transcriptome of matched primary CRC and liver metastasis samples from the same patients. Genome-scale [...] Read more.

Sequencing-based genome-wide DNA methylation, gene expression studies and associated data on paired colorectal cancer (CRC) primary and liver metastasis are very limited. We have profiled the DNA methylome and transcriptome of matched primary CRC and liver metastasis samples from the same patients. Genome-scale methylation and expression levels were examined using Reduced Representation Bisulfite Sequencing (RRBS) and RNA-Seq, respectively. To investigate DNA methylation and expression patterns, we generated a total of 1.01 × 10⁹ RRBS reads and 4.38 × 10⁸ RNA-Seq reads from the matched cancer tissues. Here, we describe in detail the sample features, experimental design, methods and bioinformatic pipeline for these epigenetic data. We demonstrate the quality of both the samples and sequence data obtained from the paired samples. The sequencing data obtained from this study will serve as a valuable resource for studying underlying mechanisms of distant metastasis and the utility of epigenetic profiles in cancer metastasis. Full article

► Show Figures

Figure 1

19 pages, 4338 KB

Open AccessArticle

Data-Driven Analysis of MRI Scans: Exploring Brain Structure Variations in Colombian Adolescent Offenders

by Germán Sánchez-Torres, Nallig Leal and Mariana Pino

Data 2024, 9(1), 7; https://doi.org/10.3390/data9010007 - 26 Dec 2023

Viewed by 2675

Abstract

With the advancements in neuroimaging techniques, understanding the relationship between brain morphology and behavioral tendencies such as criminal behavior has garnered interest. This research addresses the investigation of disparities in neuroanatomical structures between adolescent offenders and non-offenders and considers the implications of such [...] Read more.

With the advancements in neuroimaging techniques, understanding the relationship between brain morphology and behavioral tendencies such as criminal behavior has garnered interest. This research addresses the investigation of disparities in neuroanatomical structures between adolescent offenders and non-offenders and considers the implications of such distinctions regarding offender behavior within adolescent populations. Employing data-driven methodologies, MRI scans of adolescents from Barranquilla, Colombia, were analyzed to explore morphological variations. Utilizing a 1.5 Tesla Siemens resonator (Siemens Healthineers, Erlangen, Germany), T1-weighted MPRAGE anatomical images were acquired and analyzed using a systematic five-step methodology including data acquisition, MRI pre-processing, feature selection, model selection, and model validation and evaluation. Participants, both offenders and non-offenders, were aged 14–18 and selected based on education, criminal history, and physical conditions. The research identified significant disparities in the volumes of 42 brain structures between adolescent offenders (AOs) and non-offenders (NOs), highlighting particular brain regions potentially associated with offending behavior. Additionally, a considerable proportion of AOs emanated from lower socioeconomic backgrounds and showcased marked substance use. The findings suggest that neuroanatomical disparities potentially correlate with criminal behavior among adolescents at a neurobiological level. Noticeable socio-environmental factors, such as lower socioeconomic status and substance abuse, were substantially prevalent among AOs. Particularly, neurobiological deviations in structures like ctx-lh-rostralmiddlefrontal and ctx-lh-caudalanteriorcingulate perhaps represent a link between neurological factors and external stimuli. Full article

► Show Figures

Figure 1

21 pages, 1931 KB

Open AccessArticle

A Profit Maximization Model for Data Consumers with Data Providers’ Incentives in Personal Data Trading Market

by Hyojin Park, Hyeontaek Oh and Jun Kyun Choi

Data 2024, 9(1), 6; https://doi.org/10.3390/data9010006 - 25 Dec 2023

Cited by 1 | Viewed by 3089

Abstract

This paper proposes a profit maximization model for a data consumer when it buys personal data from data providers (by obtaining consent) through data brokers and provides their new services to data providers (i.e., service consumers). To observe the behavioral models of data [...] Read more.

This paper proposes a profit maximization model for a data consumer when it buys personal data from data providers (by obtaining consent) through data brokers and provides their new services to data providers (i.e., service consumers). To observe the behavioral models of data providers, the data consumer, and service consumers, this paper proposes the willingness-to-sell model of personal data of data providers (which is affected by data providers’ behavior related to explicit consent), the service quality model obtained by the collected personal data from the data consumer’s perspective, and the willingness-to-pay model of service consumers regarding provided new services from the data consumer. Particularly, this paper jointly considers the behavior of data providers and service users under a limited budget. With parameters inspired by real-world surveys on data providers, this paper shows various numerical results to check the feasibility of the proposed models. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

9 pages, 2088 KB

Open AccessData Descriptor

Single-Nucleotide Variants in PADI2 and PADI4 and Ancestry Informative Markers in Interstitial Lung Disease and Rheumatoid Arthritis among a Mexican Mestizo Population

by Karol J. Nava-Quiroz, Jorge Rojas-Serrano, Gloria Pérez-Rubio, Ivette Buendia-Roldan, Mayra Mejía, Juan Carlos Fernández-López, Espiridión Ramos-Martínez, Luis A. López-Flores, Alma D. Del Ángel-Pablo and Ramcés Falfán-Valencia

Data 2024, 9(1), 5; https://doi.org/10.3390/data9010005 - 25 Dec 2023

Cited by 1 | Viewed by 2960

Abstract

Rheumatoid arthritis (RA) is an autoimmune disease mainly characterized by joint inflammation. It presents extra-articular manifestations, with the lungs being one of the affected areas. Among these, damage to the pulmonary interstitium (Interstitial Lung Disease—ILD) has been linked to proteins involved in the [...] Read more.

Rheumatoid arthritis (RA) is an autoimmune disease mainly characterized by joint inflammation. It presents extra-articular manifestations, with the lungs being one of the affected areas. Among these, damage to the pulmonary interstitium (Interstitial Lung Disease—ILD) has been linked to proteins involved in the inflammatory process and related to extracellular matrix deposition and lung fibrosis establishment. Peptidyl arginine deiminase enzymes (PAD), which carry out protein citrullination, play a role in this context. A genetic association analysis was conducted on genes encoding two PAD isoforms: PAD2 and PAD4. This analysis also included ancestry informative markers and protein level determination in samples from patients with RA, RA-associated ILD, and clinically healthy controls. Significant single nucleotide variants (SNV) and one haplotype were identified as susceptibility factors for RA-ILD development. Elevated levels of PAD4 were found in RA-ILD cases, while PADI2 showed an association with RA susceptibility. This work presents data obtained from previously published research. Population variability has been noticed in genetic association studies. We present data for 14 SNVs that show geographical and genetic variation across the Mexican population, which provides highly informative content and greater intrapopulation genetic diversity. Further investigations in the field should be considered in addition to AIMs. The data presented in this study were analyzed in association with SNV genotypes in PADI2 and PADI4 to assess susceptibility to ILD in RA, as well as with changes in PAD2 and PAD4 protein levels according to carrier genotype, in addition to the use of covariates such as ancestry markers. Full article

► Show Figures

Figure 1

12 pages, 4901 KB

Open AccessData Descriptor

An Urban Traffic Dataset Composed of Visible Images and Their Semantic Segmentation Generated by the CARLA Simulator

by Sergio Bemposta Rosende, David San José Gavilán, Javier Fernández-Andrés and Javier Sánchez-Soriano

Data 2024, 9(1), 4; https://doi.org/10.3390/data9010004 - 24 Dec 2023

Cited by 4 | Viewed by 4526

Abstract

A dataset of aerial urban traffic images and their semantic segmentation is presented to be used to train computer vision algorithms, among which those based on convolutional neural networks stand out. This article explains the process of creating the complete dataset, which includes [...] Read more.

A dataset of aerial urban traffic images and their semantic segmentation is presented to be used to train computer vision algorithms, among which those based on convolutional neural networks stand out. This article explains the process of creating the complete dataset, which includes the acquisition of the images, the labeling of vehicles, pedestrians, and pedestrian crossings as well as a description of the structure and content of the dataset (which amounts to 8694 images including visible images and those corresponding to the semantic segmentation). The images were generated using the CARLA simulator (but were like those that could be obtained with fixed aerial cameras or by using multi-copter drones) in the field of intelligent transportation management. The presented dataset is available and accessible to improve the performance of vision and road traffic management systems, especially for the detection of incorrect or dangerous maneuvers. Full article

► Show Figures

Figure 1

20 pages, 3448 KB

Open AccessArticle

Unlocking Insights: Analysing COVID-19 Lockdown Policies and Mobility Data in Victoria, Australia, through a Data-Driven Machine Learning Approach

by Shiyang Lyu, Oyelola Adegboye, Kiki Adhinugraha, Theophilus I. Emeto and David Taniar

Data 2024, 9(1), 3; https://doi.org/10.3390/data9010003 - 21 Dec 2023

Cited by 3 | Viewed by 3414

Abstract

The state of Victoria, Australia, implemented one of the world’s most prolonged cumulative lockdowns in 2020 and 2021. Although lockdowns have proven effective in managing COVID-19 worldwide, this approach faced challenges in containing the rising infection in Victoria. This study evaluates the effects [...] Read more.

The state of Victoria, Australia, implemented one of the world’s most prolonged cumulative lockdowns in 2020 and 2021. Although lockdowns have proven effective in managing COVID-19 worldwide, this approach faced challenges in containing the rising infection in Victoria. This study evaluates the effects of short-term (less than 60 days) and long-term (more than 60 days) lockdowns on public mobility and the effectiveness of various social restriction measures within these periods. The aim is to understand the complexities of pandemic management by examining various measures over different lockdown durations, thereby contributing to more effective COVID-19 containment methods. Using restriction policy, community mobility, and COVID-19 data, a machine-learning-based simulation model was proposed, incorporating analysis of correlation, infection doubling time, and effective lockdown date. The model result highlights the significant impact of public event cancellations in preventing COVID-19 infection during short- and long-term lockdowns and the importance of international travel controls in long-term lockdowns. The effectiveness of social restriction was found to decrease significantly with the transition from short to long lockdowns, characterised by increased visits to public places and increased use of public transport, which may be associated with an increase in the effective reproduction number (R_t) and infected cases. Full article

► Show Figures

Figure 1

14 pages, 283 KB

Open AccessArticle

Medical Opinions Analysis about the Decrease of Autopsies Using Emerging Pattern Mining

by Isaac Machorro-Cano, Ingrid Aylin Ríos-Méndez, José Antonio Palet-Guzmán, Nidia Rodríguez-Mazahua, Lisbeth Rodríguez-Mazahua, Giner Alor-Hernández and José Oscar Olmedo-Aguirre

Data 2024, 9(1), 2; https://doi.org/10.3390/data9010002 - 21 Dec 2023

Cited by 2 | Viewed by 2307

Abstract

An autopsy is a widely recognized procedure to guarantee ongoing enhancements in medicine. It finds extensive application in legal, scientific, medical, and research domains. However, declining autopsy rates in hospitals constitute a worldwide concern. For example, the Regional Hospital of Rio Blanco in [...] Read more.

An autopsy is a widely recognized procedure to guarantee ongoing enhancements in medicine. It finds extensive application in legal, scientific, medical, and research domains. However, declining autopsy rates in hospitals constitute a worldwide concern. For example, the Regional Hospital of Rio Blanco in Veracruz, Mexico, has substantially reduced the number of autopsies at hospitals in recent years. Since there are no documented historical records of a decrease in the frequency of autopsy cases, it is crucial to establish a methodological framework to substantiate any actual trends in the data. Emerging pattern mining (EPM) allows for finding differences between classes or data sets because it builds a descriptive data model concerning some given remarkable property. Data set description has become a significant application area in various contexts in recent years. In this research study, various EPM (emerging pattern mining) algorithms were used to extract emergent patterns from a data set collected based on medical experts’ perspectives on reducing hospital autopsies. Notably, the top-performing EPM algorithms were iEPMiner, LCMine, SJEP-C, Top-k minimal SJEPs, and Tree-based JEP-C. Among these, iEPMiner and LCMine demonstrated faster performance and produced superior emergent patterns when considering metrics such as Confidence, Weighted Relative Accuracy Criteria (WRACC), False Positive Rate (FPR), and True Positive Rate (TPR). Full article

26 pages, 5854 KB

Open AccessData Descriptor

Expert-Annotated Dataset to Study Cyberbullying in Polish Language

by Michal Ptaszynski, Agata Pieciukiewicz, Pawel Dybala, Pawel Skrzek, Kamil Soliwoda, Marcin Fortuna, Gniewosz Leliwa and Michal Wroczynski

Data 2024, 9(1), 1; https://doi.org/10.3390/data9010001 - 20 Dec 2023

Cited by 5 | Viewed by 4468

Abstract

We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the [...] Read more.

We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Data, Volume 9, Issue 1 (January 2024) – 17 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI