Data | Editor’s choice Articles

9 pages, 5016 KB

Open AccessEditor’s ChoiceData Descriptor

Elliott State Research Forest Timber Cruise, Oregon, 2015–2016

by Todd West and Bogdan M. Strimbu

Data 2024, 9(1), 16; https://doi.org/10.3390/data9010016 - 18 Jan 2024

Cited by 1 | Viewed by 2497

The Elliott State Research Forest comprises 33,700 ha of temperate, Douglas-fir rainforest along North America’s Pacific Coast (Oregon, United States). In 2015, naturally regenerated stands at least 92 years old covered 49% of the research area and sawtimber plantations younger than 68 years [...] Read more.

The Elliott State Research Forest comprises 33,700 ha of temperate, Douglas-fir rainforest along North America’s Pacific Coast (Oregon, United States). In 2015, naturally regenerated stands at least 92 years old covered 49% of the research area and sawtimber plantations younger than 68 years another 50%. During the winter of 2015–2016, a forest wide inventory sampled both naturally regenerated and plantation stands, recording 97,424 trees on 17,866 plots in 738 stands. The resulting dataset is atypical for the area as plot locations were not restricted to upland, commercially harvestable timber. Multiage stands and riparian areas were therefore documented along with plantations 2–61 years old and trees retained through clearcut harvests. This dataset constitutes the only open access, stand-based forest inventory currently available for a large area within the Oregon Coast Range. The dataset enables development of suites of models as well as many comparisons across stand ages and types, both at stand level and at the level of individual trees. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

9 pages, 5038 KB

Open AccessEditor’s ChoiceData Descriptor

A Tumour and Liver Automatic Segmentation (ATLAS) Dataset on Contrast-Enhanced Magnetic Resonance Imaging for Hepatocellular Carcinoma

by Félix Quinton, Romain Popoff, Benoît Presles, Sarah Leclerc, Fabrice Meriaudeau, Guillaume Nodari, Olivier Lopez, Julie Pellegrinelli, Olivier Chevallier, Dominique Ginhac, Jean-Marc Vrigneaud and Jean-Louis Alberini

Data 2023, 8(5), 79; https://doi.org/10.3390/data8050079 - 27 Apr 2023

Cited by 40 | Viewed by 10247

Abstract

Liver cancer is the sixth most common cancer in the world and the fourth leading cause of cancer mortality. In unresectable liver cancers, especially hepatocellular carcinoma (HCC), transarterial radioembolisation (TARE) can be considered for treatment. TARE treatment involves a contrast-enhanced magnetic resonance imaging [...] Read more.

Liver cancer is the sixth most common cancer in the world and the fourth leading cause of cancer mortality. In unresectable liver cancers, especially hepatocellular carcinoma (HCC), transarterial radioembolisation (TARE) can be considered for treatment. TARE treatment involves a contrast-enhanced magnetic resonance imaging (CE-MRI) exam performed beforehand to delineate the liver and tumour(s) in order to perform dosimetry calculation. Due to the significant amount of time and expertise required to perform the delineation process, there is a strong need for automation. Unfortunately, the lack of publicly available CE-MRI datasets with liver tumour annotations has hindered the development of fully automatic solutions for liver and tumour segmentation. The “Tumour and Liver Automatic Segmentation” (ATLAS) dataset that we present consists of 90 liver-focused CE-MRI covering the entire liver of 90 patients with unresectable HCC, along with 90 liver and liver tumour segmentation masks. To the best of our knowledge, the ATLAS dataset is the first public dataset providing CE-MRI of HCC with annotations. The public availability of this dataset should greatly facilitate the development of automated tools designed to optimise the delineation process, which is essential for treatment planning in liver cancer patients. Full article

(This article belongs to the Topic Advances in Data Analytics with Applications to Health Care)

► Show Figures

Figure 1

18 pages, 2885 KB

Open AccessEditor’s ChoiceArticle

Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework

by Sikha S. Bagui, Dustin Mink, Subhash C. Bagui, Tirthankar Ghosh, Russel Plenkers, Tom McElroy, Stephan Dulaney and Sajida Shabanali

Data 2023, 8(1), 18; https://doi.org/10.3390/data8010018 - 11 Jan 2023

Cited by 23 | Viewed by 9001

Abstract

With the rapid rate at which networking technologies are changing, there is a need to regularly update network activity datasets to accurately reflect the current state of network infrastructure/traffic. The uniqueness of this work was that this was the first network dataset collected [...] Read more.

With the rapid rate at which networking technologies are changing, there is a need to regularly update network activity datasets to accurately reflect the current state of network infrastructure/traffic. The uniqueness of this work was that this was the first network dataset collected using Zeek and labelled using the MITRE ATT&CK framework. In addition to identifying attack traffic, the MITRE ATT&CK framework allows for the detection of adversary behavior leading to an attack. It can also be used to develop user profiles of groups intending to perform attacks. This paper also outlined how both the cyber range and hadoop’s big data platform were used for creating this network traffic data repository. The data was collected using Security Onion in two formats: Zeek and PCAPs. Mission logs, which contained the MITRE ATT&CK data, were used to label the network attack data. The data was transferred daily from the Security Onion virtual machine running on a cyber range to the big-data platform, Hadoop’s distributed file system. This dataset, UWF-ZeekData22, is publicly available at datasets.uwf.edu. Full article

► Show Figures

Figure 1

15 pages, 5903 KB

Open AccessEditor’s ChoiceData Descriptor

Traffic Sign Detection and Classification on the Austrian Highway Traffic Sign Data Set

by Alexander Maletzky, Nikolaus Hofer, Stefan Thumfart, Karin Bruckmüller and Johannes Kasper

Data 2023, 8(1), 16; https://doi.org/10.3390/data8010016 - 9 Jan 2023

Cited by 5 | Viewed by 8678

Abstract

Advanced Driver Assistance Systems rely on automated traffic sign recognition. Today, Deep Learning methods outperform other approaches in terms of accuracy and processing time; however, they require vast and well-curated data sets for training. In this paper, we present the Austrian Highway Traffic [...] Read more.

Advanced Driver Assistance Systems rely on automated traffic sign recognition. Today, Deep Learning methods outperform other approaches in terms of accuracy and processing time; however, they require vast and well-curated data sets for training. In this paper, we present the Austrian Highway Traffic Sign Data Set (ATSD), a comprehensive annotated data set of images of almost all traffic signs on Austrian highways in 2014, and corresponding images of full traffic scenes they are contained in. Altogether, the data set consists of almost 7500 scene images with more than 28,000 detailed annotations of more than 100 distinct traffic sign classes. It covers diverse environments, ranging from urban to rural and mountainous areas, and includes many images recorded in tunnels. We further evaluate state-of-the-art traffic sign detectors and classifiers on ATSD to establish baselines for future experiments. The data set and our baseline models are freely available online. Full article

► Show Figures

Figure 1

19 pages, 34962 KB

Open AccessEditor’s ChoiceArticle

PERSIST: A Multimodal Dataset for the Prediction of Perceived Exertion during Resistance Training

by Justin Amadeus Albert, Arne Herdick, Clemens Markus Brahms, Urs Granacher and Bert Arnrich

Data 2023, 8(1), 9; https://doi.org/10.3390/data8010009 - 28 Dec 2022

Cited by 3 | Viewed by 6459

Abstract

Measuring and adjusting the training load is essential in resistance training, as training overload can increase the risk of injuries. At the same time, too little load does not deliver the desired training effects. Usually, external load is quantified using objective measurements, such [...] Read more.

Measuring and adjusting the training load is essential in resistance training, as training overload can increase the risk of injuries. At the same time, too little load does not deliver the desired training effects. Usually, external load is quantified using objective measurements, such as lifted weight distributed across sets and repetitions per exercise. Internal training load is usually assessed using questionnaires or ratings of perceived exertion (RPE). A standard RPE scale is the Borg scale, which ranges from 6 (no exertion) to 20 (the highest exertion ever experienced). Researchers have investigated predicting RPE for different sports using sensor modalities and machine learning methods, such as Support Vector Regression or Random Forests. This paper presents PERSIST, a novel dataset for predicting PERceived exertion during reSIStance Training. We recorded multiple sensor modalities simultaneously, including inertial measurement units (IMU), electrocardiography (ECG), and motion capture (MoCap). The MoCap data has been synchronized to the IMU and ECG data. We also provide heart rate variability (HRV) parameters obtained from the ECG signal. Our dataset contains data from twelve young and healthy male participants with at least one year of resistance training experience. Subjects performed twelve sets of squats on a Flywheel platform with twelve repetitions per set. After each set, subjects reported their current RPE. We chose the squat exercise as it involves the largest muscle group. This paper demonstrates how to access the dataset. We further present an exploratory data analysis and show how researchers can use IMU and ECG data to predict perceived exertion. Full article

► Show Figures

Figure 1

14 pages, 13286 KB

Open AccessEditor’s ChoiceData Descriptor

UNIPD-BPE: Synchronized RGB-D and Inertial Data for Multimodal Body Pose Estimation and Tracking

by Mattia Guidolin, Emanuele Menegatti and Monica Reggiani

Data 2022, 7(6), 79; https://doi.org/10.3390/data7060079 - 9 Jun 2022

Cited by 8 | Viewed by 4114

Abstract

The ability to estimate human motion without requiring any external on-body sensor or marker is of paramount importance in a variety of fields, ranging from human–robot interaction, Industry 4.0, surveillance, and telerehabilitation. The recent development of portable, low-cost RGB-D cameras pushed forward the [...] Read more.

The ability to estimate human motion without requiring any external on-body sensor or marker is of paramount importance in a variety of fields, ranging from human–robot interaction, Industry 4.0, surveillance, and telerehabilitation. The recent development of portable, low-cost RGB-D cameras pushed forward the accuracy of markerless motion capture systems. However, despite the widespread use of such sensors, a dataset including complex scenes with multiple interacting people, recorded with a calibrated network of RGB-D cameras and an external system for assessing the pose estimation accuracy, is still missing. This paper presents the University of Padova Body Pose Estimation dataset (UNIPD-BPE), an extensive dataset for multi-sensor body pose estimation containing both single-person and multi-person sequences with up to 4 interacting people. A network with 5 Microsoft Azure Kinect RGB-D cameras is exploited to record synchronized high-definition RGB and depth data of the scene from multiple viewpoints, as well as to estimate the subjects’ poses using the Azure Kinect Body Tracking SDK. Simultaneously, full-body Xsens MVN Awinda inertial suits allow obtaining accurate poses and anatomical joint angles, while also providing raw data from the 17 IMUs required by each suit. This dataset aims to push forward the development and validation of multi-camera markerless body pose estimation and tracking algorithms, as well as multimodal approaches focused on merging visual and inertial data. Full article

(This article belongs to the Special Issue Computer Vision Datasets for Positioning, Tracking and Wayfinding)

► Show Figures

Figure 1

11 pages, 17964 KB

Open AccessEditor’s ChoiceData Descriptor

Dataset: Roundabout Aerial Images for Vehicle Detection

by Enrique Puertas, Gonzalo De-Las-Heras, Javier Fernández-Andrés and Javier Sánchez-Soriano

Data 2022, 7(4), 47; https://doi.org/10.3390/data7040047 - 12 Apr 2022

Cited by 18 | Viewed by 7011

Abstract

This publication presents a dataset of Spanish roundabouts aerial images taken from a UAV, along with annotations in PASCAL VOC XML files that indicate the position of vehicles within them. Additionally, a CSV file is attached containing information related to the location and [...] Read more.

This publication presents a dataset of Spanish roundabouts aerial images taken from a UAV, along with annotations in PASCAL VOC XML files that indicate the position of vehicles within them. Additionally, a CSV file is attached containing information related to the location and characteristics of the captured roundabouts. This work details the process followed to obtain them: image capture, processing, and labeling. The dataset consists of 985,260 total instances: 947,400 cars, 19,596 cycles, 2262 trucks, 7008 buses, and 2208 empty roundabouts in 61,896 1920 × 1080 px JPG images. These are divided into 15,474 extracted images from 8 roundabouts with different traffic flows and 46,422 images created using data augmentation techniques. The purpose of this dataset is to help research into computer vision on the road, as such labeled images are not abundant. It can be used to train supervised learning models, such as convolutional neural networks, which are very popular in object detection. Full article

► Show Figures

Figure 1

10 pages, 13286 KB

Open AccessEditor’s ChoiceData Descriptor

Large-Scale Dataset for the Analysis of Outdoor-to-Indoor Propagation for 5G Mid-Band Operational Networks

by Usman Ali, Giuseppe Caso, Luca De Nardis, Konstantinos Kousias, Mohammad Rajiullah, Özgü Alay, Marco Neri, Anna Brunstrom and Maria-Gabriella Di Benedetto

Data 2022, 7(3), 34; https://doi.org/10.3390/data7030034 - 15 Mar 2022

Cited by 19 | Viewed by 6718

Abstract

Understanding radio propagation characteristics and developing channel models is fundamental to building and operating wireless communication systems. Among others uses, channel characterization and modeling can be used for coverage and performance analysis and prediction. Within this context, this paper describes a comprehensive dataset [...] Read more.

Understanding radio propagation characteristics and developing channel models is fundamental to building and operating wireless communication systems. Among others uses, channel characterization and modeling can be used for coverage and performance analysis and prediction. Within this context, this paper describes a comprehensive dataset of channel measurements performed to analyze outdoor-to-indoor propagation characteristics in the mid-band spectrum identified for the operation of 5th Generation (5G) cellular systems. Previous efforts to analyze outdoor-to-indoor propagation characteristics in this band were made by using measurements collected on dedicated, mostly single-link setups. Hence, measurements performed on deployed and operational 5G networks still lack in the literature. To fill this gap, this paper presents a dataset of measurements performed over commercial 5G networks. In particular, the dataset includes measurements of channel power delay profiles from two 5G networks in Band n78, i.e., 3.3–3.8 GHz. Such measurements were collected at multiple locations in a large office building in the city of Rome, Italy by using the Rohde & Schwarz (R&S) TSMA6 network scanner during several weeks in 2020 and 2021. A primary goal of the dataset is to provide an opportunity for researchers to investigate a large set of 5G channel measurements, aiming at analyzing the corresponding propagation characteristics toward the definition and refinement of empirical channel propagation models. Full article

(This article belongs to the Special Issue Measurements of User and Sensor Data from the Internet of Things (IoT) Devices)

► Show Figures

Figure 1

19 pages, 2685 KB

Open AccessEditor’s ChoiceArticle

A Mixture Hidden Markov Model to Mine Students’ University Curricula

by Silvia Bacci and Bruno Bertaccini

Data 2022, 7(2), 25; https://doi.org/10.3390/data7020025 - 21 Feb 2022

Cited by 3 | Viewed by 3837

Abstract

In the context of higher education, the wide availability of data gathered by universities for administrative purposes or for recording the evolution of students’ learning processes makes novel data mining techniques particularly useful to tackle critical issues. In Italy, current academic regulations allow [...] Read more.

In the context of higher education, the wide availability of data gathered by universities for administrative purposes or for recording the evolution of students’ learning processes makes novel data mining techniques particularly useful to tackle critical issues. In Italy, current academic regulations allow students to customize the chronological sequence of courses they have to attend to obtain the final degree. This leads to a variety of sequences of exams, with an average time taken to obtain the degree that may significantly differ from the time established by law. In this contribution, we propose a mixture hidden Markov model to classify students into groups that are homogenous in terms of university paths, with the aim of detecting bottlenecks in the academic career and improving students’ performance. Full article

(This article belongs to the Special Issue Education Data Mining)

► Show Figures

Figure 1

27 pages, 4208 KB

Open AccessEditor’s ChoiceData Descriptor

#PraCegoVer: A Large Dataset for Image Captioning in Portuguese

by Gabriel Oliveira dos Santos, Esther Luna Colombini and Sandra Avila

Data 2022, 7(2), 13; https://doi.org/10.3390/data7020013 - 21 Jan 2022

Cited by 6 | Viewed by 6387

Abstract

Automatically describing images using natural sentences is essential to visually impaired people’s inclusion on the Internet. This problem is known as Image Captioning. There are many datasets in the literature, but most contain only English captions, whereas datasets with captions described in [...] Read more.

Automatically describing images using natural sentences is essential to visually impaired people’s inclusion on the Internet. This problem is known as Image Captioning. There are many datasets in the literature, but most contain only English captions, whereas datasets with captions described in other languages are scarce. We introduce the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese. In contrast to popular datasets, #PraCegoVer has only one reference per image, and both mean and variance of reference sentence length are significantly high, which makes our dataset challenging due to its linguistic aspect. We carry a detailed analysis to find the main classes and topics in our data. We compare #PraCegoVer to MS COCO dataset in terms of sentence length and word frequency. We hope that #PraCegoVer dataset encourages more works addressing the automatic generation of descriptions in Portuguese. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

10 pages, 889 KB

Open AccessEditor’s ChoiceData Descriptor

A Repertoire of Virtual-Reality, Occupational Therapy Exercises for Motor Rehabilitation Based on Action Observation

by Emilia Scalona, Doriana De Marco, Maria Chiara Bazzini, Arturo Nuara, Adolfo Zilli, Elisa Taglione, Fabrizio Pasqualetti, Generoso Della Polla, Nicola Francesco Lopomo, Maddalena Fabbri-Destro and Pietro Avanzini

Data 2022, 7(1), 9; https://doi.org/10.3390/data7010009 - 11 Jan 2022

Cited by 3 | Viewed by 4268

Abstract

There is a growing interest in action observation treatment (AOT), i.e., a rehabilitative procedure combining action observation, motor imagery, and action execution to promote the recovery, maintenance, and acquisition of motor abilities. AOT studies employed basic upper limb gestures as stimuli, but—in principle—the [...] Read more.

There is a growing interest in action observation treatment (AOT), i.e., a rehabilitative procedure combining action observation, motor imagery, and action execution to promote the recovery, maintenance, and acquisition of motor abilities. AOT studies employed basic upper limb gestures as stimuli, but—in principle—the AOT approach can be effectively extended to more complex actions like occupational gestures. Here, we present a repertoire of virtual-reality (VR) stimuli depicting occupational therapy exercises intended for AOT, potentially suitable for occupational safety and injury prevention. We animated a humanoid avatar by fitting the kinematics recorded by a healthy subject performing the exercises. All the stimuli are available via a custom-made graphical user interface, which allows the user to adjust several visualization parameters like the viewpoint, the number of repetitions, and the observed movement’s speed. Beyond providing clinicians with a set of VR stimuli promoting via AOT the recovery of goal-oriented, occupational gestures, such a repertoire could extend the use of AOT to the field of occupational safety and injury prevention. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

14 pages, 4721 KB

Open AccessEditor’s ChoiceArticle

View VULMA: Data Set for Training a Machine-Learning Tool for a Fast Vulnerability Analysis of Existing Buildings

by Angelo Cardellicchio, Sergio Ruggieri, Valeria Leggieri and Giuseppina Uva

Data 2022, 7(1), 4; https://doi.org/10.3390/data7010004 - 31 Dec 2021

Cited by 27 | Viewed by 4527

Abstract

The paper presents View VULMA, a data set specifically designed for training machine-learning tools for elaborating fast vulnerability analysis of existing buildings. Such tools require supervised training via an extensive set of building imagery, for which several typological parameters should be defined, [...] Read more.

The paper presents View VULMA, a data set specifically designed for training machine-learning tools for elaborating fast vulnerability analysis of existing buildings. Such tools require supervised training via an extensive set of building imagery, for which several typological parameters should be defined, with a proper label assigned to each sample on a per-parameter basis. Thus, it is clear how defining an adequate training data set plays a key role, and several aspects should be considered, such as data availability, preprocessing, augmentation and balancing according to the selected labels. In this paper, we highlight all these issues, describing the pursued strategies to elaborate a reliable data set. In particular, a detailed description of both requirements (e.g., scale and resolution of images, evaluation parameters and data heterogeneity) and the steps followed to define View VULMA are provided, starting from the data assessment (which allowed to reduce the initial sample of about 20.000 images to a subset of about 3.000 pictures), to achieve the goal of training a transfer-learning-based automated tool for fast estimation of the vulnerability of existing buildings from single pictures. Full article

► Show Figures

Figure 1

11 pages, 646 KB

Open AccessEditor’s ChoiceData Descriptor

Mobile Apps to Fight the COVID-19 Crisis

by Chrisa Tsinaraki, Irena Mitton, Marco Minghini, Marina Micheli, Alexander Kotsev, Lorena Hernandez Quiros, Fabiano-Antonio Spinelli, Alessandro Dalla Benetta and Sven Schade

Data 2021, 6(10), 106; https://doi.org/10.3390/data6100106 - 8 Oct 2021

Cited by 11 | Viewed by 3609

Abstract

The COVID-19 pandemic led to a multi-faceted global crisis, which triggered the diverse and quickly emerging use of old and new digital tools. We have developed a multi-channel approach for the monitoring and analysis of a subset of such tools, the COVID-19 related [...] Read more.

The COVID-19 pandemic led to a multi-faceted global crisis, which triggered the diverse and quickly emerging use of old and new digital tools. We have developed a multi-channel approach for the monitoring and analysis of a subset of such tools, the COVID-19 related mobile applications (apps). Our approach builds on the information available in the two most prominent app stores (i.e., Google Play for Android-powered devices and Apple’s App Store for iOS-powered devices), as well as on relevant tweets and digital media outlets. The dataset presented here is one of the outcomes of this approach, uses the content of the app stores and enriches it, providing aggregated information about 837 mobile apps published across the world to fight the COVID-19 crisis. This information includes: (a) information available in the mobile app stores between 20 April 2020 and 2 August 2020; (b) complementary information obtained from manual analysis performed until mid-September 2020; and (c) status information about app availability on 28 February 2021, when we last collected data from the mobile app stores. We highlight our findings with a series of descriptives, which depict both the activities in the app stores and the qualitative information that was revealed by the manual analysis. Full article

(This article belongs to the Special Issue A European Approach to the Establishment of Data Spaces)

► Show Figures

Figure 1

13 pages, 20761 KB

Open AccessEditor’s ChoiceData Descriptor

Experimental Data of Bottom Pressure and Free Surface Elevation including Wave and Current Interactions

by Roman Gabl, Samuel Draycott, Ajit C. Pillai and Thomas Davey

Data 2021, 6(10), 103; https://doi.org/10.3390/data6100103 - 30 Sep 2021

Cited by 1 | Viewed by 2563

Abstract

Force plates are commonly used in tank testing to measure loads acting on the foundation of a structure. These targeted measurements are overlaid by the hydrostatic and dynamic pressure acting on the force plate induced by the waves and currents. This paper presents [...] Read more.

Force plates are commonly used in tank testing to measure loads acting on the foundation of a structure. These targeted measurements are overlaid by the hydrostatic and dynamic pressure acting on the force plate induced by the waves and currents. This paper presents a dataset of bottom force measurement with a six degree-of-freedom force plate (AMTI OR6-7 1000, surface area 0.464 m × 0.508 m) combined with synchronised measurements of surface elevation and current velocity. The data cover wave frequencies between 0.2 to 0.7 Hz and wave directions between 0

^{\circ}

and 180

^{\circ}

. These variations are provided for current speeds of 0 and 0.2 m/s and a variation of the current in the absence of waves covering 0 to 0.45 m/s. The dataset can be utilised as a validation dataset for models predicting bottom pressure based on free surface elevation. Additionally, the dataset provides the wave- and current-induced load acting on the specific load cell at a fixed water depth of 2 m, which can subsequently be removed to obtain the often-desired measurement of structural loads. Full article

► Show Figures

Figure 1

15 pages, 1409 KB

Open AccessEditor’s ChoiceData Descriptor

A Dataset of Photos and Videos for Digital Forensics Analysis Using Machine Learning Processing

by Sara Ferreira, Mário Antunes and Manuel E. Correia

Data 2021, 6(8), 87; https://doi.org/10.3390/data6080087 - 5 Aug 2021

Cited by 15 | Viewed by 9013

Abstract

Deepfake and manipulated digital photos and videos are being increasingly used in a myriad of cybercrimes. Ransomware, the dissemination of fake news, and digital kidnapping-related crimes are the most recurrent, in which tampered multimedia content has been the primordial disseminating vehicle. Digital forensic [...] Read more.

Deepfake and manipulated digital photos and videos are being increasingly used in a myriad of cybercrimes. Ransomware, the dissemination of fake news, and digital kidnapping-related crimes are the most recurrent, in which tampered multimedia content has been the primordial disseminating vehicle. Digital forensic analysis tools are being widely used by criminal investigations to automate the identification of digital evidence in seized electronic equipment. The number of files to be processed and the complexity of the crimes under analysis have highlighted the need to employ efficient digital forensics techniques grounded on state-of-the-art technologies. Machine Learning (ML) researchers have been challenged to apply techniques and methods to improve the automatic detection of manipulated multimedia content. However, the implementation of such methods have not yet been massively incorporated into digital forensic tools, mostly due to the lack of realistic and well-structured datasets of photos and videos. The diversity and richness of the datasets are crucial to benchmark the ML models and to evaluate their appropriateness to be applied in real-world digital forensics applications. An example is the development of third-party modules for the widely used Autopsy digital forensic application. This paper presents a dataset obtained by extracting a set of simple features from genuine and manipulated photos and videos, which are part of state-of-the-art existing datasets. The resulting dataset is balanced, and each entry comprises a label and a vector of numeric values corresponding to the features extracted through a Discrete Fourier Transform (DFT). The dataset is available in a GitHub repository, and the total amount of photos and video frames is 40,588 and 12,400, respectively. The dataset was validated and benchmarked with deep learning Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) methods; however, a plethora of other existing ones can be applied. Generically, the results show a better F1-score for CNN when comparing with SVM, both for photos and videos processing. CNN achieved an F1-score of

0.9968

and

0.8415

for photos and videos, respectively. Regarding SVM, the results obtained with 5-fold cross-validation are

0.9953

and

0.7955

, respectively, for photos and videos processing. A set of methods written in Python is available for the researchers, namely to preprocess and extract the features from the original photos and videos files and to build the training and testing sets. Additional methods are also available to convert the original PKL files into CSV and TXT, which gives more flexibility for the ML researchers to use the dataset on existing ML frameworks and tools. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

30 pages, 4019 KB

Open AccessEditor’s ChoiceReview

Machine Learning-Based Algorithms to Knowledge Extraction from Time Series Data: A Review

by Giuseppe Ciaburro and Gino Iannace

Data 2021, 6(6), 55; https://doi.org/10.3390/data6060055 - 25 May 2021

Cited by 36 | Viewed by 14799

Abstract

To predict the future behavior of a system, we can exploit the information collected in the past, trying to identify recurring structures in what happened to predict what could happen, if the same structures repeat themselves in the future as well. A time [...] Read more.

To predict the future behavior of a system, we can exploit the information collected in the past, trying to identify recurring structures in what happened to predict what could happen, if the same structures repeat themselves in the future as well. A time series represents a time sequence of numerical values observed in the past at a measurable variable. The values are sampled at equidistant time intervals, according to an appropriate granular frequency, such as the day, week, or month, and measured according to physical units of measurement. In machine learning-based algorithms, the information underlying the knowledge is extracted from the data themselves, which are explored and analyzed in search of recurring patterns or to discover hidden causal associations or relationships. The prediction model extracts knowledge through an inductive process: the input is the data and, possibly, a first example of the expected output, the machine will then learn the algorithm to follow to obtain the same result. This paper reviews the most recent work that has used machine learning-based techniques to extract knowledge from time series data. Full article

(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)

► Show Figures

Figure 1

6 pages, 255 KB

Open AccessEditor’s ChoiceData Descriptor

Hand-Washing Video Dataset Annotated According to the World Health Organization’s Hand-Washing Guidelines

by Martins Lulla, Aleksejs Rutkovskis, Andreta Slavinska, Aija Vilde, Anastasija Gromova, Maksims Ivanovs, Ansis Skadins, Roberts Kadikis and Atis Elsts

Data 2021, 6(4), 38; https://doi.org/10.3390/data6040038 - 7 Apr 2021

Cited by 13 | Viewed by 6772

Abstract

Washing hands is one of the most important ways to prevent infectious diseases, including COVID-19. The World Health Organization (WHO) has published hand-washing guidelines. This paper presents a large real-world dataset with videos recording medical staff washing their hands as part of their [...] Read more.

Washing hands is one of the most important ways to prevent infectious diseases, including COVID-19. The World Health Organization (WHO) has published hand-washing guidelines. This paper presents a large real-world dataset with videos recording medical staff washing their hands as part of their normal job duties in the Pauls Stradins Clinical University Hospital. There are 3185 hand-washing episodes in total, each of which is annotated by up to seven different persons. The annotations classify the washing movements according to the WHO guidelines by marking each frame in each video with a certain movement code. The intention of this “in-the-wild” dataset is two-fold: to serve as a basis for training machine-learning classifiers for automated hand-washing movement recognition and quality control, and to allow to investigation of the real-world quality of washing performed by working medical staff. We demonstrate how the data can be used to train a machine-learning classifier that achieves classification accuracy of 0.7511 on a test dataset. Full article

► Show Figures

Figure 1

12 pages, 5849 KB

Open AccessEditor’s ChoiceData Descriptor

Dataset of the Optimization of a Low Power Chemoresistive Gas Sensor: Predictive Thermal Modelling and Mechanical Failure Analysis

by Andrea Gaiardo, David Novel, Elia Scattolo, Alessio Bucciarelli, Pierluigi Bellutti and Giancarlo Pepponi

Data 2021, 6(3), 30; https://doi.org/10.3390/data6030030 - 9 Mar 2021

Cited by 5 | Viewed by 2899

Abstract

Over the last few years, employment of the standard silicon microfabrication techniques for the gas sensor technology has allowed for the development of ever-small, low-cost, and low-power consumption devices. Specifically, the development of silicon microheaters (MHs) has become well established to produce MOS [...] Read more.

Over the last few years, employment of the standard silicon microfabrication techniques for the gas sensor technology has allowed for the development of ever-small, low-cost, and low-power consumption devices. Specifically, the development of silicon microheaters (MHs) has become well established to produce MOS gas sensors. Therefore, the development of predictive models that help to define a priori the optimal design and layout of the device have become crucial, in order to achieve both low power consumption and high mechanical stability. In this research dataset, we present the experimental data collected to develop a specific and useful predictive thermal-mechanical model for high performing silicon MHs. To this aim, three MH layouts over three different membrane sizes were developed by using the standard silicon microfabrication process. Thermal and mechanical performances of the produced devices were experimentally evaluated, by using probe stations and mechanical failure analysis, respectively. The measured thermal curves were used to develop the predictive thermal model towards low power consumption. Moreover, a statistical analysis was finally introduced to cross-correlate the mechanical failure results and the thermal predictive model, aiming at MH design optimization for gas sensing applications. All the data collected in this investigation are shown. Full article

► Show Figures

Figure 1

11 pages, 797 KB

Open AccessEditor’s ChoiceData Descriptor

FIKWaste: A Waste Generation Dataset from Three Restaurant Kitchens in Portugal

by Lucas Pereira, Vitor Aguiar and Fábio Vasconcelos

Data 2021, 6(3), 25; https://doi.org/10.3390/data6030025 - 26 Feb 2021

Cited by 3 | Viewed by 4004

Abstract

In the era of big data and artificial intelligence, public datasets are becoming increasingly important for researchers to build and evaluate their models. This paper presents the FIKWaste dataset, which contains time series data for the volume of waste produced in three restaurant [...] Read more.

In the era of big data and artificial intelligence, public datasets are becoming increasingly important for researchers to build and evaluate their models. This paper presents the FIKWaste dataset, which contains time series data for the volume of waste produced in three restaurant kitchens in Portugal. Organic (undifferentiated) and inorganic (glass, paper, and plastic) waste bins were monitored for a consecutive period of four weeks. In addition to the time series measurements, the FIKWaste dataset contains labels for waste disposal events, i.e., when the waste bins are emptied, and technical and non-technical details of the monitored kitchens. Full article

► Show Figures

Figure 1

12 pages, 961 KB

Open AccessEditor’s ChoiceData Descriptor

A Long-Term, Real-Life Parkinson Monitoring Database Combining Unscripted Objective and Subjective Recordings

by Jeroen G. V. Habets, Margot Heijmans, Albert F. G. Leentjens, Claudia J. P. Simons, Yasin Temel, Mark L. Kuijf, Pieter L. Kubben and Christian Herff

Data 2021, 6(2), 22; https://doi.org/10.3390/data6020022 - 23 Feb 2021

Cited by 8 | Viewed by 5801

Abstract

Accurate real-life monitoring of motor and non-motor symptoms is a challenge in Parkinson’s disease (PD). The unobtrusive capturing of symptoms and their naturalistic fluctuations within or between days can improve evaluation and titration of therapy. First-generation commercial PD motion sensors are promising to [...] Read more.

Accurate real-life monitoring of motor and non-motor symptoms is a challenge in Parkinson’s disease (PD). The unobtrusive capturing of symptoms and their naturalistic fluctuations within or between days can improve evaluation and titration of therapy. First-generation commercial PD motion sensors are promising to augment clinical decision-making in general neurological consultation, but concerns remain regarding their short-term validity, and long-term real-life usability. In addition, tools monitoring real-life subjective experiences of motor and non-motor symptoms are lacking. The dataset presented in this paper constitutes a combination of objective kinematic data and subjective experiential data, recorded parallel to each other in a naturalistic, long-term real-life setting. The objective data consists of accelerometer and gyroscope data, and the subjective data consists of data from ecological momentary assessments. Twenty PD patients were monitored without daily life restrictions for fourteen consecutive days. The two types of data can be used to address hypotheses on naturalistic motor and/or non-motor symptomatology in PD. Full article

(This article belongs to the Special Issue Data from Smartphones and Wearables)

► Show Figures

Figure 1

85 pages, 10056 KB

Open AccessEditor’s ChoiceReview

A Systematic Survey of ML Datasets for Prime CV Research Areas—Media and Metadata

by Helder F. Castro, Jaime S. Cardoso and Maria T. Andrade

Data 2021, 6(2), 12; https://doi.org/10.3390/data6020012 - 22 Jan 2021

Cited by 3 | Viewed by 6096

Abstract

The ever-growing capabilities of computers have enabled pursuing Computer Vision through Machine Learning (i.e., MLCV). ML tools require large amounts of information to learn from (ML datasets). These are costly to produce but have received reduced attention regarding standardization. This prevents the cooperative [...] Read more.

The ever-growing capabilities of computers have enabled pursuing Computer Vision through Machine Learning (i.e., MLCV). ML tools require large amounts of information to learn from (ML datasets). These are costly to produce but have received reduced attention regarding standardization. This prevents the cooperative production and exploitation of these resources, impedes countless synergies, and hinders ML research. No global view exists of the MLCV dataset tissue. Acquiring it is fundamental to enable standardization. We provide an extensive survey of the evolution and current state of MLCV datasets (1994 to 2019) for a set of specific CV areas as well as a quantitative and qualitative analysis of the results. Data were gathered from online scientific databases (e.g., Google Scholar, CiteSeerX). We reveal the heterogeneous plethora that comprises the MLCV dataset tissue; their continuous growth in volume and complexity; the specificities of the evolution of their media and metadata components regarding a range of aspects; and that MLCV progress requires the construction of a global standardized (structuring, manipulating, and sharing) MLCV “library”. Accordingly, we formulate a novel interpretation of this dataset collective as a global tissue of synthetic cognitive visual memories and define the immediately necessary steps to advance its standardization and integration. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

11 pages, 1508 KB

Open AccessEditor’s ChoiceData Descriptor

Data for Sustainable Platform Economy: Connections between Platform Models and Sustainable Development Goals

by Mayo Fuster Morell, Ricard Espelt and Enric Senabre Hidalgo

Data 2021, 6(2), 7; https://doi.org/10.3390/data6020007 - 20 Jan 2021

Cited by 7 | Viewed by 8005

Abstract

In recent years, the platform economy has been recognised by researchers and governments around the world for its potential to contribute to the sustainable development of society. Yet, platform economy cases such as Uber, Airbnb, and Deliveroo have created a huge controversy over [...] Read more.

In recent years, the platform economy has been recognised by researchers and governments around the world for its potential to contribute to the sustainable development of society. Yet, platform economy cases such as Uber, Airbnb, and Deliveroo have created a huge controversy over their socioeconomic impact, while other alternative models have been associated with a new form of cooperativism. In parallel, the United Nations are advocating global sustainable development by promoting Sustainable Development Goals (SDGs), considering elements such as decent work, inclusive and sustainable economic growth, and fostering innovation. In any case, the SDGs have been also criticised for the lack of digital perspective. This dataset draws from two 2020 European projects’ (DECODE and PLUS) data collections and presents the possibility to compare different platform economy models and their connections with the SDGs. Full article

(This article belongs to the Special Issue A European Approach to the Establishment of Data Spaces)

► Show Figures

Figure 1

14 pages, 2518 KB

Open AccessEditor’s ChoiceData Descriptor

Aircraft Engine Run-to-Failure Dataset under Real Flight Conditions for Prognostics and Diagnostics

by Manuel Arias Chao, Chetan Kulkarni, Kai Goebel and Olga Fink

Data 2021, 6(1), 5; https://doi.org/10.3390/data6010005 - 13 Jan 2021

Cited by 215 | Viewed by 26007

Abstract

A key enabler of intelligent maintenance systems is the ability to predict the remaining useful lifetime (RUL) of its components, i.e., prognostics. The development of data-driven prognostics models requires datasets with run-to-failure trajectories. However, large representative run-to-failure datasets are often unavailable in real [...] Read more.

A key enabler of intelligent maintenance systems is the ability to predict the remaining useful lifetime (RUL) of its components, i.e., prognostics. The development of data-driven prognostics models requires datasets with run-to-failure trajectories. However, large representative run-to-failure datasets are often unavailable in real applications because failures are rare in many safety-critical systems. To foster the development of prognostics methods, we develop a new realistic dataset of run-to-failure trajectories for a fleet of aircraft engines under real flight conditions. The dataset was generated with the Commercial Modular Aero-Propulsion System Simulation (CMAPSS) model developed at NASA. The damage propagation modelling used in this dataset builds on the modelling strategy from previous work and incorporates two new levels of fidelity. First, it considers real flight conditions as recorded on board of a commercial jet. Second, it extends the degradation modelling by relating the degradation process to its operation history. This dataset also provides the health, respectively, fault class. Therefore, besides its applicability to prognostics problems, the dataset can be used for fault diagnostics. Full article

► Show Figures

Figure 1

15 pages, 416 KB

Open AccessEditor’s ChoiceData Descriptor

A Public Dataset of 24-h Multi-Levels Psycho-Physiological Responses in Young Healthy Adults

by Alessio Rossi, Eleonora Da Pozzo, Dario Menicagli, Chiara Tremolanti, Corrado Priami, Alina Sîrbu, David A. Clifton, Claudia Martini and Davide Morelli

Data 2020, 5(4), 91; https://doi.org/10.3390/data5040091 - 25 Sep 2020

Cited by 31 | Viewed by 12454

Abstract

Wearable devices now make it possible to record large quantities of physiological data, which can be used to obtain a clearer view of a person’s health status and behavior. However, to the best of our knowledge, there are no open datasets in the [...] Read more.

Wearable devices now make it possible to record large quantities of physiological data, which can be used to obtain a clearer view of a person’s health status and behavior. However, to the best of our knowledge, there are no open datasets in the literature that provide psycho-physiological data. The Multilevel Monitoring of Activity and Sleep in Healthy people (MMASH) dataset presented in this paper provides 24 h of continuous psycho-physiological data, that is, inter-beat intervals data, heart rate data, wrist accelerometry data, sleep quality index, physical activity (i.e., number of steps per second), psychological characteristics (e.g., anxiety status, stressful events, and emotion declaration), and sleep hormone levels for 22 participants. The MMASH dataset will enable the investigation of possible relationships between the physical and psychological characteristics of people in daily life. Data were validated through different analyses that showed their compatibility with the literature. Full article

(This article belongs to the Special Issue Data from Smartphones and Wearables)

► Show Figures

Figure 1

40 pages, 1474 KB

Open AccessEditor’s ChoiceArticle

Survey of Decentralized Solutions with Mobile Devices for User Location Tracking, Proximity Detection, and Contact Tracing in the COVID-19 Era

by Viktoriia Shubina, Sylvia Holcer, Michael Gould and Elena Simona Lohan

Data 2020, 5(4), 87; https://doi.org/10.3390/data5040087 - 23 Sep 2020

Cited by 58 | Viewed by 16173

Abstract

Some of the recent developments in data science for worldwide disease control have involved research of large-scale feasibility and usefulness of digital contact tracing, user location tracking, and proximity detection on users’ mobile devices or wearables. A centralized solution relying on collecting and [...] Read more.

Some of the recent developments in data science for worldwide disease control have involved research of large-scale feasibility and usefulness of digital contact tracing, user location tracking, and proximity detection on users’ mobile devices or wearables. A centralized solution relying on collecting and storing user traces and location information on a central server can provide more accurate and timely actions than a decentralized solution in combating viral outbreaks, such as COVID-19. However, centralized solutions are more prone to privacy breaches and privacy attacks by malevolent third parties than decentralized solutions, storing the information in a distributed manner among wireless networks. Thus, it is of timely relevance to identify and summarize the existing privacy-preserving solutions, focusing on decentralized methods, and analyzing them in the context of mobile device-based localization and tracking, contact tracing, and proximity detection. Wearables and other mobile Internet of Things devices are of particular interest in our study, as not only privacy, but also energy-efficiency, targets are becoming more and more critical to the end-users. This paper provides a comprehensive survey of user location-tracking, proximity-detection, and digital contact-tracing solutions in the literature from the past two decades, analyses their advantages and drawbacks concerning centralized and decentralized solutions, and presents the authors’ thoughts on future research directions in this timely research field. Full article

(This article belongs to the Section Featured Reviews of Data Science Research)

► Show Figures

Figure 1

16 pages, 3385 KB

Open AccessEditor’s ChoiceData Descriptor

Experimental Force Data of a Restrained ROV under Waves and Current

by Roman Gabl, Thomas Davey, Yu Cao, Qian Li, Boyang Li, Kyle L. Walker, Francesco Giorgio-Serchi, Simona Aracri, Aristides Kiprakis, Adam A. Stokes and David M. Ingram

Data 2020, 5(3), 57; https://doi.org/10.3390/data5030057 - 30 Jun 2020

Cited by 29 | Viewed by 5819

Abstract

Hydrodynamic forces are an important input value for the design, navigation and station keeping of underwater Remotely Operated Vehicles (ROVs). The experiment investigated the forces imparted by currents (with representative real world turbulence) and waves on a commercially available ROV, namely the BlueROV2 [...] Read more.

Hydrodynamic forces are an important input value for the design, navigation and station keeping of underwater Remotely Operated Vehicles (ROVs). The experiment investigated the forces imparted by currents (with representative real world turbulence) and waves on a commercially available ROV, namely the BlueROV2 (Blue Robotics, Torrance, USA). Three different distances of a simplified cylindrical obstacle (shading effects) were investigated in addition to the free stream cases. Eight tethers held the ROV in the middle of the 2 m water depth to minimise the influence of the support structure without completely restricting the degrees of freedom (DoF). Each tether was equipped with a load cell and small motions and rotations were documented with an underwater video motion capture system. The paper describes the experimental set-up, input values (current speed and wave definitions) and initial processing of the data. In addition to the raw data, a processed dataset is provided, which includes forces in all three main coordinate directions for each mounting point synchronised with the 6DoF results and the free surface elevations. The provided dataset can be used as a validation experiment as well as for testing and development of an algorithm for position control of comparable ROVs. Full article

► Show Figures

Figure 1

18 pages, 501 KB

Open AccessEditor’s ChoiceArticle

Trend Analysis on Adoption of Virtual and Augmented Reality in the Architecture, Engineering, and Construction Industry

by Mojtaba Noghabaei, Arsalan Heydarian, Vahid Balali and Kevin Han

Data 2020, 5(1), 26; https://doi.org/10.3390/data5010026 - 13 Mar 2020

Cited by 180 | Viewed by 27178

Abstract

With advances in Building Information Modeling (BIM), Virtual Reality (VR) and Augmented Reality (AR) technologies have many potential applications in the Architecture, Engineering, and Construction (AEC) industry. However, the AEC industry, relative to other industries, has been slow in adopting AR/VR technologies, partly [...] Read more.

With advances in Building Information Modeling (BIM), Virtual Reality (VR) and Augmented Reality (AR) technologies have many potential applications in the Architecture, Engineering, and Construction (AEC) industry. However, the AEC industry, relative to other industries, has been slow in adopting AR/VR technologies, partly due to lack of feasibility studies examining the actual cost of implementation versus an increase in profit. The main objectives of this paper are to understand the industry trends in adopting AR/VR technologies and identifying gaps within the industry. The identified gaps can lead to opportunities for developing new tools and finding new use cases. To achieve these goals, two rounds of a survey at two different time periods (a year apart) were conducted. Responses from 158 industry experts and researchers were analyzed to assess the current state, growth, and saving opportunities for AR/VR technologies for the AEC industry. The findings demonstrate that older generations are significantly more confident about the future of AR/VR technologies and they see more benefits in AR/VR utilization. Furthermore, the research results indicate that Residential and commercial sectors have adopted these tools the most, compared to other sectors and institutional and transportation sectors had the highest growth from 2017 to 2018. Industry experts anticipated a solid growth in the use of AR/VR technologies in 5 to 10 years, with the highest expectations towards healthcare. Ultimately, the findings show a significant increase in AR/VR utilization in the AEC industry from 2017 to 2018. Full article

(This article belongs to the Special Issue Data Sensing and Analysis in Design, Construction, Operation, Monitoring, and Maintenance of Built Environments)

► Show Figures

Figure 1

42 pages, 3117 KB

Open AccessEditor’s ChoiceReview

Basic Features of the Analysis of Germination Data with Generalized Linear Mixed Models

by Alberto Gianinetti

Data 2020, 5(1), 6; https://doi.org/10.3390/data5010006 - 8 Jan 2020

Cited by 24 | Viewed by 9540

Abstract

Germination data are discrete and binomial. Although analysis of variance (ANOVA) has long been used for the statistical analysis of these data, generalized linear mixed models (GzLMMs) provide a more consistent theoretical framework. GzLMMs are suitable for final germination percentages (FGP) as well [...] Read more.

Germination data are discrete and binomial. Although analysis of variance (ANOVA) has long been used for the statistical analysis of these data, generalized linear mixed models (GzLMMs) provide a more consistent theoretical framework. GzLMMs are suitable for final germination percentages (FGP) as well as longitudinal studies of germination time-courses. Germination indices (i.e., single-value parameters summarizing the results of a germination assay by combining the level and rapidity of germination) and other data with a Gaussian error distribution can be analyzed too. There are, however, different kinds of GzLMMs: Conditional (i.e., random effects are modeled as deviations from the general intercept with a specific covariance structure), marginal (i.e., random effects are modeled solely as a variance/covariance structure of the error terms), and quasi-marginal (some random effects are modeled as deviations from the intercept and some are modeled as a covariance structure of the error terms) models can be applied to the same data. It is shown that: (a) For germination data, conditional, marginal, and quasi-marginal GzLMMs tend to converge to a similar inference; (b) conditional models are the first choice for FGP; (c) marginal or quasi-marginal models are more suited for longitudinal studies, although conditional models lead to a congruent inference; (d) in general, common random factors are better dealt with as random intercepts, whereas serial correlation is easier to model in terms of the covariance structure of the error terms; (e) germination indices are not binomial and can be easier to analyze with a marginal model; (f) in boundary conditions (when some means approach 0% or 100%), conditional models with an integral approximation of true likelihood are more appropriate; in non-boundary conditions, (g) germination data can be fitted with default pseudo-likelihood estimation techniques, on the basis of the SAS-based code templates provided here; (h) GzLMMs are remarkably good for the analysis of germination data except if some means are 0% or 100%. In this case, alternative statistical approaches may be used, such as survival analysis or linear mixed models (LMMs) with transformed data, unless an ad hoc data adjustment in estimates of limit means is considered, either experimentally or computationally. This review is intended as a basic tutorial for the application of GzLMMs, and is, therefore, of interest primarily to researchers in the agricultural sciences. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Editor’s Choice Articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI