Next Issue
Volume 7, August
Previous Issue
Volume 7, June
 
 

Data, Volume 7, Issue 7 (July 2022) – 19 articles

Cover Story (view full-size image): Due to vast dynamic data generation, it is impractical to manually filter social media's rich information on events and sentiments, making automated extraction mechanisms invaluable to the community. Real data with ground truth labels are required to build/evaluate such systems. Still, to the best of our knowledge, no available social media dataset covers continuous periods with both event and sentiment labels together except for events or sentiments. Filling this gap, TED-S is built with continuous subsets of Twitter streams with both event and sentiment labels to support event sentiment-based research. With TED-S, an automatic data annotation approach appropriate for event sentiment labeling is also proposed, involving several neural networks. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
5 pages, 227 KB  
Data Descriptor
First Draft Genome Assembly of Tropical Bed Bug, Cimex hemipterus (F.)
by Li Lim and Abdul Hafiz Ab Majid
Data 2022, 7(7), 101; https://doi.org/10.3390/data7070101 - 21 Jul 2022
Viewed by 2203
Abstract
Cimex hemipterus, a blood-feeding ectoparasite commonly found in tropical regions, is a notorious household pest. The draft genome assembly of C. hemipterus is presented in this study, generated using SPAdes software with Illumina short reads. The obtained genome size was 388.66 Mb [...] Read more.
Cimex hemipterus, a blood-feeding ectoparasite commonly found in tropical regions, is a notorious household pest. The draft genome assembly of C. hemipterus is presented in this study, generated using SPAdes software with Illumina short reads. The obtained genome size was 388.66 Mb with a contig N50 size of 3503 bp. BUSCO assessment indicated that 96.71% of the expected Insecta lineage genes were complete in the genome assembly. Annotation of the C. hemipterus genome assembly identified 2.88% of repetitive sequences and 17,254 protein-coding genes. Functional annotation showed that most gene families are involved in cellular processes and signaling. This first C. hemipterus genome will be helpful in further understanding the bed bug genetics and evolution, while the annotated genome may also help in devising new strategies in bed bug management. Full article
14 pages, 6283 KB  
Data Descriptor
Measured Indoor Environmental Data in a Retrofitted Multiapartment Building to Assess Energy Flexibility and Thermal Safety during Winter Power Outages
by Silvia Erba and Alessandra Barbieri
Data 2022, 7(7), 100; https://doi.org/10.3390/data7070100 - 19 Jul 2022
Cited by 2 | Viewed by 1916
Abstract
The article describes detailed measurements of indoor environmental parameters in a multiapartment housing block located in Milan, Italy, which has recently undergone a deep energy retrofit and is used as a thermal battery during the winter season. Two datasets are provided: one refers [...] Read more.
The article describes detailed measurements of indoor environmental parameters in a multiapartment housing block located in Milan, Italy, which has recently undergone a deep energy retrofit and is used as a thermal battery during the winter season. Two datasets are provided: one refers to a series of experimental tests conducted by the authors in an unoccupied flat, in which the thermal capacity of the building mass is exploited to act as an energy storage. The dataset reports, with a time step of 10 min, measurements of air temperature, globe temperature and surface temperatures in the analyzed room and data characterizing the adjacent spaces and the outdoor conditions. The second set of data refers to the air temperature monitoring carried out continuously in all the apartments of the apartment block, and hence also during two unplanned heating power outages. The analyzed data show the role of deep renovations in extending the time over which a building can remain in the thermal comfort range after an energy interruption and thus highlight the potential role of retrofitted buildings in delivering energy flexibility services to related stakeholders, such as the occupants, the building manager, the grid operator, and others. Furthermore, the dataset can be used to calibrate an energy simulation model to investigate different demand-side flexibility strategies and evaluate thermal safety under extreme weather events. Full article
Show Figures

Figure 1

27 pages, 2571 KB  
Article
A Cross-Sectional Study on Mental Health of School Students during the COVID-19 Pandemic in India
by Sibnath Deb, Samarjit Kar, Shayana Deb, Sanjib Biswas, Aehsan Ahmad Dar and Tusharika Mukherjee
Data 2022, 7(7), 99; https://doi.org/10.3390/data7070099 - 18 Jul 2022
Cited by 8 | Viewed by 6086
Abstract
The broad objective of the present study is to assess the levels of anxiety and depression of school students during the COVID-19 lockdown phase and their association with students’ background, stress, concerns and social support. In this regard, the present study follows a [...] Read more.
The broad objective of the present study is to assess the levels of anxiety and depression of school students during the COVID-19 lockdown phase and their association with students’ background, stress, concerns and social support. In this regard, the present study follows a novel two stage approach. In the first phase, an empirical survey was carried out, based on multivariate statistical analysis, wherein a group of 273 school students participated in the study voluntarily. In the second phase, a novel Picture Fuzzy FFA (PF-FFA) method was applied for understanding the dynamics of facilitating and prohibiting factors for three categories of focus groups (FG), formulated on the basis of attendance in online classes. Findings revealed a significant impact of anxiety and depression on mental health. Further, PF-FFA examinedthe impact of the driving forces that steered children to attend class as contrasted to the the impact of the restricting forces. Full article
Show Figures

Figure 1

18 pages, 8780 KB  
Article
SBGTool v2.0: An Empirical Study on a Similarity-Based Grouping Tool for Students’ Learning Outcomes
by Zeynab (Artemis) Mohseni, Rafael M. Martins and Italo Masiello
Data 2022, 7(7), 98; https://doi.org/10.3390/data7070098 - 18 Jul 2022
Cited by 8 | Viewed by 2776
Abstract
Visual learning analytics (VLA) tools and technologies enable the meaningful exchange of information between educational data and teachers. This allows teachers to create meaningful groups of students based on possible collaboration and productive discussions. VLA tools also allow a better understanding of students’ [...] Read more.
Visual learning analytics (VLA) tools and technologies enable the meaningful exchange of information between educational data and teachers. This allows teachers to create meaningful groups of students based on possible collaboration and productive discussions. VLA tools also allow a better understanding of students’ educational demands. Finding similar samples in huge educational datasets, however, involves the use of effective similarity measures that represent the teacher’s purpose. In this study, we conducted a user study and improved our web-based similarity-based grouping VLA tool, (SBGTool) to help teachers categorize students into groups based on their similar learning outcomes and activities. SBGTool v2.0 differs from SBGTool due to design changes made in response to teacher suggestions, the addition of sorting options to the dashboard table, the addition of a dropdown component to group the students into classrooms, and improvement in some visualizations. To counteract color blindness, we have also considered a number of color palettes. By applying SBGTool v2.0, teachers may compare the outcomes of individual students inside a classroom, determine which subjects are the most and least difficult over the period of a week or an academic year, identify the numbers of correct and incorrect responses for the most difficult and easiest subjects, categorize students into various groups based on their learning outcomes, discover the week with the most interactions for examining students’ engagement, and find the relationship between students’ activity and study success. We used 10,000 random samples from the EdNet dataset, a large-scale hierarchical educational dataset consisting of student–system interactions from multiple platforms at the university level, collected over a two-year period, to illustrate the tool’s efficacy. Finally, we provide the outcomes of the user study that evaluated the tool’s effectiveness. The results revealed that even with limited training, the participants were able to complete the required analysis tasks. Additionally, the participants’ feedback showed that the SBGTool v2.0 gained a good level of support for the given tasks, and it had the potential to assist teachers in enhancing collaborative learning in their classrooms. Full article
Show Figures

Figure 1

11 pages, 2166 KB  
Data Descriptor
Dataset: Mobility Patterns of a Coastal Area Using Traffic Classification Radars
by Joaquim Ferreira, Rui Aguiar, José A. Fonseca, João Almeida, João Barraca, Diogo Gomes, Rafael Oliveira, João Rufino, Fernando Braz and Pedro Gonçalves
Data 2022, 7(7), 97; https://doi.org/10.3390/data7070097 - 13 Jul 2022
Viewed by 2387
Abstract
Monitoring road traffic is extremely important given the possibilities it opens up in terms of studying the behavior of road users, road design and planning problems, as well as because it can be used to predict future traffic. Especially on highways that connect [...] Read more.
Monitoring road traffic is extremely important given the possibilities it opens up in terms of studying the behavior of road users, road design and planning problems, as well as because it can be used to predict future traffic. Especially on highways that connect beaches and larger urban areas, traffic is characterized by having peaks that are highly dependent on weather conditions and rest periods. This paper describes a dataset of mobility patterns of a coastal area in Aveiro region, Portugal, fully covered with traffic classification radars, over a two-year period. The sensing infrastructure was deployed in the scope of the PASMO project, an open living lab for co-operative intelligent transportation systems. The data gathered includes the speed of the detected objects, their position, and their type (heavy vehicle, light vehicle, two-wheeler, and pedestrian). The dataset includes 74,305 records, corresponding to the aggregation of road information at 10 min intervals. A brief analysis of the dataset shows the highly dynamic nature of traffic during the two-year period. In addition, the existence of meteorological records from nearby stations, and the recording of daily data on COVID-19 infections, make it possible to cross-reference information and study the influence of weather conditions and infections on traffic behavior. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

17 pages, 2198 KB  
Data Descriptor
SEN2VENµS, a Dataset for the Training of Sentinel-2 Super-Resolution Algorithms
by Julien Michel, Juan Vinasco-Salinas, Jordi Inglada and Olivier Hagolle
Data 2022, 7(7), 96; https://doi.org/10.3390/data7070096 - 13 Jul 2022
Cited by 26 | Viewed by 10130 | Correction
Abstract
Boosted by the progress in deep learning, Single Image Super-Resolution (SISR) has gained a lot of interest in the remote sensing community, who sees it as an opportunity to compensate for satellites’ ever-limited spatial resolution with respect to end users’ needs. This is [...] Read more.
Boosted by the progress in deep learning, Single Image Super-Resolution (SISR) has gained a lot of interest in the remote sensing community, who sees it as an opportunity to compensate for satellites’ ever-limited spatial resolution with respect to end users’ needs. This is especially true for Sentinel-2 because of its unique combination of resolution, revisit time, global coverage and free and open data policy. While there has been a great amount of work on network architectures in recent years, deep-learning-based SISR in remote sensing is still limited by the availability of the large training sets it requires. The lack of publicly available large datasets with the required variability in terms of landscapes and seasons pushes researchers to simulate their own datasets by means of downsampling. This may impair the applicability of the trained model on real-world data at the target input resolution. This paper presents SEN2VENµS, an open-data licensed dataset composed of 10 m and 20 m cloud-free surface reflectance patches from Sentinel-2, with their reference spatially registered surface reflectance patches at 5 m resolution acquired on the same day by the VENµS satellite. This dataset covers 29 locations on earth with a total of 132,955 patches of 256 × 256 pixels at 5 m resolution and can be used for the training and comparison of super-resolution algorithms to bring the spatial resolution of 8 of the Sentinel-2 bands up to 5 m. Full article
(This article belongs to the Section Spatial Data Science and Digital Earth)
Show Figures

Figure 1

5 pages, 530 KB  
Data Descriptor
Annotations of Lung Abnormalities in the Shenzhen Chest X-ray Dataset for Computer-Aided Screening of Pulmonary Diseases
by Feng Yang, Pu Xuan Lu, Min Deng, Yì Xiáng J. Wáng, Sivaramakrishnan Rajaraman, Zhiyun Xue, Les R. Folio, Sameer K. Antani and Stefan Jaeger
Data 2022, 7(7), 95; https://doi.org/10.3390/data7070095 - 13 Jul 2022
Cited by 15 | Viewed by 6833
Abstract
Developments in deep learning techniques have led to significant advances in automated abnormality detection in radiological images and paved the way for their potential use in computer-aided diagnosis (CAD) systems. However, the development of CAD systems for pulmonary tuberculosis (TB) diagnosis is hampered [...] Read more.
Developments in deep learning techniques have led to significant advances in automated abnormality detection in radiological images and paved the way for their potential use in computer-aided diagnosis (CAD) systems. However, the development of CAD systems for pulmonary tuberculosis (TB) diagnosis is hampered by the lack of training data that is of good visual and diagnostic quality, of sufficient size, variety, and, where relevant, containing fine-region annotations. This study presents a collection of annotations/segmentations of pulmonary radiological manifestations that are consistent with TB in the publicly available and widely used Shenzhen chest X-ray (CXR) dataset made available by the U.S. National Library of Medicine and obtained via a research collaboration with No. 3. People’s Hospital Shenzhen, China. The goal of releasing these annotations is to advance the state of the art for image segmentation methods toward improving the performance of the fine-grained segmentation of TB-consistent findings in digital chest X-ray images. The annotation collection comprises the following: (1) annotation files in JavaScript Object Notation (JSON) format that indicate locations and shapes of 19 lung pattern abnormalities for 336 TB patients; (2) mask files saved in PNG format for each abnormality per TB patient; and (3) a comma-separated values (CSV) file that summarizes lung abnormality types and numbers per TB patient. To the best of our knowledge, this is the first collection of pixel-level annotations of TB-consistent findings in CXRs. Full article
Show Figures

Figure 1

30 pages, 799 KB  
Review
A Systematic Review of Deep Knowledge Graph-Based Recommender Systems, with Focus on Explainable Embeddings
by Ronky Francis Doh, Conghua Zhou, John Kingsley Arthur, Isaac Tawiah and Benjamin Doh
Data 2022, 7(7), 94; https://doi.org/10.3390/data7070094 - 12 Jul 2022
Cited by 10 | Viewed by 7755
Abstract
Recommender systems (RS) have been developed to make personalized suggestions and enrich users’ preferences in various online applications to address the information explosion problems. However, traditional recommender-based systems act as black boxes, not presenting the user with insights into the system logic or [...] Read more.
Recommender systems (RS) have been developed to make personalized suggestions and enrich users’ preferences in various online applications to address the information explosion problems. However, traditional recommender-based systems act as black boxes, not presenting the user with insights into the system logic or reasons for recommendations. Recently, generating explainable recommendations with deep knowledge graphs (DKG) has attracted significant attention. DKG is a subset of explainable artificial intelligence (XAI) that utilizes the strengths of deep learning (DL) algorithms to learn, provide high-quality predictions, and complement the weaknesses of knowledge graphs (KGs) in the explainability of recommendations. DKG-based models can provide more meaningful, insightful, and trustworthy justifications for recommended items and alleviate the information explosion problems. Although several studies have been carried out on RS, only a few papers have been published on DKG-based methodologies, and a review in this new research direction is still insufficiently explored. To fill this literature gap, this paper uses a systematic literature review framework to survey the recently published papers from 2018 to 2022 in the landscape of DKG and XAI. We analyze how the methods produced in these papers extract essential information from graph-based representations to improve recommendations’ accuracy, explainability, and reliability. From the perspective of the leveraged knowledge-graph related information and how the knowledge-graph or path embeddings are learned and integrated with the DL methods, we carefully select and classify these published works into four main categories: the Two-stage explainable learning methods, the Joint-stage explainable learning methods, the Path-embedding explainable learning methods, and the Propagation explainable learning methods. We further summarize these works according to the characteristics of the approaches and the recommendation scenarios to facilitate the ease of checking the literature. We finally conclude by discussing some open challenges left for future research in this vibrant field. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

20 pages, 456 KB  
Review
The Role of Human Knowledge in Explainable AI
by Andrea Tocchetti and Marco Brambilla
Data 2022, 7(7), 93; https://doi.org/10.3390/data7070093 - 6 Jul 2022
Cited by 26 | Viewed by 9189
Abstract
As the performance and complexity of machine learning models have grown significantly over the last years, there has been an increasing need to develop methodologies to describe their behaviour. Such a need has mainly arisen due to the widespread use of black-box models, [...] Read more.
As the performance and complexity of machine learning models have grown significantly over the last years, there has been an increasing need to develop methodologies to describe their behaviour. Such a need has mainly arisen due to the widespread use of black-box models, i.e., high-performing models whose internal logic is challenging to describe and understand. Therefore, the machine learning and AI field is facing a new challenge: making models more explainable through appropriate techniques. The final goal of an explainability method is to faithfully describe the behaviour of a (black-box) model to users who can get a better understanding of its logic, thus increasing the trust and acceptance of the system. Unfortunately, state-of-the-art explainability approaches may not be enough to guarantee the full understandability of explanations from a human perspective. For this reason, human-in-the-loop methods have been widely employed to enhance and/or evaluate explanations of machine learning models. These approaches focus on collecting human knowledge that AI systems can then employ or involving humans to achieve their objectives (e.g., evaluating or improving the system). This article aims to present a literature overview on collecting and employing human knowledge to improve and evaluate the understandability of machine learning models through human-in-the-loop approaches. Furthermore, a discussion on the challenges, state-of-the-art, and future trends in explainability is also provided. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

8 pages, 1428 KB  
Data Descriptor
A Database of Topo-Bathy Cross-Shore Profiles and Characteristics for U.S. Atlantic and Gulf of Mexico Sandy Coastlines
by Rangley C. Mickey and Davina L. Passeri
Data 2022, 7(7), 92; https://doi.org/10.3390/data7070092 - 6 Jul 2022
Cited by 3 | Viewed by 3123
Abstract
A database of seamless topographic and bathymetric cross-shore profiles along with metrics of the associated morphological characteristics based on the latest available lidar data ranging from 2011–2020 and bathymetry from the Continuously Updated Digital Elevation Model was developed for U.S. Atlantic and Gulf [...] Read more.
A database of seamless topographic and bathymetric cross-shore profiles along with metrics of the associated morphological characteristics based on the latest available lidar data ranging from 2011–2020 and bathymetry from the Continuously Updated Digital Elevation Model was developed for U.S. Atlantic and Gulf of Mexico open-ocean sandy coastlines. Cross-shore resolution ranges from 2.5 m for topographic and nearshore portions to 10 m for offshore portions. Topographic morphological characteristics include: foredune crest elevation, foredune toe elevation, foredune width, foredune volume, foredune relative height, beach width, beach volume, beach slope, and nearshore slope. This database was developed to serve as inputs for current and future morphological modeling studies aimed at providing real-time estimates of coastal change magnitudes resulting from imminent tropical storm and hurricane landfall. Beyond this need for model inputs, the database of cross-shore profiles and characteristic metrics could serve as a tool for coastal scientists to visualize and to analyze varying local, regional, and national variations in coastal morphology for varying types of studies and projects related to Atlantic and Gulf of Mexico sandy coastline environments. Full article
Show Figures

Figure 1

13 pages, 437 KB  
Data Descriptor
Unified, Labeled, and Semi-Structured Database of Pre-Processed Mexican Laws
by Bella Martinez-Seis, Obdulia Pichardo-Lagunas, Harlan Koff, Miguel Equihua, Octavio Perez-Maqueo and Arturo Hernández-Huerta
Data 2022, 7(7), 91; https://doi.org/10.3390/data7070091 - 6 Jul 2022
Cited by 1 | Viewed by 2684
Abstract
This paper presents a corpus of pre-processed Mexican laws for computational tasks. The main contributions are the proposed JSON structure and the methodology used to achieve the semi-structured corpus with the selected algorithms. Law PDF documents were transformed into plain text, unified by [...] Read more.
This paper presents a corpus of pre-processed Mexican laws for computational tasks. The main contributions are the proposed JSON structure and the methodology used to achieve the semi-structured corpus with the selected algorithms. Law PDF documents were transformed into plain text, unified by a deconstruction of law–document structure, and labeled with natural language processing techniques considering part of speech (PoS); a process of entity extraction was also performed. The corpus includes the Mexican constitution and the Mexican laws that were collected from the official site in PDF format repealed before 14 October 2021. The collection has 305 documents, including: the Mexican constitution, 289 laws, 8 federal codes, 3 regulations, 2 statutes, 1 decree, and 1 ordinance. The semi-structured database includes the transformation of the set of laws from PDF format to a digital representation in order to facilitate its computational analysis. The documents were migrated to JSON type files to represent internal hierarchical relations. In addition, basic natural language processing techniques were implemented on laws for the identification of part of speech and named entities. The presented data set is mainly useful for text analysis and data science. It could be used for various legislative analysis tasks including: comprehension, interpretation, translation, classification, accessibility, coherence, and searches. Finally, we present some statistic of the identified entities and an example of the usefulness of the corpus for environmental laws. Full article
Show Figures

Figure 1

16 pages, 1317 KB  
Data Descriptor
TED-S: Twitter Event Data in Sports and Politics with Aggregated Sentiments
by Hansi Hettiarachchi, Doaa Al-Turkey, Mariam Adedoyin-Olowe, Jagdev Bhogal and Mohamed Medhat Gaber
Data 2022, 7(7), 90; https://doi.org/10.3390/data7070090 - 30 Jun 2022
Cited by 3 | Viewed by 4686
Abstract
Even though social media contain rich information on events and public opinions, it is impractical to manually filter this information due to data’s vast generation and dynamicity. Thus, automated extraction mechanisms are invaluable to the community. We need real data with ground truth [...] Read more.
Even though social media contain rich information on events and public opinions, it is impractical to manually filter this information due to data’s vast generation and dynamicity. Thus, automated extraction mechanisms are invaluable to the community. We need real data with ground truth labels to build/evaluate such systems. Still, to the best of our knowledge, no available social media dataset covers continuous periods with event and sentiment labels together except for events or sentiments. Datasets without time gaps are huge due to high data generation and require extensive effort for manual labelling. Different approaches, ranging from unsupervised to supervised, have been proposed by previous research targeting such datasets. However, their generic nature mainly fails to capture event-specific sentiment expressions, making them inappropriate for labelling event sentiments. Filling this gap, we propose a novel data annotation approach in this paper involving several neural networks. Our approach outperforms the commonly used sentiment annotation models such as VADER and TextBlob. Also, it generates probability values for all sentiment categories besides providing a single category per tweet, supporting aggregated sentiment analyses. Using this approach, we annotate and release a dataset named TED-S, covering two diverse domains, sports and politics. TED-S has complete subsets of Twitter data streams with both sub-event and sentiment labels, providing the ability to support event sentiment-based research. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

9 pages, 3492 KB  
Data Descriptor
Goat Kidding Dataset
by Pedro Gonçalves, Maria R. Marques, Ana T. Belo, António Monteiro and Fernando Braz
Data 2022, 7(7), 89; https://doi.org/10.3390/data7070089 - 29 Jun 2022
Cited by 5 | Viewed by 3363
Abstract
The detection of kidding in production animals is of the utmost importance, given the frequency of problems associated with the process, and the fact that timely human help can be a safeguard for the well-being of the mother and kid. The continuous human [...] Read more.
The detection of kidding in production animals is of the utmost importance, given the frequency of problems associated with the process, and the fact that timely human help can be a safeguard for the well-being of the mother and kid. The continuous human monitoring of the process is expensive, given the uncertainty of when it will occur, so the establishment of an autonomous mechanism that does so would allow calling the human responsible who could intervene at the opportune moment. The present dataset consists of data from the sensorization of 16 pregnant and two non-pregnant Charnequeira goats, during a period of four weeks, the kidding period. The data include measurements from neck to floor height, measured by ultrasound and accelerometry data measured by an accelerometer existing at the monitoring collar. Data was continuously sampled throughout the experiment every 10 s. The goats were monitored both in the goat shelter (day and night) and during the grazing period in the pasture. The births of the animals were also registered, both in terms of the time at which they took place, but also with details regarding how they took place and the number of offspring, and notes were also added. Full article
Show Figures

Figure 1

6 pages, 2432 KB  
Data Descriptor
Daily Precipitation Data for the Mexico City Metropolitan Area from 1930 to 2015
by Erika D. López-Espinoza, Oscar A. Fuentes-Mariles, Dulce R. Herrera-Moro, Octavio Gómez-Ramos, David A. Novelo-Casanova and Jorge Zavala-Hidalgo
Data 2022, 7(7), 88; https://doi.org/10.3390/data7070088 - 29 Jun 2022
Cited by 1 | Viewed by 3602
Abstract
The Metropolitan Zone of Mexico City, as well as the associated basin, includes the territories of Mexico City, some municipalities of the State of Mexico and the state of Hidalgo. In addition, this area is the most densely populated in Mexico. The region [...] Read more.
The Metropolitan Zone of Mexico City, as well as the associated basin, includes the territories of Mexico City, some municipalities of the State of Mexico and the state of Hidalgo. In addition, this area is the most densely populated in Mexico. The region is influenced by mid-latitude and tropical weather systems and is vulnerable to extreme hydrometeorological events. In this context, we developed a dataset from the records of 136 geolocated sites that includes daily precipitation data from the CLImate COMputing (CLICOM) project and the Mexico City Water System. The data spans the period from 1930 to 2015 for the rainy months (June–October) from stations with records of 20 or more years. In each recording site, automatic and manual data quality control were performed to verify the consistency of the daily precipitation data. We believe that our highly dense precipitation dataset will be useful for climate, trend and extreme events analysis. Additionally, the data will allow validating simulations of numerical atmospheric models. The dataset is public, and it was previously used in other research to determine areas susceptible to flooding due to heavy rain events and to develop a web mapping application of daily precipitation data. Full article
Show Figures

Figure 1

15 pages, 424 KB  
Article
Context Sensitive Verb Similarity Dataset for Legal Information Extraction
by Gathika Ratnayaka, Nisansa de Silva, Amal Shehan Perera, Gayan Kavirathne, Thirasara Ariyarathna and Anjana Wijesinghe
Data 2022, 7(7), 87; https://doi.org/10.3390/data7070087 - 28 Jun 2022
Cited by 3 | Viewed by 5042
Abstract
Existing literature demonstrates that verbs are pivotal in legal information extraction tasks due to their semantic and argumentative properties. However, granting computers the ability to interpret the meaning of a verb and its semantic properties in relation to a given context can be [...] Read more.
Existing literature demonstrates that verbs are pivotal in legal information extraction tasks due to their semantic and argumentative properties. However, granting computers the ability to interpret the meaning of a verb and its semantic properties in relation to a given context can be considered as a challenging task, mainly due to the polysemic and domain specific behaviours of verbs. Therefore, developing mechanisms to identify behaviors of verbs and evaluate how artificial models detect the domain specific and polysemic behaviours of verbs can be considered as tasks with significant importance. In this regard, a comprehensive dataset that can be used as an evaluation resource, as well as a training data set, can be considered as a major requirement. In this paper, we introduce LeCoVe, which is a verb similarity dataset intended towards facilitating the process of identifying verbs with similar meanings in a legal domain specific context. Using the dataset, we evaluated both domain specific and domain generic embedding models, which were developed using state-of-the-art word representation and language modelling techniques. As a part of the experiments carried out using the announced dataset, Sense2Vec and BERT models were trained using a corpus of legal opinion texts in order to capture domain specific behaviours. In addition to LeCoVe, we demonstrate that a neural network model, which was developed by combining semantic, syntactic, and contextual features that can be obtained from the outputs of embedding models, can perform comparatively well, even in a low resource scenario. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

15 pages, 1220 KB  
Article
Event Forecasting for Thailand’s Car Sales during the COVID-19 Pandemic
by Chartchai Leenawong and Thanrada Chaikajonwat
Data 2022, 7(7), 86; https://doi.org/10.3390/data7070086 - 25 Jun 2022
Cited by 5 | Viewed by 3447
Abstract
The COVID-19 pandemic that started in 2020 has affected Thailand’s automotive industry, among many others. During the several stages of the pandemic period, car sales figures fluctuate, and hence are difficult to fit and forecast. Due to the trend present in the sales [...] Read more.
The COVID-19 pandemic that started in 2020 has affected Thailand’s automotive industry, among many others. During the several stages of the pandemic period, car sales figures fluctuate, and hence are difficult to fit and forecast. Due to the trend present in the sales data, the Holt’s forecasting method appears a reasonable choice. However, the pandemic, or in a more general term, the “event”, requires a subtle method to handle this extra event component. This research proposes a forecasting method based on Holt’s method to better suit the time-series data affected by large-scale events. In addition, when combined with seasonality adjustment, three modified Holt’s-based methods are proposed and implemented on Thailand’s monthly car sales covering the pandemic period. Different flags are carefully assigned to each of the sales data to represent different stages of the pandemic. The results show that Holt’s method with seasonality and events yields the lowest MAPE of 8.64%, followed by 9.47% of Holt’s method with events. Compared to the typical Holt’s MAPE of 16.27%, the proposed methods are proved strongly effective for time-series data containing the event component. Full article
Show Figures

Figure 1

6 pages, 902 KB  
Data Descriptor
Collection and Processing of Behavioural Data of the Olive Fruit Fly, Bactrocera oleae, When Exposed to Olive Twigs Treated with Different Commercial Products
by Elissa Daher, Elena Chierici, Nicola Cinosi, Gabriele Rondoni, Franco Famiani and Eric Conti
Data 2022, 7(7), 85; https://doi.org/10.3390/data7070085 - 24 Jun 2022
Viewed by 2231
Abstract
The need for the development of sustainable control methods of herbivorous insects implies that new molecules are proposed on the market. Among the different effects the new products may have on the target species, the alteration of insect oviposition behaviour might be considered. [...] Read more.
The need for the development of sustainable control methods of herbivorous insects implies that new molecules are proposed on the market. Among the different effects the new products may have on the target species, the alteration of insect oviposition behaviour might be considered. At the scope, parallel simple behavioural assays can be conducted in arena. Freely available software can be used to track observed events, but they often need intensive customization to the specific experimental design. Hence, integrating such software with, e.g., R environment, can provide a much more effective protocol development for data collection and analysis. Here we present a dataset and protocol for processing data of the oviposition behaviour of the olive fruit fly, Bactrocera oleae, when exposed to olive twigs treated with different commercial products. Treatments were rock powder, propolis, a mixture of rock powder and propolis, copper oxychloride, copper sulphate, and water as the experimental control. JWatcher was used to simultaneously collect data from 12 arena assays and ad-hoc developed R code was used to process raw data for data analyses. The procedure described here is novel and represents a valuable and transferable protocol to analyse observational events in B. oleae, as well as other biological systems. Full article
Show Figures

Figure 1

10 pages, 1041 KB  
Data Descriptor
Dataset: Fauna of Adult Ground Beetles (Coleoptera, Carabidae) of the National Park “Smolny” (Russia)
by Alexander B. Ruchin, Sergei K. Alekseev, Oleg N. Artaev, Anatoliy A. Khapugin, Evgeniy A. Lobachev, Sergei V. Lukiyanov and Gennadiy B. Semishin
Data 2022, 7(7), 84; https://doi.org/10.3390/data7070084 - 23 Jun 2022
Cited by 2 | Viewed by 2756
Abstract
(1) Background: Protected areas are “hotspots” of biodiversity in many countries. In such areas, ecological systems are preserved in their natural state, which allows them to protect animal populations. In several protected areas, the Coleoptera biodiversity is studied as an integral part of [...] Read more.
(1) Background: Protected areas are “hotspots” of biodiversity in many countries. In such areas, ecological systems are preserved in their natural state, which allows them to protect animal populations. In several protected areas, the Coleoptera biodiversity is studied as an integral part of the ecological monitoring of the ecosystem state. This study was aimed to describe the Carabidae fauna in one of the largest protected areas of European Russia, namely National Park “Smolny”. (2) Methods: The study was conducted in April–September 2008, 2009, 2017–2021. A variety of ways was used to collect beetles (by hand, caught in light traps, pitfall traps, and others). Seasonal dynamics of the beetle abundance were studied in various biotopes. Coordinates were fixed for each observation. (3) Results: The dataset contains 1994 occurrences. In total, 32,464 specimens of Carabidae have been studied. The dataset contains information about 131 species of Carabidae beetles. In this study, we have not found two species (Carabus estreicheri and Calathus ambiguus), previously reported in the fauna of National Park “Smolny”. (4) Conclusions: The Carabidae diversity in the National Park “Smolny” is represented by 133 species from 10 subfamilies. Ten species (Carabus cancellatus, Harpalus laevipes, Carabus hortensis, Pterostichus niger, Poecilus versicolor, Pterostichus melanarius, Carabus glabratus, Carabus granulatus, Carabus arvensis baschkiricus, Pterostichus oblongopunctatus) constitute the majority of the Carabidae fauna. Seasonal dynamics are maximal in spring; the number of ground beetles decreases in biotopes by autumn. Full article
Show Figures

Figure 1

11 pages, 1212 KB  
Article
Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
by Reem ALBayari and Sherief Abdallah
Data 2022, 7(7), 83; https://doi.org/10.3390/data7070083 - 22 Jun 2022
Cited by 19 | Viewed by 5628
Abstract
(1) Background: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of [...] Read more.
(1) Background: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of the datasets were in English, not in Arabic. Even the few Arabic datasets that were collected, none focused on Instagram despite being a major social media platform in the Arab world. (2) Methods: we use the official Instagram APIs to collect our dataset. To consider the dataset as a benchmark, we use SPSS (Kappa statistic) to evaluate the inter-annotator agreement (IAA), as well as examine and evaluate the performance of various learning models (LR, SVM, RFC, and MNB). (3) Results: in this research, we present the first Instagram Arabic corpus (sub-class categorization (multi-class)) focusing on cyberbullying. The dataset is primarily designed for the purpose of detecting offensive language in texts. We end up with 200,000 comments, of which 46,898 comments were annotated by three human annotators. The results show that the SVM classifier outperforms the other classifiers, with an F1 score of 69% for bullying comments and 85 percent for positive comments. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop