Next Issue
Volume 7, December
Previous Issue
Volume 7, October
 
 

Data, Volume 7, Issue 11 (November 2022) – 26 articles

Cover Story (view full-size image): Safe, effective, and affordable COVID-19 treatments could be identified from circa 7817 drugs approved for other indications. Jain et al. (2022) have developed an interactive, open-source interface (CoviRx) that presents each drug's physical and chemical properties, original indication, available data from assays and clinical trials, any red flags, similar drugs, etc. It enables systematic down-selection of repurposed drug candidates that have passed user-driven combinations of up to 11 filters. This platform can be extended to other diseases, and the database can be kept up to date through data mining and verification by authenticated volunteers and superusers. The data can be downloaded in standardized formats or can be accessed via multiple API endpoints. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
9 pages, 475 KiB  
Data Descriptor
Database of Metagenomes of Sediments from Estuarine Aquaculture Farms in Portugal—AquaRAM Project Collection
by Teresa Nogueira, Daniel G. Silva, Susana Lopes and Ana Botelho
Data 2022, 7(11), 167; https://doi.org/10.3390/data7110167 - 20 Nov 2022
Cited by 1 | Viewed by 1840
Abstract
Aquaculture farms and estuarine environments close to human activities play a critical role in the interaction between aquatic and terrestrial surroundings and animal and human health. The AquaRAM project aimed to study estuarine aquaculture farms in Portugal as a reservoir of antibiotic resistance [...] Read more.
Aquaculture farms and estuarine environments close to human activities play a critical role in the interaction between aquatic and terrestrial surroundings and animal and human health. The AquaRAM project aimed to study estuarine aquaculture farms in Portugal as a reservoir of antibiotic resistance genes and the potential of its spread due to mobile genetic elements. We have assembled a collection of metagenomic data from 30 sediment samples from oysters, mussels, and gilt-head sea bream aquaculture farms. This collection includes samples of the estuarine environment of three rivers and one lagoon located from the north to the south of Portugal, namely, the Lima River in Viana do Castelo, Aveiro Lagoon in Aveiro, Tagus River in Alcochete, and Sado River in Setúbal. Statistical data from the raw metagenome files, as well as the file sizes of the assembled nucleotide and protein sequences, are also presented. The link to the statistics and the download page for all the metagenomes is also listed below. Full article
Show Figures

Figure 1

24 pages, 3375 KiB  
Article
Forecasting Daily COVID-19 Case Counts Using Aggregate Mobility Statistics
by Bulut Boru and M. Emre Gursoy
Data 2022, 7(11), 166; https://doi.org/10.3390/data7110166 - 20 Nov 2022
Viewed by 1789
Abstract
The COVID-19 pandemic has impacted the whole world profoundly. For managing the pandemic, the ability to forecast daily COVID-19 case counts would bring considerable benefit to governments and policymakers. In this paper, we propose to leverage aggregate mobility statistics collected from Google’s Community [...] Read more.
The COVID-19 pandemic has impacted the whole world profoundly. For managing the pandemic, the ability to forecast daily COVID-19 case counts would bring considerable benefit to governments and policymakers. In this paper, we propose to leverage aggregate mobility statistics collected from Google’s Community Mobility Reports (CMRs) toward forecasting future COVID-19 case counts. We utilize features derived from the amount of daily activity in different location categories such as transit stations versus residential areas based on the time series in CMRs, as well as historical COVID-19 daily case and test counts, in forecasting future cases. Our method trains optimized regression models for different countries based on dynamic and data-driven selection of the feature set, regression type, and time period that best fit the country under consideration. The accuracy of our method is evaluated on 13 countries with diverse characteristics. Results show that our method’s forecasts are highly accurate when compared to the real COVID-19 case counts. Furthermore, visual analysis shows that the peaks, plateaus and general trends in case counts are also correctly predicted by our method. Full article
(This article belongs to the Special Issue Health Informatics in the Age of COVID-19)
Show Figures

Figure 1

18 pages, 2683 KiB  
Article
Density-Based Unsupervised Learning Algorithm to Categorize College Students into Dropout Risk Levels
by Miguel Angel Valles-Coral, Luis Salazar-Ramírez, Richard Injante, Edwin Augusto Hernandez-Torres, Juan Juárez-Díaz, Jorge Raul Navarro-Cabrera, Lloy Pinedo and Pierre Vidaurre-Rojas
Data 2022, 7(11), 165; https://doi.org/10.3390/data7110165 - 18 Nov 2022
Cited by 5 | Viewed by 3255
Abstract
Compliance with the basic conditions of quality in higher education implies the design of strategies to reduce student dropout, and Information and Communication Technologies (ICT) in the educational field have allowed directing, reinforcing, and consolidating the process of professional academic training. We propose [...] Read more.
Compliance with the basic conditions of quality in higher education implies the design of strategies to reduce student dropout, and Information and Communication Technologies (ICT) in the educational field have allowed directing, reinforcing, and consolidating the process of professional academic training. We propose an academic and emotional tracking model that uses data mining and machine learning to group university students according to their level of dropout risk. We worked with 670 students from a Peruvian public university, applied 5 valid and reliable psychological assessment questionnaires to them using a chatbot-based system, and then classified them using 3 density-based unsupervised learning algorithms, DBSCAN, K-Means, and HDBSCAN. The results showed that HDBSCAN was the most robust option, obtaining better validity levels in two of the three internal indices evaluated, where the performance of the Silhouette index was 0.6823, the performance of the Davies–Bouldin index was 0.6563, and the performance of the Calinski–Harabasz index was 369.6459. The best number of clusters produced by the internal indices was five. For the validation of external indices, with answers from mental health professionals, we obtained a high level of precision in the F-measure: 90.9%, purity: 94.5%, V-measure: 86.9%, and ARI: 86.5%, and this indicates the robustness of the proposed model that allows us to categorize university students into five levels according to the risk of dropping out. Full article
Show Figures

Figure 1

19 pages, 8434 KiB  
Data Descriptor
CoviRx: A User-Friendly Interface for Systematic Down-Selection of Repurposed Drug Candidates for COVID-19
by Hardik A. Jain, Vinti Agarwal, Chaarvi Bansal, Anupama Kumar, Faheem, Muzaffar-Ur-Rehman Mohammed, Sankaranarayanan Murugesan, Moana M. Simpson, Avinash V. Karpe, Rohitash Chandra, Christopher A. MacRaild, Ian K. Styles, Amanda L. Peterson, Matthew A. Cooper, Carl M. J. Kirkpatrick, Rohan M. Shah, Enzo A. Palombo, Natalie L. Trevaskis, Darren J. Creek and Seshadri S. Vasan
Data 2022, 7(11), 164; https://doi.org/10.3390/data7110164 - 18 Nov 2022
Cited by 3 | Viewed by 3145
Abstract
Although various vaccines are now commercially available, they have not been able to stop the spread of COVID-19 infection completely. An excellent strategy to get safe, effective, and affordable COVID-19 treatments quickly is to repurpose drugs that are already approved for other diseases. [...] Read more.
Although various vaccines are now commercially available, they have not been able to stop the spread of COVID-19 infection completely. An excellent strategy to get safe, effective, and affordable COVID-19 treatments quickly is to repurpose drugs that are already approved for other diseases. The process of developing an accurate and standardized drug repurposing dataset requires considerable resources and expertise due to numerous commercially available drugs that could be potentially used to address the SARS-CoV-2 infection. To address this bottleneck, we created the CoviRx.org platform. CoviRx is a user-friendly interface that allows analysis and filtering of large quantities of data, which is onerous to curate manually for COVID-19 drug repurposing. Through CoviRx, the curated data have been made open source to help combat the ongoing pandemic and encourage users to submit their findings on the drugs they have evaluated, in a uniform format that can be validated and checked for integrity by authenticated volunteers. This article discusses the various features of CoviRx, its design principles, and how its functionality is independent of the data it displays. Thus, in the future, this platform can be extended to include any other disease beyond COVID-19. Full article
Show Figures

Figure 1

9 pages, 2326 KiB  
Brief Report
An Analysis by State on The Effect of Movement Control Order (MCO) 3.0 Due to COVID-19 on Malaysians’ Mental Health: Evidence from Google Trends
by Nicholas Tze Ping Pang, Assis Kamu, Chong Mun Ho, Walton Wider and Mathias Wen Leh Tseu
Data 2022, 7(11), 163; https://doi.org/10.3390/data7110163 - 17 Nov 2022
Cited by 1 | Viewed by 1771
Abstract
Due to significant social and economic upheavals brought on by the COVID-19 pandemic, there is a great deal of psychological pain. Google Trends data have been seen as a corollary measure to assess population-wide trends via observing trends in search results. Judicious analysis [...] Read more.
Due to significant social and economic upheavals brought on by the COVID-19 pandemic, there is a great deal of psychological pain. Google Trends data have been seen as a corollary measure to assess population-wide trends via observing trends in search results. Judicious analysis of Google Trends data can have both analytical and predictive capacities. This study aimed to compare nation-wide and inter-state trends in mental health before and after the Malaysian Movement Control Order 3.0 (MCO 3.0) commencing 12 May 2021. This was through assessment of two terms, “stress” and “sleep” in both the Malay and English language. Google Trends daily data between March 6 and 31 May in both 2019 and 2021 was obtained, and both series were re-scaled to be comparable. Searches before and after MCO 3.0 in 2021 were compared to searches before and after the same date in 2019. This was carried out using the differences in difference (DiD) method. This ensured that seasonal variations between states were not the source of our findings. We found that DiD estimates, β_3 for “sleep” and “stress” were not significantly different from zero, implying that MCO 3.0 had no effect on psychological distress in all states. Johor was the only state where the DiD estimates β_3 were significantly different from zero for the search topic ‘Tidur’. For the topic ‘Tekanan’, there were two states with significant DiD estimates, β_3, namely Penang and Sarawak. This study hence demonstrates that there are particular state-level differences in Google Trend search terms, which gives an indicator as to states to prioritise interventions and increase surveillance for mental health. In conclusion, Google Trends is a powerful tool to examine larger population-based trends especially in monitoring public health parameters such as population-level psychological distress, which can facilitate interventions. Full article
(This article belongs to the Special Issue Health Informatics in the Age of COVID-19)
Show Figures

Figure 1

10 pages, 2740 KiB  
Data Descriptor
Methodology for the Surveillance the Voltage Supply in Public Buildings Using the ITIC Curve and Python Programming
by Javier Fernández-Morales, Juan-José González-de-la-Rosa, José-María Sierra-Fernández, Olivia Florencias-Oliveros, Paula Remigio-Carmona, Manuel-Jesús Espinosa-Gavira, Agustín Agüera-Pérez and José-Carlos Palomares-Salas
Data 2022, 7(11), 162; https://doi.org/10.3390/data7110162 - 17 Nov 2022
Viewed by 1597
Abstract
This paper proposes an easy-to-implement method for detecting and assessing two of the most frequent PQ (Power Quality) problems: voltage sags and swells. These can affect sensitive equipment such as computers, programmable logic controllers, contactors, etc. Therefore, it is of great interest to [...] Read more.
This paper proposes an easy-to-implement method for detecting and assessing two of the most frequent PQ (Power Quality) problems: voltage sags and swells. These can affect sensitive equipment such as computers, programmable logic controllers, contactors, etc. Therefore, it is of great interest to implement it in any laboratory, not only for protection reasons but also as a safeguard for claims against the supply company. Thanks to the actual context, in which it is possible to manage big volumes of data, connect multiple devices with IoT (Internet of Things), etc., it is feasible and of great interest to monitor the voltage at specific points of the network. This makes it possible to detect voltage sags and swells and diagnose which points are more prone to this type of problems. For the detection of sags and swells, a program written in Python is in charge of crawling all the files in the database and target those RMS values that fall outside the established limits. Compared to LabVIEW, which might have been the most logical alternative, being the acquisition hardware from the same company (National Instruments), Python has a higher computational performance and is also free of charge, unlike LabVIEW. Thanks to the libraries available in Python, it allows a hardware control close to what is possible using LabVIEW. Implemented in MATLAB, the ITIC (Information Technology Industry Council) power acceptability curve reflects the impact of these power quality disturbances in electrical power systems. The results showed that the combined action of Python and MATLAB performed well on a conventional desktop computer. Full article
(This article belongs to the Special Issue Data-Driven Approach on Urban Planning and Smart Cities)
Show Figures

Figure 1

8 pages, 1556 KiB  
Data Descriptor
Dataset: Coleoptera (Insecta) Collected from Beer Traps in “Smolny” National Park (Russia)
by Alexander B. Ruchin, Leonid V. Egorov, Oleg N. Artaev and Mikhail N. Esin
Data 2022, 7(11), 161; https://doi.org/10.3390/data7110161 - 15 Nov 2022
Cited by 2 | Viewed by 1589
Abstract
Monitoring Coleoptera diversity in protected areas is part of the global ecological monitoring of the state of ecosystems. The purpose of this research is to describe the biodiversity of Coleoptera studied with the help of baits based on fermented substrate in the European [...] Read more.
Monitoring Coleoptera diversity in protected areas is part of the global ecological monitoring of the state of ecosystems. The purpose of this research is to describe the biodiversity of Coleoptera studied with the help of baits based on fermented substrate in the European part of Russia (Smolny National Park). The research was conducted April–August 2018–2022. Samples were collected in traps of our own design. Beer or wine with the addition of sugar, honey, or jam was used for bait. A total of 194 traps were installed. The dataset contains 1254 occurrences. A total of 9226 Coleoptera specimens have been studied. The dataset contains information about 134 species from 24 Coleoptera families. The largest number of species that have been found in traps belongs to the family Cerambycidae (30 species), Nitidulidae (14 species), Elateridae (12 species), and Curculionidae and Coccinellidae (10 species each). The number of individuals in the traps of these families was distributed as follows: Cerambycidae—1018 specimens; Nitidulidae—5359; Staphylinidae—241; Elateridae—33; Curculionidae—148; and Coccinellidae—19. The 10 dominant species accounted for 90.7% of all detected specimens in the traps. The maximum species diversity and abundance of Coleoptera was obtained in 2021. With the installation of the largest number of traps in 2022 and more diverse biotopes (64 traps), a smaller number of species was caught compared to 2021. New populations of such species have been found from rare Coleoptera: Calosoma sycophanta, Elater ferrugineus, Osmoderma barnabita, Protaetia speciosissima, and Protaetia fieberi. Full article
Show Figures

Figure 1

12 pages, 1318 KiB  
Article
Explainable Machine Learning for Financial Distress Prediction: Evidence from Vietnam
by Kim Long Tran, Hoang Anh Le, Thanh Hien Nguyen and Duc Trung Nguyen
Data 2022, 7(11), 160; https://doi.org/10.3390/data7110160 - 14 Nov 2022
Cited by 11 | Viewed by 4375
Abstract
The past decade has witnessed the rapid development of machine learning applied in economics and finance. Recent evidence suggests that machine learning models have produced superior results to traditional statistical models and have become the driving force for dramatic improvement in the financial [...] Read more.
The past decade has witnessed the rapid development of machine learning applied in economics and finance. Recent evidence suggests that machine learning models have produced superior results to traditional statistical models and have become the driving force for dramatic improvement in the financial industry. However, a much-debated question is whether the prediction results from black box machine learning models can be interpreted. In this study, we compared the predictive power of machine learning algorithms and applied SHAP values to interpret the prediction results on the dataset of listed companies in Vietnam from 2010 to 2021. The results showed that the extreme gradient boosting and random forest models outperformed other models. In addition, based on Shapley values, we also found that long-term debts to equity, enterprise value to revenues, account payable to equity, and diluted EPS had greatly influenced the outputs. In terms of practical contributions, the study helps credit rating companies have a new method for predicting the possibility of default of bond issuers in the market. The study also provides an early warning tool for policymakers about the risks of public companies in order to develop measures to protect retail investors against the risk of bond default. Full article
(This article belongs to the Special Issue Second Edition of Data Analysis for Financial Markets)
Show Figures

Figure 1

19 pages, 543 KiB  
Article
Stance Classification of Social Media Texts for Under-Resourced Scenarios in Social Sciences
by Victoria Yantseva and Kostiantyn Kucher
Data 2022, 7(11), 159; https://doi.org/10.3390/data7110159 - 13 Nov 2022
Cited by 1 | Viewed by 2107
Abstract
In this work, we explore the performance of supervised stance classification methods for social media texts in under-resourced languages and using limited amounts of labeled data. In particular, we focus specifically on the possibilities and limitations of the application of classic machine learning [...] Read more.
In this work, we explore the performance of supervised stance classification methods for social media texts in under-resourced languages and using limited amounts of labeled data. In particular, we focus specifically on the possibilities and limitations of the application of classic machine learning versus deep learning in social sciences. To achieve this goal, we use a training dataset of 5.7K messages posted on Flashback Forum, a Swedish discussion platform, further supplemented with the previously published ABSAbank-Imm annotated dataset, and evaluate the performance of various model parameters and configurations to achieve the best training results given the character of the data. Our experiments indicate that classic machine learning models achieve results that are on par or even outperform those of neural networks and, thus, could be given priority when considering machine learning approaches for similar knowledge domains, tasks, and data. At the same time, the modern pre-trained language models provide useful and convenient pipelines for obtaining vectorized data representations that can be combined with classic machine learning algorithms. We discuss the implications of their use in such scenarios and outline the directions for further research. Full article
Show Figures

Figure 1

8 pages, 1890 KiB  
Data Descriptor
Measuring and Validating the Factors Influenced the SME Business Growth in Germany—Descriptive Analysis and Construct Validation
by Hosam Azat Elsaman, Nourhan El-Bayaa and Suriyakumaran Kousihan
Data 2022, 7(11), 158; https://doi.org/10.3390/data7110158 - 10 Nov 2022
Cited by 4 | Viewed by 1832
Abstract
In Germany, the medical device industry constitutes a cornerstone of the health sector. In this study, we investigated the challenges and factors affecting the present-day performance of German SMEs concerned with medical devices. The research methodology adopted a cross-sectional and correlational research design, [...] Read more.
In Germany, the medical device industry constitutes a cornerstone of the health sector. In this study, we investigated the challenges and factors affecting the present-day performance of German SMEs concerned with medical devices. The research methodology adopted a cross-sectional and correlational research design, with simple random-sampling techniques, to data obtained from 110 mid-level and senior managers in German SMEs by means of an online structured survey in August 2022. We statistically validated our study data using exploratory factor analysis (EFA), Kaiser–Meyer–Olkin (KMO) testing, and Bartlett’s test, to assess the relationship between study variables and measure data adequacy using the R4.1.1(21) software, then carried out principal component analysis (PCA) with varimax factor loading and extracted six factors for use as research variables. The researchers also applied descriptive data analysis techniques using SPSS.21. The main study variables were: (1) the business performance of small and medium businesses (SMP); (2) their financial situation (SMEF); and (3) their implementation of new medical device industry regulations (MDR). By such statistical means, results confirmed poorer business performance and lower anticipated growth amongst SMEs affected by MDR, over and above the impacts of the present-day economic situation. The data can be used by management information systems (MIS) and decision system support professionals for planning and developing practical models about how to cope with current industry challenges. We recommend further research involving inferential analysis and triangulation of these data in the form of a semi-structured qualitative study in the larger scope of the population and different sectors. Full article
Show Figures

Figure 1

11 pages, 4155 KiB  
Data Descriptor
High-Resolution UAV RGB Imagery Dataset for Precision Agriculture and 3D Photogrammetric Reconstruction Captured over a Pistachio Orchard (Pistacia vera L.) in Spain
by Sergio Vélez, Rubén Vacas, Hugo Martín, David Ruano-Rosa and Sara Álvarez
Data 2022, 7(11), 157; https://doi.org/10.3390/data7110157 - 10 Nov 2022
Cited by 8 | Viewed by 3553
Abstract
A total of 248 UAV RGB images were taken in the summer of 2021 over a representative pistachio orchard in Spain (X: 341450.3, Y: 4589731.8; ETRS89/UTM zone 30N). It is a 2.03 ha plot, planted in 2016 with Pistacia vera L. cv. Kerman [...] Read more.
A total of 248 UAV RGB images were taken in the summer of 2021 over a representative pistachio orchard in Spain (X: 341450.3, Y: 4589731.8; ETRS89/UTM zone 30N). It is a 2.03 ha plot, planted in 2016 with Pistacia vera L. cv. Kerman grafted on UCB rootstock, with a NE–SW orientation and a 7 × 6 m triangular planting pattern. The ground was kept free of any weeds that could affect image processing. The photos (provided in JPG format) were taken using a UAV DJI Phantom Advance quadcopter in two flight missions: one planned to take nadir images (β = 0°), and another to take oblique images (β = 30°), both at 55 metres above the ground. The aerial platform incorporates a DJI FC6310 RGB camera with a 20 megapixel sensor, a horizontal field of view of 84° and a mechanical shutter. In addition, GCPs (ground control points) were collected. Finally, a high-quality 3D photogrammetric reconstruction process was carried out to generate a 3D point cloud (provided in LAS, LAZ, OBJ and PLY formats), a DEM (digital elevation model) and an orthomosaic (both in TIF format). The interest in using remote sensing in precision agriculture is growing, but the availability of reliable, ready-to-work, downloadable datasets is limited. Therefore, this dataset could be useful for precision agriculture researchers interested in photogrammetric reconstruction who want to evaluate models for orthomosaic and 3D point cloud generation from UAV missions with changing flight parameters, such as camera angle. Full article
Show Figures

Figure 1

20 pages, 14661 KiB  
Data Descriptor
Hybrid Wi-Fi and BLE Fingerprinting Dataset for Multi-Floor Indoor Environments with Different Layouts
by Aina Nadhirah Nor Hisham, Yin Hoe Ng, Chee Keong Tan and David Chieng
Data 2022, 7(11), 156; https://doi.org/10.3390/data7110156 - 09 Nov 2022
Cited by 9 | Viewed by 2420
Abstract
Indoor positioning has garnered significant interest over the last decade due to the rapidly growing demand for location-based services. As a result, a multitude of techniques has been proposed to localize objects and devices in indoor environments. Wireless fingerprinting, which leverages machine learning, [...] Read more.
Indoor positioning has garnered significant interest over the last decade due to the rapidly growing demand for location-based services. As a result, a multitude of techniques has been proposed to localize objects and devices in indoor environments. Wireless fingerprinting, which leverages machine learning, has emerged as one of the most popular positioning approaches due to its low implementation cost. The prevailing fingerprinting-based positioning mainly utilizes wireless fidelity (Wi-Fi) and Bluetooth low energy (BLE) signals. However, the RSS of Wi-Fi and BLE signals are very sensitive to the layout of the indoor environment. Thus, any change in the indoor layout could potentially lead to severe degradation in terms of localization performance. To foster the development of new positioning methods, several open-source location fingerprinting datasets have been made available to the research community. Unfortunately, none of these public datasets provides the received signal strength (RSS) measurements for indoor environments with different layouts. To fill this gap, this paper presents a new hybrid Wi-Fi and BLE fingerprinting dataset for multi-floor indoor environments with different layouts to facilitate the future development of new fingerprinting-based positioning systems that can provide adaptive positioning performance in dynamic indoor environments. Additionally, the effects of indoor layout change on the location fingerprint and localization performance are also investigated. Full article
Show Figures

Figure 1

12 pages, 1754 KiB  
Data Descriptor
Reference-Guided Draft Genome Assembly, Annotation and SSR Mining Data of the Peruvian Creole Cattle (Bos taurus)
by Richard Estrada, Flor-Anita Corredor, Deyanira Figueroa, Wilian Salazar, Carlos Quilcate, Héctor V. Vásquez, Jorge L. Maicelo, Jhony Gonzales and Carlos I. Arbizu
Data 2022, 7(11), 155; https://doi.org/10.3390/data7110155 - 09 Nov 2022
Cited by 1 | Viewed by 3172
Abstract
The Peruvian creole cattle (PCC) is a neglected breed and an essential livestock resource in the Andean region of Peru. To develop a modern breeding program and conservation strategies for the PCC, a better understanding of the genetics of this breed is needed. [...] Read more.
The Peruvian creole cattle (PCC) is a neglected breed and an essential livestock resource in the Andean region of Peru. To develop a modern breeding program and conservation strategies for the PCC, a better understanding of the genetics of this breed is needed. We sequenced the whole genome of the PCC using a de novo assembly approach with a paired-end 150 strategy on the Illumina HiSeq 2500 platform, obtaining 320 GB of sequencing data. A reference scaffolding was used to improve the draft genome. The obtained genome size of the PCC was 2.81 Gb with a contig N50 of 108 Mb and 92.59% complete BUSCOs. This genome size is similar to the genome references of Bos taurus and B. indicus. In addition, we identified 40.22% of repetitive DNA of the genome assembly, of which retroelements occupy 32.39% of the total genome. A total of 19,803 protein-coding genes were annotated in the PCC genome. For SSR data mining, we detected similar statistics in comparison with other breeds. The PCC genome will contribute to a better understanding of the genetics of this species and its adaptation to tough conditions in the Andean ecosystem. Full article
Show Figures

Figure 1

12 pages, 4291 KiB  
Data Descriptor
Dataset on Force Myography for Human–Robot Interactions
by Umme Zakia and Carlo Menon
Data 2022, 7(11), 154; https://doi.org/10.3390/data7110154 - 08 Nov 2022
Cited by 4 | Viewed by 1908
Abstract
Force myography (FMG) is a contemporary, non-invasive, wearable technology that can read the underlying muscle volumetric changes during muscle contractions and expansions. The FMG technique can be used in recognizing human applied hand forces during physical human robot interactions (pHRI) via data-driven models. [...] Read more.
Force myography (FMG) is a contemporary, non-invasive, wearable technology that can read the underlying muscle volumetric changes during muscle contractions and expansions. The FMG technique can be used in recognizing human applied hand forces during physical human robot interactions (pHRI) via data-driven models. Several FMG-based pHRI studies were conducted in 1D, 2D and 3D during dynamic interactions between a human participant and a robot to realize human applied forces in intended directions during certain tasks. Raw FMG signals were collected via 16-channel (forearm) and 32-channel (forearm and upper arm) FMG bands while interacting with a biaxial stage (linear robot) and a serial manipulator (Kuka robot). In this paper, we present the datasets and their structures, the pHRI environments, and the collaborative tasks performed during the studies. We believe these datasets can be useful in future studies on FMG biosignal-based pHRI control design. Full article
Show Figures

Figure 1

7 pages, 207 KiB  
Data Descriptor
Ground Truth Dataset: Objectionable Web Content
by Hamza H. M. Altarturi and Nor Badrul Anuar
Data 2022, 7(11), 153; https://doi.org/10.3390/data7110153 - 07 Nov 2022
Cited by 2 | Viewed by 1540
Abstract
Cyber parental control aims to filter objectionable web content and prevent children from being exposed to harmful content. Succeeding in detecting and blocking objectionable content depends heavily on the accuracy of the topic model. A reliable ground truth dataset is essential for building [...] Read more.
Cyber parental control aims to filter objectionable web content and prevent children from being exposed to harmful content. Succeeding in detecting and blocking objectionable content depends heavily on the accuracy of the topic model. A reliable ground truth dataset is essential for building effective cyber parental control models and validation of new detection methods. The ground truth is the measurement for labeling objectionable and unobjectionable websites of the cyber parental control dataset. The lack of publicly accessible datasets with a reliable ground truth has prevented a fair and coherent comparison of different methods proposed in the field of cyber parental control. This paper presents a ground truth dataset that contains 8000 labelled websites with 4000 objectionable websites and 4000 unobjectionable websites. These websites consist of more than 2 million web pages. Creating a ground truth objectionable web content dataset involved a few phases, including data collection, extraction, and labeling. Finally, the presence of bias, using kappa coefficient measurement, is addressed. The ground truth dataset is available publicly in the Mendeley repository. Full article
(This article belongs to the Section Information Systems and Data Management)
17 pages, 5158 KiB  
Data Descriptor
Arabic Twitter Conversation Dataset about the COVID-19 Vaccine
by Huda Alhazmi
Data 2022, 7(11), 152; https://doi.org/10.3390/data7110152 - 04 Nov 2022
Cited by 2 | Viewed by 2397
Abstract
The development and rollout of COVID-19 vaccination around the world offers hope for controlling the pandemic. People turned to social media such as Twitter seeking information or to voice their opinion. Therefore, mining such conversation can provide a rich source of data for [...] Read more.
The development and rollout of COVID-19 vaccination around the world offers hope for controlling the pandemic. People turned to social media such as Twitter seeking information or to voice their opinion. Therefore, mining such conversation can provide a rich source of data for different applications related to the COVID-19 vaccine. In this data article, we developed an Arabic Twitter dataset of 1.1 M Arabic posts regarding the COVID-19 vaccine. The dataset was streamed over one year, covering the period from January to December 2021. We considered a set of crawling keywords in the Arabic language related to the conversation about the vaccine. The dataset consists of seven databases that can be analyzed separately or merged for further analysis. The initial analysis depicts the embedded features within the posts, including hashtags, media, and the dynamic of replies and retweets. Further, the textual analysis reveals the most frequent words that can capture the trends of the discussions. The dataset was designed to facilitate research across different fields, such as social network analysis, information retrieval, health informatics, and social science. Full article
Show Figures

Figure 1

11 pages, 1134 KiB  
Article
Isochromatic-Art: A Computational Dataset for Digital Photoelasticity Studies
by Juan-Carlos Briñez-De-Leon, Mateo Rico-Garcia and Alejandro Restrepo-Martínez
Data 2022, 7(11), 151; https://doi.org/10.3390/data7110151 - 01 Nov 2022
Cited by 1 | Viewed by 1837
Abstract
The importance of evaluating the stress field of loaded structures lies in the need for identifying the forces which make them fail, redesigning their geometry to increase the mechanical resistance, or characterizing unstressed regions to remove material. In such work line, digital photoelasticity [...] Read more.
The importance of evaluating the stress field of loaded structures lies in the need for identifying the forces which make them fail, redesigning their geometry to increase the mechanical resistance, or characterizing unstressed regions to remove material. In such work line, digital photoelasticity highlights with the possibility of revealing the stress information through isochromatic color fringes, and quantifying it through inverse problem strategies. However, the absence of public data with a high variety of spatial fringe distribution has limited developing new proposals which generalize the stress evaluation in a wider variety of industrial applications. This dataset shares a variated collection of stress maps and their respective representation in color fringe patterns. In this case, the data were generated following a computational strategy that emulates the circular polariscope in dark field, but assuming stress surfaces and patches derived from analytical stress models, 3D reconstructions, saliency maps, and superpositions of Gaussian surfaces. In total, two sets of ‘101430’ raw images were separately generated for stress maps and isochromatic color fringes, respectively. This dataset can be valuable for researchers interested in characterizing the mechanical response in loaded models, engineers in computer science interested in modeling inverse problems, and scientists who work in physical phenomena such as 3D reconstruction in visible light, bubble analysis, oil surfaces, and film thickness. Full article
Show Figures

Figure 1

9 pages, 2724 KiB  
Data Descriptor
Manual Conversion of Sadhukarn to Thai and Western Music Notations and Their Translation into a Rhyme Structure for Music Analysis
by Sumetus Eambangyung, Gretel Schwörer-Kohl and Witoon Purahong
Data 2022, 7(11), 150; https://doi.org/10.3390/data7110150 - 31 Oct 2022
Cited by 1 | Viewed by 1922
Abstract
Sadhukarn plays an important role as the most sacred music composition in Thai, Cambodian, and Lao music cultural areas. Due to various versions of unverified Sadhukarn main melodies in three different countries, notating melodies in suitable formats with a systematic method is necessary. [...] Read more.
Sadhukarn plays an important role as the most sacred music composition in Thai, Cambodian, and Lao music cultural areas. Due to various versions of unverified Sadhukarn main melodies in three different countries, notating melodies in suitable formats with a systematic method is necessary. This work provides a data descriptor for music transcription related to 25 different versions of the Sadhukarn main melody collected in Thailand, Cambodia, and Laos. Furthermore, we introduce a new procedure of music analysis based on rhyme structure. The aims of the study are to (1) provide Thai/Western musical note comprehension in the forms of Western staff and Thai notation, and (2) describe the procedures for translating from musical note to rhyme structure. To generate a rhyme structure, we apply a Thai poetic and linguistic approach as the method establishment. Rhyme structure is composed of melodic structures, the pillar tones Look-Tok, and melodic rhyming outline. Full article
Show Figures

Figure 1

13 pages, 659 KiB  
Article
Cryptocurrency Price Prediction with Convolutional Neural Network and Stacked Gated Recurrent Unit
by Chuen Yik Kang, Chin Poo Lee and Kian Ming Lim
Data 2022, 7(11), 149; https://doi.org/10.3390/data7110149 - 31 Oct 2022
Cited by 15 | Viewed by 11347
Abstract
Virtual currencies have been declared as one of the financial assets that are widely recognized as exchange currencies. The cryptocurrency trades caught the attention of investors as cryptocurrencies can be considered as highly profitable investments. To optimize the profit of the cryptocurrency investments, [...] Read more.
Virtual currencies have been declared as one of the financial assets that are widely recognized as exchange currencies. The cryptocurrency trades caught the attention of investors as cryptocurrencies can be considered as highly profitable investments. To optimize the profit of the cryptocurrency investments, accurate price prediction is essential. In view of the fact that the price prediction is a time series task, a hybrid deep learning model is proposed to predict the future price of the cryptocurrency. The hybrid model integrates a 1-dimensional convolutional neural network and stacked gated recurrent unit (1DCNN-GRU). Given the cryptocurrency price data over the time, the 1-dimensional convolutional neural network encodes the data into a high-level discriminative representation. Subsequently, the stacked gated recurrent unit captures the long-range dependencies of the representation. The proposed hybrid model was evaluated on three different cryptocurrency datasets, namely Bitcoin, Ethereum, and Ripple. Experimental results demonstrated that the proposed 1DCNN-GRU model outperformed the existing methods with the lowest RMSE values of 43.933 on the Bitcoin dataset, 3.511 on the Ethereum dataset, and 0.00128 on the Ripple dataset. Full article
(This article belongs to the Special Issue Data Analysis for Financial Markets)
Show Figures

Figure 1

7 pages, 195 KiB  
Data Descriptor
An Open Dataset of Connected Speech in Aphasia with Consensus Ratings of Auditory-Perceptual Features
by Zoe Ezzes, Sarah M. Schneck, Marianne Casilio, Davida Fromm, Antje S. Mefferd, Michael de Riesthal and Stephen M. Wilson
Data 2022, 7(11), 148; https://doi.org/10.3390/data7110148 - 30 Oct 2022
Viewed by 2590
Abstract
Auditory-perceptual rating of connected speech in aphasia (APROCSA) is a system in which trained listeners rate a variety of perceptual features of connected speech samples, representing the disruptions and abnormalities that commonly occur in aphasia. APROCSA has shown promise as an approach for [...] Read more.
Auditory-perceptual rating of connected speech in aphasia (APROCSA) is a system in which trained listeners rate a variety of perceptual features of connected speech samples, representing the disruptions and abnormalities that commonly occur in aphasia. APROCSA has shown promise as an approach for quantifying expressive speech and language function in individuals with aphasia. The aim of this study was to acquire and share a set of audiovisual recordings of connected speech samples from a diverse group of individuals with aphasia, along with consensus ratings of APROCSA features, for future use as training materials to teach others how to use the APROCSA system. Connected speech samples were obtained from six individuals with chronic post-stroke aphasia. The first five minutes of participant speech were excerpted from each sample, and five researchers independently evaluated each sample using APROCSA, rating its 27 features on a five-point scale. The researchers then discussed each feature in turn to obtain consensus ratings. The dataset will provide a useful, freely accessible resource for researchers, clinicians, and students to learn how to evaluate aphasic speech with an auditory-perceptual approach. Full article
41 pages, 2202 KiB  
Article
Thematic Analysis of Indonesian Physics Education Research Literature Using Machine Learning
by Purwoko Haryadi Santoso, Edi Istiyono, Haryanto and Wahyu Hidayatulloh
Data 2022, 7(11), 147; https://doi.org/10.3390/data7110147 - 28 Oct 2022
Cited by 3 | Viewed by 3280
Abstract
Abundant physics education research (PER) literature has been disseminated through academic publications. Over the years, the growing body of literature challenges Indonesian PER scholars to understand how the research community has progressed and possible future work that should be encouraged. Nevertheless, the previous [...] Read more.
Abundant physics education research (PER) literature has been disseminated through academic publications. Over the years, the growing body of literature challenges Indonesian PER scholars to understand how the research community has progressed and possible future work that should be encouraged. Nevertheless, the previous traditional method of thematic analysis possesses limitations when the amount of PER literature exponentially increases. In order to deal with this plethora of publications, one of the machine learning (ML) algorithms from natural language processing (NLP) studies was employed in this paper to automate a thematic analysis of Indonesian PER literature that still needs to be explored within the community. One of the well-known NLP algorithms, latent Dirichlet allocation (LDA), was used in this study to extract Indonesian PER topics and their evolution between 2014 and 2021. A total of 852 papers (~4 to 8 pages each) were collectively downloaded from five international conference proceedings organized, peer reviewed, and published by Indonesian PER researchers. Before their topics were modeled through the LDA algorithm, our data corpus was preprocessed through several common procedures of established NLP studies. The findings revealed that LDA had thematically quantified Indonesian PER topics and described their distinct development over a certain period. The identified topics from this study recommended that the Indonesian PER community establish robust development in eight distinct topics to the present. Here, we commenced with an initial interest focusing on research on physics laboratories and followed the research-based instruction in late 2015. For the past few years, the Indonesian PER scholars have mostly studied 21st century skills which have given way to a focus on developing relevant educational technologies and promoting the interdisciplinary aspects of physics education. We suggest an open room for Indonesian PER scholars to address the qualitative aspects of physics teaching and learning that is still scant within the literature. Full article
Show Figures

Figure 1

17 pages, 6419 KiB  
Data Descriptor
Predicting Student Dropout and Academic Success
by Valentim Realinho, Jorge Machado, Luís Baptista and Mónica V. Martins
Data 2022, 7(11), 146; https://doi.org/10.3390/data7110146 - 28 Oct 2022
Cited by 9 | Viewed by 22831
Abstract
Higher education institutions record a significant amount of data about their students, representing a considerable potential to generate information, knowledge, and monitoring. Both school dropout and educational failure in higher education are an obstacle to economic growth, employment, competitiveness, and productivity, directly impacting [...] Read more.
Higher education institutions record a significant amount of data about their students, representing a considerable potential to generate information, knowledge, and monitoring. Both school dropout and educational failure in higher education are an obstacle to economic growth, employment, competitiveness, and productivity, directly impacting the lives of students and their families, higher education institutions, and society as a whole. The dataset described here results from the aggregation of information from different disjointed data sources and includes demographic, socioeconomic, macroeconomic, and academic data on enrollment and academic performance at the end of the first and second semesters. The dataset is used to build machine learning models for predicting academic performance and dropout, which is part of a Learning Analytic tool developed at the Polytechnic Institute of Portalegre that provides information to the tutoring team with an estimate of the risk of dropout and failure. The dataset is useful for researchers who want to conduct comparative studies on student academic performance and also for training in the machine learning area. Full article
Show Figures

Figure 1

22 pages, 9306 KiB  
Article
In Vitro Major Arterial Cardiovascular Simulator to Generate Benchmark Data Sets for In Silico Model Validation
by Michelle Wisotzki, Alexander Mair, Paul Schlett, Bernhard Lindner, Max Oberhardt and Stefan Bernhard
Data 2022, 7(11), 145; https://doi.org/10.3390/data7110145 - 27 Oct 2022
Cited by 3 | Viewed by 1649
Abstract
Cardiovascular diseases are commonly caused by atherosclerosis, stenosis and aneurysms. Understanding the influence of these pathological conditions on the circulatory mechanism is required to establish methods for early diagnosis. Different tools have been developed to simulate healthy and pathological conditions of blood flow. [...] Read more.
Cardiovascular diseases are commonly caused by atherosclerosis, stenosis and aneurysms. Understanding the influence of these pathological conditions on the circulatory mechanism is required to establish methods for early diagnosis. Different tools have been developed to simulate healthy and pathological conditions of blood flow. These simulations are often based on computational models that allow the generation of large data sets for further investigation. However, because computational models often lack some aspects of real-world data, hardware simulators are used to close this gap and generate data for model validation. The aim of this study is to develop and validate a hardware simulator to generate benchmark data sets of healthy and pathological conditions. The development process was led by specific design criteria to allow flexible and physiological simulations. The in vitro hardware simulator includes the major 33 arteries and is driven by a ventricular assist device generating a parametrised in-flow condition at the heart node. Physiologic flow conditions, including heart rate, systolic/diastolic pressure, peripheral resistance and compliance, are adjustable across a wide range. The pressure and flow waves at 17 + 1 locations are measured by inverted fluid-resistant pressure transducers and one ultrasound flow transducer, supporting a detailed analysis of the measurement data even for in silico modelling applications. The pressure and flow waves are compared to in vivo measurements and show physiological conditions. The influence of the degree and location of the stenoses on blood pressure and flow was also investigated. The results indicate decreasing translesional pressure and flow with an increasing degree of stenosis, as expected. The benchmark data set is made available to the research community for validating and comparing different types of computational models. It is hoped that the validation and improvement of computational simulation models will provide better clinical predictions. Full article
Show Figures

Figure 1

24 pages, 9445 KiB  
Data Descriptor
Technical Data of In Silico Analysis of the Interaction of Dietary Flavonoid Compounds against Spike-Glycoprotein and Proteases of SARS-CoV-2
by Nurbella Sofiana Altu, Cahyo Budiman, Rafida Razali, Ruzaidi Azli Mohd Mokhtar and Khairul Azfar Kamaruzaman
Data 2022, 7(11), 144; https://doi.org/10.3390/data7110144 - 27 Oct 2022
Cited by 1 | Viewed by 1345
Abstract
The spike glycoprotein (S protein), 3-chymotrypsin-like protease (3CL-Pro), and papain-like protease (PL-Pro) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus are widely targeted for the discovery of therapeutic compounds against this virus. Dietary flavonoid compounds were proposed as a candidate for safe [...] Read more.
The spike glycoprotein (S protein), 3-chymotrypsin-like protease (3CL-Pro), and papain-like protease (PL-Pro) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus are widely targeted for the discovery of therapeutic compounds against this virus. Dietary flavonoid compounds were proposed as a candidate for safe therapy for COVID-19 patients. Nevertheless, wet lab experiments for high-throughput screening of the compounds are undoubtedly time and cost consuming. This study aims to screen dietary flavonoid compounds that bind to S protein, 3CL-Pro, and PL-Pro of SARS-CoV-2. For this purpose, protein structures of the receptor-binding domain (RBD) of S protein (6M0J), 3CL-Pro (6LU7), and PL-Pro (6W9C) were retrieved from the RCSB Protein Data Bank (PDB). Twelve dietary flavonoid compounds were selected for the studies on their binding affinity to the targeted proteins by global and local docking. The docking and molecular dynamic (MD) simulations were performed using YASARA software. Out of 12 compounds, the highest binding score was observed between hesperidin against RBD S protein (−9.98 kcal/mol), 3CL-Pro (−9.43 kcal/mol), and PL-Pro (−8.89 kcal/mol) in global docking. Interestingly, MD simulation revealed that the complex between 3CL-Pro and RBD S protein has better stability than PL-Pro. This study suggests that hesperidin might have versatile inhibitory properties against several essential proteins of SARS-CoV-2. This study, nevertheless, remains to be confirmed through in vitro and in vivo assays. Full article
Show Figures

Figure 1

12 pages, 1754 KiB  
Article
Assessing the Accuracy of Google Trends for Predicting Presidential Elections: The Case of Chile, 2006–2021
by Francisco Vergara-Perucich
Data 2022, 7(11), 143; https://doi.org/10.3390/data7110143 - 27 Oct 2022
Cited by 2 | Viewed by 1568
Abstract
This article presents the results of reviewing the predictive capacity of Google Trends for national elections in Chile. The electoral results of the elections between Michelle Bachelet and Sebastián Piñera in 2006, Sebastián Piñera and Eduardo Frei in 2010, Michelle Bachelet and Evelyn [...] Read more.
This article presents the results of reviewing the predictive capacity of Google Trends for national elections in Chile. The electoral results of the elections between Michelle Bachelet and Sebastián Piñera in 2006, Sebastián Piñera and Eduardo Frei in 2010, Michelle Bachelet and Evelyn Matthei in 2013, Sebastián Piñera and Alejandro Guillier in 2017, and Gabriel Boric and José Antonio Kast in 2021 were reviewed. The time series analyzed were organized on the basis of relative searches between the candidacies, assisted by R software, mainly with the gtrendsR and forecast libraries. With the series constructed, forecasts were made using the Auto Regressive Integrated Moving Average (ARIMA) technique to check the weight of one presidential option over the other. The ARIMA analyses were performed on 3 ways of organizing the data: the linear series, the series transformed by moving average, and the series transformed by Hodrick–Prescott. The results indicate that the method offers the optimal predictive ability. Full article
Show Figures

Figure 1

6 pages, 1520 KiB  
Data Descriptor
Data of National Dishes in the Developed and Developing Countries in the World, Their Similarity and Trade Flows
by Anne C. Wunderlich and Andreas Kohler
Data 2022, 7(11), 142; https://doi.org/10.3390/data7110142 - 26 Oct 2022
Viewed by 1430
Abstract
This paper presents a database that includes information on national recipes and their ingredients for 171 countries, measures for food taste similarities between all 171 countries as well as bilateral migration and agro-food trade data for 5 years. The database can be used [...] Read more.
This paper presents a database that includes information on national recipes and their ingredients for 171 countries, measures for food taste similarities between all 171 countries as well as bilateral migration and agro-food trade data for 5 years. The database can be used for analyzing e.g., the relation between food preferences and international trade or food preferences and health outcomes (e.g., obesity) across countries. Full article
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop