Open AccessData Descriptor
The #BTW17 Twitter Dataset–Recorded Tweets of the Federal Election Campaigns of 2017 for the 19th German Bundestag
Data 2017, 2(4), 34; doi:10.3390/data2040034 (registering DOI) -
Abstract
The German Bundestag elections are the most important elections in Germany. This dataset comprises Twitter interactions related to German politicians of the most important political parties over several months in the (pre-)phase of the German federal election campaigns in 2017. The Twitter accounts
[...] Read more.
The German Bundestag elections are the most important elections in Germany. This dataset comprises Twitter interactions related to German politicians of the most important political parties over several months in the (pre-)phase of the German federal election campaigns in 2017. The Twitter accounts of more than 360 politicians were followed for four months. The collected data comprise a sample of approximately 10 GB of Twitter raw data, and they cover more than 120,000 active Twitter users and more than 1,200,000 recorded tweets. Even without sophisticated data analysis techniques, it was possible to deduce a likely political party proximity for more than half of these accounts simply by looking at the re-tweet behavior. This might be of interest for innovative data-driven party campaign strategists in the future. Furthermore, it is observable, that, in Germany, supporters and politicians of populist parties make use of Twitter much more intensively and aggressively than supporters of other parties. Furthermore, established left-wing parties seem to be more active on Twitter than established conservative parties. The dataset can be used to study how political parties, their followers and supporters make use of social media channels in political election campaigns and what kind of content is shared. Full article
Figures

Figure 1

Open AccessArticle
Temporal Statistical Analysis of Degree Distributions in an Undirected Landline Phone Call Network Graph Series
Data 2017, 2(4), 33; doi:10.3390/data2040033 -
Abstract
This article aims to provide new results about the intraday degree sequence distribution considering phone call network graph evolution in time. More specifically, it tackles the following problem. Given a large amount of landline phone call data records, what is the best way
[...] Read more.
This article aims to provide new results about the intraday degree sequence distribution considering phone call network graph evolution in time. More specifically, it tackles the following problem. Given a large amount of landline phone call data records, what is the best way to summarize the distinct number of calling partners per client per day? In order to answer this question, a series of undirected phone call network graphs is constructed based on data from a local telecommunication source in Albania. All network graphs of the series are simplified. Further, a longitudinal temporal study is made on this network graphs series related to the degree distributions. Power law and log-normal distribution fittings on the degree sequence are compared on each of the network graphs of the series. The maximum likelihood method is used to estimate the parameters of the distributions, and a Kolmogorov–Smirnov test associated with a p-value is used to define the plausible models. A direct distribution comparison is made through a Vuong test in the case that both distributions are plausible. Another goal was to describe the parameters’ distributions’ shape. A Shapiro-Wilk test is used to test the normality of the data, and measures of shape are used to define the distributions’ shape. Study findings suggested that log-normal distribution models better the intraday degree sequence data of the network graphs. It is not possible to say that the distributions of log-normal parameters are normal. Full article
Figures

Figure 1

Open AccessData Descriptor
Wi-Fi Crowdsourced Fingerprinting Dataset for Indoor Positioning
Data 2017, 2(4), 32; doi:10.3390/data2040032 -
Abstract
Benchmark open-source Wi-Fi fingerprinting datasets for indoor positioning studies are still hard to find in the current literature and existing public repositories. This is unlike other research fields, such as the image processing field, where benchmark test images such as the Lenna image
[...] Read more.
Benchmark open-source Wi-Fi fingerprinting datasets for indoor positioning studies are still hard to find in the current literature and existing public repositories. This is unlike other research fields, such as the image processing field, where benchmark test images such as the Lenna image or Face Recognition Technology (FERET) databases exist, or the machine learning field, where huge datasets are available for example at the University of California Irvine (UCI) Machine Learning Repository. It is the purpose of this paper to present a new openly available Wi-Fi fingerprint dataset, comprised of 4648 fingerprints collected with 21 devices in a university building in Tampere, Finland, and to present some benchmark indoor positioning results using these data. The datasets and the benchmarking software are distributed under the open-source MIT license and can be found on the EU Zenodo repository. Full article
Figures

Figure 1

Open AccessArticle
An Improved Power Law for Nonlinear Least-Squares Fitting?
Data 2017, 2(3), 31; doi:10.3390/data2030031 -
Abstract
Models based on a power law are prevalent in many areas of study. When regression analysis is performed on data sets modeled by a power law, the traditional model uses a lead coefficient. However, the proposed model replaces the lead coefficient with a
[...] Read more.
Models based on a power law are prevalent in many areas of study. When regression analysis is performed on data sets modeled by a power law, the traditional model uses a lead coefficient. However, the proposed model replaces the lead coefficient with a scaling parameter and reduces uncertainties in best-fit parameters for data sets with exponents close to 3. This study extends previous work by testing each model for a range of parameters. Data sets with known values of scaling parameter and exponent were generated by adding normally distributed random errors with controlled mean and standard deviations to underlying power laws. These data sets were then analyzed for both forms of the power law. For the scaling parameter, the proposed model provided smaller errors in 96/180 cases and smaller uncertainties in 88/180 cases. In most remaining cases, the traditional model provided smaller errors or uncertainties. Examination of conditions indicates that the proposed law has potential in select cases, but due to ambiguity in the conditions which favor one model over the other, an approach similar to the one in this study is encouraged for determining which model will offer reduced errors and uncertainties in data sets where additional accuracy is desired. Full article
Figures

Figure 1

Open AccessArticle
Estimating Cost Savings from Early Cancer Diagnosis
Data 2017, 2(3), 30; doi:10.3390/data2030030 -
Abstract
We estimate treatment cost-savings from early cancer diagnosis. For breast, lung, prostate and colorectal cancers and melanoma, which account for more than 50% of new incidences projected in 2017, we combine published cancer treatment cost estimates by stage with incidence rates by stage
[...] Read more.
We estimate treatment cost-savings from early cancer diagnosis. For breast, lung, prostate and colorectal cancers and melanoma, which account for more than 50% of new incidences projected in 2017, we combine published cancer treatment cost estimates by stage with incidence rates by stage at diagnosis. We extrapolate to other cancer sites by using estimated national expenditures and incidence rates. A rough estimate for the U.S. national annual treatment cost-savings from early cancer diagnosis is in 11 digits. Using this estimate and cost-neutrality, we also estimate a rough upper bound on the cost of a routine early cancer screening test. Full article
Open AccessArticle
Adjustable Robust Singular Value Decomposition: Design, Analysis and Application to Finance
Data 2017, 2(3), 29; doi:10.3390/data2030029 -
Abstract
The Singular Value Decomposition (SVD) is a fundamental algorithm used to understand the structure of data by providing insight into the relationship between the row and column factors. SVD aims to approximate a rectangular data matrix, given some rank restriction, especially lower rank
[...] Read more.
The Singular Value Decomposition (SVD) is a fundamental algorithm used to understand the structure of data by providing insight into the relationship between the row and column factors. SVD aims to approximate a rectangular data matrix, given some rank restriction, especially lower rank approximation. In practical data analysis, however, outliers and missing values maybe exist that restrict the performance of SVD, because SVD is a least squares method that is sensitive to errors in the data matrix. This paper proposes a robust SVD algorithm by applying an adjustable robust estimator. Through adjusting the tuning parameter in the algorithm, the method can be both robust and efficient. Moreover, a sequential robust SVD algorithm is proposed in order to decrease the computation volume in sequential and streaming data. The advantages of the proposed algorithms are proved with a financial application. Full article
Figures

Figure 1

Open AccessData Descriptor
Development of a Data Set of Pesticide Dissipation Rates in/on Various Plant Matrices for the Pesticide Properties Database (PPDB)
Data 2017, 2(3), 28; doi:10.3390/data2030028 -
Abstract
Data relating to the rate at which pesticide active substances dissipate on or within various plant matrices are important for a range of different risk assessments; however, despite the importance of this data, dissipation rates are not included in the most common online
[...] Read more.
Data relating to the rate at which pesticide active substances dissipate on or within various plant matrices are important for a range of different risk assessments; however, despite the importance of this data, dissipation rates are not included in the most common online data resources. Databases have been collated in the past, but these tend not to be maintained or regularly updated. The purpose of the exercise described herein was to collate a new database in a format compatible with the main online pesticide database resource (the Pesticide Properties Database, PPDB), to validate this database in line with the Pesticide Properties Database protocols and thus ensure that the data is maintained and updated in future. Data was collated using a systematic review approach using several scientific databases. Collated literature was subjected to a quality assessment, and then data was extracted into an MS Excel spreadsheet. The outcome of the study is a database based on data collated from 1390 published articles covering over 400 pesticides and over 200 crops across a wide variety of different matrices (leaves, fruits, seeds etc.) for pesticide residues on the crop surface, as well as residues absorbed within the plant material. This data is now fully incorporated into the PPDB. Full article
Figures

Figure 1

Open AccessData Descriptor
A 2001–2015 Archive of Fractional Cover of Photosynthetic and Non-Photosynthetic Vegetation for Beijing and Tianjin Sandstorm Source Region
Data 2017, 2(3), 27; doi:10.3390/data2030027 -
Abstract
Fractional covers of photosynthetic and non-photosynthetic vegetation are key indicators for land degradation surveillance in the dryland of China. However, there are no available, well validated, and multispectral-based products. Aiming for this, we selected the Beijing and Tianjin Sandstorm Source Region as the
[...] Read more.
Fractional covers of photosynthetic and non-photosynthetic vegetation are key indicators for land degradation surveillance in the dryland of China. However, there are no available, well validated, and multispectral-based products. Aiming for this, we selected the Beijing and Tianjin Sandstorm Source Region as the study area, and utilized the linear spectral mixture model for generating the fractional cover of PV, NPV, and bare soil, with endmember spectra retrieved from the field measured endmember spectral library, based on the MODIS NBAR data from 2001 to 2015. The unmixing results were validated through comparison with the field samples. The results show the method adopted could acquire rational and accurate estimation of fractional cover of photosynthetic vegetation (R2 = 0.6297, RMSE = 0.2443) and non-photosynthetic vegetation (R2 = 0.3747, RMSE = 0.2568). The dataset could provide key data support for the users in land degradation surveillance fields. Full article
Figures

Figure 1

Open AccessData Descriptor
Chlamydospore Specific Proteins of Candida albicans
Data 2017, 2(3), 26; doi:10.3390/data2030026 -
Abstract
Polymorphic yeast, Candida albicans, forms thick-walled structures called chlamydospores in order to survive under adverse conditions. We present proteomic profile changes occurring during chlamydospore formation. Chlamydospores were induced by inoculating C. albicans cells (grown for 48 h) on rice extract and semisolid agar
[...] Read more.
Polymorphic yeast, Candida albicans, forms thick-walled structures called chlamydospores in order to survive under adverse conditions. We present proteomic profile changes occurring during chlamydospore formation. Chlamydospores were induced by inoculating C. albicans cells (grown for 48 h) on rice extract and semisolid agar containing tween 80 (1%), and were overlaid by a polyethene sheet to induce microaerophilic conditions at 30 °C. Proteins extracted from chlamydospores and hyphae (producing chlamydospores) were identified by LC-MS/MS analysis. Present datasets include proteomic data (Swath spectral libraries) of chlamydospores and yeast phase cells, as well as methodologies and tools used for the data generation. Further analysis is expected to provide an opportunity to understand modulations in metabolic processes, molecular architecture (i.e., cell wall, membrane, and cytoskeleton) and stress response pathways leading to chlamydospore formation and thus facilitating survival of C. albicans under adverse conditions. Full article
Figures

Figure 1

Open AccessData Descriptor
Thermodynamic Data of Fusarium oxysporum Grown on Different Substrates in Gold Mine Wastewater
Data 2017, 2(3), 24; doi:10.3390/data2030024 -
Abstract
The necessity for sustainable process development has led to an upsurge in bio-based processes, thereby placing a higher demand on the use of suitable microorganisms. Similarly, thermodynamics is a veritable tool that can predict the behavior of any material under well-defined conditions. Thermodynamic
[...] Read more.
The necessity for sustainable process development has led to an upsurge in bio-based processes, thereby placing a higher demand on the use of suitable microorganisms. Similarly, thermodynamics is a veritable tool that can predict the behavior of any material under well-defined conditions. Thermodynamic data of Fusariumoxysporum used in the bioremediation of gold mine wastewater, for a process supported with different carbon sources, was investigated. The data were obtained using a Discovery DSC® (TA Instruments, Inc. New Castle, DE, USA) equipped with modulated Differential Scanning Calorimeter (MDSCTM) software. The data revealed minimal differences in the physical properties of the F. oxysporum used, indicating that the utilisation of agro-waste for microbial proliferation in wastewater treatment is as feasible as when refined carbon sources are used. The data will be helpful for the development of environmentally benign process development strategies, especially for environmental engineering applications. Full article
Open AccessData Descriptor
A Database of Weekly Sea Ice Parcel Tracks Derived from Lagrangian Motion Data with Ancillary Data Products
Data 2017, 2(3), 25; doi:10.3390/data2030025 -
Abstract
Arctic sea ice has been on the decline over the past several decades, and multi-year sea ice has decreased significantly in its areal share of the overall sea ice cover. Changes in several key variables such as radiative balances, albedo, ice surface temperature,
[...] Read more.
Arctic sea ice has been on the decline over the past several decades, and multi-year sea ice has decreased significantly in its areal share of the overall sea ice cover. Changes in several key variables such as radiative balances, albedo, ice surface temperature, and ice thickness have driven much of the decline, but the motion of sea ice makes studying the effects of these variables on individual parcels difficult. Previous studies have observed changes in the means of these variables and their impacts on sea ice concentration, but an accessible database of Lagrangian tracked data is not yet available for study. In order to address this, a database has been developed at the University of Colorado Boulder that performs Lagrangian tracking on individual sea ice parcels and saves coincident ancillary thermodynamic and dynamic variables for each parcel on a weekly timescale. Full article
Figures

Figure 1

Open AccessData Descriptor
Overview of German Additive Manufacturing Companies
Data 2017, 2(3), 23; doi:10.3390/data2030023 -
Abstract
This dataset is the description of a curated list of companies involved in additive manufacturing in Germany. The companies included are of various categories, such as 3D printing providers, hardware manufacturers, software developers and vendors. The list was compiled through literature and Internet-based
[...] Read more.
This dataset is the description of a curated list of companies involved in additive manufacturing in Germany. The companies included are of various categories, such as 3D printing providers, hardware manufacturers, software developers and vendors. The list was compiled through literature and Internet-based research, resulting in the compilation of information from a number of resources, such as the Bundesanzeiger (Federal Gazette), the Registergerichte (Register Courts), the respective websites themselves and a B2B marketplace (Wer liefert Was?). The aim of compiling this list is to provide information to researchers on the current situation of 3D printing in Germany. Full article
Figures

Figure 1

Open AccessData Descriptor
A High Resolution Dataset of Drought Indices for Spain
Data 2017, 2(3), 22; doi:10.3390/data2030022 -
Abstract
Drought indices are essential metrics for quantifying drought severity and identifying possible changes in the frequency and duration of drought hazards. In this study, we developed a new high spatial resolution dataset of drought indices covering all of Spain. The dataset includes seven
[...] Read more.
Drought indices are essential metrics for quantifying drought severity and identifying possible changes in the frequency and duration of drought hazards. In this study, we developed a new high spatial resolution dataset of drought indices covering all of Spain. The dataset includes seven drought indices, spans the period 1961–2014, and has a spatial resolution of 1.1 km and a weekly temporal resolution. A web portal has been created to enable download and visualization of the data. The data can be downloaded as single gridded points for each drought index, but the entire drought index dataset can also be downloaded in netCDF4 format. The dataset will be updated for complete years as the raw meteorological data become available. Full article
Figures

Figure 1

Open AccessArticle
Using Semantic Web Technologies to Query and Manage Information within Federated Cyber-Infrastructures
Data 2017, 2(3), 21; doi:10.3390/data2030021 -
Abstract
A standardized descriptive ontology supports efficient querying and manipulation of data from heterogeneous sources across boundaries of distributed infrastructures, particularly in federated environments. In this article, we present the Open-Multinet (OMN) set of ontologies, which were designed specifically for this purpose as well
[...] Read more.
A standardized descriptive ontology supports efficient querying and manipulation of data from heterogeneous sources across boundaries of distributed infrastructures, particularly in federated environments. In this article, we present the Open-Multinet (OMN) set of ontologies, which were designed specifically for this purpose as well as to support management of life-cycles of infrastructure resources. We present their initial application in Future Internet testbeds, their use for representing and requesting available resources, and our experimental performance evaluation of the ontologies in terms of querying and translation times. Our results highlight the value and applicability of Semantic Web technologies in managing resources of federated cyber-infrastructures. Full article
Figures

Figure 1

Open AccessArticle
Open Source Fundamental Industry Classification
Data 2017, 2(2), 20; doi:10.3390/data2020020 -
Abstract
Abstract: We provide complete source code for building a fundamental industry classification based on publicly available and freely downloadable data. We compare various fundamental industry classifications by running a horserace of short-horizon trading signals (alphas) utilizing open source heterotic risk models (https://ssrn.com/abstract=2600798)
[...] Read more.
Abstract: We provide complete source code for building a fundamental industry classification based on publicly available and freely downloadable data. We compare various fundamental industry classifications by running a horserace of short-horizon trading signals (alphas) utilizing open source heterotic risk models (https://ssrn.com/abstract=2600798) built using such industry classifications. Our source code includes various stand-alone and portable modules, e.g., for downloading/parsing web data, etc. Full article
Figures

Figure 1

Open AccessData Descriptor
Four Datasets Derived from an Archive of Personal Homepages (1995–2009)
Data 2017, 2(2), 19; doi:10.3390/data2020019 -
Abstract
While data from social media are easily accessible, understanding how individuals expressed themselves on the Internet in its initial years of public availability (the mid-late 1990s) has proved difficult. In this data deposit, I describe how archival data from Geocities homepages were retrieved
[...] Read more.
While data from social media are easily accessible, understanding how individuals expressed themselves on the Internet in its initial years of public availability (the mid-late 1990s) has proved difficult. In this data deposit, I describe how archival data from Geocities homepages were retrieved and processed to remove non-text data, then further refined to create separate datasets, each of which provides unique insights into modes of personal expression on the early Internet. The present paper describes four datasets, all of which were derived from a larger collection of personal websites: (1) a large corpus of raw text data from Geocities personal homepages, (2) a linguistic analysis of basic psychological properties of the same Geocities pages, using an open-source implementation of the Linguistic Inquiry Word Count (LIWC), (3) a dataset of links between homepages (suitable for network analysis), and (4) a manifest dataset summarizing the size and last update date for each file in the dataset. Data from over 378,000 Geocities pages are included. In addition to providing a detailed description of how these datasets were created, I describe how they might be utilized in future research. Full article
Figures

Figure 1

Open AccessData Descriptor
Towards Automatic Bird Detection: An Annotated and Segmented Acoustic Dataset of Seven Picidae Species
Data 2017, 2(2), 18; doi:10.3390/data2020018 -
Abstract
Analysing behavioural patterns of bird species in a certain region enables researchers to recognize forthcoming changes in environment, ecology, and population. Ornithologists spend many hours observing and recording birds in their natural habitat to compare different audio samples and extract valuable insights. This
[...] Read more.
Analysing behavioural patterns of bird species in a certain region enables researchers to recognize forthcoming changes in environment, ecology, and population. Ornithologists spend many hours observing and recording birds in their natural habitat to compare different audio samples and extract valuable insights. This manual process is typically undertaken by highly-experienced birders that identify every species and its associated type of sound. In recent years, some public repositories hosting labelled acoustic samples from different bird species have emerged, which has resulted in appealing datasets that computer scientists can use to test the accuracy of their machine learning algorithms and assist ornithologists in the time-consuming process of analyzing audio data. Current limitations in the performance of these algorithms come from the fact that the acoustic samples of these datasets combine fragments with only environmental noise and fragments with the bird sound (i.e., the computer confuses environmental sound with the bird sound). Therefore, the purpose of this paper is to release a dataset lasting more than 4984 s that contains differentiated samples of (1) bird sounds and (2) environmental sounds. This data descriptor releases the processed audio samples—originally obtained from the Xeno-Canto repository—from the known seven families of the Picidae species inhabiting the Iberian Peninsula that are good indicators of the habitat quality and have significant value from the environment conservation point of view. Full article
Figures

Figure 1

Open AccessData Descriptor
Transcriptome Dataset of Soybean (Glycine max) Grown under Phosphorus-Deficient and -Sufficient Conditions
Data 2017, 2(2), 17; doi:10.3390/data2020017 -
Abstract
This data descriptor introduces the dataset of the transcriptome of low-phosphorus tolerant soybean (Glycine max) variety NN94-156 under phosphorus-deficient and -sufficient conditions. This data is comprised of the transcriptome datasets (four libraries) acquired from roots and leaves of the soybean plants
[...] Read more.
This data descriptor introduces the dataset of the transcriptome of low-phosphorus tolerant soybean (Glycine max) variety NN94-156 under phosphorus-deficient and -sufficient conditions. This data is comprised of the transcriptome datasets (four libraries) acquired from roots and leaves of the soybean plants challenged with low-phosphorus, which allows further analysis whether systemic tolerance response to low phosphorus stress occurred. We describe the detailed procedure of how plants were prepared and treated and how the data were generated and pre-processed. Further analyses of this data would be helpful to improve our understanding of molecular mechanisms of low-phosphorus stress in soybean. Full article
Open AccessData Descriptor
Long-Term Land Cover Data for the Lower Peninsula of Michigan, 2010–2050
Data 2017, 2(2), 16; doi:10.3390/data2020016 -
Abstract
Land cover data are often used to examine the impacts of landscape alterations on the environment from the local to global scale. Although various agencies produce land cover data at various spatial scales, data are still limited at the regional scale over extended
[...] Read more.
Land cover data are often used to examine the impacts of landscape alterations on the environment from the local to global scale. Although various agencies produce land cover data at various spatial scales, data are still limited at the regional scale over extended timescales. This is a critical data gap since decision-makers often use future and long-term land cover maps to develop effective policies for sustainable environmental systems. As a result, land change science incorporates common data mining tools to create future land cover maps that extend over long timescales. This study applied one of the well-known land cover change models, called Land Transformation Model (LTM), to produce urbanization maps for the Lower Peninsula of Michigan in United States from 2010 to 2050 with five year intervals. Long-term urbanization data in the Lower Peninsula of Michigan can be used in various environmental studies such as assessing the impact of future urbanization on climate change, water quality, food security and biodiversity. Full article
Figures

Figure 1

Open AccessArticle
Demonstration Study: A Protocol to Combine Online Tools and Databases for Identifying Potentially Repurposable Drugs
Data 2017, 2(2), 15; doi:10.3390/data2020015 -
Abstract
Traditional methods for discovery and development of new drugs can be very time-consuming and expensive processes because they include several stages, such as compound identification, pre-clinical and clinical trials before the drug is approved by the U.S. Food and Drug Administration (FDA). Therefore,
[...] Read more.
Traditional methods for discovery and development of new drugs can be very time-consuming and expensive processes because they include several stages, such as compound identification, pre-clinical and clinical trials before the drug is approved by the U.S. Food and Drug Administration (FDA). Therefore, drug repurposing, namely using currently FDA-approved drugs as therapeutics for other diseases than what they are originally prescribed for, is emerging to be a faster and more cost-effective alternative to current drug discovery methods. In this paper, we have described a three-step in silico protocol for analyzing transcriptomics data using online databases and bioinformatics tools for identifying potentially repurposable drugs. The efficacy of this protocol was evaluated by comparing its predictions with the findings of two case studies of recently reported repurposed drugs: HIV treating drug zidovudine for the treatment of dry age-related macular degeneration and the antidepressant imipramine for small-cell lung carcinoma. The proposed protocol successfully identified the published findings, thus demonstrating the efficacy of this method. In addition, it also yielded several novel predictions that have not yet been published, including the finding that imipramine could potentially treat Severe Acute Respiratory Syndrome (SARS), a disease that currently does not have any treatment or vaccine. Since this in silico protocol is simple to use and does not require advanced computer skills, we believe any motivated participant with access to these databases and tools would be able to apply it to large datasets to identify other potentially repurposable drugs in the future. Full article
Figures

Figure 1