Next Article in Journal
A Dataset of Vietnamese Junior High School Students’ Reading Preferences and Habits
Next Article in Special Issue
Semantics in the Deep: Semantic Analytics for Big Data
Previous Article in Journal
Planetary Defense Mitigation Gateway: A One-Stop Gateway for Pertinent PD-Related Contents
Previous Article in Special Issue
Neural Networks in Big Data and Web Search
Open AccessArticle

From a Smoking Gun to Spent Fuel: Principled Subsampling Methods for Building Big Language Data Corpora from Monitor Corpora

Department of English, Franklin College of Arts and Sciences, University of Georgia, Athens, GA 30602, USA
Received: 29 November 2018 / Revised: 22 March 2019 / Accepted: 30 March 2019 / Published: 2 April 2019
(This article belongs to the Special Issue Semantics in the Deep: Semantic Analytics for Big Data)
With the influence of Big Data culture on qualitative data collection, acquisition, and processing, it is becoming increasingly important that social scientists understand the complexity underlying data collection and the resulting models and analyses. Systematic approaches for creating computationally tractable models need to be employed in order to create representative, specialized reference corpora subsampled from Big Language Data sources. Even more importantly, any such method must be tested and vetted for its reproducibility and consistency in generating a representative model of a particular population in question. This article considers and tests one such method for Big Language Data downsampling of digitally accessible language data to determine both how to operationalize this form of corpus model creation, as well as testing whether the method is reproducible. Using the U.S. Nuclear Regulatory Commission’s public documentation database as a test source, the sampling method’s procedure was evaluated to assess variation in the rate of which documents were deemed fit for inclusion or exclusion from the corpus across four iterations. After performing multiple sampling iterations, the approach pioneered by the Tobacco Documents Corpus creators was deemed to be reproducible and valid using a two-proportion z-test at a 99% confidence interval at each stage of the evaluation process–leading to a final mean rejection ratio of 23.5875 and variance of 0.891 for the documents sampled and evaluated for inclusion into the final text-based model. The findings of this study indicate that such a principled sampling method is viable, thus necessitating the need for an approach for creating language-based models that account for extralinguistic factors and linguistic characteristics of documents. View Full-Text
Keywords: corpus linguistics; language modeling; big data; language data; databases; monitor corpora; documentary analysis; nuclear power; government regulation; tobacco documents corpus linguistics; language modeling; big data; language data; databases; monitor corpora; documentary analysis; nuclear power; government regulation; tobacco documents
Show Figures

Figure 1

MDPI and ACS Style

Tidwell, J.H. From a Smoking Gun to Spent Fuel: Principled Subsampling Methods for Building Big Language Data Corpora from Monitor Corpora. Data 2019, 4, 48.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop