4.2. The Proposed Method
Step 1. Corpora compilation
The target corpora contain 615 COVID-19 news reports from the FOX News website, one of the mainstream media organizations in the United States. The corpora are divided into Corpus 1 and 2, containing news reports from January and February, respectively; the corpora consist of 16,536 word types and 457,891 tokens, and their diversity indicator—type/token ratio (TTR) is 3.6% (see
Table 1 and
Figure 5).
Step 2. Selecting and inputting the benchmark corpus
To generate a keyword list from the target corpora, the researchers choose the corpus of contemporary American English (COCA) as the benchmark corpus for input into AntConc 3.5.8 [
30]. In this study, the COCA sample, recently released online, encompasses 123,029 word types and 9,412,521 tokens, and its TTR is 1.3%. The benchmark corpus is much larger than the target corpora and genre-balanced; thus, COCA can be considered an ideal benchmark corpus.
Step 3. Generating a wordlist from the target corpora
The corpora of COVID-19 news reports are analyzed using AntConc 3.5.8 [
30], and the wordlist is generated without any additional adjustments. The raw wordlist covered 16,536 word types and the words were ranked in frequency order. An example of the wordlist (see
Table 2) shows many function words, meaningless words, or even characters embedded in the wordlist (words were highlighted in
Table 2). It decreases the efficiency of analysis of the wordlist if researchers must make manual corrections or eliminations.
Step 4. Generating a keyword list from the target corpora
Keyword list generation is a function of corpus software for filtering out function words, meaningless words, and words that especially in general purposes. Corpus software uses likelihood ratio algorithms to determine the keyness of each token. Keywords are considered words that have specific purposes and will indicate the domain characteristics of the target corpora. According to
Table 3, the keyword list shows words that have specific usages from the corpora of COVID-19 news reports, such as
coronavirus,
virus,
outbreak,
infected, and
CDC (Centers for Disease Control and Prevention), which are closer to the discipline of the target corpora than words on the wordlist (also see
Table 2). The corpus software, based on its algorithm, retrieves 1346 keywords; however, function words, meaningless words, or simple letters such as
don, u, didn, doesn, has, the, etc. still exist on the keyword list. In order to enhance the efficiency of corpus analysis, the following procedures are dedicated to optimizing the resulting wordlist and the keyword list.
Step 5. Establishing a function word list
Based on English linguistic patterns and rules [
56], the researchers categorize function words into fifteen categories, which include auxiliary verbs, conjunctions, determiners (articles, demonstratives, possessive pronouns, and quantifiers), modals, prepositions, pronouns, qualifiers, question words, comparatives, conditionals, concessive clauses, frequencies, other words, negatives, and meaningless words; the total quantity of function words and meaningless characters is 228 tokens (see
Table 4). Those 228 tokens are considered the most common words that construct sentences and have the highest occurrence in articles written in English. Moreover, those tokens commonly block the efficiency of corpora analysis of specialized cases.
Step 6. Limiting wordlist range
In this step, the function wordlist compiled in step 5 is transformed into .txt format (UTF-8) and is input into AntConc 3.5.8 [
30] to limit the wordlist range. After constraining the wordlist range, the keyness calculation (i.e., likelihood test) results will also be changed. Thus, the resulting wordlist and keyword list are optimized during this step.
Step 7. Generating a refined word list from the target corpora
The refined wordlist from the corpora of COVID-19 news reports is downsized to 16,325 word types (original wordlist: 16,536 word types). The eliminated 211 words were function words, meaningless words, or letters. Words that are more meaningful and more specific to the corpora emerge in the top 100 of the refined wordlist (see
Table 5). This allows researchers to further expose the domain knowledge of the corpora of COVID-19 news reports.
Step 8. Generating a refined keyword list from the target corpora
After going through the optimization process, the corpus software recalculated the keyness of each word and finally extracted 2149 keywords (original keyword list: 1346 keywords). Without the interference of function words and meaningless words, keywords retrieval and analysis embraces more keywords that are closer to the disciplinary purposes and usages of the target corpora (see
Table 6). As the quantity of keywords increase, the chances of information distortion will decrease.
Step 9. Providing critical information and reference data for decision-makers
The optimized wordlist and keyword list from the COVID-19 news corpora provide decision-makers with critical reference data to make further decisions. The optimized results of steps 7 and 8 supply decision-makers with domain-oriented words, and they can utilize those words for extracting critical information from the corpus software. Once decision-makers have sufficient information, the decisions they make will not be distorted and cause significant errors.
4.3. Comparison and Discussion
To enhance the efficiency of a big textual data optimization processing, this section uses the corpora of COVID-19 news reports as an empirical example to discuss the differentiations of the proposed approach in three aspects that include optimization efficiency, refined results, and knowledge extraction. Firstly, for function words elimination and machine elimination of function words, the proposed approach is compared with three listing approaches to explore the refining efficiency. Secondly, the differences between the original and refined data are quantitatively presented to verify the proposed approach. Finally, there is information retrieval (IR) from a big textual dataset to highlight the significant values of knowledge extraction. Detailed descriptions are shown as follows.
(1) Function words elimination and machine elimination of function words
This paper compares four ESP cases (see
Table 7). Firstly, the authors compare the elimination of function words. Li’s approach [
29] was used to explore the linguistic and domain usages of vague terms in JRC-Acquis (EN), a corpus of legal documents. Thus, the study did not eliminate function words manually or by machine processing. Vague terms, in Li’s study [
31], were listed as
some, or more, several, about, a period of, a number of, and so on; those terms frequently collocated with quantifiers, prepositions, and other function words. If function words were deleted, the corpus program might unable to cluster vague term phrases for linguistic analysis. That is, for some cases in genre or general linguistic analysis, function words are essential because they are critical components for constructing English clauses and sentences. The proposed approach, Todd’s approach [
32], and Ross and River’s approach [
33] focused on analyzing specific cases of COVID-19 news reports, engineering fields, and President Trump’s tweets, respectively; thus, function words needed to be eliminated because specific words or terms were more relevant for retrieving domain knowledge and critical information from the target corpora.
Secondly, the researchers compare machine-based means for eliminating function words. Todd’s approach [
30] created a manual five-stage filtering approach to refine the data (i.e., the keyword list) resulting from the corpus software in the context of retrieving opaque words in engineering corpora. The first two steps in the manual five-stage filtering approach were removing meaningless words and function words. Todd’s approach [
32] did not use a machine (e.g., a corpus software or other computer method) for deleting function words. If the proposed approach were embedded into Todd’s approach [
30], it would save two steps in the manual five-stage filtering process and enhance the efficiency of the wordlist analysis. Ross and River’s approach [
33] simply removed function words from the keyword list manually after the corpus software created it and then made a taxonomy of keywords. Ross and River’s approach [
33] did not use a machine to remove function words either. In the first part of this discussion, this paper demonstrated the discrepancy between the original keyword list and the refined keyword list. The refined keyword list in this study increased the coverage of keywords. If the proposed method were embedded into Ross and River’s approach [
33], it would enhance the breadth of the keyword list and save the step of eliminating function words manually. That is, Ross and River’s approach [
33] would no longer need to consume time in manually adjusting the data resulting from the corpus software; instead, the proposed approach might allow them to reveal more accurate data and conduct a token taxonomy directly. The proposed method used a machine-based technique to eliminate function words and meaningless words in order to augment the efficiency and accuracy of the data resulting from corpus analysis.
(2) Data discrepancy
There is a significant discrepancy between the original data and refined data in both the wordlist and keyword list. In terms of the wordlist (see
Table 8), the refined data (word types: 16,536; tokens: 457,891) eliminated 211 function words compared to the original data (word types: 16,325; tokens: 234,612), the size of the corpora was dramatically reduced by 48.8%, and the TTR raised from 3.6% to 7%. The remaining 51.2% of the corpora contains words that are more meaningful and domain-specific. This confirms that function words are critical elements for constructing English sentences; it also confirms that the proposed approach significantly improves corpus analysis efficiency by machine processing rather by manual optimization. Moreover, only 211 word types occupy 48.8% of the target corpora. It is no wonder that function words do cause some inconvenience and interference in processing information in ESP cases.
Keyword extraction is designed for filtering out function words, unrelated words, or commonly-used words in the target corpora. However, even if the corpus software has sophisticated statistical algorithms, the function words can still interfere with the resulting data in the keyword list. According to
Table 9, the proposed approach makes corpus software recalculate the keyness of each token: the refined data (keyword types: 2149; tokens: 156,841) as compared to the original data (keyword types: 1346; tokens: 226,271) increased 803 keywords and reduced 69,430 tokens (downsizing 30.7%). The keyword list increase of 803 keywords makes keyword analysis more accurate and embeds more detailed information for analyzing ESP cases.
(3) Demonstration of knowledge extraction
Knowledge extraction from a big textual data is demonstrated by taking
coronavirus as an example. According to the refined wordlist and keyword list,
coronavirus is ranked first based on both its frequency (=2425) and keyness (=17,780.87). However, before the wordlist was refined, the original wordlist showed
coronavirus is ranked 24th; that is, the top 23 functional tokens (such as
the, to, and, of, etc.) must be eliminated or filtered manually or automatically for
coronavirus to emerge for further ESP analysis. Here, the researchers categorize significant features of the linguistic patterns of
coronavirus using the corpus software platform and extracting information about
coronavirus from the big textual data of the COVID-19 news reports. Firstly, the researchers check the Cluster/N-Grams to find linguistic clusters of
coronavirus and, based on the researchers’ information analysis intentions, extract six critical types of information from the top 30 clusters list (see
Table 10). Then, the researchers check the concordance lines of each cluster (see
Figure 6) based on
Table 10 in order to retrieve crucial information.
Example 1. coronavirus outbreak: the cluster of coronavirus outbreak can be considered as an index to assist in searching for the best related content from the target corpus, such as events related to the disease outbreak (1-1, 1-2), the number of infected and deaths (1-3, 1-4, 1-5), and so on.
- 1-1
(retrieved from FOX-Jan-26-5) Coronavirus outbreak spurs Paris to cancel Lunar New Year parade…
- 1-2
(retrieved from FOX-Feb-4-1) A Chinese doctor who claimed he quietly warned of the coronavirus outbreak that has besieged the country…
- 1-3
(retrieved from FOX-Feb-15-6) …medical supplies to help China combat a coronavirus outbreak that has infected over 67,000 people.
- 1-4
(retrieved from FOX-Feb-17-3) …the deadly coronavirus outbreak that’s sickened over 70,500 in the country and killed at least 1,770.
- 1-5
(retrieved from FOX-Feb-17-4) …prayed for a cure to combat the coronavirus outbreak that has killed 1,770 people.
Example 2. coronavirus in: this example will allow us to find where the coronavirus incidents happened (2-1, 2-5) and where the cases occurred (2-2, 2-3, 2-4).
- 2-1
(retrieved from FOX-Jan-20-2) Human-to-human transmission of coronavirus in China confirmed.
- 2-2
(retrieved from FOX-Feb-3-6) Currently, there are six cases of the novel coronavirus in California, one in Arizona…
- 2-3
(retrieved from FOX-Feb-10-10) Leaving someone with coronavirus in a hallway could expose countless patients…
- 2-4
(retrieved from FOX-Feb-11-16) Evacuee confirmed to have coronavirus in California as US total reaches 13.
- 2-5
(retrieved from FOX-Feb-28-1) Dog tests ‘weak positive’ for coronavirus in Hong Kong, first possible infection in pet.
Example 3. coronavirus cases/coronavirus case: this indicates the reported medical cases that occurred in different areas, states, countries, etc. (3-1, 3-2, 3-3, 3-4, 3-5).
- 3-1
(retrieved from FOX-Jan-24-7) CDC confirms coronavirus case in Illinois, dozens more under investigation.
- 3-2
(retrieved from FOX-Jan-27-13) CDC: 110 suspected coronavirus cases in US under investigation…
- 3-3
(retrieved from FOX-Jan-27-11) Africa investigating first possible coronavirus case in Ivory Coast student: officials.
- 3-4
(retrieved from FOX-Feb-13-24) The number of coronavirus cases in China significantly increased on Thursday…
- 3-5
(retrieved from FOX-Feb-21-13) Coronavirus cases balloon in South Korea as outbreak spreads.
Example 4. coronavirus is: the grammatical structure of noun + auxiliary verbs (such as am, is, are, etc.) may represent important information regarding explanations and definitions of the noun. In 4-1, it explains the transmission routes of coronavirus. In 4-2, it defines the type of coronavirus. In 4-3, it explains the origins of the coronavirus. In 4-4, it shows an expert’s point of view on the disease; 4-5 indicates President Trump’s mocking insight on the coronavirus.
- 4-1
(retrieved from FOX-Feb-10-1) The coronavirus is primarily transmitted through respiratory droplets, meaning human saliva and mucus.
- 4-2
(retrieved from FOX-Feb-11-3) The new coronavirus is a respiratory virus, and we know respiratory viruses are often seasonal, but not always.
- 4-3
(retrieved from FOX-Feb-24-3) The strain of coronavirus is believed to have jumped from bats and snakes…
- 4-4
(retrieved from FOX-Feb-26-24) Dr. Marc Siegel said Wednesday that coronavirus is appearing to be more contagious than the flu.
- 4-5
(retrieved from FOX-Feb-29-17) Donald Trump: Coronavirus is Democrats’ ‘new hoax’.
Example 5. coronavirus death: this example helps us to extract information about coronavirus death cases chronologically. In 5-1, the news reports on 24 January indicated that there had been 41 deaths in China. In 5-2, the first overseas death was reported in the Philippines on 2 February. In 5-3, the second death outside of mainland China was reported in Hong Kong. In 5-4, coronavirus deaths increased to 105 on 16 February. In addition, 5-5 reported that Japan announced its first death case.
- 5-1
(retrieved from FOX-Jan-24-2) Coronavirus death toll rises to 41 in China, more than 1200 sickened.
- 5-2
(retrieved from FOX-Feb-2-5) Coronavirus death in Philippines said to be first outside China.
- 5-3
(retrieved from FOX-Feb-4-4) A second coronavirus death outside of China was reported earlier Tuesday by Hong Kong…
- 5-4
(retrieved from FOX-Feb-16-3) China sees coronavirus death toll rise by 105.
- 5-5
(retrieved from FOX-Feb-18-2) Japan announced its first coronavirus death last Thursday.
Example 6. coronavirus patients/coronavirus patient: this indicates related diagnostic tests (6-1, 6-5), medical treatments (6-2), and medical cases (6-3, 6-4).
- 6-1
(retrieved from FOX-Jan-23-8) Suspected coronavirus patients in Scotland being tested: reports.
- 6-2
(retrieved from FOX-Feb-18-3) Japan to trial HIV medications on coronavirus patients.
- 6-3
(retrieved from FOX-Feb-18-15) ‘SARS-like damage’ seen in dead coronavirus patient in China, report says.
- 6-4
(retrieved from FOX-Feb-19-9) Iran’s first 2 coronavirus patients die, state media says.
- 6-5
(retrieved from FOX-Feb-26-20) First US docs to analyze coronavirus patients’ lungs say insight could lead to quicker diagnosis.
To conclude this section, the first part of the discussion demonstrated how the proposed approach could be imported into different researchers’ corpus-based approaches to replace their experimental steps of manually removing function words, increase the efficiency of their analysis, and save time. The second part of the discussion presented quantitative data to show the verification of the proposed approach in removing function and meaningless words significantly, increasing the efficiency of the corpus analysis in an ESP case. In the final part of the discussion, IR from the big textual data, also called knowledge extraction, was illustrated to shed light on the value of the proposed optimized corpus-based approach to ESP big textual analysis.