Selection Criteria and Data Collection
To ensure a systematic review process, the guideline provided by Higgins et al. [
10] was used. The systematic literature review process used in this study included four primary steps including formulation of the research questions, setting protocol of systematic review, analysis of the literature, and finally data analysis and reporting of the findings. Many educational research papers have been published in 21st century that integrate text mining or natural language processing in their methodology but not all of them are related to our two research questions. The paper selection criteria is targeted to ensure that our analysis is mainly focused on those peer reviewed research papers that represent the application of the aforementioned techniques in student’s learning and improved teaching interventions. We aimed to find these peer reviewed research papers published in 21st century that focus primarily on objectives that related to our research questions. Furthermore, this systematic review aims to discover emerging trends in text mining and natural language processing techniques to deliver insights for the researchers for further investigation.
Figure 1 illustrates the process used to identify the papers to include in the systematic review for this study which was guided by PRISMA guideline [
11]. In order to collect all the papers related to educational text mining, two abstraction and citation databases including Web of Science (Core Collection) and Scopus were targeted. We selected these two traditionally famous databases [
12] because manual inspection of the conference proceedings and journals covered by these two databases revealed that the combinatory use of these two databases gives us the highest degree of coverage of the author keywords that are related to our study. Therefore, the initial search term was set to find those English peer-reviewed publications that are published in 21st century and have “text mining” or “text analytics” or “text analysis” or “writing analytics” or “natural language processing” or “NLP” or “language model” or “computational linguistics” in their title, and also have “teach*” or “learn*” or “student” or “educat*” or “university” or “college” or “institution” or “school” in their title, abstract or keywords. To accomplish that, the following initial search terms were used (We thank the anonymous reviewers for their invaluable comments enabling a broader keyword search):
Scopus search term: (TITLE (“text mining” OR “text analytics” OR “text analysis” OR “natural language processing” OR “NLP” OR “writing analytics” OR “writing analysis” OR “language model” OR “computational linguistics”) AND TITLE-ABS-KEY (“teach*” OR “learn*” OR “educat*” OR “university" OR “college” OR “institution” OR “school” OR “student”)) AND PUBYEAR > 1999 AND (EXCLUDE (DOCTYPE, “re”)) AND (LIMIT-TO (LANGUAGE, “English”))
Web of Science search term: (TI = (“text mining” OR “text analytics” OR “text analysis” OR “natural language processing” OR “NLP” OR “writing analytics” OR “writing analysis” OR “language model” OR “computational linguistics”)) AND TS = (“teach*” OR “learn*” OR “educat*” OR “university” OR “college” OR “institution” OR “school” OR “student”) and Review Articles (Exclude–Document Types) and English (Languages)
Applying the selection criteria on Scopus and Web of Science returned 4433 and 2331 publications respectively. Upon closer inspection of the returned papers, we noted that a considerable number of key papers of the field are not identified by neither the Web of Science nor the Scopus. This is explained by at least two common reasons: first, in some cases there is no explicit mention of the discipline in the title of the papers; instead the authors chose to use a term that represents a broader discipline (for example “learning analytics” as a discipline instead of “text mining”) or used the formal name for a direct application of text mining in educational settings (e.g., “automated writing evaluation”); secondly, for some of the publications the authors put the name of the text analysis technique (e.g., “tf-idf”) and/or specific technical word (e.g., “recurrent neural network”) that were used to analyse the educational text data explicitly in the title of the paper. Therefore, we needed to extend our search term in ways that it caters for those publications that are potentially related to the scope of this study but are not returned by Scopus or Web of Science when the initial search term is used.
To tackle these issues, we extracted all the keywords present in the bib records of the publications that were identified by the first search, and sort them based on their frequency. Next using z-score transformation, we calculated the z-scores of each author keyword (calculated based on the frequency of each author keyword) and using a cut-off value of +1.96 we selected those author keywords (n = 41) which are enriched in the bib records of the result of our initial search. This gave us a pool of author keywords (See
Table 1) that are favourable for this study, providing a basis for extending our search term. Exploring the list of abundant author keywords also made us realise that there are some highly enriched author keywords that are not related to our interest (e.g., “electronic health records”). Later, we used these author keywords in the construction of the new search terms to reduce the number of false positives in the results of our new search. Also, this list helped us identify variations in author keywords (e.g., “language model" and “language models”) that should be considered when constructing the new search term.
Figure 1.
PRISMA [
11] guided systematic review procedure used for this study.
Figure 1.
PRISMA [
11] guided systematic review procedure used for this study.
Next for each item in the prepared list of enriched author keywords, we assigned the author keywords to a group:
Education related terms (Group A): words that represent education, teaching, or learning (e.g., “distance learning”, “MOOCs”)
Text related jargon (Group B): terms that deal with preparing, processing, presenting or analysing text data (e.g., “word embedding”, “sentiment analysis”)
Data analysis technique, jargon or discipline (Group C): terms that represent the name of a technique or part of a process that is concerned with the analysis of the data (e.g., “support vector machine”, “neural networks”)
Categorisation of the 41 author keywords into the aforementioned groups resulted in 1, 20 and 18 author keywords for groups A, B and C respectively. Since we needed more “education" related author keywords, we intuitively relaxed the z-score cut-off so that we can go down deeper in the list to add more author keywords to our defined groups, importantly focusing on keywords related to group A. In the end, we collected 60, 167 and 156 author keywords for groups A, B and C that now can be considered for the construction of the new search terms. Next we define two new groups of search terms and use the author keywords for the implementation of these search terms:
Publications that have:
- −
a text related jargon (Group B) as well as an education related term (Group A) in their title
Publications that have
- −
a data analysis related technique, jargon or discipline (Group C) in their title, and
- −
a text related jargon (Group B) in their title, abstract or author keywords and
- −
an education related term (Group A) in their title
Table 1.
Keywords of the papers that were returned when initial search term was used.
Table 1.
Keywords of the papers that were returned when initial search term was used.
Author Keyword | Group | Frequency |
---|
Natural language processing | B | 1502 |
Machine learning | C | 1031 |
Text mining | B | 1005 |
Deep learning | C | 401 |
NLP | B | 294 |
Sentiment analysis | B | 206 |
Artificial intelligence | C | 167 |
Language model | B | 165 |
Information extraction | C | 128 |
Text analysis | B | 125 |
Text classification | B | 124 |
Classification | C | 119 |
Social media | C | 115 |
Data mining | C | 109 |
Natural language | B | 102 |
Learning | A | 99 |
Text analytics | B | 95 |
Big data | C | 89 |
Neural networks | C | 85 |
Information retrieval | C | 78 |
Electronic health records | NA | 77 |
Speech recognition | C | 76 |
Transfer learning | C | 74 |
Natural language processing (nlp) | B | 73 |
Topic modelling | B | 73 |
Processing | C | 71 |
Ontology | B | 68 |
BERT | B | 67 |
Twitter | C | 66 |
Computational linguistics | B | 65 |
COVID-19 | NA | 64 |
Natural | C | 62 |
Language models | B | 60 |
Language processing | B | 60 |
Word embeddings | B | 60 |
Language modelling | B | 58 |
Named entity recognition | B | 58 |
Clustering | C | 57 |
Text | C | 57 |
LSTM | B | 53 |
Neural network | C | 52 |
Using the pool of related author keywords and guided by the aforementioned new search strategies, we next performed searches on Scopus that led us to a set of 9666 papers. Motivated by the richness of the publications returned by Scopus and guided by the findings of [
13], we chose not to repeat this comprehensive search on Web of Science or Dimension or any other citation databases. Next, the abstract and title of the publications were manually examined to guarantee that the papers suit the scope of this study. Papers with a focus on analysis of literature review using text mining, conference proceedings, proceeding trend analysis, journal trend analysis, bibliometric analysis papers (systematic reviews), theses, papers with a focus on new text mining or natural language processing techniques in non-educational settings, and studies that examine the application of text mining and natural language processing in a broad sense were removed. In the end, a total number of 981 publications were selected and used for analysis in this study (the final search term used for this study as well as the resulting BibTex files are available for download at
https://zenodo.org/record/5890421#.Yeu92f5BxjE). It’s worth to mention that the number of the accepted papers when the final search term is used (981) is considerably larger than the number of the accepted papers (n = 321) when first search terms were used.
The quantitative analysis in this review employs Bibliometric analysis of the selected papers to generate various quantitative results and identify the main research themes. Authors of [
14] provide a summary of some of the widely used tools for bibliometric analysis. We used the Bibliometrix R package [
15] to conduct the bibliometric analysis for this paper. The package provides various functions for a comprehensive analysis of the selected literature.