One-Class Learning for AI-Generated Essay Detection
Abstract
:1. Introduction
2. Materials and Methods
2.1. Feature Extraction
- Text: This category possesses features found in textual data, which can be employed as input for machine learning models since they are measurable. The six Text features in this study include distinct type of punctuation marks, number of Oxford commas, number of paragraphs [35], number of full stops, number of commas [35], and average sentence length [36]. Previous studies have demonstrated that human-generated texts typically contain a higher frequency of punctuation mark use [35]. However, the examination of distinctive types of punctuation marks has been largely overlooked. Similarly, the number of paragraphs, full stops, commas, and average sentence length are higher in human texts [35]. The Oxford comma was included (comma before the final element in a list) as a distinguishing feature for its use is prevalent in certain English dialects but absent in Spanish, thus acting as a language transfer feature when analyzing essays written by English-speaking learners of Spanish. Moreover, language, more specifically dialect bias [37], will be evident in AI-generated texts if the Oxford comma is employed.
- Repetitiveness: Features in this category were compiled due to the fact that AI-generated texts frequently display the use of similar words and phrases, resulting in monotonous writing and lack of narrative and linguistic diversity [22]. The set of unique n-grams, iterations over the set, and count of the number of times each n-gram appears in the text are extracted to compute the unigram, bi-gram, and tri-gram overlap in order to quantify repetitiveness properties. The overlap is then calculated as a ratio between the count and the total number of different n-grams [38]. Additionally, the frequency of the words in the data is compared to the top 5K and 10K words in each language [24], thus determining how closely the lexicon in the dataset matches that of everyday speech. What is more, the lexical diversity of human and AI-generated texts can also be studied. Punctuation from essays was removed to extract the list of tokens in order to determine the number of matches with the most-used words.
- Emotional Semantics: This set of features represents the texts’ emotional tone, expressed through words, phrases, and sentences [39]. Text that has been created by humans has been said to have more subjectivity, complexity, and emotional diversity. AI-generated text, on the other hand, is more likely to be consistent in tone and emotionally neutral [40]. Polarity and Subjectivity, two of the emotional semantic features, are extracted using TextBlob, a lexicon-driven open-source Python text data processing package. Polarity expresses sentiment on a scale from −1.0 to 1.0 [−1.0: negative; 0.0: neutral; 1.0: positive], and Subjectivity indicates the degree by which personal feelings, views, beliefs, opinions, allegations, desires, beliefs, suspicions, and speculations are expressed in the essay, with a range from 0.0 to 1.0 [0.0: very objective; 1.0: very subjective] [41]. Three model-driven techniques were used to capture the semantics of sentiments: Sentiment (ES), Sentiment (Multi-language), and Sentiment Score (Multi-language). For the first one, a Naive Bayes model trained using over 800,000 reviews from El Tenedor, Decathlon, Tripadvisor, Filmaffinity, and eBay was used (https://github.com/sentiment-analysis-spanish/sentiment-spanish (accessed on 1 July 2023)). For Sentiment (Multi-language), the bert-base, A-multilingual-uncased-sentiment model was trained on reviews in different languages: English 150K, Dutch 80K, German 137K, French 140K, Italian 72K, Spanish 50K. From this model, three categories are obtained: 1- and 2-star reviews are considered negative [−1.0], 3-star reviews are neutral [0.0], and 4- and 5-star reviews are positive [1.0]. For the last feature, Sentiment Score (Multi-language), the raw star ratings were normalized in a 0–1 range [0.0: one star, 1.0: five stars], obtaining a continuous score.
- Readability: The fourth set of features encompasses the complexity of a text’s vocabulary [18,26] through the use of thirteen different indicators, four of them being exclusive to the Spanish language. Each indicator possesses its own unique formula. The Flesch Reading Ease Score indicates how difficult a passage is to understand based on word length and sentence length. Scores range from 0 to 100, with higher scores denoting material that is easier to read and easily understood by elementary school students. Lower scores depict a text that is difficult to read, best understood by people with tertiary education. The Flesch Kincaid Grade Score also determines the readability of a text, but the score that it provides is aligned with a US grade level. Given the formula, there is no upper bound. However, the lowest possible score, in theory, is . This indicator emphasizes sentence length over word length. The SMOG Index Score estimates the years of education necessary to comprehend a text. The Coleman Liau Index Score measures the understandability of a text, whose output indicates the necessary US grade level to comprehend the text. The score ranges from 1 to . Similarly, The Automated Readability Index Score measures the understandability of a text by providing a representation of the US grade level required to understand the text. Its scores range from 1 to 14, where 1 indicated kindergarten and 14 college student. The Dale–Chall Readability Score uses a list of 3000 words, which could be easily understood by fourth graders. The scores range from or lower to indicate that the text is understood by an average fourth grader or lower to to denote that the text is understood by college students. The Difficult Words Score provides a value based on the number of syllables a word contains. Words with more than two syllables are considered difficult to understand. The Linsear Write Formula Score provides a score that indicates the grade level of a text considering sentence length and the number of words with three or more syllables. Values range from 0 to . The Gunning Fog Score estimates the years of formal education required to comprehend a text on the first reading. Scores range from 6, sixth grade, to 17, college graduate. The following four indexes were used in the Spanish dataset. The Fernandez-Huerta Score stems from the Flesch Reading Ease Index, with similar score interpretation, where higher scores indicate that a text is easier to understand than lower scores. The Szigriszt-Pazos Score is also based on the Flesch Reading Ease Index, with similar score interpretation. However, the levels of difficulty of the text can also be associated with a type of text; i.e., a score between 1 and 15 indicates that a text is very hard to understand and that it is a scientific or philosophical publication. Conversely, a score between 86 and 100 indicates that the text is very easy to read and is usually a comicbook or a short story. The Gutiérrez de Polini Score is not an adapatation of any index and was designed for school-level Spanish texts. A low score indicates that the readability of the text is more difficult. The Crawford Score indicates the years of schooling that a person requires in order to understand the text. The utilization of diverse readability indexes provides valuable information about the complexity and accessibility of each text.
- Part-of-Speech (POS): The aforementioned absence of syntactic and lexical variance in machine-generated texts can be quantified by identifying the distribution of Parts of Speech. These word classes are useful to analyze, understand, and construct sentences [42]. POS tagging has been introduced in previous work [18,24] to indicate the relative frequency of word types based on per-sentence count of most classes. Given the complexity of languages, where word order could alter the meaning of a word or where words that use different morphemes possess a similar meaning, SpaCy (https://spacy.io/ (accessed on 1 July 2023)) was used to parse and tag the data according to the linguistic context of each token. In terms of morphologically inflected words, the process by which words are modified to convey grammatical categories (number, tense, person), the lemma (root form; without inflection) becomes a token. A schematic view of all the features considered in this study is shown in Table 1.
2.2. One-Class Models
- One-Class Support Machines (OCSVMs) [43]: They conceptually operate in a similar way to Support Vector Machines, which identify a hyperplane to separate data instances from two classes. However, the one-class learning counterpart uses a hyperplane to encompass all of the background data instances (human essays). Solving the OCSVM optimization problem corresponds to solving the dual quadratic programming problem:The adoption of kernel values allows to avoid the explicit computation of feature vectors, with great improvement in the computational efficiency. Common kernels include Linear, Radial Basis Function (RBF), Polynomial, and Sigmoid. Following the training phase, the OCSVM-learned hyperplane can categorize a new data instance (essay) as regular/normal (human) or different/anomaly (AI-generated) with regard to the training data distribution based on its geometric location within the decision boundary.
- Local Outlier Factor (LOF) [44]: The method measures the deviation in local density of data instances (essays) with respect to their neighbors. The anomaly score returned by LOF is based on the ratio between the local density of the data instance and the average local density of the nearest neighbors. Considering the k-distance as the distance of instance A from its k-th nearest neighbors, it is possible to define the notion of reachability distance:Instances that belong to the k nearest neighbors of B are considered to be equally distant. The local reachability density of an instance A defined as is the inverse of the average reachability distance of the object A from its neighbors. Local reachability densities are compared with those of the neighbors as:
- Isolation Forest [45]: It uses a group of tree-based models and calculates an isolation score for every data instance (essay). The average distance that lies from the tree’s root to the leaf associated with the data instance, corresponding to the number of partitions required to reach the instance, is used to compute the anomaly score. Considering that more noticeable variations in values relatively correspond to shorter paths in the tree, the method uses this information to distinguish one abnormal observation (AI-generated essays) from the rest (human essays). The anomaly score is defined as
- Angle-Base Outlier Detection (ABOD) [46,47]: is a widely adopted method that calculates the variance of weighted cosine scores between each data instance and its neighbors and uses that variance as the anomaly score. It has the ability to reduce the frequency of false positive detections by effectively identifying relationships in high-dimensional spaces between each instance and its neighbors rather than interactions among neighbors.
- Histogram-based Outlier Score (HBOS) [48]: It is a straightforward statistical approach for anomaly detection that assumes feature independence. The main concept is the generation a histogram for each feature of the data. Subsequently, for each instance, the method multiplies the inverse height of the bins associated to it, providing an assessment of the density of all features. This behavior is conceptually similar to the Naive Bayes algorithm for classification, which is known for multiplying all independent feature probabilities. Histograms represent a quick and effective way to identify anomalies (AI-generated essays). Even though feature relationships are ignored (i.e., the method assumes feature independence), this simplification allows the method to converge quickly. HBOS builds histograms in two different modalities: (i) static bin sizes and a preset bin width, and (ii) dynamic bins with a close-to-equal number of bins.
- AutoEncoder: A neural network model used to learn efficient representation of unlabeled data in an unsupervised manner. It learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The learned representation can be used to address detection tasks by performing the reconstruction of new data instances and comparing their reconstruction error with that of the learned distribution.
2.3. Research Questions
- RQ1: Is it possible to accurately detect human vs. AI-generated essays using one-class learning models, i.e., without exploitation of AI-generated essays for model training?
- RQ2: Do linguistic features allow one-class models to accurately classify human vs. AI-generated essays, and which ones are the most relevant for essay classification?
- RQ3: Are there substantial differences in detection accuracy in essays written in L2 English and L2 Spanish?
2.4. Datasets
- L2 English: the Uppsala Student English Corpus (USE) (https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2457 (accessed on 1 July 2023)), which consists of 1489 essays written by Swedish university students learning English. The majority of the essays belong to students in their first term of full-time studies. Essays found in the ’a2’ folder in the repository were used for two main reasons: the substantial amount of essays in that folder and the nature of the task. Learners wrote argumentative essays on diverse topics. While argumentative essays are linguistically more complex than introductory or personal experience texts, they are easier to generate using AI. The language proficiency of the students (A2) reflects the performance of a basic user of the language. The total number of human-generated essays in English was 335.
- L2 Spanish: To obtain L2 Spanish data, the UC Davis Corpus of Written Spanish, L2 and Heritage Speakers (COWSL2HS (https://github.com/ucdaviscl/cowsl2h (accessed on 1 July 2023))) was used. The corpus includes essays written by students in university-level Spanish courses. Essays written only by L2 learners were considered. Similarly to L2 English essays, the language proficiency of the learners was that of a basic user and topics were varied; 350 essays were compiled for analysis.
2.5. Setup and Metrics
3. Results
4. Discussion
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
- Lund, B.D.; Wang, T. Chatting about ChatGPT: How may AI and GPT impact academia and libraries? Libr. Hi Tech News 2023, 40, 26–29. [Google Scholar] [CrossRef]
- King, M.R.; ChatGPT. A conversation on artificial intelligence, chatbots, and plagiarism in higher education. Cell. Mol. Bioeng. 2023, 16, 1–2. [Google Scholar] [CrossRef]
- Slaouti, D. The World Wide Web for academic purposes: Old study skills for new? Engl. Specif. Purp. 2002, 21, 105–124. [Google Scholar] [CrossRef]
- Stapleton, P. Writing in an electronic age: A case study of L2 composing processes. J. Engl. Acad. Purp. 2010, 9, 295–307. [Google Scholar] [CrossRef]
- Crothers, E.; Japkowicz, N.; Viktor, H. Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods. arXiv 2022, arXiv:2210.07321. [Google Scholar]
- Bostrom, N.; Yudkowsky, E. The ethics of artificial intelligence. In Artificial Intelligence Safety and Security; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018; pp. 57–69. [Google Scholar]
- Arbane, M.; Benlamri, R.; Brik, Y.; Alahmar, A.D. Social media-based COVID-19 sentiment classification model using Bi-LSTM. Expert Syst. Appl. 2023, 212, 118710. [Google Scholar] [CrossRef]
- Li, W.; Qi, F.; Tang, M.; Yu, Z. Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification. Neurocomputing 2020, 387, 63–77. [Google Scholar] [CrossRef]
- Kumari, R.; Ashok, N.; Ghosal, T.; Ekbal, A. A multitask learning approach for fake news detection: Novelty, emotion, and sentiment lend a helping hand. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Damasceno, L.P.; Shafer, A.; Japkowicz, N.; Cavalcante, C.C.; Boukouvalas, Z. Efficient Multivariate Data Fusion for Misinformation Detection During High Impact Events. In Proceedings of the Discovery Science: 25th International Conference, DS 2022, Montpellier, France, 10–12 October 2022; pp. 253–268. [Google Scholar]
- Jing, Q.; Yao, D.; Fan, X.; Wang, B.; Tan, H.; Bu, X.; Bi, J. TRANSFAKE: Multi-task Transformer for Multimodal Enhanced Fake News Detection. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Han, H.; Ke, Z.; Nie, X.; Dai, L.; Slamu, W. Multimodal Fusion with Dual-Attention Based on Textual Double-Embedding Networks for Rumor Detection. Appl. Sci. 2023, 13, 4886. [Google Scholar] [CrossRef]
- Prasad, N.; Saha, S.; Bhattacharyya, P. A Multimodal Classification of Noisy Hate Speech using Character Level Embedding and Attention. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Alghamdi, J.; Lin, Y.; Luo, S. Does Context Matter? Effective Deep Learning Approaches to Curb Fake News Dissemination on Social Media. Appl. Sci. 2023, 13, 3345. [Google Scholar] [CrossRef]
- Allouch, M.; Mansbach, N.; Azaria, A.; Azoulay, R. Utilizing Machine Learning for Detecting Harmful Situations by Audio and Text. Appl. Sci. 2023, 13, 3927. [Google Scholar] [CrossRef]
- Rubin, V.L.; Conroy, N.; Chen, Y.; Cornwell, S. Fake news or truth? Using satirical cues to detect potentially misleading news. In Proceedings of the Second Workshop on Computational Approaches to Deception Detection, San Diego, CA, USA, 17 June 2016; pp. 7–17. [Google Scholar]
- Feng, L.; Jansche, M.; Huenerfauth, M.; Elhadad, N. A comparison of features for automatic readability assessment. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, 23–27 August 2010. [Google Scholar]
- Argamon-Engelson, S.; Koppel, M.; Avneri, G. Style-based text categorization: What newspaper am I reading. In Proceedings of the AAAI Workshop on Text Categorization, Madison, WI, USA, 26–27 July 1998; pp. 1–4. [Google Scholar]
- Koppel, M.; Argamon, S.; Shimoni, A.R. Automatically categorizing written texts by author gender. Lit. Linguist. Comput. 2002, 17, 401–412. [Google Scholar] [CrossRef] [Green Version]
- Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic detection of fake news. arXiv 2017, arXiv:1708.07104. [Google Scholar]
- Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Ippolito, D.; Duckworth, D.; Callison-Burch, C.; Eck, D. Automatic Detection of Generated Text is Easiest when Humans are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1808–1822. [Google Scholar]
- Fröhling, L.; Zubiaga, A. Feature-based detection of automated language models: Tackling GPT-2, GPT-3 and Grover. Peerj Comput. Sci. 2021, 7, e443. [Google Scholar] [CrossRef] [PubMed]
- Gehrmann, S.; Harvard, S.; Strobelt, H.; Rush, A.M. GLTR: Statistical Detection and Visualization of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2019, Florence, Italy, 28 July–2 August 2019; p. 111. [Google Scholar]
- Crossley, S.A.; Allen, D.B.; McNamara, D.S. Text readability and intuitive simplification: A comparison of readability formulas. Read. Foreign Lang. 2011, 23, 84–101. [Google Scholar]
- Corizzo, R.; Leal-Arenas, S. A Deep Fusion Model for Human vs. Machine-Generated Essay Classification. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Broadbeach, Australia, 18–23 June 2023; pp. 1–8. [Google Scholar]
- Rewicki, F.; Denzler, J.; Niebling, J. Is It Worth It? Comparing Six Deep and Classical Methods for Unsupervised Anomaly Detection in Time Series. Appl. Sci. 2023, 13, 1778. [Google Scholar] [CrossRef]
- Ryan, S.; Corizzo, R.; Kiringa, I.; Japkowicz, N. Pattern and anomaly localization in complex and dynamic data. In Proceedings of the 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 1756–1763. [Google Scholar]
- Lian, Y.; Geng, Y.; Tian, T. Anomaly Detection Method for Multivariate Time Series Data of Oil and Gas Stations Based on Digital Twin and MTAD-GAN. Appl. Sci. 2023, 13, 1891. [Google Scholar] [CrossRef]
- Corizzo, R.; Ceci, M.; Pio, G.; Mignone, P.; Japkowicz, N. Spatially-aware autoencoders for detecting contextual anomalies in geo-distributed data. In Proceedings of the Discovery Science: 24th International Conference, DS 2021, Halifax, NS, Canada, 11–13 October 2021; Springer: Berlin, Germany, 2021; Volume 24, pp. 461–471. [Google Scholar]
- Herskind Sejr, J.; Christiansen, T.; Dvinge, N.; Hougesen, D.; Schneider-Kamp, P.; Zimek, A. Outlier detection with explanations on music streaming data: A case study with danmark music group ltd. Appl. Sci. 2021, 11, 2270. [Google Scholar] [CrossRef]
- Faber, K.; Corizzo, R.; Sniezynski, B.; Japkowicz, N. Active Lifelong Anomaly Detection with Experience Replay. In Proceedings of the 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA), Shenzhen, China, 13–16 October 2022; pp. 1–10. [Google Scholar]
- Kaufmann, J.; Asalone, K.; Corizzo, R.; Saldanha, C.; Bracht, J.; Japkowicz, N. One-class ensembles for rare genomic sequences identification. In Proceedings of the Discovery Science: 23rd International Conference, DS 2020, Thessaloniki, Greece, 19–21 October 2020; Springer: Berlin, Germany, 2020; Volume 23, pp. 340–354. [Google Scholar]
- Baly, R.; Karadzhov, G.; Alexandrov, D.; Glass, J.; Nakov, P. Predicting factuality of reporting and bias of news media sources. arXiv 2018, arXiv:1810.01765. [Google Scholar]
- Horne, B.D.; Adali, S. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the Eleventh International AAAI Conference on Web and Social Media, Montreal, QC, Canada, 15–18 May 2017. [Google Scholar]
- Hube, C.; Fetahu, B. Detecting biased statements in wikipedia. In Proceedings of the Companion Proceedings of the Web Conference, Lyon, France, 23–27 April 2018; pp. 1779–1786.
- Moroney, C.; Crothers, E.; Mittal, S.; Joshi, A.; Adalı, T.; Mallinson, C.; Japkowicz, N.; Boukouvalas, Z. The case for latent variable vs deep learning methods in misinformation detection: An application to covid-19. In Proceedings of the Discovery Science: 24th International Conference, DS 2021, Halifax, NS, Canada, 11–13 October 2021; Springer: Berlin, Germany, 2021; Volume 24, pp. 422–432. [Google Scholar]
- Wang, W.; Yu, Y.; Sheng, J. Image retrieval by emotional semantics: A study of emotional space and feature extraction. In Proceedings of the 2006 IEEE International Conference on Systems, Man and Cybernetics, Taipei, Taiwan, 8–11 October 2006; Volume 4, pp. 3534–3539. [Google Scholar]
- Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science 2018, 359, 1146–1151. [Google Scholar] [CrossRef]
- Bonta, V.; Janardhan, N.K.N. A comprehensive study on lexicon based approaches for sentiment analysis. Asian J. Comput. Sci. Technol. 2019, 8, 1–6. [Google Scholar] [CrossRef]
- Voutilainen, A. Part-of-speech tagging. In The Oxford Handbook of Computational Linguistics; Oxford University Press: Oxford, UK, 2003; pp. 219–232. [Google Scholar]
- Schölkopf, B.; Williamson, R.C.; Smola, A.J.; Shawe-Taylor, J.; Platt, J.C. Support vector method for novelty detection. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; pp. 582–588. [Google Scholar]
- Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; pp. 93–104. [Google Scholar]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
- Kriegel, H.; Schubert, M.; Zimek, A. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 444–452. [Google Scholar]
- Pham, N.; Pagh, R. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; pp. 877–885. [Google Scholar]
- Goldstein, M.; Score, A.D.H.b.O. A fast Unsupervised Anomaly Detection Algorithm. In Proceedings of the KI-2012: Poster and Demo Track, 35th German Conference on Artificial Intelligence, Saarbrücken, Germany, 24–27 September 2012; pp. 59–63. [Google Scholar]
- Choudhary, A.; Arora, A. Linguistic feature based learning model for fake news detection and classification. Expert Systems with Applications 2021, 169, 114171. [Google Scholar] [CrossRef]
- Zhu, T. From Textual Experiments to Experimental Texts: Expressive Repetition in “Artificial Intelligence Literature”. arXiv 2022, arXiv:2201.02303. [Google Scholar]
- Selinker, L. Language transfer. Gen. Linguist. 1969, 9, 67. [Google Scholar]
- Haspelmath, M.; Michaelis, S.M. Analytic and synthetic. In Proceedings of the Language Variation-European Perspectives VI: Selected Papers from the Eighth International Conference on Language Variation in Europe (ICLaVE 8), Leipzig, Germany, 27–29 May 2017; pp. 3–22. [Google Scholar]
- Filippova, K. Multi-sentence compression: Finding shortest paths in word graphs. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, 23–27 August 2010; pp. 322–330. [Google Scholar]
Type | Feature | Example |
---|---|---|
Text | Distinct types of punctuation marks | … |
Number of Oxford commas | “ …Christmas, Easter, and spring break…” | |
Number of paragraphs | / | |
Number of full stops | / | |
Number of commas | / | |
Average sentence length | / | |
Repetitiveness | Unigram Overlap | only …only … |
Bi-gram Overlap | only you …only you … | |
Tri-gram Overlap | only you can …only you can … | |
Matches in the 5K most common words | from, your, have … | |
Matches in the 10K most common words | scared, infringement, bent … | |
Emotional Semantics | Polarity | |
Subjectivity | ||
Sentiment (ES) | ||
Sentiment (Multi-language) | ||
Sentiment Score (Multi-language) | ||
Readability | Flesch Reading Ease Score | (0, 100) |
Flesch Kincaid Grade Score | (−3.40, no limit) | |
Smog Index Score | (1, 240) | |
Coleman Liau Index Score | (1, 11+) | |
Automated Readability Index | (1, 14) | |
Dale–Chall Readability Score | (0, 10) | |
Difficult Words Score | Varies | |
Linsear Write Formula Score | (0, 11+) | |
Gunning Fog score | (6–17) | |
Fernández-Huerta Score (SPAN) | (0, 100) | |
Szigriszt-Pazos Score (SPAN) | (0, 100) | |
Gutiérrez de Polini Score (SPAN) | (0, 100) | |
Crawford Score (SPAN) | (6–17) | |
Part-of-Speech (POS) | ADJ (Adjective) * | current, long |
ADP (Adverbial Phrase) * | during the week, right there | |
ADV (Adverb) * | hopefully, differently | |
AUX (Auxiliary) * | “…as you have seen…” | |
CCONJ (Coordinating Conjunction) * | and, but, or | |
DET (Determiner) * | the, a, an | |
NOUN * | teachers, adoption | |
NUM * | two, hundreds | |
PRON (Pronoun) * | I, you, him, hers | |
PROPN (Proper Noun) * | Sweden, United States | |
PUNCT (Punctuation) * | . , | |
SCONJ (Subordinating Conjunction) * | although, because, whereas | |
SYM (Symbol) * | symbol that is not punctuation, e.g., $ | |
VERB * | mirror, use, have | |
SPACE (Number of spaces) * | count of number of spaces |
Topic: Capital punishment | |
Human | Generated |
…The right to life is the most fundamental part of the Human Rights Agenda. To claim the right to life for innocent people is not very difficult. It goes without saying. It is considerably harder to defend people with horrible crimes on their consciences; mass-murderers, war criminals, paedophiles etc. But that is how universal rights works. They applies to everyone, everywhere and during all circumstances… | …One of the main reasons why capital punishment is not good is that it is not an effective deterrent to crime. Studies have shown that the death penalty does not have a greater deterrent effect on crime than other forms of punishment, such as life imprisonment. Additionally, the death penalty can also have a negative impact on the crime rate, as it can lead to a lack of public trust in the justice system, which can in turn lead to a rise in crime… |
Topic: Una persona famosa | |
Human | Generated |
Hay un artista y se llama Kanye West. El tiene muchos dinero. El tiene cuarenta y cuartro anos de edad. El es de estados unicos y es americano. El tiene dos ojos grandes. El lleva una camiseta blanca y pantalones negro o azul. El lleva zapatos blancos. Tiene una esposa y llamada Kim Kardashian West. El tiene una hija y un hijo. Sus nombres son North West y Saint West. El escribe musica El trabaja con artistas de musica … | Una persona famosa que me viene a la mente es Albert Einstein. Einstein es conocido en todo el mundo como uno de los científicos más importantes y brillantes de la historia. Nacido en Alemania en 1879, Einstein comenzó a mostrar una gran aptitud para las matemáticas y la física desde temprana edad. A los 16 años, se matriculó en la Universidad Politécnica de Zurich, donde obtuvo su título en 1900 … |
Setting = Text | Precision | Recall | F1-Score |
---|---|---|---|
OneClassSVM (RBF kernel) | 0.7607 | 0.6915 | 0.6694 |
OneClassSVM (Polynomial kernel) | 0.3092 | 0.3294 | 0.3110 |
OneClassSVM (Linear kernel) | 0.3092 | 0.3294 | 0.3110 |
LocalOutlierFactor (3 neighbors) | 0.5169 | 0.5052 | 0.4084 |
LocalOutlierFactor (5 neighbors) | 0.4817 | 0.4948 | 0.3783 |
LocalOutlierFactor (10 neighbors) | 0.4395 | 0.4903 | 0.3527 |
IsolationForest | 0.6465 | 0.6095 | 0.5836 |
ABOD | 0.3312 | 0.4933 | 0.3324 |
HBOS | 0.5978 | 0.5067 | 0.3587 |
AutoEncoder | 0.4297 | 0.4873 | 0.3534 |
Setting = Repetitiveness | Precision | Recall | F1-Score |
OneClassSVM (RBF kernel) | 0.6141 | 0.6095 | 0.6054 |
OneClassSVM (Polynomial kernel) | 0.5082 | 0.5082 | 0.5082 |
OneClassSVM (Linear kernel) | 0.5067 | 0.5067 | 0.5067 |
LocalOutlierFactor (3 neighbors) | 0.6764 | 0.6051 | 0.5611 |
LocalOutlierFactor (5 neighbors) | 0.7163 | 0.6036 | 0.5446 |
LocalOutlierFactor (10 neighbors) | 0.7113 | 0.6021 | 0.5435 |
IsolationForest | 0.9473 | 0.9419 | 0.9417 |
ABOD | 0.7268 | 0.5618 | 0.4651 |
HBOS | 0.9105 | 0.9031 | 0.9027 |
AutoEncoder | 0.9563 | 0.9553 | 0.9553 |
Setting = Emotional Semantics | Precision | Recall | F1-Score |
OneClassSVM (RBF kernel) | 0.6699 | 0.6528 | 0.6436 |
OneClassSVM (Polynomial kernel) | 0.4918 | 0.4918 | 0.4918 |
OneClassSVM (Linear kernel) | 0.5291 | 0.5291 | 0.5286 |
LocalOutlierFactor (3 neighbors) | 0.6085 | 0.5648 | 0.5168 |
LocalOutlierFactor (5 neighbors) | 0.6065 | 0.5574 | 0.5005 |
LocalOutlierFactor (10 neighbors) | 0.6214 | 0.5708 | 0.5215 |
IsolationForest | 0.6516 | 0.6110 | 0.5835 |
ABOD | 0.6968 | 0.5350 | 0.4155 |
HBOS | 0.5674 | 0.5067 | 0.3657 |
AutoEncoder | 0.6606 | 0.5768 | 0.5138 |
Setting = Readability | Precision | Recall | F1-Score |
OneClassSVM (RBF kernel) | 0.7526 | 0.5112 | 0.3569 |
OneClassSVM (Polynomial kernel) | 0.8318 | 0.7466 | 0.7292 |
OneClassSVM (Linear kernel) | 0.8324 | 0.7481 | 0.7310 |
LocalOutlierFactor (3 neighbors) | 0.8166 | 0.7675 | 0.7582 |
LocalOutlierFactor (5 neighbors) | 0.8408 | 0.8048 | 0.7995 |
LocalOutlierFactor (10 neighbors) | 0.8563 | 0.8271 | 0.8236 |
IsolationForest | 0.8943 | 0.8942 | 0.8942 |
ABOD | 0.7581 | 0.6006 | 0.5291 |
HBOS | 0.8072 | 0.7273 | 0.7085 |
AutoEncoder | 0.8525 | 0.8376 | 0.8358 |
Setting = POS | Precision | Recall | F1-Score |
OneClassSVM (RBF kernel) | 0.8154 | 0.7392 | 0.7223 |
OneClassSVM (Polynomial kernel) | 0.6065 | 0.5618 | 0.5098 |
OneClassSVM (Linear kernel) | 0.5835 | 0.5812 | 0.5781 |
LocalOutlierFactor (3 neighbors) | 0.7457 | 0.6080 | 0.5446 |
LocalOutlierFactor (5 neighbors) | 0.7513 | 0.5991 | 0.5280 |
LocalOutlierFactor (10 neighbors) | 0.7501 | 0.5708 | 0.4776 |
IsolationForest | 0.6678 | 0.5887 | 0.5342 |
ABOD | 0.7326 | 0.5693 | 0.4782 |
HBOS | 0.5932 | 0.5127 | 0.3800 |
AutoEncoder | 0.7014 | 0.6289 | 0.5925 |
Setting = Text | Precision | Recall | F1-Score |
---|---|---|---|
OneClassSVM (RBF kernel) | 0.6190 | 0.6043 | 0.5917 |
OneClassSVM (Polynomial kernel) | 0.3102 | 0.3329 | 0.3123 |
OneClassSVM (Linear kernel) | 0.3102 | 0.3329 | 0.3123 |
LocalOutlierFactor (3 neighbors) | 0.3641 | 0.4600 | 0.3443 |
LocalOutlierFactor (5 neighbors) | 0.3409 | 0.4657 | 0.3354 |
LocalOutlierFactor (10 neighbors) | 0.2582 | 0.4629 | 0.3187 |
IsolationForest | 0.2511 | 0.4514 | 0.3132 |
ABOD | 0.2489 | 0.4957 | 0.3314 |
HBOS | 0.2475 | 0.4900 | 0.3289 |
AutoEncoder | 0.2426 | 0.4714 | 0.3204 |
Setting = Repetitiveness | Precision | Recall | F1-Score |
OneClassSVM (RBF kernel) | 0.5200 | 0.5200 | 0.5199 |
OneClassSVM (Polynomial kernel) | 0.3988 | 0.4029 | 0.3968 |
OneClassSVM (Linear kernel) | 0.3972 | 0.4014 | 0.3952 |
LocalOutlierFactor (3 neighbors) | 0.5369 | 0.5171 | 0.4426 |
LocalOutlierFactor (5 neighbors) | 0.5697 | 0.5257 | 0.4369 |
LocalOutlierFactor (10 neighbors) | 0.5752 | 0.5229 | 0.4223 |
IsolationForest | 0.3971 | 0.4886 | 0.3424 |
ABOD | 0.2482 | 0.4929 | 0.3301 |
HBOS | 0.2486 | 0.4943 | 0.3308 |
AutoEncoder | 0.4578 | 0.4929 | 0.3599 |
Setting = Emotional Semantics | Precision | Recall | F1-Score |
OneClassSVM (RBF kernel) | 0.4898 | 0.4900 | 0.4881 |
OneClassSVM (Polynomial kernel) | 0.3568 | 0.3614 | 0.3562 |
OneClassSVM (Linear kernel) | 0.3828 | 0.3843 | 0.3824 |
LocalOutlierFactor (3 neighbors) | 0.5305 | 0.5186 | 0.4663 |
LocalOutlierFactor (5 neighbors) | 0.5177 | 0.5100 | 0.4501 |
LocalOutlierFactor (10 neighbors) | 0.5387 | 0.5229 | 0.4684 |
IsolationForest | 0.4859 | 0.4900 | 0.4499 |
ABOD | 0.5947 | 0.5100 | 0.3689 |
HBOS | 0.4830 | 0.4986 | 0.3496 |
AutoEncoder | 0.5561 | 0.5157 | 0.4094 |
Setting = Readability | Precision | Recall | F1-Score |
OneClassSVM (RBF kernel) | 0.7514 | 0.5057 | 0.3459 |
OneClassSVM (Polynomial kernel) | 0.8218 | 0.7357 | 0.7168 |
OneClassSVM (Linear kernel) | 0.8244 | 0.7414 | 0.7238 |
LocalOutlierFactor (3 neighbors) | 0.7134 | 0.6329 | 0.5946 |
LocalOutlierFactor (5 neighbors) | 0.7055 | 0.6129 | 0.5637 |
LocalOutlierFactor (10 neighbors) | 0.6382 | 0.5671 | 0.5033 |
IsolationForest | 0.6904 | 0.6186 | 0.5788 |
ABOD | 0.2486 | 0.4943 | 0.3308 |
HBOS | 0.2482 | 0.4929 | 0.3301 |
AutoEncoder | 0.4845 | 0.4971 | 0.3684 |
Setting = POS | Precision | Recall | F1-Score |
OneClassSVM (RBF kernel) | 0.4971 | 0.4971 | 0.4968 |
OneClassSVM (Polynomial kernel) | 0.4100 | 0.4100 | 0.4100 |
OneClassSVM (Linear kernel) | 0.3787 | 0.3843 | 0.3771 |
LocalOutlierFactor (3 neighbors) | 0.5200 | 0.5029 | 0.3673 |
LocalOutlierFactor (5 neighbors) | 0.5216 | 0.5029 | 0.3652 |
LocalOutlierFactor (10 neighbors) | 0.5311 | 0.5043 | 0.3681 |
IsolationForest | 0.2701 | 0.4757 | 0.3247 |
ABOD | 0.2478 | 0.4914 | 0.3295 |
HBOS | 0.2482 | 0.4929 | 0.3301 |
AutoEncoder | 0.3171 | 0.4671 | 0.3297 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Corizzo, R.; Leal-Arenas, S. One-Class Learning for AI-Generated Essay Detection. Appl. Sci. 2023, 13, 7901. https://doi.org/10.3390/app13137901
Corizzo R, Leal-Arenas S. One-Class Learning for AI-Generated Essay Detection. Applied Sciences. 2023; 13(13):7901. https://doi.org/10.3390/app13137901
Chicago/Turabian StyleCorizzo, Roberto, and Sebastian Leal-Arenas. 2023. "One-Class Learning for AI-Generated Essay Detection" Applied Sciences 13, no. 13: 7901. https://doi.org/10.3390/app13137901
APA StyleCorizzo, R., & Leal-Arenas, S. (2023). One-Class Learning for AI-Generated Essay Detection. Applied Sciences, 13(13), 7901. https://doi.org/10.3390/app13137901