Comparison of Grammar Characteristics of Human-Written Corpora and Machine-Generated Texts Using a Novel Rule-Based Parser

Strübbe, Simon; Sidorenko, Irina; Lampe, Renée

doi:10.3390/info16040274

Open AccessArticle

Comparison of Grammar Characteristics of Human-Written Corpora and Machine-Generated Texts Using a Novel Rule-Based Parser

by

Simon Strübbe

¹,

Irina Sidorenko

¹

and

Renée Lampe

^1,2,*

¹

Research Unit of the Buhl-Strohmaier Foundation for Cerebral Palsy and Paediatric Neuroorthopaedics, Department of Orthopaedics and Sports Orthopaedics, Klinikum Rechts der Isar, TUM School of Medicine and Health, Technical University of Munich, Ismaningerstr 22, 81675 Munich, Germany

²

Markus Würth Professorship, Technical University of Munich, 81675 Munich, Germany

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 274; https://doi.org/10.3390/info16040274

Submission received: 3 March 2025 / Revised: 26 March 2025 / Accepted: 27 March 2025 / Published: 28 March 2025

(This article belongs to the Special Issue The Role of Artificial Intelligence for Diversity, Equity, and Inclusion)

Download

Browse Figures

Versions Notes

Abstract

As the prevalence of machine-written texts grows, it has become increasingly important to distinguish between human- and machine-generated content, especially when such texts are not explicitly labeled. Current artificial intelligence (AI) detection methods primarily focus on human-like characteristics, such as emotionality and subjectivity. However, these features can be easily modified through AI humanization, which involves altering word choice. In contrast, altering the underlying grammar without affecting the conveyed information is considerably more challenging. Thus, the grammatical characteristics of a text can be used as additional indicators of its origin. To address this, we employ a newly developed rule-based parser to analyze the grammatical structures in human- and machine-written texts. Our findings reveal systematic grammatical differences between human- and machine-written texts, providing a reliable criterion for the determination of the text origin. We further examine the stability of this criterion in the context of AI humanization and translation to other languages.

Keywords:

AI detection; AI humanization; natural language grammar; rule-based parser

1. Introduction

The rapid advancement of artificial intelligence (AI) in the field of natural language processing (NLP) has resulted in the widespread adoption of AI-generated text across diverse domains, including journalism, academia, social media, and content creation. Models such as ChatGPT [1] exhibit remarkable fluency and coherence; however, their increasing prevalence has raised serious concerns regarding misinformation, academic integrity, and the authenticity of digital communication. Consequently, the development of reliable methods for detecting AI-generated text has emerged as a pressing challenge for researchers, educators, and political figures.

In response to these challenges, various AI text detection tools [2,3,4] have been created to distinguish between human- and machine-generated content. These detectors typically focus on analyzing lexical choices, emotional tone, and subjectivity within texts, often neglecting grammatical structure. To counter these detectors, AI humanizers [5] have been developed, which modify texts by incorporating more emotional language and subjective elements, while usually keeping grammatical structures unchanged.

In earlier studies [6,7], we introduced a novel rule-based parsing approach for the German language, which emphasizes the grammatical characteristics of a text. Unlike conventional parsers, such as dependency parsers [8] that rely on statistical models and neural networks, our parser provides more informative grammatical insights. As demonstrated in a previous study [7], our parser not only identifies word classes but also delineates the constituents of main and subordinate clauses and determines the cases of nouns. This enables a comprehensive grammatical analysis, facilitating comparisons between texts from different corpora based on their grammatical components.

In this study, we employed our rule-based parser to extract grammatical features and compare those in human-written and machine-generated texts. These features are inherently syntactic and cannot be effectively identified by less informative parsers, such as dependency parsers. We investigated if these features of a text remain detectable after an AI humanization process or computer-assisted translation into other languages.

The structure of the paper is organized as follows: In Section 2, we provide an overview of syntactic dependency networks [9], which are similar to our approach. Section 3 introduces our proposed methodology, detailing the process for determining word-to-word relationships. In Section 4, we present the grammatical comparison of human- and machine-written corpora, which is accompanied by numerical results. A detailed discussion of our findings is provided in Section 5, followed by the conclusions in Section 6.

2. Related Work

The rule-based parser introduced earlier [6,7] is designed to extract syntactic and grammatical relationships between words in arbitrary sentences. To achieve this, the parser constructs a highly structured parsing graph. Traditional parsing approaches such as dependency parsing [8] or constituent parsing [10] follow the Robinson Axioms [11] and create tree structures. The Robinson Axioms define the structural properties of such parse trees as follows: (1) these trees possess a single root, typically corresponding to the verb of the main clause; (2) all other words in the sentence are hierarchically dependent on either this root or another word within the tree; (3) each word can have only one direct dependency, ensuring that the tree remains acyclic.

In contrast, our method employs a capsular graph structure and violates these Axioms, for instance, our graphs can have many roots—one for each clause—as described in an earlier paper [6]. This capsular structure enables the extraction of grammatical relationships between every word pair in a sentence, whereas tree structures in dependency and constituent parsing typically capture relationships only between directly dependent words.

Natural language inherently contains ambiguities, which pose challenges for the syntactic analysis. As demonstrated in a previous publication [7], common dependency or constituent parsers do not fully address these ambiguities. The parsing solutions generated by these tools rely on statistical methods and neural networks, often providing the most statistically probable interpretation of an ambiguity, which is not necessarily the correct one.

In Section 3, we detail how the rule-based parser constructs a comprehensive network of word-to-word relationships and stores these relationships in a database. While these word-to-word relationships share similarities with syntactic dependency networks produced by dependency parsers, they differ in several key aspects. For context, we briefly describe syntactic dependency networks and their usage for corpus linguistics. Additionally, we describe the features of texts commonly used to detect AI-generated texts.

2.1. Syntactic Dependency Networks

A syntactic dependency network [9] represents the syntactic and grammatical relationships between the words within the sentences of a comprehensive corpus. Each sentence in the corpus is analyzed using a parser, typically a dependency parser [8], which extracts the grammatical relationships between words based on the resulting parse structure. The parse structure for dependency parsing takes the form of a tree constructed in accordance with the Robinson Axioms. For example, Figure 1 illustrates the dependency parse for the following sentence:

“The cat chased the mouse”.

In this parse tree, the verb of the main clause, “chased”, serves as the root, with all other words either directly or indirectly dependent on it. Specifically, the nouns “cat” and “mouse” directly depend on the root, while the determiners “the” depend on their respective nouns. The dependencies are labeled with tags such as “nsubj” (nominal subject), “dobj” (direct object), and “det” (determiner) to indicate the nature of the grammatical relationship between the words. Excluding determiners, which are high-frequency words with limited semantic value, two primary relationships can be extracted from this sentence:

chased - nsubj -> cat;
chased - dobj -> mouse.

Here, “nsubj” refers to “nominal subject”, indicating that “cat” is the subject of the verb “chased”, while “dobj” denotes “the direct object”, indicating that “mouse” is the direct object of the verb. These two relationships, along with similar relationships from other sentences in the corpus, are stored as part of the syntactic dependency network.

In most cases, the same relationship appears multiple times throughout the corpus. Two primary approaches can be employed to address this repetition. First, the relationships can be treated as binary, meaning they are included in the syntactic dependency network if they occur at least once in the corpus. Alternatively, the relationships can be assigned a weight, where the weight reflects the frequency of occurrence of that relationship within the corpus.

Typically, only direct dependencies between words are stored in the database. For instance, in the example sentence above, only the direct dependencies between the verb “chased” and the nouns “cat” and “mouse” are recognized. No relationship is established between the nouns “cat” and “mouse”, even though they are both nouns within the same sentence. However, it is also possible, though less common, to establish indirect relationships between words that do not directly depend on one another. This can be achieved by determining the shortest path between the two words in the parse tree and storing a representation of this path in the database.

2.2. AI Detection Methods

Common AI detection mechanisms, including our own, extract a feature vector from the text under analysis and compare it to the feature vectors of known human-written and AI-generated texts.

Opara and Chidimma (2024) [12] identified 31 stylometric features, including lexical attributes such as unique word count, as well as sentiment and subjectivity markers, such as the frequency of emotional words. Those features are relatively easily to determine without a complex syntactic analysis by counting specific words or by recognizing specific patterns within the text.

Similarly, Alamleh et al. (2023) [13] employed the Term Frequency–Inverse Document Frequency (TF-IDF) [14] score of each word as a feature vector. The TF-IDF measures the frequency of a word (TF) and divides it through the number of documents that contain this word. The dimension of this feature vector is equal to the number of words in the vocabulary, where each entry of this feature vector represents the TF-IDF of the corresponding word.

A different approach by Guo et al. (2024) [15] used a Bidirectional Encoder Representations from Transformers (BERT) [16] to determine an embedding vector for every text. The embedding vector of a text is compared to the embedding vectors of known human-written and machine-generated texts. In the next step, the neuronal network BERT is fine-tuned in a way that embedding vectors of human-written texts resemble each other but differ from the embedding vectors for machine-generated texts.

None of the three mentioned methods incorporates a comprehensive syntactic analysis of the texts, as we do with our parser. In this way, our parser extracts grammatical features of a text, which are otherwise difficult to determine. These features alone or in combination with the other mentioned features can then be used to determine the origin of texts.

2.3. Corpus Linguistic

Corpus linguistic analysis examines linguistic patterns at the corpus level. These patterns can be leveraged to cluster sentences within a corpus or to compare different corpora. Syntactic dependency networks are a valuable tool for both purposes. In this study, we focus on comparing corpora based on their grammatical features. This type of comparison typically involves the following steps:

Parsing all corpora to be compared using a natural language parser, such as a dependency parser.
Extracting sub-graphs from the resulting parse trees.
Counting the frequencies of the identified sub-graphs in each corpus.
Comparing these frequencies across the corpora.

Commonly, a dependency parser is used to parse the corpora under consideration (step 1). Then, sub-graphs are identified within the resulting parse trees (step 2). A sub-graph represents a specific grammatical relationship, such as nsubj → verb → obj, which captures a relationship between a sentence’s subject and its object. The frequency of these sub-graphs is then counted within each corpus (step 3) and compared across the different corpora (step 4).

A relevant example of this methodology is provided by Muñoz-Ortiz et al. (2024) [17]. Their study demonstrated that LLM (large language model)-generated corpora exhibit a higher frequency of auxiliary verbs compared to human-written corpora, while human-written corpora contain more adjective modifiers than their LLM-generated counterparts.

In our method, however, the use of sub-graphs is not required to follow the four outlined steps. Instead, our rule-based parser provides a grammatical relationship between every word pair in a sentence, capturing long-term dependencies that correspond to connections spanning multiple levels in a parse tree.

3. Methods

In two previous publications [6,7], we introduced a novel rule-based method for natural language parsing, which we tailored to the German language. This parser operates solely through deterministic rules, deliberately avoiding statistical methods during the parsing process.

In this study, we utilized our parser to analyze datasets comprising hundreds or thousands of sentences to gain insights into the grammatical characteristics of a given text or corpus. Although our parser operates exclusively on a rule-based approach, statistical data can be extracted from the parsing results of sentence collections. In this paper, we leverage this statistical data to determine the origin of a text or corpus. To ensure that statistical analysis provides reliable results, the corpus must contain a sufficiently large number of sentences (see Section 4.3), usually several hundred sentences. However, increasing the number of sentences beyond this threshold does not enhance the detection accuracy, as the statistical data reach saturation at approximately three hundred sentences.

Statistical data are generated by examining correlations among words within the parsed sentences. After parsing each sentence, the program constructs pairs of distinct words from the sentence and identifies a grammatical relationship for each pair, which is referred to as a “grammatical bridge”. For a sentence containing N words, this approach yields

\frac{N^{2} - N}{2}

grammatical bridges (including punctuation), all of which are stored in a database for subsequent analysis.

Unlike traditional dependency parsers that generate parse trees, our parser operates using “bubbles”—containers for various linguistic elements, such as words, phrases, or clauses—as detailed in two previous publications [6,7]. According to our framework, each linguistic element occurs in exactly one superordinated bubble, enabling the use of this nested bubble structure to identify keywords for the unique grammar bridges.

A grammar bridge is a string that links two words within a sentence. These connected words serve as the endpoints of the string, while intermediate keywords are derived from the nested bubble structure of the sentence. The connection is established by identifying the shortest path between the two words, which traverses several bubbles. The construction of the grammar bridges is described in detail in Appendix A.

The grammatical bridges identified within a large corpus of sentences constitute a database that bears similarities to a syntactic dependency network produced by a dependency parser (see Section 2.1). As outlined in the previous publication [7], our rule-based parser provides grammatically richer information regarding words and their grammatical relationships. For instance, a dependency parser does not determine the grammatical cases of nouns within a sentence. Furthermore, we demonstrated in the previous publication [7] that while neural parsers resolve ambiguities by selecting the most likely solution, our parser disambiguates these cases explicitly, offering a more precise interpretation.

A notable distinction is that grammatical bridges link every word pair within a sentence, whereas syntactic dependency networks only connect words that are directly dependent on one another. In general, it is possible to find the shortest path between two words within a parse tree generated by a dependency parser, as is done for grammatical bridges. To the authors’ best knowledge, none of the existing methods adequately captures hypothetically longer implicit dependencies spanning multiple words.

3.1. Source of the Used Corpora

In this study, we analyzed corpora of two distinct types: human- and machine-written corpora. Eight of the human-written corpora were sourced from the Wortschatz Leipzig [18], a database that collects millions of sentences in German and other languages. These corpora vary in size, with sentence packages containing 10,000, 30,000, 100,000, 300,000, or 1,000,000 sentences. We utilize the packages of 10,000 sentences for our analysis because the gained statistics stabilize at around a few hundred parsed sentences. Further increasing the number of analyzed sentences no longer changes the predominant grammatical statistics in the corpus. Three of the eight corpora are written in a foreign language, English and French, to check if the grammar analysis can be extended to other languages (see Section 3.4).

The sentences in the Wortschatz Leipzig corpora were presented in an arbitrary order and were not linked to one another, although they were derived from continuous text. For instance, in the case of a corpus assembled from Wikipedia, entire articles were extracted and segmented into individual sentences through sentence tokenization. Although sentence tokenization is generally effective, it is not flawless, as some abbreviations are mistakenly interpreted as punctuation, resulting in incomplete sentences that are discarded later by the parser. According to the Wortschatz Leipzig’s documentation, the collection includes any sentence from an article that can be interpreted as a natural language sentence, with the exception of those containing excessive special characters, such as formulas or cryptic expressions. If such sentences remain in the corpus, they are automatically filtered out by our parser (see Section 3.2).

We incorporated an additional human-written corpus obtained directly from Wikipedia via the internal API. This corpus comprises 100 randomly selected Wikipedia articles, with each sentence from these articles contributing to the dataset. We employed this corpus for comparative analysis, as all machine-generated corpora were gathered from continuous texts.

The machine-generated corpora were produced using two distinct methodologies. First, we randomly selected terms defined on Wikipedia and prompted six LLMs to generate continuous textual descriptions of these terms, typically comprising 10 to 15 sentences each. This process resulted in six sentence-based corpora. Additionally, we prompted one LLM to generate the same term descriptions in English and French. Second, we randomly selected historical figures and instructed the same six LLMs to generate extended articles detailing their lives, with each article generally consisting of approximately 50 sentences. This procedure yielded six additional sentence-based corpora derived from these longer coherent text samples.

3.2. Percentage of Parsed Sentences

The analyses of the text consists of several steps. Prior to parsing, some sentences are filtered out due to factors such as the presence of non-standard characters or grammatical errors, which prevents the parser from successfully analyzing the sentence. In the following, we provide a breakdown of the reasons for non-parsed sentences, noting that the relative proportions of these reasons remain consistent regardless of whether the corpus contains thousands or millions of sentences.

The first filtering criterion pertains to the presence of characters that our parser is currently unable to process. Specifically, 1.0% of the sentences contained unknown characters, while 21.3% included punctuation marks that the parser cannot yet handle. At present, the parser is limited to processing only periods (.) and commas (,), as these punctuation marks are essential for constructing complex sentence structures.

The second filtering criterion concerns the presence of words that are not found in the Morphy lexicon [19]. Such words are interpreted as proper names, and all instances of these “proper names” that are not capitalized are excluded from further processing. This accounted for 5.2% of all sentences.

The third filtering criterion addresses grammatically incorrect sentences, primarily resulting from errors in sentence tokenization. Some of these errors can be identified before parsing, for instance, in sentences missing a verb. A total of 5.3% of the sentences were filtered out for grammatical incorrectness before parsing.

Following the application of these filtering criteria, 77.2% of the sentences remained available for parsing. Of this subset, the parser successfully processed 54.5%, which are 42.1% of the initial number of sentences. The primary reason for unsuccessful parsing is an “interpretation explosion” caused by an excessive number of proper names within sentences. News articles and Wikipedia entries often contain numerous proper names, which, in the current version of our parser, are handled in a rudimentary manner. Since the parser lacks specific information about these proper names beyond recognizing them as nouns, it assumes every possible grammatical case for them, leading to an exponential increase in interpretations. If the number of possible interpretations exceeds a preselected cutoff value (512 in this study), the parsing process is terminated.

The current study focuses on substantive–substantive relationships to determine the origin of a text by relating substantives with certain cases (see Section 4.1). However, for most proper names, the case cannot be determined unambiguously, and therefore, a relation between a proper name and another substantive regarding their two cases is ambiguous. This leads to an exclusion of this relationship from the statistics (see Appendix A.5) because solely biunique relations are incorporated in the analysis. Therefore, texts with few proper names, unlike news or Wikipedia articles, would lead to reliable statistics with less needed sentences because more substantive–substantive relationships can be gained from less sentences.

Due to the parser’s ongoing development, large-scale parsing on the order of millions or billions of sentences on a supercomputer has not yet been conducted. Currently, we let the parser process one million sentences of a news corpus of the Wortschatz Leipzig [18]—a workload that can be completed in under 12 h on a single multi-core CPU (12th Gen Intel(R) Core(TM) i5-1235U), utilizing all ten cores in parallel.

3.3. Longer Coherent LLM-Written Texts

In Section 4.1, we compare sentences generated by LLMs with sentences taken from the human-written corpora from the Wortschatz Leipzig, which were both extracted from coherent texts. The Wortschatz Leipzig texts were drawn from sources such as news articles, Wikipedia, and other platforms, while the LLM-generated texts were LLM responses to specific prompts.

We needed several hundred sentences from a single LLM for our corpus analysis. However, an LLM rarely responses with a hundred sentences to a single prompt. Hence, we concatenated the LLM responses to single prompts to obtain the desired number of sentences.

We employed two prompting strategies: one which led to relatively short answers of 10–15 sentences and one which led to more sentences to a single prompt in the range of 50 sentences. We analyzed the sentences of both prompting strategies to ensure that the grammar of the regarded LLM was similar whether the answer was short or relatively long.

In the first prompting strategy, we requested the LLM to write explanations of random Wikipedia terms with the following prompt (using the Wikipedia term “Auto”):

“Erkläre den Begriff Auto”.
(“Explain the term car”).

On average, this prompting strategy led to 10–15 sentences to a single prompt.

In the second prompting strategy, we requested the regarded LLM to write essays about historical figures, such as Albert Einstein and Marie Curie:

“Schreibe einen Aufsatz über Albert Einstein, der mindestens 50 Sätze umfasst”.
(“Write an essay about Albert Einstein, which comprises at least 50 sentences”.

The LLM had more to say about a historical figure then about a Wikipedia term in response to a single prompt, and the longer texts comprised about 50 sentences.

The answers of the LLM were concatenated separately for each prompting strategy to build a sentence corpus. The total number of sentences were the same for both prompting strategies, but in the case of shorter pieces of generated text, more LLM responses were included in the corpus.

3.4. Computer-Assisted Translated Corpora

While the new linguistic theory we proposed in the previous publication [6] is applicable to any natural language, the parser developed based on this theory is currently tailored exclusively for the German language. Nevertheless, it is possible to analyze corpora in other languages by first translating them into German and then parsing the translated sentences with the German-specific parser.

Hence, some of the corpora analyzed in this study were written in English or French. These foreign-language corpora were included in our analysis to verify if the identified grammatical characteristics of the corpora, both human-written or machine-generated, would be consistent across different languages.

For the bridges to be valid in the translated corpora, it is important that the relationships between substantives are preserved during translation. To ensure this, we employed a sentence-by-sentence translation approach using DeepL [20], which prevents that DeepL translates two sentences into one sentence or vice versa. According to several online sources [21,22], the translation of DeepL is relatively literal and conserves the syntactic roles of the nouns, whether serving as subjects, direct objects, or indirect objects. The word choice could be altered during this translation, effecting common AI detection mechanisms, but the mentioned nouns in the sentence and their case, which we rely on, are stable characteristics during translation.

3.5. Humanized Corpora

An AI humanizer is an advanced tool designed to enhance text generated by LLMs, transforming it into a more natural and human-like form. The primary objective of an AI humanizer is to make machine-generated content indistinguishable from human writing, focusing on adjusting emotional tone and word choices while mostly preserving the underlying grammatical structure.

Although LLMs can produce grammatically correct and coherent text, their output often lacks the subtle emotional nuances typical of human language. AI detectors identify machine-generated content by detecting patterns, phrasing, and word choices that are characteristic of artificial writing. The AI humanizer addresses this issue by selecting more emotionally charged or contextually appropriate words, replacing neutral or mechanical phrases with those that convey empathy, excitement, warmth, or other human emotions. By enhancing the emotional depth of the text while maintaining most of its grammatical integrity, the humanizer ensures that the content remains clear and structured, yet imbued with a layer of human-like expressiveness.

In our study, we used the AI humanizer developed by “HIX.AI” [5], which is designed to bypass detection by several AI detectors, including “GPTZero” [2], “Copyleaks” [3], and “Crossplag” [4]. Since the underlying grammar remains mostly unaffected by the AI humanizer, we investigated whether the grammatical characteristics of the texts are altered during the translation process. We also applied AI humanization to human-written texts, not to make the text more human-like but to study whether the underlying grammar changes during the humanization process.

3.6. The Eight Grammatical Markers

Grammatical bridges have two endpoints, which are the two words that are grammatically connected by the bridge and represented by a string. It is possible to truncate these endpoint words, thereby creating a truncated bridge that connects word classes and highlights the grammatical relationship between them expressed by the intermediate keywords. When applied to a whole corpus, these truncated bridges reveal the grammatical characteristics present within the corpus. The cumulative analysis of all truncated bridges indicates the frequency of specific grammatical bridges, offering a method for distinguishing between different corpora.

Any truncated bridge may serve as a marker for corpus linguistics, but for statistical reliability, we focused exclusively on those that were frequent across all corpora. Table 1 specifically presents the truncated bridges between two substantives, limiting the analysis to the frequency of substantive connections within the corpora. While other frequently occurring bridges might also reveal significant differences between the corpora, an examination of these is reserved for future research.

The displayed grammatical markers in Table 1 follow a systematic pattern. The first three bridges (first group) represent connections between pairs of independent (i.e., not enumerated) nominative, accusative, and dative objects. For instance, the sentence

“Die Regierung gewährte dem Unternehmen eine finanzielle Unterstützung”.
(“The government granted the company financial support”.

contains exactly one subject “the government”, one direct object “financial support”, and one indirect object “the company”. The parser constructs three grammatical bridges from this sentence, one for each substantive pair. Each of these bridges exemplifies one of the first three markers.

The next two bridges (second group) account for the connections between two independent accusative objects and two independent dative objects, respectively. For instance, the sentence

“Der Verein überließ den Forschern für ihre Studie historische Dokumente”.
(“The association provided the researchers with historical documents for their study”.

includes two independent indirect objects: “the researchers” and “for their study”. Among other syntactic relations, the parser links these two dative objects and constructs a bridge between them, serving as an example of the second group of markers.

The final three bridges (third group) capture cases in which nominative, accusative, or dative objects are connected within an enumeration. A corresponding example is the following sentence:

“Das Gericht prüft die Beweise und die Zeugenaussagen vor der Urteilsverkündung”.
(“The court examines the evidence and the witness statements before pronouncing judgment”.

Here, the direct object “the evidence and the witness statements” constitutes an enumeration of substantives. In this case, the parser links the individual elements within the enumeration, forming a bridge that exemplifies the third group of markers.

To demonstrate the number of markers that can be extracted from a corpus, we analyzed the GPT-4o corpus, which comprises 2605 sentences (see Section 3.1). Prior to parsing, 632 sentences were excluded. Of the remaining 1973 sentences, the parser successfully processed 1062. In total, 7337 truncated, unambiguously determined bridges were identified, neglecting the ambiguous bridges (see Appendix A.5). Among these, 319 bridges connect two substantives. This study focuses on eight specific bridges within this subset, which are referred to as markers. Table 2 presents the occurrences of these eight marker bridges within the 1062 successfully parsed sentences.

Another 311 bridges were found, which connect two substantives; however, the eight markers were the bridges with the most occurrences. To make the occurrences of the markers comparable, we normalized the occurrences by the total number of all occurrences of all 319 substantive bridges. The 319 different substantive bridges include 3222 occurrences in total; therefore, the occurrences in Table 2 were divided by this number:

normalized occurrence of substantive bridge n = \frac{occurrence of substantive bridge n}{\sum_{i = 1}^{N} occurrence of substantive bridge i}

3.7. Similarity Measurements Between Corpora

We used the normalized eight markers (see Section 3.6) as a feature vector of a corpus. Two corpora are similar when the cosine similarity of their feature vectors is high. A value of one stands for identical feature vectors, and a value of zero stands for orthogonal feature vectors. In Section 4.3, we analyze how many sentences must be parsed such that a corpus under investigation can be assigned to either human-written or AI-generated text.

The corpus under investigation was organized into levels representing subsets of its sentences. Specifically, the set was divided by the factor of two many times. If N is the number of sentences in the corpus under investigation, each level belongs to

\frac{N}{n^{l e v e l}}

sentences. This division resulted in a hierarchical structure of sentence sets, where each set at a lower level contained one half of the sentences from the higher level set and one half of new sentences not included in the smaller set.

The analysis begins with a high-level set (corresponding to a high dividing factor) that provides five or six parsed sentences. These few sentences are not yet enough to gain reliable statistics about the percentages of the grammatical markers. Hence, more and more sentences are parsed, corresponding to a lower dividing factor (level), until the statistics converges to stable percentages of the markers in the corpus of investigation.

To assess the similarity between a particular level of the regarded corpus and a comparison corpora, the software computes the cosine similarity between these two corpora:

cos (θ) = \frac{\vec{A} \cdot \vec{B}}{‖ \vec{A} ‖ ‖ \vec{B} ‖}

where

\vec{A}

is the eight-dimensional vector of the regarded corpus, where each dimension represents one of the eight markers and the value of one marker is the percentage of occurrence in the corpus. In the same way,

\vec{B}

represents the eight-dimensional vector of the comparison corpora.

4. Results

The present paper introduces a novel method, referred to as the generation of grammatical bridges, which can be utilized as a statistical tool for analyzing the grammar present in different corpora. We analyzed diverse corpora, human- and machine-written, on the basis of eight markers defined in Section 3.6 and show differences in the frequencies of these markers across the two origins. Additionally, we analyze whether these differences remain under several transformations like the translation into another language or an humanization of the corpora.

4.1. Typical Grammatical Characteristics of Human-Written Corpora and LLM-Generated Texts

Figure 2 illustrates the grammatical characteristics of various corpora by comparing the frequencies of frequently occurring truncated bridges—referred to as markers. The bars in Figure 2 are grouped according to the origin of the corresponding corpus. The y axis displays the percentage of the regarded bridge. A value of 12%, for example, means that the bridge represents 12% of all connections between substantives see Section 3.6).

Some of the bars in Figure 2 represent multiple corpora, while some of the bars were derived from a single corpus. The legend of Figure 2 shows how many corpora were analyzed for a specific bar. If multiple corpora were involved, the mean value is shown along with the standard deviation (black lines).

The first set of bars corresponds to human-written corpora and consists of four bars. The first bar (blue) represents a corpus collected directly via the Wikipedia API. The second bar (orange) illustrates the mean and standard deviation of five German-language corpora from Wortschatz Leipzig. The third bar (green) represents the mean and standard deviation of three foreign-language corpora from Wortschatz Leipzig, comprising two English corpora and one French corpus. The fourth bar (red) corresponds to the same corpus as the first bar; however, this corpus was processed through an AI humanization model prior to parsing (see Section 3.1).

The second set of bars represents marker frequencies of corpora generated by LLMs. The first bar in this group (purple) depicts the mean and standard deviation of the short (10 to 15 sentences) descriptions generated by six LLMs (see Section 3.1) developed by OpenAI [1]: GPT-3.5-Turbo, GPT-4-Turbo, GPT-4o-mini, GPT-4o, GPTo1-mini, and GPTo1-preview. The second bar (brown) shows the mean and standard deviation for the same six LLMs, which were instead prompted to generate longer text passages (50 sentences) about historical figures (see Section 3.3). The third bar (pink) represents the mean and standard deviation of two foreign-language corpora, one in English and one in French, produced by GPT-4o using the same short-term description prompts as the first bar in this group. Finally, the fourth bar (gray) represents the frequencies observed in a humanized corpus generated by GPT-4o, which was also prompted with the short-term descriptions (see Section 3.1).

The heights of the markers across the two corpus groups exhibit similarities within the corpus group but also significant differences between the corpus groups. The first three markers, representing the connections between nominative, accusative, and dative objects, are notably higher in the human-written corpora compared to the machine-generated corpora, indicating a different use of these three objects depending on the corpora. The connection between two independent accusative objects is also more frequent in the Wortschatz Leipzig corpora, though the difference is less pronounced than for the connection between two independent dative objects. In contrast, the use of enumerations of substantives reverses this trend. While the machine-generated corpora employ enumerations of nominative objects as frequently as in the Wortschatz Leipzig corpora, they significantly use more enumerations of accusative and dative objects.

Figure 2 illustrates that the bars representing corpora created from shorter LLM-generated text snippets (purple) and those representing longer text snippets (brown) exhibit similar distributions. Specifically, the mean value of one bar falls within, or in close proximity to, the standard deviation range of the other. In contrast, both types of text snippets differ significantly from the human-written corpora. This finding suggests that text length has only a minor influence on the grammatical patterns used by the analyzed LLMs.

4.2. Permanence of the Grammatical Features Under Translation and Humanization

In order to study grammatical differences in translated texts, we analyzed three translated human-written corpora from Wortschatz Leipzig, two in English and one in French, as well as two machine-generated corpora produced by GPT-4o in English and French. The mean values of these two sets of corpora (see Figure 2 green and pink bars) exhibit a high degree of similarity to other corpora of the same origin while differing significantly from those of the opposite origin. This suggests that grammatical characteristics, particularly the syntactic role of nouns, remain consistent across translations, allowing the origin to be identifiable regardless of language. Furthermore, these findings provide evidence that the linguistic framework we proposed in the two previous publications [6,7] is language-independent.

Figure 2 demonstrates that AI humanization of a corpus had minimal impact on its grammatical characteristics. For this analysis, we applied the AI humanizer to the corpus assembled from the Wikipedia API written by humans and to the GPT-4o-generated corpus consisting of shorter text snippets before dividing the text into individual sentences.

The red bar, representing humanized human-written text, and the gray bar, representing humanized machine-generated text, exhibit similar heights within the same origin, while both differ significantly from the opposing origin. These findings indicate that AI humanization, whether applied to machine-generated or human-written text, largely preserves grammatical characteristics. While AI detection systems are misled by AI humanization, our analysis using grammatical markers remains robust, since grammatical characteristics were not substantially altered during the humanization transformation.

4.3. How Many Parsed Sentences Are Needed to Determine the Origin of the Text

This section aims to determine the minimum number of parsed sentences required to reliably identify the origin of a text—either human-written or machine-generated. The same eight markers previously outlined in Section 3.6 served as indicators for the origin in this analysis. One corpus was the corpus under investigation, which has to be classified as either human-written or machine-generated. This corpus was compared with the five German Wortschatz Leipzig corpora and the six German LLM-generated corpora described in Section 3.1.

Figure 3 presents the cosine similarity against several human-written and AI-generated corpora for two corpora under investigation: the human-written News-Crawl corpus from Wortschatz Leipzig [18] (Figure 3A) and the corpus generated by GPT4o (Figure 3B). Note that the corpus under investigation is not represented as a line in Figure 3. The figure displays the cosine similarity of this corpus to all the other comparison corpora. The corpus under investigation grew in the number of sentences, from left to right on the x axis, whereas the comparison corpora had a fixed number of sentences. The number of sentences in the comparison corpora is high enough that the statistics of the markers is in saturation: more sentences do not change the statistics of these comparison corpora.

The x axes in Figure 3A and 3B are logarithmic and show the number of successfully parsed sentences and not the number of total sentences at the current dividing factor (level), because the rule-based parser was not able to parse every sentence of the corpus under investigation. Two additional reference lines are displayed in Figure 3A and 3B: the mean cosine similarity across all five Wortschatz Leipzig corpora (represented by the blue line) and the mean cosine similarity across all six machine-generated corpora (represented by the black line).

Figure 3 demonstrates that the cosine similarity against the comparison corpora are similar within the same origin. In Figure 3A, the cosine similarity between the regarded News-Crawl corpus and the machine-generated corpora stabilized between 0.65 and 0.75, while the cosine similarity against the other Wortschatz Leipzig corpora stabilized between 0.9 and 1.0. Conversely, Figure 3B illustrates the opposite trend when the regarded corpus was machine-generated. The cosine similarity of the GPT4o against the human-written corpora stabilized between 0.6 and 0.75, while the cosine similarity against the machine-generated corpora stabilized between 0.85 and 1.0. Thus, one can conclude that the corpus under investigation has a higher cosine similarity when compared to corpora of the same origin. Notably, Figure 3A show a slightly different behavior for the oldest language model (GPT-3.5 Turbo), although this behavior remained similar to that of the newer models.

The Figure 3A and 3B indicate that the statistics of the corpus under investigation saturated at around 128 parsed sentences. More parsed sentences of the corpus under investigation changed the eight-dimensional feature vector of this corpus only slightly. The statistics of the comparison corpora were in saturation for every level of the corpus under investigation. The number of 128 sentences is not the threshold of the sentences needed for a reliable AI detection; this number may be lower. The number of about 128 sentences marks the threshold, where more sentences do not contribute to a more pronounced distinction of the corpora. Note that the total number of sentences needed for saturating the statistics is higher than 128, because not every sentence can be parsed with the current version of the parser (see Section 3.2).

5. Discussion

The present study focused on analyzing the grammatical differences between human-written and machine-generated corpora, as well as the effects of translating the corpora to other languages and the effect of humanizing the corpora by special software. The analyses was made using a novel method for parsing natural language sentences we introduced in two previous publications [6,7]. This method is entirely rule-based, relying solely on German grammar and avoiding the use of statistical models or neural networks. This approach contrasts with common techniques such as dependency parsing, which relies on statistical models and generates parses for any sequence of words, regardless of whether the sequence is meaningful or not (as we show in [6]). Conversely, our parser may reject sentences that are grammatically incorrect.

At its current developmental stage, the parser successfully parsed approximately 42.1% of real-world sentences, such as those found in the Wortschatz Leipzig corpus [18]. This is sufficient to extract the syntactic and semantic properties of entire corpora containing thousands, millions, or even billions of sentences. The parsing results are stored in a database that represents word-to-word relationships. Such databases are typically referred to as syntactic dependency networks (SDNs), which are often constructed using a dependency parser. However, our database offers three key advantages over conventional SDNs:

Disambiguation: To provide a grammatical analyses of the corpora, our parser stores word relationships that are unambiguously determined. Dependency parsers typically resolve ambiguities by selecting the most probable solution, whereas our approach includes even less likely solutions if they can be definitively determined (we show this in [7]).
Connectivity: While SDNs connect only directly dependent words, our parser establishes connections between every pair of words in a sentence by identifying the shortest path within the parsing structure, which we call here grammatical bridges.
Informational Depth: SDNs provide limited grammatical information. As we demonstrated earlier in [7], dependency parse trees capture only a fraction of the grammatical structures present in German sentences. For instance, they do not explicitly identify main and subordinate clauses or the grammatical cases of nouns. In contrast, our parser constructs a unique parsing structure that differs fundamentally from conventional parse trees and violates the Robinson Axioms, which underlie such trees. Thus, it determines every grammatical case distinguishable within German grammar, thereby offering a much richer representation of the sentence structure.

In this study, we focused specifically on relationships between substantives, temporarily disregarding other types of relationships. Among the substantive–substantive relationships, we concentrated on the eight most frequently occurring ones, such as the relationships between the sentence subject and the direct or indirect objects. We called them grammatical markers. Using these markers, we could distinguish whether two corpora stem from the same or different origin.

We have analyzed corpora from two distinct origins, human-written and machine-generated, by counting the marker frequencies of substantive–substantive relationships. The human-written corpora were sourced from the Wortschatz Leipzig and Wikipedia, while the machine-written corpora were generated by six different large language models (LLMs) from OpenAI [1], including the older GPT-3.5 model and the more recent o1-preview model. Our analysis revealed differences in the marker frequencies between human and machine origin texts. Specifically, human-written texts consistently exhibited higher frequencies of nominative–dative, accusative–dative, and dative–dative object pairs, whereas the machine-generated texts showed a greater prevalence of enumerations involving accusative and dative objects. These differences were consistent regardless of the length of the LLM generated texts.

These findings were also consistent when the corpora underwent certain transformations prior to parsing. We investigated whether the specific natural language of the corpus affects the markers, focusing on German, English, and French. Some Wortschatz Leipzig corpora are written in English and French, and we used an LLM to generate descriptions of Wikipedia terms also in these languages. As our parser is designed for German, we translated all foreign-language corpora into German using DeepL [20], performing sentence-by-sentence translations before parsing. Our findings demonstrated that the original language did not influence the markers, as the relationships between nouns remained consistent across translations.

Additionally, we analyzed the impact of post-processing LLM generated corpora to make them appear more human-like. This AI humanization process involves modifying sentences to make them more emotional or subjective, often in an effort to deceive AI detectors. We have shown that the grammar stayed unchanged during the humanization process. This indicates that AI detectors should concentrate on the grammar used instead of concentrating on subjectivity or emotionality of the texts.

The analysis was constrained by two primary factors, both of which contributed to an increased number of required sentences for determining the origin of texts. First, the parser, in its current state of development, successfully processed approximately 42.1% of German sentences. This parsing rate remained consistent across both human-written and AI-generated texts and was not influenced by sentence complexity. We assume that this does not introduce bias into the analysis, specifically, that the parser does not systematically exclude sentences with distinct grammatical structures.

As demonstrated in Section 4.3, a minimum of 128 parsed sentences was necessary to obtain statistically reliable markers. Consequently, the original text must contain approximately 300 sentences, which is a number significantly higher than the typical length of most texts. Thus, improving the parsing rate is essential for the practical application of this method in real-world scenarios.

The second limiting factor concerns the removal of incorrect sentence interpretations. Our analysis exclusively considered substantive relationships that are unambiguously determined, meaning all possible interpretations of a given sentence exhibit the same relationship. However, the parser identified three times as many relationships overall, many of which were ambiguous. Eliminating these incorrect interpretations increases the proportion of unambiguous relations, and reducing ambiguity by a factor of three would decrease the required number of sentences to fewer than 50.

For most practical applications, the required number of sentences must be reduced to fewer than 50, as many documents in fields such as education and journalism are shorter than this threshold. In education, for instance, students typically write essays consisting of only a few dozen sentences. By applying our software, educators could analyze these essays to assess whether AI assistance was used, even if the text has been humanized by AI. Similarly, in journalism, an increasing number of news articles are generated by AI without being explicitly labeled as such, potentially compromising the quality and authenticity of information. If an article is sufficiently long, editorial offices and readers could use our method to verify whether it was human-written.

In contrast, documents in science and law are typically extensive, containing a sufficient number of sentences for reliable analysis. However, these fields frequently use foreign words and specialized terminology, which our parser can recognize if such terms are included in the underlying lexicon. The current lexicon, Morphy [19], may not be comprehensive enough in this regard. Moreover, legal texts often employ specialized sentence structures that may differ from the grammatical patterns observed in both human-written and machine-generated texts. In such cases, the reference corpora for human-written and machine-generated texts should be specifically tailored to legal discourse, and the set of grammatical features used for analysis may need to be adjusted accordingly.

The identification of AI-generated content becomes more challenging when only certain portions of a text are AI-generated. In these cases, we hypothesize that the grammatical statistics of the text would fall between those of purely human-written and purely AI-generated texts.

Conventional AI detection techniques, such as those proposed by Opara et al. (2024) [12], Alamleh et al. (2023) [13], and Guo et al. (2024) [15], primarily rely on easily extractable textual features, such as emotional tone or subjectivity. These methods typically involve counting specific words or identifying recurring patterns within the text. However, because such features are straightforward to modify, AI systems can readily alter them through techniques like humanization.

In contrast, our approach relies on features that require a complete parsing of sentences. Given that all analyzed large language models (LLMs), regardless of whether they are older or more recent, exhibit consistent grammatical properties, we posit that these properties are characteristic of machine-generated writing. While it may be possible to instruct an LLM to emulate human-like grammar, doing so would fundamentally alter the model’s writing style. Post-processing an existing text to retrospectively modify its grammar could also be feasible, but it presents significant challenges. Specifically, ensuring that the original content remains unchanged while modifying the syntactic structure, particularly the occurrence of nouns and their syntactic relationships, would be difficult to achieve.

In this study, we compared two corpora based on the cosine similarity of their grammatical feature vectors. If the cosine similarity between a given corpus and a machine-generated corpus is higher than that between the given corpus and a human-written corpus, the given corpus can be classified as machine-generated. However, we did not currently compute confidence scores for these classifications, which would be crucial for practical applications. Additionally, we have not yet evaluated the reliability of our method in comparison to other AI-based detection mechanisms. These limitations will be addressed as the method is further refined for real-world implementation.

Future research will aim to extend the parser’s capabilities to encompass all grammatically correct German sentences rather than only a subset. In combination with the reduction in false interpretations of sentences, the sentences needed for an AI detection would reduce significantly. Furthermore, integrating the strengths of both rule-based and neural methodologies appears to be a promising direction for future developments.

6. Conclusions

In two preliminary publications, we introduced a novel rule-based parsing approach for natural language, which contrasts with commonly used neural methods such as dependency parsing. We applied a rule-based parser to differentiate between human-written and machine-generated corpora based on the grammar used. Our findings reveal systematic grammatical differences between the analyzed corpora, which remained consistent across various transformations, such as translations into other languages or AI-based humanization of the texts. The identified grammatical differences reflect the writing style of current AI models. It is potentially possible that future adversarial humanization techniques will be capable to eliminate these found differences. It emphasizes the need to continually re-evaluate grammar-based detection strategies.

Our results indicate that grammar is a reliable feature for determining text origin and complements existing methods based on the subjective elements or emotional content of the text.

Author Contributions

Conceptualization, S.S. and R.L.; Methodology, S.S. and I.S.; Software, S.S.; Validation, S.S. and I.S.; Formal Analysis, S.S.; Investigation, R.L.; Data Curation, S.S.; Writing—Original Draft Preparation, S.S.; Writing—Review and Editing, S.S., I.S. and R.L.; Visualization, S.S.; Supervision, R.L.; Project Administration, R.L.; Funding Acquisition, R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Edith-Haberland-Wagner-Stiftung, Buhl-Strohmaier-Foundation, and Würth Foundation.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Construction of the Grammar Bridges

To construct the grammar bridges using our previously developed rule-based parser, we extend the methodology outlined in [7]. In Appendix A.1, we introduce an additional parsing step that connects clauses within our parser’s framework. Subsequently, we detail the identification of intermediate keywords essential for the grammar bridges, which are categorized into types (Appendix A.2) and roles (Appendix A.3). The process of constructing these bridges is then elaborated in Appendix A.4. Notably, the parser often generates multiple interpretations for a single sentence, resulting in ambiguities in the constructed bridges. The issue of ambiguous grammar bridges is addressed in Appendix A.5.

Appendix A.1. Methodical Extention of the Parsing Method: Clause Connections

The parsing steps of our parser are the same as we described in [7], with one additional step described in the following.

In a previous publication, we briefly addressed the challenge of connecting multiple clauses within a sentence (see Section 3.1.10 in our previously publication [7]). This problem becomes particularly complex when sentences consist of more than two clauses. In the construction of grammatical bridges, we take into account the role that each clause plays within the overall sentence structure (see Section A.3). As we demonstrated in [6], all clause connections can be reduced to three primary types. The first type connects clauses by sharing a noun. The second type establishes a conditional relationship between the clauses. The third type describes a clause that functions as a noun within another clause.

The type of clause connection can be determined by examining the leading linguistic element of each clause. For instance, in the sentence “The father takes the umbrella because it is raining”, the conjunction “because” indicates a connection of the second type, where a condition is established between the two clauses. The first clause does not feature a distinctive leading element and is thus labeled “NEUTRAL” (we use capitalized letter for our keywords), while the second clause, introduced by the conjunction, is marked “CONBEGIN”. These tags, “NEUTRAL” and “CONBEGIN”, precisely specify the roles that the respective clauses play within the overall sentence structure. The sentence as a whole is conceptualized as a container, referred to as “ROOT”, which encompasses all of its clauses.

It is important to note that when a sentence contains more than two clauses, such as in a main–subordinate–subordinate clause sequence, the order of clause connections can affect the overall structure. Specifically, there is a distinction between connecting the main clause to the subordinate clause first versus connecting the two subordinate clauses. In this study, we abstracted away from the specific order of clause connections and focused instead on documenting the linguistic element that introduces each clause within the grammatical bridges. Consequently, within the ROOT container, clauses are assigned one of the following roles: “NEUTRAL”, “CONBEGIN”, “DASSBEGIN”, or “PROBEGIN”.

The role “DASSBEGIN” is specific to German. The word “dass” is a conjunction, but with a special function in German. It introduces a clause, which serves as noun in another clause, and represents in this way the third possibility to connect clauses. Other languages have other keywords to indicate that a clause functions as noun, like the English word “that”.

The role “PROBEGIN” indicates that a clause begins with a pronoun that refers to a noun in a preceding clause, thereby creating the first type of clause connection through shared nouns.

Appendix A.2. Types

The bubble model distinguishes between several types of “bubbles”, which serve as containers for various linguistic elements. As outlined in two previous publications [6,7], the primary bubble types include those surrounding noun phrases (representing objects in a scene) and those around clauses (representing the scenes themselves).

There are multiple types of noun phrases, which we discussed in detail in [7]. However, a standard type of noun phrase accounts for over 90% of occurrences. The structure of this standard type is described by us in [7,23]. In this study, we focused exclusively on this standard noun phrase type and designated it with the keyword “REGNOUN”.

A sentence consists of different parts, which are separated by a comma or conjunction. In this study, we were solely interested in such parts, which form main and sub-clauses, and called the overarching type of these two parts “FULLCLAUSE”. Additionally, certain sentence segments that lack verbs, such as parenthetical insertions, were identified by the current version of the parser but are excluded from the results discussed here.

At each position of a linguistic element, an enumeration of linguistic elements of the same type can occur. As a result, various enumeration types are enclosed in bubbles labeled with keywords such as “NOUNBLOCK”, “VERBBLOCK”, “RPBLOCK”, and “PRPBLOCK” for noun, verb, verb particle, and preposition enumerations, respectively.

In the German language, multi-part nouns like “Apfelbaum” (“apple tree”) are written as a single word. However, this rule does not apply when one part of the multi-part noun is a proper name, as seen in the sentence

“Die Familie Huber macht Ferien”.
(“The Huber family goes on holiday”.

When such multi-part nouns occur, they are encapsulated in a bubble marked with the keyword “MULTISUB”.

Verb phrases, which include verbs and their associated particles, are analyzed for tense, among other grammatical features [7]. To identify which verbs and particles form a verb phrase, a bubble is drawn around the entire phrase and labeled with the keyword “VERBPHRASE”.

Appendix A.3. Roles

Bubbles are arranged side by side within a superordinate bubble. To differentiate between the bubbles within this structure, each bubble is assigned a specific role based on its function within the superordinate bubble.

The most basic bubble type is the linguistic element “WORD”. Since this applies uniformly to all words in a sentence, it is omitted from the grammatical bridges. Instead, the role of the “WORD” bubble is determined by the word class of the word in question. Thus, the initial keyword in the grammatical bridge always represents the word class, allowing for the classification of which word classes are being connected.

Distinguishing word classes within a “REGNOUN” bubble assigns roles to the words, such as preposition, article, adjective, or noun. Within a “FULLCLAUSE” bubble, regular nouns and pronouns, the two analyzed types of nouns, are further differentiated by grammatical case, namely, nominative “NOM”, accusative “ACC”, dative “DAT”, and genitive “GEN”, which determines their role within the clause.

Clauses located within the “ROOT” bubble are assigned two keywords to specify their role. The first keyword indicates whether the clause is a main clause or subordinate clause, using the terms “MAINCLAUSE” and “SUBCLAUSE”, respectively. The second keyword identifies how the clause connects to other clauses, with roles such as “NEUTRAL”, “CONBEGIN”, “DASSBEGIN”, or “PROBEGIN”, as explained in Section A.1.

Appendix A.4. Connecting the Keywords

Upon parsing a sentence, the relationships between words are represented through structures we call “grammatical bridges”. In theory, each distinct interpretation of a sentence generates its own set of grammatical bridges, as discussed in Section A.5. For the purposes of this discussion, however, we will focus on a single interpretation and the corresponding grammatical bridges.

The core idea is that every content-bearing word in a sentence conveys semantic meaning, and the grammatical relationships between these words illustrate how they are connected structurally. While it is possible to analyze various types of word relationships, this study concentrated specifically on relationships between pairs of words. These relationships are represented by strings, with each string connecting two words as endpoints. The construction of these strings is described in the following.

The grammatical bridge between two words is derived from the hierarchical “bubble structure” corresponding to a particular interpretation of the sentence. Figure A1 and Figure A2 demonstrate the construction of two such grammatical bridges for the word pairs (“Auto, frontal”) and (“Auto, Polizei”), respectively. The shortest path between two words is determined in two steps: (1) From each word, a path is traced to the overarching bubble labeled “ROOT”. This path is recorded by first noting the type of the bubble that is traversed (see Section A.2), followed by the role that the bubble occupies within its superordinate bubble (see Section A.3). Consequently, a string is formed, beginning with the word in question and ending with the keyword “ROOT”. (2) The constructed strings to the“ROOT” of two words are compared, and the same keywords are truncated except of the last shared keyword. Starting from the “ROOT” keyword, one locates the farthest away (from ROOT) shared keyword between the two strings. Both strings are truncated after this shared keyword and are concatenated with this keyword in the middle (the order of words in one string must be reversed). The concatenated string then has the two regarded words as endpoints.

Figure A1. A bubble graph of a single clause containing the substantive “Auto” and the adjective “frontal”, which are connected by a grammatical bridge.

Figure A2. A bubble graph of two connected clauses, where the substantive “Auto” occurs in one clause and the substantive “Polizei” in the other. These substantives are connected by a grammatical bridge spanning the two clauses.

We explain these steps by two examples. Figure A1 shows a simplified version of a bubble graph connecting the word pair (“Auto, frontal”). The bubble graph does not show the parse result of a whole sentence; instead, solely the two words in question are shown together with the nested bubble structure. The underlying sentence could be, for example, the following:

“Das Auto prallt frontal gegen die Mauer”.
(“The car crashes frontally against the wall”.

However, the simplified graph represents every sentence which exhibits the same relationship between the word pair (“Auto, frontal”).

The first step builds strings from the regarded word to “ROOT”, resulting in two strings for this example:

“Auto SUB REGNOUN NOM FULLCLAUSE MAINCLAUSE NEUTRAL ROOT”,
“frontal ADJ FULLCLAUSE MAINCLAUSE NEUTRAL ROOT”

The shared keyword is “FULLCLAUSE”, resulting in the concatenated string:

“Auto SUB REGNOUN NOM FULLCLAUSE ADJ frontal”.

The grammatical bridge expresses the relationship that “Auto” is the subject of the sentence (in the nominative case) and is associated with a verb modified by the adjective “frontal”.

The second example is illustrated by the simplified bubble graph in Figure A2. The underlying sentence could be, for example, the following:

“Die Diebe geben dem Auto eine neue Farbe, damit die Polizei es nicht erkennt”.
(“The thieves give the car a new color so that the police don’t recognize it”.

However, the simplified graph also represents other sentences with the same grammatical relationship between the word pair (“Auto, Polizei”). The two strings to the“ROOT” are the following:

“Auto SUB REGNOUN DAT FULLCLAUSE MAINCLAUSE NEUTRAL ROOT”,

and the string from “Polizei” to “ROOT” is

“Polizei SUB REGNOUN NOM FULLCLAUSE SUBCLAUSE CONBEGIN ROOT”.

The shared keyword is the “ROOT” in this example, resulting in the truncated grammatical bridge:

“Auto SUB REGNOUN DAT FULLCLAUSE MAINCLAUSE NEUTRAL ROOT CONBEGIN SUBCLAUSE NOM REGNOUN SUB Polizei”.

The grammatical bridge illustrates a relationship in which “Auto” is the indirect object (dative case) of a main clause, which is connected to a subordinate clause by a conjunction. In the subordinate clause, “Polizei” functions as the subject (in the nominative case).

The keyword for the bubble type is noted before the role it plays in the superordinated structure to emphasize that the role of a bubble in a superordinated bubble is irrelevant when it serves as the connecting bubble. For instance, in Figure A1, the relationship would remain the same regardless of whether the connecting bubble with the type “FULLCLAUSE” is a main clause, with the corresponding role “MAINCLAUSE”, or subordinate clause, with the corresponding role “SUBCLAUSE”, because the connection occurs within the bubble “FULLCLAUSE” itself.

Appendix A.5. Sure and Unsure Grammatical Bridges

Parsing a sentence results in multiple interpretations, with only one being the correct interpretation. However, it is not always possible to remove all false interpretations. Each interpretation then generates its own set of grammatical bridges. This raises the concern that some grammatical bridges may be unreliable if they originate from incorrect interpretations. However, it is important to note that not all word-to-word relationships are affected by different interpretations; many remain consistent across all interpretations.

To address this, two databases are constructed: “sure” and “unsure” grammatical bridges. The “sure” database contains grammatical bridges that are consistent across all interpretations, which means that every interpretation of the regarded sentence exhibits the same grammatical bridge. Since the parsing algorithm analyzes every ambiguity, these bridges can be considered unambiguous and reliably present in the regarded sentence because the remaining ambiguities of the sentence, which could not be disambiguated, do not affect these bridges.

For the purposes of our analysis, only the database containing the “sure” bridges was used to utilize solely those bridges, which were unambiguously identified by the parser.

References

OpenAI. ChatGPT (GPT-4, GPT-3.5, DALLE, etc.). 2024. Large Language Models. Available online: https://openai.com (accessed on 12 December 2024).
Tian, E. GPTZero: AI Content Detection Tool. 2024. Available online: https://gptzero.me (accessed on 12 December 2024).
Copyleaks Ltd. Copyleaks: AI-Based Plagiarism & AI Content Detection. 2024. Available online: https://copyleaks.com (accessed on 12 December 2024).
Crossplag. Crossplag: AI Content Detection Tool. 2024. Available online: https://crossplag.com (accessed on 12 December 2024).
HIX.AI. AI Humanizer. 2024. Available online: https://bypass.hix.ai (accessed on 12 December 2024).
Strübbe, S.; Sidorenko, I.; Grünwald, A.; Lampe, R. Semantic approach for solving the decision problem in natural language. J. Phys. Conf. Ser. 2023, 2514, 012019. [Google Scholar]
Strübbe, S.M.; Grünwald, A.T.; Sidorenko, I.; Lampe, R. A Rule-Based Parser in Comparison with Statistical Neuronal Approaches in Terms of Grammar Competence. Appl. Sci. 2024, 15, 87. [Google Scholar] [CrossRef]
Nivre, J. Dependency parsing. Lang. Linguist. Compass 2010, 4, 138–152. [Google Scholar] [CrossRef]
Ferrer i Cancho, R.; Solé, R.V.; Köhler, R. Patterns in syntactic dependency networks. Phys. Rev. E-Stat. Nonlinear Soft Matter Phys. 2004, 69, 051915. [Google Scholar]
Zhang, M. A survey of syntactic-semantic parsing based on constituent and dependency structures. Sci. China Technol. Sci. 2020, 63, 1898–1920. [Google Scholar] [CrossRef]
Robinson, J.J. Dependency structures and transformational rules. Language 1970, 46, 259–285. [Google Scholar] [CrossRef]
Opara, C. StyloAI: Distinguishing AI-generated content with stylometric analysis. In Proceedings of the International Conference on Artificial Intelligence in Education, Recife, Brazil, 8–12 July 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 105–114. [Google Scholar]
Alamleh, H.; AlQahtani, A.A.S.; ElSaid, A. Distinguishing human-written and ChatGPT-generated text using machine learning. In Proceedings of the 2023 Systems and Information Engineering Design Symposium (SIEDS), Charlottesville, VA, USA, 27–28 April 2023; pp. 154–158. [Google Scholar]
Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar]
Guo, X.; He, Y.; Zhang, S.; Zhang, T.; Feng, W.; Huang, H.; Ma, C. Detective: Detecting ai-generated text via multi-level contrastive learning. Adv. Neural Inf. Process. Syst. 2024, 37, 88320–88347. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Muñoz-Ortiz, A.; Gómez-Rodríguez, C.; Vilares, D. Contrasting linguistic patterns in human and llm-generated text. arXiv 2023, arXiv:2308.09067. [Google Scholar]
Goldhahn, D.; Eckart, T.; Quasthoff, U. Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages. In Proceedings of the LREC, Istanbul, Turkey, 21–27 May 2012; Voume 29, pp. 31–43. [Google Scholar]
Lezius, W. Morphy-German morphology, part-of-speech tagging and applications. In Proceedings of the 9th EURALEX International Congress, Stuttgart, Germany, 8–12 August 2000; pp. 619–623. [Google Scholar]
DeepL GmbH DeepL Translator. 2024. Available online: https://www.deepl.com (accessed on 12 December 2024).
Schmieder Übersetzungen & Sprachmanagement. DeepL vs. ChatGPT: Wer Liefert Die Bessere Übersetzung? 2025. Available online: https://www.schmieder-uebersetzungen.de/blog/detail/deepl-vs-chatgpt (accessed on 19 March 2025).
DOG GmbH. DeepL—Ein Erfahrungsbericht. n.d. Available online: https://www.dog-gmbh.de/blog/deepl-ein-erfahrungsbericht (accessed on 19 March 2025).
Strübbe, S.; Sidorenko, I.; Roy, S.; Grünwald, A.T.; Lampe, R. Development and verification of a user-friendly software for German text simplification focused on patients with cerebral palsy. Nat. Lang. Process. J. 2023, 3, 100012. [Google Scholar]

Figure 1. Dependency parse tree for the sentence “The cat chased the mouse”.

Figure 2. Grouped grammatical markers of human-written and LLM-generated corpora. The x axis shows the chosen eight grammatical markers for different corpora. The y axis shows the percentages of the corresponding truncated bridges normalized to all substantive–substantive bridges determined in the corpus.

Figure 3. Cosine similarity between the corpus under investigation against six LLM-generated (green curves) and five Wortschatz Leipzig [18] (red curves) corpora. (A) The corpus under investigation is a human-written News-Crawl corpus from the Wortschatz Leipzig [15]. (B) The corpus under investigation was generated by GPT4o.

Table 1. The eight marker bridges and their abbreviations (Explanations of terms: SUB = substantive; NOM = nominative object; ACC = accusative object; DAT = dative object; FULLCLAUSE = connecting bubble is a clause; NOUNBLOCK = connecting bubble is an enumeration of nouns).

Full-Bridge	Abbreviated Bridge
SUB REGNOUN NOM FULLCLAUSE ACC REGNOUN SUB	IND-NOM-ACC
SUB REGNOUN NOM FULLCLAUSE DAT REGNOUN SUB	IND-NOM-DAT
SUB REGNOUN ACC FULLCLAUSE DAT REGNOUN SUB	IND-ACC-DAT
SUB REGNOUN ACC FULLCLAUSE ACC REGNOUN SUB	IND-ACC-ACC
SUB REGNOUN DAT FULLCLAUSE DAT REGNOUN SUB	IND-DAT-DAT
SUB REGNOUN NOM NOUNBLOCK NOM REGNOUN SUB	BLOCK-NOM
SUB REGNOUN ACC NOUNBLOCK ACC REGNOUN SUB	BLOCK-ACC
SUB REGNOUN DAT NOUNBLOCK DAT REGNOUN SUB	BLOCK-DAT

Table 2. The eight marker bridges and their occurrence in the GPT4o corpus containing 2605 sentences.

Abbreviated Bridge	Number of Occurrences	Normalized Occurrences in Percent
IND-NOM-ACC	104	3.2%
IND-NOM-DAT	130	4.0%
IND-ACC-DAT	128	4.0%
IND-ACC-ACC	16	0.5%
IND-DAT-DAT	58	1.8%
BLOCK-NOM	47	1.5%
BLOCK-ACC	216	6.7%
BLOCK-DAT	136	4.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Strübbe, S.; Sidorenko, I.; Lampe, R. Comparison of Grammar Characteristics of Human-Written Corpora and Machine-Generated Texts Using a Novel Rule-Based Parser. Information 2025, 16, 274. https://doi.org/10.3390/info16040274

AMA Style

Strübbe S, Sidorenko I, Lampe R. Comparison of Grammar Characteristics of Human-Written Corpora and Machine-Generated Texts Using a Novel Rule-Based Parser. Information. 2025; 16(4):274. https://doi.org/10.3390/info16040274

Chicago/Turabian Style

Strübbe, Simon, Irina Sidorenko, and Renée Lampe. 2025. "Comparison of Grammar Characteristics of Human-Written Corpora and Machine-Generated Texts Using a Novel Rule-Based Parser" Information 16, no. 4: 274. https://doi.org/10.3390/info16040274

APA Style

Strübbe, S., Sidorenko, I., & Lampe, R. (2025). Comparison of Grammar Characteristics of Human-Written Corpora and Machine-Generated Texts Using a Novel Rule-Based Parser. Information, 16(4), 274. https://doi.org/10.3390/info16040274

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Grammar Characteristics of Human-Written Corpora and Machine-Generated Texts Using a Novel Rule-Based Parser

Abstract

1. Introduction

2. Related Work

2.1. Syntactic Dependency Networks

2.2. AI Detection Methods

2.3. Corpus Linguistic

3. Methods

3.1. Source of the Used Corpora

3.2. Percentage of Parsed Sentences

3.3. Longer Coherent LLM-Written Texts

3.4. Computer-Assisted Translated Corpora

3.5. Humanized Corpora

3.6. The Eight Grammatical Markers

3.7. Similarity Measurements Between Corpora

4. Results

4.1. Typical Grammatical Characteristics of Human-Written Corpora and LLM-Generated Texts

4.2. Permanence of the Grammatical Features Under Translation and Humanization

4.3. How Many Parsed Sentences Are Needed to Determine the Origin of the Text

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Construction of the Grammar Bridges

Appendix A.1. Methodical Extention of the Parsing Method: Clause Connections

Appendix A.2. Types

Appendix A.3. Roles

Appendix A.4. Connecting the Keywords

Appendix A.5. Sure and Unsure Grammatical Bridges

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI