Syllable-, Bigram-, and Morphology-Driven Pseudoword Generation in Greek

Kosmidis, Kosmas; Apostolouda, Vassiliki; Revithiadou, Anthi

doi:10.3390/app15126582

Open AccessArticle

Syllable-, Bigram-, and Morphology-Driven Pseudoword Generation in Greek

by

Kosmas Kosmidis

^1,*,†

,

Vassiliki Apostolouda

^2,†

and

Anthi Revithiadou

^2,†

¹

Department of Physics, Aristotle University of Thessaloniki, University Campus, 54124 Thessaloniki, Greece

²

School of Philology, Department of Linguistics, Aristotle University of Thessaloniki, University Campus, 54124 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(12), 6582; https://doi.org/10.3390/app15126582

Submission received: 29 April 2025 / Revised: 27 May 2025 / Accepted: 6 June 2025 / Published: 11 June 2025

(This article belongs to the Special Issue Computational Linguistics: From Text to Speech Technologies)

Download

Browse Figures

Versions Notes

Abstract

Featured Application

SyBig-r-Morph is a versatile tool for generating pseudowords designed for Greek, but it can be easily modified to work with any language. By allowing researchers to produce phonotactically and morphologically well-formed pseudowords that are specifically tailored to particular morphosyntactic categories, such as nouns or verbs, it overcomes the shortcomings of current multilingual generators. This tool is especially valuable for designing controlled linguistic experiments, including studies on stress assignment, lexical access, and morphophonological and lexical processing. By serving as an important link between orthographic representation and phonological realization—an important step in the text-to-speech pipeline—SyBig-r-Morph offers a valuable tool for psycholinguistic research, computational phonology, and speech synthesis applications that require linguistically authentic pseudoword stimuli.

Abstract

Pseudowords are essential in (psycho)linguistic research, offering a way to study language without meaning interference. Various methods for creating pseudowords exist, but each has its limitations. Traditional approaches modify existing words, risking unintended recognition. Modern algorithmic methods use high-frequency n-grams or syllable deconstruction but often require specialized expertise. Currently, no automatic process for pseudoword generation is designed explicitly for Greek, which is our primary focus. Therefore, we developed SyBig-r-Morph, a novel application that constructs pseudowords using syllables as the main building block, replicating Greek phonotactic patterns. SyBig-r-Morph draws input from word lists and databases that include syllabification, word length, part of speech, and frequency information. It categorizes syllables by position to ensure phonotactic consistency with user-selected morphosyntactic categories and can optionally assign stress to generated words. Additionally, the tool uses multiple lexicons to eliminate phonologically invalid combinations. Its modular architecture allows easy adaptation to other languages. To further evaluate its output, we conducted a manual assessment using a tool that verifies phonotactic well-formedness based on phonological parameters derived from a corpus. Most SyBig-r-Morph words passed the stricter phonotactic criteria, confirming the tool’s sound design and linguistic adequacy.

Keywords:

automatic pseudoword generation; Greek phonotactics; n-gram; syllable; morphosyntactic category; language resources for Greek

1. Introduction

A pseudoword is constructed with proper linguistic structure but lacks meaning [1]. Pseudowords adhere to a language’s phonotactic and orthographic rules and are typically characterized by the absence of frequency statistics since they are not part of the language’s lexicon [2]. Pseudowords are used in linguistic research by being integrated into lexical decision tasks, production and perception experiments, vocabulary evaluations, etc. For instance, in morphology experiments, the renowned pseudoword ‘wug’ is often used to assess the productivity of plural rules [3], revealing the implicit knowledge children have acquired about the morphology of their language, i.e., English. In a broader context, pseudowords have been extensively used in phonetic decoding [4] and visual word recognition, including lexical decision and naming tasks [5]. Furthermore, pseudowords help assess the credibility of learners’ responses in non-native vocabulary tests [6,7,8,9]. They can also function as stimuli to examine the mechanisms within cognitive models of reading [10,11,12,13]. Pseudowords are also employed in standardized tests (e.g., PAT—Phonological Awareness Test, TOWRE—Test of Word Reading Efficiency) to assess a speaker’s skills in phonological processing, decoding ability, and overall literacy [14,15].

Several methods have been proposed for the construction of pseudowords, many of which are accompanied by databases and software applications designed to streamline pseudoword generation on a large scale. However, most of the methods applied for pseudoword construction, whether conducted manually or with the help of specific applications or software, come with certain limitations. For instance, they often require a high level of proficiency in the language and/or an in-depth understanding of its phonotactics, syllable structure, and orthographic conventions that researchers may not have.

In the context of such experimental applications, achieving reliable results depends on the careful development of appropriate pseudowords. The pseudowords must conform to the phonotactic rules of the language in the sense that they could potentially be real words while also meeting the researchers’ experiment-specific criteria. Depending on the nature of the experiment, these pseudowords may need, for example, to closely resemble high or low-frequency words or, conversely, avoid bearing too strong a resemblance to existing words.

After first reviewing some existing approaches for generating pseudowords and the challenges they face at the empirical level, we introduce our proposal for constructing and evaluating pseudowords in Greek, a language lacking a concrete methodology for pseudoword construction. Specifically, we present SyBig-r-Morph, a tool we developed for pseudoword construction that uses syllables, bigrams, orthographic conventions, and morphological information to construct pseudowords. This application can serve as a valuable resource for experimental tasks aimed at studying the phonological characteristics of the language.

In Greek, stress assignment is heavily determined by lexical factors, with any of the last three syllables of a phonological word potentially serving as a stress location, as exemplified by words like [ˈθo.ɾi.vos] ‘noise’, [zo.ˈɣra.fos] ‘painter’, [po.ta.ˈmos] ‘river’ [16,17,18]. For illustrative purposes in this article, we focus on constructing pseudonouns for experimental tasks that explore how young and adult Greek speakers apply stress rules. As suggested by relevant research [19,20,21], we anticipate that the distribution of the three grammatical stress patterns will be affected by the size and morphological class of the noun. Therefore, the following criteria are essential when constructing pseudonouns. Firstly, they must strictly adhere to the phonotactic rules of the specific language to prevent any resemblance to foreign words. Secondly, the constructed nouns should avoid significant similarity to real words, which could influence young and adult participants when choosing stress patterns.

The remainder of the article is organized as follows: Section 2 provides an overview of current methods and applications for creating pseudowords in various languages, including Greek. Section 3 introduces SyBig-r-Morph, highlighting its important innovations and distinct features. Section 4 presents the outcomes of an additional evaluation process that we applied to words generated by SyBig-r-Morph to assess its suitability for the hypothesized experimental task. Section 5 provides a discussion and concludes this article.

2. Related Work

2.1. Exploring Current Methods and Systems for Pseudoword Generation

In the construction of pseudowords, numerous studies have followed a well- established tradition in which pseudowords are generated by modifying one or more letters or segments of an existing word (e.g., CLEAM derived from CLEAN; [22]). Balota et al. [23] developed the English Lexicon Project, which consisted of 40,481 words and was intended for a lexical decision task. An equivalent number of pseudowords was also generated by modifying one or two letters from the original words, such as addomen from abdomen. However, visual similarity between the pseudoword and the original word poses a risk because participants can occasionally recognize the original word within specific pseudowords. New et al. [24] highlight that this unintended recognition of the original word may introduce an experimental priming bias that researchers are unable to control.

In recent years, researchers have developed more evolved methods for pseudoword construction. König et al. [25] identified three basic strategies, that is, stimulus manipulation, high-frequency bigrams, and a combination of sub-syllabic elements. The first method, i.e., stimulus manipulation, involves taking a word present in the lexicon and modifying it in various ways, e.g., by changing one or two characters through insertion, deletion, transposition, or replacement, to produce a pseudoword. Alternatively, a pseudoword can be created by adding an affix to the stimulus, provided that the outcome does not match an existing word in the lexicon [26].

The second method creates pseudowords using high-frequency bigrams. This approach employs software applications like WordGen [27] that take into account neighborhood size and orthographic similarity factors. WordGen generates pseudowords in English, Dutch, German, and French using the CELEX [28] and Lexique [29] databases. The primary steps in the pseudoword generation process involve creating random letter sequences and then determining, based on several predetermined characteristics, whether these sequences constitute existing words or not. Any deviation from these qualities leads to the rejection of the letter sequence.

However, strict WordGen parameters can prolong candidate search times. Furthermore, its use may be challenging for researchers unfamiliar with sub-lexical statistics, especially when configuring effective pseudoword parameters for the experimental material’s design [24].

The character-gram chaining algorithm (CGCA, [25]) is another tool that processes wordlists or words from a corpus. It begins by forming a wordlist by extracting all unique tokens from the input (i.e., the original wordlist or corpus). It then identifies character-grams (bigrams, trigrams, four-grams, etc.) and assigns them to three potential positions, namely, the word’s start, middle, or end. Subsequently, it generates and validates chains of character-grams. The outcome is a list of pseudowords linked explicitly to the words used for their construction. CGCA is a script executed via the terminal, which is a feature that makes it less user-friendly for researchers unfamiliar with such interfaces.

UniPseudo also generates pseudowords from words given by the user (in the form of a wordlist or a corpus) and is accessible through a web interface [24,30]. Currently, it can generate pseudowords for 64 languages. The algorithm is based on Markov chains, like CGCA, and extracts all trigrams (or bigrams, depending on the length of the requested pseudoword) at a given position from the input set of words. It begins by randomly choosing a trigram from words that have the trigram sequence in word-initial position, e.g., gla from glared, while at the same time respecting their frequencies in the original corpus. Then, it randomly selects a second trigram that starts with the two letters that conclude the first one (given in underlined format below), choosing from input words that have the trigram sequence in the second position, e.g., lar from glared. This iterative procedure continues until the last trigram is reached, e.g., ard from guards and rds from gourds. In our example, glards, the generated word is displayed in a downloadable table format (XLS file). The table presents pseudowords in the first column, while the words used to create the pseudowords are listed in subsequent columns. Trigrams (or bigrams) from each input word, which played a role in forming the pseudoword, are color-coded for clarity.

The third method for pseudoword construction is based on the prosodic unit of the syllable. A syllable typically consists of a vowel (nucleus) and an onset (the consonant that precedes the vowel). In some languages, it can also include a coda constituent (the consonant following the vowel). In this approach, syllabic components are extracted from existing words and are reassembled to create pseudowords. The ARC Nonword Database [31] includes over 350,000 monosyllabic pseudowords created by a combination of onsets and nucleus-coda sequences based on the sound associations of the CELEX database.

Operating on syllabified word lists, Wuggy is a pseudoword generator designed to create nonwords in Basque, Dutch, English, French, German, Serbian, Spanish, and Vietnamese for (psycho)linguistic research [32]. It breaks down words into their syllabic elements (onset, nucleus, coda) and then reconstructs them to form pseudowords. The program creates monosyllabic pseudowords from monosyllabic words and disyllabic pseudowords from disyllabic words by implementing different grammatical patterns based on syllable count. For instance, from the English phrase border, syllabified as bor.der., the algorithm generates the pseudowords bowper, bowmer, curder, besder, etc. Wuggy is based on bigram chains. However, it operates on segments rather than individual letters, thus treating sequences like dge in bridge as units. It also maintains frequency-based constraints by excluding elements that deviate significantly from specified reference frequencies [32]. A welcome result of keeping the frequency differences minimal is that other high-frequency segments replace high-frequency segments; similarly, low-frequency ones replace low-frequency segments.

However, Wuggy’s limitation compared to CGCA and UniPseudo is its inability to generate words with the same endings as the input, which could be problematic for studies requiring pseudowords within specific morphosyntactic categories. However, Dołżycka et al. [33] presented a method for generating pseudowords using Wuggy. Acknowledging the constraints of the algorithm in constructing pseudowords with suffixes of distinct morphosyntactic categories, they supplement it with language-specific rules grounded in real-language data. To facilitate the process, they provide a Python script designed for the creation of Polish pseudo-verbs and pseudonouns.

It is evident from the discussion above that each of the established methods for generating pseudowords comes with its own set of limitations. Constructing a pseudoword list necessitates a good grasp of the language in question, including a deep knowledge of its morphosyntax and phonology and, especially, of its sound inventory, phonotactics, and syllable structure. Manipulating a word stimulus requires knowledge of which segments can be inserted, deleted, or rearranged in a way that allows the resulting pseudoword to be perceived as phonologically and morphologically plausible in the language at hand. More precisely, using high-frequency bigrams demands familiarity with the permissible phonotactic combinations of a language. Moreover, reconfiguring sub-syllabic components requires a strong understanding of syllabification and the grammatical transitions between adjacent syllables. As König et al. [25] astutely observed, software tools often require researchers to apply criteria, such as the number of neighbors, word frequency, and n-gram frequency, to limit the number of pseudowords generated, which may make the use of such tools quite demanding.

Furthermore, applications like the ones described above do not always prioritize assessing whether the created pseudowords can be readily identified (in terms of phonotactic and orthographic restrictions) as valid or well-formed words in the target language. An exception is the CGCA algorithm, which establishes criteria for verifying the legality of the generated pseudowords [25]. This involves checking whether all letter sequences in a pseudoword exist within words found in the original wordlist.

In conclusion, methods like stimulus manipulation, high-frequency bigrams, and sub-syllabic element combinations have significantly advanced the development of pseudowords for linguistic research. However, researchers must be familiar with a language’s grammar, especially phonology, to use these methods effectively. While some tools, like CGCA, prioritize pseudoword well-formedness, others may not. The research on pseudoword construction methods emphasizes the necessity of developing reliable tools that will be broadly and freely available to researchers.

2.2. Constructing Pseudowords in Greek: Methodological Considerations

While a substantial body of work has been dedicated to developing software tools for languages such as English, French, and German, the same cannot be said for Greek, which presents unique challenges such as unpredictable stress (e.g., [16,17]) and rich affixation. To date, the literature on Greek has presented different manual methods for pseudoword construction, which vary depending on the research focus.

A common practice of pseudoword construction involves changing a real word by altering letters or segments. Tsapkini et al. [34] created pseudo-stems by changing the first or first two consonants of an existing word to examine the lexical access of verbs and nouns in Greek. The same method was employed by Diamanti et al. [35], who investigated the development of morphological awareness in Greek children. Pseudowords were created by replacing vowels and consonants in the stem of actual words while preserving the original existing word’s phonological structure, stress pattern, and inflectional ending. Similarly, in their study on the acquisition of past tense by children with specific language impairment, Varlokosta & Nerantzini [36] constructed pseudo-verbs based on real verbs by changing two consonants in terms of place or manner of articulation. In Manouilidou & Stockall’s [37] research on morphosyntactic processing, pseudowords were created by selecting a noun as the base to form a verb instead of the expected verbal root. Consequently, the root of the constructed word is joined with suffixes that differ from those it usually combines with in existing words.

Protopapas et al. [12,13], in their study on stress assignment in Greek reading, created novel words by modifying vowels and consonants in terms of various features, including place of articulation, manner of articulation, and voicing. They assigned distinct points for each change to assess the degree of modification applied to existing words when creating pseudowords. The cumulative change score served as a metric to evaluate the similarity of resulting pseudowords to actual words, allowing their classification into low- and high-similarity groups. Native speakers then evaluated these words and attempted to identify the original word. Based on the success or failure of the identification task, the researchers selected suitable pseudowords for use in their experimental tasks.

While the primary objective in some of these studies is to reduce the influence of lexical effects by generating pseudowords that bear little resemblance to actual words, an important factor is often overlooked, namely, the phonological structure and, primarily, the frequency of segmental or syllabic strings that the pseudoword consists of [38]. Defining this factor can be challenging without the help of specialized tools or software, impeding the quantification or assessment of phonological well-formedness.

A recent body of research [20,39,40] has considered quantifying measures related to the phonological structure of words. These studies use psycholinguistic resources, such as the Num Tool [41], which is described in Section 4, to assess their manually constructed pseudonouns and pseudo-verbs.

The use of specialized software for pseudoword creation in Greek is practically non-existent except for UniPseudo, as discussed in Section 2.1. This tool was developed to create pseudowords in numerous languages, including Greek, based on a user-provided wordlist or the associated WordLex corpus [42,43].

We used the software and employed the embedded source database, instructing the tool to create forty pseudowords, twenty of which had four characters and the remaining twenty had six characters. The algorithm constructs pseudowords using bigrams and/or trigrams. We also set the algorithm to limit consecutive consonants to two and include only one letter with a diacritic. The latter condition was critical because Greek orthography employs an acute accent to indicate stress on the vowel.

A representative list of the generated pseudowords is provided in Appendix A. However, the pseudowords generated pose several significant issues, which we briefly outline below:

Accent marks: Some pseudowords displayed either two accent marks (e.g., ώμήν [ˈoˈmin], ώμός [ˈoˈmos]) or none at all (e.g., ατμη [atmi], υμοι [imi]), in violation of the orthographic rules of Greek.
Phonotactic challenges: The software occasionally generated pseudowords with marginal or rare phonotactic patterns due to their association with specific lexical items (e.g., ατμη [atmi] based on the source word ατμός [atˈmos] ‘steam’). Often, the somewhat unconventional phonotactics of the produced words were due to their association with a word of Ancient Greek origin (e.g., μάδμοι [ˈmaðmi] based on the Ancient Greek name Κάδμος [ˈkaðmos] ‘Cadmus’).
Orthographic conventions: UniPseudo did not consistently adhere to Greek orthographic conventions. For instance, it cannot recognize that the letter sequences μπ and οι correspond to single segments, [b] and [i], respectively. Consequently, while some six-character pseudowords consisted of six distinct segments, e.g., μαρμών [maɾˈmon], others consisted of four, e.g., μπάμοι [ˈbami].
Phonotactic legality: Automatic bigram settings for four-character words occasionally delivered phonotactically illegal pseudowords, e.g., νωμπ [nob], πλμα [plma], τρμα [tɾma], στμω [stmo]. This problem can be addressed using the ‘trigram setting’ when creating four-character words.
Bias for a specific morphosyntactic category: When tested, UniPseudo showed a strong tendency to generate nouns rather than other parts of speech, particularly verbs. We attempted to correct this imbalance by providing our wordlist, which contained an equal number of nouns and verbs (1000 each), as the input source. Despite this intervention, we observed that the bias persisted, as the software consistently reused the same limited set of base words.
Repeated base-word usage and homonymy: Despite having access to a substantial input lexicon, UniPseudo frequently recycles the same words as the foundation for creating pseudowords, rather than combining elements from different source words. For instance, the pseudoword ταπόρι [taˈpoɾi] is constructed from the words ταπέτο [taˈpeto] ‘mat’ and βαπόρι [vaˈpoɾi] ‘ship’. Interestingly, βαπόρι was used as a base word three times in this process, contributing to the trigrams απο, πορ, and ορι. As a result, the final pseudoword heavily depends on these repeated trigrams and risks resembling real words. Moreover, the tool occasionally generates pseudowords that are homonymous with actual Greek words due to its tendency to reuse identical base elements. For example, it produces the output άκουγε [ˈakuʝe], which is itself an existing word, derived from the base verbs ακούει [aˈkuj] ‘he/she listens to’ and άκουγε [ˈakuʝe] ‘he/she listened to’.

In conclusion, UniPseudo is a valuable tool for generating Greek pseudowords, but it exhibits several limitations—particularly regarding phonetic rules and orthographic consistency.

The following section introduces SyBig-r-Morph, a word generator developed for Greek. The tool combines important linguistic elements, including syllabic structure, n-gram patterns, and morphological structure. By combining these elements, SyBig-r-Morph generates pseudowords that adhere to Greek phonotactics, morphological patterns, and spelling conventions. A distinguishing feature of this system is its built-in evaluation component, which automatically filters out illegitimate sound and letter combinations. This ensures that all generated pseudowords consistently follow the Greek morphophonological structure.

3. Methods

Our research focuses on constructing pseudonouns due to their diverse accentual patterns, which set them apart from other grammatical categories. We aim to build pseudonouns from various inflectional classes to be used in experimental tasks that explore how young and adult Greek speakers apply stress rules. Specifically, we strive to generate pseudowords from a specific lexicon, ensuring they are morphophonologically correct and distinct from existing words.

Currently, specialized software tools for the large-scale generation of Greek pseudowords are lacking, and, as discussed above, existing tools present significant challenges when applied to Greek. To address this gap, we introduce SyBig-r-Morph. The Flask-based web application (v2.3.3) requires Python 3.7+ for implementation and integrates two primary lexical resources: ILSP’s Clean Corpus [41,44] and GreekLex2.1 [45,46].

GreekLex2.1 offers detailed insights into Greek word forms and linguistic patterns by providing part-of-speech information, syllabification, phonetic and orthographic details for each entry, and metrics for word similarity. It was developed using entries from the Dictionary of Standard Modern Greek [47]. The database is also enhanced with a set of statistical metrics for these entries from the Hellenic National Corpus(HNC), an extensive compilation (over 97 million words) of written materials, including newspapers, books, periodicals, and other textual sources, developed by the Institute for Language and Speech Processing/R.C. Athena (ILSP) [48,49]. GreekLex2.1 encompasses around 35,000 lemmas found in both the dictionary and the HNC.

The Clean Corpus is an extensive collection of 217,664 word types and 29.6 million tokens from the HNC journalistic, legal, and literary texts. It provides essential information for each word, including syllabic structure, word length, stress pattern, and frequency. Moreover, it offers useful statistical information on metrics like average letters, phonemes, syllable frequencies, and counts of orthographic and phonological neighbors. Unfortunately, Clean lacks data on the morphological characteristics of words, which is particularly significant for studies focused on specific grammatical categories like nouns. In SyBig-r-Morph, we use the “Ignoring Stress” version, which provides the relevant data, including complete tables of syllables and their associated type and token frequencies, without taking stress diacritics into account.

To ensure optimal performance and access to the necessary linguistic data, both lexical resources are loaded during application startup. This initialization process, illustrated in Figure 1, involves loading the lexical data and accompanying information from GreekLex2.1, along with the Clean Corpus. Due to the substantial size of the Clean Corpus (217,664 word types), the loading process typically completes in under a minute, after which users can begin pseudoword generation with full access to the capabilities of both databases (Figure 1).

Building on these lexical resources, the application’s technical workflow consists of two main components, that is, a generation component and an evaluation component. The generation component includes lexicon analysis, syllable segmentation, and the strategic reassembly of syllables based on their original positions within words. On the other hand, the evaluation component applies similarity filtering to ensure that the generated pseudowords are sufficiently distinct from existing Greek lexical items.

At the core of our application lies GreekLex2.1, which provides the words upon which the syllable decomposition mechanism will apply to create the syllable arrays for the pseudoword generation, along with essential morphosyntactic information and metrics for word frequency. However, any corpus, lexical database, or wordlist containing information on syllabic structure, word length, part of speech, and frequency can serve for this purpose.

SyBig-r-Morph’s pseudoword generation process begins by letting users specify the morphosyntactic category of the pseudowords they want, such as nouns, verbs, adjectives, adverbs, or prepositions, as shown in Figure 2. Users can even select all categories if they wish.

Additionally, users can opt to apply a frequency filter, using metrics such as Zipf frequency, which is included in GreekLex2.1. Operating within a range of 1.327 to 7.533 (based on GreekLex2.1 items), this parameter regulates the proximity of generated words to low- and high-frequency words. Lower values indicate low Zipf frequency, while higher values indicate high Zipf frequency.

In the present version of SyBig-r-Morph, we have set the default Zipf frequency value to 5 because we would like our hypothetical experimental materials to have an equal distance between existing low and high-frequency words. Raising this threshold entails fewer words passing the filter, resulting in a smaller pool of syllables available for word generation. Our testing showed that a high Zipf frequency setting resulted in a smaller set of word-final syllables. This led to the exclusion of some less productive inflectional classes. This feature is optional; users who do not wish to control the pseudoword frequency of existing words can choose not to apply the frequency filter. Once frequency parameters are established (or bypassed), users can then define the structural characteristics of their pseudowords by specifying the desired word size and number for the generated pseudowords. In its current settings, the application can create up to 1000 pseudowords.

Our application carefully filters the input lexicon and creates a dictionary by keeping only those words that match the specified syllable count. An innovation of SyBig-r-Morph is that it includes a syllabification component that breaks words into their respective syllables (sequences of onset(s) – nucleus – coda(s)). The syllabification stage is crucial because each word from the filtered list is deconstructed into constituent syllables. These syllables are then categorized based on their position in the word, such as first, second, third, etc., thus creating a foundation for constructing plausible pseudowords. Table 1 illustrates a compilation of syllables per position for trisyllabic words:

The syllable compilation shown in Table 1 reveals an important design feature of SyBig-r-Morph: all syllables are delivered without stress diacritics. This reflects the application’s systematic approach to stress handling. As a direct result, all words generated by SyBig-r-Morph are stressless. While this feature works well for studies like our hypothetical scenario that investigates speakers’ stress assignment patterns, it may pose challenges for other research that requires stressed outputs. To address this need, we have incorporated an option for assigning stress to the generated pseudowords. This feature will be discussed in more detail below.

Following the creation of the syllable inventory, SyBig-r-Morph systematically explores all potential syllable combinations to generate pseudoword candidates. However, not all theoretically possible combinations yield linguistically valid or experimentally useful pseudowords. To ensure the quality and appropriateness of generated items, all candidates must undergo rigorous evaluation through the application’s second major component: the evaluation system.

The first evaluation phase is criteria-based filtering, which is carried out in several steps. This evaluation is designed to work with any available database, wordlist, or corpus. The current version of SyBig-r-Morph uses both GreekLex2.1 and the Clean Corpus [41]. They are collectively called the “Lexicon” in the remainder of the article.

The filtering process begins with a “Lexicon check” that removes any pseudoword that may already exist in the lexicon/lexica it checks with. Next, the “Repetition check” applies to discard those pseudowords that contain sequences of repeating letters or sounds exceeding a threshold of two or more consecutive identical letters: this way, words like ααα [aaa], ααγη [aaʝi], αριθθον [aɾiθθon] are eliminated. As a side effect, this process excludes words with two consecutive letters, even those permitted in Greek orthography, such as ρρ. That is, words like απουρρα [apuɾa] will be excluded. The same filtering mechanism operates on segments, disallowing words with phonetically identical vowels, but different spelling, e.g., εαιγη [eeʝi], παωο [paoo].

Following pseudoword generation, the process continues with n-gram validation, during which the application assesses the generated pseudowords for phonotactic acceptability. More precisely, it constructs bigrams and trigrams for each pseudoword. If any of these generated n-grams fail to match words in the lexicon, the pseudoword is marked as phonologically unacceptable and is thus excluded from further analysis. This filter will rule out pseudowords like αριθδρος [aɾiθðɾos], παιχπουρστο [pexpuɾsto].

Subsequently, the process continues with a comparative analysis stage that measures the similarity between each pseudoword and all words in the reference databases, i.e., GreekLex2.1 and Clean, or any other specified lexical resource. Employing a similarity ratio grounded in the Levenshtein distance [50], SyBig-r-Morph proceeds to this evaluation as follows: “1 minus the Levenshtein distance divided by the maximum length of the compared words.” In this formula, the Levenshtein distance is normalized by dividing it by the maximum length of the two words being compared. Then the resulting ratio is subtracted from 1 to generate a similarity score so that a smaller distance yields a higher similarity score between the two words. This step ensures that the final score will range from 0, with no similarity, to 1, a perfect match. Pseudowords exhibiting a high similarity ratio to actual words will be excluded as being too similar to existing ones, e.g., αυρειο [avrio] (cf. αύριο [ˈavrio] ‘tomorrow’), αδεικα [aðika] (cf. άδικα [ˈaðika] ‘unfairly’), αγογος [aɣoɣos] (cf. άγονος [ˈaɣonos] ‘infertile’), οναδα [onaða] (cf. ομάδα [oˈmaða] ‘team’). This way, the application guarantees that the final output consists of pseudowords distinct from actual words in the Lexicon.

In the current version, the cut-off point is set at 0.8. During testing, we observed that lowering this threshold for nouns resulted in pseudonouns significantly dissimilar to existing nouns, while increasing it resulted in pseudowords that closely resembled existing ones. We arrived at this conclusion during a manual evaluation phase discussed in Section 4.

Figure 3 illustrates the application’s selection dashboard, which includes user-friendly predetermined threshold values for frequency and familiarity, as explained above. It also includes several user-defined parameters, such as the selection of part of speech, the number of syllables, and the number of words. The latter value is set to 1000, the maximum number of words allowed in the application’s current settings. Importantly, users can view and select either all syllables or only those appearing in the word-final position, allowing them to determine specific noun classes or verb voice based on word endings. This is a crucial feature because it allows users to precisely control the grammatical properties of generated pseudowords. Furthermore, users can also decide to either include or exclude words with marginal word-final phonotactics, such as those ending in consonants other than /n/ and /s/, which are the only word-final consonants phonotactically permitted in Greek.

As shown in Figure 4, the “Status” panel below the pseudoword generation button provides users with comprehensive information about their selections and the pseudoword generation process. During initialization, it displays the lexicon loading process and parameters, including part-of-speech specification (noun), syllable count (3), and threshold settings (frequency: 4.327, similarity: 0.8). Once processing begins, the system reports the distribution of syllables across three positions, identifying 121 initial, 159 medial, and 71 final syllables. Following Greek phonotactic constraints, the panel indicates that 64 of the 71 final syllables were selected, with endings from loanwords and archaic forms being systematically excluded.

Throughout the generation phase, the interface provides real-time progress updates, tracking incremental production in sets of 5 pseudowords from the initial batch to the final target of 1000. In this demonstration, the entire process was completed in approximately 19 s (01:06:24 to 01:06:43), achieving an average generation rate of 53 words per second. The panel concludes with a completion message confirming the successful generation of all requested pseudowords.

When the set of available syllables is particularly limited, however, identical outputs are more likely. For example, generating 15 non-passive verb forms ending in -o/-zo with the same settings resulted in multiple duplicates, as follows: νοστευω, πιβαιω, προμιω, νοκειω, πιμιω, νοκειζω, πιστευζω, πιβαιω, προστευζω, πιμιζω, πικειω, προβαιζω, νοστευω, νοστευζω, προβαιζω. This occurred because the only available final syllables in the syllable set were ζω (/zo/) and ω (/o/). Nevertheless, if users wish to reproduce the same results, they can ensure consistency by adding the following command at the beginning of the sybig-r-morph-gui.py script:

np.random.seed(seed=12345678)

Τhe “Generated Pseudowords” panel of the application displays the pseudowords. This section of the interface includes four control buttons: “Sort A-Z”, which organizes the pseudonounnouns in alphabetical order; “Randomize”, which shuffles their arrangement; “Copy All”, which copies the words to the clipboard; and “Download CSV”, which allows users to export the data. Users can manually reorganize words using a drag-and-drop feature, as indicated by on-screen instructions, to suit experimental needs or categorization schemes. Changes in the order of the constructed words are reported on the “Status” panel.

The “Stress Assignment” module allows users to apply systematic stress patterns to generated pseudowords in accordance with Greek stress rules (Figure 5). The interface is seamlessly integrated with the pseudoword generation process and includes an “Import Generated Words” button that automatically transfers items from the main generation panel into the stress assignment text area.

Users can also manually insert Greek words and choose from all three permissible stress positions, i.e., antepenult (APU, third-to-last syllable), penult (PU, second-to-last syllable), or ultima (U, last syllable). Multiple stress patterns can be selected using checkboxes, and the selected patterns are then evenly distributed across the input words to ensure a balanced set of stimuli. After clicking “Apply Stress”, the words are processed, and the results are displayed in a designated “Stressed Words” output area, where each word is marked with the appropriate accent diacritic according to the selected stress pattern(s). The “Clean” button removes any selected stress pattern options.

Multiple stress patterns can be selected using checkboxes, and the selected patterns are then evenly distributed across the input words to ensure a balanced set of stimuli by clicking “Apply Greek Stress Rules”. This option applies specific rules of Greek stress; for instance, it will not assign U stress to passive verb forms since such stress positioning is ungrammatical. When a single stress pattern is selected, “Apply Stress” must be clicked. The words are then processed and the results are displayed in the “Stressed Words” output area, where each word is marked with the accent diacritic according to the selected stress pattern(s).

Users can then export their stressed pseudowords using the “Copy Stressed Words” or “Download Stressed CSV” buttons. This workflow enables researchers to regulate the prosodic characteristics of their experimental materials while ensuring phonotactic precision and a systematic distribution of stress.

This careful process produces a refined set of pseudowords that follow the morphophonological patterns of the intended grammatical category while still being distinct from real Greek words. Appendix B provides sample experimental pseudonouns.

Figure 6 illustrates the main components of SyBig-r-Morph and outlines the specific operations performed within each module.

The main strength of SyBig-r-Morph is its ability to generate natural-sounding pseudowords that conform to the phonotactic rules of Greek, although these words do not exist in real language. By generating words that adhere to specific linguistic patterns without being actual words, this tool provides researchers with valuable resources for linguistic research and related fields.

4. Evaluation of Pseudowords

We added several steps for phonological evaluation to assess further the suitability of the pseudonouns generated by SyBig-r-Morph for our experimental purposes. These steps aim to ensure that our constructed words not only comply with the syllabic and phonotactic rules of Greek but also meet the specific objectives of our experiment. These objectives include creating pseudowords that neither closely resemble native words nor are too distant from them.

To assess the phonological well-formedness of our pseudonouns, we harnessed the power of the online Num Tool [41,51] and a manually annotated subset of the Clean Corpus compiled by Apostolouda [20], called the Annotated Clean Corpus (A-Clean), which encompasses 71,105 words.

The Num Tool operates online and provides data equivalent to that found in the Clean Corpus for any word submitted to the platform. It furnishes quantitative metrics for each entered letter string, permitting 20 submissions per session. Users can adapt their preferred metrics by selecting the corresponding checkboxes. The results are presented in a tab-separated format, accessible within the user’s web browser or downloadable with software that manages text columns, such as MS Excel. The range of available metrics encompasses length (letters, phones, syllables), frequency, uniqueness/recognition point, cumulative syllable and bigram frequency, count and frequency of orthographic and phonological neighbors, cohort, stress, and Levenshtein distance neighbors, and indices of orthographic transparency.

Moreover, the Num Tool helps evaluate pseudowords based on statistical measurements of real words documented within the Clean Corpus. Specific lexical and sub-lexical parameters can be pivotal for developing Greek pseudowords within this toolkit. For our study, we selected the following ones:

Log-mean bigram token frequency (phonemes only) (BGtokfreqPho) calculates the logarithmic mean of the frequency of appearance of neighboring phonemes across all samples in the corpus. For example, the combination of the phonemes τ [t] + ο [o] results in το [to] with a log-mean frequency of 3.976 per million samples.
Log-mean bigram type frequency (phonemes only) (BGtypfreqPho) computes the logarithmic mean of neighboring phoneme frequency within the entire set of word types in the corpus. For example, the combination of the phonemes τ [t] + ο [o] = το [to] has a log-mean frequency of 1.709 per million types.
N phonological neighbors (standard: replace only) (nNeiPho) provides the count of words that emerge when a phoneme is substituted within a word. For instance, the word τόνος [ˈtonos] ‘stress’ has phonological neighbors created through substitution: τόνος [ˈtonos] →φόνος [ˈfonos] ‘murder’ (τ /t/ →φ /f/) and τόνος [ˈtonos] →τόκος [ˈtokos] ‘interest’ (ν /n/ → κ /k/).
N phonological neighbors (replace, delete, insert, transpose) (nNeiRDITPho) offers the count of words resulting from phoneme replacement, deletion, insertion, or transposition within a word. For example, the word φόρος [ˈfoɾos] ‘tax’ has phonological neighbors created through various operations: (a) deletion: φόρος [ˈfoɾos] →όρος [ˈoɾos] ‘mountain’ (deletion of initial φ /f/); (b) replacement: φόρος [ˈfoɾos] →χώρος [ˈxoɾos] ‘space’ (replacing φ /f/ with χ /x/); (c) insertion: φόρος [ˈfoɾos] →φόρτος [ˈfoɾtos] ‘burden’ (insertion of τ /t/ before final ος /os/); (d) transposition: φόρος [ˈfoɾos] → ροφόs [ɾoˈfos] ‘dusky grouper’ (change in the position of ρ /ɾ/).
Phonological Levenshtein distance 20 (PLD20; [50]) calculates phonological similarity by calculating the phonological edits (insertions, deletions, substitutions, or transpositions) required to change one word into the 20 most similar ones [52].

Having outlined these main parameters, we applied them systematically to evaluate pseudowords generated by SyBig-r-Morph. We selected parameters a and b to measure bigram tokens and type frequency, and parameters c, d, and e to control similiarity to real words. This selection guarantees that our pseudowords align with the Greek phonotactic system without closely resembling actual words, thus preventing any direct connections to existing words that might influence speaker choices.

We conducted comprehensive assessments across different grammatical categories (e.g., nouns, verbs) to demonstrate the practical application of these criteria, with a particular focus on noun evaluation reported here. Table 2 presents a sample of the pseudonouns and their corresponding parameter values to illustrate the application of our criteria. All pseudonouns were produced from SyBig-r-Morph with the Similarity Threshold set to 0.8. We decided on this value because, with lower values, the pseudonouns were too dissimilar to existing nouns, and with higher values, they were too similar to existing ones.

While these initial results demonstrated the tool’s effectiveness, given that all nouns fell within Clean’s phonotactic well-formedness range, we recognized that even more rigorous, noun-specific criteria would better ensure experimental validity. To achieve this precision, we developed enhanced evaluation standards tailored specifically to Greek noun phonotactics. We extracted all di- and trisyllabic nouns from A-Clean. Following the methodology of Revithiadou et al. [53] and Apostolouda [20], we calculated the mean (M) and standard deviation (SD) for each of the five parameters for the total number of 5326 trisyllabic nouns across all noun classes present in A-Clean. The lower bound was established using the formula Mean − 2SD, whereas Mean + 2SD defined the upper limit. This process allowed us to precisely determine the numerical ranges in which the values of A-Clean nouns naturally fell. Table 3 shows the admissible ranges of all five parameters.

For trisyllabic nouns, the mean logarithmic bigram token frequency was calculated to be 0.53, with a corresponding standard deviation of 0.365. By applying the formula [M − 2SD to M + 2SD], the acceptable range of values for trisyllabic pseudonouns was determined to be within 0 and 1.260 (M − 2SD = −0.2 and M + 2SD = 1.260). Similar calculations were performed for the remaining parameters. The same process was applied to disyllabic nouns. Having established these refined criteria, we applied them to a focused evaluation of trisyllabic pseudonouns ending in -os.

Table 4 demonstrates how a representative sample of ten pseudowords performed against these stringent parameters, with shaded rows indicating words that fell outside acceptable ranges. Pseudowords exceeding these thresholds were excluded from experimental materials.

To assess the broader applicability of our established thresholds, we repeated the same evaluation process on a larger scale. Specifically, we submitted pseudonouns from various inflectional classes and word lengths (e.g., disyllabic and trisyllabic nouns, such as -os masculine, -as masculine, -a feminine, -i feminine, -o neuter, -i neuter) to the Num Tool platform for evaluation against the pre-defined parameters and subsequently applied our stricter criteria, as depicted in Table 3. Of these 600 trisyllabic nouns, 521 met and passed our stricter parameter ranges. In a separate evaluation, when we repeated the process with 250 disyllabic nouns (produced using a similarity threshold of 0.8 and a frequency threshold of 5.0), 142 fell within acceptable phonotactic ranges. This outcome is expected, as our strict parameter ranges in Table 3 evaluate words using metrics designed to avoid direct resemblance to existing words that could influence speaker judgments. Moreover, the lower success rate for disyllabic nouns reflects the higher similarity threshold used in their generation. These results collectively demonstrate SyBig-r-Morph’s high success rate in producing pseudowords suitable for experimental research.

5. Discussion and Conclusions

This article highlighted the importance of pseudowords in (psycho)linguistic studies and addressed the ongoing need for reliable and accessible pseudoword generation tools across various languages. We also presented conventional approaches to pseudoword construction, which often involve modifying real words by changing letters or phonological features. This strategy carries the risk of accidental recognition of base words. In contrast, modern techniques employ high-frequency bigrams or syllable deconstruction, often requiring a good understanding of a language’s grammar.

To address these challenges and the lack of an automatic pseudoword generation tool in Greek, we developed SyBig-r-Morph, an application that introduces several innovations that distinguish it from existing applications and software.

SyBig-r-Morph constructs pseudowords syllable by syllable rather than using n-grams, creating more phonologically natural outputs. It also employs an internal syllabification system that can process wordlists of any size, from databases with hundreds of thousands to millions of entries. Crucially, the system organizes syllables into arrays based on their position within a word, ensuring that generated words display phonotactic patterns consistent with those observed in the selected category. For example, in Greek, the sequence βδ [vð] never appears at the beginning of verbs; it only occurs in nouns. Likewise, σγ [zɣ] is exclusively found in adjectival bases.

Building on this positional sensitivity, additional application features enable users to specify grammatical categories (nouns, verbs, etc.), select specific word endings, and assign stress patterns to their generated pseudowords. This combination of category-specific phonotactics and user customization is particularly valuable for specialized linguistic tests and experimental tasks because it delivers words that adhere to the specific category’s phonotactic restrictions. Moreover, SyBig-r-Morph automatically detects and filters out phonologically invalid letter/sound combinations by recognizing impermissible sequences in Greek, such as κγχ [kɣx], and excluding pseudowords containing such ill-formed combinations.

Further enhancing its versatility, the application employs a dual-lexicon approach, using separate lexicons for input during word formation and validation. In the current implementation, SyBig-r-Morph uses GreekLex2.1 for input and Clean (together with GreekLex2.1) to validate constructed forms, offering greater flexibility and precision in the creation and validation processes.

Another significant advantage of SyBig-r-Morph is that, while it was initially developed for Greek, it can be easily modified to work with any language, provided the user has access to a syllabification tool and a lexicon for the target language.

SyBig-r-Morph has several advantages over similar tools like UniPseudo. For instance, SyBig-r-Morph is more effective at producing words that adhere to proper phonotactic rules and follow Greek orthographic standards. Additionally, unlike UniPseudo, SyBig-r-Morph determines word length based on the number of syllables rather than the number of characters, allowing it to easily generate words that meet the user’s specified length criteria.

Although designed for different languages, both SyBig-r-Morph and Wuggy employ frequency criteria, but they implement these criteria in different ways. Wuggy focuses on transition frequencies between sub-syllabic components; for example, it compares the i to sk transition in the pseudoword misk with the i to lk transition in the word milk. In contrast, SyBig-r-Morph uses Zipf’s frequency to choose either low- or high-frequency words as bases for generating pseudowords and further employs Levenshtein distance to assess phonological similarity. Finally, Wuggy operates on a single-lexicon system, whereas SyBig-r-Morph features a dual (or multiple) lexicon.

The features described above render SyBig-r-Morph useful for linguistic studies, as it produces phonologically well-formed pseudowords that maintain the structural properties of specific Greek grammatical categories. However, the system has two notable limitations, as follows: it cannot produce pseudowords that violate the morphophonological rules of Greek, nor can it deliver stressed words.

In future research, we plan to enhance SyBig-r-Morph with additional features allowing users to construct pseudowords based on more criteria beyond frequency and word similarity considerations. For instance, users can specify the morphosyntactic category and inflectional and derivational affixes for pseudoword beginnings or endings. We also aim to improve the lexicon component by incorporating more comprehensive Greek databases and corpora. These enhancements will increase the tool’s usability and expand its applicability across multiple languages and linguistic contexts.

Author Contributions

Conceptualization, K.K.; methodology, K.K., V.A. and A.R.; software, K.K. and A.R.; validation, V.A. and A.R.; formal analysis, K.K., V.A. and A.R.; writing—original draft preparation, K.K., V.A. and A.R.; writing—review and editing, K.K., V.A. and A.R.; visualization, A.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code for our application and the installation guidelines are available in the Github repository at https://github.com/revith71/sybig-r-morph (accessed on 8 June 2025). The application also operates at https://revith.pythonanywhere.com/ (accessed on 8 June 2025).

Conflicts of Interest

During the last two years, from the beginning of the work (which includes conducting the research and preparing the manuscript for submission), there have been no competing interests directly or indirectly related to the current work submitted for publication.

Abbreviations

The following abbreviations are used in this manuscript:

CGCA	character-gram chaining algorithm
HNC	Hellenic National Corpus
ILSP	Institute for Language and Speech Processing
A-Clean	Annotated Clean Corpus
BGtokfreqPho	log-mean bigram token frequency (phonemes only)
BGtypfreqPho	log-mean bigram type frequency (phonemes only)
nNeiPho	N phonological neighbors (standard: replace only)
nNeiRDITPho	N phonological neighbors (replace, delete, insert, transpose)
PLD20	phonological Levenshtein distance 20

Appendix A. Sample of UniPseudo-Generated Pseudowords in Greek

Table A1. Sample of UniPseudo-generated pseudowords in Greek: four-character-long pseudowords (the authors provide IPA transcriptions).

UniPseudo settings:
Pseudoword length	4
Algorithm	Bigram
Use words from database	Yes
max consecutive Cs	2
Max. same letter consecutive	1
Max. diacritical letters	1
Lexicon	WordLex (integrated)
UniPseudo generated these four-character pseudowords using a bigram algorithm that extracts letter patterns from the WordLex Greek database. The middle columns show source bigrams from real Greek words used to create the pseudowords. Selected constraints and their values (max two consecutive consonants, no repeated letters, max one diacritical mark, such as stress) ensure the stimuli adhere to Greek phonotactic patterns while preserving their pseudoword nature. (IPA transcriptions are provided by the authors.)
τσμό	τσιπ	οσμή	νομό
[ˈtsmo]	[tsip]	[ozˈmi]	[noˈmo]
πλμα	πλην	άλμη	γόμα
[plma]	[plin]	[ˈalmi]	[ˈGoma]
στμω	στοά	ατμό	νόμω
[stmo]	[stoˈa]	[atˈmo]	[ˈnomo]
τρμα	τρις	όρμο	αίμα
[tɾma]	[tɾis]	[ˈoɾmo]	[ˈema]
νωμπ	νωπά	βωμό	παμπ
[nob]	[noˈpa]	[voˈmo]	[ˈpab]
αμωφ	αμήν	όμως	λεωφ
[amof]	[aˈmin]	[ˈomos]	[leof]
ωμοκ	ωμός	ώμοι	στοκ
[omok]	[oˈmos]	[ˈomi]	[ˈstok]
ατμη	ατμό	ατμό	ζύμη
[atmi]	[atˈmo]	[atˈmo]	[ˈzimi]
υμοι	υμάς	ομού	βίοι
[imi]	[iˈmas]	[oˈmu]	[ˈvii]
ώμήν	ώμοι	ωμής	αμήν
[ˈoˈmin]	[ˈomi]	[oˈmis]	[aˈmin]
ώμός	ώμου	αμόκ	θεός
[ˈoˈmos]	[ˈomu]	[aˈmok]	[θeˈos]

UniPseudo generated these six-character pseudowords using a trigram algorithm that extracts letter patterns from a specific wordlist, consisting of 1000 nouns and 1000 verbs in Greek. The middle columns show source trigrams from real Greek words used to create the pseudowords. Selected constraints and their values (max two consecutive consonants,no repeated letters, max one diacritical mark) ensure the stimuli follow Greek phonotactic patterns without forming real words. (IPA transcriptions are provided by the authors).

Table A2. Sample of UniPseudo-generated pseudowords in Greek: six-character-long pseudowords.

UniPseudo settings:
Pseudoword length	6
Algorithm	trigram
Use words from database	yes
Max consecutive consonants	2
Max. same letter consecutive	1
Max. diacritical letters	1
Lexicon	Wordlist (1000 nouns and 1000 verbs)
μάδμοι	μάδησα	Κάδμου	Κάδμου	έρημοι
[ˈmaðmi]	[ˈmaðisa]	[ˈkaðmu]	[ˈkaðmu]	[ˈeɾimi]
μαρμών	μαριάμ	γαρμπή	συρμών	ρυθμών
[maɾˈmon]	[maɾiˈam]	[Gaɾˈbi]	[siɾˈmon]	[ɾiθˈmon]
μπαμοι	μπαγιό	μπαμπά	άγαμοι	δρόμοι
[bami]	[baˈʝo]	[baˈba]	[ˈaɣami]	[ˈðɾomi]

Appendix B. Sample of SyBig-r-Morph-Generated Pseudonouns

Parameter	Value
Part of Speech (PoS)	Noun
Number of Syllables	3
Frequency Threshold	5.00
Similarity Threshold	0.8

Table A3. Sample of SyBig-r-Morph-generated pseudonouns. The three-syllable pseudonouns were created using SyBig-r-Morph with frequency and similarity thresholds of 5.00 and 0.8, respectively. With these parameter settings, the stimuli maintain Greek phonotactic and morphological patterns characteristic of nouns without matching existing words. (IPA transcriptions are provided by the authors.)

δηθεκη	[ðiθeci]
πλαιρειδα	[pleɾiða]
ανδιγη	[anðiʝi]
βουστεγος	[vusteɣos]
κεινογκη	[cinoɟi]
κινωδα	[cinoða]
παιχλανα	[pexlana]
ληνηστρο	[linistro]
παιχνοιμη	[pexnimi]
σηπουρκη	[sipuɾci]
ταιρελμη	[teɾelmi]
κειβλιχη	[civliçi]
παιχλειχη	[pexliçi]
παιθρωχη	[peθɾoçi]
δουσωτος	[ðusotos]
ποναζα	[ponaza]
πλαιθεγη	[pleθeʝi]
βουξιμος	[vuksimos]
μελειγκη	[meliɟi]
αυναζα	[avnaza]
τερελχη	[terelçi]
κυστηγκη	[cistiɟi]
παιθεγκη	[peθeɟi]
καπουρτη	[kapuɾti]
εγμαστο	[eɣmasto]
δευθουο	[ðefθuo]
ουμιγος	[umiɣos]
κειπουργη	[cipuɾʝi]
γυριγη	[ʝiɾiʝi]
μουταρνη	[mutaɾni]
εκνιμη	[eknimi]
μεδειφο	[meðifo]
σταχειλο	[staçilo]
αριγκη	[aɾiɟi]
δευναιπο	[ðevnepo]
οθεσα	[oθesa]
ογοκα	[oɣoka]
αλωδα	[aloða]
ευδιγκη	[evðiɟi]
σηρελπο	[siɾelpo]
συθουκη	[siθuci]
τεμιδρος	[temiðɾos]

References

Nordquist, R.; Definition and Examples of Pseudowords. ThoughtCo. 2018. Available online: https://www.thoughtco.com/pseudoword-definition-1691549 (accessed on 1 April 2025).
Groff, P. Pseudoword Construction in a Non-Word Repetition Task. Ph.D. Thesis, University of Texas at Austin, Austin, TX, USA, 2003. [Google Scholar]
Berko, J. The Child’s Learning of English Morphology. Word 1958, 14, 150–177. [Google Scholar] [CrossRef]
Cardenas, J.M. Phonics Instruction Using Pseudowords for Success in Phonetic Decoding. Ph.D. Thesis, Florida International University, Miami, FL, USA, 2009. Available online: https://digitalcommons.fiu.edu/etd/139 (accessed on 1 April 2025).
Balota, D.A.; Cortese, M.J.; Sergent-Marshall, S.D.; Spieler, D.H.; Yap, M. Visual Word Recognition of Single-Syllable Words. J. Exp. Psychol. Gen. 2004, 133, 283–316. [Google Scholar] [CrossRef] [PubMed]
Meara, P. EFL Vocabulary Tests, 2nd ed.; Centre for Applied Language Studies, University College of Swansea: Swansea, UK, 2010. [Google Scholar]
Keuleers, E.; Stevens, M.; Mandera, P.; Brysbaert, M. Word Knowledge in the Crowd: Measuring Vocabulary Size and Word Prevalence in a Massive Online Experiment. Q. J. Exp. Psychol. 2015, 68, 1665–1692. [Google Scholar] [CrossRef] [PubMed]
Arndt, H.L.; Woore, R. Vocabulary Learning from Watching YouTube Videos and Reading Blog Posts. Lang. Learn. Technol. 2018, 22, 124–142. [Google Scholar] [CrossRef]
Elgort, I.; Warren, P. L2 Vocabulary Learning from Reading: Explicit and Tacit Lexical Knowledge and the Role of Learner and Item Variables. Lang. Learn. 2014, 64, 365–414. [Google Scholar] [CrossRef]
Coltheart, M.; Rastle, K.; Perry, C.; Langdon, R.; Ziegler, J. DRC: A Dual Route Cascaded Model of Visual Word Recognition and Reading Aloud. Psychol. Rev. 2001, 108, 204–256. [Google Scholar] [CrossRef]
Harm, M.W.; Seidenberg, M.S. Computing the Meanings of Words in Reading: Cooperative Division of Labor between Visual and Phonological Processes. Psychol. Rev. 2004, 111, 662–720. [Google Scholar] [CrossRef]
Protopapas, A.; Gerakaki, S.; Alexandri, S. Lexical and Default Stress Assignment in Reading Greek. J. Res. Read. 2006, 29, 418–432. [Google Scholar] [CrossRef]
Protopapas, A.; Gerakaki, S.; Alexandri, S. Sources of Information for Stress Assignment in Reading Greek. Appl. Psycholinguist. 2007, 28, 695–720. [Google Scholar] [CrossRef]
Robertson, C.; Salter, W. The Phonological Awareness Test (PAT); LinguiSystems: East Moline, IL, USA, 1997. [Google Scholar]
Torgesen, J.K.; Rashotte, C.A.; Wagner, R.K. TOWRE: Test of Word Reading Efficiency; Pro-ed: Austin, TX, USA, 1999. [Google Scholar]
Drachman, G.; Malikouti-Drachman, A. Greek Word Accent. In Word Prosodic Systems in the Languages of Europe; van der Hulst, H., Ed.; Mouton de Gruyter: Berlin, Germany; New York, NY, USA, 1999; pp. 897–945. [Google Scholar]
Revithiadou, A. Headmost Accent Wins: Head Dominance and Ideal Prosodic Form in Lexical Accent Systems. Ph.D. Thesis, HIL/Leiden University, Leiden, The Netherlands, 1999. [Google Scholar]
Revithiadou, A. Colored Turbid Accents and Containment: A Case Study from Lexical Stress. In Freedom of Analysis? Blaho, S., Bye, P., Krämer, M., Eds.; Mouton de Gruyter: Berlin, Germany; New York, NY, USA, 2007; pp. 149–174. [Google Scholar]
Revithiadou, A.; Lengeris, A. One or Many? In Search of the Default Stress in Greek. In Dimensions of Stress; Heinz, J., Goedemans, R., van der Hulst, H., Eds.; Cambridge University Press: Cambridge, UK, 2016; pp. 263–290. [Google Scholar] [CrossRef]
Apostolouda, V. Experimental Investigations on Greek Stress. Ph.D. Thesis, Aristotle University of Thessaloniki, Thessaloniki, Greece, 2018. [Google Scholar] [CrossRef]
Revithiadou, A.; Soukalopoulou, M.; Markopoulos, G.; Argyrakis, P.; Kosmidis, K.; Apostolopoulou, E.; Apostolouda, V.; Paspali, A.; Tsouchnika, M.; Kanetidis, M.; et al. Modeling Young Speakers’ Lexical Stress Grammars: A Case Study from Greek. In Proceedings of the AG 11: Multifaceted and Multifactorial Approaches to Developing Phonological Systems, Mainz, Germany, 4–7 March 2025. [Google Scholar]
Davis, C.J.; Lupker, S.J. Masked Inhibitory Priming in English: Evidence for Lexical Inhibition. J. Exp. Psychol. Hum. Percept. Perform. 2006, 32, 668–687. [Google Scholar] [CrossRef]
Balota, D.A.; Yap, M.J.; Hutchison, K.A.; Cortese, M.J.; Kessler, B.; Loftis, B.; Neely, J.H.; Nelson, D.L.; Simpson, G.B.; Treiman, R. The English Lexicon Project. Behav. Res. Methods 2007, 39, 445–459. [Google Scholar] [CrossRef] [PubMed]
New, B.; Bourgin, J.; Barra, J.; Pallier, C. UniPseudo: A Universal Pseudoword Generator. Q. J. Exp. Psychol. 2024, 77, 278–286. [Google Scholar] [CrossRef] [PubMed]
König, J.; Calude, A.S.; Coxhead, A. Using Character-Grams to Automatically Generate Pseudowords and How to Evaluate Them. Appl. Linguist. 2020, 41, 878–900. [Google Scholar] [CrossRef]
Burani, C.; Thornton, A.M. The Root, Suffix, and Whole-Word Frequency Interplay in Processing Derived Words. In Morphological Structure in Language Processing; Baayen, R.H., Schreuder, R., Eds.; De Gruyter Mouton: Berlin, Germany; New York, NY, USA, 2003; pp. 157–208. [Google Scholar] [CrossRef]
Duyck, W.; Desmet, T.; Verbeke, L.P.C.; Brysbaert, M. WordGen: A Tool for Word Selection and Nonword Generation in Dutch, English, German, and French. Behav. Res. Methods Instrum. Comput. 2004, 36, 488–499. [Google Scholar] [CrossRef]
Baayen, R.H.; Piepenbrock, R.; van Rijn, H. The CELEX Database on CD-ROM; Linguistic Data Consortium, University of Pennsylvania: Philadelphia, PA, USA, 1996. [Google Scholar]
New, B.; Pallier, C.; Brysbaert, M.; Ferrand, L. Lexique 2: A New French Lexical Database. Behav. Res. Methods Instrum. Comput. 2004, 36, 516–524. [Google Scholar] [CrossRef]
UniPseudo. Available online: http://unipseudo.lexique.org/ (accessed on 1 April 2025).
Rastle, K.; Harrington, J.; Coltheart, M. The ARC Nonword Database. Q. J. Exp. Psychol. 2002, 55, 1339–1362. [Google Scholar] [CrossRef]
Keuleers, E.; Brysbaert, M. Wuggy: A Multilingual Pseudoword Generator. Behav. Res. Methods 2010, 42, 627–633. [Google Scholar] [CrossRef]
Dołżycka, J.D.; Nikadon, J.; Formanowicz, M. Constructing Pseudowords with Constraints on Morphological Features—Application for Polish Pseudonouns and Pseudoverbs. J. Psycholinguist. Res. 2022, 51, 1247–1265. [Google Scholar] [CrossRef]
Tsapkini, K.; Jarema, G.; Kehayia, E. Regularity revisited: Evidence from lexical access of verbs and nouns in Greek. Brain Lang. 2002, 81, 103–119. [Google Scholar] [CrossRef]
Diamanti, V.; Benaki, A.; Mouzaki, A.; Ralli, A.; Antoniou, F.; Papaioannou, S.; Protopapas, A. Development of Early Morphological Awareness in Greek: Epilinguistic versus Metalinguistic and Inflectional versus Derivational Awareness. Appl. Psycholinguist. 2018, 39, 545–567. [Google Scholar] [CrossRef]
Varlokosta, S.; Nerantzini, M. The acquisition of past tense by Greek-speaking children with Specific Language Impairment: The role of phonological saliency, regularity, and frequency. In Specific Language Impairment: Current Trends in Research; Stavrakaki, S., Ed.; John Benjamins Publishing: Amsterdam, The Netherlands, 2015; pp. 253–286. [Google Scholar] [CrossRef]
Manouilidou, C.; Stockall, L. Teasing Apart Syntactic Category vs. Argument Structure Information in Deverbal Word Formation: A Comparative Psycholinguistic Study. Ital. J. Linguist. 2014, 26, 71–98. [Google Scholar]
Mastropavlou, M.; Tsimpli, I.M. The Role of Suffixes in Grammatical Gender Assignment in Modern Greek: A Psycholinguistic Study. J. Greek Linguist. 2011, 11, 27–55. [Google Scholar] [CrossRef]
Mitsiaki, M. Phonological Gradience in Greek #CC: Grammatical Modelling and Applications in Teaching Greek as L2. Ph.D. Thesis, Aristotle University of Thessaloniki, Thessaloniki, Greece, 2014. [Google Scholar] [CrossRef]
Soukalopoulou, M. Aspect in L1/L2 Greek: Experimental Investigation and Teaching Applications. Ph.D. Thesis, Aristotle University of Thessaloniki, Thessaloniki, Greece, 2021. [Google Scholar] [CrossRef]
Protopapas, A.; Tzakosta, M.; Chalamandaris, A.; Tsiakoulis, P. IPLR: An Online Resource for Greek Word-Level and Sublexical Information. Lang. Resour. Eval. 2012, 46, 449–559. [Google Scholar] [CrossRef]
Lexique. Available online: http://www.lexique.org/?page_id=250 (accessed on 1 April 2025).
Gimenes, M.; New, B. Worldlex: Twitter and Blog Word Frequencies for 66 Languages. Behav. Res. Methods 2016, 48, 963–972. [Google Scholar] [CrossRef]
IPLR Downloads. Available online: http://speech.ilsp.gr/iplr/downloads.htm (accessed on 1 April 2025).
Kyparissiadis, A.; van Heuven, W.J.; Pitchford, N.J.; Ledgeway, T. GreekLex 2: A Comprehensive Lexical Database with Part-of-Speech, Syllabic, Phonological, and Stress Information. PLoS ONE 2017, 12, e0172493. [Google Scholar] [CrossRef]
GreekLex2.1. Available online: https://psychology.nottingham.ac.uk/greeklex/ (accessed on 1 April 2025).
Institute of Modern Greek Studies (Manolis Triandaphyllidis Foundation). Dictionary of Standard Modern Greek; Aristotle University of Thessaloniki: Thessaloniki, Greece, 1998. [Google Scholar]
Hatzigeorgiu, N.; Gavrilidou, M.; Piperidis, S.; Carayannis, G.; Papakostopoulou, A.; Spiliotopoulou, A.; Vacalopoulou, A.; Labropoulou, P.; Mantzari, E.; Papageorgiou, H.; et al. Design and Implementation of the Online ILSP Corpus. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC), Athens, Greece, 31 May–2 June 2000; Volume 3, pp. 1737–1740. [Google Scholar]
Hellenic National Corpus. Available online: https://hnc.ilsp.gr (accessed on 1 April 2025).
Levenshtein, V.I. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
Num Tool. Available online: http://speech.ilsp.gr/iplr/NumTool.aspx (accessed on 1 April 2025).
Yarkoni, T.; Balota, D.; Yap, M. Moving beyond Coltheart’s N: A new measure of orthographic similarity. Psychon. Bull. Rev. 2008, 15, 971–979. [Google Scholar] [CrossRef]
Revithiadou, A.; Ioannou, D.; Chatzinikolaou, M.; Aivazoglou, K. Constructing Pseudowords for Experimental Research: Problems and Solutions. Sel. Pap. Theor. Appl. Linguist. 2016, 21, 356–365. [Google Scholar] [CrossRef]

Figure 1. Loading the lexical resources.

Figure 2. Pseudoword generation interface. Interface for configuring pseudoword generation parameters, including part of speech selection (Preposition shown selected), syllable count (3), frequency threshold (5.0), and similarity threshold (0.8).

Figure 3. Detailed pseudoword generation interface. Expanded interface showing generation parameters for Greek pseudowords with noun selection (3 syllables), frequency threshold (5.0), and similarity threshold (0.8). The last syllable selection panel displays available Greek syllables (64 of 71 selected) with options to select/deselect all or search specific syllables. The interface is configured to generate up to 1000 pseudowords.

Figure 4. (a,b) Pseudoword generation process and performance metrics. (a) Status panel showing the lexicon loading process and the initialization parameters for pseudoword generation; (b) completion of pseudoword generation task and user interface for generated Greek pseudowords with management controls. Generation of 1000 three-syllable noun pseudowords using 64 selected final syllables from a corpus analysis (121 initial, 159 medial, 71 final syllables identified). Parameters: frequency threshold 4.327, similarity threshold 0.8. Total processing time: 19s (53 words/s).

Figure 5. Stress assignment interface. The stress assignment module interface displays the input area for Greek words, stress pattern selection options (antepenult, penult, ultima), and the resulting stressed words with diacritical markings. The “Import Generated Words” button enables seamless transfer of pseudowords from the generation panel for stress processing.

Figure 6. SyBig-r-Morph’s pseudoword generation pipeline.

Table 1. Compilation of syllables for trisyllabic words after syllabification.

1: array (49)

α, αι, αλ, αν, αυ, βι, βου, γε, γυ, δευ, δη, δου, δυ, ε, εγ, ει, εκ, εν, ευ, ζη, η, θε, κα, κει, κι, κομ, κυ, λο, με, μει, μου, ο, ου, πα, παι, παιχ, πλαι, πο, προ, ρυθ, ση, στα, στοι, συ, σχε, ται, τε, τρα, υ

2: array (53)

α, βλη, βλι, γι, γο, γρα, γραμ, δει, δι, δο, ε, θε, θου, θρω, θυ, κο, λα, λε, λει, λευ, λω, μα, με, μει, μι, να, ναι, νη, νι, νο, νοι, νω, ξη, ξι, πε, πο, πουρ, ρει, ρελ, ρευ, ρι, ριθ, σι, σο, στε, στη, σω, τα, ταρ, τε, τη, χει, ω

3: array (35)

α, γη, γκη, γος, δα, δι, δρος, ζα, θον, κα, κη, λο, μα, μη, μος, να, νη, νο, νος, ο, ος, πο, πος, ρα, σα, ση, στο, στως, τη, της, τος, τρο, φο, χη, ψη

Table 2. Sample trisyllabic pseudonouns in -os and their scores in phonological parameters.

	BGtokfreqPho	BGtypfreqPho	nNeiPho	nNeiRDITPho	PLD20
temiðros	1.119	1.307	0	0	2.850
ðusotos	1.167	0.848	0	0	2.650
siɣopos	1.536	1.481	0	0	2.350
plemiɣos	0.749	0.881	0	0	2.850
ciçipos	1.410	1.138	0	0	2.450
travlinos	1.203	1.456	0	0	2.800
umiɣos	0.801	1.024	1	1	1.950
ekaɣos	1.225	1.422	0	0	2.000
vuksimos	0.647	0.909	0	0	2.800
ðupetos	0.524	0.593	0	0	2.900

Table 3. Ranges of phonological parameters for di- and trisyllabic nouns in A-Clean.

	BGtokfreqPho	BGtypfreqPho	nNeiPho	nNeiRDITPho	PLD20
Disyllabic nouns	0–1.147	0–1.185	0–16	0–20	0.960–2.495
Trisyllabic nouns	0–1.260	0–1.350	0–4.75	0–6.16	1.369–3.254

Table 4. Evaluation of trisyllabic pseudonouns in -os against established phonological parameters. Shaded rows indicate rejected words that fall outside acceptable parameter ranges.

	BGtokfreqPho	BGtypfreqPho	nNeiPho	nNeiRDITPho	PLD20
	0–1.260 *	0–1.350	0–5	0–7	1.369–3.254
temiðros	1.119	1.307	0	0	2.850
ðusotos	1.167	0.848	0	0	2.650
siɣopos	1.536	1.481	0	0	2.350
plemiɣos	0.749	0.881	0	0	2.850
ciçipos	1.410	1.138	0	0	2.450
travlinos	1.203	1.456	0	0	2.800
umiɣos	0.801	1.024	1	1	1.950
vusteɣos	0.995	0.989	0	0	2.950
vuksimos	0.647	0.909	0	0	2.800
ðupetos	0.524	0.593	0	0	2.900

* Even though the lower range is technically a negative number, we set it to zero. This is because bigram counts are log-transformed in the Num Tool, and the tool does not return negative values.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kosmidis, K.; Apostolouda, V.; Revithiadou, A. Syllable-, Bigram-, and Morphology-Driven Pseudoword Generation in Greek. Appl. Sci. 2025, 15, 6582. https://doi.org/10.3390/app15126582

AMA Style

Kosmidis K, Apostolouda V, Revithiadou A. Syllable-, Bigram-, and Morphology-Driven Pseudoword Generation in Greek. Applied Sciences. 2025; 15(12):6582. https://doi.org/10.3390/app15126582

Chicago/Turabian Style

Kosmidis, Kosmas, Vassiliki Apostolouda, and Anthi Revithiadou. 2025. "Syllable-, Bigram-, and Morphology-Driven Pseudoword Generation in Greek" Applied Sciences 15, no. 12: 6582. https://doi.org/10.3390/app15126582

APA Style

Kosmidis, K., Apostolouda, V., & Revithiadou, A. (2025). Syllable-, Bigram-, and Morphology-Driven Pseudoword Generation in Greek. Applied Sciences, 15(12), 6582. https://doi.org/10.3390/app15126582

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Syllable-, Bigram-, and Morphology-Driven Pseudoword Generation in Greek

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

2.1. Exploring Current Methods and Systems for Pseudoword Generation

2.2. Constructing Pseudowords in Greek: Methodological Considerations

3. Methods

4. Evaluation of Pseudowords

5. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Sample of UniPseudo-Generated Pseudowords in Greek

Appendix B. Sample of SyBig-r-Morph-Generated Pseudonouns

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI