Next Article in Journal
Dataset and AI Workflow for Deep Learning Image Classification of Ulcerative Colitis and Colorectal Cancer
Previous Article in Journal
Collecting and Analyzing IBD Clinical Data for Machine-Learning: Insights from an Italian Cohort
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Data Descriptor

OrthoKnow-SP: A Large-Scale Dataset on Orthographic Knowledge and Spelling Decisions in Spanish Adults

by
Jon Andoni Duñabeitia
Centro de Investigación Nebrija en Cognición (CINC), Universidad Nebrija, 28043 Madrid, Spain
Data 2025, 10(7), 101; https://doi.org/10.3390/data10070101
Submission received: 20 May 2025 / Revised: 19 June 2025 / Accepted: 23 June 2025 / Published: 24 June 2025

Abstract

Orthographic knowledge is a critical component of skilled language use, yet its large-scale behavioral signatures remain understudied in Spanish. To address this gap, we developed OrthoKnow-SP, a megastudy that captures spelling decisions from 27,185 native Spanish-speaking adults who completed an 80-item forced-choice task. Each trial required selecting the correctly spelled word from a pair comprising a real word and a pseudohomophone foil that preserved pronunciation while violating the correct graphemic representation. The stimuli targeted six high-confusability contrasts in Spanish orthography. We recorded response accuracy and reaction times for over 2.17 million trials, alongside demographic and device metadata. Results show robust variability across items and individuals, with item-level metrics closely aligned with independent norms of word prevalence. A composite difficulty index integrating speed and accuracy further allowed fine-grained item ranking. The dataset provides the first population-scale norms of Spanish spelling difficulty, capturing regional and generational diversity absent from traditional lab-based studies. Public release of OrthoKnow-SP enables new research on the cognitive and demographic factors shaping orthographic decisions, and provides educators, clinicians, and developers with a valuable benchmark for assessing spelling competence and modeling written language processing.
Dataset License: CC0

1. Summary

Orthographic processing in skilled adulthood is now understood as a rapid, knowledge-driven operation in which precisely specified letter-string representations give readers almost immediate access to meaning [1]. Within the first few hundreds of milliseconds of viewing a word, adults recruit left-occipitotemporal networks that encode orthographic form independently of articulation, yet they can still fall back on phonological recoding when print-to-sound mappings are unreliable [2]. The efficiency of this system hinges on how tightly orthographic, phonological, morphological, and semantic codes are bonded [3].
Undoubtedly, orthographic knowledge is a core component of the multi-layered process involved in orthographic processing. In adults, it refers to the stored mental representations of whole words, word parts (e.g., affixes), and sublexical elements [4]. Crucially, orthographic knowledge contributes to overall lexical quality, with more robust representations enabling more efficient lexical access [3]. Notably, orthographic knowledge is not static; it is a dynamic system that supports the ongoing acquisition and refinement of lexical representations throughout the lifespan [5]. Experimental evidence confirms that adults continue to fine-tune their orthographic representations over time, especially for low-frequency or morphologically complex items [6].
This dynamic and evolving nature of orthographic knowledge becomes particularly relevant when considering languages like Spanish. Although Spanish exhibits high phonological transparency for reading, its orthographic system retains etymological complexities that demand more than simple phoneme-to-grapheme mappings. Certain graphemic domains remain persistent sources of difficulty, such as the phoneme /b/, which can be represented by either the letter B or V (e.g., the homophones vaca [cow] and baca [luggage rack]), or the phoneme /y/, which is mapped onto both the letter Y and the digraph LL (e.g., malla [mesh] and maya [Mayan]). Consequently, despite the relative consistency of Spanish orthography, adult writers must often rely on lexical knowledge to resolve ambiguities in spelling. Pseudohomophones—nonwords that are phonologically identical to real words (e.g., absolber, which preserves the pronunciation of absolver [to absolve])—have been widely employed to investigate this interplay between phonological and orthographic information during visual word recognition. Research conducted in Spanish consistently shows that such pseudohomophones pose a challenge for lexical decision, highlighting the cognitive effort involved in rejecting plausible-sounding nonwords [7].
Evidence from laboratory studies, classroom corpora, and neurocognitive research consistently points to elevated error rates in these orthographic contrasts [8,9]. Eye-tracking data have revealed increased competition between phonologically similar representations during reading and writing tasks [10], while electrophysiological studies indicate that late-stage verification processes draw primarily on stored lexical knowledge rather than phonological cues alone [11]. Together, these findings emphasize that even fluent adult readers must navigate a complex network of historical orthographic irregularities during written language production. They underscore the central role of orthographic knowledge in supporting efficient spelling and accurate visual word recognition.
To quantify these challenges at scale, we compiled OthoKnow-SP, a megastudy in which 27,185 Spanish adults completed an 80-item two-alternative forced-choice task (2AFC). Each trial presented a correctly spelled word paired with a pseudohomophone created by altering a critical grapheme, exploiting the ambiguities permitted by Spanish orthography. The project yielded 2.17 million observations together with age, sex, education, and device metadata (see Section 3 for details). OthoKnow-SP was designed to offer the possibility to (i) generate population norms of grapheme difficulty, (ii) trace how sociodemographic variables sculpt orthographic decisions, and (iii) furnish a benchmark for computational models trained on quasi-transparent orthographies. Early validation shows that item-level speed and accuracy correlate strongly with existing word knowledge norms [5], confirming sensitivity to established lexical variables.
Megastudies of this sort have transformed psycholinguistics by replacing small, homogeneous samples with crowdsourced datasets large enough to expose subtle interactions between reader characteristics and word properties [12]. Over the past decade Spanish has joined English, Dutch, and French as a major contributor to this movement, spawning large-scale, web-based resources [13,14,15]; however, none of these corpora directly targets the persistent orthographic confusability that surfaces in adult spelling, nor do they pair accuracy with high-resolution timing data in a demographically diverse sample. OthoKnow-SP extends that agenda by sampling beyond university cohorts, capturing regional, generational, and technological diversity that is typically invisible in laboratory work.
The OthoKnow-SP dataset underpins multiple lines of research. One ongoing project integrates OthoKnow-SP with existing crowdsourced lexical decision data to investigate lifespan changes in lexical quality and orthographic knowledge. A secondary objective is to extract systematic error patterns that may inform the development of more sophisticated spell-checking algorithms. In parallel, an international collaborative initiative is deploying the same experimental architecture across multiple languages, enabling the creation of comparable datasets and opening new avenues for cross-linguistic comparisons of orthographic competence and its modulation by demographic variables.
Public release of the dataset in an open repository will magnify these benefits: open access ensures full reproducibility, enables secondary analyses that may reveal novel demographic effects, and provides educators and natural language processing (NLP) developers with fine-grained norms for error-sensitive applications. Moreover, researchers can now investigate how age, gender, educational background, and device ecology influence the micro-dynamics of spelling, questions that earlier, smaller studies could only address in broad strokes.
In sum, OthoKnow-SP offers a richly annotated, population-scaled window onto the enduring irregularities of Spanish orthography and the sociocognitive factors that govern their resolution. We anticipate that its public availability, alongside sister corpora in other languages, will stimulate theoretical advances and practical tools alike.

2. Data Description

The OrthoKnow-SP dataset comprises 2,174,800 raw observations, corresponding to 27,185 participants completing 80 forced-choice spelling trials each. Detailed information about the participants, materials, and procedures is provided in Section 3. The final dataset is distributed through a public repository (https://doi.org/10.6084/m9.figshare.29107727) and consists of three main files: ORTHOKNOW-SP PARTICIPANT DATA.csv, ORTHOKNOW-SP RESPONSE DATA.csv, and OrthoKnow-SP.R.
The first main data file, ORTHOKNOW-SP PARTICIPANT DATA.csv, contains participant-level sociodemographic metadata for each individual in the study. The second main data file, ORTHOKNOW-SP RESPONSE DATA.csv, contains trial-level response data for each participant and each word pair. These files include the variables reported in Table 1. These files are encoded in UTF-8 CSV format and are compatible with standard analysis tools such as R and Python.
The dataset is accompanied by a supporting R script, OrthoKnow-SP.R, which contains all the code used to process, clean, and structure the raw data. This script performs key preprocessing operations, including the import and formatting of the original response logs, the filtering and validation of participant-level entries, the trimming of implausible reaction times, and the computation of accuracy metrics and trial-level summaries. These procedures ensure the integrity and consistency of the final dataset provided. Full methodological details regarding these steps are outlined in the Data Treatment section of the manuscript.

3. Methods

3.1. Participants

A total of 27,185 unique native speakers of Spanish, all of whom reported Spain as their country of residence and were at least 18 years of age, completed the experiment after providing informed consent through an online form that detailed the anonymisation of their data and its exclusive use for research purposes. Ethical clearance was granted by the Ethics Board of Universidad Nebrija (approval code UNNE-2022-0017), and the study was carried out following the rules of the Declaration of Helsinki of 1975, revised in 2008. The sample had a mean age of 40.40 years (SD = 13.0, range = 18–88) and comprised 62.06% women (n = 15,609) and 37.94% men (n = 9544; see Figure 1). Formal education was generally high, with 74.64% of participants holding a university degree, while the remainder had completed secondary school, professional training, high school, or primary school. Most volunteers accessed the task on a mobile phone (88.31%, n = 23,785), and the rest used a computer (11.69%, n = 3148).

3.2. Materials

To operationalize orthographic difficulty, we selected eighty Spanish words that have been consistently identified by teachers, normative guides, and prior behavioral studies as frequent targets of misspelling. The test items were designed to measure knowledge of conventional Spanish orthography by presenting participants with real words in their correct spelling alongside plausible but incorrect alternatives. Importantly, none of the incorrect alternatives involved violations of explicit orthographic rules; instead, they were chosen to reflect theoretically acceptable, but lexically incorrect, spellings for existing words. This approach allowed us to assess word-specific orthographic knowledge rather than rule-based generalizations. This design required participants to rely on stored orthographic knowledge rather than phonological cues alone. The manipulation was structured around six orthographic contrasts that are well-documented sources of confusion in contemporary Spanish. First, we included the historical merger of the bilabial plosives B and V, both pronounced /b/ but orthographically distinct (e.g., absolver [to absolve] vs. absolber). Second, we targeted the silent H, a letter with no phonetic realization in modern Spanish but retained in lexicalized forms (e.g., almohada [pillow] vs. almoada). Third, we manipulated the contrast between Y and LL, which can both represent the palatal approximant /y/ (e.g., boyante [buoyant] vs. bollante). Fourth, we drew on the graphemic ambiguity between C and Z, both of which can encode the /θ/ phoneme before certain vowels in Peninsular Spanish (e.g., cianuro [cyanide] vs. zianuro). Fifth, we addressed the alternation between G and J, where the voiceless velar fricative /x/ can be written with either letter depending on vowel context and other factors (e.g., exagerar [to exaggerate] vs. exajerar). Finally, we included items requiring the diaeresis <ü> in the digraph <gü>, which signals the preservation of the glide /w/ before front vowels (e.g., pingüino [penguin] vs. pingüino; note that this is the only contrast that involved a phonological change).

3.3. Data Collection

Data were gathered between 24 May and 11 June 2024 via a purpose-built website (https://nebrija.com/ortografia). The dissemination of the study was carried out primarily through institutional and personal social media accounts, with LinkedIn and X (formerly Twitter) being the most actively used platforms by the university, the research center, and the author. Additionally, the study received public attention through various media and science communication outlets, including The Conversation and Europa Press, which helped amplify its reach through syndication in general news media such as Yahoo News.
After landing on the site, visitors read a concise description of the study, the inclusion criteria, contact details for the research team, and the informed consent statement. They then completed a brief sociodemographic questionnaire (age, sex, highest level of education, country of residence, and native language) before proceeding to the practice trials that illustrated the two-alternative forced-choice (2-AFC) procedure. During the test, the eighty word–foil pairs were presented in random order. The spatial arrangement of the two spellings was also randomized on every trial. Participants indicated the spelling they believed correct by touching or clicking, and each trial terminated either when a response was registered or after 5000 ms had elapsed. A progress bar at the top of the screen provided continuous feedback on task completion.
Upon finishing, participants received personalised feedback that included their mean reaction time (RT), overall accuracy, a list of incorrectly answered items linked to their definitions in the official Spanish dictionary, and options for sharing their performance on social media or repeating the test.
All front-end interactions were implemented in JavaScript. PHP scripts (version 8.2) handled server-side communication with a MySQL database (version 8.0) that stored item identifiers, responses, reaction times, and session metadata. The full experiment typically required no more than eight minutes.

3.4. Data Treatment

For the present dataset, we retained only the first session of each participant who satisfied the inclusion criteria, yielding 2,174,800 raw observations (27,185 participants × 80 trials).
Participant-level accuracy ranged from 50% (chance) to 100%, with a mean of 90.4% (SD = 5.61%; see Figure 2, left panel). Reaction time (RT) analyses excluded all timeouts (4681 trials, representing 0.22% of the total) and all incorrect responses. For each participant, the remaining RTs were trimmed at the conventional threshold of the individual mean plus three standard deviations, eliminating a further 41,342 outliers (1.90% of the whole dataset). The cleaned RT matrix comprised 1,923,973 observations (M = 1418 ms, SD = 252 ms, Median = 1382 ms, range = 156–3966 ms; see Figure 2, right panel), providing a robust basis for subsequent item-level estimates.
To integrate speed and accuracy into a single descriptive metric, we computed an item-wise difficulty index. Mean RTs were first rescaled to the 0–1 interval, where 0 denotes the fastest and 1 the slowest item; mean accuracies were converted into error proportions, likewise rescaled so that 0 marks flawless performance and 1 the poorest. The two standardized values were averaged and multiplied by 100, producing a percentage score that assigns equal weight to slowness and inaccuracy. This composite facilitates intuitive item ranking while avoiding ceiling effects that would arise from relying on accuracy alone (see Figure 3 for a ranked list of the items based on their estimated difficulty).

Funding

This research was funded by the Spanish Ministry of Science and Innovation, Grant PID2021-126884NB-I00 (MCIN/AEI/10.13039/501100011033) and by a grant funded by the Fundación Ramón Areces.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of Universidad Nebrija (approval code UNNE-2022-0017 dated 2 December 2022).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The dataset and the code used to support reported results can be found at https://doi.org/10.6084/m9.figshare.29107727.

Acknowledgments

During the preparation of the manuscript, AI-based tools were exclusively used to assist with grammar and language refinement, given that the author is a non-native English speaker. The author has reviewed and edited the output and takes full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Grainger, J. Word recognition I: Visual and orthographic processing. In The Science of Reading: A Handbook, 2nd ed.; Snowling, M.J., Hulme, C., Nation, K., Eds.; Wiley Blackwell: Hoboken, NJ, USA, 2022; pp. 60–78. [Google Scholar] [CrossRef]
  2. Glezer, L.S.; Eden, G.; Jiang, X.; Luetje, M.; Napoliello, E.; Kim, J.; Riesenhuber, M. Uncovering phonological and orthographic selectivity across the reading network using fMRI-RA. NeuroImage 2016, 138, 248–256. [Google Scholar] [CrossRef] [PubMed]
  3. Perfetti, C. Reading Ability: Lexical Quality to Comprehension. Sci. Stud. Read. 2007, 11, 357–383. [Google Scholar] [CrossRef]
  4. Chrabaszcz, A.; Gebremedhen, N.I.; Alvarez, T.A.; Durisko, C.; Fiez, J.A. Orthographic learning in adults through overt and covert reading. Acta Psychol. 2023, 241, 104061. [Google Scholar] [CrossRef] [PubMed]
  5. Aguasvivas, J.; Carreiras, M.; Brysbaert, M.; Mandera, P.; Keuleers, E.; Duñabeitia, J.A. How do Spanish speakers read words? Insights from a crowdsourced lexical decision megastudy. Behav. Res. 2020, 52, 1867–1882. [Google Scholar] [CrossRef] [PubMed]
  6. Ziegler, J.C.; Ferrand, L. Orthographic consistency and word-frequency effects in auditory word recognition. Front. Psychol. 2011, 2, 263. [Google Scholar] [CrossRef]
  7. Fariña, N.; Duñabeitia, J.A.; Carreiras, M. Phonological and orthographic coding in deaf skilled readers. Cognition 2017, 168, 27–33. [Google Scholar] [CrossRef] [PubMed]
  8. Afonso, O.; Suárez-Coalla, P.; Cuetos, F. Spelling impairments in Spanish dyslexic adults. Front. Psychol. 2015, 6, 466. [Google Scholar] [CrossRef] [PubMed]
  9. Costello, B.; Caffarra, S.; Fariña, N.; Duñabeitia, J.A.; Carreiras, M. Reading without phonology: ERP evidence from skilled deaf readers of Spanish. Sci. Rep. 2021, 11, 5202. [Google Scholar] [CrossRef] [PubMed]
  10. Dean, C.A.; Kroff, J.R.V. Cross-Linguistic Orthographic Effects in Late Spanish/English Bilinguals. Languages 2017, 2, 24. [Google Scholar] [CrossRef]
  11. Furgoni, A.; Martin, C.D.; Stoehr, A. A cross linguistic study on orthographic influence during auditory word recognition. Sci. Rep. 2025, 15, 8374. [Google Scholar] [CrossRef] [PubMed]
  12. Keuleers, E.; Balota, D.A. Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments. Q. J. Exp. Psychol. 2015, 68, 1457–1468. [Google Scholar] [CrossRef] [PubMed]
  13. Duchon, A.; Perea, M.; Sebastián-Gallés, N.; Martí, A.; Carreiras, M. EsPal: One-stop shopping for Spanish word properties. Behav. Res. 2013, 45, 1246–1258. [Google Scholar] [CrossRef] [PubMed]
  14. Aguasvivas, J.A.; Carreiras, M.; Brysbaert, M.; Mandera, P.; Keuleers, E.; Duñabeitia, J.A. SPALEX: A Spanish lexical decision database from a massive online data collection. Front. Psychol. 2018, 9, 2156. [Google Scholar] [CrossRef] [PubMed]
  15. Buades-Sitjar, F.; Boada, R.; Guasch, M.; Ferré, P.; Hinojosa, J.A.; Duñabeitia, J.A. The predictors of general knowledge: Data from a Spanish megastudy. Behav. Res. 2022, 54, 898–909. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Descriptive visualization of the participant sample. The left panel displays the age distribution of participants, separated by gender. The sample spans a broad age range (from approximately 18 to 85 years), with the modal age group concentrated between 30 and 50 years. Female participants slightly outnumber male participants across most age bands. The right panel shows the number of participants completing the task each day over the data collection period. A sharp peak is observed on 3 June 2024, reflecting the date of the main public dissemination campaign.
Figure 1. Descriptive visualization of the participant sample. The left panel displays the age distribution of participants, separated by gender. The sample spans a broad age range (from approximately 18 to 85 years), with the modal age group concentrated between 30 and 50 years. Female participants slightly outnumber male participants across most age bands. The right panel shows the number of participants completing the task each day over the data collection period. A sharp peak is observed on 3 June 2024, reflecting the date of the main public dissemination campaign.
Data 10 00101 g001
Figure 2. Distributions of participant-level performance metrics. The left panel shows the distribution of participants’ mean accuracy across the 80 spelling trials. The vertical red line marks the overall mean accuracy. The right panel displays the distribution of participants’ mean reaction times, with the mean also indicated by a vertical red line.
Figure 2. Distributions of participant-level performance metrics. The left panel shows the distribution of participants’ mean accuracy across the 80 spelling trials. The vertical red line marks the overall mean accuracy. The right panel displays the distribution of participants’ mean reaction times, with the mean also indicated by a vertical red line.
Data 10 00101 g002
Figure 3. Ranked item difficulty based on a composite index integrating speed and accuracy. Each item’s difficulty score reflects the average of standardized reaction time and error rate, scaled to a 0–100 range. Higher values indicate greater difficulty.
Figure 3. Ranked item difficulty based on a composite index integrating speed and accuracy. Each item’s difficulty score reflects the average of standardized reaction time and error rate, scaled to a 0–100 range. Higher values indicate greater difficulty.
Data 10 00101 g003
Table 1. Description of the variables included in the two files constituting the OrthoKnow-SP dataset. The ORTHOKNOW-SP PARTICIPANT DATA.csv file contains sociodemographic and contextual information for each participant, while the ORTHOKNOW-SP RESPONSE DATA.csv file includes trial-level data capturing spelling decisions and response latencies across 80 trials per participant. Variable names, file origins, and descriptions are provided to facilitate interpretation and reproducibility.
Table 1. Description of the variables included in the two files constituting the OrthoKnow-SP dataset. The ORTHOKNOW-SP PARTICIPANT DATA.csv file contains sociodemographic and contextual information for each participant, while the ORTHOKNOW-SP RESPONSE DATA.csv file includes trial-level data capturing spelling decisions and response latencies across 80 trials per participant. Variable names, file origins, and descriptions are provided to facilitate interpretation and reproducibility.
FileColumn NameDescription
ORTHOKNOW-SP PARTICIPANT DATA.csvPARTICIPANTUnique alphanumeric identifier for each participant.
ORTHOKNOW-SP PARTICIPANT DATA.csvAGEAge of the participant in years.
ORTHOKNOW-SP PARTICIPANT DATA.csvGENDERSelf-reported gender (“Male”, “Female”).
ORTHOKNOW-SP PARTICIPANT DATA.csvEDUCATIONHighest level of education reported.
ORTHOKNOW-SP PARTICIPANT DATA.csvTIMESTAMPDate and time of task completion.
ORTHOKNOW-SP PARTICIPANT DATA.csvDEVICEDevice used to complete the task (“Computer”, “Mobile”).
ORTHOKNOW-SP RESPONSE DATA.csvPARTICIPANTUnique identifier matching the participant file.
ORTHOKNOW-SP RESPONSE DATA.csvCORRECTTarget word with correct spelling (lowercase string).
ORTHOKNOW-SP RESPONSE DATA.csvINCORRECTOrthographic foil (pseudohomophone with one spelling error).
ORTHOKNOW-SP RESPONSE DATA.csvORDERSerial position of the trial (1 to 80).
ORTHOKNOW-SP RESPONSE DATA.csvRTReaction time in milliseconds.
ORTHOKNOW-SP RESPONSE DATA.csvACCURACYBinary variable: 1 for correct responses, 0 for incorrect.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Duñabeitia, J.A. OrthoKnow-SP: A Large-Scale Dataset on Orthographic Knowledge and Spelling Decisions in Spanish Adults. Data 2025, 10, 101. https://doi.org/10.3390/data10070101

AMA Style

Duñabeitia JA. OrthoKnow-SP: A Large-Scale Dataset on Orthographic Knowledge and Spelling Decisions in Spanish Adults. Data. 2025; 10(7):101. https://doi.org/10.3390/data10070101

Chicago/Turabian Style

Duñabeitia, Jon Andoni. 2025. "OrthoKnow-SP: A Large-Scale Dataset on Orthographic Knowledge and Spelling Decisions in Spanish Adults" Data 10, no. 7: 101. https://doi.org/10.3390/data10070101

APA Style

Duñabeitia, J. A. (2025). OrthoKnow-SP: A Large-Scale Dataset on Orthographic Knowledge and Spelling Decisions in Spanish Adults. Data, 10(7), 101. https://doi.org/10.3390/data10070101

Article Metrics

Back to TopTop