Previous Article in Journal
Journalist? Influencer? Both—And Neither: How Wanghong Journalists Negotiate Professional Identity on Chinese Social Media
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Transforming Historical Newspaper Research and Preservation Through AI: A Global Perspective

School of Communication, The Hang Seng University of Hong Kong, Hong Kong, China
*
Author to whom correspondence should be addressed.
Journal. Media 2026, 7(1), 10; https://doi.org/10.3390/journalmedia7010010
Submission received: 16 September 2025 / Revised: 25 December 2025 / Accepted: 30 December 2025 / Published: 7 January 2026

Abstract

Artificial intelligence (AI) is transforming the preservation and research of historical newspapers by providing powerful tools that overcome longstanding challenges in terms of digitization, analysis, and access. This study offers a comprehensive global analysis of AI-driven innovations—including advanced Optical Character Recognition (OCR), Large Language Models (LLMs) for post-correction, and Natural Language Processing (NLP) techniques—that significantly enhance text extraction, image restoration, metadata generation, and semantic enrichment. Through qualitative case studies and comparative examinations of projects worldwide, this research demonstrates how AI not only improves the accuracy and efficiency of preservation workflows but also enables novel forms of computational inquiry such as cross-lingual analysis, sentiment detection, and discourse tracking. This study further explores emerging ethical and practical challenges and outlines future directions like multimodal analysis and collaborative digital infrastructures. The findings underscore AI’s transformative role in unlocking historical newspaper archives for both scholarly and public use, thereby fostering a deeper understanding of cultural heritage and historical narratives on a global scale.

1. Introduction

The digitization and preservation of historical newspapers are essential for safeguarding cultural heritage and ensuring long-term access to invaluable historical records. As primary sources, historical newspapers offer unfiltered accounts of past events, capturing cultural practices, societal transformations, and the evolution of media and journalism. They serve as rich resources for researchers, often containing original quotations and firsthand reports that lend authenticity and depth to historical inquiry (Tworek, 2024; Hauswedell et al., 2020). Large-scale digitization initiatives, such as the National Digital Newspaper Program in the United States, have demonstrated the scholarly and public value of these archives by enabling unprecedented access to millions of historical pages (Grohsgal, 2013).
Recent advancements in artificial intelligence (AI) have significantly transformed the landscape of newspaper digitization and preservation. AI technologies—including advanced Optical Character Recognition (OCR), Handwritten Text Recognition (HTR), Natural Language Processing (NLP), and image analysis—now enable more efficient and accurate methods for text recognition, metadata generation, and document restoration (Romein et al., 2025; Fewster & Arias-Hernandez, 2025). These innovations have greatly enhanced the ability to manage, interpret, and analyze large archival collections, thereby expanding the possibilities for historical research and digital humanities scholarship (Colavizza et al., 2021; Teel, 2024). For example, AI-driven semantic search and automated metadata creation have improved discoverability and usability across global archives (Johnson, 2025).
Digital humanities is an interdisciplinary field that merges the traditional methodologies of humanities research with cutting-edge digital technologies, significantly enhancing our understanding of historical newspaper archives. In this context, Artificial Intelligence plays a pivotal role, providing scholars with advanced tools to analyze, preserve, and interpret vast collections of historical newspapers.
This paper presents a comprehensive global analysis of AI-driven innovations in the digitization and preservation of historical newspapers, highlighting their transformative impact through case studies from North America, Europe, Asia, and Oceania. The primary objective of this research is to examine the multifaceted impact of AI technologies on the digitization, preservation, and research accessibility of historical newspapers. Specifically, the study explores how AI enhances digitization processes—improving text recognition, metadata generation, and overall document quality. It also investigates how AI facilitates research and access, focusing on advanced search capabilities and user experience. By addressing challenges such as variable image quality and complex historical document formats, this research aims to provide a comprehensive understanding of AI’s role in ensuring the long-term preservation, accessibility, and scholarly usability of historical newspaper archives.

2. Literature Review

Historical newspapers are widely recognized as indispensable sources for reconstructing sociocultural, political, and economic life, offering granular accounts of events, sentiments, and everyday practices that are often absent from official records. Their seriality and breadth make them particularly suited to tracing long-term transformations in public discourse, linguistic usage, media ecologies, and information infrastructures across time and space (Tworek, 2024). Beyond scholarly utility, newspapers play a constitutive role in cultural memory, preserving community narratives, vernacular voices, and the contingencies of public debate; well-curated collections thereby sustain cultural identity and enable reflexive engagement with the past.

2.1. Traditional Preservation, Digitization, and Their Limits

Conventional preservation strategies—microfilming foremost among them—have historically mitigated handling risks by providing stable surrogates, while environmental controls address the inherent fragility of wood-pulp, acidic paper that is highly sensitive to light, humidity, and temperature fluctuations. Digitization has become the dominant paradigm in the last two decades, promising improved access, storage efficiency, and risk reduction for vulnerable originals. Nevertheless, longstanding constraints persist. Many newspapers were produced as ephemeral commodities, leading to uneven collecting mandates, incomplete runs, and heterogeneous condition profiles across institutions. Moreover, digitization is not synonymous with accessibility or usability: image quality varies, article boundaries are often ambiguous in complex layouts, and downstream discoverability depends on the consistency of structural and descriptive metadata.
Large-scale programs such as the National Digital Newspaper Program (Chronicling America), Europeana Newspapers, the British Library’s newspaper digitization (including the British Newspaper Archive), Trove (Australia), Gallica (BnF, France), Delpher (Netherlands), and the Royal Library of Belgium’s initiatives have transformed availability for both scholars and the public (Ali et al., 2022; Oberbichler et al., 2021). Yet common technical obstacles remain substantial: OCR errors (especially for degraded pages, historical fonts such as Fraktur, or non-Latin scripts), incomplete or inconsistent metadata, inadequate article segmentation, and uneven support for complex features such as advertisements, tables, and images (Hamdi et al., 2020; Järvelin et al., 2015).

2.2. AI-Driven Innovations in Preservation

Artificial Intelligence (AI) technologies are increasingly addressing these limitations, offering innovative solutions for both preservation and research. Optical Character Recognition (OCR) forms the foundational layer of AI-assisted digitization, converting scanned newspaper images into machine-readable text. Advanced OCR systems, augmented by AI and deep learning, outperform traditional methods in handling complex layouts and degraded print quality (Ali et al., 2023; Menhour et al., 2021; Hamdi et al., 2020). Post-OCR correction using Large Language Models (LLMs) further enhances text quality, reducing character error rates significantly.
Metadata enrichment is another critical area where AI demonstrates transformative potential. Computer vision and machine learning approaches enable automated extraction of article authorship, publication dates, and categorization, improving indexing and retrieval capabilities in extensive newspaper archives (Ali et al., 2023; Jaillant & Aske, 2023). This automation reduces manual workload and accelerates archival workflows, allowing institutions to scale digitization efforts efficiently.

2.3. NLP and Computational Research

Natural Language Processing (NLP) techniques amplify the research potential of digitized newspapers. Named Entity Recognition (NER) and Entity Linking (EL) connect historical entities to structured knowledge bases like Wikidata, enriching contextual understanding and supporting longitudinal and geographic trend analysis. For example, the NewsEye project applies NER and EL to newspapers in French, German, Finnish, and Swedish from 1850 to 1950, enabling nuanced annotations and inferred author stances toward named entities.
Beyond preservation, AI-powered tools such as sentiment analysis, topic modeling, and machine translation enable advanced computational research. These methods support large-scale, cross-lingual studies that uncover patterns, biases, and narratives previously inaccessible through manual analysis (Colavizza et al., 2021). Generative AI and LLMs now facilitate corpus creation, semantic search, and multilingual alignment, marking a turning point for historical newspaper research (Oberbichler et al., 2021; Sikora & Haffenden, 2024).

2.4. Global Initiatives and Emerging Trends

International projects illustrate the interdisciplinary nature of AI-driven digitization. Collaborations among librarians, computer scientists, and historians have produced integrated workflows that combine technical innovation with scholarly needs (Oberbichler et al., 2021). AI-driven robotics and digital twins are emerging as tools for cultural heritage preservation, enabling predictive maintenance and immersive access to historical artifacts (Marchello et al., 2023). Similarly, AI-assisted metadata generation for images and documents reduces resource requirements and enhances discoverability (Reiche, 2023).
In sum, historical newspapers represent a vital nexus of preservation, accessibility, and scholarly inquiry. While traditional conservation and digitization have safeguarded fragile materials and broadened access, they remain constrained by issues such as paper degradation, OCR errors, and metadata gaps. AI-driven innovations—ranging from advanced OCR and post-processing to metadata automation and NLP—are dismantling these barriers, transforming static archives into dynamic, searchable resources. These technologies enable richer contextualization, cross-lingual analysis, and predictive preservation strategies, ensuring that historical newspapers not only endure but thrive as tools for cultural memory and interdisciplinary research in the digital age.

3. Research Questions

While existing literature identifies key themes and documents various AI applications in newspaper digitization and analysis, it lacks a comprehensive synthesis of the tools and techniques employed for preservation and research. This gap limits our ability to evaluate the full scope of technological innovation and its scholarly implications. To address this, the present study is guided by the following research questions, which aim to systematically investigate the current applications, transformative impacts, and future potential of artificial intelligence in the preservation and analysis of historical newspapers:
  • What are the primary AI technologies currently employed in the preservation of historical newspapers across different regions and institutions?
  • How are AI technologies enhancing the research capabilities and scholarly use of digitized historical newspaper archives?
  • What future trends and developments are anticipated as AI continues to be integrated into newspaper preservation and research practices?
Together, these questions position the study to examine AI’s role from multiple dimensions: identifying the technologies currently shaping preservation practices, assessing how these tools expand scholarly engagement with digitized archives, and anticipating future developments that may redefine cultural heritage strategies. By addressing these interconnected aspects, the research aims to provide a comprehensive understanding of AI as both a technical enabler and a catalyst for methodological innovation in historical newspaper studies.

4. Research Methods

This study adopts a mixed-methods design to examine the implementation and impact of artificial intelligence (AI) in historical newspaper digitization and research. Two complementary components are pursued in parallel and then integrated: (1) a qualitative multiple–case study analysis of deployed AI applications and (2) a systematic documentary analysis of project documentation and gray literature concerning present practice and future trajectories:
  • Qualitative Multiple–Case Study Analysis
This component examines representative cases of AI applications in newspaper preservation and research to understand real-world practices and outcomes. The process involves three stages:
Global Scan: The third author (research assistant) conducts a structured search of libraries and archives worldwide to compile AI applications used in digitization and restoration.
Classification: The second author, leveraging expertise in computational intelligence and image processing, organizes these applications into two categories: (a) AI tools for preservation and (b) AI tools for research.
In-depth Analysis: The first author studies selected cases in detail, documenting workflows, technologies, and challenges, and synthesizes findings through thematic analysis.
2.
Systematic Documentary Review
A systematic review of technical reports, institutional guidelines, and relevant news coverage complements the case studies by identifying trends and future directions. The process includes:
Data Collection: The second and third authors gather authoritative sources on AI adoption and anticipated developments in newspaper preservation and research. The second and third authors gathered authoritative sources on AI adoption and anticipated developments in newspaper preservation and research. Cases were purposively selected based on:
(a)
use of AI for newspaper digitization, preservation, or research;
(b)
availability of detailed public documentation; and
(c)
geographic diversity across North America, Europe, and Asia/Oceania.
Authoritative sources were drawn from peer-reviewed articles, institutional reports, project websites, and gray literature. Searches in Web of Science, Scopus, Google Scholar, and institutional repositories used keywords such as: “historical newspapers” AND (“artificial intelligence” OR AI) AND (digitization OR preservation OR research).
Thematic Analysis: The first author systematically analyzes these materials to identify recurring patterns and themes concerning emerging technologies, institutional strategies, and anticipated innovations. This involves categorization and synthesis of insights, revealing how AI-driven approaches are prioritized and projected within the evolving landscape of historical newspaper preservation and research.
Together, these methods provide a comprehensive perspective on AI’s role in historical newspaper projects, integrating technical evaluation with contextual insights to address the study’s three research questions.

5. Results

The results of this study reveal the transformative impact of artificial intelligence on both the preservation and scholarly analysis of historical newspapers. Through a detailed examination of current technologies and practices, the research identifies key AI-driven innovations that address longstanding challenges in digitization, restoration, and content analysis. The findings are organized into three primary domains: (1) AI technologies for preservation, which focus on enhancing the quality, accessibility, and longevity of historical newspaper archives; (2) AI technologies for research, which enable new forms of computational inquiry, cross-lingual exploration, and large-scale historical analysis; and (3) AI technologies for future development, exploring pathways for advancing AI applications to unlock previously inaccessible historical sources and foster global scholarly collaboration. Collectively, these findings illustrate how AI is not only streamlining technical workflows but also reshaping the methodological foundations of historical scholarship. Building on current trends, this study identifies critical directions for future development—advancements that will extend the boundaries of historical inquiry and enable innovative research paradigms.

5.1. AI Technologies in Historical Newspaper Preservation

AI has fundamentally transformed the global preservation of historical newspapers through initiatives such as Chronicling America (United States), Europeana Newspapers (European Union), the British Newspaper Archive (United Kingdom), Papers Past (New Zealand), and the Chinese Text Project (China). These projects leverage AI technologies—including advanced Optical Character Recognition (OCR), language modeling, image restoration, and automated archiving—to create accurate, searchable, and durable digital collections. This integration significantly enhances accessibility and ensures the long-term safeguarding of cultural heritage for researchers and the public.
This research identifies several transformative AI technologies that are revolutionizing the preservation and digitization of historical newspapers. These innovations address longstanding challenges such as physical degradation, labor-intensive manual processes, and the need for accurate, machine-readable text.
  • AI-powered Optical Character Recognition (OCR) and Post-OCR Correction
Basic OCR itself is not inherently an AI tool but falls within the fields of computer vision and pattern recognition. Traditional OCR systems often struggle with the poor print quality, unconventional fonts, and physical wear that are common in historical newspapers. Early OCR systems faced significant challenges due to irregularities in historical prints (Lehman, 2024). However, recent advancements in deep learning, particularly through the use of convolutional neural networks (CNNs) and long short-term memory (LSTM) models, have greatly enhanced the performance of OCR in these contexts (Mistral AI, 2025). When AI and machine learning are integrated into OCR, such systems become capable of learning from data, understanding context, and improving over time, which classifies them as AI-powered OCR solutions. AI-enhanced OCR engines, such as Google Cloud Vision, utilize these deep learning techniques to boost text recognition accuracy. Powered by Google’s extensive computer vision infrastructure, Google Cloud Vision processes images via API to deliver multilingual OCR results. It automatically detects languages including Latin, Japanese, Chinese, and Devanagari, while performing layout analysis that organizes words into lines and paragraphs with precise bounding boxes. This capability is invaluable for researchers working with heterogeneous historical collections, reducing the need for extensive manual preprocessing.
Emerging practices incorporate large language models (LLMs) for post-OCR correction, leveraging their advanced language understanding to analyze contextual information and rectify OCR errors effectively. Fine-tuned LLMs, such as ByT5 and Llama models, significantly reduce character error rates by learning OCR-specific error patterns. These models improve the fidelity, coherence, and usability of digitized texts by correcting errors that traditional methods often miss. The integration of LLMs with OCR enhances text quality across diverse documents, including historical and modern texts, marking a significant advancement in OCR technology (Thomas et al., 2024; Kanerva et al., 2025).
2.
Text and Image Restoration
Physical deterioration such as faded ink, stains, and paper damage presents substantial challenges in preserving historical documents. Generative adversarial networks (GANs) have become essential in advancing digital restoration by reconstructing damaged visuals. For example, EA-GAN uses a two-branch structure to restore severely damaged Chinese characters by referencing undamaged examples, improving accuracy and preserving detail (Zheng et al., 2023). GANs also enhance scanned historical newspapers by removing noise, sharpening blurred text, and reconstructing missing elements, which significantly boosts both human readability and OCR performance (Kristianto & Soewito, 2025). Other GAN-based models apply super-resolution and inpainting techniques to improve low-quality document images, enabling the recovery of text even when pages experience substantial information loss. For instance, ESRGAN leverages enhanced super-resolution generative adversarial networks to reduce noise, eliminate artifacts such as ink bleed or paper tears, and sharpen faded textures, thereby improving subsequent OCR accuracy. Similarly, Topaz Gigapixel AI employs proprietary deep learning models to upscale low-resolution scans and restore clarity, making faint or damaged visuals suitable for detailed analysis. These AI-powered restoration tools ensure long-term digital preservation, maintaining cultural heritage accessible and usable for researchers and the public. Their ongoing development continues to push the boundary of what is possible in restoring and digitizing fragile, deteriorated archival materials, ultimately safeguarding historical records for future generations (Kaur, 2025; Kovács, 2024; Carneiro, 2024).
3.
Automated Digital Archiving and Preservation
AI-driven platforms play a critical role in automating the long-term archiving and preservation of historical newspapers and documents. Beyond initial digitization, these platforms perform automated file integrity checks and monitor for data degradation, ensuring the digital archive remains accurate and uncorrupted over time. They also manage format migrations, converting files to newer formats compatible with evolving software and hardware standards, preventing obsolescence. For example, large-scale projects like Chronicling America and the British Newspaper Archive utilize AI-powered workflows to handle metadata generation, quality assurance, and content indexing, thereby streamlining access and sustainability. Projects such as Historascan exemplify AI’s evolution from an auxiliary tool to a core component in digitizing materials dating back to the 1850s (Costa et al., 2025). Additionally, archival platforms such as Preservica employ AI to automate metadata cleanup, transcribe handwritten annotations, and proactively monitor digital degradation, ensuring long-term integrity while revealing previously obscured content. Likewise, JSTOR Digital Stewardship’s Seeklight AI rapidly generates descriptive, machine-readable metadata and transcripts for both typescript and cursive elements, significantly improving search engine indexing and cross-collection discoverability for researchers worldwide. Through scalable platforms integrated with OCR, natural language processing (NLP), and machine learning, institutions can efficiently preserve vast collections of fragile historical prints. This automation reduces human workload, minimizes errors, and supports ongoing accessibility, making cultural heritage available for future research and public use, while adapting to technological changes in digital preservation standards (Kristianto & Soewito, 2025).
A summary of key AI tools employed in this preservation, as shown in Table 1, highlights a comprehensive pipeline. Advanced OCR systems convert degraded images into searchable text, while large language models (LLMs) refine this output for usability. Image restoration tools improve the quality of source materials, and AI-powered archiving platforms automate functions like metadata management and format migration for digital preservation. This multi-faceted, AI-driven approach produces high-quality digital surrogates of historical newspapers. Projects worldwide illustrate how AI has evolved into a fundamental component of digitization and preservation efforts, significantly enhancing the usability and longevity of historical archives (Carneiro, 2024).
4.
Challenges and Ethical Considerations for AI in Historical Newspaper Preservation
Despite significant advancements, several challenges persist in AI-driven newspaper preservation. OCR systems continue to struggle with severely degraded materials and non-standard scripts, often requiring custom models and substantial computational resources (Eureka Network, 2025; Businessware Technologies, 2024). Large Language Models (LLMs) introduce risks of inaccuracies, including fabricated corrections that may distort historical records, underscoring the need for rigorous validation protocols (MacFarlane, 2024; Newman & Cherubini, 2025b). Image restoration technologies, particularly those based on Generative Adversarial Networks (GANs), raise ethical concerns regarding authenticity, as they may inadvertently introduce fictional details that mislead future scholars (Newman, 2025a).
Moreover, archiving platforms face scalability issues when managing vast collections and encounter privacy concerns related to AI-enhanced metadata (Businessware Technologies, 2024). Broader implications include the risk of bias amplification, where models trained on limited datasets undervalue non-Western or underrepresented archives, highlighting the urgent need for more diverse and inclusive training datasets.

5.2. AI Technologies in Historical Newspaper Research

AI technologies are not only improving access to historical newspapers but also enabling new forms of computational scholarship and large-scale analysis that were previously unattainable. By enhancing scholars’ ability to analyze and interpret extensive archival collections with remarkable precision, AI is redefining historical research worldwide. In Europe, projects such as Impresso and NewsEye advance text mining and semantic search, enabling multilingual analysis of 19th-century press archives and uncovering significant themes that deepen understanding of historical contexts. In the United States, Chronicling America uses AI to enhance the searchability of millions of newspaper pages, while Newspaper Navigator applies computer vision to classify visual content like photographs and advertisements, enriching cultural studies through multimodal analysis.
Globally, AI initiatives extend to Transkribus, which specializes in recognizing handwritten texts from Ottoman and Asian archives, bridging linguistic and historical gaps. In Australia, the Trove platform employs AI to refine metadata and improve thematic analysis of regional newspapers. Similarly, Canada’s Digital Newspaper Collection leverages machine learning for accurate text recognition, supporting pattern identification and thematic research. Across Asia, the National Library of Korea’s AI-driven program and Japan’s NDL Lab projects are transforming access to historical newspapers through semantic technologies, enabling deeper insights into cultural and social narratives.
Collectively, these advancements convert static archives into dynamic datasets, fostering cross-cultural studies, comparative analyses, and innovative methodologies. By harnessing AI, scholars can uncover nuanced historical perspectives, contributing to a richer understanding of global heritage and the evolution of media across time.
Three major categories—text analysis and processing, content conversion and generation, and visual and pattern recognition—encapsulate both the diverse capabilities and the transformative potential of AI in historical newspaper research.
  • Text Analysis and Semantic Processing
Text analysis technologies lie at the heart of modern research on digitized newspapers, enabling scholars to move beyond simple keyword searches toward rich semantic exploration. These tools allow researchers to uncover patterns, relationships, and narratives embedded in historical texts, supporting both qualitative and quantitative methodologies.
Content analysis and metadata generation extract keywords, summaries, topics, and entities, enabling efficient navigation across massive corpora. For example, topic modeling in the NewsEye Project (NewsEye, 2025) revealed thematic shifts in European press during World War I, while metadata tagging in Impresso (Impresso Project, 2025) improved interoperability across Swiss and Luxembourgish archives, tracing themes like industrialization and labor movements.
Sentiment analysis adds depth by detecting emotional tone and tracking public opinion trends over time. For instance, in the Chronicling America project, sentiment analysis revealed fluctuating attitudes toward U.S. presidential candidates during early 20th-century elections, highlighting regional differences in political discourse and media framing. Similarly, pandemic-era studies applied sentiment analysis to digitized newspapers and social media archives, uncovering contrasting responses to health measures across states—such as varying levels of support for mask mandates and lockdowns—while also identifying shifts in tone as infection rates and government policies evolved. These insights not only enrich historical narratives but also enable researchers to correlate sentiment patterns with socio-economic factors, cultural norms, and policy outcomes.
Named Entity Recognition (NER) identifies people, places, organizations, and dates, supporting event tracking and network analysis. Projects like Impresso and NewsEye use NER to map influential figures and political discourse.
Advanced Natural Language Processing (NLP) frameworks such as spaCy, NLTK, and BERT enable contextual parsing, semantic clustering, and multilingual search. Essential in post-OCR workflows, these tools correct transcription errors and support analyses like ideological clustering and propaganda detection.
Collectively, these technologies empower historians to uncover patterns, reconstruct networks, and explore cultural and political dynamics, transforming archives into dynamic resources for global research.
2.
Content Conversion and Generation
AI-powered content conversion technologies overcome key barriers in historical newspaper research by transforming fragile printed and handwritten materials into searchable digital resources. Automated transcription and translation tools, such as Transkribus OCR/HTR, not only digitize degraded text but also handle complex scripts, including Gothic and cursive handwriting, which were previously inaccessible to large-scale analysis. Beyond basic transcription, Transkribus integrates language models for multilingual access, enabling researchers to work across archives in German, French, Spanish, and other languages without manual translation. A major advancement lies in its ability to support cross-lingual and transnational research: AI models can extract structured event data—identifying who, what, when, where, why, and how—from diverse sources. This capability facilitates comparative studies of phenomena such as epidemics, wars, or natural disasters, revealing how narratives evolved across regions and cultures. For example, researchers can trace how reports of the 1918 influenza pandemic differed between European and American newspapers, uncovering variations in tone, public health messaging, and societal impact. Such functionality transforms fragmented historical records into interconnected datasets, opening new avenues for global historical analysis.
Complementing transcription, Generative AI and Large Language Models (LLMs)—including GPT-4o, Llama 3, and Gemini 1.5 Pro—enhance data quality by correcting OCR errors, restoring context, and summarizing content. These tools also enable entity extraction and thematic clustering, creating structured datasets for advanced historical analysis. Together, these technologies transform static archives into dynamic resources, fostering global scholarship and innovative methodologies in digital humanities.
3.
Visual Categorization and Pattern Recognition
Historical newspapers often feature complex layouts and mixed media, requiring advanced visual and pattern recognition tools to unlock their full research potential. Modern computer vision technologies—such as Newspaper Navigator, SIAMESE networks, and YOLOv5—go beyond simple OCR by detecting and classifying visual elements like photographs, illustrations, advertisements, and page structures. For example, Newspaper Navigator, developed by the Library of Congress, uses machine learning to extract and categorize millions of images from digitized newspapers, enabling researchers to analyze visual culture trends such as the rise of political cartoons during wartime or the evolution of fashion imagery across decades. Similarly, YOLOv5 excels in object detection, allowing automated identification of recurring motifs—such as brand logos or product packaging—in historical advertisements. SIAMESE networks support similarity-based retrieval, making it possible to trace design patterns or compare visual layouts across different publications and time periods. These capabilities open new avenues for studying advertising strategies, graphic design evolution, and the interplay between text and imagery in shaping public discourse. For instance, scholars can now quantify how visual propaganda shifted during World War I versus World War II or track the emergence of consumer culture through illustrated ads in early 20th-century newspapers.
Beyond layout analysis, text-mining and deep learning systems scan entire corpora to detect recurring linguistic and structural patterns, revealing shifts in language, themes, and sentiment that reflect cultural and political change. For example, the Impresso project—a collaborative initiative involving Swiss and Luxembourg institutions—applies advanced NLP and machine learning techniques to digitized historical newspapers from multiple European countries. Its tools enable researchers to trace the rise of consumer advertising in early 20th-century publications, analyze wartime propaganda strategies, and study how narratives evolved across borders. A key feature of Impresso is cross-lingual text reuse detection, which identifies replicated or adapted content across newspapers in different languages, such as French, German, and Italian. This capability is crucial for examining competing narratives, misinformation during crises, and the role of press discourse in shaping national identities. By linking articles that share thematic or rhetorical patterns, Impresso allows scholars to map the dissemination of ideas—such as public health messaging during pandemics or political rhetoric during elections—revealing how media ecosystems influenced public opinion and cultural norms over time. Table 2 presents an overview of the most commonly used AI tools in historical newspaper research.
It is worth noting that AI tools developed for historical newspaper preservation frequently exhibit overlapping functionalities that extend beyond mere digitization, thereby actively supporting scholarly research on these invaluable documents. Generative AI and Large Language Models (LLMs) not only facilitate the correction and extraction of information from historical texts but also significantly contribute to content comprehension and analytical processes through contextual learning. These models are capable of generating coherent text, summarizing extensive datasets, and assisting in the identification of events and entities. For instance, GPT-4o is employed for event detection, Llama 3 enhances OCR accuracy in converting printed characters to digital text, and Gemini 1.5 Pro specializes in article extraction, enabling researchers to efficiently locate pertinent information.
4.
Challenges and Ethical Considerations for AI in Historical Newspaper Research
As AI technologies are increasingly employed in historical newspaper research, several challenges and ethical considerations arise. One significant issue is the biases present in AI models, often stemming from training data. Models trained on modern datasets may introduce contemporary biases that misrepresent historical narratives, such as reinforcing gender stereotypes in historical coverage (Bender et al., 2021). Handling input quality, particularly regarding OCR errors, remains a persistent challenge; inaccuracies can lead to false conclusions about historical events or themes when researchers rely on flawed transcriptions (Smith & Cordell, 2018).
Automated processes, including transcription and translation, also face ethical dilemmas concerning cultural sensitivity. Algorithms may inadvertently reinforce cultural prejudices or misinterpret politically sensitive articles, requiring human oversight to ensure fidelity to original content (Floridi & Cowls, 2019). For example, the Transkribus platform, while effective for handwritten text recognition, faces scrutiny over its handling of historical dialects, which may not be adequately represented in its training sets (Muehlberger et al., 2019).
Overall, while AI presents exciting opportunities for advancing research, addressing these challenges and ethical considerations is essential to ensure responsible and accurate scholarship in historical newspaper studies.

5.3. Future Directions

The integration of artificial intelligence into historical newspaper preservation and research signifies a profound transformation in historiography, heralding the emergence of Computational Archival Science (CAS)—a discipline where algorithmic processing intersects with archival theory. Moving beyond the foundational stage of mass digitization, forthcoming developments will pursue innovative trajectories that exploit advanced AI technologies to enable richer, more holistic engagement with historical newspapers.
  • Multimodal Analysis and Volumetric Restoration
Future research will move decisively beyond text-centric approaches toward multimodal interpretation, treating newspapers as complex cultural artifacts rather than mere text repositories. AI systems will increasingly integrate visual, textual, and structural features, enabling scholars to analyze typography, editorial design, advertisements, and illustrations as interconnected historical evidence (Lee et al., 2020).
A critical frontier will involve volumetric restoration techniques, inspired by virtual unrolling methods applied to ancient manuscripts. Advanced 3D imaging combined with machine learning will allow recovery of content from warped, brittle, or otherwise inaccessible newspapers previously considered irretrievable. These innovations will extend preservation beyond flat scanning, reconstructing damaged pages and restoring their visual integrity for computational analysis. Ultimately, multimodal strategies will underpin semantic segmentation across heterogeneous sources, ensuring robust interpretation of complex layouts from diverse eras and regions.
2.
End-to-End AI Stewardship Platforms with Human-in-the-Loop
The next generation of AI-powered stewardship platforms will deliver fully integrated workflows, from digitization to long-term preservation. These systems will employ LLMs for automated text correction, semantic enrichment, and metadata generation, while embedding human-in-the-loop mechanisms to ensure quality and accountability. Such platforms will democratize access by enabling institutions of all sizes to digitize localized collections, amplifying historically marginalized voices and fostering inclusive historical narratives (Rijhwani et al., 2020; Colavizza et al., 2021).
3.
Global Networks for Cross-Lingual Interoperability
Future infrastructures will prioritize global networks built on interoperable standards such as the International Image Interoperability Framework (IIIF) (Snydman et al., 2015). These networks will support cross-lingual named entity recognition and semantic linking, enabling large-scale comparative analyses that trace transnational patterns and the circulation of ideas across linguistic and cultural boundaries (Ehrmann et al., 2020; Ehrmann et al., 2023). By facilitating the distant reading of interconnected newspaper archives worldwide, these systems are poised to fundamentally transform the scale and scope of historical inquiry (Beals & Bell, 2020).
4.
Embedding Algorithmic Accountability and Ethical Stewardship
As AI becomes integral to archival practice, ethical stewardship will be paramount. Future frameworks will incorporate transparency tools—such as datasheets for datasets and model cards—to document provenance and biases, fostering critical engagement with algorithmic outputs (Gebru et al., 2021). Explainable AI will empower historians to interrogate computational interpretations, while privacy-preserving techniques will balance historical truth with protections for individuals represented in more recent records (Lemieux et al., 2025; Jaillant & Zhao, 2025).
Evidence of this shift toward ethically grounded systems is visible in regulatory and technical initiatives. For example, the EU AI Act explicitly classifies archival AI applications as high-risk, mandating transparency, bias mitigation, and human oversight (European Parliament, 2024). Broader efforts address biases in historical archives, promoting accountability without confirmed pilots like Chronicling America. Beyond academic literature, recent gray literature and policy briefs track the operationalization of these principles in archival workflows, signaling a systemic move toward accountability and inclusivity (Jobin et al., 2019; Whittlestone et al., 2019). The overarching goal is to ensure accountable, inclusive AI systems that respect the diversity of legacies preserved within newspaper archives.
As shown in Table 3, the future of AI in historical newspaper preservation and research is characterized not by incremental improvements, but by a paradigm shift toward more autonomous, collaborative, and ethically grounded systems. These advancements will unlock previously inaccessible historical sources, foster global scholarly collaboration, and enable new forms of inquiry that redefine the boundaries of historical research.

5.4. Implications for Media Narratives, Journalistic Practices, and Media Historiography

AI-driven preservation tools—such as advanced OCR and post-correction language models—do more than digitize newspapers; they actively shape media narratives through selective error correction and metadata generation (Hammarström, 2025). Algorithmic refinements can privilege certain voices while marginalizing others, influencing which discourses remain legible in digital archives (van Otterlo, 2018). These processes introduce new forms of gatekeeping, as systems trained on contemporary data retroactively impose linguistic norms, potentially altering reconstructions of past media ecologies (Garajamirli, 2025). This underscores the need for critical examination of AI as an active agent in preserving journalistic heritage, extending debates on archival bias into the era of computational mediation.
Similarly, AI tools in historical newspaper research—such as NLP-driven entity recognition and semantic processing—enable large-scale analyses of journalistic practices, including shifts in objectivity norms over time. However, underrepresented languages and dialects often remain poorly modeled, risking algorithmic forgetting within digital archives (Ruhil, 2025). In media historiography, platforms like NewsEye illustrate how algorithmic tools trace the evolution of news genres and public sphere dynamics through multilingual and cross-lingual analysis, while inviting reflection on AI’s role in shaping historical interpretation (Oiva et al., 2020; NewsEye Consortium, 2024).

6. Conclusions and Implications

The research clearly shows that artificial intelligence is revolutionizing the preservation, digitization, and accessibility of historical newspapers on a global scale. Through the integration of advanced technologies such as Optical Character Recognition enhanced by deep learning, Large Language Models for correcting OCR errors, Natural Language Processing, and sophisticated image restoration, the digitization process has seen significant improvements in both accuracy and efficiency. These advancements have transformed historical newspaper archives from static collections into dynamic, searchable digital resources that facilitate new computational methods of research. Consequently, AI has enabled scholars to perform cross-lingual analyses, semantic enrichment, and large-scale historical studies with greater precision, thereby expanding the potential for scholarly engagement and public access to cultural heritage. While this study acknowledges ongoing challenges, including limitations in OCR accuracy, ethical concerns around data authenticity, and the need to address biases in AI training datasets, it illustrates how AI serves as a transformative force in historical newspaper research and preservation.
This study advances journalism and media scholarship by conceptualizing AI-enabled historical newspaper archives as mediated spaces that influence the production, accessibility, and interpretation of journalism’s past. Drawing on global case studies, it extends theoretical debates on news temporality, archival gatekeeping, and the circulation of journalistic discourse, positioning AI as a form of algorithmic mediation within media historiography. The proposed tripartite framework—AI for preservation, research, and future stewardship—offers a transferable lens for examining transformations in media memory infrastructures and for bridging computational archival science with media theory. These insights contribute to understanding how AI-driven processes shape archival practices and research methodologies in the evolving landscape of digital media studies.
Building on these findings, the implications of AI’s transformative impact are expansive and deeply intertwined with the field of digital humanities. Theoretically, this research underscores the emergence of a new interdisciplinary field—Computational Archival Science—where AI-driven algorithmic processing intersects meaningfully with archival theory, reshaping the ways historians engage with and interpret archival materials. Practically, AI technologies streamline preservation workflows by automating text correction, semantic tagging, and metadata generation, thus reducing the manual burden on archivists and increasing the accessibility and discoverability of vast archival collections. Importantly, these advancements empower digital humanities scholars to harness computational tools in ways that enable richer and more nuanced cultural and historical analyses. From an ethical and policy perspective, the study highlights the importance of embedding transparency, accountability, and human oversight into AI applications to mitigate risks such as biased or fabricated data and to ensure responsible and culturally sensitive archival practices. Looking towards the future, the research points to exciting directions, including multimodal analysis that treats newspapers as rich cultural artifacts beyond text alone, volumetric restoration of damaged physical materials, and federated global networks that enable cross-lingual and cross-cultural interoperability of archives. Ultimately, the integration of AI within digital humanities and archival infrastructures is not merely an incremental upgrade but a paradigm shift that promises more autonomous, collaborative, and ethically guided systems. These advancements will unlock previously inaccessible sources and broaden the scope and depth of historical scholarship worldwide, fostering a richer, more inclusive, and technologically empowered understanding of cultural and historical narratives.

Author Contributions

Conceptualization, Z.X.S.; Methodology, Z.X.S.; Formal analysis, Z.X.S., K.W.C. and Z.Y.J.; Investigation, Z.X.S., K.W.C. and Z.Y.J.; Resources, Z.X.S., K.W.C. and Z.Y.J.; Data curation, Z.Y.J.; Writing—original draft preparation, Z.X.S. and Z.Y.J.; Writing—review and editing, Z.X.S. and K.W.C.; Supervision, Z.X.S.; Project administration, Z.X.S.; Funding acquisition, Z.X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is part of a larger study on the “Business News Treasures in the Vanishing Old Newspapers in Hong Kong: The History, Characteristics, and Social Impact”, which was fully supported by the Hong Kong Research Grants Council (RGC) 2022/23, Faculty Development Scheme (FDS) (Ref no. UGC/FDS14/H03/22).

Institutional Review Board Statement

The study was conducted in compliance with the Code of Ethics for Research at The Hang Seng University of Hong Kong and was approved by the University Research Committee (URC) and on 26 May 2022.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting this study are primarily derived from publicly accessible literature and online reports. Additional information is available from the Principal Investigator upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ali, D., Blyau, T., Weghe, N., & Verstockt, S. (2022). Context-aware querying, geolocalization, and rephotography of historical newspaper images. Applied Sciences, 12(21), 11063. [Google Scholar] [CrossRef]
  2. Ali, D., Milleville, K., Verstockt, S., Weghe, N., Chambers, S., & Birkholz, J. (2023). Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections. Journal of Documentation, 80(5), 1031–1056. [Google Scholar] [CrossRef]
  3. Beals, M., & Bell, E. (2020). The Atlas of digitised newspapers and metadata: Reports from oceanic exchanges (Version 2). Figshare. [Google Scholar] [CrossRef]
  4. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochatic parrots: Can language models be too big? In FAccT ‘21: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610–623). Association for Computing Machinery. [Google Scholar] [CrossRef]
  5. Businessware Technologies. (2024). Automatic processing of newspapers and magazines with AI. Available online: https://www.businesswaretech.com/blog/automatic-processing-of-newspapers-and-magazines-with-ai (accessed on 10 January 2025).
  6. Carneiro, A. (2024). Reviving the past: How generative adversarial networks (GANs) are empowering digital archives to Restore Old Documents. Medium. Available online: https://medium.com/@arita111997/reviving-the-past-how-generative-adversarial-networks-gans-are-empowering-digital-archives-to-7bbaea4e4bd1 (accessed on 10 January 2025).
  7. Colavizza, G., Blanke, T., Jeurgens, C., & Noordegraaf, J. (2021). Archives and AI: An overview of current debates and future perspectives. ACM Journal on Computing and Cultural Heritage, 15(1), 1–15. [Google Scholar] [CrossRef]
  8. Costa, B. F., Mateus, B. C., Pinto, H. J., & Tabrizi, M. R. (2025). Looking back to 1850 in 2025: Historascan to digitize historical journals. In A. Rosário, & A. Boechat (Eds.), Impact of digitalization on communication dynamics (pp. 393–420). IGI Global Scientific Publishing. [Google Scholar] [CrossRef]
  9. Ehrmann, M., Hamdi, A., Linhares Pontes, E., Romanello, M., & Doucet, A. (2023). Named entity recognition and classification in historical documents: A survey. ACM Computing Surveys, 56(2), 1–47. [Google Scholar] [CrossRef]
  10. Ehrmann, M., Romanello, M., Flückiger, A., & Clematide, S. (2020). Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers. CEUR Workshop Proceedings, 2696, 255. Available online: http://ceur-ws.org/Vol-2696/paper_255.pdf (accessed on 22 September 2020).
  11. Eureka Network. (2025). Digitising historical documents. Available online: https://www.eurekanetwork.org/impact/digitising-historical-documents/ (accessed on 23 July 2025).
  12. European Parliament. (2024). Artificial Intelligence Act: Briefing—Risk-based classification and high-risk requirements. European Parliamentary Research Service. Available online: https://www.europarl.europa.eu/RegData/etudes/BRIE/2021/698792/EPRS_BRI%282021%29698792_EN.pdf (accessed on 13 June 2024).
  13. Fewster, K., & Arias-Hernandez, R. (2025). AI/ML for processing textual records in archives. InterPARES Trust AI. [Google Scholar]
  14. Floridi, L., & Cowls, J. (2019). A unified framework of five principles for AI in society. Harvard Data Science Review, 1(1), 2–15. [Google Scholar] [CrossRef]
  15. Garajamirli, N. (2025). Algorithmic Gatekeeping and Democratic Communication: Who decides what the public sees? European Journal of Communication and Media Studies, 4(3), 1–10. [Google Scholar] [CrossRef]
  16. Gebru, T., Morgenstern, J., Vecchione, B., Wortman Vaughan, J., Wallach, H., Daumé, H., III, & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. [Google Scholar] [CrossRef]
  17. Grohsgal, L. W. (2013). Preserving America’s historic newspapers: Experiences from the field. NEH Blog. [Google Scholar]
  18. Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., & Doucet, A. (2020). Assessing and minimizing the impact of OCR quality on named entity recognition. In Proceedings of the 24th international conference on theory and practice of digital libraries, Lyon, France, August 25–28 (pp. 87–101). Springer. [Google Scholar] [CrossRef]
  19. Hammarström, E. (2025). Investigation of OCR model performance and post-OCR correction strategies: A comparative analysis [Master’s thesis, Uppsala University]. Available online: https://www.diva-portal.org/smash/get/diva2:1991132/FULLTEXT01.pdf (accessed on 21 August 2025).
  20. Hauswedell, T., Nyhan, J., Beals, M. H., Terras, M., & Bell, E. (2020). Of global reach yet of situated contexts: An examination of the implicit and explicit selection criteria that shape digital archives of historical newspapers. Archival Science, 20, 139–165. [Google Scholar] [CrossRef]
  21. Impresso Project. (2025). Media monitoring of the past. Available online: https://www.impresso-project.ch (accessed on 28 October 2025).
  22. Jaillant, L., & Aske, K. (2023). Are users of digital archives ready for the AI era? Obstacles to the application of computational research methods and new opportunities. Journal on Computing and Cultural Heritage, 16(4), 1–16. [Google Scholar] [CrossRef]
  23. Jaillant, L., & Zhao, L. (2025). Introduction: When data turns into archives: Making digital records more accessible with AI. AI & Society, 40, 5787–5791. [Google Scholar] [CrossRef]
  24. Järvelin, A., Keskustalo, H., Sormunen, E., Saastamoinen, M., & Kettunen, K. (2015). Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach. Journal of the Association for Information Science and Technology, 67(12), 2928–2946. [Google Scholar] [CrossRef]
  25. Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389–399. [Google Scholar] [CrossRef]
  26. Johnson, S. (2025). How AI is changing digital archives: Possibilities and pitfalls. Historica Blog. [Google Scholar]
  27. Kanerva, J., Ledins, C., Käpyaho, S., & Ginter, F. (2025). OCR error post-correction with LLMs in historical documents: No Free Lunches. arXiv, arXiv:2502.01205. Available online: https://arxiv.org/html/2502.01205v1 (accessed on 3 February 2025).
  28. Kaur, J. (2025). AI-driven image restoration and enhancement: A complete guide. XenonStack. Available online: https://www.xenonstack.com/blog/ai-driven-image-restoration (accessed on 15 April 2025).
  29. Kovács, G. (2024). Revolutionizing archival document processing with AI: Enhancing degraded historical document images. Rényi Institute of Mathematics. Available online: https://ai.renyi.hu/posts/revolutionizing-archival-document-processing-with-ai/ (accessed on 21 November 2024).
  30. Kristianto, Y., & Soewito, B. (2025). Beyond OCR: GAN-Driven restoration of severely degrading document. International Journal of Computer Theory and Engineering, 17(4), 189–201. [Google Scholar] [CrossRef]
  31. Lee, B., Mears, J., Jakeway, E., Ferriter, M., Adams, C., Yarasavage, N., Thomas, D., Zwaard, K., & Weld, D. S. (2020). The newspaper navigator dataset: Extracting and analyzing visual content from 16 million historic newspaper pages in Chronicling America. arXiv, arXiv:2005.01583. [Google Scholar] [CrossRef]
  32. Lehman, B. (2024). Exploring OCR tools with two 19th century documents. University of California, Berkeley Library. Available online: https://update.lib.berkeley.edu/2024/12/03/exploring-ocr-tools-with-two-19th-century-documents/ (accessed on 3 December 2024).
  33. Lemieux, V. L., Gil, R., Molosiwa, F., Zhou, Q., Li, B., Garcia, R., Torre-Cubillo, L. D., & Wang, Z. (2025). Clio-X: AWeb3 solution for privacy-preserving AI access to digital archives. arXiv, arXiv:2507.08853. [Google Scholar] [CrossRef]
  34. MacFarlane, J. (2024). Improving OCR results for historical newspapers using LLMs. Medium. Available online: https://medium.com/@jarrettcmac/improving-ocr-results-for-historical-newspapers-using-llms-17900fb9ddb8 (accessed on 26 August 2024).
  35. Marchello, G., Giovanelli, R., Fontana, E., Cannella, F., & Traviglia, A. (2023). Cultural heritage digital preservation through AI-driven robotics. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLVIII-M-2-2023, 995–1000. [Google Scholar] [CrossRef]
  36. Menhour, H., Şahin, H., Sarıkaya, R., Aktaş, M., Sağlam, R., Ekinci, E., & Eken, S. (2021). Searchable Turkish OCRed historical newspaper collection 1928–1942. Journal of Information Science, 49(2), 335–347. [Google Scholar] [CrossRef]
  37. Mistral AI. (2025). Mistral OCR. Available online: https://mistral.ai/news/mistral-ocr (accessed on 6 March 2025).
  38. Muehlberger, G., Seaward, L., Terras, M., Oliveira, S. A., Bosch, V., Bryan, M., Colutto, S., Déjean, H., Diem, M., Fiel, S., Gatos, B., Greinoecker, A., Grüning, T., Hackl, G., Haukkovaara, V., Heyer, G., Hirvonen, L., Hodel, T., Jokinen, M., … Zagoris, K. (2019). Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study. Journal of Documentation, 75(5), 954–976. [Google Scholar] [CrossRef]
  39. Newman, N. (2025a). Overview and key findings of the 2025 Reuters Institute report. Reuters Institute. Available online: https://reutersinstitute.politics.ox.ac.uk/digital-news-report/2025/dnr-executive-summary (accessed on 17 June 2025).
  40. Newman, N., & Cherubini, F. (2025b). Journalism, media, and technology trends and predictions 2025. Reuters Institute. Available online: https://reutersinstitute.politics.ox.ac.uk/journalism-media-and-technology-trends-and-predictions-2025 (accessed on 14 January 2025).
  41. NewsEye. (2025). The NewsEye project: An overview. Available online: https://www.newseye.eu/blog/news/the-newseye-project-an-overview/?no_cache=1&cHash=9b7bad4b6cc36fa58fd807344230c3ab (accessed on 5 May 2025).
  42. NewsEye Consortium. (2024). NewsEye: A digital investigator for historical newspapers (Project report). European Heritage Awards/Europa Nostra Awards. Available online: https://www.newseye.eu (accessed on 30 May 2024).
  43. Oberbichler, S., Boroş, E., Doucet, A., Marjanen, J., Pfanzelter, E., Rautiainen, J., Toivonen, H., & Tolonen, M. (2021). Integrated interdisciplinary workflows for research on historical newspapers: Perspectives from humanities scholars, computer scientists, and librarians. Journal of the Association for Information Science and Technology, 73(2), 225–239. [Google Scholar] [CrossRef]
  44. Oiva, M., Fridlund, M., & Paju, P. (Eds.). (2020). Digital histories: Emergent approaches within the new digital history. Helsinki University Press. [Google Scholar] [CrossRef]
  45. Reiche, I. (2023). The viability of using an open source locally hosted AI for creating metadata in digital image collections. The Code4Lib Journal, (56). Available online: https://journal.code4lib.org/articles/17186 (accessed on 21 April 2023).
  46. Rijhwani, S., Anastasopoulos, A., & Neubig, G. (2020). OCR post-correction for endangered language texts. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Online, November 16–20 (pp. 5931–5942). Association for Computational Linguistics. [Google Scholar] [CrossRef]
  47. Romein, C. A., Rabus, A., Leifert, G., & Ströbel, P. B. (2025). Assessing advanced handwritten text recognition engines for digitizing historical documents. International Journal of Digital Humanities, 7, 115–134. [Google Scholar] [CrossRef]
  48. Ruhil, O. (2025). The great forgetting: When AI decides what we do not need to know. AI & Society. [Google Scholar] [CrossRef]
  49. Sikora, J., & Haffenden, C. (2024, January 10–11). AI, data curation and the data readiness of heritage collections: Exploring the Swedish newspaper archive at KBLab. Huminfra Conference (HiC 2024), Gothenburg, Sweden. Available online: https://pdfs.semanticscholar.org/c65e/75bd43e6bf25029cf61002aa10ad8291fe40.pdf (accessed on 22 January 2024).
  50. Smith, D. A., & Cordell, R. (2018). A research agenda for historical and multilingual Optical Character Recognition. NUlab, Northeastern University. Available online: http://hdl.handle.net/2047/D20297452 (accessed on 2 December 2024).
  51. Snydman, S., Sanderson, R., & Cramer, T. (2015). The International Image Interoperability Framework (IIIF): A community & technology approach for web-based images. Archiving Conference, 2015(1), 16–21. [Google Scholar] [CrossRef]
  52. Teel, Z. (2024). Artificial Intelligence’s role in digitally preserving historic archives. Preservation, Digital Technology & Culture, 53(1), 29–33. [Google Scholar] [CrossRef]
  53. Thomas, A., Gaizauskas, R., & Lu, H. (2024). Leveraging LLMs for post-OCR correction of historical newspapers. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024 (pp. 116–121). ELRA and ICCL. Available online: https://aclanthology.org/2024.lt4hala-1.14/ (accessed on 25 May 2024).
  54. Tworek, H. J. S. (2024). Digitized newspapers and the hidden transformation of history. American Historical Review, 129(1), 143–147. [Google Scholar] [CrossRef]
  55. van Otterlo, M. (2018). Gatekeeping algorithms with human ethical bias: The ethics of algorithms in archives, libraries and society. arXiv, arXiv:1801.01705. [Google Scholar] [CrossRef]
  56. Whittlestone, J., Nyrup, R., Alexandrova, A., Dihal, K., & Cave, S. (2019). Ethical and societal implications of algorithms, data, and artificial intelligence: A roadmap for research. Nuffield Foundation. Available online: http://www.nuffieldfoundation.org/sites/default/files/files/Ethical-and-Societal-Implications-of-Data-and-AI-report-Nuffield-Foundat.pdf (accessed on 10 January 2025).
  57. Zheng, W., Su, B., Feng, R., Peng, X., & Chen, S. (2023). EA-GAN: Restoration of text in ancient Chinese books based on an example attention generative adversarial network. Heritage Science, 11, 42. [Google Scholar] [CrossRef]
Table 1. Most used AI tools in historical newspaper preservation.
Table 1. Most used AI tools in historical newspaper preservation.
AI Tool/TechnologyDescriptionApplicationsExamples/Providers
AI-powered OCRDeep learning-enhanced systems (CNNs, LSTMs) for text recognition from degraded scansDigitizing faded or damaged pages, enabling searchable archivesGoogle Cloud Vision; Tesseract; Kraken
Post-OCR Correction with LLMsContextual models that amend OCR errorsRefining garbled text for usable outputsMeta’s LLaMA 3; GPT variants; ByT5
Image & Text Restoration AIGAN-based/diffusion-based algorithms to remove noise and enhance visualsCleaning stains and creases; improving OCRTopaz Gigapixel; ESRGAN
Handwritten/Printed Text RecognitionTrainable AI for varied scripts and layoutsAutomating transcription of printed articlesTranskribus
Automated Archiving PlatformsAI systems for metadata and format managementEnsuring long-term accessibility; detecting degradationPreservica; JSTOR Digital Stewardship
Table 2. Most used AI tools in historical newspaper research.
Table 2. Most used AI tools in historical newspaper research.
AI Tool/TechnologyDescriptionApplicationsExamples/Providers
Content Analysis & Metadata GenerationAnalyzes unstructured content to extract keywords, summaries, topics, entities, and roles.Metadata generation, categorization, enabling efficient searches and insights in archives.IBM Watson NLU
Sentiment AnalysisRule-based scoring of text valence, considering context and intensifiers for sentiment tone.Assessing public opinion, emotional tones in event coverage, tracking sentiment trends.VADER lexicon
Named Entity Recognition (NER)Identifies and categorizes entities like people or locations using pre-trained models.Extracting references for tracking events, figures, places, enabling network analysis.spaCy, Gale Digital Scholar Lab
Natural Language Processing (NLP) FrameworksAnalyzes text for entities, sentiment, topics, language with pre-trained models.Metadata extraction, article classification, summarization, semantic search post-OCR.spaCy, NLTK, BERT (via Hugging Face Transformer)
Automated Transcription & TranslationAI-powered recognition of printed/handwritten text with layout analysis and multilingual translation.Digitizing articles, global translation, creating searchable texts.Transkribus OCR/HTR, Ottoman prints application
Generative AI & Large Language Models (LLMs)Contextual models for generating, correcting, and extracting information with in-context learning APIs.Correcting errors, segmenting articles, extracting events/entities, summarizing content, discovering discursive patterns.GPT-4o, Llama 3, Gemini 1.5 Pro, Historian’s Friend
Computer Vision & Image AnalysisDetects objects and layouts in images for similarity searches and categorization.Categorizing visuals like ads or photos for cultural studies.Newspaper Navigator, Ultralytics YOLO
Text-Mining & Deep Learning PlatformsProcesses corpora for pattern recognition including layout detection and frequency analysis.Identifying themes, sentiments, trends; classifying articles by topic/date; enabling geospatial and temporal analyses.Impresso project, Elasticsearch integrations
Table 3. Future directions for AI in historical newspaper research.
Table 3. Future directions for AI in historical newspaper research.
Focus AreaKey Future DirectionExpected Impact
Multimodal RestorationMultimodal analysis integrating text, images, and layout; volumetric restoration using 3D imaging and MLEnables recovery of damaged or warped newspapers; holistic interpretation of visual and textual culture
AI Stewardship PlatformsEnd-to-end digitization workflows powered by LLMs with human-in-the-loop oversightDemocratizes access to localized collections; improves quality and scalability of archival processes
Global InteroperabilityFederated networks using IIIF standards; cross-lingual entity recognition and semantic linkingFacilitates distant reading and comparative research across borders; reveals transnational historical patterns
Ethical and Accountable AITransparency tools (datasheets, model cards); explainable AI; privacy-preserving frameworksEnsures fairness, mitigates bias, and protects sensitive data; fosters trust in AI-driven archival research
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, Z.X.; Cheung, K.W.; Jia, Z.Y. Transforming Historical Newspaper Research and Preservation Through AI: A Global Perspective. Journal. Media 2026, 7, 10. https://doi.org/10.3390/journalmedia7010010

AMA Style

Song ZX, Cheung KW, Jia ZY. Transforming Historical Newspaper Research and Preservation Through AI: A Global Perspective. Journalism and Media. 2026; 7(1):10. https://doi.org/10.3390/journalmedia7010010

Chicago/Turabian Style

Song, Zhao Xun, Kwok Wai Cheung, and Zi Yun Jia. 2026. "Transforming Historical Newspaper Research and Preservation Through AI: A Global Perspective" Journalism and Media 7, no. 1: 10. https://doi.org/10.3390/journalmedia7010010

APA Style

Song, Z. X., Cheung, K. W., & Jia, Z. Y. (2026). Transforming Historical Newspaper Research and Preservation Through AI: A Global Perspective. Journalism and Media, 7(1), 10. https://doi.org/10.3390/journalmedia7010010

Article Metrics

Back to TopTop