Transforming Historical Newspaper Research and Preservation Through AI: A Global Perspective
Abstract
1. Introduction
2. Literature Review
2.1. Traditional Preservation, Digitization, and Their Limits
2.2. AI-Driven Innovations in Preservation
2.3. NLP and Computational Research
2.4. Global Initiatives and Emerging Trends
3. Research Questions
- What are the primary AI technologies currently employed in the preservation of historical newspapers across different regions and institutions?
- How are AI technologies enhancing the research capabilities and scholarly use of digitized historical newspaper archives?
- What future trends and developments are anticipated as AI continues to be integrated into newspaper preservation and research practices?
4. Research Methods
- Qualitative Multiple–Case Study Analysis
- 2.
- Systematic Documentary Review
- (a)
- use of AI for newspaper digitization, preservation, or research;
- (b)
- availability of detailed public documentation; and
- (c)
- geographic diversity across North America, Europe, and Asia/Oceania.
5. Results
5.1. AI Technologies in Historical Newspaper Preservation
- AI-powered Optical Character Recognition (OCR) and Post-OCR Correction
- 2.
- Text and Image Restoration
- 3.
- Automated Digital Archiving and Preservation
- 4.
- Challenges and Ethical Considerations for AI in Historical Newspaper Preservation
5.2. AI Technologies in Historical Newspaper Research
- Text Analysis and Semantic Processing
- 2.
- Content Conversion and Generation
- 3.
- Visual Categorization and Pattern Recognition
- 4.
- Challenges and Ethical Considerations for AI in Historical Newspaper Research
5.3. Future Directions
- Multimodal Analysis and Volumetric Restoration
- 2.
- End-to-End AI Stewardship Platforms with Human-in-the-Loop
- 3.
- Global Networks for Cross-Lingual Interoperability
- 4.
- Embedding Algorithmic Accountability and Ethical Stewardship
5.4. Implications for Media Narratives, Journalistic Practices, and Media Historiography
6. Conclusions and Implications
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ali, D., Blyau, T., Weghe, N., & Verstockt, S. (2022). Context-aware querying, geolocalization, and rephotography of historical newspaper images. Applied Sciences, 12(21), 11063. [Google Scholar] [CrossRef]
- Ali, D., Milleville, K., Verstockt, S., Weghe, N., Chambers, S., & Birkholz, J. (2023). Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections. Journal of Documentation, 80(5), 1031–1056. [Google Scholar] [CrossRef]
- Beals, M., & Bell, E. (2020). The Atlas of digitised newspapers and metadata: Reports from oceanic exchanges (Version 2). Figshare. [Google Scholar] [CrossRef]
- Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochatic parrots: Can language models be too big? In FAccT ‘21: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610–623). Association for Computing Machinery. [Google Scholar] [CrossRef]
- Businessware Technologies. (2024). Automatic processing of newspapers and magazines with AI. Available online: https://www.businesswaretech.com/blog/automatic-processing-of-newspapers-and-magazines-with-ai (accessed on 10 January 2025).
- Carneiro, A. (2024). Reviving the past: How generative adversarial networks (GANs) are empowering digital archives to Restore Old Documents. Medium. Available online: https://medium.com/@arita111997/reviving-the-past-how-generative-adversarial-networks-gans-are-empowering-digital-archives-to-7bbaea4e4bd1 (accessed on 10 January 2025).
- Colavizza, G., Blanke, T., Jeurgens, C., & Noordegraaf, J. (2021). Archives and AI: An overview of current debates and future perspectives. ACM Journal on Computing and Cultural Heritage, 15(1), 1–15. [Google Scholar] [CrossRef]
- Costa, B. F., Mateus, B. C., Pinto, H. J., & Tabrizi, M. R. (2025). Looking back to 1850 in 2025: Historascan to digitize historical journals. In A. Rosário, & A. Boechat (Eds.), Impact of digitalization on communication dynamics (pp. 393–420). IGI Global Scientific Publishing. [Google Scholar] [CrossRef]
- Ehrmann, M., Hamdi, A., Linhares Pontes, E., Romanello, M., & Doucet, A. (2023). Named entity recognition and classification in historical documents: A survey. ACM Computing Surveys, 56(2), 1–47. [Google Scholar] [CrossRef]
- Ehrmann, M., Romanello, M., Flückiger, A., & Clematide, S. (2020). Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers. CEUR Workshop Proceedings, 2696, 255. Available online: http://ceur-ws.org/Vol-2696/paper_255.pdf (accessed on 22 September 2020).
- Eureka Network. (2025). Digitising historical documents. Available online: https://www.eurekanetwork.org/impact/digitising-historical-documents/ (accessed on 23 July 2025).
- European Parliament. (2024). Artificial Intelligence Act: Briefing—Risk-based classification and high-risk requirements. European Parliamentary Research Service. Available online: https://www.europarl.europa.eu/RegData/etudes/BRIE/2021/698792/EPRS_BRI%282021%29698792_EN.pdf (accessed on 13 June 2024).
- Fewster, K., & Arias-Hernandez, R. (2025). AI/ML for processing textual records in archives. InterPARES Trust AI. [Google Scholar]
- Floridi, L., & Cowls, J. (2019). A unified framework of five principles for AI in society. Harvard Data Science Review, 1(1), 2–15. [Google Scholar] [CrossRef]
- Garajamirli, N. (2025). Algorithmic Gatekeeping and Democratic Communication: Who decides what the public sees? European Journal of Communication and Media Studies, 4(3), 1–10. [Google Scholar] [CrossRef]
- Gebru, T., Morgenstern, J., Vecchione, B., Wortman Vaughan, J., Wallach, H., Daumé, H., III, & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. [Google Scholar] [CrossRef]
- Grohsgal, L. W. (2013). Preserving America’s historic newspapers: Experiences from the field. NEH Blog. [Google Scholar]
- Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., & Doucet, A. (2020). Assessing and minimizing the impact of OCR quality on named entity recognition. In Proceedings of the 24th international conference on theory and practice of digital libraries, Lyon, France, August 25–28 (pp. 87–101). Springer. [Google Scholar] [CrossRef]
- Hammarström, E. (2025). Investigation of OCR model performance and post-OCR correction strategies: A comparative analysis [Master’s thesis, Uppsala University]. Available online: https://www.diva-portal.org/smash/get/diva2:1991132/FULLTEXT01.pdf (accessed on 21 August 2025).
- Hauswedell, T., Nyhan, J., Beals, M. H., Terras, M., & Bell, E. (2020). Of global reach yet of situated contexts: An examination of the implicit and explicit selection criteria that shape digital archives of historical newspapers. Archival Science, 20, 139–165. [Google Scholar] [CrossRef]
- Impresso Project. (2025). Media monitoring of the past. Available online: https://www.impresso-project.ch (accessed on 28 October 2025).
- Jaillant, L., & Aske, K. (2023). Are users of digital archives ready for the AI era? Obstacles to the application of computational research methods and new opportunities. Journal on Computing and Cultural Heritage, 16(4), 1–16. [Google Scholar] [CrossRef]
- Jaillant, L., & Zhao, L. (2025). Introduction: When data turns into archives: Making digital records more accessible with AI. AI & Society, 40, 5787–5791. [Google Scholar] [CrossRef]
- Järvelin, A., Keskustalo, H., Sormunen, E., Saastamoinen, M., & Kettunen, K. (2015). Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach. Journal of the Association for Information Science and Technology, 67(12), 2928–2946. [Google Scholar] [CrossRef]
- Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389–399. [Google Scholar] [CrossRef]
- Johnson, S. (2025). How AI is changing digital archives: Possibilities and pitfalls. Historica Blog. [Google Scholar]
- Kanerva, J., Ledins, C., Käpyaho, S., & Ginter, F. (2025). OCR error post-correction with LLMs in historical documents: No Free Lunches. arXiv, arXiv:2502.01205. Available online: https://arxiv.org/html/2502.01205v1 (accessed on 3 February 2025).
- Kaur, J. (2025). AI-driven image restoration and enhancement: A complete guide. XenonStack. Available online: https://www.xenonstack.com/blog/ai-driven-image-restoration (accessed on 15 April 2025).
- Kovács, G. (2024). Revolutionizing archival document processing with AI: Enhancing degraded historical document images. Rényi Institute of Mathematics. Available online: https://ai.renyi.hu/posts/revolutionizing-archival-document-processing-with-ai/ (accessed on 21 November 2024).
- Kristianto, Y., & Soewito, B. (2025). Beyond OCR: GAN-Driven restoration of severely degrading document. International Journal of Computer Theory and Engineering, 17(4), 189–201. [Google Scholar] [CrossRef]
- Lee, B., Mears, J., Jakeway, E., Ferriter, M., Adams, C., Yarasavage, N., Thomas, D., Zwaard, K., & Weld, D. S. (2020). The newspaper navigator dataset: Extracting and analyzing visual content from 16 million historic newspaper pages in Chronicling America. arXiv, arXiv:2005.01583. [Google Scholar] [CrossRef]
- Lehman, B. (2024). Exploring OCR tools with two 19th century documents. University of California, Berkeley Library. Available online: https://update.lib.berkeley.edu/2024/12/03/exploring-ocr-tools-with-two-19th-century-documents/ (accessed on 3 December 2024).
- Lemieux, V. L., Gil, R., Molosiwa, F., Zhou, Q., Li, B., Garcia, R., Torre-Cubillo, L. D., & Wang, Z. (2025). Clio-X: AWeb3 solution for privacy-preserving AI access to digital archives. arXiv, arXiv:2507.08853. [Google Scholar] [CrossRef]
- MacFarlane, J. (2024). Improving OCR results for historical newspapers using LLMs. Medium. Available online: https://medium.com/@jarrettcmac/improving-ocr-results-for-historical-newspapers-using-llms-17900fb9ddb8 (accessed on 26 August 2024).
- Marchello, G., Giovanelli, R., Fontana, E., Cannella, F., & Traviglia, A. (2023). Cultural heritage digital preservation through AI-driven robotics. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLVIII-M-2-2023, 995–1000. [Google Scholar] [CrossRef]
- Menhour, H., Şahin, H., Sarıkaya, R., Aktaş, M., Sağlam, R., Ekinci, E., & Eken, S. (2021). Searchable Turkish OCRed historical newspaper collection 1928–1942. Journal of Information Science, 49(2), 335–347. [Google Scholar] [CrossRef]
- Mistral AI. (2025). Mistral OCR. Available online: https://mistral.ai/news/mistral-ocr (accessed on 6 March 2025).
- Muehlberger, G., Seaward, L., Terras, M., Oliveira, S. A., Bosch, V., Bryan, M., Colutto, S., Déjean, H., Diem, M., Fiel, S., Gatos, B., Greinoecker, A., Grüning, T., Hackl, G., Haukkovaara, V., Heyer, G., Hirvonen, L., Hodel, T., Jokinen, M., … Zagoris, K. (2019). Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study. Journal of Documentation, 75(5), 954–976. [Google Scholar] [CrossRef]
- Newman, N. (2025a). Overview and key findings of the 2025 Reuters Institute report. Reuters Institute. Available online: https://reutersinstitute.politics.ox.ac.uk/digital-news-report/2025/dnr-executive-summary (accessed on 17 June 2025).
- Newman, N., & Cherubini, F. (2025b). Journalism, media, and technology trends and predictions 2025. Reuters Institute. Available online: https://reutersinstitute.politics.ox.ac.uk/journalism-media-and-technology-trends-and-predictions-2025 (accessed on 14 January 2025).
- NewsEye. (2025). The NewsEye project: An overview. Available online: https://www.newseye.eu/blog/news/the-newseye-project-an-overview/?no_cache=1&cHash=9b7bad4b6cc36fa58fd807344230c3ab (accessed on 5 May 2025).
- NewsEye Consortium. (2024). NewsEye: A digital investigator for historical newspapers (Project report). European Heritage Awards/Europa Nostra Awards. Available online: https://www.newseye.eu (accessed on 30 May 2024).
- Oberbichler, S., Boroş, E., Doucet, A., Marjanen, J., Pfanzelter, E., Rautiainen, J., Toivonen, H., & Tolonen, M. (2021). Integrated interdisciplinary workflows for research on historical newspapers: Perspectives from humanities scholars, computer scientists, and librarians. Journal of the Association for Information Science and Technology, 73(2), 225–239. [Google Scholar] [CrossRef]
- Oiva, M., Fridlund, M., & Paju, P. (Eds.). (2020). Digital histories: Emergent approaches within the new digital history. Helsinki University Press. [Google Scholar] [CrossRef]
- Reiche, I. (2023). The viability of using an open source locally hosted AI for creating metadata in digital image collections. The Code4Lib Journal, (56). Available online: https://journal.code4lib.org/articles/17186 (accessed on 21 April 2023).
- Rijhwani, S., Anastasopoulos, A., & Neubig, G. (2020). OCR post-correction for endangered language texts. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Online, November 16–20 (pp. 5931–5942). Association for Computational Linguistics. [Google Scholar] [CrossRef]
- Romein, C. A., Rabus, A., Leifert, G., & Ströbel, P. B. (2025). Assessing advanced handwritten text recognition engines for digitizing historical documents. International Journal of Digital Humanities, 7, 115–134. [Google Scholar] [CrossRef]
- Ruhil, O. (2025). The great forgetting: When AI decides what we do not need to know. AI & Society. [Google Scholar] [CrossRef]
- Sikora, J., & Haffenden, C. (2024, January 10–11). AI, data curation and the data readiness of heritage collections: Exploring the Swedish newspaper archive at KBLab. Huminfra Conference (HiC 2024), Gothenburg, Sweden. Available online: https://pdfs.semanticscholar.org/c65e/75bd43e6bf25029cf61002aa10ad8291fe40.pdf (accessed on 22 January 2024).
- Smith, D. A., & Cordell, R. (2018). A research agenda for historical and multilingual Optical Character Recognition. NUlab, Northeastern University. Available online: http://hdl.handle.net/2047/D20297452 (accessed on 2 December 2024).
- Snydman, S., Sanderson, R., & Cramer, T. (2015). The International Image Interoperability Framework (IIIF): A community & technology approach for web-based images. Archiving Conference, 2015(1), 16–21. [Google Scholar] [CrossRef]
- Teel, Z. (2024). Artificial Intelligence’s role in digitally preserving historic archives. Preservation, Digital Technology & Culture, 53(1), 29–33. [Google Scholar] [CrossRef]
- Thomas, A., Gaizauskas, R., & Lu, H. (2024). Leveraging LLMs for post-OCR correction of historical newspapers. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024 (pp. 116–121). ELRA and ICCL. Available online: https://aclanthology.org/2024.lt4hala-1.14/ (accessed on 25 May 2024).
- Tworek, H. J. S. (2024). Digitized newspapers and the hidden transformation of history. American Historical Review, 129(1), 143–147. [Google Scholar] [CrossRef]
- van Otterlo, M. (2018). Gatekeeping algorithms with human ethical bias: The ethics of algorithms in archives, libraries and society. arXiv, arXiv:1801.01705. [Google Scholar] [CrossRef]
- Whittlestone, J., Nyrup, R., Alexandrova, A., Dihal, K., & Cave, S. (2019). Ethical and societal implications of algorithms, data, and artificial intelligence: A roadmap for research. Nuffield Foundation. Available online: http://www.nuffieldfoundation.org/sites/default/files/files/Ethical-and-Societal-Implications-of-Data-and-AI-report-Nuffield-Foundat.pdf (accessed on 10 January 2025).
- Zheng, W., Su, B., Feng, R., Peng, X., & Chen, S. (2023). EA-GAN: Restoration of text in ancient Chinese books based on an example attention generative adversarial network. Heritage Science, 11, 42. [Google Scholar] [CrossRef]
| AI Tool/Technology | Description | Applications | Examples/Providers |
|---|---|---|---|
| AI-powered OCR | Deep learning-enhanced systems (CNNs, LSTMs) for text recognition from degraded scans | Digitizing faded or damaged pages, enabling searchable archives | Google Cloud Vision; Tesseract; Kraken |
| Post-OCR Correction with LLMs | Contextual models that amend OCR errors | Refining garbled text for usable outputs | Meta’s LLaMA 3; GPT variants; ByT5 |
| Image & Text Restoration AI | GAN-based/diffusion-based algorithms to remove noise and enhance visuals | Cleaning stains and creases; improving OCR | Topaz Gigapixel; ESRGAN |
| Handwritten/Printed Text Recognition | Trainable AI for varied scripts and layouts | Automating transcription of printed articles | Transkribus |
| Automated Archiving Platforms | AI systems for metadata and format management | Ensuring long-term accessibility; detecting degradation | Preservica; JSTOR Digital Stewardship |
| AI Tool/Technology | Description | Applications | Examples/Providers |
|---|---|---|---|
| Content Analysis & Metadata Generation | Analyzes unstructured content to extract keywords, summaries, topics, entities, and roles. | Metadata generation, categorization, enabling efficient searches and insights in archives. | IBM Watson NLU |
| Sentiment Analysis | Rule-based scoring of text valence, considering context and intensifiers for sentiment tone. | Assessing public opinion, emotional tones in event coverage, tracking sentiment trends. | VADER lexicon |
| Named Entity Recognition (NER) | Identifies and categorizes entities like people or locations using pre-trained models. | Extracting references for tracking events, figures, places, enabling network analysis. | spaCy, Gale Digital Scholar Lab |
| Natural Language Processing (NLP) Frameworks | Analyzes text for entities, sentiment, topics, language with pre-trained models. | Metadata extraction, article classification, summarization, semantic search post-OCR. | spaCy, NLTK, BERT (via Hugging Face Transformer) |
| Automated Transcription & Translation | AI-powered recognition of printed/handwritten text with layout analysis and multilingual translation. | Digitizing articles, global translation, creating searchable texts. | Transkribus OCR/HTR, Ottoman prints application |
| Generative AI & Large Language Models (LLMs) | Contextual models for generating, correcting, and extracting information with in-context learning APIs. | Correcting errors, segmenting articles, extracting events/entities, summarizing content, discovering discursive patterns. | GPT-4o, Llama 3, Gemini 1.5 Pro, Historian’s Friend |
| Computer Vision & Image Analysis | Detects objects and layouts in images for similarity searches and categorization. | Categorizing visuals like ads or photos for cultural studies. | Newspaper Navigator, Ultralytics YOLO |
| Text-Mining & Deep Learning Platforms | Processes corpora for pattern recognition including layout detection and frequency analysis. | Identifying themes, sentiments, trends; classifying articles by topic/date; enabling geospatial and temporal analyses. | Impresso project, Elasticsearch integrations |
| Focus Area | Key Future Direction | Expected Impact |
|---|---|---|
| Multimodal Restoration | Multimodal analysis integrating text, images, and layout; volumetric restoration using 3D imaging and ML | Enables recovery of damaged or warped newspapers; holistic interpretation of visual and textual culture |
| AI Stewardship Platforms | End-to-end digitization workflows powered by LLMs with human-in-the-loop oversight | Democratizes access to localized collections; improves quality and scalability of archival processes |
| Global Interoperability | Federated networks using IIIF standards; cross-lingual entity recognition and semantic linking | Facilitates distant reading and comparative research across borders; reveals transnational historical patterns |
| Ethical and Accountable AI | Transparency tools (datasheets, model cards); explainable AI; privacy-preserving frameworks | Ensures fairness, mitigates bias, and protects sensitive data; fosters trust in AI-driven archival research |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Song, Z.X.; Cheung, K.W.; Jia, Z.Y. Transforming Historical Newspaper Research and Preservation Through AI: A Global Perspective. Journal. Media 2026, 7, 10. https://doi.org/10.3390/journalmedia7010010
Song ZX, Cheung KW, Jia ZY. Transforming Historical Newspaper Research and Preservation Through AI: A Global Perspective. Journalism and Media. 2026; 7(1):10. https://doi.org/10.3390/journalmedia7010010
Chicago/Turabian StyleSong, Zhao Xun, Kwok Wai Cheung, and Zi Yun Jia. 2026. "Transforming Historical Newspaper Research and Preservation Through AI: A Global Perspective" Journalism and Media 7, no. 1: 10. https://doi.org/10.3390/journalmedia7010010
APA StyleSong, Z. X., Cheung, K. W., & Jia, Z. Y. (2026). Transforming Historical Newspaper Research and Preservation Through AI: A Global Perspective. Journalism and Media, 7(1), 10. https://doi.org/10.3390/journalmedia7010010

