(Doing) Computational History: The Role of Data Work in Computational Approaches
Abstract
1. Introduction
2. Precision and the Scope of Computational History
3. Resources, Corpus Criticism, and the Politics of Digitisation
4. Case Study: Applying Computer Vision to Alchemical Images
5. A Two-Tier Society: Ethics and the Value of Labour in Computational History
6. Data Work in Computational History
7. Conclusions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| DH | Digital Humanities |
| LLM | Large Language Model |
| ADHO | Alliance of Digital Humanities Organizations |
| GIGO | Garbage In, Garbage Out |
| GLAM | Galleries, Libraries, Archives, and Museums |
| DCAI | data-centric AI |
| OCR | Optical Character Recognition |
| FAIR | Findable, Accessible, Interoperable, Reusable |
| CARE | Collective Benefit, Authority to control, Responsibility, Ethics |
| 1 | Distant reading denotes a widely used approach in the digital humanities that employs computational techniques to analyse textual corpora. The analysis is conducted ‘at a distance,’ often through quantitative or statistical procedures that abstract from individual passages and instead operate at the level of large datasets. While the term is closely associated with Franco Moretti, whose work catalysed its adoption, contemporary practice has diverged considerably from his original formulation, and his fairly black-and-white approach has been subject to sustained critique. It is therefore important not to conflate the distant reading turn in Digital Humanities, its practitioners and methods, with the relatively narrow concept Moretti proposed when he first introduced the term over two decades ago (Moretti 2005, 2013; Primorac et al. 2023). In practice, the majority of digital humanists do not construe distant and close reading as mutually exclusive alternatives. Rather, they adopt mixed-methods hybrid reading approaches that combine computational analysis with more traditional forms of text interpretation, operating at the intersection of distant and close reading (Aledavood 2024). It would therefore be misleading to suggest that practitioners of distant reading do not consult their texts directly, or that they regard engagement with their source materials as unnecessary. In fact, close reading is usually needed to make sense of distant reading results. Accordingly, Matthew Jockers, for example, has reframed Moretti’s original proposal as a process of `zooming in and out,’ arguing that the term macroanalysis more accurately describes what computational text analysis can accomplish (Jockers 2013). These methods, of course, do not ‘read’ in a human sense; rather, they analyse text and detect patterns across large corpora, which must then be interpreted within broader historical and literary frameworks. It is also important for those unfamiliar with computational humanities to recognise that distant reading now encompasses a wide range of methods that extend far beyond what was possible when the paradigm first emerged. Early computational text analysis often relied on relatively simple techniques, such as counting word frequencies (Sinclair and Rockwell 2016), but the available methods have since diversified considerably. However, due to the slow diffusion of Digital Humanities methods back into their Humanities disciplines of origin, scholars who are new to or unfamiliar with the field often still associate distant reading with these early, sometimes rather naive approaches (e.g., word clouds). Beyond word frequency analysis, methods most commonly used today include, for example, clustering, topic modelling, sentiment analysis, and many more. The field is broad, with significant variation in how methods are implemented and their underlying assumptions. Not all methods are equally effective in every context: for instance, topic modelling is particularly useful for large corpora with diverse themes, but less so for small or homogeneous datasets. Sentiment analysis is also common, although it can be problematic methodologically, as it relies on predefined sentiment dictionaries that can yield divergent results depending on which dictionary is used. Many of these methods are used in digital history (Lässig 2021), adapted to serve historical inquiry, though methodological developments usually originate in computational literary studies. Ultimately, they draw on natural language processing and linguistics but are repurposed for humanistic questions that differ from the concerns of those source disciplines. |
| 2 | Distant viewing (Arnold and Tilton 2019, 2023) as a research paradigm, coined in response to distant reading, emerged from the critique that digital humanities have been overly text-focused, neglecting other media (Binkyte 2023). In response, scholars have called for a visual (Wevers and Smits 2020) or multimodal (Smits and Wevers 2023) turn in computational humanities. The term ‘distant viewing’ refers to the use of computer vision methods to analyse visual cultural artefacts. While common applications such as optical character recognition (OCR), handwritten text recognition (HTR) or, to use the more general term, Automated Text Recognition (ATR) fall under the broader umbrella of computer vision (Hodel 2023), distant viewing typically involves more complex image analysis, such as identifying and comparing features across large image corpora. Examples include the study of image reuse or the reuse of print plates (Dutta et al. 2021; Götzelmann 2022), combining concerns of document analysis and layout recognition. Distant viewing is also applied in film analysis or photographic collections (Arnold and Tilton 2023). These methods are often used by libraries or cultural heritage institutions for large-scale tasks like enhancing recommendation algorithms. However, adapting them for scholarly research questions requires a high degree of precision and effort, as the methods do not generalise easily on materials radically different from those on which they have been trained. |
| 3 | It is important to note that the perceived AI revolution is largely due to increased public awareness and a greater appreciation of AI as a useful tool, rather than any sudden technological leap. Providing a historical perspective that traces AI’s gradual development over the past 80 years, the authors of the excellent book AI Snake Oil (Narayanan and Kapoor 2024) challenge the popular narrative that AI is on the brink of a singularity. They argue that the excitement surrounding tools like ChatGPT in 2023 was driven more by increased visibility, public awareness and public perception of current AI technologies than by the magnitude of genuine breakthroughs. In their view, the widespread perception of an ongoing AI revolution reflects shifting public attention rather than the scale of actual technological advancement—although this heightened interest has certainly led to drastically increased scholarly attention and an onslaught of AI-related papers across all disciplines, which may now actually accelerate progress. The surge of AI-related publications across disciplines further reinforces the impression of a revolution, even if the underlying developments remain incremental. |
| 4 | This surge of work on generative AI has prompted many institutions in the humanities, digital humanities and beyond, to publish statements on the use of AI in research. These statements often focus specifically on generative AI, but also address AI more broadly and the implications of its development in corporate contexts, which is discussed in another contribution to this special issue. One widely circulated example is the manifesto Against the Uncritical Adoption of ‘AI’ Technologies in Academia (Guest et al. 2025). |
| 5 | For example, claims that “the sum of the data points is greater than their parts” in analyzing social media data to make previously unseen patterns visible (Lasser 2023) are frequently invoked in computational social sciences. |
| 6 | The Sphaera project, led by Matteo Valeriani, is a long-term study of multiple editions of the same book, examining how these editions changed over time and were modified by different printers (Valleriani 2020, 2025a). To support this work, the project developed a specialised database based on an extended CIDOC-CRM ontology to represent relationships not typically captured in standard bibliographic metadata. In addition, computer vision analyses were conducted on diagrams within these editions. While the project produced valuable insights, one could argue that, despite its computational methods often associated with large-scale analysis, it centres on a single book, not unlike a scholarly edition project would. However, in computational contexts, expectations frequently lean towards a broader scope, even when such expectations conflict with the precision and level of detail required to obtain meaningful results using computational methods. The project on lost books applies extinct species algorithms from biology in a cross-disciplinary transfer of methods, aiming to estimate how many books may have been lost over time (Kestemont et al. 2022). While the approach is certainly innovative and the large-scale results are headline-worthy, the method is highly specialised and demands significant effort to adapt. The outcome, though intriguing, remains speculative, as there is no way to verify the estimates. Given these constraints, more traditional historians could question how much this contributes to historical understanding. |
| 7 | Following this line of argument, historical interpretation itself could be described using the metaphor of the black box often applied to algorithms (Schwandt 2022). This raises the question of whether computational approaches differ as fundamentally from traditional historical methods as is often assumed. This could be argued in many respects, ranging from their detail focus to their validity or explainability. |
| 8 | There has been a famous related debate including a considerable aftermath (Da 2019a, 2019b, 2020; Jannidis 2020; Underwood 2020; Ries et al. 2023; Joyeux-Prunel 2024). |
| 9 | Object detection involves two key steps: locating an object within an image and then labelling or classifying it correctly. While models can readily identify common objects like plants, animals, or humans (i.e., concepts well represented in training data), recognising and classifying specialised historical items is significantly more difficult. |
| 10 | A critical analysis of six widely cited benchmark datasets (Caltech 101, Caltech 256, PASCAL VOC, ImageNet, MS COCO, and Google Open Images) demonstrates how the creators’ subjective choices and the labour of crowd workers shape the datasets: The selection of categories is not grounded in a general notion of visuality but is instead driven by perceived practical applications and the availability of downloadable images (Smits and Wevers 2021). Moreover, the reliance on Flickr and the broader web for data collection has introduced a temporal bias into many computer vision datasets. |
| 11 | The 80 object categories can be explored at https://cocodataset.org/#explore, accessed on 10 November 2025 (Lin et al. 2014). |
| 12 | However, it is important to note that even within the highly technical computational humanities community, where one might expect open-source code sharing to be standard practice, there remains considerable room for improvement. A recent study (Illmer 2025) found that all necessary code was cited in only 40% of the publications examined, and notably, this assessment did not even involve actually attempting to run the code. It merely checked whether the code was theoretically available for replication. This highlights significant shortcomings in current practices and underscores the need for greater transparency and reproducibility in the field. In addition, other documentation and reporting best practices, such as datasheets, data and model audits, or carbon reporting (Lang et al. 2025), have not even entered the broader consciousness of the computational humanities field as measures that should be taken for ensuring transparency. |
| 13 | The myth of the lone male genius is unfortunately prevalent in many disciplines, including in Digital Humanities (Nyhan 2022). |
| 14 | The myth of the ‘objective algorithm’ has been debunked many times. For example, see (Crawford 2021). |
| 15 | On dataset criticism, see (Paullada et al. 2021; Orr and Crawford 2024) or on critical data studies: (Iliadis and Russo 2016). |
| 16 | Many digital humanists regret that terms like Distant Reading became so widely adopted that replacing them with terms more accurately reflecting what computational methods actually do has proven difficult. In fact, few digital humanists would endorse the original claims associated with the term as proposed by its inventor, Franco Moretti (Moretti 2005, 2013). Unfortunately, the critical receptions and subsequent reinterpretations of Distant reading within the field are often less visible to those outside of digital humanities. As a result, many hold unrealistic assumptions about what digital humanities scholars mean by the term. One attempt to introduce a more suitable alternative is Jockers’ Macroanalysis (Jockers 2013). |
| 17 | Cf. the “Garbage In, Garbage Out” (GIGO) principle. |
| 18 | For instance, the Contributor Role Taxonomy (CRediT, Holcombe 2019) has been proposed to democratize the attribution of credit beyond just the authors of academic papers. |
| 19 | There has been a lot of highly visible LLM criticism (Bender et al. 2021). |
| 20 | The CARE principles (GIDA 2021) are related to the increasingly more frequent call for an ethics of care (Gray and Witt 2021). |
| 21 | This issue is currently being debated within the digital humanities community in discussions on BlueSky sparked by Matthew Wilkens’ claim that digital humanities may have effectively ended, with much of what once fell under its umbrella absorbed into quantitative disciplines (Wilkens 2026). The claim generated considerable debate online: while some agreed, many strongly disagreed. Notably, a GLAM professional pointed out that practitioners in adjacent fields such as digital archives often feel excluded from discussions about the “state of the field,” although even narrow definitions of computational approaches to humanities data accurately describe their work. (https://bsky.app/profile/did:plc:f4obdtap2xdezbn73lyo5dlu/post/3mf3cnwi4s22k, accessed on 10 November 2025). This, again, reflects an unequal division and valuation of labour. Those who align themselves more closely with computer science or quantitatively oriented disciplines tend to receive greater visibility and recognition than those whose work remains rooted in more traditional humanities contexts or in forms of labour historically coded as feminised (Lang 2027), such as much of the work carried out in GLAM institutions. This structural imbalance within the field warrants our critical attention. |
| 22 | Research in digital humanities has shown that even in contexts where gender participation is balanced, topics coded as feminine are systematically less recognised. At ADHO conferences, for example, feminised themes often receive less visibility despite equal scholarly merit (Eichmann-Kalwara et al. 2018). |
| 23 | There is ample disciplinary discourse on the role of quantitative methods (Allington 2022; Bernhart 2018; Lang 2021; Lauer 2020; Piotrowski and Neuwirth 2020; Shadrova 2021). |
References
- Aledavood, Parham. 2024. Taking the Middle Road: Reflections on Mixed Methodology within the Digital Humanities. Digital Studies/Le Champ Numérique 14: 1–19. [Google Scholar] [CrossRef]
- Alkemade, Henk, Steven Claeyssens, Giovanni Colavizza, Nuno Freire, Jörg Lehmann, Clemens Neudecker, Giulia Osti, and Daniel van Strien. 2023. Datasheets for Digital Cultural Heritage Datasets. Journal of Open Humanities Data 9: 1–11. [Google Scholar] [CrossRef]
- Allington, Daniel. 2022. The Place of Computation in the Study of Culture. In The Bloomsbury Handbook to the Digital Humanities, 1st ed. Edited by James O’Sullivan. Bloomsbury Handbooks. London: Bloomsbury Academic, pp. 373–84. [Google Scholar] [CrossRef]
- Alvarado, Rafael C. 2022. Datawork and the Future of Digital Humanities. In The Bloomsbury Handbook to the Digital Humanities. Edited by James O’Sullivan. London: Bloomsbury Publishing, pp. 361–72. [Google Scholar] [CrossRef]
- Arnold, Taylor, and Lauren Tilton. 2019. Distant Viewing: Analyzing Large Visual Corpora. Digital Scholarship in the Humanities 34: 3–16. [Google Scholar] [CrossRef]
- Arnold, Taylor, and Lauren Tilton. 2023. Distant Viewing: Computational Exploration of Digital Images. Cambridge, MA: The MIT Press. [Google Scholar] [CrossRef]
- Aronova, Elena, Christine von Oertzen, and David Sepkoski. 2017. Introduction: Historicizing Big Data. Osiris 32: 1–17. [Google Scholar] [CrossRef]
- Aydelotte, William O. 1966. Quantification in History. The American Historical Review 71: 803–25. [Google Scholar] [CrossRef]
- Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Paper presented at the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT’21), Virtual, March 3–10; pp. 610–23. [Google Scholar] [CrossRef]
- Bernhart, Toni. 2018. Quantitative Literaturwissenschaft: Ein Fach mit langer Tradition? In Quantitative Ansätze in Literatur- und Geisteswissenschaften: Systematische und Historische Perspektiven. Edited by Toni Bernhart, Marcus Willand, Sandra Richter and Andrea Albrecht. Berlin: De Gruyter, pp. 207–20. [Google Scholar] [CrossRef]
- Berry, David M. 2011. The Computational Turn: Thinking About the Digital Humanities. Culture Machine 12: 1–22. [Google Scholar]
- Binkyte, Ruta. 2023. Distant Reading and Viewing: ‘Big Questions’ in Digital Art History and Digital Literary Studies. Digital Humanities Quarterly 17: 1. Available online: https://dhq.digitalhumanities.org/vol/17/2/000686/000686.html (accessed on 10 November 2025). [CrossRef]
- Birhane, Abeba, Sepehr Dehdashtian, Vinay Prabhu, and Vishnu Boddeti. 2024. The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models. In 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT’24), Rio de Janeiro, Brazil, June 3–6. New York: Association for Computing Machinery, pp. 1229–44. [Google Scholar] [CrossRef]
- Birhane, Abeba, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021a. Large Image Datasets: A Pyrrhic Win for Computer Vision? Paper presented at the IEEE Winter Conference on Applications of Computer Vision (WACV 2021), Waikoloa, HI, USA, January 3–8; pp. 1536–46. [Google Scholar] [CrossRef]
- Birhane, Abeba, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021b. Multimodal Datasets: Misogyny, Pornography, and Malignant Stereotypes. arXiv arXiv:2110.01963. [Google Scholar] [CrossRef]
- Boyles, Christina. 2018. Counting the Costs: Funding Feminism in the Digital Humanities. In Bodies of Information: Intersectional Feminism and the Digital Humanities. Edited by Elizabeth Losh and Jacqueline Wernimont. Debates in the Digital Humanities. Minneapolis: University of Minnesota Press, pp. 93–107. [Google Scholar] [CrossRef]
- Brown, Shea, Jovana Davidovic, and Ali Hasan. 2021. The Algorithm Audit: Scoring the Algorithms That Score Us. Big Data & Society 8: 1–8. [Google Scholar] [CrossRef]
- Buolamwini, Joy, and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In 1st Conference on Fairness, Accountability and Transparency (FAT* 2018), New York, NY, USA, February 23–24. Cambridge, MA: Proceedings of Machine Learning Research, vol. 81, pp. 77–91. Available online: https://proceedings.mlr.press/v81/buolamwini18a.html (accessed on 10 November 2025).
- Crawford, Kate. 2021. Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence. New Haven: Yale University Press. [Google Scholar]
- Crawford, Kate, and Trevor Paglen. 2019. Excavating AI: The Politics of Images in Machine Learning Training Sets. Essay Published by The AI Now Institute (NYU). September 19. Available online: https://excavating.ai/ (accessed on 10 November 2025).
- Da, Nan Z. 2019a. The Computational Case against Computational Literary Studies. Critical Inquiry 45: 601–39. [Google Scholar] [CrossRef]
- Da, Nan Z. 2019b. The Digital Humanities Debacle: Computational Methods Repeatedly Come Up Short. The Chronicle of Higher Education 27. Available online: https://www.chronicle.com/article/the-digital-humanities-debacle/ (accessed on 10 November 2025).
- Da, Nan Z. 2020. Critical Response III. On EDA, Complexity, and Redundancy: A Response to Underwood and Weatherby. Critical Inquiry 46: 913–24. [Google Scholar] [CrossRef]
- Denton, Emily, Alex Hanna, Razvan Amironesei, Andrew Smart, and Hilary Nicole. 2021. On the Genealogy of Machine Learning Datasets: A Critical History of ImageNet. Big Data & Society 8: 20539517211035955. [Google Scholar] [CrossRef]
- D’Ignazio, Catherine, and Lauren F. Klein. 2020. Data Feminism. Cambridge, MA: MIT Press. [Google Scholar]
- Dombrowski, Quinn. 2022. Does Coding Matter for Doing Digital Humanities? In The Bloomsbury Handbook to the Digital Humanities, 1st ed. Edited by James O’Sullivan. Bloomsbury Handbooks. London: Bloomsbury Academic, pp. 137–46. [Google Scholar] [CrossRef]
- Dutta, Abhishek, Giles Bergel, and Andrew Zisserman. 2021. Visual Analysis of Chapbooks Printed in Scotland. Paper presented at the 6th International Workshop on Historical Document Imaging and Processing (HIP’21), Lausanne, Switzerland, September 5–6. [Google Scholar] [CrossRef]
- Dziudzia, Corinna, and Mark Hall. 2020. Die Kanonfrage 2.0. DHd 2020. Available online: https://zenodo.org/record/4621782#.Y1vlTeTP2Uk (accessed on 10 November 2025).
- Eichmann-Kalwara, Nickoal, Jeana Jorgensen, and Scott B. Weingart. 2018. Representation at Digital Humanities Conferences (2000–2015). In Bodies of Information: Intersectional Feminism and the Digital Humanities. Edited by Elizabeth Losh and Jacqueline Wernimont. Debates in the Digital Humanities. Minneapolis: University of Minnesota Press, pp. 72–92. [Google Scholar] [CrossRef]
- Frietsch, Ute. 2017. Alchemie-Notationen in IconClass. In Alchemiegeschichtliche Quellen in der Herzog August Bibliothek. Wolfenbüttel: Herzog August Bibliothek Wolfenbüttel. Available online: https://alchemie.hab.de/bilder/ (accessed on 10 November 2025).
- Gaede, Jonathan. 2024. »So nehmet die materia und thut sie in ein solches Glaß«. Gefäßdarstellungen in Destillationsbüchern und alchemistischen Traktaten der frühen Neuzeit. In Die Sprache Wissenschaftlicher Objekte. Interdisziplinäre Perspektiven auf die Materielle Kultur in den Wissenschaften. Edited by Bettina Lindner-Bornemann and Sebastian Kürschner. Lingua Academica. Berlin: De Gruyter, vol. 8, pp. 7–52. [Google Scholar] [CrossRef]
- Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé, III, and Kate Crawford. 2021. Datasheets for Datasets. Communications of the ACM 64: 86–92. [Google Scholar] [CrossRef]
- Global Indigenous Data Alliance (GIDA). 2021. CARE Principles of Indigenous Data Governance. Available online: https://www.gida-global.org/care/ (accessed on 10 November 2025).
- Götzelmann, Germaine. 2022. Bilderschätze, Bildersuchen: Digitale Auswertung von Illustrationswiederverwendungen im Buchdruck des 16. Jahrhunderts. In Wissen und Buchgestalt. Edited by Philipp Hegel and Michael Krewet. Wiesbaden: Harrassowitz Verlag, pp. 323–40. [Google Scholar]
- Gray, Joanne, and Alice Witt. 2021. A Feminist Data Ethics of Care for Machine Learning: The What, Why, Who and How. First Monday 26. [Google Scholar] [CrossRef]
- Gray, Mary L., and Siddharth Suri. 2019. Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Boston: Eamon Dolan Books. [Google Scholar]
- Guest, Olivia, Marcela Suarez, Barbara Müller, Edwin van Meerkerk, Arnoud Oude Groote Beverborg, Ronald de Haan, Andrea Reyes Elizondo, Mark Blokpoel, Natalia Scharfenberg, Annelies Kleinherenbrink, and et al. 2025. Against the Uncritical Adoption of ‘AI’ Technologies in Academia. Zenodo. [Google Scholar] [CrossRef]
- Hall, Mark. 2019. DH Is the Study of Dead Dudes. DHd 2019. Available online: https://zenodo.org/record/4622026#.Y1vlCeTP2Uk (accessed on 10 November 2025).
- Hodel, Tobias. 2023. Konsequenzen der Handschriftenerkennung und des maschinellen Lernens für die Geschichtswissenschaft. Anwendung, Einordnung und Methodenkritik. Historische Zeitschrift 316: 151–80. [Google Scholar] [CrossRef]
- Holcombe, Alex O. 2019. Contributorship, Not Authorship: Use Credit to Indicate Who Did What. Publications 7: 48. [Google Scholar] [CrossRef]
- Houston, Natalie M. 2023. Distant Reading. In Technology and Literature. Edited by Adam Hammond. Cambridge: Cambridge University Press, pp. 361–76. [Google Scholar] [CrossRef]
- Iliadis, Andrew, and Federica Russo. 2016. Critical Data Studies: An Introduction. Big Data & Society 3: 2053951716674238. [Google Scholar] [CrossRef]
- Illmer, Viktor J. 2025. Works on My Machine: A Case Study of Replicability Challenges in Computational Humanities Research. Anthology of Computers and the Humanities 3: 142–48. [Google Scholar] [CrossRef]
- Jakubik, Johannes, Michael Vössing, Niklas Kühl, Jannis Walk, and Gerhard Satzger. 2024. Data-Centric Artificial Intelligence. Business & Information Systems Engineering 66: 507–15. [Google Scholar] [CrossRef]
- Jannidis, Fotis. 2019. Digitale Geisteswissenschaften: Offene Fragen—Schöne Aussichten. ZMK Zeitschrift für Medien-und Kulturforschung. Ontography 10: 63–70. [Google Scholar] [CrossRef]
- Jannidis, Fotis. 2020. On the Perceived Complexity of Literature: A Response to Nan Z. Da. Journal of Cultural Analytics 5. [Google Scholar] [CrossRef]
- Jarrahi, Mohammad Hossein, Ali Memariani, and Shion Guha. 2023. The Principles of Data-Centric AI (DCAI). Communications of the ACM 66: 84–92. [Google Scholar] [CrossRef]
- Jockers, Matthew L. 2013. Macroanalysis. Digital Methods and Literary History. Champaign: University of Illinois Press. [Google Scholar]
- Joyeux-Prunel, Béatrice. 2024. Digital Humanities in the Era of Digital Reproducibility: Towards a Fairest and Post-Computational Framework. International Journal of Digital Humanities 6: 23–43. [Google Scholar] [CrossRef]
- Kestemont, Mike, Folgert Karsdorp, Elisabeth de Bruijn, Matthew Driscoll, Katarzyna A. Kapitan, Pádraig Ó Macháin, Daniel Sawyer, Remco Sleiderink, and Anne Chao. 2022. Forgotten Books: The Application of Unseen Species Models to the Survival of Culture. Science 375: 765–69. [Google Scholar] [CrossRef]
- Klein, Lauren, and Catherine D’Ignazio. 2024. Data Feminism for AI. In 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT’24), Rio de Janeiro, Brazil, June 3–6. New York: Association for Computing Machinery, pp. 100–12. [Google Scholar] [CrossRef]
- Lang, Sarah A. 2020. The Computational Humanities and Toxic Masculinity? A (Long) Reflection. LaTeX Ninja’ing and the Digital Humanities. Blog Post from 19 April 2020. Available online: https://latex-ninja.com/2020/04/19/the-computational-humanities-and-toxic-masculinity-a-long-reflection/ (accessed on 10 November 2025).
- Lang, Sarah A. 2021. Experiments in the Digital Laboratory. What the Computational Humanities Can Learn About Their Definition and Terminology from the History of Science. In Fabrikation von Erkenntnis: Experimente in den Digital Humanities. Edited by Manuel Burghardt, Lisa Dieckmann, Timo Steyer, Peer Trilcke, Niels-Oliver Walkowski, Joëlle Weis and Ulrike Wuttke. Esch-sur-Alzette: Melusina Press. [Google Scholar] [CrossRef]
- Lang, Sarah A. 2025. Fine-Tuning Machine Learning with Historical Data. An Alchemical Object Detection Dataset for Early Modern Scientific Illustrations. Zeitschrift für Digitale Geisteswissenschaften 10. Available online: https://zfdg.de/2025_002 (accessed on 10 November 2025).
- Lang, Sarah A. 2026. Critical Concerns for Using LLMs in the (Computational) Humanities and Beyond. In Large Language Models for the History, Philosophy, and Sociology of Science: Reflections from a Field in Motion. Edited by Arno Simons, Adrian Wüthrich, Michael Zichert and Gerd Graßhoff. Bielefeld: Transcript. [Google Scholar]
- Lang, Sarah A. 2027. Two-tier Computational Humanities: A Labour History of Undervalued Contributions in DH. In De Gruyter Handbook of Feminist Digital Scholarship: DS/DH at the Kitchen Table. Edited by Anne Cong-Huyen and Kim Brillante Knight. Berlin and Boston: De Gruyter. [Google Scholar]
- Lang, Sarah A., and Elena Suárez Cronauer. 2026. Beyond Data Feminism: Toward Ethical Data Work in the (Digital) Humanities. Zeitschrift für digitale Geisteswissenschaften (ZfdG). Available online: https://zfdg.de/wp_2026 (accessed on 10 November 2025).
- Lang, Sarah A., Bernhard Liebl, and Manuel Burghardt. 2023. Toward a Computational Historiography of Alchemy: Challenges and Obstacles of Object Detection for Historical Illustrations of Mining, Metallurgy, and Distillation in 16th–17th Century Print. In Computational Humanities Research Conference 2023 (CHR 2023, Paris, France, December 6–8). Edited by Artjoms Šeļa, Fotis Jannidis and Iza Romanowska. Aachen: CEUR-WS.org, pp. 29–48. Available online: https://ceur-ws.org/Vol-3558/paper342.pdf (accessed on 10 November 2025).
- Lang, Sarah A., Wishyut Pitawanik, Pascal Belouin, Emma Sevink, Jesse Olszynko-Gryn, Alfred Freeborn, and Etienne Benson. 2025. Quantifying the Environmental Footprint of Curating Datasets with LLMs. Zenodo. [CrossRef]
- Lasser, Jana. 2023. Computational Modelling of Complex Social Systems. Graz: Graz University of Technology. Available online: https://janalasser.at/habilitation.pdf (accessed on 10 November 2025).
- Lauer, Gerhard. 2020. Über den Wert der exakten Geisteswissenschaften. In Geisteswissenschaft—Was bleibt? Zwischen Theorie, Tradition und Transformation. Edited by Hans Joas and Jörg Noller. Geist und Geisteswissenschaft. Freiburg: Verlag Karl Alber, vol. 5, pp. 152–73. [Google Scholar]
- Lässig, Simone. 2021. Digital History: Challenges and Opportunities for the Profession. Geschichte und Gesellschaft 47: 5–34. [Google Scholar] [CrossRef]
- Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014. Edited by David Fleet, Tomas Pajdla, Bernt Schiele and Tinne Tuytelaars. Lecture Notes in Computer Science. Cham: Springer, vol. 8693. [Google Scholar] [CrossRef]
- Luccioni, Sasha, and Kate Crawford. 2024. The Nine Lives of ImageNet: A Sociotechnical Retrospective of a Foundation Dataset and the Limits of Automated Essentialism. Journal of Data-Centric Machine Learning Research 1: 1–18. Available online: https://data.mlr.press/assets/pdf/v01-4.pdf (accessed on 10 November 2025).
- Luthra, Mrinalini, and Maria Eskevich. 2024. Data-Envelopes for Cultural Heritage: Going beyond Datasheets. In Workshop on Legal and Ethical Issues in Human Language Technologies@LREC-COLING 2024. Edited by Ingo Siegert and Khalid Choukri. Torino: ELRA and ICCL, pp. 52–65. Available online: https://aclanthology.org/2024.legal-1.9/ (accessed on 10 November 2025).
- Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. In Conference on Fairness, Accountability, and Transparency (FAT*’19), Atlanta, GA, USA, January 29–31. New York: Association for Computing Machinery, pp. 220–29. [Google Scholar] [CrossRef]
- Moretti, Franco. 2005. Graphs, Maps, Trees. Abstract Models for a Literary History. London: Verso. [Google Scholar]
- Moretti, Franco. 2013. Distant Reading. London: Verso. [Google Scholar]
- Narayanan, Arvind, and Sayash Kapoor. 2024. AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference. Princeton: Princeton University Press. [Google Scholar]
- Neudecker, Clemens. 2023. Digital Curation and Artificial Intelligence—Opportunities and Risks for Cultural Heritage Institutions. In AI in Museums: Reflections, Perspectives and Applications. Edited by Sonja Thiel and Johannes Bernhardt. Bielefeld: Transcript Verlag, pp. 149–62. [Google Scholar]
- Nyhan, Julianne. 2022. On the Making of the Myth of the Lone Scholar: Digital Humanities as Aetiology. In Hidden and Devalued Feminized Labour in the Digital Humanities: On the Index Thomisticus Project, 1954–67, 1st ed. London: Routledge. [Google Scholar] [CrossRef]
- Nyhan, Julianne. 2023. The History of the ‘Techie’ in the History of Digital Humanities. In On Making in the Digital Humanities: The Scholarship of Digital Humanities Development in Honour of John Bradley. Edited by Julianne Nyhan, Geoffrey Rockwell, Stéfan Sinclair and Alexandra Ortolja-Baird. London: UCL Press, pp. 129–47. [Google Scholar] [CrossRef]
- Odebrecht, Carolin, Lou Burnard, and Christof Schöch, eds. 2021. European Literary Text Collection (ELTeC). version 1.1.0. COST Action Distant Reading for European Literary History (CA16204). Geneva: CERN. [Google Scholar] [CrossRef]
- Orr, Will, and Kate Crawford. 2024. Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators. Journal of Data-Centric Machine Learning Research 1: 1–21. Available online: https://openreview.net/forum?id=6bd8BrRKTW (accessed on 10 November 2025).
- Paullada, Amandalynne, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. 2021. Data and Its (Dis)Contents: A Survey of Dataset Development and Use in Machine Learning Research. Patterns 2: 100336. [Google Scholar] [CrossRef]
- Piotrowski, Michael, and Markus Neuwirth. 2020. Prospects for Computational Hermeneutics. In Atti del IX Convegno Annuale AIUCD. La Svolta Inevitabile: Sfide e Prospettive per l’Informatica Umanistica (Milan, France, January 15–17). Edited by Cristina Marras, Marco Passarotti, Greta Franzini and Eleonora Litta. Milan: Associazione per l’Informatica Umanistica e la Cultura Digitale (AIUCD), pp. 204–9. [Google Scholar] [CrossRef]
- Primorac, Antonija, Rosario Arias, Roxana Patraș, Eva Eglāja-Kristsone, Karina van Dalen-Oskam, Berenike Herrmann, Christof Schöch, and Pieter François. 2023. Distant Reading Two Decades On: Reflections on the Digital Turn in the Study of Literature. Digital Studies/Le Champ Numérique 13: 1–24. [Google Scholar] [CrossRef]
- Rajabi, Amirarsalan, Mehdi Yazdani-Jahromi, Ozlem Ozmen Garibay, and Gita Sukthankar. 2022. Through a Fair Looking-Glass: Mitigating Bias in Image Datasets. Paper presented at the AAAI 2023 Workshop on Representation Learning for Responsible Human-Centric AI (R2HCAI), Washington, DC, USA, February 13. [Google Scholar]
- Raji, Inioluwa Deborah, Timnit Gebru, Margaret Mitchell, Joy Buolamwini, Joonseok Lee, and Remi Denton. 2020. Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing. In AAAI/ACM Conference on AI, Ethics, and Society (AIES ’20), AIES ’20: AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, February 7–9. New York: Association for Computing Machinery, pp. 145–51. [Google Scholar] [CrossRef]
- Ries, Thorsten, Karina van Dalen-Oskam, and Fabian Offert. 2023. Reproducibility and Explainability in Digital Humanities. International Journal of Digital Humanities 5: 247–51. [Google Scholar] [CrossRef]
- Riley, Donna. 2017. Rigor/Us: Building Boundaries and Disciplining Diversity with Standards of Merit. Engineering Studies 9: 249–65. [Google Scholar] [CrossRef]
- Ross, Shawna, and Andrew Pilsch. 2022. Labor, Alienation, and the Digital Humanities. In The Bloomsbury Handbook to the Digital Humanities, 1st ed. Edited by James O’Sullivan. Bloomsbury Handbooks. London: Bloomsbury Academic, pp. 335–46. [Google Scholar] [CrossRef]
- Roth, Camille. 2019. Digital, Digitized, and Numerical Humanities. Digital Scholarship in the Humanities 34: 616–32. [Google Scholar] [CrossRef]
- Salari, Aria, Abtin Djavadifar, Xiangrui Liu, and Homayoun Najjaran. 2022. Object Recognition Datasets and Challenges: A Review. Neurocomputing 495: 129–52. [Google Scholar] [CrossRef]
- Schöch, Christof, Roxana Patraș, Diana Santos, and Tomaž Erjavec. 2021. Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives. Modern Languages Open 1. [Google Scholar] [CrossRef]
- Schwandt, Silke. 2022. Opening the Black Box of Interpretation: Digital History Practices as Models of Knowledge. History and Theory 61: 77–85. [Google Scholar] [CrossRef]
- Shadrova, Anna. 2021. Topic Models Do Not Model Topics: Epistemological Remarks and Steps Towards Best Practices. Journal of Data Mining and Digital Humanities 2021: 1–28. [Google Scholar] [CrossRef]
- Sinclair, Stéfan, and Geoffrey Rockwell. 2016. Hermeneutica. Computer-Assisted Interpretation in the Humanities. Cambridge, MA: MIT Press. [Google Scholar]
- Smits, Thomas, and Melvin Wevers. 2021. The Agency of Computer Vision Models as Optical Instruments. Visual Communication 21: 329–49. [Google Scholar] [CrossRef]
- Smits, Thomas, and Melvin Wevers. 2023. A Multimodal Turn in Digital Humanities: Using Contrastive Machine Learning Models to Explore, Enrich, and Analyze Digital Visual Historical Collections. Digital Scholarship in the Humanities 38: 1267–80. [Google Scholar] [CrossRef]
- Smits, Thomas, and Mike Kestemont. 2021. Towards Multimodal Computational Humanities: Using CLIP to Analyze Late-Nineteenth-Century Magic Lantern Slides. Paper presented at the Computational Humanities Research Conference 2021 (CHR 2021), Amsterdam, The Netherlands, November 17–19; pp. 149–58. Available online: https://ceur-ws.org/Vol-2989/short_paper23.pdf (accessed on 10 November 2025).
- Stachowiak, Herbert. 1973. Allgemeine Modelltheorie. Vienna: Springer. [Google Scholar]
- Torralba, Antonio, Rob Fergus, and William T. Freeman. 2008. 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 30: 1958–70. [Google Scholar] [CrossRef]
- Underwood, Ted. 2020. Critical Response II. The Theoretical Divide Driving Debates about Computation. Critical Inquiry 46: 900–12. [Google Scholar] [CrossRef]
- Valleriani, Matteo, ed. 2020. De Sphaera of Johannes de Sacrobosco in the Early Modern Period: The Authors of the Commentaries. Cham: Springer. [Google Scholar] [CrossRef]
- Valleriani, Matteo. 2025a. The Sphere: Knowledge System Evolution and the Shared Scientific Identity in Europe. Berlin: Max Planck Institute for the History of Science. Available online: https://www.mpiwg-berlin.mpg.de/project/the-sphere (accessed on 10 November 2025).
- Valleriani, Matteo. 2025b. Large Language Models That Power AI Should Be Publicly Owned. The Guardian, May 26. Available online: https://www.theguardian.com/technology/2025/may/26/large-language-models-that-power-ai-should-be-publicly-owned (accessed on 10 November 2025).
- Vecchione, Briana, Solon Barocas, and Karen Levy. 2021. Algorithmic Auditing and Social Justice: Lessons from the History of Audit Studies. In Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO’21). New York: ACM. [Google Scholar] [CrossRef]
- Wevers, Melvin, and Thomas Smits. 2020. The Visual Digital Turn: Using Neural Networks to Study Historical Images. Digital Scholarship in the Humanities 35: 194–207. [Google Scholar] [CrossRef]
- Wilkens, Matthew. 2026. What Instagram and Community Colleges Tell Us about the Future of Digital Humanities. American Literary History 37: 1095–103. [Google Scholar] [CrossRef]
- Yang, Kaiyu, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. 2020. Towards Fairer Datasets: Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy. Paper presented at the 2020 Conference on Fairness, Accountability, and Transparency (FAccT 2020), Barcelona, Spain, January 27–30; pp. 547–58. [Google Scholar]
- Zaagsma, Gerben. 2023. Digital History and the Politics of Digitization. Digital Scholarship in the Humanities 38: 830–51. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lang, S.A. (Doing) Computational History: The Role of Data Work in Computational Approaches. Histories 2026, 6, 26. https://doi.org/10.3390/histories6020026
Lang SA. (Doing) Computational History: The Role of Data Work in Computational Approaches. Histories. 2026; 6(2):26. https://doi.org/10.3390/histories6020026
Chicago/Turabian StyleLang, Sarah A. 2026. "(Doing) Computational History: The Role of Data Work in Computational Approaches" Histories 6, no. 2: 26. https://doi.org/10.3390/histories6020026
APA StyleLang, S. A. (2026). (Doing) Computational History: The Role of Data Work in Computational Approaches. Histories, 6(2), 26. https://doi.org/10.3390/histories6020026

