Towards A Taxonomy of Uncertainties: Analysing Sources of Spatio-Temporal Uncertainty on the Example of Non-Standard German Corpora

: Di ﬀ erent types of uncertainties occur in almost all datasets and are an inherent property of data across di ﬀ erent academic disciplines, including digital humanities (DH). In this paper, we address, demonstrate and analyse spatio-temporal uncertainties in a non-standard German legacy dataset in a DH context. Although the data collection is primarily a linguistic resource, it contains a wealth of additional, comprehensive information, such as location and temporal detail. The addressed uncertainties have manifested because of a variety of reasons, and partly also because of decades of data transformation processes. We here propose our own taxonomy for capturing and classifying the various uncertainties, and show with numerous examples how the remedying but also re-introduction of uncertainties a ﬀ ects DH practices.


Introduction
Uncertainty is an inherent aspect of humanity, of our daily activities, actions and interactions.It is typically associated with unknown or lacking information, imprecise or incomplete knowledge, inaccurate measurements and risk.As uncertainty permeates our lives in various forms and ways, it has also become a topic of discussion in policy making and entrepreneurship, and has equally been taken up in the scholarly and scientific discourses.
In the light of recent developments in the European policy landscape regarding science, research and innovation, the boosting of "openness" on several levels (i.e., innovation, data), has also given fresh impetus for uncertainty to resurface in the wider scientific discourse [1,2].Nowotny [3] (p. 1602) states that "[...] science thrives on the cusp of uncertainty-the rare moments of intense creativity when completely new perspectives and visions open up and lead to new discoveries and insights."Scientific research and innovation processes are thus inherently uncertain, and the more so as their progress evolves towards ecosystem networks of actor groups with increased inclusion, collaboration and participation of different stakeholders, and the pressing necessities to meet human needs and face societal challenges."It is therefore crucial to distinguish when and where science needs time and space when to engage with uncertainty."[3] (p. 1603).As a consequence, embracing uncertainty, creating a culture of learning from errors and allowing for the creation and acknowledgement of serendipitous discovery conditions are key; yet they lie at the centre of the ongoing discussion around scientific innovation and progress, not only at the policy level.Across academic and scholarly discourses, uncertainty-for decades-has been a fundamental topic of interest across scientific disciplines, including philosophy [4], psychology [5], physics [6], information science [7], economics [8], law [9] and statistics [10], just to name a few examples (see also [11]).There, uncertainty is typically dealt with from the point of view of risk assessment, measurements of uncertainty or methods of removing uncertainty to attain higher degrees of certainty (cf.[12]).
In this context, a distinction can be made in how uncertainty is dealt with, for example, in the fields of natural sciences in contrast to the humanities.Whilst uncertainties in natural sciences are mostly related to the expected limits in the possibilities of making measurements and also inherent to the statistical properties of what can be inferred on the empirical samples, uncertainties in the humanities, however, can also involve subjective aspects related to perception, ambiguity, vagueness, incompleteness, credibility, etc.
The research purpose of the present paper thus evolves out of the uniqueness of the legacy collection dealt with here.The types of uncertainties encountered in humanities datasets already differ from those more typical in the natural sciences.Our specific non-standard dataset, and the focus on spatio-temporal aspects, makes it a necessity to devise our own taxonomy in order to fully capture the relevant uncertainties.
The paper is thus structured as follows: Section 1 (Introduction) presents an overview of existing and relevant taxonomies dealing with one or more of the aspects (temporal, spatial, uncertainties) central to the research purpose, and also an outline of the diverse contexts that uncertainty has been dealt with in the digital humanities field.In Section 2 (Materials and Methods), the specific dataset is illustrated, followed by a pointer to previous analyses of spatial and temporal aspects.In addition, the methods of devising our taxonomy are outlined.Section 3 (Results) presents the DBÖ (Datenbank der bairischen Mundarten in Österreich/Database of Bavarian Dialects in Austria) taxonomy of uncertainties illustrated with detailed examples of its specific categories.Finally, Section 4 (Discussion/Conclusion) discusses the newly composed taxonomy against the background of existing taxonomies that have been reviewed in Section 1.Aside from this, reflections on how the remedying or re-introduction of uncertainties affects digital humanities (DH) practice are provided.

Research Framework and Context
This study is realised in the context of exploration space.The exploration space was established in the working group 'Methods and Innovation' [13] at the Austrian Centre for Digital Humanities (ACDH) as a virtual and physical environment for fostering experimentation and innovation in the networked humanities, encouraging exchange at the interface of humanities, geography and environmental studies, as well as design and arts.Actors co-design innovation processes and experiment on questions of cultural, linguistic and biological diversity.Subsequently, exploration space has been listed as a best-practice example of open innovation by the Ministries of Transport, Innovation and Technology (bmvit) and Education, Science and Research (bmwfw) [14].The particular framework and approach taken in the project has given rise to, and enabled, the establishment of the Open Innovation Research Infrastructure (OI-RI) and exploration space at the Austrian Centre for Digital Humanities (ACDH-OeAW) at the Austrian Academy of Sciences, in which exploreAT! is reified.
exploreAT!-exploring Austria's culture through the language glass (cf.[15])-has evolved as a cross-disciplinary project at ACDH-OeAW since 2015.It brings together expertise from different disciplines and collaboration partners in the fields of cultural lexicography and open innovation (ACDH-OeAW, Austria), semantic technologies (ADAPT Centre, Dublin City University, Dublin, Ireland) and human-machine interaction via visualization (VisUSAL, Universidad de Salamanca, Salamanca, Spain) (cf.[16][17][18][19]).The project more generally aims at making the implicit, but to-date still partially unlocked, cultural knowledge contained within a non-standard language legacy dataset accessible, connectable and reusable for different disciplines and actor groups.To achieve this, a variety of cross-disciplinary knowledge and agile research and design thinking methods are drawn upon.Following the principles of the Open Innovation Strategy for Austria [20], it also connects to different actor groups from society and industry.
The exploreAT!project evolves around a digitised non-standard language resource of the Bavarian Dialects in Austria (DBÖ -Datenbank der bairischen Mundarten in Österreich/Database of Bavarian Dialects in Austria) and related dbo@ema (Database of Bavarian Dialects Electronically Mapped; Wandl-Vogt, E. (2010; Ed.).Datenbank der bairischen Mundarten in Österreich electronically mapped [Database of the Bavarian Dialects in Austria electronically mapped] (dbo@ema).Wien.[Processing status: 2018.01.]) (cf.[21]).This highly heterogeneous collection captures the language and through it the culture of the local society in the area of the former Austro-Hungarian empire from the beginnings of the German language up until now.Besides capturing the local non-standard speech of the population, it contains a wealth of cultural information on detailed aspects of the former day-to-day life of the local population, including professions, customs, religious festivities, folklore medicine and food.
Inspiration also came from another DH project ProvideDH (Progressive Visual Decision Making in Digital Humanities) [22] (cf.[23]), who are developing their own set of metrics based on a distinction of epistemic uncertainty (systematic uncertainty; reducible) and aleatory (inherent uncertainty; irreducible) as proposed by [24] (cf.[25]).
The ProvideDH project ( [23,25]) is implemented at the University of Salamanca (GrialUSAL) (ES) in collaboration with exploration space @ ACDH-OeAW Austrian Academy of Sciences (AT), Trinity College Dublin (IE) and the Supercomputing Centre Poznan (PL).The project is also run within a digital humanities environment and seeks to propose and develop innovative progressive visualisation tools that track and convey the degree of uncertainty of a given (humanities) dataset during its evolution, and how these data are affected by different computational models applied.The different sources of uncertainty introduced over the course of time and affecting the DH practice are made visible and can be assessed by the researcher, ultimately supporting the decision-making process (cf.[25]).Tools are developed for different use-cases based on humanities data (XML/TEI) (the 1641 depositions) and scenarios (according the Open Innovation Research Infrastructure Design © ewv 2017), and aim to enable collaborative and visual annotation in an uncertainty-aware context (cf.[26]) (Figure 1).(dbo@ema).Wien.[Processing status: 2018.01.]) (cf.[21]).This highly heterogeneous collection captures the language and through it the culture of the local society in the area of the former Austro-Hungarian empire from the beginnings of the German language up until now.Besides capturing the local non-standard speech of the population, it contains a wealth of cultural information on detailed aspects of the former day-to-day life of the local population, including professions, customs, religious festivities, folklore medicine and food.Inspiration also came from another DH project ProvideDH (Progressive Visual Decision Making in Digital Humanities) [22] (cf.[23]), who are developing their own set of metrics based on a distinction of epistemic uncertainty (systematic uncertainty; reducible) and aleatory (inherent uncertainty; irreducible) as proposed by [24] (cf.[25]).
The ProvideDH project ( [23,25]) is implemented at the University of Salamanca (GrialUSAL) (ES) in collaboration with exploration space @ ACDH-OeAW Austrian Academy of Sciences (AT), Trinity College Dublin (IE) and the Supercomputing Centre Poznan (PL).The project is also run within a digital humanities environment and seeks to propose and develop innovative progressive visualisation tools that track and convey the degree of uncertainty of a given (humanities) dataset during its evolution, and how these data are affected by different computational models applied.The different sources of uncertainty introduced over the course of time and affecting the DH practice are made visible and can be assessed by the researcher, ultimately supporting the decision-making process (cf.[25]).Tools are developed for different use-cases based on humanities data (XML/TEI) (the 1641 depositions) and scenarios (according the Open Innovation Research Infrastructure Design © ewv 2017), and aim to enable collaborative and visual annotation in an uncertainty-aware context (cf.[26]) (Figure 1).

Taxonomies of Uncertainty: A Concise Overview
This section presents a review of selected taxonomies dealing with one or all of the categories relevant for the present research purpose of capturing spatial and temporal uncertainties.The chosen taxonomies offer also a broad basis for evaluating whether existing classifications can be readily adopted, or whether categories more specific to lexicographic collections would need to be devised.
There are different kinds of taxonomies that have been proposed across disciplines to classify them [27]: Specific taxonomies of uncertainty can be found for various given areas, such as biology [28], health [29] and trading regulations [30].A very general taxonomy is presented by the New World Encyclopaedia entry on uncertainty [31] (Figure 2).The difference between epistemological and ontological uncertainty was taken into account when devising our own taxonomy.

Taxonomies of Uncertainty: A Concise Overview
This section presents a review of selected taxonomies dealing with one or all of the categories relevant for the present research purpose of capturing spatial and temporal uncertainties.The chosen taxonomies offer also a broad basis for evaluating whether existing classifications can be readily adopted, or whether categories more specific to lexicographic collections would need to be devised.
There are different kinds of taxonomies that have been proposed across disciplines to classify them [27]: Specific taxonomies of uncertainty can be found for various given areas, such as biology [28], health [29] and trading regulations [30].A very general taxonomy is presented by the New World Encyclopaedia entry on uncertainty [31] (Figure 2).The difference between epistemological and ontological uncertainty was taken into account when devising our own taxonomy.Smithson [32] presents a comprehensible one (Figure 3), adapted from [33].In this taxonomy, uncertainty appears as a specific kind of incompleteness, but not as an error.
Shattuck, Lewis Miller, and Kemmerer [34], on the other hand, make the distinction between uncertainty produced by the flow of information and by the individuals dealing with given information (Figure 4).This is the same approach as [35].
Lovell [36], in an extended digression on the topic, presents a detailed compilation of uncertainties of all sources (Figure 5).In this view, which adds another aspect to the previous one, uncertainties can be originated in (i) the world itself, (ii) the empirical evidence and (iii) the human subjects that interpret them (decision makers).
Vullings, de Vries, and de Borman [37], based on [24], devised a fairly complete model for dealing with spatial uncertainties (Figure 6).Smithson [32] presents a comprehensible one (Figure 3), adapted from [33].In this taxonomy, uncertainty appears as a specific kind of incompleteness, but not as an error.Smithson [32] presents a comprehensible one (Figure 3), adapted from [33].In this taxonomy, uncertainty appears as a specific kind of incompleteness, but not as an error.
Shattuck, Lewis Miller, and Kemmerer [34], on the other hand, make the distinction between uncertainty produced by the flow of information and by the individuals dealing with given information (Figure 4).This is the same approach as [35].
Lovell [36], in an extended digression on the topic, presents a detailed compilation of uncertainties of all sources (Figure 5).In this view, which adds another aspect to the previous one, uncertainties can be originated in (i) the world itself, (ii) the empirical evidence and (iii) the human subjects that interpret them (decision makers).
Vullings, de Vries, and de Borman [37], based on [24], devised a fairly complete model for dealing with spatial uncertainties (Figure 6).Shattuck, Lewis Miller, and Kemmerer [34], on the other hand, make the distinction between uncertainty produced by the flow of information and by the individuals dealing with given information (Figure 4).This is the same approach as [35].Lovell [36], in an extended digression on the topic, presents a detailed compilation of uncertainties of all sources (Figure 5).In this view, which adds another aspect to the previous one, uncertainties can be originated in (i) the world itself, (ii) the empirical evidence and (iii) the human subjects that interpret them (decision makers).The most salient aspect is the distinguishment of "data (input/output)" and "planning process".Temporal uncertainties often come associated with spatial data, as pointed out by [38].Aigner et al. [39] distinguish time points and time intervals, and also draws attention to the kind of events that are being described when they involve other variables (as space).Kissling et al. [40] identify the variation of length of time series and the precision of time in the collection process as sources of temporal uncertainty.Vullings, de Vries, and de Borman [37], based on [24], devised a fairly complete model for dealing with spatial uncertainties (Figure 6).
The most salient aspect is the distinguishment of "data (input/output)" and "planning process".Temporal uncertainties often come associated with spatial data, as pointed out by [38].Aigner et al. [39] distinguish time points and time intervals, and also draws attention to the kind of events that are being described when they involve other variables (as space).Kissling et al. [40] identify the variation of length of time series and the precision of time in the collection process as sources of temporal uncertainty.

Uncertainty in (Digital) Humanities
In the scope of the present paper, we specifically deal with the exploration of uncertainty in the field of humanities and, in particular, digital humanities (DH).There, uncertainty has in recent years also been in the spotlight of discussion and has generated an increased interest, particularly in relation to data and data treatment.Our understanding of uncertainty in this paper concerns the process of data transformation and evolution, and in understanding the sources of uncertainty that can affect the DH practice, which we illustrate with concrete examples from a case study dataset, focusing on spatio-temporal aspects.
While some have addressed the specific possibilities of encoding and modelling uncertainty in humanities data [41], others have discussed the differentiation between humanistic "fact" and interpretation that shapes the nature of humanistic research questions and attitudes towards sources [42].Further issues relevant to the DH field were also discussed in different studies in a special track on "Uncertainty in Digital Humanities" at the recent Conference of Technological Ecosystems for Enhancing Multiculturality (TEEM) in 2018 [43] (see also [25,42,44,45], etc.).The concept of uncertainty in relation to data driven innovation (DDI)-the production of innovative outputs from data-has more-specifically been addressed in [46], who urge the re-thinking and re-organising of views on data collection in the light of open science and cross-organisational collaborations to enable new designs of DDI networks for dealing with aspects of heterogeneity in data.

Uncertainty in (Digital) Humanities
In the scope of the present paper, we specifically deal with the exploration of uncertainty in the field of humanities and, in particular, digital humanities (DH).There, uncertainty has in recent years also been in the spotlight of discussion and has generated an increased interest, particularly in relation to data and data treatment.Our understanding of uncertainty in this paper concerns the process of data transformation and evolution, and in understanding the sources of uncertainty that can affect the DH practice, which we illustrate with concrete examples from a case study dataset, focusing on spatio-temporal aspects.
While some have addressed the specific possibilities of encoding and modelling uncertainty in humanities data [41], others have discussed the differentiation between humanistic "fact" and interpretation that shapes the nature of humanistic research questions and attitudes towards sources [42].Further issues relevant to the DH field were also discussed in different studies in a special track on "Uncertainty in Digital Humanities" at the recent Conference of Technological Ecosystems for Enhancing Multiculturality (TEEM) in 2018 [43] (see also [25,42,44,45], etc.).The concept of uncertainty in relation to data driven innovation (DDI)-the production of innovative outputs from data-has more-specifically been addressed in [46], who urge the re-thinking and re-organising of views on data collection in the light of open science and cross-organisational collaborations to enable new designs of DDI networks for dealing with aspects of heterogeneity in data.
We follow a data-driven research approach, and address aspects of uncertainty in a non-standard language legacy dataset (DBÖ); Österreichische Akademie der Wissenschaften.(1993-).Datenbank der bairischen Mundarten in Österreich [Database of Bavarian Dialects in Austria] (DBÖ).Wien.[Processing status: 2018.01.] in the context of the DH project exploreAT![15]).Even though heterogeneous data offers a wide spectrum of different kinds of uncertainties to be addressed, we here more-specifically focus on the analysis of uncertainties in spatial and temporal aspects of our legacy language collection.
Uncertainty in data pertaining geographic information systems (GIS), and spatial information in general, is a frequently explored topic (cf.[47][48][49][50]) and also finds its own entry in the GIS dictionary [51].
In relation to our language dataset and spatio-temporal dimensions, uncertainties arisen in the process of data evolution include imprecise or erroneous information and knowledge, incomplete information, spelling mistakes, abbreviations, ambiguous information, missing information or uncertainties introduced in the process of digital data transformation and standardisation by tools or persons.In particular-in combination with language phenomena and changes in linguistic processes, such as shifts in language borders/boundaries-uncertainties in the spatio-temporal aspects play an important role and also give insights.To this end, we have grounded our analysis of the uncertainties over specific already-existent sets of categories-eventually modified to include novel aspects we have found in our data uncertainties.The model proposed by [37] based on [24], as introduced in Section 1.1, proved fairly successful when used to classify our examples of spatial uncertainties.We have adapted the taxonomies presented and incorporated new elements.Figure 7 depicts the classes of uncertainties used to classify the samples in our collection regarding the spatial, temporal and linguistic dimensions.We will provide some examples of each one in the results section of this paper.We follow a data-driven research approach, and address aspects of uncertainty in a nonstandard language legacy dataset (DBÖ); Österreichische Akademie der Wissenschaften.(1993-).Datenbank der bairischen Mundarten in Österreich [Database of Bavarian Dialects in Austria] (DBÖ).Wien.[Processing status: 2018.01.] in the context of the DH project exploreAT![15]).Even though heterogeneous data offers a wide spectrum of different kinds of uncertainties to be addressed, we here more-specifically focus on the analysis of uncertainties in spatial and temporal aspects of our legacy language collection.
Uncertainty in data pertaining geographic information systems (GIS), and spatial information in general, is a frequently explored topic (cf.[47][48][49][50]) and also finds its own entry in the GIS dictionary [51].
In relation to our language dataset and spatio-temporal dimensions, uncertainties arisen in the process of data evolution include imprecise or erroneous information and knowledge, incomplete information, spelling mistakes, abbreviations, ambiguous information, missing information or uncertainties introduced in the process of digital data transformation and standardisation by tools or persons.In particular-in combination with language phenomena and changes in linguistic processes, such as shifts in language borders/boundaries-uncertainties in the spatio-temporal aspects play an important role and also give insights.To this end, we have grounded our analysis of the uncertainties over specific already-existent sets of categories-eventually modified to include novel aspects we have found in our data uncertainties.The model proposed by [37] based on [24], as introduced in Section 1.1, proved fairly successful when used to classify our examples of spatial uncertainties.We have adapted the taxonomies presented and incorporated new elements.Figure 7 depicts the classes of uncertainties used to classify the samples in our collection regarding the spatial, temporal and linguistic dimensions.We will provide some examples of each one in the results section of this paper.Our selected focus for this study is relevant in the digital humanities field and to all related disciplines, in that spatial and temporal information are frequent aspects of data collections that makes our paper pertinent also to other fields.Our selected focus for this study is relevant in the digital humanities field and to all related disciplines, in that spatial and temporal information are frequent aspects of data collections that makes our paper pertinent also to other fields.

Materials and Methods
This section presents an overview of the materials and methods central to the paper.We provide an overview of the exploreAT!project and exploration space as a general context and framework for the scenario described in this study, introducing the indigenous language data collection and discussing sources of uncertainty, focusing on spatio-temporal dimensions in our linguistic dataset.The methods explain the composition of our specific taxonomy.

Materials and Data Description
A prominent part of the resource constitutes digitised data collection questionnaires and related answers from paper slips, which initially also pertained to a dictionary project (WBÖ [52], Wörterbuch der bairischen Mundarten in Österreich [Dictionary of Bavarian Dialects in Austria]), intending to capture the German language spoken by the local population, including also the compilation of a linguistic atlas of the local dialect geography [53].Aside from this dataset, the DBÖ collection further contains digitised information from excerpts of folklore literature, vernacular dictionaries or plant names and mushroom catalogues.The data then follows a lexicographic structuring, consisting of lemmas, definitions, sources, time stamps, location information and a variety of other fields.In addition to this rich texture of linguistic, cultural and societal content captured, there is also detailed information available on persons (authors, collectors, editors) (cf.[54]) and spatio-temporal information (places, regions, GIS locations, etc.) of the collection (cf.[55]).Table 1 presents a numerical overview of a relevant sub-set (linguistic, location, time-related) of the major entities contained in this non-standard language data collection in two of the main current digital sources (XML/TEI files and records from a relational MySQL database).Looking first at the overall number of entries in Table 1, we noted a striking difference in size between the two data collections, with the MySQL database containing considerably fewer entries than the XML/TEI data.As regards different parameters contained in each entry, some, but not all, were relevant for our current context and analysis.Thus, we here concentrate on a sub-set of specific linguistic, spatial and temporal parameters only, and provide an overview of the total number of entries in each of the two data sets, as well as the number of unique (i.e., non-repeated) entries (Table 1, column 1).The specific linguistic parameters concern headwords/lemmas with the following distinction: original mainlemmas (i.e., headwords transcribed in its original form with special characters, e.g., (zu{o)t-aufen), normalized mainlemmas (i.e., headwords transcribed without special characters, e.g., zuotaufen), original additional lemmas (e.g., (Winkel)êe), normalized additional lemmas (i.e., additional headword transcribed without special characters; e.g., Winkelêe) and entries with no lemmas.The specific spatial parameters on the other hand are: Bundesland (county; e.g., Steiermark/St.), Großregion (big region; e.g., mittelbairische Obersteiermark/mbair.Obst.),Kleinregion (small region; e.g., Erzberger Gegend/Erzbg.Geg.),Gemeinde (municipality; e.g., Radmer), Ort (location; e.g., Radmer) and entries without a given location.The distinction between the different types/sizes of regions were made according to an internally developed system of identifiers for regions, so-called sigles, consisting of a letter-number combination and denoting a hierarchical structure, as can be seen in Figure 8.
Overall the number of unique entries was significantly smaller than the overall number of entries.Whereas normalized lemmas (i.e., lemmas written without special characters) were included in the XML data, this parameter did not occur in the lemmas of the MySQL dataset.As for entries lacking lemmas, a considerably high number of such entries were found in the XML dataset, whereas this was not at all the case in the MySQL data.
Comparing next the total and unique numbers of locations, we first noted again considerably higher numbers of entries in the XML dataset compared with MySQL, but also striking structural differences between these two datasets.Whereas the majority of XML entries contained a hierarchical structure of location information (Bundesland > Großregion > Kleinregion > Gemeinde > Ort), some parameters (Bundesland, Großregion, Kleinregion) were not accessible in a structured way, but had been merged in a single column.A noticeable difference between the datasets, however, emerged, in that a higher number of unique location entries was contained in the MySQL dataset.Looking finally at entries that were or were not linked to location parameters, again an overall higher number could be observed for the XML dataset.There, a higher number of entries was also linked to location information, whereas the opposite was the case in the MySQL dataset.
As for the time information of entries in both data sets, the oldest and most recent time information was determined, which most typically refers to the publication year of a source in the case of books, or the year of collection in the case of a questionnaire-related entry.In both datasets, we noted rather large time spans, which again highlights the heterogeneity of the material.Comparing next the numbers of total and unique lemmas in the XML/TEI and MySQL datasets, we again noted an overall higher number of both total and unique lemmas in the TEI/XML dataset.Overall the number of unique entries was significantly smaller than the overall number of entries.Whereas normalized lemmas (i.e., lemmas written without special characters) were included in the XML data, this parameter did not occur in the lemmas of the MySQL dataset.As for entries lacking lemmas, a considerably high number of such entries were found in the XML dataset, whereas this was not at all the case in the MySQL data.
Comparing next the total and unique numbers of locations, we first noted again considerably higher numbers of entries in the XML dataset compared with MySQL, but also striking structural differences between these two datasets.Whereas the majority of XML entries contained a hierarchical structure of location information (Bundesland > Großregion > Kleinregion > Gemeinde > Ort), some parameters (Bundesland, Großregion, Kleinregion) were not accessible in a structured way, but had been merged in a single column.A noticeable difference between the datasets, however, emerged, in that a higher number of unique location entries was contained in the MySQL dataset.
Looking finally at entries that were or were not linked to location parameters, again an overall higher number could be observed for the XML dataset.There, a higher number of entries was also linked to location information, whereas the opposite was the case in the MySQL dataset.
As for the time information of entries in both data sets, the oldest and most recent time information was determined, which most typically refers to the publication year of a source in the case of books, or the year of collection in the case of a questionnaire-related entry.In both datasets, we noted rather large time spans, which again highlights the heterogeneity of the material.
While this numerical overview offers an impression of the type of data contained, at the same time, it gives insights into the various levels at which uncertainties in this particular dataset can arise and the extent of heterogeneity.The records were not homogeneous, given differences in the details from the myriad of sources, and also because of differences in the transformation and conversion processes from the legacy sources for the current ones.

Data Transformation and Process Description
The DBÖ collection has suffered many transformations since its beginning in the early 20th century (~1913) until today (2019).Originally, the collection was initiated with paper questionnaires and answers noted on individual paper slips.From these analogue forms, the cohort of the now digital data has undergone several stages of digitization and digital transformation through the years (Figures 9  and 10) until reaching its current state; partly in XML/TEI formats [56] and partly as a MySQL database (dbo@ema) (cf.[57]).
several data typists into TUSTEP (Tuebinger System von Textverarbeitungs-Programmen/Tuebingen System of Text Processing tools), resulting in ~2.43 million entries.Additionally, auxiliary databases, persons databases, a literature database, a plant name database and location data were also created in TUSTEP.Towards the end of this first digitization process (2007), part of these TUSTEP data were transferred to a MySQL database as part of the dbo@ema project [57].For the first time, different separate databases were joint; a geographic visualization interface (maps) and GIS locations were added, and information and data were made publicly accessible and visible on the Internet via a project website [56].From then, the heterogeneity of the data increased again, with parts of the original data being still available in TUSTEP, another part having been converted to MySQL and additional newly digitized data being directly entered in MySQL.Following the dbo@ema project, the next step marked the transformation process towards the networked, open data realm.The conversion process of the remaining TUSTEP data to XML/TEI format was one of the first endeavours in exploreAT!, starting in 2015.Data entered in MySQL, however, remained unaltered at first.In the course of exploreAT!, the MySQL data and some of the XML/TEI data were converted to RDF and linked to the Linked Open Data (LOD) Cloud (2017-).In addition, the data is currently in the process of being enriched with lexical concepts and linked to DBpedia concepts.Figures 9 and 10 show an overview of the data transformation process.During the turmoil of World War II, the collection suffered some notable losses, and thus the precise quantitative overview of the original material can only be speculated.In the first stage of digitization (1993-2011), all available information noted on the paper slips (including, for example, headword, meaning, pronunciation, location, date and collector name) was manually entered by several data typists into TUSTEP (Tuebinger System von Textverarbeitungs-Programmen/Tuebingen System of Text Processing tools), resulting in ~2.43 million entries.Additionally, auxiliary databases, persons databases, a literature database, a plant name database and location data were also created in TUSTEP.Towards the end of this first digitization process (2007), part of these TUSTEP data were transferred to a MySQL database as part of the dbo@ema project [57].For the first time, different separate databases were joint; a geographic visualization interface (maps) and GIS locations were added, and information and data were made publicly accessible and visible on the Internet via a project website [56].From then, the heterogeneity of the data increased again, with parts of the original data being still available in TUSTEP, another part having been converted to MySQL and additional newly digitized data being directly entered in MySQL.Following the dbo@ema project, the next step marked the transformation process towards the networked, open data realm.The conversion process of the remaining TUSTEP data to XML/TEI format was one of the first endeavours in exploreAT!, starting in 2015.Data entered in MySQL, however, remained unaltered at first.In the course of exploreAT!, the MySQL data and some of the XML/TEI data were converted to RDF and linked to the Linked Open Data (LOD) Cloud (2017-).In addition, the data is currently in the process of being enriched with lexical concepts and linked to DBpedia concepts.Figures 9 and 10 show an overview of the data transformation process.
On a concrete example, we can see in Figure 11 the partial steps of conversion, beginning with a paper slip (1), a TUSTEP entry (2) and an XML/TEI file excerpt (3).Regarding the MySQL database, data was added straight from the records, as depicted in Figure 10.An overview of the database schema can be seen in Figure 12.Regarding the MySQL database, data was added straight from the records, as depicted in Figure 10.An overview of the database schema can be seen in Figure 12.

Spatial and Temporal Dimensions in the DBÖ: A Review
Since the very beginning of the data collection and in the course of the DBÖ data transformation, spatial and geographical concepts contained in dbo@ema have been a fundamental aspect of the data analysis and transformation [58][59][60][61][62].
With the existence of recent novel digital tools and methods, new ways of exploring these spatial dimensions were enabled, and they have thus served as the data basis for various studies, collaborations and theses in the context of data visualisation [16] (Figure 13), data modelling [60,63] or linked data representations [55,64] (Figure 14).

Spatial and Temporal Dimensions in the DBÖ: A Review
Since the very beginning of the data collection and in the course of the DBÖ data transformation, spatial and geographical concepts contained in dbo@ema have been a fundamental aspect of the data analysis and transformation [58][59][60][61][62].
With the existence of recent novel digital tools and methods, new ways of exploring these spatial dimensions were enabled, and they have thus served as the data basis for various studies, collaborations and theses in the context of data visualisation [16] (Figure 13), data modelling [60,63] or linked data representations [55,64] (Figure 14).

Spatial and Temporal Dimensions in the DBÖ: A Review
Since the very beginning of the data collection and in the course of the DBÖ data transformation, spatial and geographical concepts contained in dbo@ema have been a fundamental aspect of the data analysis and transformation [58][59][60][61][62].
With the existence of recent novel digital tools and methods, new ways of exploring these spatial dimensions were enabled, and they have thus served as the data basis for various studies, collaborations and theses in the context of data visualisation [16] (Figure 13), data modelling [60,63] or linked data representations [55,64] (Figure 14).The association of spatial and linguistic features also allowed for the analysis of the areas which yielded more concepts, as shown in the heatmap of Figure 15.

Methods
The composition of our taxonomy of uncertainties was preceded by a review of carefullyselected existing models, aiming at covering the spatial, temporal and linguistic aspects across different fields.In spite of the abundance of taxonomies dealing with spatial and temporal aspects, we found almost no coverage of the linguistic aspects.On the basis of this review, entities for our The association of spatial and linguistic features also allowed for the analysis of the areas which yielded more concepts, as shown in the heatmap of Figure 15.The association of spatial and linguistic features also allowed for the analysis of the areas which yielded more concepts, as shown in the heatmap of Figure 15.

Methods
The composition of our taxonomy of uncertainties was preceded by a review of carefullyselected existing models, aiming at covering the spatial, temporal and linguistic aspects across different fields.In spite of the abundance of taxonomies dealing with spatial and temporal aspects, we found almost no coverage of the linguistic aspects.On the basis of this review, entities for our

Methods
The composition of our taxonomy of uncertainties was preceded by a review of carefully-selected existing models, aiming at covering the spatial, temporal and linguistic aspects across different fields.In spite of the abundance of taxonomies dealing with spatial and temporal aspects, we found almost no coverage of the linguistic aspects.On the basis of this review, entities for our taxonomy were selected against the background of our dataset that were deemed most appropriate in capturing the underlying uncertainties.We tried to the maximum extent to reuse the spatial and temporal categories, presenting them in a simple but unified and coherent structure.Additionally, we introduced the linguistic aspects in a way in which the whole set of dimensions proposed could fit into an all-encompassing instrument that would be able to be applied to our ad hoc collection.Where no match was found, we established our own categories.The devised taxonomy based on uncertainties in the DBÖ is discussed in further detail in Section 3.

Results
As commonly occurs in long data transformation and conversion processes, uncertainties have both been remedied as well as reintroduced through time, such as in the differences in DB schemas due to assignment of fields during database conversion and imperfect matches between lexical and LOD concepts in the enrichment process.Most of these uncertainties are common in a plethora of long-term, data-intensive projects.However, some are very particular to this collection.
To characterize the uncertainty sources that were found, a specific taxonomy was developed, which is now introduced.

Uncertainty in DBÖ-A Specific Taxonomy
In Section 1.2, a non-exhaustive review on attempts to offer taxonomies of uncertainties was presented.In this section, we present a taxonomy of our own, based on specific categories relevant to our dataset.Table 2 below, based on Figure 7 (Section 1.3), depicts the classes and sources of uncertainties found in our collection regarding spatial, temporal and linguistic dimensions, along with some examples of its sources.
While the intrinsic/ontological dimension deals with the problem of indeterminacy that arises from the limits on our capacity to know what exists in reality, the intrinsic/epistemic dimension deals with limits in the human capacity to measure, observe, record or determine precisely what is to be known.Conversely, uncertainties with origins in the extrinsic dimensions would all be avoidable, given a certain effort and control, but these are often unfeasible.User input is related to the role humans have in the transformation of data and information substrates; data conversion is related to the technological pitfalls when converting automatically from format to format and-last, but not least-data record is inherent to ambiguity in the data per se, and not directly related to the consequences of transformations and users' actions.

Uncertainty in DBÖ-Examples
Having first established our own taxonomy of uncertainties, we now present concrete examples for the most relevant cases from our data to exemplify and illustrate each category.The concrete examples in the following subsections correspond to the dimensions outlined in Table 2 above.

Spatial/Intrinsic/Ontological
A typical example of this kind of source of uncertainty is, for the DBÖ spatial data, a somewhat common historical region in the far north west of Bohemia in the Czech Republic at the border with Germany, called "Egerland".It was characterised by the German-speaking population until 1945 (cf.[68]), but now the names of the places in this region have all changed.Other common cases for this collection are regions that used to belong to Austria but are now in a different country, such as "Stilfs-Stelvio, Vinschgau" that are now "Val Venosta, BZ, TAA, 39,029 in Italy", or "Groß Olkowitz" that are now "Oleksovice" in the Czech Republic.Some of these examples are presented in Figure 16, together with their original, corresponding paper slips.

Uncertainty in DBÖ-Examples
Having first established our own taxonomy of uncertainties, we now present concrete examples for the most relevant cases from our data to exemplify and illustrate each category.The concrete examples in the following subsections correspond to the dimensions outlined in Table 2 above.

Spatial/Intrinsic/Ontological
A typical example of this kind of source of uncertainty is, for the DBÖ spatial data, a somewhat common historical region in the far north west of Bohemia in the Czech Republic at the border with Germany, called "Egerland".It was characterised by the German-speaking population until 1945 (cf.[68]), but now the names of the places in this region have all changed.Other common cases for this collection are regions that used to belong to Austria but are now in a different country, such as "Stilfs-Stelvio, Vinschgau" that are now "Val Venosta, BZ, TAA, 39,029 in Italy", or "Groß Olkowitz" that are now "Oleksovice" in the Czech Republic.Some of these examples are presented in Figure 16, together with their original, corresponding paper slips.

Spatial/Intrinsic/Epistemic
In the epistemic side, spatial examples are many, given the abundance of sources for imprecision, ignorance or incompleteness.Such examples can be differences in detail (a point/coordinate or a region/polygon), as in "Südmähren" (region) and "Obweg, Landeck, Tirol, 6571, Österreich" (a very well defined point).

Spatial/Extrinsic/User Input
Euphemistically speaking, there is no shortage of cases where the collectors' and data typists' "creativity" introduces uncertainties when putting data into systems.The lack of standards, guidelines or even their mutable dynamic over time, along with errors that arise from biases or different backgrounds, have the power to produce records that range from "ambiguous" to "data chimera".As an illustration, we can mention typos and abbreviations (e.g., "Kapellen BöW ^#^# ON/Gm.?", or "o.Ang.", that mix upper and lowercases, special characters and different types of information, such as the collector initials) in the same data field.There are, however, also other interesting cases, where biases and mistaken interpretations can lead to errors in the chain.Figure 17 shows an entry where the data typist read wrong twice, as explained: 1.The correct location is "Unterinn" and not "Matrei".The error caused a wrong assignment, thus the standardised source in TUSTEP and later in the TEI/XML file (see below) is not correct.The

Spatial/Intrinsic/Epistemic
In the epistemic side, spatial examples are many, given the abundance of sources for imprecision, ignorance or incompleteness.Such examples can be differences in detail (a point/coordinate or a region/polygon), as in "Südmähren" (region) and "Obweg, Landeck, Tirol, 6571, Österreich" (a very well defined point).

Spatial/Extrinsic/User Input
Euphemistically speaking, there is no shortage of cases where the collectors' and data typists' "creativity" introduces uncertainties when putting data into systems.The lack of standards, guidelines or even their mutable dynamic over time, along with errors that arise from biases or different backgrounds, have the power to produce records that range from "ambiguous" to "data chimera".As an illustration, we can mention typos and abbreviations (e.g., "Kapellen BöW ˆ#ˆ# ON/Gm.?", or "o.Ang.", that mix upper and lowercases, special characters and different types of information, such as the collector initials) in the same data field.There are, however, also other interesting cases, where biases and mistaken interpretations can lead to errors in the chain.Figure 17 shows an entry where the data typist read wrong twice, as explained: 1.
The correct location is "Unterinn" and not "Matrei".The error caused a wrong assignment, thus the standardised source in TUSTEP and later in the TEI/XML file (see below) is not correct.The data typist was probably mistaken, as the collectors in Matrei and in Unterinn have similar names (Egger/Paregger).

2.
There is also an error in the meaning-it is not "übermächtigen Leuten" (overpowering people), but "übernächtigen Leuten" (people that are tired out).As it almost sounds true and is not all that far-fetched, it is easy to overlook.These interpretations have introduced both spatial and linguistic errors in the collection.

Spatial/Extrinsic/Data Conversion
Data conversion and enrichment can be a source of augmenting information conveyed by the systems, but can introduce errors as well.In the specific case of spatial data, wrongly-matched geolocations have occurred often with our dataset.We have mainly used Geopy [69] (a Python package) for geolocation, as it offers access to many free and paid geocoders.The default and free option, OSM Nominatim [70], is quite good, but not as reliable as GoogleV3 or Bing APIs, which are paid.One example of mismatch is the place "Weikersdorf" that was identified as being "Vikýřovice, okres Šumperk, Olomoucký kraj, Střední Morava, 78813, Česko" by OSM geolocation, whilst Google would have pointed the correct location "Weikersdorf am Steinfelde", in Austria.data typist was probably mistaken, as the collectors in Matrei and in Unterinn have similar names (Egger/Paregger).2. There is also an error in the meaning-it is not "übermächtigen Leuten" (overpowering people), but "übernächtigen Leuten" (people that are tired out).As it almost sounds true and is not all that far-fetched, it is easy to overlook.These interpretations have introduced both spatial and linguistic errors in the collection.

Spatial/Extrinsic/Data Conversion
Data conversion and enrichment can be a source of augmenting information conveyed by the systems, but can introduce errors as well.In the specific case of spatial data, wrongly-matched geolocations have occurred often with our dataset.We have mainly used Geopy [69] (a Python package) for geolocation, as it offers access to many free and paid geocoders.The default and free option, OSM Nominatim [70], is quite good, but not as reliable as GoogleV3 or Bing APIs, which are paid.One example of mismatch is the place "Weikersdorf" that was identified as being "Vikýřovice, okres Šumperk, Olomoucký kraj, Střední Morava, 78813, Česko" by OSM geolocation, whilst Google would have pointed the correct location "Weikersdorf am Steinfelde", in Austria.

Spatial/Extrinsic/Data Record
Even if the process of inputting data is made in the correct way, it may happen that some information has ambiguities that may lead to uncertainties.In the case of places, we have as its main sources homographs and differences in details among records.As an example, we can present the case of "Neumarkt" where the location is not exactly specified and the information recorded can be interpreted as seven different places.We can only differentiate, then, referring to additional information, such as the colour of the paper slip, the name of the collector and/or the handwriting (for identification of the collector).Figure 18 shows two examples of paper slips for "Neumarkt".

Spatial/Extrinsic/Data Record
Even if the process of inputting data is made in the correct way, it may happen that some information has ambiguities that may lead to uncertainties.In the case of places, we have as its main sources homographs and differences in details among records.As an example, we can present the case of "Neumarkt" where the location is not exactly specified and the information recorded can be interpreted as seven different places.We can only differentiate, then, referring to additional information, such as the colour of the paper slip, the name of the collector and/or the handwriting (for identification of the collector).Figure 18 shows two examples of paper slips for "Neumarkt".3.2.6.Temporal/Intrinsic/Ontological and Epistemic Ontologically speaking, there are many aspects of time events that are uncertain.Examples are events with debatable fiat limits (e.g., the beginning and end of World Wars), and also the difference over punctual events (e.g., the day a person has answered a question) vs processes over time (e.g., when a word started to be used).Sometimes it may be the case that the event is ontologically welldefined, but there is no way to get access to the information about it (epistemic uncertainty), which can lead to imprecisions in the records.Both cases can give rise to differences in detail, such as:

Temporal/Extrinsic/User Input
It is often difficult to tell whether a time record is ontologically/epistemic uncertain or if it is the case of erroneous user inputs.As an example of the latter, we have references to two (or more) different periods (e.g., "1560 (Kop.18. Jh.)").Other cases can be regarded as pure naivety of the data typist about the diachronic interpretations of statements (e.g., "heute noch"; "Allerheiligen" or "vor Jahrzehnten").These were ontologically and epistemically correct in the time they were created, but almost useless in terms of information.

Temporal/Extrinsic/Data Conversion and Data Record
Data conversion can play a role in introducing uncertainties in time references.One of the most common cases are the differences in date formats [71], but can also be the case of idiosyncrasies in

Temporal/Intrinsic/Ontological and Epistemic
Ontologically speaking, there are many aspects of time events that are uncertain.Examples are events with debatable fiat limits (e.g., the beginning and end of World Wars), and also the difference over punctual events (e.g., the day a person has answered a question) vs processes over time (e.g., when a word started to be used).Sometimes it may be the case that the event is ontologically well-defined, but there is no way to get access to the information about it (epistemic uncertainty), which can lead to imprecisions in the records.Both cases can give rise to differences in detail, such as:

•
Imprecise reference (e.g., "Mitte d. 15.Jh.", "1608; 1651 [?]", "vor 2. Weltkrieg", "Biedermeierzeit", "Vorkriegszeit" or "nach 1. Weltkrieg") (Figure 19).3.2.6.Temporal/Intrinsic/Ontological and Epistemic Ontologically speaking, there are many aspects of time events that are uncertain.Examples are events with debatable fiat limits (e.g., the beginning and end of World Wars), and also the difference over punctual events (e.g., the day a person has answered a question) vs processes over time (e.g., when a word started to be used).Sometimes it may be the case that the event is ontologically welldefined, but there is no way to get access to the information about it (epistemic uncertainty), which can lead to imprecisions in the records.Both cases can give rise to differences in detail, such as:

Temporal/Extrinsic/User Input
It is often difficult to tell whether a time record is ontologically/epistemic uncertain or if it is the case of erroneous user inputs.As an example of the latter, we have references to two (or more) different periods (e.g., "1560 (Kop.18. Jh.)").Other cases can be regarded as pure naivety of the data typist about the diachronic interpretations of statements (e.g., "heute noch"; "Allerheiligen" or "vor Jahrzehnten").These were ontologically and epistemically correct in the time they were created, but almost useless in terms of information.

Temporal/Extrinsic/Data Conversion and Data Record
Data conversion can play a role in introducing uncertainties in time references.One of the most common cases are the differences in date formats [71], but can also be the case of idiosyncrasies in

Temporal/Extrinsic/User Input
It is often difficult to tell whether a time record is ontologically/epistemic uncertain or if it is the case of erroneous user inputs.As an example of the latter, we have references to two (or more) different periods (e.g., "1560 (Kop.18. Jh.)").Other cases can be regarded as pure naivety of the data typist about the diachronic interpretations of statements (e.g., "heute noch"; "Allerheiligen" or "vor Jahrzehnten").These were ontologically and epistemically correct in the time they were created, but almost useless in terms of information.

Temporal/Extrinsic/Data Conversion and Data Record
Data conversion can play a role in introducing uncertainties in time references.One of the most common cases are the differences in date formats [71], but can also be the case of idiosyncrasies in user input that, in spite of not being errors, can also cause problems in the conversion, as the use of Roman notation for numbers, or even a mix of Arabic and Roman (e.g., "O 1670 8.VII").These will introduce errors in the data records.
When trying to extract information from data, we have faced the need to deal with vague records, such as "17.Jh." and "16.Jh."-the two more common references of the collection, with an example in Figure 20.
Informatics 2019, 6, 34 22 of 30 user input that, in spite of not being errors, can also cause problems in the conversion, as the use of Roman notation for numbers, or even a mix of Arabic and Roman (e.g., "O 1670 8.VII").These will introduce errors in the data records.When trying to extract information from data, we have faced the need to deal with vague records, such as "17.Jh." and "16.Jh."-the two more common references of the collection, with an example in Figure 20.For characterizing the collection, we decided to assign the year in the middle of the century for each imprecise reference such as these, which led to some mid-century peaks in the temporal depiction of dates, as can be seen in Figure 21.Whether a random assignment would convey more information is a matter of interpretation.For characterizing the collection, we decided to assign the year in the middle of the century for each imprecise reference such as these, which led to some mid-century peaks in the temporal depiction of dates, as can be seen in Figure 21.Whether a random assignment would convey more information is a matter of interpretation.
Informatics 2019, 6, 34 22 of 30 user input that, in spite of not being errors, can also cause problems in the conversion, as the use of Roman notation for numbers, or even a mix of Arabic and Roman (e.g., "O 1670 8.VII").These will introduce errors in the data records.When trying to extract information from data, we have faced the need to deal with vague records, such as "17.Jh." and "16.Jh."-the two more common references of the collection, with an example in Figure 20.For characterizing the collection, we decided to assign the year in the middle of the century for each imprecise reference such as these, which led to some mid-century peaks in the temporal depiction of dates, as can be seen in Figure 21.Whether a random assignment would convey more information is a matter of interpretation.3.2.9.Linguistic/Intrinsic/Ontological Ontological uncertainties in the linguistic aspects are sometimes hard to tell apart from epistemic ones, but the phenomenon of dialects ceasing being used is undoubtedly an ontological problem, as a dialect exists as soon as there are at least two individuals using it.It brings uncertainties when the typists or researchers cannot distinguish them from other common errors and typos.They can also be interpreted as different words (homographs or semi-homographs) with different meanings.Figure 22 shows an example of a paperslip that was rejected for the dictionary because of the inherent uncertainty on the interpretation.The meaning is clear, but the word/pronunciation is unusual, so one cannot reliably deduce a lemma (Espetze?)for it or assign it to an already existing lemma.This is another argument for the importance of current research being done in endangered languages [72,73].
Informatics 2019, 6, 34 23 of 30 a dialect exists as soon as there are at least two individuals using it.It brings uncertainties when the typists or researchers cannot distinguish them from other common errors and typos.They can also be interpreted as different words (homographs or semi-homographs) with different meanings.Figure 22 shows an example of a paperslip that was rejected for the dictionary because of the inherent uncertainty on the interpretation.The meaning is clear, but the word/pronunciation is unusual, so one cannot reliably deduce a lemma (Espetze?)for it or assign it to an already existing lemma.This is another argument for the importance of current research being done in endangered languages [72,73].3.2.10.Linguistic/Intrinsic/Epistemic Epistemic uncertainties are common in linguistic records, mainly because of the process of collecting speech samples and choosing the most adequate spellings for the words unknown to the collectors.Even for those that they know, they may lack standard orthographies for certain vocables.

Linguistic/Extrinsic/User Input
Besides the type of errors in inputting information, as illustrated in Figure 17, Section 3.2.3,another emblematic example of uncertainty caused by user input is the observance of different standards for the phonetic representation of words.For the digitization and during the transformation processes of the collection, both the International Phonetic Alphabet (IPA) [74] and (coded) Teuthonista have been used (cf.[75]).Unfortunately, the documents referring to the latter are now available only through the Wayback Machine [76]. Figure 23 shows an example of entry in the TEI/XML database that shows both coded Teuthonista (referred as "tustep" notation) and IPA formats for representing the lemma "Alté".3.2.10.Linguistic/Intrinsic/Epistemic Epistemic uncertainties are common in linguistic records, mainly because of the process of collecting speech samples and choosing the most adequate spellings for the words unknown to the collectors.Even for those that they know, they may lack standard orthographies for certain vocables.

Linguistic/Extrinsic/User Input
Besides the type of errors in inputting information, as illustrated in Figure 17, Section 3.2.3,another emblematic example of uncertainty caused by user input is the observance of different standards for the phonetic representation of words.For the digitization and during the transformation processes of the collection, both the International Phonetic Alphabet (IPA) [74] and (coded) Teuthonista have been used (cf.[75]).Unfortunately, the documents referring to the latter are now available only through the Wayback Machine [76]. Figure 23 shows an example of entry in the TEI/XML database that shows both coded Teuthonista (referred as "tustep" notation) and IPA formats for representing the lemma "Alté".a dialect exists as soon as there are at least two individuals using it.It brings uncertainties when the typists or researchers cannot distinguish them from other common errors and typos.They can also be interpreted as different words (homographs or semi-homographs) with different meanings.Figure 22 shows an example of a paperslip that was rejected for the dictionary because of the inherent uncertainty on the interpretation.The meaning is clear, but the word/pronunciation is unusual, so one cannot reliably deduce a lemma (Espetze?)for it or assign it to an already existing lemma.This is another argument for the importance of current research being done in endangered languages [72,73].3.2.10.Linguistic/Intrinsic/Epistemic Epistemic uncertainties are common in linguistic records, mainly because of the process of collecting speech samples and choosing the most adequate spellings for the words unknown to the collectors.Even for those that they know, they may lack standard orthographies for certain vocables.

Linguistic/Extrinsic/User Input
Besides the type of errors in inputting information, as illustrated in Figure 17, Section 3.2.3,another emblematic example of uncertainty caused by user input is the observance of different standards for the phonetic representation of words.For the digitization and during the transformation processes of the collection, both the International Phonetic Alphabet (IPA) [74] and (coded) Teuthonista have been used (cf.[75]).Unfortunately, the documents referring to the latter are now available only through the Wayback Machine [76]. Figure 23 shows an example of entry in the TEI/XML database that shows both coded Teuthonista (referred as "tustep" notation) and IPA formats for representing the lemma "Alté".As the knowledge of the collectors about the standards were heterogeneous, we have many errors related to the correct representation of words, which led to uncertainties.
3.2.12.Linguistic/Extrinsic/Data Conversion Perhaps the most common problem of all regarding data conversion from legacy systems is the difference in character encoding.Before Unicode Transformation Format encoding UTF-8 [77] and Unicode standards [78] being adopted as the preferred choice to represent characters, different systems used one of the many available character encodings [79], leading to undesirable special characters when the conversion is unaware of the previous formats.Figure 24 shows an example of such conversion errors of the lemmas in the MySQL database.Words that present the squared character are a result of a character encodings from the legacy system that could not be interpreted in the new one.As the knowledge of the collectors about the standards were heterogeneous, we have many errors related to the correct representation of words, which led to uncertainties.
3.2.12.Linguistic/Extrinsic/Data Conversion Perhaps the most common problem of all regarding data conversion from legacy systems is the difference in character encoding.Before Unicode Transformation Format encoding UTF-8 [77] and Unicode standards [78] being adopted as the preferred choice to represent characters, different systems used one of the many available character encodings [79], leading to undesirable special characters when the conversion is unaware of the previous formats.Figure 24 shows an example of such conversion errors of the lemmas in the MySQL database.Words that present the squared character are a result of a character encodings from the legacy system that could not be interpreted in the new one.This kind of error has to be detected during the conversion phase, otherwise they will be perpetuated and may make some data unusable.
3.2.13.Linguistic/Extrinsic/Data Record Data records are the "compulsory depositaries" of all the errors in the data creation and transformation processes, and this includes ontological, epistemic, user input and data conversion errors.However, even when the creation and conversion are impeccable, uncertainties may arise from the records per se.This is the case, in linguistic regards, of homographs and polysemic words.In the scope of our collection, formed by harvesting multiple sources, differences in details among records also plays its role in uncertainty creation.Figure 25 shows an example of different levels of detail about lemmas originated from TEI-XML entries.This kind of error has to be detected during the conversion phase, otherwise they will perpetuated and may make some data unusable.
3.2.13.Linguistic/Extrinsic/Data Record Data records are the "compulsory depositaries" of all the errors in the data creation and transformation processes, and this includes ontological, epistemic, user input and data conversion errors.However, even when the creation and conversion are impeccable, uncertainties may arise from the records per se.This is the case, in linguistic regards, of homographs and polysemic words.In the scope of our collection, formed by harvesting multiple sources, differences in details among records also plays its role in uncertainty creation.Figure 25 shows an example of different levels of detail about lemmas originated from TEI-XML entries.

Discussion
Our analysis and exemplification of spatial and temporal aspects of uncertainty in the DBÖ collection has offered an abundance of insights on contributing factors and sources, highlighting also the sheer extent of heterogeneity in this legacy dataset.Although previously-established taxonomies for classifying uncertainties were consulted, it was only with the devising of our own taxonomy that the various dimensions and categories in our data could be fully captured and the necessity of our research purpose confirmed.What has become apparent, however, is the continuous course of remedying, but at the same time introducing new types of uncertainties in the data transformation process, despite the availability and use of guidelines, standards and manual corrections.In particular, as far as spatial dimensions are concerned, the constantly evolving dynamics and possible changes in borders, names of places, regions, territories or other geopolitical changes in the real world may affect and impact not only historical, but also current datasets.
We understand that much of what was illustrated in this paper is common and may happen to all collections formed through a reasonable amount of time, as is the case in heritage and historical documents.With regards to similar studies on spatial or temporal uncertainties in non-standard lexical collections, to our knowledge to date, no such taxonomy exists.As a result, our taxonomy can serve as an example for similar datasets and a starting point for related developments.With regards to similar efforts in the digital humanities field, our taxonomy adds to the pool of existing uncertainty taxonomies (e.g., ProvideDH).It draws its higher classes on wider concepts of uncertainties (i.e., intrinsic, extrinsic), but-unlike other taxonomies-devotes its specific subclasses to the fine-grained categories unique to digitised lexicographic collections (e.g., data conversion, data record, etc.).
A first (and perhaps the easiest) step in remedying the types of uncertainties identified and classified would be to treat them as simple record errors."Being certain about our uncertainties", however, is not the same as, nor does it even come close, to remedying them.Because of the fact that the collection comprises-as has been shown-many different databases and formats, it would take time, and sometimes a huge amount of manual work to correct each identified instance of the categories shown.Some of them would be more easily identified or corrected, such as spatial ones, if we could use a geographical gold standard for addressing ambiguities [60].The temporal uncertainties, on the other hand, could be analyzed under certain rules and decisions on how to deal with incomplete information, as is the case of the vague references to centuries.Linguistic uncertainties are harder to fix, as the collection itself contains a lot of indigenous and dialect words, making the comparison and mapping with well-defined instruments, such as the available dictionaries, rather unsatisfactory and incomplete.

Discussion
Our analysis and exemplification of spatial and temporal aspects of uncertainty in the collection has offered an abundance of insights on contributing factors and sources, highlighting also the sheer extent of heterogeneity in this legacy dataset.Although previously-established taxonomies for classifying uncertainties were consulted, it was only with the devising of our own taxonomy that the various dimensions and categories in our data could be fully captured and the necessity of our research purpose confirmed.What has become apparent, however, is the continuous course of remedying, but at the same time introducing new types of uncertainties in the data transformation process, despite the availability and use of guidelines, standards and manual corrections.In particular, as far as spatial dimensions are concerned, the constantly evolving dynamics and possible changes in borders, names of places, regions, territories or other geopolitical changes in the real world may affect and impact not only historical, but also current datasets.
We understand that much of what was illustrated in this paper is common and may happen to all collections formed through a reasonable amount of time, as is the case in heritage and historical documents.With regards to similar studies on spatial or temporal uncertainties in non-standard lexical collections, to our knowledge to date, no such taxonomy exists.As a result, our taxonomy can serve as an example for similar datasets and a starting point for related developments.With regards to similar efforts in the digital humanities field, our taxonomy adds to the pool of existing uncertainty taxonomies (e.g., ProvideDH).It draws its higher classes on wider concepts of uncertainties (i.e., intrinsic, extrinsic), but-unlike other taxonomies-devotes its specific subclasses to the fine-grained categories unique to digitised lexicographic collections (e.g., data conversion, data record, etc.).
A first (and perhaps the easiest) step in remedying the types of uncertainties identified and classified would be to treat them as simple record errors."Being certain about our uncertainties", however, is not the same as, nor does it even come close, to remedying them.Because of the fact that the collection comprises-as has been shown-many different databases and formats, it would take time, and sometimes a huge amount of manual work to correct each identified instance of the categories shown.Some of them would be more easily identified or corrected, such as spatial ones, if we could use a geographical gold standard for addressing ambiguities [60].The temporal uncertainties, on the other hand, could be analyzed under certain rules and decisions on how to deal with incomplete information, as is the case of the vague references to centuries.Linguistic uncertainties are harder to fix, as the collection itself contains a lot of indigenous and dialect words, making the comparison and mapping with well-defined instruments, such as the available dictionaries, rather unsatisfactory and incomplete.

Conclusions
One piece of learning we have taken from this process, nevertheless, concerns the precautions and best practices that may prevent (or at least diminish) such cases from appearing in future collections and databases that are being constantly built in the scope of cultural and linguistic aspects under our particular digital humanities umbrella.Although many processes of data gathering, inputting and conversion are inherently ad hoc, the possible abstractions and generalizations may serve as a warning to the prospective tasks of maintaining huge textual, imagetic and multimedia collections, as those currently being developed.The majority of computer database collections have been formed in the last few decades, and cases of collections formed throughout long periods-in our case, a whole century-are key to understanding the long-term consequences of each and every decision regarding data maintenance.One could say that "creating uncertainty is one of our main certainties", but keeping it, then, as its lowest and acceptable level is an inextricable goal of data humanists.
The distinction to be made by DH researchers should be when uncertainty is a source of insight, allowing better representation of the complexities of reality in the scope of post normal science [80], and when uncertainty is solely a by-product of bad decisions and/or lack of standards for procedures in dealing with data sources, processes and transformation.We identify our intrinsic dimensions as somewhat related to the former, and extrinsic dimensions definitely related to the latter.Interdisciplinary teams that aggregate different backgrounds are also key in fulfilling tasks and developing the skills expected in this fascinating data-driven landscape that is being unveiled for humanities research.

Figure 1 .
Figure 1.The collaborative visual annotation process supporting decision-making in uncertainty-aware contexts.Figure redrawn from [26].

Figure 6 .
Figure 6.A taxonomy of uncertainty in spatial planning.Figure redrawn from [37].

Figure 6 .
Figure 6.A taxonomy of uncertainty in spatial planning.Figure redrawn from [37].

Figure 8 .
Figure 8. Example of the nested location codes in an entry of the XML files.Source: the authors.

Figure 8 .
Figure 8. Example of the nested location codes in an entry of the XML files.Source: the authors.

Figure 9 .
Figure 9. Timeline of the data transformation process relative to the beginning of the exploreAT!project.Image: Amelie Dorn, Eveline Wandl-Vogt, 2018.

Figure 10 .
Figure 10.Overview of the data transformation process.Source: the authors.

Figure 11 .
Figure 11.Examples of the data conversion process on the word "Strützel".Source: the authors.

Figure 12 .
Figure 12.Except of the MySQL database schema.Source: the authors.

Figure 15 .
Figure 15.Part of the heatmap of the spatial concentration of collected concepts.Source: the authors.

Figure 15 .
Figure 15.Part of the heatmap of the spatial concentration of collected concepts.Source: the authors.

Figure 15 .
Figure 15.Part of the heatmap of the spatial concentration of collected concepts.Source: the authors.

Figure 16 .
Figure 16.Examples of spatial/intrinsic/ontological uncertainties on the original paperslips: (a) Grafenried (= Lučina) in the Bohemian Forest has completely vanished as a geographical name or political region; (b) "Oberplan" (= Horní Planá) is an exemple of a region that used to belong to the Austrian empire, but now is in the Czech Republic.Source: the authors.

Figure 16 .
Figure 16.Examples of spatial/intrinsic/ontological uncertainties on the original paperslips: (a) Grafenried (= Lučina) in the Bohemian Forest has completely vanished as a geographical name or political region; (b) "Oberplan" (= Horní Planá) is an exemple of a region that used to belong to the Austrian empire, but now is in the Czech Republic.Source: the authors.

Figure 17 .
Figure 17.A chain of errors in user input: (a) Original record with the correct location "Unterinn"; (b) Location was changed to "Matrei" in TUSTEP; (c) TEI/XML file reproducing the error introduced previously.Source: the authors.

Figure 18 .
Figure 18.Two examples of paper slips referring to different homograph places: (a) Neumarkt (referring to a location in South Tyrol, now Alto Adige, Italy); (b) Neumarkt referring to Neumark (Czech.: Všeruby) in the Bohemian Forest).Source: the authors.

Figure 18 .
Figure 18.Two examples of paper slips referring to different homograph places: (a) Neumarkt (referring to a location in South Tyrol, now Alto Adige, Italy); (b) Neumarkt referring to Neumark (Czech.: Všeruby) in the Bohemian Forest).Source: the authors.

Figure 18 .
Figure 18.Two examples of paper slips referring to different homograph places: (a) Neumarkt (referring to a location in South Tyrol, now Alto Adige, Italy); (b) Neumarkt referring to Neumark (Czech.: Všeruby) in the Bohemian Forest).Source: the authors.

Figure 21 .
Figure 21.Temporal distribution of the collection records.Source: the authors.

3. 2 . 9 .
Linguistic/Intrinsic/Ontological Ontological uncertainties in the linguistic aspects are sometimes hard to tell apart from epistemic ones, but the phenomenon of dialects ceasing being used is undoubtedly an ontological problem, as

Figure 21 .
Figure 21.Temporal distribution of the collection records.Source: the authors.

3. 2 . 9 .
Linguistic/Intrinsic/Ontological Ontological uncertainties in the linguistic aspects are sometimes hard to tell apart from epistemic ones, but the phenomenon of dialects ceasing being used is undoubtedly an ontological problem, as

Figure 21 .
Figure 21.Temporal distribution of the collection records.Source: the authors.

Figure 22 .
Figure 22.An example of an ontological uncertainty.Source: the authors.

Figure 22 .
Figure 22.An example of an ontological uncertainty.Source: the authors.

Figure 22 .
Figure 22.An example of an ontological uncertainty.Source: the authors.

Figure 24 .
Figure 24.Errors from conversions of different character encodings in the MySQL database.Source: the authors.

Figure 24 .
Figure 24.Errors from conversions of different character encodings in the MySQL database.Source: the authors.

Figure 25 .
Figure 25.Differences in details about lemmas from the TEI-XML records.Source: the authors.

Figure 25 .
Figure 25.Differences in details about lemmas from the TEI-XML records.Source: the authors.

Table 1 .
Numerical overview of selected linguistic, spatial and temporal parameters of the collection.Source: the authors.

Table 2 .
Classes and sources of spatial, temporal and linguistic dimensions of uncertainties.