One Step Is Not Enough: A Multi-Step Procedure for Building the Training Set of a Query by String Keyword Spotting System to Assist the Transcription of Historical Document

Digital libraries offer access to a large number of handwritten historical documents. These documents are available as raw images and therefore their content is not searchable. A fully manual transcription is time-consuming and expensive while a fully automatic transcription is cheaper but not comparable in terms of accuracy. The performance of automatic transcription systems is strictly related to the composition of the training set. We propose a multi-step procedure that exploits a Keyword Spotting system and human validation for building up a training set in a time shorter than the one required by a fully manual procedure. The multi-step procedure was tested on a data set made up of 50 pages extracted from the Bentham collection. The palaeographer that transcribed the data set with the multi-step procedure instead of the fully manual procedure had a time gain of 52.54%. Moreover, a small size training set that allowed the keyword spotting system to show a precision value greater than the recall value was built with the multi-step procedure in a time equal to 35.25% of the time required for annotating the whole data set.


Introduction
In the last decade, significant investments were made for the digital transformation of cultural heritage material. Online digital libraries store and share a huge number of historical books and manuscripts that were scanned for ensuring their preservation along the centuries. These digital collections are not searchable because their documents are digital images. Therefore, these images need to be transcribed in order to allow the indexing and querying of the digital libraries.
A fully manual transcription cannot be a solution because it is a time-consuming and expensive process. In fact, a large number of manuscripts need to be digitized and the trouble in reading documents written with a lexicon different respect to the one used nowadays impose the involvement of highly qualified experts in the transcription process.
On the other hand, a fully automatic transcription is cheaper but not comparable in terms of transcription accuracy. The state-of-the-art technologies for automatic transcription [1][2][3][4] can be grouped into two families: recognition based and recognition free approaches.
Based on these considerations, we propose a procedure that exploits a KWS as a tool for assisting the palaeographers in building up a training set that fulfills the recommendations of the performance model in [33]. This procedure involves, at the bootstrap, the manual transcription of a small subset of pages to build an initial training set and, consequently, the definition of an initial keyword list. Then, the training set and the keyword list are updated at each step until the precision of the KWS system overcomes the recall. For this purpose, the documents without transcription are divided in batches that are processed by the KWS system. The palaeographers transcribe documents in a batch by searching for all the terms in the keyword list and validating the outputs of the KWS system.
The adoption of this procedure allows to reduce the time required for building up the training set because the samples that are correctly retrieved by the system are transcribed in a time that is shorter than the one required by the manual transcription. Even more, this procedure allows to track the values of precision and recall of the system providing a mechanism for evaluating the goodness of the training set.
The experimental results presented in this paper confirm that the iterative construction of the training set is executed in a time that is significantly lower than the time required for manually transcribing the same data collection.
The remaining of the paper is organized as follows: Section 2 describes the KWS system, the multi-step procedure and the data set adopted for the experimentation, Section 3 compares the human efforts required for building a training set with a manual and an interactive multi-step procedure, Section 4 concludes the paper discussing the results and highlighting the future research steps.

Materials and Methods
Different tools are available for carrying out manuscript transcription, as for example Aletheia [34], a ground truthing tool, and Transkribus [35], a platform for the digitization, transcription, recognition and searching of historical documents. Usually, most of the tools adopt an architecture as the one shown in Figure 1: a collection of documents, the data set DS, is manually transcribed and the annotated word images are included in the training set. Platforms as Transkribus use HTR or KWS systems previously trained for annotating new documents and allow to validate the transcriptions at the end of the automatic process.
In this paper document transcription is carried out with a system that stands out from all the others for being based on a multi-step procedure that interleaves a query-by-string retrieval system and human validation [19], as shown in Figure 2. In particular, the system has been designed for pursuing two goals: one is reducing the human time effort for building a TS to be used by any HTR or KWS system, the other is to build up a small size training set, from here on called reference set (RS), used by the KWS system we adopted for the assisted transcription of the DS. The RS is built by taking care of including samples that are instances of a variety of keywords and aiming for a precision value greater than recall for each keyword. Differently from the RS, a TS has a bigger size and its samples are collected without any selection criteria.
In the next subsections we briefly summarize the architecture of the proposed system and how it has been used in the experimentation. Figure 1. Architecture adopted by many tools for building a training set. Ground truthing tools offer different functionalities for manually segmenting and annotating words contained in the documents belonging to the data set. Usually, human beings annotate all the words contained in the documents without any selection criteria. . Architecture of a system for building up a training set through a multi-step procedure that interleaves a query-by-string retrieval system and human validation. The data set is divided in batches. Batch 0 is manually transcribed, while the other ones are processed by the keyword spotting (KWS) system and the human validation stage. The keyword list contains the transcriptions of word images included in the reference set with no repetitions. The keyword list is used for querying the KWS system and spotting word images in the batch under analysis. At each step the reference set is updated with the word images of the batch in which the transcription is not yet contained in the keyword list. Thus, at the end of the multi-step procedure, the data set is transcribed and the training set is created.

Query-by-String Retrieval System
The QbS system used during the experimentation is a segmentation-based keyword spotting system [19] that adopts the algorithm in [36] for extracting word images by any processed document.
Each word is binarized adopting the Otsu method [37] and represented by its skeleton. The trajectory executed by the subject for writing the word is recovered by transforming the word's skeleton in a graph that is traversed following criteria derived by handwriting generation [38]. Eventually, each trajectory is segmented in elementary movements named strokes [39].
When a transcription is available for a word image, as in the case of samples in RS, each stroke is labeled with the ASCII code of the character it belongs to [40]. Figure 3 shows how a word image is elaborated by the system. . Each word image extracted from a document is processed by the query-by-string (QbS) retrieval system through the following steps: binarization [37], trajectory recovery [38], stroke segmentation [39] and, if a transcription is available for the word image, stroke labeling [40].
When a textual query is executed, documents to be transcribed are scanned looking for word images that are instances of the keyword. The trajectory of a word image extracted from one of these documents is compared with all the trajectories stored in RS looking for sequences of strokes with similar shapes [41]. When two similar sequences of strokes are found, the transcription associated to the matching strokes belonging to the trajectory in RS is assigned to the matching strokes belonging to the trajectory in the document to be transcribed. Because of handwriting variability, different transcriptions could be assigned to the same sequence of strokes.
The ranked list of all the possible interpretations for a word image is obtained by traversing a graph in which the nodes represent the transcriptions associated with the strokes that matched during the comparison with the trajectories in RS [42].
When a subject queries for a keyword, the QbS system outputs all the word images in DS in which the ranked list of interpretations include the desired keyword.

Multi-Step Procedure for Reference Set Construction
The RS is incrementally built by interleaving keyword spotting and human validation. The QbS keyword spotting system described in the previous section is followed by a human validation stage, as shown in Figure 2, with the aim of implementing an interactive multi-step procedure that speeds up the transcription of the DS.
The first step of the procedure involves the split of the DS in batches and the manual transcription of one of them in order to build the bootstrap RS. The unique transcriptions of the word images in RS are copied in the keyword list that will be used for submitting a query to the system and spotting words in a new batch.
After the bootstrap, word spotting and human validation are alternated for incrementally updating the RS until the precision rate of the system overcomes the recall rate. The RS is updated until it is no longer possible to increase the precision with respect to the recall. Afterward, documents that are not yet transcribed are processed in a final step. If the precision never overcomes the recall, the RS is updated until the last document is transcribed. The DS is fully transcribed and the TS is created whether the condition on precision and recall is verified or not. In fact, at the end of each step, all the words included in a batch are transcribed and included in the TS.
For each entry in the keyword list, the QbS retrieves all the word images that contain the desired keyword in their ranked list of interpretations. Depending on the performance of the system, the retrieved images can be instances of the desired keyword, instances of other keywords or even instances of terms not included in the keyword list. Word images that are instances of terms not included in the keyword list of the actual step are named Out-Of-Vocabulary (OOV) words. Eventually, because the KWS has a Recall lower than 1, it could happen that word images that are instances of entries of the keyword list are never retrieved by the system, even after many steps of keyword spotting and validation. These word images are named missed words.
The GUI shown in Figure 4 allows human beings to validate the output of the KWS system by providing two functionalities:

•
To confirm with the right click of the mouse the retrieved images that are instances of the query; • To label a word image by typing its transcription in a text box. For speeding up the typing, the text box works in auto-complete mode by suggesting possible transcriptions taken from the keyword list.
Human validation has the effect of updating the RS and keyword list with the annotated OOV. The RS and keyword list are updated only at the end of a step of the procedure, i.e., when all the entries of the keyword list have been searched in the batch and validated by the human being. The updated RS and keyword list are used for spotting new word images in a new batch of documents. The keyword spotted by the system is shown in the text box placed in the top left corner. The button "Search" starts the word spotting. The button "Select Data" allows to select the data set DS. The slider regulates the dimension of the image. The text box located on the bottom is used for typing the transcription of the word image. The text box supports the auto-completion mode.

Data Set
The experimentation was carried out on handwritten documents extracted from the Bentham collection, a corpus of documents written by the English philosopher Jeremy Bentham (1748-1832) and his secretaries over a period of sixty years [43]. These handwritten documents have been used in competitions on keyword spotting systems [2,4] and handwritten recognition systems [3].
In particular, the data collection used in the experimentation includes 50 pages that are split into 10 batches of 5 pages. One batch is manually transcribed in order to create the RS and the keyword list that will be used during the first step of the KWS system. Batches are comparable in terms of word images and unique words. Table 1 shows the pages assigned to each batch and the number of word images per batch. The bootstrap keyword list contains 354 entries corresponding to the unique words of the bootstrap batch.

Characterization of Human Effort in Transcription
One palaeographer was involved in the experimentation. We asked her to manually transcribe the 5 documents included in the bootstrap batch and to exploit the multi-step procedure and the GUI described in the previous section for transcribing the 45 documents included in the other batches.
The time spent by the palaeographer for manually transcribing the 1089 images included in the bootstrap batch was equal to 10,127.7 s and a single word was transcribed in a mean time T word equal to 9.3 s.
During the multi-step procedure we recorded the activities executed by the palaeographer for validating and correcting the output of the KWS system. The mean time T val required for validating a correct retrieved image with a simple mouse click was equal to 1 s. When the system retrieved an image that was an instance of another entry of the keyword list, the palaeographer had to correct the transcription by typing the correct label. Thanks to the auto-complete mode, the palaeographer had to write only the first characters of the actual transcription and the system automatically completed it. Therefore, the mean time T err required for correcting the transcription of a word that is an instance of a keyword was equal to 5 s. When the system retrieved an OOV word, the auto-complete mode did not speed up the manual transcription and the mean time T OOV was the same as T word .
Eventually, the GUI shows all the word images in which the transcription is empty as they were not retrieved by the system. The missed words, which are images that are instances of the keywords but are without a transcription because the recall is lower than 1, were annotated in a mean time T miss equal to T err thanks to the auto-complete mode. Table 2 reports the means and standard deviations of the times for annotating the word images during the experimentation. Table 2. Means and standard deviations of the times measured during the word transcription.

Results
The experimentation has the aim of evaluating how good the multi-step procedure in building up a training set to be used in a KWS system is for document transcription.
As described in the previous section, the multi-step procedure involves, at each step, the word spotting of all the entries in the keyword list and the validation or transcription performed by a human being. N val (step), N err (step), N miss (step) and N OOV (step) are the correct, wrong, missed and OOV words processed by the system at the end of each step, respectively. N batch (step) is the number of word images processed at each step. Eventually, KW OOV (step) is the number of unique transcriptions of the OOV word images at each step.
At the bootstrap step (step 0), it is required that the human being manually transcribes all the words in the batch. The word images annotated at step 0 are used for creating the RS, which will be updated during the following steps, and their transcriptions, taken once if many words have the same transcription, populate the keyword list.
At each step, each item of the keyword list is used as a query for the KWS system. OOV words and their labels are used for updating the RS and the keyword list that will be used at the next step. N RS (step) and N KL (step) are the size of the reference set RS and the number of terms in the keyword list at the beginning of each step and they are defined as in Equations (1) and (2), respectively. It is worth noting that at the bootstrap step the keyword list is empty and all the manually transcribed words are considered OOV words.
The metrics adopted for evaluating the procedure are defined in Sections 3.1 and 3.2 reports the comparison between the multi-step procedure and a fully manual transcription.

Metrics
The multi-step procedure is evaluated in terms of time saved to manually transcribe the data set and automatic transcription rate. The procedure is compared with respect to a baseline system that allows the manual transcription of the data set. Although the baseline system does not involve a multi-step procedure, Equation (3) computes the time spent for a fully manual transcription of DS as it was executed in more than one step. This formulation allows the comparison between the baseline system and the KWS system.
Equation (4) defines the time spent by a human being for validating with a mouse click (T clk ) the correct word images retrieved by the system while Equation (5) defines the time spent for labeling (T lab ) wrong, missed and OOV words at each step of the multi-step procedure.
T lab (step) = T err * N err (step) + T OOV * N OOV (step) + T miss * N miss (step) The human time effort for building up the RS is computed as in Equation (6). At the bootstrap step, the manual transcription is required for setting up the starting training set and keyword list.
As suggested in [44], we introduce Equation (7) for measuring the time gained with the multi-step procedure with respect to the baseline system. Gain(step) could vary between 0% and 100% and it is strongly related to the values of Precision and Recall, defined in Equations (8) and (9), respectively.
Precision(step) = ( N val (step) N batch (step) − N miss (step) ) * 100, for step > 0 (8) ) * 100, for step > 0 Eventually, we introduce other two metrics for evaluating the system: the reference set updating rate (R new (step)) and the automatic transcription rate (R auto (step)), defined by Equations (10) and (11), respectively. R new (step) measures the percentage of manual transcriptions that contribute to the update of the reference set. It corresponds to the percentage of OOV words with respect to all the images that are manually transcribed up to the actual step. R auto (step) measures the percentage of word images that are correctly transcribed by the KWS system with respect to the images that could be automatically transcribed up to the actual step. If a KWS that never fails was available, both the metrics would be equal to 100: missed and wrong word images would be absent and the human being would manually transcribe only the OOV words, which are the words used for updating the RS.

Multi-Step Procedure vs. Manual Procedure
Fifty handwritten pages were selected as DS to be transcribed for building up a training set. These pages are split into batches, as reported in Table 1.
The multi-step procedure adopts a KWS system for building up an RS step by step. Table 3 shows the number of words that are retrieved by the system, the number of words that are missed and how the size of the training set and of the keyword list vary at each step. Table 4 reports the performance of the system in terms of precision, recall, R new (step) and R auto (step), at each step. Eventually, Table 5 reports the performance of the system in terms of transcription time.
It is worth noting that the performance of the system at the i-th step is obtained with the RS rebuilt at the end of the (i-1)-th step. For example, the performance at step 2 is obtained on a batch of 1204 word images with an RS of 1332 word images and a keyword list of 525 entries. The RS used at step 2 is made up of the 1089 word images manually transcribed at step 0 and the 243 OOV words transcribed at step 1. The time spent by a human being that uses the GUI described before for building up the RS used at step 2 is equal to 14,916.6 s.
The multi-step procedure allows to compute the values of precision and recall that the KWS obtains on a batch of documents with the RS built step by step. The best RS configuration is the one that allows the KWS to obtain a precision value greater than the recall. Our system reaches that condition at step 7 with a recall equal to 59.62% and a precision equal to 63.33%, as shown in Table 4. At the same step, the KWS system reaches the highest value of automatic transcription rate (57.18%) and the human time effort up to step 6 is equal to 35,223.3 s. These values are obtained on a batch of 1052 word images with a RS made up of 2281 word images and a keyword list of 1279 entries. From step 1 to step 6, 33.08% of the words manually annotated by the palaeographer are used for updating the RS.
Once the precision overcomes the recall the multi-step procedure ends with a last step on a bigger batch that, in our case, is made up by batch 8 and batch 9 in Table 1. The results show that it is not advantageous to rebuild the RS at the end of step 7 because there is a significant reduction of the precision value due to an increase of N err .
As with regards to the baseline system, the time required to manually transcribe all the words in 50 pages is 99,919.2 s and the time spent up to batch 6 is 68,922.3 s, as shown in Table 5.
Therefore, by using the multi-step procedure, the palaeographer gained 48.89% of the time with respect to the fully manual procedure for building the RS up to step 6 and she gained 52.54% of the time for annotating the whole DS.
The last two columns in Table 5 allow to compare the manual and the multi-step procedure in terms of the fraction of time spent in transcribing all the batches with respect to the time required by the fully manual procedure. The multi-step procedure reaches the desired condition of precision greater than recall in a time that is only 35.25% of the total time spent with the manual procedure.
Eventually, we notice that the system shows an automatic transcription rate slightly greater than 50% starting from step 3, when the KWS is equipped with the RS built at the end of step 2. Table 3. Word images processed by the KWS system step by step. Note that at step 8, batch 8 and batch 9 have been merged in one single batch.
Step  Table 5. Transcription time with the baseline system and the multi-step procedure, time gain, step by step.
Step T clk T lab T man T hte Gain T man (step)/T man (8) T hte (step)/T man (8)

Statistical Analysis
The metrics reported in Table 5, as for example the time spent in document transcription with the manual and the multi-step procedure, depend on the mean time T word measured when the palaeographer manually transcribed the five pages included in the bootstrap batch.
The value of T word is computed at the end of the bootstrap step and it can be used for computing T man (step) and T hte (step) at the following steps if the word length distribution of batches transcribed with the multi-step procedure is equal to the word length distribution of the bootstrap batch. In fact, a variation in the word length distribution would have an effect on the mean word transcription time because the transcription time depends on word length, i.e., the number of characters in the word: the longer the word is, the longer the time required for reading and typing it. Figure 5 shows the word length distribution computed over the bootstrap batch and the union of the other nine batches.
A statistical test was performed in order to verify that word length is equally distributed between the two samples of words and to validate the comparison between the manual and the multi-step procedure.
We tested the null hypothesis H 0 that both samples have been drawn from the same population with the Epps and Singleton test [45] implemented in SciPy v. 1.5.2. This test does not assume that samples are drawn from continuous distribution and therefore it is suitable for our case. The Epps and Singleton test returned a statistic value equal to 0.1131 and a p-value, which gives the probability of falsely rejecting H 0 , equal to 0.998.
This result confirms that word length is equally distributed between the bootstrap batch and the union of the other nine batches and therefore the values of T man (step) and T hte (step) are valid and can be compared.

Comparison with the State of the Art
HTR and KWS systems have been adopted in different frameworks for transcribing handwritten documents. Human beings and automatic systems work jointly to speed-up the manual transcription process and drastically reduce the cost of training data creation. The performance of the interactive systems for the assisted transcription is usually measured in terms of number of actions executed by the transcriber, as for example keystrokes, in order to correct the automatic transcription. We measured the performance of the system in terms of transcription time because it is a direct measure of the human effort and its cost. Table 6 lists the papers that adopt the transcription time as a performance measure. It is worth noting that Table 6 does not provide a rank of the systems for the computer-assisted transcription of historical documents. We cannot fairly compare the systems in Table 6 because the expertise of the transcribers and the legibility of collection adopted in each experimentation influence the transcription time. Moreover, papers reported mean transcription times by using different units (seconds per word, seconds per line, seconds per group of lines, etc.) and the times were converted by us in seconds per word adopting the approximation reported in the footnotes of Table 6. Therefore, Table 6 provides a rough indication of the transcription time spent by the palaeographers that adopted one of the listed systems. The experimental studies presented in [46,47] were performed with the same HTR, which is based on Hidden Markov modes and N-grams models, on two different document collections. The HTR system proposed a full transcript of a given text line image and every time the user amended a wrong word the following ones were updated by the system. In [46] the user was an expert palaeographer while in [47] students in History were involved. In both the papers, the authors noticed a significant typing effort reduction that did not result in a net user time effort savings. This counterintuitive finding was explained by taking into account the additional amount of time the user needed to read and understand each system prediction, which might change after each interaction step.
The study in [48] was conducted by using a QbE system coupled with a relevance feedback mechanism that introduced the human being in the retrieval loop. The transcription time was measured on 50 pages extracted from the Bentham collection. The authors measured a transcription time equal to 9.55 s per word when the documents were automatically segmented in words but manually transcribed, while the time was equal to 4.21 s per word when the interactive system was adopted. Overall, their system allowed to gain 55.9% of the time with respect to the fully manual transcription.
The transcription times measured in [48] are in line with the ones reported in this paper. The difference between the manual transcription times (9.3 s instead of 9.55 s per word) is negligible taking into account that the users involved in the two studies are different. The transcription time we measured with the interactive system is slightly greater than the one reported in [48] (4.41 s instead of 4.21 s per word) if we take into account the time spent for the manual transcription of the bootstrap batch, while is lower (3.86 s instead of 4.21 s per word) if we consider only the steps that exploit our KWS system. In this regard, it is worth noting that the tool based on the QbE system does not require a training step while the transcription times reported in the other two papers [46,47] do not take into account the time spent for training the HTR.
If we compare the four tools without taking into account the time spent for building up the training set, we notice that their transcription times are similar. If we compare the systems in terms of saved time with respect to the manual transcription, our system is the one that allows obtaining the biggest time gain.

Discussion and Conclusions
QbS systems are adopted as tools for assisting human beings in the transcription of historical documents. These systems are beneficial for transcribing documents when they are equipped with a training set in which the samples are instances of as many keywords as possible in order to reduce the occurrence of OOV words [44]. Moreover, these systems are efficient in the automatic transcription of a data collection when their precision value is greater than their recall value.
In this paper, we have compared two procedures for transcribing a collection of handwritten historical documents: the baseline and the multi-step procedure.
The baseline procedure involves the manual transcription of the whole DS and therefore it is a time-consuming and expensive process. Moreover, this procedure does not allow to apply any selection criteria during the construction of the training set: all the words in DS are labeled and included in TS.
The multi-step procedure is based on a loop between a KWS system and a human validation stage. The documents available for the construction of the training set are split into batches that are processed one after the other. This procedure takes advantage of a GUI that reduces the transcription time for the words correctly retrieved by the KWS system or that are wrongly retrieved but are instances of another entry of the keyword list. This procedure allows to build up two sets of annotated word images: the training and the reference set. The training set is exactly the same set of annotated word images that it is built with the manual procedure, but it is obtained faster. The reference set is a smaller set of annotated word images that is built for training the QbS system adopted in our transcription tool. The images included in RS are selected with the aim of training the QbS system in a way that it could show a precision rate greater than the recall rate. If that condition is verified, the performance model in [33] guarantees that it is advantageous to use the QbS for the transcription of new documents.
We tested the multi-step procedure on 50 pages extracted from the Bentham collection and during the transcription the condition with a precision value greater than recall value was reached. The system showed an increase in the rate of automatic transcription at each step of the procedure up to the value of 57.18%. The key idea of retraining the QbS system only with the OOV retrieved at each step together with the short time required for validating a correctly retrieved word image concur to reduce the human time effort, which is the main cost item in the transcription of manuscripts. The palaeographer had a time gain of 48.89% for building up the RS and a time gain of 52.54% for transcribing the whole DS by adopting the multi-step procedure instead of the baseline procedure.
If the QbS presented in Section 2 is used both for building a training set and the assisted transcription of a collection of documents, the multi-step procedure allows, at the same time, to select and transcribe the most useful word images to be used for training the system.
What is the best size for a batch is an open question. The procedure is less advantageous if the size of the batches is too big because the number of both the word images in the bootstrap batch and the OOV words that need to be manually transcribed increases. On the other hand, if the size of batches is too small the number of missed words could increase and the RS and the keyword lists are not significantly updated at each step. The experimentation presented here suggests that the number of OOV and missed words per step are markers of how good the multi-step procedure is working. In our experiment, the average sum of OOV and missed words per step is around 29% of the batch size.
The future steps will regard the improvement of the KWS system in order to reduce the numbers of wrong words retrieved by the system as well as to reduce the missed words. On the other hand, we will investigate new methods that will allow to discover more OOV words at each step. These aspects are of paramount importance for increasing the time gained in the transcription and the rate of automatic transcription.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: KWS keyword spotting systems QbS query-by-string QbE query-by-example OOV out-of-vocabulary TS training set RS reference set DS data set GUI graphic user interface