Automatic Estimation of Age Distributions from the First Ottoman Empire Population Register Series by Using Deep Learning

: Recently, an increasing number of studies have applied deep learning algorithms for extracting information from handwritten historical documents. In order to accomplish that, documents must be divided into smaller parts. Page and line segmentation are vital stages in the Handwritten Text Recognition systems; it directly affects the character segmentation stage, which in turn determines the recognition success. In this study, we ﬁrst applied deep learning-based layout analysis techniques to detect individuals in the ﬁrst Ottoman population register series collected between the 1840s and the 1860s. Then, we employed horizontal projection proﬁle-based line segmentation to the demographic information of these detected individuals in these registers. We further trained a CNN model to recognize automatically detected ages of individuals and estimated age distributions of people from these historical documents. Extracting age information from these historical registers is signiﬁcant because it has enormous potential to revolutionize historical demography of around 20 successor states of the Ottoman Empire or countries of today. We achieved approximately 60% digit accuracy for recognizing the numbers in these registers and estimated the age distribution with Root Mean Square Error 23.61.


Introduction
We have been living in written cultures for ages, and we not only produce vast amounts of documentation, but we also are governed and ruled by them. The necessity of analyzing documentation and understanding and processing their content are perpetual. In the past, processing the information and correspondence kept in manuscripts was performed manually because of the lack of comprehensive and high-quality digitized datasets where an automatic method could be employed. Because of the rarity of high-quality digital scanning solutions and devices with high-storage capability, transforming and saving manuscript images from paper form to digital form was difficult. Recently, this job has become more evident due to dramatic progress in digital scanning and storage solutions.
Nowadays, there are many digitized historical documents in Arabic script in the national libraries and archives around the world, thanks to the above-mentioned advances in technology. Investigating data and retrieving information manually are costly and challenging. Therefore, an automatic method is needed to process these documents rapidly. Processing historical manuscripts is an up-to-date research topic that has seen dramatic growth recently [1][2][3]. However, historical Arabic document processing is a difficult research issue. The reasons could be listed as the complex nature of Arabic script compared to other scripts, and fragility of ancient documents, which are subject to degradation [4].
When recognizing handwritten documents, segmenting the document images into their primary objects, words, and text lines is a highly complex research issue in Arabic manuscripts because of the various problems faced in both word and text line segmentation processes. The main difficulties in the segmentation process of Arabic lines are overlapping words, very close neighboring text lines, and over the same text line or on the page between lines, the variances of the angle skew [5]. In this study, we worked with the first population registers of the Ottoman Empire that were conducted in the 1840s. The coverage of these registers is the entire Ottoman Empire in the mid-nineteenth century, which comprises the areas of around twenty successor countries of modern Southeast Europe and the Middle East; they are written in Arabic scripts. In these censuses, clerks created manuscripts without using hand-drawn or printed tables and there was not any pre-determined page structure. In the last line of demographic data of individuals, the ages are written and we aim to recognize them.
Retrieving age information from historical manuscripts has huge potential for the field of historical demography. Although the Ottoman population registers convey minute and diverse detail on individuals such as names, family relations, occupations, physical appearance, and body height for the purposes of historical demography, age data and its distribution are the most telling and the most rewarding ones. The age structure of any population in any given time is a crucial marker of demographic transition [6,7]. Furthermore, age heaping, a directly related phenomenon to registered age distribution, has vast potential for historical demography studies with its connection to human capital formation [8]. A computerized age information retrieval method for handwritten Arabic numbers tailored for these Ottoman population registers does not only have massive potential to revolutionize historical demography of around 20 countries of today, but it also has the potential to improve HTR methods for historical and modern handwritten documentation in Arabic.
In this study, we developed software that automatically segments pages and detects lines in these objects from the population registered in Ottoman populated places or settlements. We automatically obtained the last line, which includes the age information and estimated the age distribution in these historical registers by recognizing the numbers. For this study, we focused on one location: Manisa town in western Anatolia in Turkey. We first employed a CNN-based page segmentation technique to retrieve demographic data of individuals by using the models developed in our previous studies [9,10]. After that, we used horizontal projection profile-based line segmentation to the demographic information of these detected individuals in these registers and obtained the age data in the last line. We further detected and recognized the ages and estimated the age distribution.
The structure of the paper is designed as follows. In Section 2, the related work for line segmentation and Arabic digit recognition studies will be examined. We describe the structure of the population registers in Section 3. Our methods for page segmentation, line detection and digit detection and recognition are described in Section 4. Experimental results and a discussion are presented in Section 5. We present the conclusion and future works of the study in Section 6.

Related Works
Digitization of historical archives and application of information retrieval methods on them have gained pace in recent decades, including non-European handwritten archival collections [11]. Some historical documents might have a tabular structure, which makes it easier to analyze the layout. Zhang et al. [12] developed a system for analyzing Japanese Personnel Record 1956 (PR1956) documents, which includes company information in a tabular structure. They segmented the document by using the tables and applied Japanese OCR techniques to segmented images. Kusetogullari et al. [13] proposed a new automatic historical handwritten digit detection and recognition framework named DIGITNET. To train this network, they created and used a new historical handwritten digit dataset (DIDA), which contains 250,000 single digits, 25,000 images with bounding box annotations, and 200,000 digit strings from the tabular Swedish historical handwritten documents of the 19th century. However, in most of the cases (as in our case) these archival documents do not have a tabular structure. Cheddad et al. [14] created a semi-annotated dataset from the Swedish Historical Birth Records from 1800 to 1840 and made it public for researchers working in this area. They evaluated some deep learning models for word spotting.
In the Arabic manuscript processing literature, a wide variety of segmentation techniques were reported. In [15], the authors started by removing outlier elements by using a threshold; then, the letters linked to two lines at the half distance are recognized and horizontally segmented. For detecting the lines, a rectangular neighborhood on a current element is centered and rises to contain particular conditions. The distance filtered elements to the corresponding lines are then allocated.
In another study, the authors developed an algorithm for detecting lines from Arabic handwritten documents by mentioning the problems of multi-touching and overlapping characters [16]. The unsupervised method depends on the analysis of a block covering. The authors first analyzed a statistical block that computes the specific number into vertical strips of manuscript decomposition. Then, they employed the fuzzy C means method, which accomplishes fuzzy-based line detection. Last, they assigned the blocks to their corresponding lines. They achieved 95% accuracy for detecting lines in Arabic manuscripts in their dataset.
In [17], the researchers applied morphological dilatation and projection profile techniques. In order to estimate the skew of the line, they used horizontal projection profiles. In every zone for smearing, they employed the slope, using dilation with adaptive structuring element to do the changes according to the zone, the slope, and the size. The big blobs are identified in the second stage with a recursive function that searches the cut point.
In [18], the projection profile method is again employed after joining cut characters and eliminating small elements to define the point of division within the horizontal projection profile; the curve of Fourier fitting is employed. The contour is employed for segmenting the baseline of the connected component, which permits defining the cut point between different neighbor lines. The nearest line is approximated by the curve of a polynomial that fits in the baselines of the pixels. As it could be seen from the literature, several techniques were applied to segment lines in Arabic manuscripts. Due to its convenience and wide usage, we selected a projection profile-based technique for our problem.
After detecting the numbers in the last line, they must be recognized for retrieving data from the population registers. Several studies have recognized Arabic digits in contemporary documents [19]. However, they employed recently created datasets that are clean and not complicated when compared to historical documents. The HODA dataset [20] was formed for creating Persian (an Arabic script-based language) handwritten digit recognition systems. It has 60,000 training and 10,000 test images, but some digits are different from classical Arabic digits. A similar and larger Farsi digit dataset was created in another study [21]. It has around 100,000 digits obtained from university and college students. Another dataset is ADBase [22], which was created in Egypt. It has 60,000 training and 10,000 test images. The CMATERDB 3.3.1 dataset [23] was created at Jadavpur University, India. It has 3000 images. All these datasets have been created in recent decades and they are tested with a variety of machine learning algorithms. Different CNN architectures were tried with the HODA dataset [24][25][26] and researchers achieved accuracies over 95%. CNN and CNN + Boltzmann Machine classifiers experimented on CMATERDB 3.3.1 dataset and over 99% accuracies were obtained [27,28]. However, to the best of our knowledge, there is not any publicly available historical Arabic handwritten digit or letter dataset. Training a deep learning model with these modern datasets and testing with historical digits did not yield high accuracies in the literature [29]. Therefore, we created a dataset containing over 6000 digits, which contributes to the literature on this aspect [9]. The dataset can be accessed at https://urbanoccupations.ku.edu.tr/historicalarabic-handwritten-digit-dataset/ (accessed on 13 September 2021). We then trained a CNN model and recognized the digits in this case study.

Dataset Description
The mid-nineteenth century Ottoman population registers resulted from an unprecedented governmental procedure, which aimed to record every male subject of the empire, irrespective of age, ethnic or religious affiliation, or military or financial status. They intended to have universal coverage for the male population. Government officials created these manuscripts without using hand-drawn or printed tables. Therefore, a predetermined page layout did not exist. Page structure can change in different districts, and structural variations occurred depending on the clerk's preferences in different registers. This research focused on the city of Manisa registers, with code names NFS.d. 2865, 2866, 2867, available at the Turkish Presidency State Archives of the Republic of Turkey, Department of Ottoman Archives, in jpeg format, upon request. We aimed to implement a method for recognizing text of similar registers from different regions of the Ottoman Empire conducted between the 1840s and the 1860s.
As we mentioned, these registers contain comprehensive demographic data on male members of the households, i.e., names, family relations, ages, and occupations. Females in the households were not recorded. The registers are provided for research purposes at the Ottoman State Archives in Turkey, as recently as 2011. Their total number is around 11,000. Until now, they have not been subject to any systematic study. Only individual registers were investigated in a piecemeal fashion. The size of the digital images of registers is 2210 × 3000 pixels. A sample register page is demonstrated in Figure 1. For estimating the age distribution, we selected three neighborhoods from Manisa city with respective numbers of registered persons: Sultan (120), Çarşı (68), and Gürhane (225). They contain a total of 413 people and their ages are estimated.

CNN-Based Object Detection Method
First, we developed a deep learning algorithm for detecting individuals in the population registers in our previous studies. We created a manually labeled dataset by using several registers and trained CNN models by using the dhSegment tool [30]. In the CNNbased dhSegment toolbox, paths use pretrained weights from well-known architectures such as Unet and Resnet50, where the system learns high-level features. They improve robustness and generalization. With the pretrained weights in the network, the training time and the number of parameters in the CNN architecture were reduced considerably [30].
We trained different models for different types of layouts. The first model was trained for registers with tightly placed individuals. The second model was trained for registers with loosely placed individuals. We used the former model for Manisa registers of this study. After we detected the individual objects in these registers, by using the pixelwise locations, we cropped the demographic data of individuals to be used for line detection algorithms. The detected individual objects can be seen in Figure 2. For more detailed information, our previous paper on object detection for these population registers could be visited [10].

HPP-Based Line Segmentation Method
One of the widely used methods for finding the line height of a document is by examining its horizontal projection profile. Horizontal projection profile (HPP) is the array of sum or rows of a two-dimensional image. Wherever there are more void areas between lines, we can observe more peaks. In order to give an idea of where should the segmentation between two lines can be employed, a sample HPP is provided in Figure 3.
In the first Ottoman population registers series, the ages are written in the last line of each entry. Therefore, we applied a peak detection algorithm for detecting the last peak to separate the age information in the last line. The Peakutils library of the Python programming language is used to find peaks. The void regions between lines appear as valleys, so we reverse the function and find peaks that correspond to valleys in the original HPP. We adapted the parameters of the Peakutils library to find the largest peak which divides the last line from the individual object. The peak threshold is selected as one-fourth of the maximum HPP and the minimum distance to look for a peak after a detected peak is selected as 10 datapoints by using trial and error. As we are searching for the last line, we crop everything under the last and largest peak in the reverse HPP, which provides us the age of the individual.

Digit Detection and CNN-Based Digit Recognition System
After line segmentation, we automatically crop the last line, which contains the age of individual (see Figure 4). We detected the digits, recognized them and directly assigned the digit value if there is only one digit. If there are two digits, we combined the digits into two-digit numbers by making necessary calculations.

Digit Detection and Training Machine Learning Models for Digit Recognition
For detecting the digits, the OpenCV-Python library is used. It first found contours. From these contours, it created bounding rectangles. We first applied a basic filter that if it was too small, it was detected as noise and we eliminated these rectangles. The other rectangles were provided as input to the digit recognition system. By using our publicly available historical Arabic handwritten digit dataset, we trained machine learning models. We chose four different algorithms: logistic regression (LR), a shallow Neural Network (one input and one output dense layer), a Deep Multilayer Perceptron (one output, one input and two hidden dense layers), and CNN. The first three algorithms are applied to the raw image by converting 2D matrices directly to 1D arrays. CNN is applied to the 2D images directly. These algorithms are selected because each one represents a different type. We used the same CNN architecture with our previous study [29] (see Figure 5). The accuracy and loss of training and test data with respect to epochs are provided in Figures 6 and 7. The models are saved as h5 file and the bounding rectangles are provided as input to these models and they give the predicted digits with their probabilities.

Transforming Digits into Numbers
The ages in this case study ranged between 0 and 83. Therefore, they are either one-digit or two-digit numbers. If we detect only one digit, its value is assigned as age. However, if there are two digits, we check their locations and we multiply the digit at the left side with ten and add the digit at the right side and assign this sum as the age of an individual. The predicted ages of individuals are computed in this way. We retrieved the ground truth age of individuals from our UrbanOccupationsOETR databases, which are manually entered by our project members, who are experts in Ottoman paleography. The metrics were calculated for each age entry and then we draw the histogram of ages and compared it with ground truth distribution.

Experimental Results and Discussion
In this section, we first provided our individual object detection results for Manisa population registers. After that, we presented last line detection accuracies for these registers.

Individual Object Detection Results
We tested the performance of our system on Manisa registers (more than 7000 individuals). We employed two different pretrained architectures, namely, Unet and Resnet50, and presented results by using both of them. In order to count individuals in the registers, we defined a high-level metric that can be calculated by dividing the predicted count errors over the ground truth count. We named this metric as Individual Counting Error (ICE).
We used a tightly placed model for Manisa as its structure is more suitable for these models. The results are presented in Table 1.

Pretrained Network ICE (%)
Unet 0.3836 Resnet50 0.6536 As can be seen from Table 1, the counting errors are around 0.5%. Using a pretrained Unet architecture reduces the error percentages from 0.65% to 0.38%.

Line Detection Results
We evaluated the performance of our line detection system. We again used a high-level metric, which is the line detection accuracy. It could be calculated by dividing the number of correctly detected lines by the total number of lines. In our previous study, we detected all lines with detection accuracies of 80.30% for Manisa registers [31]. As we used only the last line in this study, when we examine the last line detection performance, we successfully detected the last lines with 100% accuracy. As the last line is more distant from the main body than classical line breaks, it makes sense that it could be detected with higher accuracies.

Digit Recognition Results
After the last lines are automatically cropped and the digits in the ages are detected via the OpenCV library, we provided these digits to the trained machine learning models and predicted the results. We compared the machine learning algorithms in two different ways: The first one is the digit recognition performance in the test set of our public dataset. We presented the results in Table 2. CNN outperforms the other algorithms in this comparison. The second way is to compare the performances of the trained and saved models for recognizing digits in the case study. We measure this performance by digit recognition accuracy and presented the results in Table 2. We further calculated digit detection accuracy, the root mean square error, average error metrics for best performing CNN architecture and presented them in Table 3. In our case study for digit recognition, we selected three neighborhoods from Manisa city, which contain 413 people and 739 digits in their ages. We calculated the digit detection accuracy by dividing the number of correctly predicted digits by the number of all digits. We predicted 60% of the digits correctly with the best machine learning model, which is CNN. When we examine the errors, 28% of the errors are caused by 0. In handwritten Ottoman, 0 is written with a dot and it could be easily confused with a noise in the document (see Figure 8). Therefore, it is also hard to detect. Another common error is the distinction between 2 and 3. They are quite similar in handwritten Ottoman (see Figure 8) and could be easily confused by our model.

Age Distribution Estimation
The ultimate aim of this study is to estimate the age distribution of the selected three neighborhoods in the city of Manisa. After combining the predicted digits into numbers, we plotted the distribution of these numbers as a histogram (see Figure 9). The histograms are quite similar, but we can see that the number of people whose ages are between 0 and 20 is higher in the predicted numbers than the ground truth. The reason is that the difficulty in predicting Arabic '0' (which is a dot). The algorithm generally misses to detect if there is a digit in the case of zeros in the two-digit numbers and predicted them as one digit, which increases the 0-20 ages. Beside the zero problems, the algorithm successfully estimates the age distribution from the historical population registers.

Conclusions
In this study, we first used a CNN-based layout analysis technique to detect individual objects in the first Ottoman population register series. Then, we employed a horizontal projection profile algorithm-based line segmentation to the demographic information of these detected individual objects for detecting the last line, which contains the age data. We further detected and recognized digits and converted them to numbers. We focused on Manisa register individual entries. We detected objects that include demographic data of individuals with approximately 0.5% error by using the CNN-based segmentation algorithm. We detected the last lines of information belonging to 413 individuals of our case study from Manisa with 100% success rate. We detected digits by using contour lines and finding bounding rectangles and recognized them with a CNN model trained with our public dataset. We achieved approximately 60% digit detection accuracy. We estimated the age distribution of the selected neighborhoods with promising accuracy. We plan to add word and letter detection systems as future works and develop a recognition system for these registers. This will reveal important demographic information from a wide geographical area in the nineteenth century and will have a significant interdisciplinary impact on historical demography.
Author Contributions: Y.S.C. is the main writer of the manuscript. He performed the curation and development of the dataset and of the software and conducted the analysis. M.E.K. organized the preparation of the archival sources and initial data gathering. He has provided historical context and information regarding late Ottoman population registers, and contributed to the conceptualization of the case study. Both authors have read and agreed to the published version of the manuscript. Data Availability Statement: UrbanOccupationsOETR_hdr_Nicaea_6k is the first historical Arabic handwritten digit dataset. It is curated from the first series of Ottoman population registers conducted in the mid-nineteenth century. The dataset was controlled manually and cleaned. It has more than 6000 digits. 5000 are divided into the training folder, and the remaining 1000 images are divided into the test folder. The dataset can be accessed at https://urbanoccupations.ku.edu.tr/historical-arabichandwritten-digit-dataset/ (accessed on 13 September 2021).

Conflicts of Interest:
The authors declare no conflict of interest.