Horizon Line Detection in Historical Terrestrial Images in Mountainous Terrain Based on the Region Covariance

: Horizon line detection is an important prerequisite for numerous tasks including the automatic estimation of the unknown camera parameters for images taken in mountainous terrain. In contrast to modern images, historical photographs contain no color information and have reduced image quality. In particular, missing color information in combination with high alpine terrain, partly covered with snow or glaciers, poses a challenge for automatic horizon detection. Therefore, a robust and accurate approach for horizon line detection in historical monochrome images in mountainous terrain was developed. For the detection of potential horizon pixels, an edge detector is learned based on the region covariance as texture descriptor. In combination with shortest path search the horizon in monochrome images is accurately detected. We evaluated our approach on 250 selected historical monochrome images in average dating back to 1950. In 85% of the images the horizon was detected with an error less than 10 pixels. In order to further evaluate the performance, an additional dataset consisting of modern color images was used. Our method, using only grayscale information, achieves comparable results with methods based on color information. In comparison with other methods using only grayscale information, accuracy of the detected horizons is signiﬁcantly improved. Furthermore, the inﬂuence of color, choice of neighborhood for the shortest path calculation, and patch size for the calculation of the region covariance were investigated. The results show that both the availability of color information and increasing the patch size for the calculation of the region covariance improve the accuracy of the detected horizons.


Introduction
Image archives hold impressive collections of historical terrestrial images from the Alpine regions. These images are an important, yet seldom used, source for identifying and documenting changes in the alpine landscape. Taken by early mountaineers without any auxiliary devices (e.g., global navigation satellite systems), the accurate position and orientation of these images are unknown. With georeferencing, the quantification of visible changes by monoplotting [1] becomes possible. Additionally, instead of relying on missing or wrong metadata, georeferenced images can be queried by their estimated location. Especially for geoscience displines (e.g., Hydrology, Glaciology), where historical archival images predate other sources by nearly 50 years (e.g., aerial imagery), the availability of geroreferenced archival images will promote their usage.
In order to estimate the unknown camera parameters by photogrammetric resection, ground control points are necessary. Without the availability of human-made structures (e.g., buildings, streets), and with vast changes in the topography (e.g., glaciers) and varying illumination conditions, their identification in historical terrestrial images is time consuming and requires experience. In combination with the limited field of view of terrestrial images, analysis is often limited to a few hand-selected images. To overcome this limitation and allow the quantification of changes based on archival terrestrial images at larger scales, automatic methods for the estimation of the unknown camera parameters are necessary. Automatic image orientation is generally based on the extraction and matching of feature points (e.g., SIFT [2], SURF [3], ORB [4]). Especially in changing environments (e.g., glaciated mountains) in combination with collections of images captured over a time period of 100 years, image matching based on feature points is challenging ( [5,6]). Therefore, in natural environments, like mountainous terrain, automatic methods are based on the visible horizon (also referred to as apparent horizon, horizon line, or skyline). Assuming that the visible horizon in an image is unique, it can can be used for matching with rendered horizons from digital terrain models [7][8][9][10][11]. While these methods vary within the matching stage and the required a priori information (e.g., approximate location, orientation, or focal length), they all are based on the visible horizon. Therefore, the accurate and robust detection of the horizon is a crucial prerequisite for the automatic estimation of the unknown camera parameters.
The apparent horizon is the border separating terrain from sky, which is characterized in images by a change in color and texture. These differences are exploited by methods approaching horizon detection as a segmentation task, where each pixel in the image is either assigned to ground or sky. In [7,12], the problem is formulated as a foregroundbackground segmentation, using multiple dense feature descriptors within a bag-of-words framework. On the other hand, the horizon line detection can be interpreted as an edge detection problem. Ref. [13] proposed a method based on edge detection and a subsequent extraction step using dynamic programming. In [14], the problem of non-continuous edges, often a result of traditional edge detection, is overcome by training a support-vector machine (SVM) on the normalized gray-scale values extracted from patches around each pixel. Based on the derived probability map, the horizon line is detected by searching for the minimum cost path. More recently, [15] used a convolutional neural network to classify patches extracted for each pixel. Instead of classifying individual patches, [16] use a fully convolutional network. A different approach was suggested by [17] based on the observation, that sky regions generally are more homogeneous compared to ground. The potential horizon line is found by maximizing the difference in variance above and below the horizon.
All of the proposed methods have only been applied to modern images. Except for the approach proposed by [14], color information is used. In contrast, the detection of the visible horizon in historical terrestrial images has not been addressed yet. Besides the unavailability of color information, these images pose additional challenges (e.g., Figure 1): • Taken up to 100 years ago with the earliest available compact cameras, historical images have a poorer image quality resulting in blurred images, low contrasts, and overexposed regions. • Most images found in archives were acquired from private collections. As those images have not been stored professionally, they show varying signs of usage (e.g., dirt, scratches, watermarks). • Alpine environments, even with modern cameras, pose a challenging environment for photography. Especially snow and glacial areas are difficult to photograph due to strongly varying illumination. In combination with rapidly changing weather (e.g., fog, clouds) conditions, visual appearance is heavily affected.
Especially the lack of color information in combination with high alpine terrain, partly covered with snow or glaciers, poses a particular challenge for the detection of the horizon line. Therefore, a robust and accurate method for detecting the horizon in these challenging images is needed. Based on the work of [14,18], a new method is suggested. The region covariance is used in classification to describe texture within the neighborhood around pixels. Introducing the probabilities obtained from a random forest classifier as weights into a shortest path search the horizon is obtained. In order to evaluate the performance of the approach, two datasets representing modern and historical images were used. While the appearance of terrain and sky varies strongly, the horizon generally follows a certain spatial layout: The region below the horizon has larger intensity variations, whereas the region above, sky, appears homogeneous. Following [14], describing the spatial distribution of intensities surrounding horizon pixels, an edge detector is learned ( Figure 2). Extracting horizon (green) and not-horizon (red) patches from training images (a), the region covariance features are calculated (b), which are directly used as inputs to a random forest classifier. The probability map (d) is obtained by classification of densely extracted patches from a test image (c). Using shortest path search on a graph defined by the pixel grid (e), the horizon in the test image is detected (f).

Method
Various descriptors have been proposed for describing the appearance and layout of pixels in images (generally referred to as texture), e.g., LBP [19], GLCM [20], or filter banks [21]. Among those, the region covariance proposed in [18] shows interesting properties regarding our problem: Being calculated from various image statistics within a selected region, the covariance matrix is directly used as texture descriptor. Hence, no additional parameters must be defined except for the size of the region of interest. Depending on the used features, it is robust towards illumination, scale, and rotation changes. As originally proposed for object detection, the spatial appearance of objects can be easily incorporated. Furthermore, the region covariance is a compact descriptor which can be calculated efficiently using integral images [22].

Region Covariance
The covariance matrix C R for a region R of a grayscale image is calculated using with I k being the intensities of each pixel within the region R, and µ the mean intensity. For a monochrome image, C R contains only the variance σ 2 I of the intensity. Therefore, [18] extended the image by additional feature channels, which can be directly derived from the intensity values. With multiple dimensions, the main diagonal of C R contains the variances of the respective features, whereas the other values represent the pairwise covariance. Among the proposed features are the intensity I, magnitude of the gradient Mag, the first and second order derivatives and x and y positions (Equation (2)).
F(x, y) = I Mag ∂I ∂x ∂I ∂y In case color images are available, the feature channels can be easily changed to include color information instead of the intensity The gradient magnitude Mag is calculated using the root squared sum of the first order image derivatives in x and y. The usage of the x and y image coordinates seems contradicting. While their variance is always the same, the covariances with the other features represent important information regarding their distribution within the selected patch.

Patch Representation
While the appearance of horizons vary, their spatial layout is consistent: A smoother sky region above, in contrast to the rougher terrain with stronger variations below. Therefore, not only the intensity values, but especially the image gradients and gradient magnitude, will have stronger variations for the terrain region in the bottom part of horizon patches. With the inclusion of the x and y coordinates into the calculation of the region covariance, some measure to describe the distribution of the features within each patch is already available.
To further emphasize the spatial structure of horizon patches, each patch is represented by five individual parts (Figure 3), which was already suggested by [18]. For each of the five individual regions of a patch, a covariance matrix describing the spatial-textural relationship is calculated. The concatenation of all covariance matrices of one patch then builds the basis within the classification.

Classification
Each covariance matrix is a symmetrical square matrix of dimension d × d with d being the number of features used. Due to symmetry, each matrix contains (d 2 + d)/2 unique values. As five regions for one patch are used, the length of the resulting feature vector is 5 · ((d 2 + d)/2).
One limitation of covariance matrices is, that they are not embedded in an euclidean space [18]. Therefore, the individual elements of the covariance matrices cannot be directly interpreted as features in a subsequent classification task. In order to overcome this limitation, various approaches have been proposed. Ref. [18] used region covariances within a nearest neighbor classification making use of the distance metric proposed by [23] to calculate the pairwise distances. In case of support-vector machines, specifically designed kernels were suggested by [24,25]. Both also experimentally evaluated the influence of directly using covariance features in contrast to considering their geometry. For random forests, the log-euclidean transformed region covariances [26] have been used by various authors [27][28][29].
Following [26], region covariances C R can be used within euclidean computations after transformation into their matrix logarithms. One prerequisite for this approach are symmetric positive-definite matrices. Per definition, C R are symmetric positive semidefinite matrices, which can be transformed into positive definite matrices by adding a small constant factor = 1 × 10 −6 : where I d is the d-dimensional identity matrix. After transformation into a positive definite matrix, the logarithm log(C R ) can be calculated using the eigenvalue decomposition with D being the diagonal matrix of the eigenvalues of C R , and finally, where log(D) is the diagonal matrix of the eigenvalue logarithms. The elements of log(C R ) now can be directly used within the random forest classification. For training, positive training patches are sampled along the ground truth horizon. To account for the imbalance of horizon and not-horizon pixels in images, negative training patches are randomly sampled with a ratio of 4:1 (not-horizon:horizon).

Horizon Line Detection
Following [13,14] the horizon line detection is solved by searching for the shortest path from the left to right image border. A directed weighted graph G(N, E) is defined with each node N corresponding to a pixel of the input image and edges E connecting two pixels of the image in the graph. Each edge has a weight w(E). Additionally, two special nodes called "source" and "sink" are added to the graph G. The source node is connected to all pixels of the first image column, whereas the sink node to all pixels of the last image column. For all edges connected either to source or sink, the weight w(E) is set to zero. As the edges have zero weight, they do not have any influence on the shortest path itself.
On the other hand, as both nodes are connected to all pixels of the first or last column, the shortest path can reach any pixel within the image. Therefore, no approximate location of the horizon must be known a priori.

Neighborhood Definition
Two different neighborhood definitions proposed in [14,16] are considered. The first one, N1, connects each pixel with the direct neighbors in the next image column (Figure 4left). In contrast, for N3 all pixels in the next image column, which are separated by up to 3 rows below or above the current pixel (Figure 4-right), are regarded as neighbors and are connected with an edge in the graph used for the shortest path calculation. According to the authors, this should allow the detected horizon to follow along steeper mountain profiles. . Different neighborhood definitions for the construction of the image grid graph used within the shortest path calculation. All pixels (indicated as circles) of the first column are connected to the source node (blue). For the N1 neighborhood, each pixel is only connected to direct neighbors in the next image column. For the N3 neighborhood, each pixel is connected to all pixels in the next column, which are separated by up to three rows. Connections are shown for two exemplary nodes highlighted in red.

Edge Weight
The horizon we want to detect corresponds to the shortest path minimizing the sum of the edge weights from the left to right image border. Therefore, the definition of the edge weights, except for those edges connecting nodes to either source or sink, is essential. The output of the random forest classifier is in the range [0, 1] with 1 indicating a high probability that a pixel is part of the horizon. As we are searching for the path with the minimum sum of edge weights from left to right image border, the negative logarithm of the probability derived from the random forest classifier is used as the cost for the corresponding node N: where P is the probability of the pixel. Therefore, a pixel with a high horizon probability corresponds to a node in the graph with a low cost and vice versa. As log(0) is undefined, all pixels with a probability of zero are set to 1 * 10 −6 prior to the logarithmic transformation. Each edge of the graph connects two pixels of the image. Hence, for the definition of the edge weights two candidates exist: the cost of the node where the edge originates from (N f ) and the cost of the node where the edge points at (N t ). As defined in Equation (8), the cost of the target node is used:

Evaluation
A commonly used metric [14,16,30], the average vertical deviation from the ground truth horizon, is used for accuracy evaluation: N col is the number of columns in the image, R j gt is the row of the ground truth horizon in column j, and R est is the row of the estimated horizon in column j. In order to derive the statistics for the whole dataset, D img is calculated for each image, and the mean (µ) and standard deviation (σ) are calculated.
In both datasets, the ground truth horizons have no vertical or overhanging geometries, making this metric sufficient to determine accuracy. Only in cases of falsely detected horizons may a column contain multiple horizon pixels. In these rare cases, the position of the highest pixel is used.

Datasets
To evaluate the performance of our approach two different datasets, CH1 and HIST, are used. For the CH1 dataset results are reported for various approaches in [16], and is therefore suitable for comparison. The HIST dataset is a collection of historical terrestrial images used within our work.

CH1
This dataset (https://cvg.ethz.ch/research/mountain-localization/, accessed on 27 April 2021) was made available by [7], consisting of 203 color images, including ground truth sky masks, captured across Switzerland. All images have similar aspect ratios and dimensions (~1024 × 700 px). It contains modern photographs of mountainous terrain including challenging situations like cloud occlusions, low contrast horizons, and images taken from within cable cars ( Figure 5). The dataset was used by a wide range of authors (e.g., [16,31]) for evaluation. While it consists of color images, for the developed method, the grayscale images are derived by averaging the color channels.

HIST
The HIST dataset is a selection of 250 historical grayscale images with manually created ground truth horizons. All images were captured in high alpine terrain in the Alps. Compared to the CH1 dataset, many images show detailed views of glaciers and their surroundings. Combined with the reduced image quality and lack of color information, these images present a particular challenge for horizon detection. Some examples are shown in Figure 1.
Collected from various archives, all images had already been digitized. For most, the original format (e.g., film, print) and information regarding the scanning process (e.g., scanner type and model, DPI) are unknown. Some of the images contain borders which were semi-automatically detected prior to the horizon line detection and excluded from all subsequent steps. In contrast to CH1, image dimensions range from approximately~700 × 1000 px to~2500 × 1800 px including both portrait and landscape images.
In total, the images span a time period of nearly 100 years. The earliest image dates back to 1895 with the most recent one taken in 1992. The majority of the images were captured between 1920 and 1950.
From the reported methods, SVM is most similar to our approach as it is based on normalized pixel intensities extracted from 16 × 16 patches. Therefore, we re-implemented SVM based on the details given in [14]. In contrast to the original approach, patches with an odd dimension of 17 pixels are extracted to symmetrically describe the regions surrounding each pixel. As the CH1 dataset contains color images, two different variants of our method were evaluated: The first one, "cov_gray", uses the feature channels as shown in Equation (2). Additionally, "cov_color" uses the R, G, and B channels instead of the intensity (Equation (3)).
The evaluation strategy from [16] is adopted by splitting the dataset into 60% training and 40% testing sets. As no details are given regarding which images were used for training and testing, random splitting of the dataset is repeated five times. µ and σ are calculated from the images in all testing sets.
Based on the mean error and standard deviation, cov_color and cov_gray rank second and third among all reported methods (Table 1). Only DCNN-DS1 based on a fully convolutional network achieves higher accuracy and lower standard deviation compared to cov_color. In contrast to the approaches achieving higher accuracy, cov_gray is solely based on the grayscale images. With regard to the HIST dataset, consisting only of historical monochrome images, comparison with SVM and CNN is of special interest. Using cov_gray, the average error was reduced from 22.57 px (SVM) and 5.78 px (CNN) to 3.51 px. As both SVM and cov_gray make use of exactly the same patches within training, the higher accuracy results exclusively from the improved horizon classification based on the region covariance. As outlined in Section 2.1, the region covariance matrix can be easily adjusted to monochromatic or color images. As color images are available in the CH1 dataset, it is possible to evaluate the role of color for horizon detection. According to the results in Table 1, cov_color is more accurate than cov_gray. This indicates that color information is beneficial and contains additional information to distinguish the horizon from other apparent silhouettes. Analyzing the results of both variants in more detail revealed, that for most images, the influence of color is minor, as the region covariance based on the grayscale information is already a powerful and descriptive patch descriptor. However, in cases where additional structures having similar visual appearances in the grayscale domain exist, color information is important. Pixels along the true horizon are classified with higher probability when color information is available, whereas those along structures in the foreground are classified with lower probabilities leading to reduced errors. (Figure 6). In the example shown in the top row of Figure 6, the horizon is correctly detected using cov_gray in the right part of the image. Rock faces, trees, and the skiing slope in the foreground visually form an apparent ridge line in the left part of the image. Without color information, the shortest path follows along these structures instead of the true horizon. In the second example (bottom row) the situation is similar. The tree line visible along the photographed mountain range is falsely detected as the horizon using cov_gray. As it can be seen in the probability map for cov_gray, multiple smaller parts of the true horizon are not correctly classified by the classifier. These gaps along the horizon result in graph nodes having high costs. Hence, the sum of edge weights along the true horizon is higher compared to the path along the tree line.

Influence of the Neighborhood DEFINITION
Refs. [14,16] employed two different neighborhood definitions within their shortest path calculations as graphically illustrated in Figure 4. While in [14] only pixels in consecutive columns and rows are connected, in [16] also pixels in consecutive columns which are separated by up to 3 rows are connected. According to [16] this should account for steeper mountain slopes encountered in the CH1 dataset.
For all approaches the results achieved based on the N3 neighborhood are better compared to N1 (Table 2). These results agree with the argumentation of [16] regarding the choice of N3.

Influence of Step-Size
All images in the CH1 dataset have a resolution of~1024 × 700 px. In general, it can be assumed that patches extracted from neighboring pixels will not differ greatly as they represent approximately the same image region. Accordingly, the region covariance of these patches will be very similar. With regard to the HIST dataset, where images with resolutions up to~2500 × 1800 are available, the question arose, if it is really necessary to extract patches densely for each pixel or more sparsely in order to speed up computations and reduce memory requirements. Therefore, the accuracy of the detected horizon for the test images classifying each pixel, every 2nd, 4th, and 8th, were compared (Table 3). The accuracy of the detected horizon is only marginally influenced by extracting and classifying patches of every second pixel, whereas the number of patches is already reduced to 1/4. Further increasing the step size up to 8, which in combination with the patch size used leads to half overlapping patches, the error further grows up to~10 px, which is three times larger as for the densely extracted patches. While the similar standard deviations for the step sizes 1, 2, and 4 reveal that only minor differences occur, the standard deviation for the step size of 8 is twice as large indicating gross errors in the detected horizon.
Further results of the horizon detection for more images of the CH1 are given in the Appendix A.

HIST
For the HIST dataset, the same evaluation strategy as for the CH1 dataset was used. The 250 manually labeled images were randomly split 5 times into 60% training and 40% testing data. Following the evaluation on the CH1 dataset, a step size of 2 is used to calculate the probability maps of the test images.
As for the CH1 dataset, our method achieves significantly better results compared to SVM (Table 4). In contrast to the results obtained for CH1, the N1 neighborhood results in higher accuracy for both methods. The idea behind using N3 as neighborhood definition was to allow horizon lines to follow steep mountain profiles. In case of the HIST dataset, this has the opposite effect as shown in (Figure 7). In situations where ridge lines close to the true horizon, formed, e.g., by rock faces, the increased neighborhood size allows the detected horizon to jump across smaller vertical gaps and follow along these apparent structures.  Connecting pixels based on the N3 neighborhood results in detected horizons which jump from the true horizon to ridges with strong visual edges below the true horizon. These ridge lines have high horizon probabilities, as they are often the result of a change from rock to snow or glacier. In contrast, the true horizon, also caused by the lack of color information, especially in case of snow coverage, has lower horizon probabilities or gaps.

Influence of the PATCH Size
The patch size controls the size of the region surrounding each pixel used within the calculation of the region covariance. While for the CH1 dataset, a patch size of 17 × 17 was used in accordance of the original implementation of SVM and for comparison with CNN, other dimensions might be more suitable. Therefore, additional patch sizes of 9 × 9 and 33 × 33 were evaluated (Table 5). The lowest average error and standard deviation is obtained for a patch dimension of 33 × 33 pixels. For smaller patch sizes, especially 9 × 9, smaller structures are classified with a high probability as potential horizon pixels leading to noisier probability maps. With increasing patch sizes and therefore consideration of larger regions surrounding each pixel, the influence of smaller structures is reduced. This leads to less possible paths within the probability maps for the shortest path to follow. In the example shown in Figure 8 the error of the detected horizon is reduced from~156 px for the smallest patch size to~9 px for the patch size of 33 × 33 pixel. The remaining error is caused due to the snow covered horizon in contrast to an apparent ridge line formed by rock faces below the true horizon ( Figure 8-orange rectangle). Without any contrast between the true horizon and the cloudy sky in the background, cov_gray is not possible to detect this part of the horizon. Another benefit of the region covariance not addressed so far is directly related to the used patch size. While for SVM, the length of the feature vector used within classification is dependent on the patch size, the length of the feature vector for the region covariance, as shown in Section 2.1.2, is only dependent on the number of feature channels and number of regions used within each patch. Therefore, for all patch sizes used, the length of the feature vector is 135 for cov_gray, whereas for SVM, it ranges from 81 (9 × 9) to 1089 (33 × 33).

Discussion
For both datasets, cov_gray achieves the highest accuracy among the methods based on monochrome images. These results demonstrate that the region covariance is a powerful texture descriptor within the context of horizon detection. While for the CH1 dataset, an average error of 3.51 px was achieved, the error for the HIST dataset is roughly 6 times larger for the best performing variant of cov_gray. Analyzing the deviation from the ground truth horizon for each image (Table 6) for both datasets shows that the increased average error is mainly caused by a larger percentage (14.4%) of images having an error above 10 pixels, whereas for the CH1 dataset, there are only 6.6%. Additionally, the maximum error for HIST (~1200 px) is approximately 10 times larger than for the CH1 (~180 px) dataset. This increased error is also related to the dimensions of the images contained in the respective dataset. While for CH1 all images have the same dimension (~1024 × 768 px), in the HIST dataset these range up to~2500 × 1800 px. To further understand the remaining challenges, selected images from the HIST dataset with deviations larger than 10 pixels are shown in Figure 9. Subfigures (a) and (b) show images where parts of the true horizon are barely distinguishable from the sky above due to the lack of color, low contrast, and reduced image quality. In those regions, where the horizon is visible, it is also correctly detected. Due to the formulation of the horizon detection as a shortest path problem from the left to the right image border, a continuous horizon for the whole image is enforced. Therefore, in those cases where parts are not visible, the next best path is chosen within the shortest path.
In (c) and (d), the situation is slightly different and similar to the example shown in Figure 6. While the horizon is visible, it is completely covered in snow. Due to the visually apparent ridge lines formed by rock faces below the horizon, the detected horizon partly follows along these structures. These situations might be solved if color was available, as was the case for the CH1 dataset.
Nevertheless, with the developed method, it is possible to overcome major problems due to challenging situations posed by historical terrestrial images captured in high alpine terrain ( Figure 10). More results of the horizon detection for images of the HIST dataset, including both successful and unsuccessful examples, are given in the Appendix B.

Conclusions
Based on the region covariance as texture descriptor, an improved method for the detection of horizon lines in monochrome images was presented. Especially comparison with [14] on the CH1 dataset shows the discriminative power of the region covariance matrix. Only approaches using color information achieve higher accuracy, indicating that color is an important additional information regarding the horizon line detection in mountainous terrain. With the HIST dataset, the developed method was evaluated on historical images in average dating back to 1950. Particularly, the combination of high alpine terrain covered with snow and glaciers and the reduced image quality pose extremely challenging situations for horizon line detection. Despite those, in 85% of the images the horizon was detected with an average error smaller than 10 pixels.  Figure A2. Selected images from the HIST dataset with ground truth horizons (red) and detected horizons using SVM (orange) and cov_gray (blue). Left and center column show correctly detected horizons using cov_gray, right column horizons which are erroneous.