1. Introduction
Image archives hold impressive collections of historical terrestrial images from the Alpine regions. These images are an important, yet seldom used, source for identifying and documenting changes in the alpine landscape. Taken by early mountaineers without any auxiliary devices (e.g., global navigation satellite systems), the accurate position and orientation of these images are unknown. With georeferencing, the quantification of visible changes by monoplotting [
1] becomes possible. Additionally, instead of relying on missing or wrong metadata, georeferenced images can be queried by their estimated location. Especially for geoscience displines (e.g., Hydrology, Glaciology), where historical archival images predate other sources by nearly 50 years (e.g., aerial imagery), the availability of geroreferenced archival images will promote their usage.
In order to estimate the unknown camera parameters by photogrammetric resection, ground control points are necessary. Without the availability of human-made structures (e.g., buildings, streets), and with vast changes in the topography (e.g., glaciers) and varying illumination conditions, their identification in historical terrestrial images is time consuming and requires experience. In combination with the limited field of view of terrestrial images, analysis is often limited to a few hand-selected images. To overcome this limitation and allow the quantification of changes based on archival terrestrial images at larger scales, automatic methods for the estimation of the unknown camera parameters are necessary. Automatic image orientation is generally based on the extraction and matching of feature points (e.g., SIFT [
2], SURF [
3], ORB [
4]). Especially in changing environments (e.g., glaciated mountains) in combination with collections of images captured over a time period of 100 years, image matching based on feature points is challenging ([
5,
6]). Therefore, in natural environments, like mountainous terrain, automatic methods are based on the visible horizon (also referred to as apparent horizon, horizon line, or skyline). Assuming that the visible horizon in an image is unique, it can can be used for matching with rendered horizons from digital terrain models [
7,
8,
9,
10,
11]. While these methods vary within the matching stage and the required a priori information (e.g., approximate location, orientation, or focal length), they all are based on the visible horizon. Therefore, the accurate and robust detection of the horizon is a crucial prerequisite for the automatic estimation of the unknown camera parameters.
The apparent horizon is the border separating terrain from sky, which is characterized in images by a change in color and texture. These differences are exploited by methods approaching horizon detection as a segmentation task, where each pixel in the image is either assigned to ground or sky. In [
7,
12], the problem is formulated as a foreground–background segmentation, using multiple dense feature descriptors within a bag-of-words framework. On the other hand, the horizon line detection can be interpreted as an edge detection problem. Ref. [
13] proposed a method based on edge detection and a subsequent extraction step using dynamic programming. In [
14], the problem of non-continuous edges, often a result of traditional edge detection, is overcome by training a support-vector machine (SVM) on the normalized gray-scale values extracted from patches around each pixel. Based on the derived probability map, the horizon line is detected by searching for the minimum cost path. More recently, [
15] used a convolutional neural network to classify patches extracted for each pixel. Instead of classifying individual patches, [
16] use a fully convolutional network. A different approach was suggested by [
17] based on the observation, that sky regions generally are more homogeneous compared to ground. The potential horizon line is found by maximizing the difference in variance above and below the horizon.
All of the proposed methods have only been applied to modern images. Except for the approach proposed by [
14], color information is used. In contrast, the detection of the visible horizon in historical terrestrial images has not been addressed yet. Besides the unavailability of color information, these images pose additional challenges (e.g.,
Figure 1):
Taken up to 100 years ago with the earliest available compact cameras, historical images have a poorer image quality resulting in blurred images, low contrasts, and overexposed regions.
Most images found in archives were acquired from private collections. As those images have not been stored professionally, they show varying signs of usage (e.g., dirt, scratches, watermarks).
Alpine environments, even with modern cameras, pose a challenging environment for photography. Especially snow and glacial areas are difficult to photograph due to strongly varying illumination. In combination with rapidly changing weather (e.g., fog, clouds) conditions, visual appearance is heavily affected.
Especially the lack of color information in combination with high alpine terrain, partly covered with snow or glaciers, poses a particular challenge for the detection of the horizon line. Therefore, a robust and accurate method for detecting the horizon in these challenging images is needed. Based on the work of [
14,
18], a new method is suggested. The region covariance is used in classification to describe texture within the neighborhood around pixels. Introducing the probabilities obtained from a random forest classifier as weights into a shortest path search the horizon is obtained. In order to evaluate the performance of the approach, two datasets representing modern and historical images were used.
While the appearance of terrain and sky varies strongly, the horizon generally follows a certain spatial layout: The region below the horizon has larger intensity variations, whereas the region above, sky, appears homogeneous. Following [
14], describing the spatial distribution of intensities surrounding horizon pixels, an edge detector is learned (
Figure 2).
2. Method
Various descriptors have been proposed for describing the appearance and layout of pixels in images (generally referred to as texture), e.g., LBP [
19], GLCM [
20], or filter banks [
21]. Among those, the region covariance proposed in [
18] shows interesting properties regarding our problem: Being calculated from various image statistics within a selected region, the covariance matrix is directly used as texture descriptor. Hence, no additional parameters must be defined except for the size of the region of interest. Depending on the used features, it is robust towards illumination, scale, and rotation changes. As originally proposed for object detection, the spatial appearance of objects can be easily incorporated. Furthermore, the region covariance is a compact descriptor which can be calculated efficiently using integral images [
22].
2.1. Region Covariance
The covariance matrix
for a region
R of a grayscale image is calculated using
with
being the intensities of each pixel within the region
R, and
the mean intensity. For a monochrome image,
contains only the variance
of the intensity. Therefore, [
18] extended the image by additional feature channels, which can be directly derived from the intensity values. With multiple dimensions, the main diagonal of
contains the variances of the respective features, whereas the other values represent the pairwise covariance. Among the proposed features are the intensity
I, magnitude of the gradient
, the first and second order derivatives and
x and
y positions (Equation (
2)).
In case color images are available, the feature channels can be easily changed to include color information instead of the intensity
The gradient magnitude
is calculated using the root squared sum of the first order image derivatives in
xand
y. The usage of the
x and
y image coordinates seems contradicting. While their variance is always the same, the covariances with the other features represent important information regarding their distribution within the selected patch.
2.1.1. Patch Representation
While the appearance of horizons vary, their spatial layout is consistent: A smoother sky region above, in contrast to the rougher terrain with stronger variations below. Therefore, not only the intensity values, but especially the image gradients and gradient magnitude, will have stronger variations for the terrain region in the bottom part of horizon patches. With the inclusion of the x and y coordinates into the calculation of the region covariance, some measure to describe the distribution of the features within each patch is already available.
To further emphasize the spatial structure of horizon patches, each patch is represented by five individual parts (
Figure 3), which was already suggested by [
18]. For each of the five individual regions of a patch, a covariance matrix describing the spatial–textural relationship is calculated. The concatenation of all covariance matrices of one patch then builds the basis within the classification.
2.1.2. Classification
Each covariance matrix is a symmetrical square matrix of dimension with d being the number of features used. Due to symmetry, each matrix contains unique values. As five regions for one patch are used, the length of the resulting feature vector is .
One limitation of covariance matrices is, that they are not embedded in an euclidean space [
18]. Therefore, the individual elements of the covariance matrices cannot be directly interpreted as features in a subsequent classification task. In order to overcome this limitation, various approaches have been proposed. Ref. [
18] used region covariances within a nearest neighbor classification making use of the distance metric proposed by [
23] to calculate the pairwise distances. In case of support-vector machines, specifically designed kernels were suggested by [
24,
25]. Both also experimentally evaluated the influence of directly using covariance features in contrast to considering their geometry. For random forests, the log-euclidean transformed region covariances [
26] have been used by various authors [
27,
28,
29].
Following [
26], region covariances
can be used within euclidean computations after transformation into their matrix logarithms. One prerequisite for this approach are symmetric positive–definite matrices. Per definition,
are symmetric positive semidefinite matrices, which can be transformed into positive definite matrices by adding a small constant factor
1 × 10
:
where
is the d-dimensional identity matrix. After transformation into a positive definite matrix, the logarithm
can be calculated using the eigenvalue decomposition
with
D being the diagonal matrix of the eigenvalues of
, and finally,
where
is the diagonal matrix of the eigenvalue logarithms. The elements of
now can be directly used within the random forest classification.
For training, positive training patches are sampled along the ground truth horizon. To account for the imbalance of horizon and not-horizon pixels in images, negative training patches are randomly sampled with a ratio of 4:1 (not-horizon:horizon).
2.2. Horizon Line Detection
Following [
13,
14] the horizon line detection is solved by searching for the shortest path from the left to right image border. A directed weighted graph
is defined with each node
N corresponding to a pixel of the input image and edges
E connecting two pixels of the image in the graph. Each edge has a weight
. Additionally, two special nodes called “source” and “sink” are added to the graph
G. The source node is connected to all pixels of the first image column, whereas the sink node to all pixels of the last image column. For all edges connected either to source or sink, the weight
is set to zero. As the edges have zero weight, they do not have any influence on the shortest path itself. On the other hand, as both nodes are connected to all pixels of the first or last column, the shortest path can reach any pixel within the image. Therefore, no approximate location of the horizon must be known a priori.
2.2.1. Neighborhood Definition
Two different neighborhood definitions proposed in [
14,
16] are considered. The first one, N1, connects each pixel with the direct neighbors in the next image column (
Figure 4-left). In contrast, for N3 all pixels in the next image column, which are separated by up to 3 rows below or above the current pixel (
Figure 4-right), are regarded as neighbors and are connected with an edge in the graph used for the shortest path calculation. According to the authors, this should allow the detected horizon to follow along steeper mountain profiles.
2.2.2. Edge Weight
The horizon we want to detect corresponds to the shortest path minimizing the sum of the edge weights from the left to right image border. Therefore, the definition of the edge weights, except for those edges connecting nodes to either source or sink, is essential. The output of the random forest classifier is in the range
with 1 indicating a high probability that a pixel is part of the horizon. As we are searching for the path with the minimum sum of edge weights from left to right image border, the negative logarithm of the probability derived from the random forest classifier is used as the cost for the corresponding node
N:
where
P is the probability of the pixel. Therefore, a pixel with a high horizon probability corresponds to a node in the graph with a low cost and vice versa. As
is undefined, all pixels with a probability of zero are set to
prior to the logarithmic transformation. Each edge of the graph connects two pixels of the image. Hence, for the definition of the edge weights two candidates exist: the cost of the node where the edge originates from (
) and the cost of the node where the edge points at (
). As defined in Equation (
8), the cost of the target node is used:
2.3. Evaluation
A commonly used metric [
14,
16,
30], the average vertical deviation from the ground truth horizon, is used for accuracy evaluation:
is the number of columns in the image,
is the row of the ground truth horizon in column
j, and
is the row of the estimated horizon in column
j. In order to derive the statistics for the whole dataset,
is calculated for each image, and the mean (
) and standard deviation (
) are calculated.
In both datasets, the ground truth horizons have no vertical or overhanging geometries, making this metric sufficient to determine accuracy. Only in cases of falsely detected horizons may a column contain multiple horizon pixels. In these rare cases, the position of the highest pixel is used.
5. Discussion
For both datasets, cov_gray achieves the highest accuracy among the methods based on monochrome images. These results demonstrate that the region covariance is a powerful texture descriptor within the context of horizon detection. While for the CH1 dataset, an average error of 3.51 px was achieved, the error for the HIST dataset is roughly 6 times larger for the best performing variant of cov_gray. Analyzing the deviation from the ground truth horizon for each image (
Table 6) for both datasets shows that the increased average error is mainly caused by a larger percentage (14.4%) of images having an error above 10 pixels, whereas for the CH1 dataset, there are only 6.6%. Additionally, the maximum error for HIST (~1200 px) is approximately 10 times larger than for the CH1 (~180 px) dataset. This increased error is also related to the dimensions of the images contained in the respective dataset. While for CH1 all images have the same dimension (~1024 × 768 px), in the HIST dataset these range up to ~2500 × 1800 px.
To further understand the remaining challenges, selected images from the HIST dataset with deviations larger than 10 pixels are shown in
Figure 9.
Subfigures (a) and (b) show images where parts of the true horizon are barely distinguishable from the sky above due to the lack of color, low contrast, and reduced image quality. In those regions, where the horizon is visible, it is also correctly detected. Due to the formulation of the horizon detection as a shortest path problem from the left to the right image border, a continuous horizon for the whole image is enforced. Therefore, in those cases where parts are not visible, the next best path is chosen within the shortest path.
In (c) and (d), the situation is slightly different and similar to the example shown in
Figure 6. While the horizon is visible, it is completely covered in snow. Due to the visually apparent ridge lines formed by rock faces below the horizon, the detected horizon partly follows along these structures. These situations might be solved if color was available, as was the case for the CH1 dataset.
Nevertheless, with the developed method, it is possible to overcome major problems due to challenging situations posed by historical terrestrial images captured in high alpine terrain (
Figure 10).
More results of the horizon detection for images of the HIST dataset, including both successful and unsuccessful examples, are given in the
Appendix B.