The proposed sky and ground segmentation framework for the navigation visions of planetary rover adopts two datasets, the Skyfinder dataset [
31] and the Katwijk beach planetary rover dataset [
47]. First, this research designs a sky and ground segmentation neural network (in
Section 2.2) pre-trained using a large and annotated dataset (the Skyfinder dataset). Second, this research proposes a conservative annotation method for annotating sky and ground pixels in the practical navigation vision of the planetary rovers (in
Section 2.3). It is noteworthy that this research utilizes the Katwijk beach planetary rover dataset as the practical planetary scene. Then, this research conducts the data augmentation. Finally, this research uses the augmented data to perform weak supervision on the pre-trained sky and ground segmentation network, which transfers the prior knowledge from the pre-training process into the practical navigation scenario.
Figure 2 depicts the process of the proposed overall framework. Semantic segmentation tasks of unstructured scenes usually do not have pixel-level annotations. The Katwijk dataset is the representation of unstructured scenes in this research. Traditional supervised learning is difficult to be directly used in this type of task. Although the method based on multiple labelers can reduce the human error in the annotation, it is difficult to eliminate the error in some complex scenes, even with various labelers. On the other hand, the unstructured scenes share a similar prior knowledge of sky and ground segmentation as the Skyfinder dataset, and the Skyfinder dataset is well annotated. So, a solution with a weakly supervised architecture based on transfer learning becomes very promising.
2.2. Pre-Training Process: Sky and Ground Segmentation Network
The proposed sky and ground segmentation network consists of multiple convolutional networks (ConvNets). The proposed sky and ground segmentation network has been inspired by the U-shaped network (U-Net) [
57,
58,
59] and network in network (NIN) [
60]. U-Net has been widely used in semantic segmentation applications [
61,
62].
Figure 4 depicts the structure of the proposed sky and ground segmentation network. Moreover, the proposed network does not directly use a ready-made ConvNets architecture.
The main inspirations from the U-net [
57] are in two aspects. First, the entire network (see
Figure 4) adopts the overall configuration of the encoder-decoder [
63,
64]. The encoder structure has good information mining capability, which compresses the scale in the length and width directions and expands the depth direction scale. This can greatly increase the receptive field of the encoder structure, thereby numerically realizing a large range of image information interaction (even if they are originally located in different image regions). Second, U-net constructs a “highway” from the lower-level convolution structure to the upper-level convolution structure through the concatenation (the purple arrows in
Figure 4) between the encoder and decoder tensors with the same scale (width and height). This can ensure stronger gradient feedback, thereby avoiding the vanishing gradients caused by the deep architectures.
The inspirations from NIN [
60] are in two aspects. First, this research locates the micro-networks (the network in network, NIN) followed by the scale-changing ConvNets (see the orange square in
Figure 4 and
Figure 5a). The level of abstraction is low in traditional CNN (with the typical convolutional and pooling design). [
60] NIN structure can enhance the discriminability for local patches within the receptive field. Second, this research put two 1 × 1 ConvNets in all NINs. The 1 × 1 ConvNets is a cascaded cross channel parametric pooling on a normal convolution layer. This cascaded cross-channel parametric pooling structure allows complex and learnable interactions of cross-channel information. Thus, the NIN has better abstraction ability than traditional CNN. Moreover, the 1 × 1 ConvNets can adjust the tensor channels, making the structure design more flexible.
This research merges the inspirations from U-Net and NIN to construct the modified sky and ground segmentation network. It integrates the micro-networks in a U-shaped encoder-decoder structure. The proposed architecture is named “network in U-shaped network” (NI-U-Net).
Figure 4 shows the overall structure of the proposed NI-U-Net, and
Figure 5 shows the details. The black, orange, green, and blue squares refer to the input image (tensor), the 1 × 1 convolution-based micro-network (see
Figure 5a), the stride-convolution-based scale reduction (see
Figure 5b), and upsampling-convolution-based scale expansion (see
Figure 5c), respectively. The green and blue braces indicate the encoder part and the decoder part, respectively. The scale reduction (green squares) and the micro-networks (orange squares) appear alternately in the encoder. The scale expansion (blue squares) and the micro-networks (green squares) in the decoder also appear alternately. The purpose of this design is that the scale changes are performed by stride-convolution and upsampling, and the micro-networks (with a deeper structure) conduct the feature abstraction.
The proposed sky and ground segmentation network has the following highlights. (1) The micro-network (
Figure 5a) has N channels of input tensor, and the number of channels decreases to N/4 after the first 3 × 3 convolution. Then, the micro-network follows with two 1 × 1 convolutions. Finally, the micro-network applies another 3 × 3 convolution to restore the number of channels to N. Only the number of channels changes, while the image scale remains invariant. However, the first (far left) and last (far right) micro-networks are slightly different. The first micro-network inputs the image, and the number of channels is three. The first 3 × 3 convolution does not reduce it to N/4 but increases it to 32 channels. It is the first convolution over the entire network. The purpose is to increase the number of channels (this is a typical operation [
57,
58,
59]). The output of the last micro-network is the prediction (output), so its activation is not “LeakyReLU” but “sigmoid” (for binary classification). (2) The scale reduction (the green square in
Figure 4) uses convolution with a kernel size of 4 and a stride of 2 (
Figure 5b). The convolution kernel is an integer multiple of stride, which can reduce the risk of artifacts [
65]. (3) The scale increase uses the upsampling with a kernel of 2 (
Figure 5c), which adopts nearest-neighbor interpolation. This research utilizes upsampling and convolution (rather than deconvolution) to avoid the risk of artifacts [
65], which can increase the challenge in the training network.
The hyper-parameters of the pre-training are as follows. The pre-training network uses the Adam optimizer; the learning rate is set to 0.00001; the callback limitation of epochs is 50 epochs; the batch size is 32 samples per batch. The loss function applies the binary cross-entropy. The loss function translates the segmentation task to a binary classification task (sky or ground) for each pixel, which improves the segmentation to the pixel level.
This research adopts seven metrics to compare to the related research [
5,
26,
31,
34,
35], including accuracy, precision, recall, Dice (F1) score, intersection over union (IoU) [
34], misclassification rate (MCR) [
31], and root mean squared error (RMSE). Equations (1)–(3) depict the mathematical definitions of IoU, MCR, and RMSE, respectively. Notably, some metrics are only for comparison to related studies rather than related to the training process. This research uses IoU and accuracy to indicate the performance, uses binary cross-entropy and RMSE to witness the loss trend during the training process. Furthermore, the related advanced studies adopt various metrics to discuss their performance. Thus, this research uses the same metrics to directly compare the performance (including precision, recall, Dice score, MCR, and RMSE).
where
,
,
,
,
,
, and
refer to true-positive pixel number, false-positive pixel number, false-negative pixel number, total pixel number, category number, ground-truth label, and prediction, respectively.
Here is a brief description of the meaning and reasons for applied evaluation metrics. (1) The pixel-level semantic segmentation is a classification task on pixels. “Accuracy” is a very intuitive evaluation metric. (2) “Precision” refers to the proportion of the annotated sky pixels of the predicted sky pixels. “Precision” can characterize whether a large number of ground pixels are predicted as sky pixels. (3) “Recall” refers to whether the annotated sky pixels are also predicted as sky pixels. (4) Although “Accuracy” provides a very intuitive sense, which works properly with a balanced category distribution. The ratio of the sky and ground pixels is not strictly 1:1. Therefore, the “Dice score” is a more effective metric than the “Accuracy”. (5) “IoU” is a very important metric in image segmentation, and it is widely used in general image segmentation studies. (6) MCR is a per-pixel performance metric. Mihail et al. [
31] used it to propose the benchmark for the Skyfinder dataset. (7) The above metrics are all pixel classification indicators, while the “RMSE” provides a metric based on the Euclidean distance. The smaller RMSE refers to a better result.
2.3. Conservative Annotation Strategy
A reasonable and efficient labeling strategy is essential for transferring the pre-trained network (
Section 2.2) to the navigation visions of the planetary rover. There is no existing ground-truth sky and ground pixel-annotation for the navigation visions of the planetary rover, which is also a common challenge in the transfer learning tasks. Image labeling is a very complicated task. Besides, manual labeling reduces the reliability of the final result due to human errors.
Figure 6 displays some difficult annotation regions in the Katwijk dataset, which have been highlighted with red frames. Therefore, this research proposes a novel labeling method, named the conservative labeling strategy, for the navigation visions of the planetary rover.
Figure 7 shows a result sample of the proposed conservative labeling strategy.
Figure 7a displays a typical sample image from the Katwijk dataset. There are roughly four boundaries required to annotation. The boundaries “1”, “2”, and “3” are easy to locate (same location as the image borders) and provide high-constraints (higher pixel-ratio). The only difficult annotation happens to the skyline “4”.
Figure 6 indicates that pixel-level annotation for skyline “4” is difficult, which can introduce significant human errors. In fact, the pre-training network achieves very accurate predictions in the Skyfinder dataset (discussed in
Section 3.1). Therefore, the target of the proposed conservative labeling strategy for the Katwijk dataset is to perform special fine-tuning of the pre-trained network to fit the new scene, the navigation visions of the planetary rover.
The conservative labeling strategy preferentially guarantees the annotated skyline “4” located inside the corresponding image region (red or green region in
Figure 7c). Then, the conservative labelling method pushes the annotated skyline to the natural skyline as close as possible without infecting too much annotation speed. (The annotated and natural skyline refer to the handcrafted label and the actual skyline, respectively.) Because of this conservative skyline selection criterion, the proposed strategy is named the conservative labeling strategy. In practical operation, the conservative labeling strategy takes about one minute per image. The labeling tool used in this research is Labelme [
66]. Approximately 3% of the Katwijk dataset (150 images) has been annotated in this research.
Equations (4)–(13) brief the process of the weak supervision adopted in this research. It is notably that the technical terms of Domain, Task, Feature Space, and Marginal Probability Distribution can also be found in [
67]. Domain is consisted of samples, while a single sample is represented using a single
Feature . All Features in a
Domain consist of the
Feature Space . It is noteworthy that
Source Domain (
) and
Target Domain (
) of the transfer learning refer to the
Domain of the pre-training and transfer-training
Sample Space, respectively. Each Domain has two coefficients, the
Feature space (
) and
Marginal Probability Distribution (
). The
Marginal Probability Distribution (
) refers to the marginal probabilities of the
Feature Space (
).
and
refer to Task. For example, it can be the image segmentation. Transfer learning aims to achieve the
in
using the Prior Knowledge (
) from
[
67].
Frosst and Hinton [
68] claimed that a converged neural network should correspond to a Marginal Probability Distribution. In Equation (4),
and
refer to the same Task with different convergences. Convergence is a concept corresponding to the network. It is noteworthy that the Knowledge is decided by the Domain, and a converged network contains the Knowledge of the specific Domain. “
” applies the
Prior Knowledge (
) of
to the
, and “
” refers to the Knowledge of
. To distinguish these two
Knowledges, this project uses the Prior Knowledge to correlate the Knowledge of the
Source Domain (
), and the Converged Knowledge to correlate the Knowledge of the
Target Domain (
).
represents the sample space of that is only related to Marginal Probability Distribution of , and “” refers to the difference between any and in a board sense. “” should not include any negative representation, so Equation (4) uses “” to refer to absolute difference any and in a board sense. Therefore, Equation (4) relates the prediction from the Prior Knowledge and the Converged Knowledge of using , where refers that the difference between and is a function correlated to “”. In another word, if and refer to the pre-training and transfer-training Domains, then should refer to the difference by straightforwardly applying the pre-trained model in the transfer-trained scenario.
During the beginning of the transfer learning process, the
Prior Knowledge (
) of
should be different than the converged prediction (
) in
. Thus,
does not equal to zero at the beginning. However, the essential of the transfer learning process should fine-tune the Prior Knowledge from the Marginal Probability Distribution of
to
. Therefore,
and
should predict the same prediction, and
should equal to zero.
The above discussions are expressed from the view of a single sample, while the Prior Knowledge and Converged Knowledge should also be valid for the entire Domain. This research defines an Extension operator (
) in a board sense, which refers to the process from a single sample
to
. Equation (5) depicts an example of the Extension from
to
. Equation (6) is accomplished by conducting
on Equation (4).
If the transfer learning is considered as an ongoing process, refers to the intermedia feature space between and . When approaches to a closed point as , should be a small value closed to zero so that the difference between and becomes small. It can be expressed as “”. Therefore, if the annotation and learning is perfect, the Prior Knowledge represented by should converge as same as in .
The above discussions assume the difference between the Prior Knowledge and the Converged Knowledge only comes from the difference between
and
. Therefore,
can represent the ground-truth Marginal Probability Distribution which is the condition of supervised learning. However, the conservative annotated dataset is not fully supervised. Thus, the ground-truth
should be divided into two parts,
and
.
refers to the weak supervision from the conservative annotations, and
refers to the difference between
and
.
Equation (7) replaces the
in Equation (4) with
and
.
is straightforwardly correspond the unsupervised pixels in the conservative annotation. Notably,
corresponds to the conservative annotations, and
corresponds to the completed annotations.
Equation (8) moves the
to the left side and pack with
as a new value,
.
Equation (9) performs
on Equations (4) and (8) in
, then achieves Equation (10).
Now, Equation (11) assumes
is a very small value to zero because of the prior knowledge and
from x to
X.
Therefore, Equation (12) can be achieved from Equations (6), (10), and (11).
Equation (13) eliminates
from both sides in Equation (11). Equation (12) justifies the weak supervision from the theory aspect.
To verify the above process and assumptions, this research uses the detailed experiments in
Section 3.3.
2.4. Transfer Training Process: Sky and Ground Segmentation Network for the Navigation Visions of Planetary Rover
The transfer-training process is carried out on the
Prior Knowledge of the pre-training process. This project proposes Hypothesis 1:
Hypothesis 1. consists of two parts, the task-based loss () and the environment-based loss (). (See Equation (14))
The Difference between the
Source Domain (
) and the
Target Domain (
) is the fundamental reason of
, which can be divided into two parts, the
Difference caused by the
Task change and the Difference caused by the Environment change. This project uses
to characterize the loss related to
Task, while
characterizes the loss related to environmental changes. Equation (6) can be transformed into Equation (15) according to Hypothesis 1, where
is a function of
Task and
is a function of
Environment.
Equation (16) represents the pre-training process based on supervised learning.
The transfer-learning process is a fine-tuning process based on the
Prior Knowledge of the pre-training process. The Hypothesis 2 is:
Hypothesis 2. If the pre-training has obtained superior sky and ground segmentation Prior Knowledge, it is considered thathas approached ZERO (see Equation(17)).
This research uses Hypothesis 2 to assume that the pre-trained model is already in a superior Prior Knowledge of recognizing the sky pixels, ground pixels, and skylines. The of using the pre-trained model in the planetary rover scene comes from the .
Therefore, Equation (17) can be substituted into Equation (15) to get Equation (18).
Equation (18) depicts the same meaning as Equation (10), where refers to , and refers to in the transfer-training process.
It is essential to transfer the pre-trained achievement to the planetary rover scenario. Although the proposed NI-U-Net shows superior performance on the Skyfinder benchmark,
Figure 3 and
Figure 7a illustrate the distinctions between the Katwijk dataset and Skyfinder dataset, which also indicates the variant data distribution. However, the conservative labeling method can only generate limited samples. Thus, this research firstly performs data augmentation, then conducts the transfer learning.
This research adopts 22 augmentation schemes, including flip, brightness adjustment, contrast adjustment, crop, rotation, and color-channel shifting (see
Figure A4 for more details). The augmentation increases the sample space from 150 to 3300 images. Notably, all augmented conservative data used for transfer training, and there are no validation and testing sets in the transfer training process. Validation and testing sets aim to evaluate the overfitting rate. However, transfer learning generally adopts to solve the problem with insufficient data. If there is enough data, there is no need for a transfer training strategy. Therefore, overfitting in transfer learning is inevitable to some extent. On the other hand, the essence of transfer learning is to fine-tune the pre-trained network with a small amount of data. Compared to dividing the validation and testing sets from a small dataset, using all available data for training can achieve better transfer training performance. The quantitative evaluation of the transfer learning result should use another part of accurately labeled data. Transfer learning directly loads the weight of the computation graph from the pre-trained network as the initializations. The learning rate sets to 0.00001. The starting point is closed to the eventual converging callback. Thus, the learning rate should be a closed value as in pre-training. The optimizer adopts Adam. The epoch callback sets to 300 epochs, and the batch size uses 15 images per batch.
This research further proposes the weakly supervised loss and accuracy for the conservative labeling method, which is named as the conservative binary cross-entropy (
) and the conservative binary accuracy (
). Equations (19) and (20) are the mathematical expressions of traditional binary cross-entropy and binary accuracy, respectively. The
,
,
,
, and
refer to possibility, predictions, ground-truth labels, class number, and pixel number of false-negative, respectively. Notably, Equations (19) and (20) express the
and
with a function format (in programming). For example, the “
” (see Equation (19)) has the function declaration of “
”, the first formal parameter is the “
”, and the second formal parameter is the “
”. Equation (20) follows the same pattern as Equation (19).
Pre-training is a binary classification task, but transfer learning becomes an “incomplete” multi-classification task. There are three types of pixels in conservative annotations (sky (red), ground (green), and unannotated (black) pixels in
Figure 7). The sky and ground pixels are annotated “easy” pixels, while the unlabeled pixels are unannotated “hard” pixels. The transfer learning in this research can only rely on the “easy” pixels rather than all pixels. Equations (21) and (22) are the proposed conservative (binary) cross-entropy (
) and conservative (binary) accuracy (
).
where
,
,
,
,
,
,
, and
refer to conservative sky-mask, conservative sky-labels, conservative ground-mask, conservative ground-labels, weight for
, weight for
, sky-pixel number, and ground-pixel number, respectively.
Algorithm A1 in
Appendix A as well as Equations (21) and (22) explain the procedure of calculating the
and
in one backpropagation (the following eight steps).
- (1)
The sky and ground segmentation network inputs a batch of images and outputs a prediction (), while the corresponding conservative label is . The can be divided into the sky, ground, and unannotated pixels using two thresholds ( and ). (Notably, all and subscripts indicate sky and ground.)
- (2)
This research calculates the number of sky () and ground () pixels in , while the and refer to sky and ground pixel ratio in , respectively.
- (3)
This research produces a and , where has all conservative annotated sky pixels with value one and others with value zero, where has all conservative annotated ground pixels with value one and others with value zero.
- (4)
This research conducts the pointwise multiplications between and to achieve the filtered conservative sky pixel prediction (see “” in Equation (21)), which only remains the predictions at the exact locations of the conservatively annotated sky pixels.
- (5)
This research uses value one to pointwise-subtract the (the “” in Equation (21)) because the situation of sky and ground should be opposite. This research conducts a similar process (as step (4)) to achieve a filtered conservative ground pixel prediction (see “” in Equation (21)).
- (6)
This research generates the and to indicate the annotated sky and ground pixels only. The has sky pixels with value one and others with zero. The has ground pixels with value one and others with zero.
- (7)
This research calls the function (Equation (19)) with the input of step (4) and step (6) to achieve the conservative sky cross-entropy. This research calls the function (Equation (19)) with the input of step (5) and step (6) to achieve the conservative ground cross-entropy. This research further adds a weight parameter “” to balance the two cross-entropies.
- (8)
Equation (22) calls the
function (Equation (20)) with the input of step (4) and step (6) to achieve the conservative sky accuracy. Equation (22) calls the function (Equation (20)) with the input of step (5) and step (6) to achieve the conservative ground accuracy.