1. Introduction
Recent planetary exploration accomplished encouraging achievements and keeps attracting various community interests because of the promotions of advances in robotics, artificial intelligence, computer vision, sensor, and space science. For example, the Zhurong and Perseverance rovers landed on Mars in 2021 and carried out many scientific missions [
1,
2], while Canada, Japan, Saudi Arabia, and Russia announced their ambitious lunar (Moon) rover projects [
3,
4,
5]. The planetary rover is an essential platform for planetary exploration, widely involved in various scientific activities, including geological exploration [
6], planetary history investigation [
7], extraterrestrial water resource exploration [
8], unknown environment perception [
9], life exploration [
10], and Mars sample return [
11]. However, terrains are significantly complicated and hazardous in these activities, possibly bringing rovers into the “mission-ending scenario” [
12]. For example, Spirit and Opportunity Mars rovers are stuck with rugged terrains [
13,
14], the loose and granular terrains can cause their wheels to slip and sink [
15], and the large rocks can block their paths. Therefore, terrain recognition is important for planetary exploration because terrains have considerable differences in inaccessibility. 
The planetary rover is a sophisticated research platform. The vision-based technology in this review only limits the image or video signal captured by the pinhole camera, which is the review scope of this study. The sensors used for the navigation of rover autonomy can be classified into active and passive [
16], and vision-based technology mainly refers to passive sensors. Active sensors (such as radar, laser scanner [
17], structured light [
18], time-of-flight (ToF) [
19], etc.) are sensitive to environmental changes, heavy-weighted, have high energy consumption, and large-sized [
16,
20,
21]. Millimeter-wave radar relies on the millimeter wave, which has a wavelength between 1 and 10 mm. The ejected wave can be reflected back when the obstacle is approached, and the radar can use the reflected signal to estimate the distance. However, the resolution of radar is relatively low. Thus, this study only surveys passive sensors. Lidar (laser scanner) measures the distance between the lidar sensor and the target to estimate the 3D structure of the target using the energy magnitude, frequency, and phase of the reflected spectrum. Lidar can be used in autonomous driving for distance estimation, 3D reconstruction for achieving accurate point clouds, and information fusion to improve 3D estimations. Structured light displays a pattern on the target surface and estimates the surface structure using the distortion of the pattern. ToF has a similar principle to radar, while only the millimeter wave becomes the light. The pinhole camera model is a typical example of passive sensors [
22]. Compared with active sensors, the pinhole camera is lightweight, has low energy consumption, and small-size. 
The vision-based navigation represented by the pinhole camera is important, which can bring a superior solution with lightweight, low energy consumption, and small-size for rover autonomy. The pinhole camera model can be divided into mono- and multi-camera systems [
23,
24]. The mono-camera system acquires single images, while the multiple-camera system is associated with image pairs. Furthermore, the visual signal (image, image-pair, or video) is essential for planetary rover navigation. It is attractive and challenging to extract, understand, and deliver the information from visual signals efficiently. Semantic segmentation is essential in visual understanding [
25], and vision-based rover autonomy relies on semantic understanding [
13]. Image is a digit matrix for machines, while visual understanding segments the pixels into pixel clusters associated with categories. Therefore, this study specifies the scope of the candidate studies to the camera system-based works, corresponding to the data format of the image, image-pair, or video.
Terrain semantic segmentation is a highly interdisciplinary topic where recent developments have greatly influenced deep learning and computer vision. For example, references [
26,
27,
28] adopt the image processing-based method (superpixel and threshold), references [
29,
30,
31] apply the unsupervised clustering machine learning algorithms (K-mean cluster or SVM), and references [
11,
13,
32] adopt some advanced neural networks (Deeplabv3, Mask-R-CNN, or U-Net). However, a comprehensive discussion is lacking, challenging future studies on the choice of quantitative metric, available dataset, or qualitative demonstration. This study reviews these studies with a methodical, replicable, and transparent survey for future research and the corresponding community.
This study conducts a rigorous systematic literature review of semantic terrain segmentation in the navigation vision of planetary rovers. The authors of [
33,
34] recommend the systematic literature review as the approach for comprehensively implementing “Evidence-based Software Engineering” (EBSE). The EBSE emphasizes an evidence-based review strategy to ensure methodological rigor [
34]. The systematic literature review is a methodical, replicable, and transparent survey that achieves robust and broad conclusions and implications by summarizing, synthesizing, and evaluating individual studies [
35]. The necessity of conducting this systematic literature review includes the following four aspects: Firstly, the topic of this study is a specific application, while the narrative review tends to survey a broad topic [
36,
37]. Secondly, the unified and precise criteria are the basis for ensuring that study selection is comprehensive and fair, while the narrative review does not require such criteria [
36,
37]. Thirdly, this study made a rigorous statistical analysis of the 
Included Study, while a narrative review focuses on qualitative discussion [
36,
37]. Finally, it is essential to draw conclusions based on the statistical analysis to guide future research, rather than relying on the authors’ subjective analysis and judgment in the narrative review [
36,
37].
The contributions of this study can be summarized as below:
- This study provides a methodical, replicable, and transparent survey for semantic terrain segmentation in the navigation vision of planetary rovers. It provided robust and broad conclusions and implications for communities by summarizing, synthesizing, and evaluating individual studies. 
- This study discussed and summarized existing research results through a systematic literature review and accordingly proposed potential gaps and challenges for future study. 
- As far as the authors are aware, this study is the first review and the first systematic literature review on the topic of semantic terrain segmentation in the navigation vision of planetary rovers. 
The structure of this study is as follows: 
Section 2 describes the method of this systematic literature review. 
Section 3 depicts the results of the review process. 
Section 4 discusses the research questions proposed in 
Section 2 and the limitations. The conclusion is then presented in 
Section 5.
  2. Method
This study adopts a rigorous systematic literature review following the guideline in references [
33,
34] and uses references [
34,
38,
39,
40,
41] as examples. This study has two objectives: Firstly, to identify, classify, and summarize current studies. Secondly, to analyze and locate the potential gaps and opportunities for future studies.
Figure 1 depicts the flow diagram for conducting this systematic literature review. The search process is divided into three phases, the identification phase, the eligibility phase, and the inclusion phase. Firstly, the identification phase identifies the candidate studies from the three databases. Secondly, the eligibility phase applies the screening conditions according to the inclusion and exclusion criteria. The quality assessment is conducted as the last step of the eligibility phase. Finally, the inclusion phase classifies the 
Included Study into three categories. The italic “
Included Study” specifically represents the studies selected after four screenings and one quality assessment.
 This study defines three categories to classify the Included Study, and they are represented using the italic “Classical Image Processing-based study”, “Machine Learning-based Study”, and “Deep Learning-based Study”. The classical image processing-based study does not apply to the learning process. It is noteworthy that this study considers the learning process to be the iteration to achieve a valid semantic segmentation. For example, neural networks use iteration to update the computation graph [
42], and SVMs apply iteration to minimize loss [
43]. Therefore, the classical image processing-based study refers to the study whose semantic segmentation model does not require such iteration. The machine learning-based study and deep learning-based study both require iteration in the learning process. Although deep learning is generally considered a sub-discipline in machine learning [
44], this study separates the deep learning-based study into an individual category using the iteration kernel. Deep learning has recently begun to dominate semantic visual segmentation [
45,
46], and 
Section 4.1.4 also indicates that the community’s attention to the deep learning-based study increased significantly. Specifically, the deep learning-based study refers to the study that uses the programmable iteration kernel, while the machine learning-based study refers to the study that applies the unprogrammable iteration kernel. For example, the neural network can be significantly different using different layer combinations and structures [
42], while the SVM can only choose a few options (linear, polynomial, and sigmoid) [
43].
  2.1. Research Question
Table 1 describes the research questions (RQ) addressed by this study. The key research question (KeyRQ) is divided into six sub-research questions (SubRQs), sub-research question 1 (SubRQ1) to sub-research question 6 (SubRQ6). This study aims to determine the current benefits and the future potential that computer vision and deep learning have and can bring to semantic terrain segmentation. The KeyRQ is framed with the guideline of the PICOC criteria (population, intervention, comparison, outcome, and context) [
33]. In the KeyRQ, The “navigation vision of the planetary rovers” refers to the population, the “computer vision and artificial intelligence” refers to the intervention, the “What...?” represents the comparison, the “achievements” stands for the outcome, and the “semantic segmentation” refers to the context.
 - 1)
- Sub-research question 1 (SubRQ1) addresses the importance of terrain segmentation in planetary explorations from motivation and impact. This study addresses the importance of terrain segmentation by summarizing the undergoing projects and clarifying current research interests and motivations worldwide. 
- 2)
- Sub-research question 2 (SubRQ2) then explores the targeting terrains in current research and the corresponding reasons. It is important to locate the research targets and corresponding reasons in current research because planetary exploration is a highly unstructured environment. 
- 3)
- Sub-research question 3 (SubRQ3) describes the data from the perspective of the sensor and data format, which addresses the research scope from the aspect of hardware and collected data. 
- 4)
- Sub-research question 4 (SubRQ4) discusses the existing solutions and characteristics. This study identifies the drawbacks of current research and achieves inspiration for potential improvement. 
- 5)
- Sub-research question 5 (SubRQ5) depicts the data availability from the perspective of data science. 
- 6)
- Sub-research question 6 (SubRQ6) addresses the evaluation metrics and corresponding state-of-the-art performance. 
  2.2. Search Strategy
The identification phase includes the candidate studies into the scope of the systematic literature review as comprehensively as possible. Three databases have been used, IEEE Xplore [
47], Web of Science [
48], and Scopus [
49], and the identification phase was completed in May 2022. There are two reasons for using these three databases: First, these three databases cover most of the relevant literature. Second, these three databases all provide advanced search functions based on logical operators, making retrieval comprehensive and efficient. The specific search method is the keyword search in the advanced search function.
The search commands are designed the same in the three databases. The search command consists of four parallel searching conditions using the “AND” logic operator, and these four conditions correlate to the PICOC composition of the KQ (see 
Section 2.1). It is noteworthy that “*” in the searching command refers to a wildcard. For example, search command “terrain*” can refer to “terrain”, “terrains”, and “terrain”, plus other spellings.
- (i)
- The intervention, comparison, and outcome represent the specific details in studies, which can only be achieved via full-text screening. Thus, this study left them to be investigated in the full-text screening. 
- (ii)
- The context is divided and correlated into two conditions. - (ii-a)
- The first condition scopes the candidate study into terrain-targeted, corresponding to the “terrain” in the context. It is noteworthy that rocks and sky are also included in the terrain category because terrain is a vast concept. The authors of [ 13- , 50- ] discuss the important semantic terrain in the navigation vision of planetary rovers, claiming that various types of rocks play a critical role in planetary exploration missions. Rock can be any concept related to rocks, such as bedrock, rocks, etc. The sky also refers to the non-sky area (ground) and the “skyline”. Skyline refers to the boundary between sky and non-sky regions. 
- Therefore, the first condition is searched for in the scope of the title, abstract, and keywords, and the search command is “terrain* OR *rock* OR sky*”.  
- (ii-b)
- The second condition scopes the candidate study into segmentation-related, corresponding to the “segmentation” in the context. Some studies only work on terrain classification or path planning, which is not considered the proper candidate study in the search strategy. - The second condition is searched for in the scope of the title, abstract, and keywords, and the searching command is “segment*”. 
 
- (iii)
- The “Population” in KeyRQ is also divided and correlated into two conditions. - (ii-a)
- The third condition scopes the candidate study into planetary exploration-related, corresponding to the “planetary rovers” in the population. Some studies address autonomous car driving or moon detection, which is not considered a valid candidate study in the search strategy. - The third condition is searched for in the scope of the title, abstract, and keywords, and the search command is “planetary OR mars OR lunar OR Martian OR moon”. 
- (ii-b)
- The fourth condition scopes the candidate study into image or video data format, corresponding to the “navigation vision” in the population. - The fourth condition is searched in the scope of the full-text, and the search command is “image* OR vision OR visual”. 
 
  2.3. Inclusion and Exclusion Criteria
The exclusion and inclusion criteria refer to the “Eligibility phase” screening conditions in 
Figure 1. The eligibility phase uses five screenings to select the 
Included Study from the candidate studies. There are 320 candidate studies in the identification phase, consisting of 73 from IEEE Xplore, 64 from Web of Science, and 183 from Scopus.
- 1)
- Language screening: exclude non-English documents. 
Non-English documents are prone to errors in reading and comprehension. Eight candidate studies were screened, and 312 eligible studies were left.
- 2)
- Duplication and document type screening: remove duplicated documents and keep only conference or journal publications. 
There were 95 candidate studies screened, and 217 eligible studies were left.
- 3)
- The abstract and title screening: only screened according to the abstract and title - a)
- The screened study is not in computer vision or image processing scope. Computer vision is the “Intervention” in the PICOC criteria [ 33- ] for the KeyRQ. 
- b)
- The studied scenario is not planetary exploration. Planetary exploration is related to the “Population” in the PICOC criteria [ 33- ] for the KeyRQ. 
- c)
- The target is not related to terrain. - This abstract and title screening removed 125 studies, and 92 studies were left. 
 
- 4)
- The full-text screening: screened according to the full-texts, and the following four types of results are removed: - a)
- The data format is neither image nor video. Image and video are two typical data formats from the passive visual sensor, while other data formats are very different from what this study addressed. 
- b)
- The image or video neither consisted of color nor grayscale format. Some studies use disparity images or infrared images, which are not in line with the objectives of this review. 
- c)
- The study is not semantic terrain segmentation. Terrain segmentation refers to the “Context” in the PICOC criteria [ 33- ] for the KeyRQ. 
- d)
- The camera is not the navigation vision of the planetary rovers. Some studies use satellite or telescope images. - This step removed 58 results, and 33 results were left. 
 
- 5)
- The quality assessment screening: screened according to the quality assessment result. 
  2.4. Quality Assessment
This study conducted the quality assessment following the guidance in reference [
51], and the quality assessment criteria can be found in 
Appendix A of reference [
51]. Five elements are assessed, the “theory robustness”, the “implication for practice”, the “methodology, data supporting arguments”, “generalizability”, and the “contribution plus a short statement summarizing the article’s contribution” [
51]. There are five levels to indicate the assessment results:
- “0” level stands for “Absence”, which refers to “the article does not provide enough information to assess this criterion”. 
- “1” level stands for “Low”. 
- “2” level stands for “Medium”. 
- “3” level stands for “High”. 
- “Not applicable” level stands for “This element does not apply to the document or study”. 
  2.5. Data Collection
The following information was extracted from every study involved in the search process. The records of the extracted information were described in brackets. The “or”, “and”, and “others” refer to single choice, multiple-choice, and omitted details, respectively.
- The title 
- Digital object identifier (DOI) 
- The authors 
- The country of the corresponding author 
- The publication time by year 
- Publication type (conference, journal, or book) 
- The source databases (IEEE Xplore, Web of Science, and Scopus) 
- Main research topics (computer vision, image processing, planetary exploration, semantic segmentation) 
- Studied targets (terrain, rock, soil, craters-related terrain, hazard/safe area, obstacle, horizon/skyline, shadow, sample tube, sky/ground, slippage, wheel sinkage, unknow, and others) 
- Data format (color image, gray image, infra-red spectrum image, or depth image) 
- The research data source (customized dataset, specific public dataset, or unknown) 
- The number of images in the research data source 
- Sensor type (whether it is the rover navigation camera) 
- Camera model (stereo camera, mono camera, or unknown) 
- Classification of the research method according to  Figure 1-  (classical image processing-based study, machine learning-based study, or deep learning-based study) 
- The details of the corresponding solution (edge detection, Canny operator, Deepv3+ model, superpixel, support vector machine, and others) 
- The applied metrics and corresponding quantitative results 
- The qualitative results (yes or no) 
The data collection takes the suggestions from references [
33,
34] and used references [
34,
38,
39,
40,
41,
52,
53] as examples.
  4. Discussion
This section describes the potential research gaps and challenges through the KeyRQ and SubRQs. Firstly, 
Section 4.1 extensively discusses the six SubRQs. Then, 
Section 4.2 depicts the answer to the KeyRQ. Finally, the limitations of this study are addressed in 
Section 4.3.
  4.1. The Answer to the Sub-Research Questions
  4.1.1. SubRQ1: Why Is Terrain Segmentation Important for Planetary Explorations?
Semantic terrain segmentation is a basic function for planetary exploration missions that supports the building of many practical applications in practice. 
Table 7 divides the field of application in the 
Included Study into four categories: navigation, geological analysis, exploration efficiency, and other particular purposes (finding water or returning Mars samples).
More than half of the 
Included Study addressed navigation. Current planetary rovers rely mostly on remote control from Earth bases, while an autonomous navigation system only works under minimal conditions and periods [
13]. As the mission distance increase, the planetary rovers increasingly require safer, more real-time, and more accurate navigation systems [
29,
86,
87]. However, according to the experience of Earth-based navigation, intelligent navigation relies highly on semantic information [
13]. Therefore, exploring semantic segmentation technology for planetary rovers is essential. The navigation category covers the broad scopes of regular rover navigation, path planning, obstacle avoidance, and autonomous navigation.
Geological analysis is another critical mission for planetary exploration, and 
Table 7 indicates that about 20% of the 
Included Study focused on geological analysis. For example, studying geographies from other planets can help us understand the history and development of the Earth [
28]. Some studies analyze geological information to trace the existing water on the planet [
8]. However, most geological analyses are conducted manually through remote communication [
28]. The data acquisition speed has increased dramatically and is faster than the manual analysis speed [
28]. Some geological information is sequenced, which might occur in different spots and timestamps within the rover missions [
28]. Thus, it is easy to miss important geological information during planetary rover operations [
59]. One solution is to analyze the data automatically, and another is to select the important data and filter the unimportant data for human researchers. Both of them rely strongly on semantic information. The “important” should correspond to the specific mission, for example, rocks for reference [
30] and water for reference [
8].
  4.1.2. SubRQ2: What Targets Does Current Research Pay Attention To?
The studied targets in the 
Included Study involve the sample tube, terrains, obstacle, skyline, sky (and ground), and rock (the sample tube is the target of reference [
11] in the 
Included Study). 
Figure 6 illustrates the ratio of these studied targets, and 
Table 8 depicts the reference list to corresponding targets in 
Figure 6. Terrains refer to studies focusing on multiple instead of one terrain. Obstacles target obstacle avoidance, which only concerns whether the path could pass the target. Skyline is a similar target to the sky (and ground). Sky and ground are two common semantic labels in planetary exploration, and the boundary between sky and ground refers to the skyline. The skyline can be used to identify the rover’s location by matching the skylines and measuring the rover’s position. Sky and ground regions can also be used for further processes. Rock is a very common target in the 
Included Study, and 54% of studied targets are rocks.
Although 54% of the 
Included Study targeted rock, rock segmentation is still challenging. Rocks have significantly different appearances, and it is challenging to use unified properties to identify the background and rocks [
59]. However, identifying the rocks in the navigation vision is essential for path planning and geological analysis. The shape, weathering, and location of rocks contain information on the environmental properties and historical processes. Therefore, it is necessary to segment rocks to identify their geological properties [
59].
  4.1.3. SubRQ3: What Have Visual Sensors Been Applied to for Obtaining Data?
Sensors can be classified into two categories, exteroceptive and proprioceptive sensors [
13,
88]. The exteroceptive sensors conduct localization using the data from the surrounding environment. The exteroceptive sensors are not suited for planetary rover autonomy, which includes global navigation satellites, range sensors, vision sensors, 3D to 2D perspective projection, and vehicular network sensors. (1) Planetary rover exploration is a global navigation satellite system-denied scenario [
89]. (2) The range sensors (like laser scanners and radar) are heavy in weight with high energy consumption, which can increase the load of the planetary rover. (3) The 3D to 2D perspective projection relies on the knowledge of the camera parameters and transformation matrix, which is only measurable when the image is in focus. Planetary exploration is a complex environment with challenging illumination, noise, and reflection conditions, which decreases the reliability of 3D to 2D perspective projection. However, the vision sensor (like a camera system) is lightweight, has low energy consumption, and has robust working requirements, and this study only addresses vision-based sensors (camera system).
The proprioceptive sensors rely on internal measurements (such as velocity and steering angle). The proprioceptive sensors contain vehicle motion sensors and inertial sensors. The proprioceptive sensors are used as a data fusion to support the localization task. However, this review concentrates on the aspect of visual semantic segmentation, while the proprioceptive sensors are not within the scope.
Therefore, the sensors are limited to the stereo camera and monocular camera, which are passive optical camera systems. 
Figure 7a indicates the distribution of sensors in the 
Included Study, and 
Table 9 depicts the study of the distribution by camera model as in 
Figure 7a. The stereo image pairs refer to the data obtained from the stereo camera system, while the monocular images refer to the data from the monocular camera. Furthermore, 80% of monocular images are utilized in the 
Included Study, and only 20% apply stereo image pairs. The multi-camera system has higher power consumption and device weight, and this distribution indicates that the navigation system of the planetary rovers would most likely prefer a mono-camera system instead of a stereo.
Figure 7b and the “F
data” column in 
Table 2 further analyze the distribution by the image type in the 
Included Study, and 
Table 10 indicates the study of the distribution by image format in 
Figure 7b. The grayscale image refers to only one channel image, while the color image refers to the three-channel image (red, green, and blue channels). Depth image comes from the stereo camera, which can be calculated from the disparity image. Sixty-three percent of the 
Included Study used the grayscale image because most information of the visual signal can be well-contained using grayscales. Depth image usually requires considerable memory and computation power, and only 3% of the studies applied it.
   4.1.4. SubRQ4: What Solution Does Current Research Have?
This section further classifies the methods in the included study as the classical image processing-based study, machine learning-based study, and deep learning-based study. The classical image processing-based study refers to the traditional methods. This review classifies any study without applying machine learning or deep learning method as the classical image processing-based study. The attribute for machine learning and deep learning is learning, which corresponds to the iteration process. In other words, the image processing method has no iteration for the purpose of segmentation. Deep learning is usually considered a subject of machine learning. However, since Hinton proposed the deep belief networks (DBN) in 2006 [
90], the deep network-based method has accomplished significant achievements in various sectors [
91,
92,
93,
94]. Therefore, this review separates the deep learning-based study as an individual category. This review distinguishes the machine learning-based study or deep learning-based study through the operation kernel of the solution. If the kernel is requires programming, then the study belongs to the deep learning-based study. Otherwise, it belongs to the machine learning-based study.
The “S
cate” and “Algorithm” columns in 
Table 2 list the statistical results of the method classification and the specific algorithm used in the corresponding 
Included Study. The “ip”, “ml”, and “dl” in the “Solution category” column refer to the 
Classical Image Processing-based Study, 
Machine Learning-based Study, and 
Deep Learning-based Study, respectively. It is noteworthy that the “SIFT”, “SVM”, “TDEL”, and “UNet” in the “Algorithm” column refer to the spatial invariant feature transform, support vector machine, template dilatation edge linking [
71], and U-shaped network [
74], respectively.
Figure 8 illustrates the relationships and classification rules among the three proposed categories for the 
Included Study in 
Section 2. Thus, 
Figure 8 firstly applies the discriminant condition of “whether the 
Included Study applied the iteration process for the terrain segmentation purpose?” The “No” studies go to the “
Classical Image Processing-based Study”, and the “Yes” studies introduce the second discriminant condition. The second discriminant condition is “whether the iteration kernel of the 
Included Study is programable?” The “No” studies go to the “
Machine Learning-based Study”, and the “Yes” studies go to the “
Deep Learning-based Study”. It is noteworthy that the classification rules for the proposed three categories in 
Figure 1 and 
Figure 8 are only valid under the conditions of this review.
 Figure 9 illustrates the accumulation trend of 
Classical Image Processing-based Study, 
Machine Learning-based Study, and 
Deep Learning-based Study in the 
Included Study. The specific publication date can be found in 
Table 2. The number of 
Classical Image Processing-based Studies increased in 1998, 1999, 2000, 2011, 2012, 2015, and 2016, while the slopes are slight. The number of 
Machine Learning-based Studies increased from 2004 to 2013, and the slopes are more significant. The number of 
Deep Learning-based Studies has rapidly increased since 2018. 
Figure 9 indicates that extending the machine learning and deep learning technologies to planetary exploration is delayed. The DBN was proposed in 2006, while the rapid related study attempted deep learning 11 years later (in 2018). Although the accumulated studies using deep learning are lower than for the 
Machine Learning-based Study, their increasing slope is significantly high. 
Deep Learning-based Studies caught up with the number of 
Classical Image Processing-based Studies in only four years. It is reasonable to expect that the 
Deep Learning-based Study can contribute further semantic terrain segmentation contributions.
 The learning-based studies can be divided into supervised and unsupervised learning [
95]. The training process of supervised learning depends on the difference between the prediction and ground-truth label, and the loss function measures their difference. In contrast, unsupervised learning is usually used in difficult-to-label cases. The unstructured environment in planetary exploration is difficult to label, which seems to be suited for unsupervised learning. Twenty-six percent of the 
Included Study used unsupervised learning (K-mean cluster and SVM), while their performance is not compatible enough with supervised learning (see 
Table 3 for the method of the 
Included Study). For example, the precision in reference [
59] is only about 65%, while reference [
65] achieved precision of more than 99% (see 
Table 3 for the quantitative results). Rock-based terrain segmentation is a typical difficult and unstructured environment, highly influenced by irregular and changing rock texture, size, and outline. Supervised learning requires many pixel-level labels, and manual annotation efficiency is low and human error is easily introduced. Therefore, it is promising to utilize transfer learning and weak supervision. For example, the transfer learning in reference [
65] applied synthetic data to achieve prior knowledge, and then only little labeling is required to fine-tune the prior knowledge. Moreover, reference [
67] utilized weak supervision to significantly decrease human error and labeling difficulty by using the proposed “conservative annotation method”, and cooperating with transfer learning.
  4.1.5. SubRQ5: What Data Have Been Used?
Data are one of the essential driving powers for artificial intelligence technologies [
96]. Data are also an essential factor for research reimplementation. The “S
data” and “N
img” columns in 
Table 2 list the source data used and the number of images, respectively. 
Figure 10a uses a pie chart to classify the used dataset as the open dataset, private data, and unknown, and 
Table 11 indicates the study of the distributions by dataset types as in 
Figure 10a. The open dataset refers to the data available online, while the private data refers to the source data that are not available to the public. Sixty percent of data utilized in the 
Included Study were open-source data, or their source data are open. 
Figure 10b further analyzed the distribution by the number of images in the source data. 
Table 12 shows the distributions’ study by the number of images in 
Figure 10b. Fifty-three percent of open-source datasets have less than 1000 images, indicating that the available data for planetary exploration research is not much. Although past planetary rovers provided many images or videos, they are unlabeled raw data, which are difficult to use directly.
Figure 11 illustrates the usage of the open datasets in the 
Included Study, and 
Table 13 depicts the study of the open datasets in 
Figure 11. The NASA image set [
97] is the most popular dataset, while the European Space Agency (ESA) Katwijk beach planetary rover navigation dataset [
98] is second. The data in the NASA image set are individual images, while the ESA Katwijk dataset provided the navigation video.
   4.1.6. SubRQ6: What Metrics Have Been Utilized for Evaluation?
The “Quanti.” column in 
Table 2 lists the evaluation metrics in the 
Included Study, including accuracy, precision, recall, Dice score (F1), IoU, and inference time. The qualitative results refer to the visualization, providing intuitive sense to readers. The “Quali.” column in 
Table 2 lists the situation of the qualitative results. The “Yes” and “No” refer to the qualitative results that are provided and not provided, respectively. 
Table 3 depicts the numerical results of the “Quanti” column in 
Table 2. It is noteworthy that the absolute values of these quantitative results are not comparable because they are achieved from different environments, source data, and experimental settings.
Equations (1)–(5) refer to accuracy, precision, recall, Dice score (F1), and IoU, respectively [
95]. The character “
” refers to the number of samples in the corresponding category. The subscripts “
”, “
”, “
”, and “
” refer to true-positive, true-negative, false-positive, and false-negative categories, respectively. The “
” and “
” stand for “true” and “false” in the predictions, while “
” and “
” stand for “positive” and “negative” in the ground-truth labels. The accuracy represents the rate of correct predictions in all samples. The precision refers to the rate of correct true-predictions in the positive samples. The recall is the rate of correct (true and false) predictions in the positive samples. The IoU is a popular metric in image segmentation research.
          
Accuracy (Equation (1)) indicates correct prediction among all pixels, an overall indicator for terrain and background predictions. Precision (Equation (2)) indicates the correct ratio within the predicted terrain pixels, and recall (Equation (3)) indicates the correct ratio within terrain pixels in the ground truth. The Dice score (Equation (4)) uses one value to cover both recall and precision. Any small recall or precision can cause the Dice score to result in a large value. IoU (Equation (5)) can prevent a skew prediction that all predictions are terrain pixels to achieve high precision.
  4.2. The Answer to the KeyRQ: What Achievements Do Computer Vision and Artificial Intelligence Bring to the Terrain Segmentation in the Navigation Vision of Planetary Rovers?
This study summarizes the answer to the KeyRQ into the following four attributes consisting of data, solution, application, and performance.
- 1)
- Three prior open datasets and four new datasets are found in the  Included Study-  because of the promotion of computer vision and artificial intelligence, and the new datasets brought more inspiration and possibility to future studies. The prior datasets are the NASA image album [ 99- ], the ESA Katwijk beach planetary rover navigation dataset [ 98- ], and the Devon Island rover navigation dataset [ 100- ]. The newly proposed datasets refer to the conservative annotation dataset [ 65- ], the synthetic rock segmentation dataset [ 67- ], the generated OAISYS dataset [ 66- ], and the Mars-Seg dataset [ 70- ]. It is noteworthy that the newly proposed datasets all applied the prior datasets as sources to create new data. 
- 2)
- The computer vision and artificial intelligence findings are widely utilized in the  Included Study-  (see  Table 3-  for details). The K-mean cluster, Deeplab family, U-Net family, Mask-R-CNN family, and classical image processing algorithms made considerable contributions to the semantic terrain segmentation topic. 
- 3)
- The  Included Study-  is used for many practical applications. For example, reference [ 29- ] applied the K-mean cluster to ensure safe wandering for the planetary rover; reference [ 11- ] utilized Mask-R-CNN to support the Mars sample return mission; and reference [ 28- ] used the mean-shift algorithm for geological analysis. 
- 4)
- The  Included Study-  claimed that they achieved superior performance by applying different computer vision and artificial intelligence technologies.  Table 3-  describes the details from the perspective of metrics, while accuracy and IoU are the most used criteria for performance. For example, reference [ 67- ] achieved accuracy of 99.58% by applying the modified U-Net++; and reference [ 66- ] accomplished IoU for the sky region of 0.9066. 
  4.3. Challenges and Corresponding Research Gaps
The challenges for semantic segmentation in the navigation vision of planetary rovers are mainly located in the following three aspects.
- 1)
- Data with pixel-level annotation are insufficient. Although much data on planetary rover navigation vision exist, most are not annotated raw images or videos. It is difficult to use these images and videos effectively, considering the unsatisfactory performance of unsupervised and self-supervised solutions. 
- 2)
- The pixel-level accuracy of semantic segmentation needs to be improved. The pixel-level accuracy refers to a broad idea, and there is no unified metric existent in current studies for evaluation. However, the pixel-level accuracy in most studies is not ideal, while pixel-level accuracy is significantly important for further functionalities based on semantic segmentation. For example, errors in obstacle contours can greatly affect the safety of path planning, and errors in rock detection may misjudge their hazards to wheels. 
- 3)
- Third, existing methods lack discussion of real-time performance regarding onboard hardware. Some studies have obtained excellent segmentation accuracy, but they are all tested on offline hardware. The results of existing studies are still far from practical rover applications. 
Therefore, the research gaps can come from the corresponding challenges:
- 1)
- An open, pixel-level annotated, and real-world image-based dataset is highly required, which may involve numerous efforts and time in data annotation. 
- 2)
- Based on the open dataset, a unified metric for the evaluation benchmark is demanded, which can form a standard for comparison with related studies. 
- 3)
- The onboard hardware test is essential for evaluating the practical performance of the corresponding solution. 
  4.4. Limitations of This Study?
The limitations of this systematic literature review came from the following three points from the guideline in references [
33,
34]:
- 1)
- The manual identification step is conducted in the identification step of the search process, which is recommended for software technology evaluation. 
- 2)
- The candidate studies are identified by a single researcher, while the research questions, search strategy, exclusion criteria, and quality assessment are reviewed by other researchers. 
- 3)
- The definitions of the “Classical Image Processing-based Study”, “Machine Learning-based Study”, and “Deep Learning-based Study” are only proposed in this study, which do not belong to common practice while highlighting the impact of technologies on terrain segmentation. 
The first point indicates that some relevant research might not be identified by the search strategy. Especially, the studies are archived in national journals or conferences, and the studies are not written in English. Therefore, this study should be stick to a systematic literature review in the English-written major international journals and conferences.
The second point is implicit that the search field for different keywords might contain some bias. For example, regarding the keywords “terrain*”, “*rock*”, and “sky*”, the search fields of the IEEE Xplore, Web of Science, and Scopus are set to “All Metadata”, “AB (abstract)”, and “TITLE-ABS-KEY (title, abstract, and keywords)”, respectively. The three databases have different settings for the search field, and the decision of choosing the search field is decided by a single researcher and reviewed by another researcher.
The definitions of the “
Classical Image Processing-based Study”, “
Machine Learning-based Study”, and “
Deep Learning-based Study” are introduced in 
Section 1 and 
Section 4.1.4. However, image processing, machine learning, and deep learning are not independent topics in common practice, which might not be precisely divided.