Remote-Sensing Data and Deep-Learning Techniques in Crop Mapping and Yield Prediction: A Systematic Review

: Reliable and timely crop-yield prediction and crop mapping are crucial for food security and decision making in the food industry and in agro-environmental management. The global coverage, rich spectral and spatial information and repetitive nature of remote sensing (RS) data have made them effective tools for mapping crop extent and predicting yield before harvesting. Advanced machine-learning methods, particularly deep learning (DL), can accurately represent the complex features essential for crop mapping and yield predictions by accounting for the nonlinear relationships between variables. The DL algorithm has attained remarkable success in different ﬁelds of RS and its use in crop monitoring is also increasing. Although a few reviews cover the use of DL techniques in broader RS and agricultural applications, only a small number of references are made to RS-based crop-mapping and yield-prediction studies. A few recently conducted reviews attempted to provide overviews of the applications of DL in crop-yield prediction. However, they did not cover crop mapping and did not consider some of the critical attributes that reveal the essential issues in the ﬁeld. This study is one of the ﬁrst in the literature to provide a thorough systematic review of the important scientiﬁc works related to state-of-the-art DL techniques and RS in crop mapping and yield estimation. This review systematically identiﬁed 90 papers from databases of peer-reviewed scientiﬁc publications and comprehensively reviewed the aspects related to the employed platforms, sensors, input features, architectures, frameworks, training data, spatial distributions of study sites, output scales, evaluation metrics and performances. The review suggests that multiple DL-based solutions using different RS data and DL architectures have been developed in recent years, thereby providing reliable solutions for crop mapping and yield prediction. However, challenges related to scarce training data, the development of effective, efﬁcient and generalisable models and the transparency of predictions should be addressed to implement these solutions at scale for diverse locations and crops.


Introduction
The global population is proliferating and the demand for food is also increasing. Food production needs to be increased by 50% compared with 2013 to meet the need of approximately 9.1 billion people in 2050 [1]. Sustainable agricultural management has become an essential issue to address this challenge. Reliable and timely information on crop type and yield, along with their spatial components, play an important role in sustainable agricultural-resource management [2]. The accurate mapping of heterogeneous agricultural landscapes is an integral part of agricultural management. Crop maps are useful for The aim of this study is to summarise the existing research related to crop mapping and yield prediction using DL and RS and to highlight the gaps between them. This review minimises the barriers to implementing the solution and provides recommendations for future research directions.

Overview of Deep Learning (DL)
Deep learning is a ML method inspired by the structure of the human brain [39]. It involves the training of neural networks with many layers. Machine learning is a branch of artificial intelligence (AI) that allows a computer to perform tasks by learning from a large amount of data without explicitly programming the computer. Machine learning is valuable when the relationships between variables cannot be efficiently described using the traditional linear, deterministic model-building approach [54]. In DL, multiple layers learn data representation at various abstraction levels. Complex functions can be learned using DL with sufficient data and many layers that represent features at various abstractions [45]. The mainstream DL models include multilayer perceptron (MLP) [55], convolutional neural network (CNN) [55], recurrent neural network (RNN) [56] and autoencoders (AEs) [39]. Recently, a self-attention-based DL model architecture called Transformer [57] has also emerged as a key advancement in DL architecture. Figure 1 summarises the architectures of some of the leading DL models.
A MLP is the simplest form of a feedforward ANN. Each MLP layer has a set of nonlinear functions of the weighted sum of all the inputs from the previous layer [55]. They are also called deep neural networks (DNN). We use both terms interchangeably in this review. Autoencoders are neural networks used for unsupervised learning [58]. They have an encoder-decoder structure, in which the encoder represents the input data in a compressed form and the decoder decodes the representation to the original data [39]. The encoding and decoding are performed automatically, without any feature engineering. One of the purposes of AEs in RS is to reduce the dimensionality of the data. A MLP is the simplest form of a feedforward ANN. Each MLP layer has a set of nonlinear functions of the weighted sum of all the inputs from the previous layer [55]. They are also called deep neural networks (DNN). We use both terms interchangeably in this review. Autoencoders are neural networks used for unsupervised learning [58]. They have an encoder-decoder structure, in which the encoder represents the input data in a compressed form and the decoder decodes the representation to the original data [39]. The encoding and decoding are performed automatically, without any feature engineering. One of the purposes of AEs in RS is to reduce the dimensionality of the data.
A CNN is also a type of feedforward neural network that is commonly used in computer vision [45]. It primarily consists of filters (convolution and pooling) that are used to extract features from the original image [39]. Different features are highlighted in various layers, thereby providing hierarchical representations of data [59]. The convolution layer acts as a feature extractor and dimensionality is reduced by the pooling layer [39]. The pooling layer also prevents the network from overfitting. Often, fully connected layers, which act as classifiers by using the high-level features learned, are found at the end of the network. One of the key advantages of CNNs is parameter sharing. The parameter size remains fixed irrespective of the size of the input grid. The successful application of CNNs dates back to the late 1990s, in LeNet [60]. However, they did not achieve momentum until the development of core-computing systems. A significant landmark in the development of CNNs was the AlexNet [61], which won the ImageNet competition by a large margin. Convolutional neural networks have been highly successful in computervision problems, such as image classification [62], object detection [63], the neural-style transfer of images [59] and image segmentation [64].
Recurrent neural networks can take inputs and generate outputs of different lengths. This type of network is suitable for modelling sequential data and is widely used in speech recognition [65], natural language processing [56] and time-series analysis [66]. In RNNs, each layer comprises a set of nonlinear functions of the weighted sum of all the inputs from the previous layer and 'a state vector' [45], which contains information about the history of all the past elements in the sequence. The problem with RNNs is the vanishing and exploding gradient during backpropagation [67]. Long short-term memory (LSTM) was developed to address the vanishing-gradient problem in RNNs [68]. The hidden layers of LSTM have memory cells that model temporal sequences and their long-range dependencies more accurately. A gated recurrent neural network is another variant of RNNs developed to solve vanishing-or exploding-gradient problems [69]. They have an update gate and a reset gate, which decides which information should be passed to the output.
Transformer architectures are also used to process sequential data. Unlike RNNs, they receive the entire input data at once and use attention mechanisms to capture the relation between the input and output [57]. Transformer allows the parallel processing of entire sequences, thereby enabling very large networks to be trained significantly faster. A CNN is also a type of feedforward neural network that is commonly used in computer vision [45]. It primarily consists of filters (convolution and pooling) that are used to extract features from the original image [39]. Different features are highlighted in various layers, thereby providing hierarchical representations of data [59]. The convolution layer acts as a feature extractor and dimensionality is reduced by the pooling layer [39]. The pooling layer also prevents the network from overfitting. Often, fully connected layers, which act as classifiers by using the high-level features learned, are found at the end of the network. One of the key advantages of CNNs is parameter sharing. The parameter size remains fixed irrespective of the size of the input grid. The successful application of CNNs dates back to the late 1990s, in LeNet [60]. However, they did not achieve momentum until the development of core-computing systems. A significant landmark in the development of CNNs was the AlexNet [61], which won the ImageNet competition by a large margin. Convolutional neural networks have been highly successful in computervision problems, such as image classification [62], object detection [63], the neural-style transfer of images [59] and image segmentation [64].
Recurrent neural networks can take inputs and generate outputs of different lengths. This type of network is suitable for modelling sequential data and is widely used in speech recognition [65], natural language processing [56] and time-series analysis [66]. In RNNs, each layer comprises a set of nonlinear functions of the weighted sum of all the inputs from the previous layer and 'a state vector' [45], which contains information about the history of all the past elements in the sequence. The problem with RNNs is the vanishing and exploding gradient during backpropagation [67]. Long short-term memory (LSTM) was developed to address the vanishing-gradient problem in RNNs [68]. The hidden layers of LSTM have memory cells that model temporal sequences and their long-range dependencies more accurately. A gated recurrent neural network is another variant of RNNs developed to solve vanishing-or exploding-gradient problems [69]. They have an update gate and a reset gate, which decides which information should be passed to the output.
Transformer architectures are also used to process sequential data. Unlike RNNs, they receive the entire input data at once and use attention mechanisms to capture the relation between the input and output [57]. Transformer allows the parallel processing of entire sequences, thereby enabling very large networks to be trained significantly faster. The use of Transformer, which was initially developed for natural language processing and is the current state-of-the-art method in the field, has also gained recognition in computer vision [70,71]. The Architecture section further describes the DL architectures used for crop mapping and yield prediction.

Literature Identification
A systematic search of the literature on DL-and RS-based crop mapping and yield estimation was conducted on 'Scopus' and 'Web of Science' databases, respectively. These databases are for peer-reviewed scientific publications. The title, abstract and keyword were searched using the following search string: (('Deep Learning') AND ('Remote Sensing' OR 'Satellite Imag*') AND ('Agri*' OR 'Crop') AND ('Yield' OR 'Production' OR 'Mapping' OR 'Classification')).
The list was then filtered to exclude review papers and book chapters, works written in languages other than English, papers published in 2023, papers with fewer than five citations and duplicate papers.
Only English-language articles were considered due to the linguistic abilities of the authors and the lack of translation resources. Some of the important works in the field are presented at conferences. Hence, we also included conference proceedings in the review. Citation-based exclusion criteria were used to reduce the number of articles whilst ensuring that the most important and effectual studies were included in the review. The citation constraint might have excluded some high-quality and effectual research. Therefore, we read the abstracts and scanned through all the articles that were removed due to these criteria to determine any significant studies and added them to the list.
The initial search yielded 800 articles from the Scopus database and 268 articles from the Web of Science database. After applying all the exclusion criteria and removing duplicates, 267 articles were considered for further analysis, out of which 81 papers were selected by reading the abstracts and skimming through the contents. Nine relevant articles, which were identified from the bibliography or from the exclusion list, were added to the list. In this way, 90 publications were selected for review. Figure 2 graphically describes the step involving the identification of the publications for this study.
The use of Transformer, which was initially developed for natural language processing and is the current state-of-the-art method in the field, has also gained recognition in computer vision [70,71]. The Architecture section further describes the DL architectures used for crop mapping and yield prediction.

Literature Identification
A systematic search of the literature on DL-and RS-based crop mapping and yield estimation was conducted on 'Scopus' and 'Web of Science' databases, respectively. These databases are for peer-reviewed scientific publications. The title, abstract and keyword were searched using the following search string: (('Deep Learning') AND ('Remote Sensing' OR 'Satellite Imag*') AND ('Agri*' OR 'Crop') AND ('Yield' OR 'Production' OR 'Mapping' OR 'Classification')).
The list was then filtered to exclude review papers and book chapters, works written in languages other than English, papers published in 2023, papers with fewer than five citations and duplicate papers.
Only English-language articles were considered due to the linguistic abilities of the authors and the lack of translation resources. Some of the important works in the field are presented at conferences. Hence, we also included conference proceedings in the review. Citation-based exclusion criteria were used to reduce the number of articles whilst ensuring that the most important and effectual studies were included in the review. The citation constraint might have excluded some high-quality and effectual research. Therefore, we read the abstracts and scanned through all the articles that were removed due to these criteria to determine any significant studies and added them to the list.
The initial search yielded 800 articles from the Scopus database and 268 articles from the Web of Science database. After applying all the exclusion criteria and removing duplicates, 267 articles were considered for further analysis, out of which 81 papers were selected by reading the abstracts and skimming through the contents. Nine relevant articles, which were identified from the bibliography or from the exclusion list, were added to the list. In this way, 90 publications were selected for review. Figure 2 graphically describes the step involving the identification of the publications for this study.  Figure 3 presents the five most important sources of the reviewed research. Four sources were peer-reviewed journals and the fifth was the IGARSS proceedings. The IGARSS proceedings also provided significant contributions related to methods and applications in RS. Figure 4 shows the most frequently used terms in the titles, keywords and abstracts in the reviewed studies. Font size corresponds to frequency. The cloud tag provides a comprehensive overview of the topics covered in these papers. The cloud tag shows that Sentinel, Landsat and SAR were the most frequently mentioned input data. The use of temporal data seems to have been popular in these studies. Regarding the crop types, soybeans, wheat, corn and rice were the most popular. The figure also shows that  Figure 3 presents the five most important sources of the reviewed research. Four sources were peer-reviewed journals and the fifth was the IGARSS proceedings. The IGARSS proceedings also provided significant contributions related to methods and applications in RS. Figure 4 shows the most frequently used terms in the titles, keywords and abstracts in the reviewed studies. Font size corresponds to frequency. The cloud tag provides a comprehensive overview of the topics covered in these papers. The cloud tag shows that Sentinel, Landsat and SAR were the most frequently mentioned input data. The use of temporal data seems to have been popular in these studies. Regarding the crop types, soybeans, wheat, corn and rice were the most popular. The figure also shows that CNN was more popular than other DL methods for crop mapping and yield prediction. Furthermore, attention methods elicited a degree of research focus. CNN was more popular than other DL methods for crop mapping and yield prediction. Furthermore, attention methods elicited a degree of research focus.

Analysis of the Literature
A full-text read was conducted on the 90 articles that were identified. The articles were analysed to determine and explore their essential aspects, including the architecture of the DL, DL frameworks, RS data, training data, site and scale, assessment measures and performance and findings. These are summarised in the section below.

Sensors and Platforms Used
Satellite, aerial or UAV sensors were used to capture RS data for crop mapping and yield prediction. As shown in Figure 5, approximately 81% of the crop-mapping and yield-prediction studies used satellite-based sensors, followed by UAVs (12%). A few of the crop-mapping studies (four) also mostly used satellite and aerial imagery to test the robustness of their developed models. Satellite imagery is easily accessible because satellites are already present in space and regularly capture data. Further, the data provider conducts the initial pre-processing of satellite imagery. Thus, the user can focus on the development of the application rather than the pre-processing part. UAVs were used more frequently in the yield-prediction studies [72][73][74][75][76][77] than the crop-mapping studies, although UAVs can be equally beneficial in providing data for precise crop-boundary mapping. CNN was more popular than other DL methods for crop mapping and yield prediction. Furthermore, attention methods elicited a degree of research focus.

Analysis of the Literature
A full-text read was conducted on the 90 articles that were identified. The articles were analysed to determine and explore their essential aspects, including the architecture of the DL, DL frameworks, RS data, training data, site and scale, assessment measures and performance and findings. These are summarised in the section below.

Sensors and Platforms Used
Satellite, aerial or UAV sensors were used to capture RS data for crop mapping and yield prediction. As shown in Figure 5, approximately 81% of the crop-mapping and yield-prediction studies used satellite-based sensors, followed by UAVs (12%). A few of the crop-mapping studies (four) also mostly used satellite and aerial imagery to test the robustness of their developed models. Satellite imagery is easily accessible because satellites are already present in space and regularly capture data. Further, the data provider conducts the initial pre-processing of satellite imagery. Thus, the user can focus on the development of the application rather than the pre-processing part. UAVs were used more frequently in the yield-prediction studies [72][73][74][75][76][77] than the crop-mapping studies, although UAVs can be equally beneficial in providing data for precise crop-boundary mapping.

Analysis of the Literature
A full-text read was conducted on the 90 articles that were identified. The articles were analysed to determine and explore their essential aspects, including the architecture of the DL, DL frameworks, RS data, training data, site and scale, assessment measures and performance and findings. These are summarised in the section below.

Sensors and Platforms Used
Satellite, aerial or UAV sensors were used to capture RS data for crop mapping and yield prediction. As shown in Figure 5, approximately 81% of the crop-mapping and yield-prediction studies used satellite-based sensors, followed by UAVs (12%). A few of the crop-mapping studies (four) also mostly used satellite and aerial imagery to test the robustness of their developed models. Satellite imagery is easily accessible because satellites are already present in space and regularly capture data. Further, the data provider conducts the initial pre-processing of satellite imagery. Thus, the user can focus on the development of the application rather than the pre-processing part. UAVs were used more frequently in the yield-prediction studies [72][73][74][75][76][77] than the crop-mapping studies, although UAVs can be equally beneficial in providing data for precise crop-boundary mapping. Table 1 summarises the main sensors used in the crop-mapping and yield-prediction studies. Apart from the sensors mentioned in the table, Planet-Scope, AVHRR, UAVbased hyperspectral and UAV-based Multispectral (MS) and Thermal Sensor were also used in a few of the yield-prediction studies, while Quickbird, SPOT, VENµS, OHS-2A, Planetscope, RADARSAT-2, EO-1 Hyperion, Formosat-2, GF-1, DigitalGlobe, ROSIS-03, WV-2, NAIP, RapidEye, Aerial (UCMerced) and Aerial (hyperspectral) were also used in the crop-mapping studies. Moderate Resolution Imaging Spectroradiometer (MODIS) was the most frequently used sensor and was used exclusively in the yield-prediction studies. High temporal and sufficient spatial resolution in regional studies could have made the MODIS a preferred choice for regional-level yield-prediction studies. Sentinel-1, Landsat and Sentinel-2 were the most commonly used sensors in the crop-mapping studies. The common merit of all the aforementioned RS data is that they are freely available. These data are also available through the Google Earth engine, so data management and preprocessing are accessible. In fact, some crop-monitoring studies have used the Google Earth engine as a data-management and -processing platform [78][79][80]. The high temporal resolution of the MODIS and Sentinel sensors also allows the study of crop phenology at a finer level. Radar sensors, such as Sentinel-1 and Radarsat-2, can also work in cloudy weather. This could be the reason for the broader use of these sensors in the study of the phenological characteristics of crops, including rice. The number of UAV-based optical and multispectral sensors is also significant. Notably, hyperspectral sensors were less explored, despite their ability to provide better spectral range and precision, which are required for crop monitoring [81].  Table 1 summarises the main sensors used in the crop-mapping and yield-prediction studies. Apart from the sensors mentioned in the table, Planet-Scope, AVHRR, UAVbased hyperspectral and UAV-based Multispectral (MS) and Thermal Sensor were also used in a few of the yield-prediction studies, while Quickbird, SPOT, VENμS, OHS-2A, Planetscope, RADARSAT-2, EO-1 Hyperion, Formosat-2, GF-1, DigitalGlobe, ROSIS-03, WV-2, NAIP, RapidEye, Aerial (UCMerced) and Aerial (hyperspectral) were also used in the crop-mapping studies. Moderate Resolution Imaging Spectroradiometer (MODIS) was the most frequently used sensor and was used exclusively in the yield-prediction studies. High temporal and sufficient spatial resolution in regional studies could have made the MODIS a preferred choice for regional-level yield-prediction studies. Sentinel-1, Landsat and Sentinel-2 were the most commonly used sensors in the crop-mapping studies. The common merit of all the aforementioned RS data is that they are freely available. These data are also available through the Google Earth engine, so data management and pre-processing are accessible. In fact, some crop-monitoring studies have used the Google Earth engine as a data-management and -processing platform [78][79][80]. The high temporal resolution of the MODIS and Sentinel sensors also allows the study of crop phenology at a finer level. Radar sensors, such as Sentinel-1 and Radarsat-2, can also work in cloudy weather. This could be the reason for the broader use of these sensors in the study of the phenological characteristics of crops, including rice. The number of UAV-based optical and multispectral sensors is also significant. Notably, hyperspectral sensors were less explored, despite their ability to provide better spectral range and precision, which are required for crop monitoring [81].

Input Features
As features input to the DL architecture, crop-mapping studies typically used optical data (RGB), multispectral data, radar data, thermal data, or a combination of these data. Some of the reviewed studies used the time-series enhanced vegetation index (EVI) [82,83] and normalised difference vegetation index [84] derived from the RS data as inputs in their crop-mapping models. Bhosle and Musande [85] reduced the dimensionality of the hyperspectral image using principal component analysis before feeding it to the CNN model. Traditionally, computer-vision CNN models were designed for three-channel red, green and blue (RGB) images. When transferring models developed for computer vision are used in RS applications, the data should be prepared in a three-channel RGB format, so additional multispectral bands cannot be used [86]. For instance, Li et al. [87] used this approach and only used the RGB channel of the multispectral Quickbird image to feed the LeNet model.
For crop-yield studies, environmental data, such as climate and soil data, are increasingly integrated with RS data [88][89][90][91][92][93][94][95][96]. Optical, multispectral, radar or thermal data, or their combination, were commonly used RS input features in the yield-prediction studies. Vegetation indices were used more often in the yield-prediction studies than in the mapping studies. Approximately 40% of the yield-prediction studies used the vegetation index as an input for their model. However, Nevavuori, Narra and Lipping [73] and Yang et al. [74] found that optical or/and multispectral images performed better than vegetation indices for yield prediction in the CNN model. The authors of [90] attempted to predict yield using satellite-derived climate and soil data without using spectral or VI information, but the model only achieved a coefficient of determination of 0.55.
In crop-yield-prediction studies at the administrative unit (county/district) scale, satellite imagery has a higher resolution than target data. A typical approach in such a scenario is to aggregate each target area's values (county/district) using mean or weighted means. You et al. [42] proposed a histogram method to reduce the dimensionality of RS Remote Sens. 2023, 15, 2014 9 of 26 data whilst preserving key information for yield prediction. The approach was adopted in several subsequent studies [5,78,97].
Multi-temporal data, which are necessary to distinguish between crop types and estimate yield reliably, capture information on different crop-growth stages [98][99][100]. Interestingly, in 59% of the studies, the input features had multitemporal dimensions. Although many studies used multitemporal data, none of them intuitively modelled temporal dependencies. A few studies used 'Explainable AI (XAI)' techniques to understand the significance of different input features in model prediction. For instance, Wolanin et al. [101] visualised and interpreted features and yield drivers using regression-activation mapping to determine the impact of different drivers in the yield-prediction study. They found that downward shortwave radiation flux is the most influential meteorological variable in yield prediction. The most important variables that influence yields were also identified using attention mechanisms [102]. From an interpretability analysis, Xu et al. [103] identified that the increment in time-series length increased the classification confidence in an in-season-classification scenario.

Architecture
The deep-learning crop-mapping and yield-prediction applications were typically built using CNN, RNN, DNN, AEs, Transformer and hybrid architectures ( Table 2). The CNN was the most popular architecture, with usage in approximately 58% of the reviewed studies. The CNN architecture is more suited to array data, such as RS data. Kuwata and Shibasaki [104] were among the pioneers in the field of crop-yield estimation using DL. They used a CNN network with a single and two fully connected layers (inner product layer) to extract features that affected the crop yield and estimated the yield index. They used satellite, climate and environmental data as inputs for the model. Similarly, for crop mapping, an early approach was applied by Nogueira et al. [105], who used a CNN for feature extraction from RS scenes and classified the scenes into coffee and non-coffee. Kussul et al. [106], in a pioneering work, proposed a CNN-based model to classify multitemporal, multisource RS data for crop mapping. The use of CNN-based methods for semantic segmentation can be broadly categorised into patch-based approaches and fully convolutional networks (FCN) [107]. In patch-based approaches, the imagery is sliced into different patches. Each patch is fed to the CNN model, which assigns the central pixel or the whole patch to one target value. Some of the examples of patch-based CNN for crop mapping include the 2D CNN classification model by Kussul et al. [106], oil-palm-tree and citrus-tree detection studies [87,108,109], the classification of PolSAR data by Chen and Tao [110] and the study by Nogueira, Miranda and Santos [105], who classified the SPOT scene as coffee and non-coffee scene.
In yield prediction, Tri et al. [75] used LeNet and the inception version to extract features from the image patch. In the inception module, rather than selecting a filter size, multiple convolution filters of different sizes were selected (to learn features at a different scale) and all the outputs were concatenated [111]. To reduce the number of parameters, one-by-one convolution was used along the depth. Nevavuori et al. [80] designed a CNN architecture similar to that presented by Krizhevsky et al. [46] to predict yield from UAVbased RGB images. The model predicted wheat and barley yield at the field scale with satisfactory accuracy. Jiang, Liu and Wu [83] used a CNN based on LeNet-5 to classify a time-series EVI curve. The authors fine-tuned the model, which was trained to detect handwritten numbers on the MNIST database with parameter-based transfer learning using the curve of the time-series EVI. One of the major disadvantages of the patch-based crop-classification approach is that small features may be smoothed and misclassified in the final classification.
In 2015, a novel CNN approach, called FCN, which used convolutional layers to process the input image and generate an output image of the same size, was introduced. It used the entire image as the input, extracted feature at different levels of abstraction and upsampled features to restore the input resolution in the next part of the network using techniques such as bilinear interpolation, a deconvolution layer and features from an earlier, more spatially accurate layer. The U-net [112] is a widely used FCN architecture with skip connections. It was also used quite often for crop mapping. Typical examples of FCN usage for crop mapping are those of Du et al. [113] and Saralioglu and Gungor [114]. Adrian, Sagan and Maimaitijiang [80] used 3D U-net to extract features from the temporal and spatial dimensions. Notably, FCN is mainly used with higher-resolution imagery. Mullissa et al. [115], La Rosa et al. [116], Chamorro Martinez et al. [117] and Wei et al. [118] implemented FCN in the classification of crops in synthetic-aperture radar (SAR) images. Chew et al. [119] used a VGG16 architecture and the publicly available ImageNet dataset to pretrain their model before feeding the UAV image.
The application of a 1D CNN along the temporal or spectral dimension is also used for crop mapping and yield prediction. Zhong, Hu and Zhou [82] demonstrated that 1D CNN can be effectively and efficiently used to classify multitemporal imagery. The authors compared the output of a 1D CNN with a RNN to classify summer crops using multitemporal Landsat EVI data. In the experiment, the 1D CNN exhibited a higher accuracy and FI score than CNN-, RF-and SVM-based methods. This experiment demonstrated the ability of 1D CNN to represent temporal features to classify crops. However, the limitation of this approach is that it fails to consider the spectral and spatial information of satellite imagery. Zhou et al. [120] used object-based image analysis, which is well-recognised as a classification method for high-resolution images for crop mapping. The authors used a segmentation algorithm to make segments from Sentinel-2 imagery and used a 1D CNN to classify the mean spectral vector of the segments. This approach used the mean of the segments to capture spatial information but still did not model the temporal relation, which would have improved the accuracy.
The RNN models were the second most widely used models, since they were applied in more than 22% of the reviewed studies. In fact, RNNs were preferred for yield prediction. More than 40% of the reviewed yield-prediction studies used RNNs. The RNN is the preferred method for agricultural monitoring when temporal dimensions are involved [47]. The LSTM, a type of RNN, was effective in learning temporal characteristics from multitemporal images for crop mapping [121,122] and yield estimation [5,79,88]. Rußwurm and Korner [123] proposed a LSTM model for temporal-feature extraction to classify multiclass crop types. Xu et al. [122] used the LSTM model to learn time-series spectral features for crop mapping. The authors also studied the spatial transfer of the model amongst six sites within the US corn-maturity zone and found that the approach can learn generalisable feature representation across regions. However, the RNN model typically cannot be used to learn spatial-feature representation.
Although MLPs are not particularly suitable for array data, such as RS and environmental data, a few studies also used MLPs. For instance, Maimaitijiang et al. [72] used a fully connected feedforward neural network and data fusion for yield prediction. They studied the impact of fusing data, such as spectral, crop-height, crop-density, temperature and texture data, at the input and intermediate stages for soybean yields. Chamorro Martinez et al. [117] used a Bayesian neural network to predict the yield and find the uncertainty associated with the prediction. In the Bayesian neural network, the weight of the neural network is not fixed, but represented by a probability distribution. The use of AE, which is an unsupervised DL technique, featured in some of the crop-mapping applications [116,124,125]. In these studies, the AEs were used to learn the compressed and improved the representation of satellite data, which were then classified using other methods.
Hybrid modules that contain more than one architecture were used to learn spatial, spectral and temporal features for improved decision making. In the hybrid models, different architectures were used to learn features in various domains. These models either merged higher-level features obtained from two networks or used the output feature of one architecture as an input for another. To model the spatial context and temporal information from the multitemporal images jointly, the combinations of RNN and CNN were used [42,117]. Ghazaryan et al. [126] found that a hybrid model provided the highest accuracy out of 3D CNN, LSTM and a combination of CNN and LSTM whilst predicting the yield from multitemporal, multispectral and multisource images. Zhao et al. [127] used a LeNet-5 [128] model and a transfer-learning approach to classify red, green and infrared images in a first step and then improved the classification results using the DT model with phenological information in a second step. Although this hybrid approach improved the accuracy of the rice mapping, the approach could be challenging to implement in a larger scale because the decision rules are local and have to be determined from the field survey.
Attention mechanisms have also become popular in recent years in DL crop-mapping and yield-prediction models. An attention LSTM model with an attention mechanism was used to improve the generalisability of a yield-prediction model and identify the contribution of different variables to the yield [102]. The attention mechanism was also used to identify important features for crop mapping [103]. Wang et al. [129] claimed that crop mapping can be improved by integrating an attention mechanism and geographic information because it reduces the effects of geographic heterogeneity and prevents irrelevant information from being considered. Seydi, Amani and Ghorbanian [84] implemented spatialand spectral-attention mechanisms to extract hidden features relevant to crop mapping. Self-attention-based transformers, which have been found to be effective for processing sequential data, were also applied for crop mapping. Rußwurm and Körner [130] concluded that transformers were more robust in handling the noise present in raw time-series RS data and were more effective for their classification. Reedha et al. [131] employed the Visual Transformer (ViT) model to classify aerial images captured by UAVs and achieved a similar degree of accuracy to a CNN. The authors also claimed that when the labelled training dataset is small, ViT models can be better than state-of-the-art CNN classification. To evaluate their performance, DL models are typically benchmarked against ML methods, such as SVM, RF and DT.

Frameworks
Deep-learning frameworks are software libraries with pre-built structures made for implementing DL models. The implementation of DL architectures has been made easier and more accessible in this way. The most popular DL frameworks are convolutional architectures for fast feature embedding (Caffe) [132], Theano [133], TensorFlow [134], PyTorch [135], CNTK [136] and MatConvNet [137]. These frameworks have a robust GPU backend that allows the training of networks with billions of parameters.
TensorFlow was the most widely used framework for crop mapping and yield prediction with DL ( Figure 6). TensorFlow is written in Python and interfaces in R and JavaScript are also available. TensorFlow was developed by researchers who work on the Google Brain Team as a ML and DNN framework. It supports multiple GPUs and CPUs. Keras was also frequently used, with a total of 19 mentions. Keras is a high-level neural-network API written in Python and runs on top of TensorFlow or Theano. Of the Keras-based implementations, 11 used TensorFlow as the backend, one used Theano, and the remainder did not report a backend. Keras APIs are intuitive and straightforward, resulting in their rapid growth. TensorFlow version 2 completely integrates Keras, thus providing a versatile library with a simple interface. remainder did not report a backend. Keras APIs are intuitive and straightforward, resulting in their rapid growth. TensorFlow version 2 completely integrates Keras, thus providing a versatile library with a simple interface.
Furthermore, Pytorch was used relatively frequently, with nine mentions. Facebook's AI-research laboratory developed PyTorch. Providing flexibility, speed and deeper integration with Python, PyTorch has gained a user community in recent years. Caffe is written in C++ with a Python interface and is also popular in computer vision because it incorporates various CNN frameworks and datasets.
Deep neural networks are also built in Scikit-learn [138], a ML library. Mu et al. [139] used Scikit-learn to develop a DNN for yield prediction in their study, and Ma et al. [91] developed a Bayesian neural network. Scikit-learn does not support GPU implementation. Furthermore, DL4J [140], which is suitable for distributed computation, was also used in a study.

Crop Type
In the crop-yield-prediction studies, DL was most frequently applied to corn and soybeans (Table 3). Although most of the yield-prediction studies used a single crop, some also approached the prediction of the yield of more than one crop without distinguishing between the crops [73,76,96]. For crop mapping, most of the studies detected multiple crops. Rice was the most commonly mapped single crop. The wide use of rice as a staple food crop and the distinct phenological characteristics of rice fields reflected in the sensor data could be the primary reasons for its high rate of detection. Furthermore, Pytorch was used relatively frequently, with nine mentions. Facebook's AI-research laboratory developed PyTorch. Providing flexibility, speed and deeper integration with Python, PyTorch has gained a user community in recent years. Caffe is written in C++ with a Python interface and is also popular in computer vision because it incorporates various CNN frameworks and datasets.
Deep neural networks are also built in Scikit-learn [138], a ML library. Mu et al. [139] used Scikit-learn to develop a DNN for yield prediction in their study, and Ma et al. [91] developed a Bayesian neural network. Scikit-learn does not support GPU implementation. Furthermore, DL4J [140], which is suitable for distributed computation, was also used in a study.

Crop Type
In the crop-yield-prediction studies, DL was most frequently applied to corn and soybeans (Table 3). Although most of the yield-prediction studies used a single crop, some also approached the prediction of the yield of more than one crop without distinguishing between the crops [73,76,96]. For crop mapping, most of the studies detected multiple crops. Rice was the most commonly mapped single crop. The wide use of rice as a staple food crop and the distinct phenological characteristics of rice fields reflected in the sensor data could be the primary reasons for its high rate of detection.

Training Data
A DL model's accuracy and generalisation ability is determined by the quality and quantity of the training data [39]. Insufficient training data cause models to overfit and affect their prediction accuracy. Most of the training on crop mapping was performed by collecting the crop-type labels of the area of concern through field visits (Table 4). A field visit is a labour-and time-intensive process. After the field survey, the cropland-data layer (CDL) was the primary source of training for the crop-classification models. The CDL is a georeferenced, crop-specific land-cover map of the United States [141]. It is prepared using ground-truth data and moderate-resolution imagery. The CDL has a resolution of 30 m. It is published annually by the United States Department of Agriculture (USDA). It can be inferred that conducting such a study in other parts of the world is challenging, since such standard data are unavailable. Only three studies used government-supplied data other than CDL. Another method for training the data was the visual-image interpretation of higher-resolution images. Benchmark data, such as the UC Merced land-use dataset, the NWPU-RESISC45 dataset, the Campo Verde dataset and Breizhcrops, are also available for RS analysis and were used to test models in some of the crop-mapping studies. Crowdsourcing is another source of training data. Wang et al. [142] used crowdsourced crop-type data from farmers to train a network. Saralioglu and Gungor [114] created a web interface to collect training data for crop mapping. However, the challenge presented by the crowdsourcing method is the creation of an incentive or motivation for the contributor. Furthermore, the validation of these data is another challenge. Google Street View Images can also provide an efficient, cost-effective way to deliver ground referencing to train a DNN for crop-type mapping [143].
The county-level yield statistics provided by the USDA National Agricultural Statistics Service and the yield data collected from fields were the most commonly used data for training DL-yield prediction ( Table 5). The USDA yield statistics are separated from other government data sources in the table to highlight the frequency of their use. The target data for yield prediction at the field scale are collected during harvesting, either by the harvester [73] or by the weighted grain from each yield plot [72]. Field data are essential for field-level predictions. The data prepared by local governance bodies may not provide confidence inaccuracy. The USDA county-level yields are available for the USA, while CISA data are available for Canada. However, such data are not available for other parts of the world. Table 5. Source of training data for crop-yield-prediction studies.

Data Source (For Crop-Yield Prediction) Number of Studies
Field data 11 Government data (excluding USDA) 9 Government data (USDA) 11 Government data (USDA) and field data 1 Data-augmentation techniques, such as rotation and flips, were also used in the cropmapping [113] and yield-prediction studies [75] to further enlarge the data and ensure that the model was independent of rotation and flips. Some of the crop-mapping studies used the domain-adaptation technique [144] and weakly supervised learning [145] to address the scarcity of training data. Wang et al. [145] concluded that CNN can perform better than other ML methods for crop mapping, even when the training data are scarce, if the training labels are used efficiently. The authors used two types of training data, a single geotagged point (pixel) and an image-level label, to train a U-net. This training approach gave satisfactory results that demonstrated the applicability of weak supervision. This model should be further validated for different areas and crop types and can be improved using multitemporal features and a DL model with temporal-learning capabilities. The scarcity of crop-type labels and historical yield data are major barriers to the development of the DL model for reliable and accurate crop mapping and yield prediction. Figure 7 shows the spatial distribution of the study sites of the reviewed studies. Some of the studies were conducted in more than one area, such as those in which the experiments conducted in one area, while a transfer-learning technique was used to perform estimates at another location [5]. In such cases, both are included on the list. Evidently, the map shows that the studies were concentrated only in some parts of the world. In total, 37% of all the studies were conducted in the USA and 15% were conducted in China. Only 3% of the studies were conducted in Africa, despite the fact that Africa holds 60% of the world's arable land [146]. Agriculture accounted for 55% of Australian land use and 11% of goods-and-services exports in 2019-2020 [147], but only one reviewed study used Australia as its study site. One of the reasons for the skewed distribution of the sites could be the unavailability of target data. Another reason could be the locations of the research institutes. The areas of the study sites varied from 65 hectares to as large as the Indian wheat belt and the entire USA. In all the UAV-based studies that reported the area, fewer than 200 hectares were covered. The reason for this could be the high cost of data capture with UAVs. Studies of larger areas provide confidence in model's applicability to diverse landscapes. One of the reasons for the skewed distribution of the sites could be the unavailability of target data. Another reason could be the locations of the research institutes. The areas of the study sites varied from 65 hectares to as large as the Indian wheat belt and the entire USA. In all the UAV-based studies that reported the area, fewer than 200 hectares were covered. The reason for this could be the high cost of data capture with UAVs. Studies of larger areas provide confidence in model's applicability to diverse landscapes.

Scale of the Output
The crop-mapping and yield-prediction studies were implemented at different scales. The application of crop-monitoring studies depends on the output's scale. Although regional studies help to monitor crop production at a national and regional scale, withinfield variability is necessary to inform field-specific decision making [147,148]. The scale of the output depends on the resolution of the input and target data. In most of the cropmapping studies, each pixel or pixel group was assigned a crop class. The precision of the field boundary and generalisation depends on the spatial resolution of the RS data. We categorised the yield-prediction studies into two classes, namely, field-level and countyor district-level. Almost 70% of the yield-prediction studies were county-level and the remainder were field-level. The county/district-crop-yield statistics were typically used in the county-scale yield-prediction studies. In contrast, the field data collected from farmers and harvesters were used for field-scale studies. Precise yield data can be used to make predictions at the best possible scale Notably, the platforms used and the scales of the studies were correlated. The field-level yield was estimated in all the UAV-based yield-prediction studies. The county-level yield-prediction studies were predominantly conducted in the USA. The reason for this could be the availability of USDA yield data.

Evaluation Metrics and Performance
The most commonly used evaluation metrics in the reviewed crop-mapping studies were the overall accuracy, kappa statistics, precision, recall and F1 score. The majority of the studies used more than one metric to evaluate performance. Approximately 87% of all the crop-mapping studies used overall accuracy to assess their model's performance. Overall accuracy is the most intuitive evaluation measure. It is the ratio of the correct predictions to the total number of predictions made. It is proportional to the area that is correctly mapped. Along with overall accuracy, kappa statistics were often computed in the crop-mapping studies. Precision refers to the ratio of correctly predicted positive observations to the total positive predictions made by the model. The recall is the number of correct positive results divided by the number of all the samples that should have been identified as positive. The F1 score is the harmonic mean of the precision and the recall [149]. The studies reported either the F1 score of individual classes or the mean (weighted or unweighted) of the F1 scores. Notably, the overall accuracy of a dataset can be misleading when the class distribution is uneven. The precision, recall and F1 score might be more useful in such studies.
The mean squared error (MSE), root MSE (RMSE), coefficient of determination (R 2 ), mean absolute error (MAE) and mean absolute percentage error (MAPE) were the commonly used metrics in assessing the reviewed yield-prediction models. The MSE is the average of the square of the difference between the original values and the predicted values. The MSE penalises larger errors because each value is squared. The RMSE is similar to the MSE, but takes the square root of the output. The problem with using MAE, MSE and RMSE is that the value depends on the units and the scale of the residuals. The mean absolute percentage error (MAPE) attempts to solve this issue. It transforms the errors into percentages; ideally, the MAPE should be as close to 0 as possible. The R 2 is the degree of agreement between the true value and the predicted value. It measures the proportion of variance in the dependent variables explained by the independent variable. The R 2 was the most frequently used evaluation metric in the yield-prediction studies, since it was used in 65% of the studies. Most of the yield-prediction studies also computed multiple metrics. After R 2 , RMSE and MAPE were the most commonly used. In some studies [75,77], the yield prediction was approached as a classification task. Each image segment was assigned to a yield class and classified within a particular yield category. These studies used the overall accuracy and F1 score as evaluation metrics.
Comparing model performance is not easy when models use different evaluation metrics. We prepared box plots that show the distribution of the achieved R 2 and the overall accuracy percentage in the yield-prediction and crop-mapping studies, respectively ( Figure 8). The data for these plots were obtained from the studies that reported the performances of the yield-prediction and crop-mapping models using R 2 and overall accuracy, respectively. The value of the best-performing model was selected to make the graph. Most of the crop-mapping studies reported very high classification accuracies, up to 99.7%, with a mean value of 90.0%. The R 2 of the yield prediction was distributed in the range of 0.5 to 0.96, with a mean value of 0.77.
absolute percentage error (MAPE) attempts to solve this issue. It transforms the errors into percentages; ideally, the MAPE should be as close to 0 as possible. The R 2 is the degree of agreement between the true value and the predicted value. It measures the proportion of variance in the dependent variables explained by the independent variable. The R 2 was the most frequently used evaluation metric in the yield-prediction studies, since it was used in 65% of the studies. Most of the yield-prediction studies also computed multiple metrics. After R 2 , RMSE and MAPE were the most commonly used. In some studies [75,77], the yield prediction was approached as a classification task. Each image segment was assigned to a yield class and classified within a particular yield category. These studies used the overall accuracy and F1 score as evaluation metrics.
Comparing model performance is not easy when models use different evaluation metrics. We prepared box plots that show the distribution of the achieved R 2 and the overall accuracy percentage in the yield-prediction and crop-mapping studies, respectively ( Figure 8). The data for these plots were obtained from the studies that reported the performances of the yield-prediction and crop-mapping models using R 2 and overall accuracy, respectively. The value of the best-performing model was selected to make the graph. Most of the crop-mapping studies reported very high classification accuracies, up to 99.7%, with a mean value of 90.0%. The R 2 of the yield prediction was distributed in the range of 0.5 to 0.96, with a mean value of 0.77.

Discussion
Deep learning and RS have emerged as promising techniques for crop mapping and yield prediction in recent years. A typical approach to DL-and RS-based crop mapping

Discussion
Deep learning and RS have emerged as promising techniques for crop mapping and yield prediction in recent years. A typical approach to DL-and RS-based crop mapping and yield prediction is summarised in Figure 9. The main platforms for capturing data are UAVs, satellites and aeroplanes. The input data can be raw spectral values from multispectral, optical, hyperspectral, radar or thermal sensors or derived features, such as the vegetation index, histograms of pixel intensities or even graphs of phenological characteristics. In yield-prediction models, RS data are often integrated with environmental data, such as climate and soil data. Although multitemporal data are popular, data from a single date can also be used. The models are built using CNN, RNN, MLP, Transformer or hybrid architectures. The models are usually implemented using a standard framework, such as TensorFlow, PyTorch or Caffe. Target labels (crop labels/yield values) are the most important components in the process and are often the limiting factors in the development of models. Trained models are evaluated using one of the evaluation metrics to assess their performance and fitness for use. In the section below, these aspects of the study are discussed. date can also be used. The models are built using CNN, RNN, MLP, Transformer or hybrid architectures. The models are usually implemented using a standard framework, such as TensorFlow, PyTorch or Caffe. Target labels (crop labels/yield values) are the most important components in the process and are often the limiting factors in the development of models. Trained models are evaluated using one of the evaluation metrics to assess their performance and fitness for use. In the section below, these aspects of the study are discussed. This review shows that satellite-based sensors are the most commonly used RS-data sources for crop mapping and yield prediction. This preference could be due to the ease of access to data, the availability of multiple spectral and spatial resolutions, the availability of historical data, or the global coverage of and fewer pre-processing steps in RS data. Additionally, platforms such as Google Earth Engine make the handling of large amounts of satellite data easier. Unmanned aerial vehicles are less often used as RS-data sources, although they provide flexibility in terms of the choice of sensors, spatial resolution and data-capture time. Unmanned aerial vehicles are the preferred platforms when greater precision is needed (e.g., in precision agriculture). With regard to sensors, MODIS is the most commonly used sensor for yield prediction, whereas Sentinel-1,2 and Landsat 2 are frequently used for crop mapping. Although hyperspectral sensors can provide better spectral ranges and precision for crop monitoring, their application is yet to be fully explored. Researchers and practitioners currently have access to RS data at varied resolutions (e.g., spatial, spectral and temporal), obtained from various platforms and sensors, thereby allowing the selection of the most suitable sensor based on the specific needs of the study. This study also summarised the attributes of commonly used RS data for crop mapping and yield prediction (Table 1). This review shows that satellite-based sensors are the most commonly used RS-data sources for crop mapping and yield prediction. This preference could be due to the ease of access to data, the availability of multiple spectral and spatial resolutions, the availability of historical data, or the global coverage of and fewer pre-processing steps in RS data. Additionally, platforms such as Google Earth Engine make the handling of large amounts of satellite data easier. Unmanned aerial vehicles are less often used as RS-data sources, although they provide flexibility in terms of the choice of sensors, spatial resolution and data-capture time. Unmanned aerial vehicles are the preferred platforms when greater precision is needed (e.g., in precision agriculture). With regard to sensors, MODIS is the most commonly used sensor for yield prediction, whereas Sentinel-1,2 and Landsat 2 are frequently used for crop mapping. Although hyperspectral sensors can provide better spectral ranges and precision for crop monitoring, their application is yet to be fully explored. Researchers and practitioners currently have access to RS data at varied resolutions (e.g., spatial, spectral and temporal), obtained from various platforms and sensors, thereby allowing the selection of the most suitable sensor based on the specific needs of the study. This study also summarised the attributes of commonly used RS data for crop mapping and yield prediction ( Table 1).
The choice of input features can significantly affect how well a DL model represents underlying phenomena. The input features determine the model's architecture. Commonly used RS-derived input features in crop mapping and yield predictions are optical, multispectral radar or thermal data. The integration of environmental data and RS data for yield prediction is becoming increasingly popular. Environmental data provide additional valuable information for yield prediction beyond RS data [89,150,151]. Vegetation indices, which are compact summaries of vegetation crafted from spectral values, were also commonly used as inputs in the crop-mapping and yield-prediction studies. Some of the yield-prediction studies used the histogram method and summarisation techniques, such as the mean or weighted mean, to simplify input variables while retaining the most important information. Although these methods can allow the training of DL models with limited labelled data, they may feature the drawback of generalising input information. Furthermore, although the use of multitemporal and multispectral data for crop mapping and yield prediction is increasing, most existing approaches do not account for temporal, spatial or spectral dependencies concurrently.
The use of DL architectures for crop mapping and yield prediction has shown significant progress due to the development of new architectures in the DL field. Several DL methods, including MLP, CNN, RNN, Transformer and AEs, have been used for crop mapping and yield prediction. The CNNs are the most widely used architectures for crop mapping and yield prediction. They were specifically designed for processing data with grid-like topologies. Furthermore, 2D CNNs, 3D CNNs and FCNs can effectively capture and analyse the spatial features of satellite imagery. A 1D CNN can be used along temporal or spectral dimensions to capture respective dependencies in crop-mapping and yieldprediction tasks. Furthermore, RNNs can be used to model temporal dependencies in crop mapping and yield prediction, but they suffer from the vanishing-and exploding-gradient problem, are not particularly effective at capturing long-term dependencies and cannot be parallelised. Transformer is known for its ability to capture long-range dependencies in input data and its ability to train very large models efficiently. The reviewed studies also suggested that for crop mapping, Transformer is more robust in handling noise in raw time series and classification when the labelled training dataset is small. The MLPs are not particularly efficient when used to process high-dimensional-array data such as RS and environmental data and they have limited utility for crop mapping and yield prediction. Considering that different architectures have varying strengths and limitations, there is no single best architecture for crop mapping and yield prediction. Rather, the best option depends on the amount, nature and quality of the data, the complexity of their representation and the available computational resources, amongst other factors.
This review suggests that crop-mapping and yield-prediction research are currently skewed towards certain crops and locations. The results of research conducted on a certain area or crop type may not be generalisable to other regions or crops. Hence, studies must be extended to other widely consumed crops and other regions to ensure the usability of models in varied conditions. One of the reasons for the skewed study distribution could be the limited availability of training labels. Scarce historical yield data and crop-type labels are significant limiting factors in the development of DL models for crop mapping and yield prediction. The data prepared by government bodies are amongst the most widely used for training yield-prediction and crop-mapping models. Such data are not available in most of the world, especially in developing countries.
With regard to the performances of the models, the evaluation metrics used in different crop-mapping and yield-prediction studies are not uniform, which makes the comparison of the models challenging. The main evaluation metrics used for the crop-mapping models were the overall accuracy, kappa statistics and F1 score, whereas those in the crop-yield prediction studies were the R2, RMSE and MAPE. The reviewed yield-prediction studies reported R 2 values that ranged from 0.5 to 0.96, with a median of 0.78, which suggests moderate-to-high accuracy. In the crop-mapping studies, generally, a higher accuracy was achieved, with a range of 51% to 99.7% and a median of 92%. However, a higher performance metric does not necessarily mean that the research problem relating to accuracy is solved. Accuracy is affected by various factors, including the choice of evaluation metrics and the training-and test-data selection. For instance, using an overall performance metric in imbalanced class-distribution scenarios can be misleading because the model will perform poorly in identifying minority classes, even though the OA is high. Similarly, a model's performance can be overestimated due to data leakage [152]. Moreover, a CNN model can overfit due to the involuntary overlapping of test data in the receptive fields of the training data, biasing the evaluation [153]. Another consideration could be the spatial distribution of the model performance. Although developed models may exhibit satisfactory performances overall, they may not perform well at specific locations. Moreover, the accuracy of predictions tends to improve as the season progresses, with early-season predictions generally less accurate than later predictions.

Future Work
In this review, several important aspects of crop mapping and yield prediction based on DL and RS, which provide a foundation for future research, were discussed. The following section highlights a few avenues for future research in this field.
The most prominent issue is the availability of target data related to different crops in various parts of the world at diverse times. Although DL can learn nonlinear patterns between input and output data, it requires a large amount of training data. However, the availability of target data for crop mapping and yield prediction is limited. In yield prediction, the data are even more scarce. Further research shall be carried out to make data available, learning from limited data [154,155], transfer-learning approaches [156], unsupervised learning [157] and the quantification of the uncertainties in predictions [158,159]. Alternative methods, such as crowdsourcing, using closed-range oblique images, including those obtained from Google Street View, geotagged social media images and interviews with farmers, should be further explored to collect training data. More benchmark datasets must be developed and used to produce a standardised measure of comparison and to allow researchers to evaluate their proposed architectures fairly and consistently. These standard data should be available for varied times, locations, resolutions and sensor types. Data availability can make the development of a global model that encompasses varying times and locations possible. The domain-adaptation technique [144,160] and weakly supervised learning [145] are useful for training DL models in scarce-target-data scenarios. These techniques need to be further validated and explored for multi-temporal scenarios and a large spatial extents with differences exists in environmental conditions and cropping practices. Curriculum learning, the multi-stage transfer-learning approach and few-shot learning also have the potential to improve our existing of crop-mapping and yield-prediction model. Unsupervised-learning techniques can reveal hidden patterns and structures within data without pre-existing labelled data [157]. Furthermore, unsupervised learning can be used to study the applicability of abundant unlabelled RS imagery to crop mapping and yield prediction. Furthermore, the modelling predictive uncertainty can be performed by combining DL models with Bayesian statistics, which can be beneficial when training data are scarce.
Another potential avenue for future research in this domain is to investigate how to develop a more efficient, effective and generalisable method whilst making the best use of available spatial, spectral and temporal richness. Although target data are scarce, satellite imagery is abundant, but it is not utilised properly. Most of the existing crop-mapping and yield-prediction applications fail to model temporal, spatial and spectral dependencies simultaneously. Additionally, DL models require high computing resources and training times, which may not always be accessible to or affordable for every institution. Thus, designing DL models that can optimally use available RS resources to improve generalisability and accuracy whilst considering the computational constraints of implementation is one promising area of research.
Deep-learning algorithms are often considered complex black-box models that pose challenges in terms of interpretability [161]. Interpreting the opaque decision-making processes of DL approaches in crop mapping and yield prediction is crucial to improve transparency, accountability and trust in predictions. Further, the interpretability of cropyield-prediction models could provide valuable insights into the factors that contribute to crop-yield variability and could help to improve it. Explainable AI [162] should be further explored in crop mapping and yield prediction to address the challenges of model interpretability. A further avenue of research could be the integration of crop models with DL. Deep-learning models can learn complex patterns and relationships from large datasets, whereas crop models provide structured representations of the growth and development of crops. The combination of these two approaches can improve the accuracy, efficiency and interpretability of yield prediction.

Conclusions
In this systematic review, 90 papers related to DL-and RS-based crop-mapping and crop-yield-prediction studies were reviewed. The review provided an overview of the approaches used in these studies and presented important observations regarding the employed platforms, sensors, input features, architectures, frameworks, training data, spatial distributions of study sites, output scales, assessment criteria and performances. This review suggests that DL provides a promising solution for crop mapping and yield estimation at different scales. However, the mapping of crops and prediction of yields in new locations for new crops at the desired scale are still challenging. The issues include scarce target data, optimal model designs, generalisability across different domains and transparency. The resolution of these issues will better prepare us to realise the application at scale and thus address the problems of food security and decision making in the food industry and agro-environmental management.