Investigating the Use of Street-Level Imagery and Deep Learning to Produce In-Situ Crop Type Information

: The creation of crop type maps from satellite data has proven challenging and is often impeded by a lack of accurate in situ data. Street-level imagery represents a new potential source of in situ data that may aid crop type mapping, but it requires automated algorithms to recognize the features of interest. This paper aims to demonstrate a method for crop type (i.e., maize, wheat and others) recognition from street-level imagery based on a convolutional neural network using a bottom-up approach. We trained the model with a highly accurate dataset of crowdsourced labelled street-level imagery using the Picture Pile application. The classiﬁcation results achieved an AUC of 0.87 for wheat, 0.85 for maize and 0.73 for others. Given that wheat and maize are two of the most common food crops grown globally, combined with an ever-increasing amount of available street-level imagery, this approach could help address the need for improved global crop type monitoring. Challenges remain in addressing the noise aspect of street-level imagery (i.e., buildings, hedgerows, automobiles, etc.) and uncertainties due to differences in the time of day and location. Such an approach could also be applied to developing other in situ data sets from street-level imagery, e.g., for land use mapping or socioeconomic indicators.


Introduction
The spatial extent of cropland areas has been mapped extensively since the mideighties, as increasing numbers of satellites have been launched and higher-spatial-and -temporal-resolution imagery has become available.For example, cropland is provided as a land cover class in many global land cover products such as GLC-2000 [1], MODIS land cover [2] and ESA-CCI [3], among many others.More recently, a time series of higher-resolution cropland extent products has been produced using Landsat [4] at a 30 m resolution, while Sentinel-2 is now also being used for land cover mapping, including cropland extent at a 10 m resolution [5].However, for monitoring food security at national-to-global scales, spatially explicit crop type maps are urgently needed [6] With recent advances in analytical methods, data infrastructure and the availability of higherresolution imagery, several recent studies have applied machine learning techniques for crop type recognition.Some of the most successful include support vector machines (SVMs) [7][8][9], random forests [9][10][11][12], decision trees [12][13][14], the maximum likelihood classifier (MLC) [11,15,16], artificial neural networks (ANNs) [11,17] and minimum distance (MD) [11].Another example is the work undertaken by Mou et al. in [16], where the authors proposed a deep recurrent neural network (RNN) for hyperspectral image classification.The RNN model effectively analyzed hyperspectral pixels as sequential data and determined information categories via network reasoning.The specific application of convolutional neural networks (CNN) in remote sensing for crop type recognition has also shown excellent performance [13,16,[18][19][20][21][22][23][24].
Several recent research studies have achieved a higher accuracy in the learning phase because of the implementation of CNNs.For instance, Cai et al. in [18] introduced a methodology for the cost-effective and in-season classification of field-level crop types using common land units (CLUs) from the United States Department of Agriculture (USDA) to aggregate spectral information based on a time series.The authors built a deep-learningbased classification model based on deep neural networks (DNN).The research aimed to understand how different spatial and temporal features affected the classification performance.Their experiments also evaluated which input features were the most helpful in training the model and how various spatial and temporal factors affected the crop type classification.Castro et al. in [19] explored three approaches to improve the classification performance for land cover and crop type recognition in tropical areas using an imagestacking approach in combination with a CNN.Their findings outperformed the traditional system based on image stacking alone in terms of overall and class accuracy.
However, all of these applications require training data on the presence of different crop types, where a lack of such data is one challenge identified in many research papers.Many studies collect field-based data as part of the development of a training data set (see, e.g., [25]), which is costly and often not shared with the broader remote sensing community.For example, in the study by Wang et al. [20], local farmers contributed by utilizing a mobile application to take pictures and assign a label according to the crop type as the training data for crop type mapping.However, these examples are generally limited to small data sets.Another source of in situ data is from the LUCAS survey, which collects information at around 300 k locations across Europe [26].However, the data are only collected every three years, they cover all land cover and land use types, not just crop types, and the data collection exercise is costly [27].
More recently, street-level imagery has become available for many areas around the world, e.g., from Google Street View and Baidu or as crowdsourced contributions through sites like Mapilllary.However, much of the research involving street-level imagery has involved applications related to urban areas [28].In contrast, D'Andrimont et al., (2018) [29] compared the amount of street-level imagery available for Europe to imagery available from the LUCAS survey as a potential source of training data, in particular for cropland mapping.They found that street-level imagery was available within 300 m of a LUCAS survey point for 9.4% of the EU territory, so it could provide additional training data.Focusing on the Netherlands, the authors then examined photographs from the Mapillary database for in situ crop type information.Of the 785 K photographs available, it was possible to identify some crops and to link these to agricultural parcels.However, the authors did not attempt to automatically classify the photographs for crop type.
The use of computer vision and the segmentation of street-level photographs is the subject of an active area of research.For example, Kang et al., (2018) [30] used images from Google Street View to classify building types, Cao et al., (2018) [31] created a land cover map of New York from combining street-level and aerial imagery, while a detailed urban map (of local climate zones) was developed by Cao et al., (2023) [32] using Google Street View.These and other similar studies are largely focused on the mapping of urban areas or features and have used pre-trained DL networks such as Places-CNN to first classify the images.The outputs from the pre-trained network are then often further classified for the urban features of interest.However, such a pre-trained network that predicts crop types or features that allow for crop types to be identified does not exist.Moreover, what is also missing is the crop type labels that would allow for such an existing pre-trained network to potentially be used.
To fill this gap, the aim of this paper is to determine the feasibility of using a tool like Picture Pile for the rapid labelling of geo-tagged street-level photographs for crop types in combination with a CNN utilizing a deep learning architecture [33] to classify the images.The novelty lies in combining these two tools, where the first provides high-quality image labels and the second uses this information for the automatic classification of street-level photos for crop types.The images were labelled using Picture Pile as part of the Earth Challenge Food Insecurity crowdsourcing campaign [34] In terms of crop species of global importance to food security, both maize and wheat (and related wheat crops) are crucial to meeting the global food demand [35].Hence, the CNN was trained to recognize maize and wheat, referred to here as the Maize-Wheat-Other CNN (MWO CNN).Geo-tagged street-level images are noisy because in addition to maize and wheat, many objects such as cars, streets, buildings, and people present in the images make crop type classification more complex.Finally, we present the results from the CNN model regarding the performance in predicting crop types.Such a trained model could potentially generate a large in situ training data set on crop types given the large volume of street-view-level imagery now available.This, in turn, could then be used in classification algorithms to produce wall-to-wall crop type maps.The model is openly available at: https://github.com/iiasa/CropTypeRecognition(accessed on 25 August 2023).

Crowdsourced Labelling of Street-Level Imagery from Google Street View and Mapillary
A total of 10,776 street-level photographs were selected for use in this study, the majority of which were taken from Google Street View and a small amount taken from Mapillary.The bulk of the images were from France (Figure 1), as France is both representative of central European agriculture and provides an openly available land parcel information database for benchmarking.These images were then placed into the Picture Pile rapid image classification app [36] and labelled by volunteers.The quality of the images varied across the data set.There were excellent images that contained very clear, unobstructed pictures of roadside crops.There were also poorer-quality or noisier images, which contained objects such as cars, houses, etc., in addition to a crop field (Figure 2).Table 1 lists the total number of images used in this study along with the number of images used in the model training, test, and validation data sets.
images.The novelty lies in combining these two tools, where the first provides hig quality image labels and the second uses this information for the automatic classificati of street-level photos for crop types.The images were labelled using Picture Pile as p of the Earth Challenge Food Insecurity crowdsourcing campaign [34] In terms of cr species of global importance to food security, both maize and wheat (and related wh crops) are crucial to meeting the global food demand [35].Hence, the CNN was trained recognize maize and wheat, referred to here as the Maize-Wheat-Other CNN (MW CNN).Geo-tagged street-level images are noisy because in addition to maize and whe many objects such as cars, streets, buildings, and people present in the images make cr type classification more complex.Finally, we present the results from the CNN mo regarding the performance in predicting crop types.Such a trained model cou potentially generate a large in situ training data set on crop types given the large volu of street-view-level imagery now available.This, in turn, could then be used classification algorithms to produce wall-to-wall crop type maps.The model is open available at: https://github.com/iiasa/CropTypeRecognition(accessed on 25 August 202

Crowdsourced Labelling of Street-Level Imagery from Google Street View and Mapillary
A total of 10,776 street-level photographs were selected for use in this study, majority of which were taken from Google Street View and a small amount taken fro Mapillary.The bulk of the images were from France (Figure 1), as France is bo representative of central European agriculture and provides an openly available la parcel information database for benchmarking.These images were then placed into Picture Pile rapid image classification app [36] and labelled by volunteers.The quality the images varied across the data set.There were excellent images that contained ve clear, unobstructed pictures of roadside crops.There were also poorer-quality or nois images, which contained objects such as cars, houses, etc., in addition to a crop fi (Figure 2).Table 1 lists the total number of images used in this study along with number of images used in the model training, test, and validation data sets.In order to ensure accuracy of the crowdsourced image classifications, we crea set of 867 control-point images for the crop types of wheat, maize, sunflower, viney sorghum, olive trees and other crops.Each of these images was classified between 8 times by different individuals.If a minimum of 5 classifications agreed, then we m that image as a crowdsourced control image.At the end of the campaign, we com the crowdsourced results with the Land Parcel Information System (LPIS) of France

Development of a Deep Learning Model for Crop Type Detection
CNNs are a popular data mining technique for image recognition, first introd by Fukushima [37].The use of CNNs for object classification has been implement many domains, achieving a high efficiency and accuracy [38][39][40].Figure 3 presen CNN architecture used in the MWO model.A CNN (and any neural network) req what is referred to as hyperparameter tuning.This is the determination of param such as the number of convolution layers (and hence the number of filters applied size of the filter, the stride length and the pooling method.These different settings f MWO model are explained in the sections that follow.In order to ensure accuracy of the crowdsourced image classifications, we created a set of 867 control-point images for the crop types of wheat, maize, sunflower, vineyards, sorghum, olive trees and other crops.Each of these images was classified between 5 and 8 times by different individuals.If a minimum of 5 classifications agreed, then we marked that image as a crowdsourced control image.At the end of the campaign, we compared the crowdsourced results with the Land Parcel Information System (LPIS) of France.

Development of a Deep Learning Model for Crop Type Detection
CNNs are a popular data mining technique for image recognition, first introduced by Fukushima [37].The use of CNNs for object classification has been implemented in many domains, achieving a high efficiency and accuracy [38][39][40].Figure 3 presents the CNN architecture used in the MWO model.A CNN (and any neural network) requires what is referred to as hyperparameter tuning.This is the determination of parameters such as the number of convolution layers (and hence the number of filters applied), the size of the filter, the stride length and the pooling method.These different settings for the MWO model are explained in the sections that follow.In the proposed CNN architecture used in this study, we applied two convolution layers and the maximum operator for pooling.To arrive at this configuration, we tested different-sized filters, different stride lengths and different numbers of convolution layers (from two to a maximum of four due to the computational cost).The final architecture with the best performance had two convolution layers.Figure 3 shows the first convolutional output layer with 16 units followed by a second convolutional layer with 32 units.The experiments also gave better results with a small filter and a smaller stride size.The final filter size used was a 2 by 2 matrix, and the stride length was set to 2. For our architecture, we set the total number of hidden units in the dense layer to double the output size of the last convolution layer, i.e., with 64 units.Finally, we experimented with different loss and activation functions to find the combination that yielded the best performance, as shown in Table 2.After experimentation, we chose SOFTMAX as the last activation function and Y as the loss function to provide the best performance.Taking the output from the last dense layer as the input, the SOFTMAX function normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers, where K is the number of classes (i.e., 3).After applying SOFTMAX, each component will be in the interval (0,1), and the sum of all the values is equal to 1. Once normalized, they can be interpreted as probabilities.Therefore, the more significant input components will correspond to larger probabilities.Finally, we experimented with the number of training epochs at which the learning process is stabilized.We found that this occurred after 15 epochs and therefore used this as the maximum value.In the proposed CNN architecture used in this study, we applied two convolution layers and the maximum operator for pooling.To arrive at this configuration, we tested different-sized filters, different stride lengths and different numbers of convolution layers (from two to a maximum of four due to the computational cost).The final architecture with the best performance had two convolution layers.Figure 3 shows the first convolutional output layer with 16 units followed by a second convolutional layer with 32 units.The experiments also gave better results with a small filter and a smaller stride size.The final filter size used was a 2 by 2 matrix, and the stride length was set to 2. For our architecture, we set the total number of hidden units in the dense layer to double the output size of the last convolution layer, i.e., with 64 units.Finally, we experimented with different loss and activation functions to find the combination that yielded the best performance, as shown in Table 2.After experimentation, we chose SOFTMAX as the last activation function and Y as the loss function to provide the best performance.Taking the output from the last dense layer as the input, the SOFTMAX function normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers, where K is the number of classes (i.e., 3).After applying SOFTMAX, each component will be in the interval (0,1), and the sum of all the values is equal to 1. Once normalized, they can be interpreted as probabilities.Therefore, the more significant input components will correspond to larger probabilities.Finally, we experimented with the number of training epochs at which the learning process is stabilized.We found that this occurred after 15 epochs and therefore used this as the maximum value.

Evaluation Methods
We evaluated the results from the crowdsourcing exercise using a standard confusion matrix and the overall accuracy.We evaluated the MWO CNN using three measures of accuracy: the precision, which evaluates the number of positive classifications made by the model that were actually correct; the recall, which measures the amount of actual positive classifications that were correctly identified by the model [41] and the F1-score, which combines the precision and recall into an aggregated accuracy measure [42].We also assessed the area under the receiver operating characteristic (ROC) curve [41], or the AUC, where the ROC is a plot of the false-positive rate (or specificity) against the recall.It allows one to determine how well the model performs across all classifications.We split the data set into 80% for training, 10% for testing and 10% for validation.

Crowdsourcing
A total of 10,776 street-level images were classified for this study during a crowdsourcing campaign with the intention to create an accurate crop type training dataset.Approximately 600 people contributed around 76.5 K classifications, with many images classified multiple times in order to measure agreement.The participants classified the following crop types: wheat, maize, sunflower, vineyards, sorghum, olive trees and other crops.
Using the crowdsourcing classifications and the parcel information from the official French 2016-2019 Land Parcel Information System (LPIS), we computed a confusion matrix to examine the performance of the crowd [43].Since each image was labelled by more than one person, we selected a sample of classifications from the database in which a minimum of eight classifications per location were collected and where there was a majority agreement, i.e., at least five classifications were of the same crop type.Finally, a total of 2049 images were used for comparison for an overall accuracy of 98.7%.The final confusion matrix is shown in Table 3.

MWO CNN
Table 4 shows the overall evaluation results for the MWO CNN, using noisy streetlevel images and recognizing three kinds of crop types (i.e., maize, wheat and other).The overall accuracy was 75.93%.Examining the F1-score (combining precision and recall), we found that wheat had the highest value (82.04), followed by maize (79.72) and other (64.88).Figure 4 depicts the ROC values for each class.The ROC curve shows the trade-off between the precision and the specificity.As a reference, an ideal classifier should have a high precision and a low specificity.The area under the ROC curve (AUC) is a measure of a classifier's overall performance, where a value of 1 indicates a perfect classifier (e.g., no wrong classification in all the test samples) and a value of 0.5 indicates that the classifier performs no better than random chance.Figure 4 depicts the model outcome; this was more accurate when classifying pictures of wheat crops, which was the class with the highest AUC of 0.87.The second-best performance corresponded to the class "Maize", with an AUC value of 0.85.As can be seen, the model struggled to classify images of the "Other" class, but still had values above 0.5.more accurate when classifying pictures of wheat crops, which was the cla highest AUC of 0.87.The second-best performance corresponded to the cla with an AUC value of 0.85.As can be seen, the model struggled to classify im "Other" class, but still had values above 0.5.After analyzing the nature of the images corresponding to the other class that many contained non-crop objects and crop types other than maize or whe ilarity between the additional crops and those of interest (i.e., maize and wh difficult for the model to distinguish between them.As a result, we achieved a performance of 0.73 for the other class.

Discussion
As crop type detection from satellite data has proven challenging due training data, we explored the use of alternative methods for generating in sit street-level imagery.The first part of the method involved using Picture Pil label the images using crowdsourcing.Picture Pile has been used in many diff image classification crowdsourcing campaigns [34], so considerable experien gained in producing a high-quality labelled data set.Hence, a high level of a achieved using the crowdsourcing approach (>95%).This rapid labeling app be used to build a very large image data set and then be used to create preworks such as those that already exist.The advantage would be that rather th generic features that are currently identified by these pre-trained networks analyzing the nature of the images corresponding to the other class, we noticed that many contained non-crop objects and crop types other than maize or wheat.The similarity between the additional crops and those of interest (i.e., maize and wheat) made it difficult for the model to distinguish between them.As a result, we achieved a lower AUC performance of 0.73 for the other class.

Discussion
As crop type detection from satellite data has proven challenging due to a lack of training data, we explored the use of alternative methods for generating in situ data from street-level imagery.The first part of the method involved using Picture Pile to rapidly label the images using crowdsourcing.Picture Pile has been used in many different rapid image classification crowdsourcing campaigns [34], so considerable experience has been gained in producing a high-quality labelled data set.Hence, a high level of accuracy was achieved using the crowdsourcing approach (>95%).This rapid labeling approach could be used to build a very large image data set and then be used to create pre-trained networks such as those that already exist.The advantage would be that rather than the quite generic features that are currently identified by these pre-trained networks (e.g., grass), this network could focus specifically on major crop types.
We then introduced a deep learning architecture to classify noisy street-level images according to the following three classes: maize, wheat and other objects.In addition to the crops of interest, street-level imagery may include objects such as cars, roads, buildings, trees, people, other crops and more.Because of the nature of the viewing angle for streetlevel imagery, automatic classification can prove challenging, as the above-mentioned objects often obscure the view.This study differs from many others that have used streetlevel imagery because they first used a pre-trained classifier to segment the images into features.These features are then input to a neural network to learn other specific features of interest, e.g., building types [30] or local climate zones [32].In contrast, in this study, the images were fed directly into a CNN and classified by crop type in one system.Moreover, such an open-source classifier does not currently exist, where much of the focus of streetlevel classification to date has been on urban areas [28].
Another limitation in the current approach is that the street-level imagery used here came from existing sources such as Google Street View and Mapillary opportunistically and were thus taken at different times of the day and from different geographical locations.Hence, there were additional uncertainties due to effects of shading, the sun angle, the camera angle or differences in brightness.However, the CNN model still produced a good performance despite these uncertainties.While it was not possible to replicate the level of accuracy achieved with the crowdsourcing approach, the MWO CNN model nevertheless produced initial results that are still promising (AUC of 0.87 for wheat and AUC of 0.85 for maize).
In the future, the model can be extended to other crop types, so this may improve the ability of the model to predict the 'other' class.Moreover, using a large labelled image set may help to further reduce these uncertainties and improve the model performance.
These initial results are promising owing to the vast potential of this data as an in situ data set for crop types.With additional improvements, classified street-level imagery could provide a powerful training data set for global satellite mapping.Crop type information combined with the image acquisition dates could be ingested into various global land products.For example, the World Cereal system for the high-resolution mapping of cereals and maize globally [44] would greatly benefit from such a model, which currently lacks in situ data in many parts of the world, particularly from Africa, South America and parts of Asia.Street-level imagery is increasing in volume, and there are other providers such as Baidu that have yet to be used in such a context.

Conclusions
We here introduced a convolutional neural network (CNN) architecture for a croptype-recognition application using deep learning to classify two specific crop types in street-level images.The architecture demonstrated the application of CNN methods to recognize maize, wheat and other classes in street-level images.
The MWO CNN model was trained using more than 8000 crowdsourced street-level images from a Picture Pile campaign over France, where citizens contributed to labeling more than 10,000 images.The crowdsourced images were classified with an accuracy of >95%, ensuring that the model was trained on high-quality data.The MWO CNN model achieved an AUC of 0.87 for wheat and 0.85 for maize, the two most predominant crops grown globally.The other class achieved an AUC of 0.73.Given the specific viewing angle of street-level imagery, various non-crop structures impeded the view, which could have confounded the algorithms.In addition, street-level imagery is an opportunistic form of data, which is collected infrequently at different times of the day with varying sun and sensor angles.Nonetheless, this method holds great potential to massively increase our ability to globally track important crop types as the amount of street-level imagery continues to increase globally.
Such an approach can also be used to classify other types of in situ features from street-level imagery, e.g., socioeconomic indicators.Although street-level imagery has been used in land cover mapping, in particular in the mapping of urban features, land use remains a difficult area to classify from remote sensing (satellite) imagery alone.Given the possibility to recognize different types of land use from street-level images and the advent of new hyperspectral satellites coming online in the next few years, this may greatly improve our ability to create detailed land use maps of the world.

Figure 1 .
Figure 1.Locations of the majority of the 10,776 street-level images classified via crowdsourcing Picture Pile, classified as either maize, wheat or other, in France.

Figure 1 .
Figure 1.Locations of the majority of the 10,776 street-level images classified via crowdsourcing in Picture Pile, classified as either maize, wheat or other, in France.

Figure 2 .
Figure 2. Typical noisy street-level images containing crop species and additional non-crop o such as roads, buildings, vehicles and trees (a-f).These images were classified by the crowd maize; (b) maize; (c) wheat; (d) wheat; (e) other and (f) wheat.

Figure 2 .
Figure 2. Typical noisy street-level images containing crop species and additional non-crop objects such as roads, buildings, vehicles and trees (a-f).These images were classified by the crowd as (a) maize; (b) maize; (c) wheat; (d) wheat; (e) other and (f) wheat.

Figure 3 .
Figure 3.The CNN architecture used in the Maize-Wheat-Other (MWO) model.

Figure 3 .
Figure 3.The CNN architecture used in the Maize-Wheat-Other (MWO) model.

Figure 4 .
Figure 4.The receiver operating characteristic (ROC) curve for the MWO CNN mod wheat and other classes.

Figure 4 .
Figure 4.The receiver operating characteristic (ROC) curve for the MWO CNN model for maize, wheat and other classes.

Table 1 .
The total number of classified street-level photographs used in the study, separated b type and usage by the CNN.

Table 1 .
The total number of classified street-level photographs used in the study, separated by crop type and usage by the CNN.

Table 2 .
Different activation and loss functions for the experiments.

Table 2 .
Different activation and loss functions for the experiments.

Table 3 .
Confusion matrix with land parcel information in columns and volunteer classifications as rows.

Table 4 .
Evaluation results for the MWO CNN by the maize, wheat and other classes.

Table 4 .
Evaluation results for the MWO CNN by the maize, wheat and other classes.