Cotton Stand Counting from Unmanned Aerial System Imagery Using MobileNet and CenterNet Deep Learning Models

An accurate stand count is a prerequisite to determining the emergence rate, assessing seedling vigor, and facilitating site-specific management for optimal crop production. Traditional manual counting methods in stand assessment are labor intensive and time consuming for large-scale breeding programs or production field operations. This study aimed to apply two deep learning models, the MobileNet and CenterNet, to detect and count cotton plants at the seedling stage with unmanned aerial system (UAS) images. These models were trained with two datasets containing 400 and 900 images with variations in plant size and soil background brightness. The performance of these models was assessed with two testing datasets of different dimensions, testing dataset 1 with 300 by 400 pixels and testing dataset 2 with 250 by 1200 pixels. The model validation results showed that the mean average precision (mAP) and average recall (AR) were 79% and 73% for the CenterNet model, and 86% and 72% for the MobileNet model with 900 training images. The accuracy of cotton plant detection and counting was higher with testing dataset 1 for both CenterNet and MobileNet models. The results showed that the CenterNet model had a better overall performance for cotton plant detection and counting with 900 training images. The results also indicated that more training images are required when applying object detection models on images with different dimensions from training datasets. The mean absolute percentage error (MAPE), coefficient of determination (R2), and the root mean squared error (RMSE) values of the cotton plant counting were 0.07%, 0.98 and 0.37, respectively, with testing dataset 1 for the CenterNet model with 900 training images. Both MobileNet and CenterNet models have the potential to accurately and timely detect and count cotton plants based on high-resolution UAS images at the seedling stage. This study provides valuable information for selecting the right deep learning tools and the appropriate number of training images for object detection projects in agricultural applications.


Introduction
An accurate plant stand count is a prerequisite to evaluating emergence rate, assessing seedling vigor and facilitating site-specific management. Stand count is required to measure crop density and uniformity of seedlings for breeding programs [1][2][3]. Stand count is critical for growers to make decisions for replanting and other site-specific management to avoid yield loss [4,5]. For example, cotton (Gossypium hirsutum L.) yield rapidly decreases if plant density is below five plants per linear meter of a row in the Texas High Plains [6]. The traditional method for determining plant stand count is typically by manually counting the number of plants within a unit area, which is time consuming and labor intensive with sampling bias. Efficient and accurate stand counting methods are needed to expedite breeding pipelines or improve decision support in precision crop management. Technological innovations in unmanned aerial systems (UAS) and advances in image processing provide opportunities to enhance high-throughput plant phenotyping,

Experimental Sites
This study was conducted in a research field (33 • 35 50.53" N, 101 • 54 27.30" W) in Lubbock County, Texas, in 2020. The climate in this region is semiarid, with an average annual rainfall of 487 mm, mostly falling between May and September, frequently as a result of convective thunderstorms [36]. The dominant soil type at the study site is Pullman clay loam (fine, mixed, superactive, thermic Torrertic Paleustolls), which has fine and mixed textures, good drainage and moderately high saturated hydraulic conductivity [37]. Three cotton varieties, including FM 1911GLT, FM 1830GLT, and ST 4946GLB2 (BASF, Ludwigshafen, Germany), were planted on May 28, 2020. In total, there were 208 plots, each 8 m long and eight rows wide in a north-south direction. A 1.5-m alley was arranged between plots. A subsurface drip system was used to irrigate the crop during the growing season. Figure 1 shows the general procedure of image acquisition, data processing and the algorithms of cotton stand counting using CenterNet and MobileNet models. After capturing UAS images, training images were randomly chosen from two flight dates. Two training datasets containing 400 and 900 images were prepared and trained by the CenterNet and MobileNet models separately. Trained models per training images were saved separately after CenterNet and MobileNet training. Testing images in two datasets went through each trained model to detect and count cotton plants. The final output included bounding boxes of detected cotton plants, detection class and the corresponding F1-score.
Remote Sens. 2021, 13, x FOR PEER REVIEW 3 of 17 cotton stand counting, can provide valuable information about selecting the appropriate deep learning tools for the right tasks. These two models can separate and count individual cotton plants at the seedling stage. Therefore, the objective of this study was to assess the application of MobileNet and CenterNet models in cotton stand counting at the seedling stage. These models were evaluated for their performance in terms of the number of training images and dimensions of training and testing images.

Experimental Sites
This study was conducted in a research field (33°35′50.53″N, 101°54′27.30″W) in Lubbock County, Texas, in 2020. The climate in this region is semiarid, with an average annual rainfall of 487 mm, mostly falling between May and September, frequently as a result of convective thunderstorms [36]. The dominant soil type at the study site is Pullman clay loam (fine, mixed, superactive, thermic Torrertic Paleustolls), which has fine and mixed textures, good drainage and moderately high saturated hydraulic conductivity [37]. Three cotton varieties, including FM 1911GLT, FM 1830GLT, and ST 4946GLB2 (BASF, Ludwigshafen, Germany), were planted on May 28, 2020. In total, there were 208 plots, each 8 m long and eight rows wide in a north-south direction. A 1.5-m alley was arranged between plots. A subsurface drip system was used to irrigate the crop during the growing season. Figure 1 shows the general procedure of image acquisition, data processing and the algorithms of cotton stand counting using CenterNet and MobileNet models. After capturing UAS images, training images were randomly chosen from two flight dates. Two training datasets containing 400 and 900 images were prepared and trained by the Cen-terNet and MobileNet models separately. Trained models per training images were saved separately after CenterNet and MobileNet training. Testing images in two datasets went through each trained model to detect and count cotton plants. The final output included bounding boxes of detected cotton plants, detection class and the corresponding F1-score.  The algorithms of these two models were implemented using the TensorFlow [38] high-level application programming interface (API). TensorFlow is an end-to-end opensource platform developed by Google (Google Inc., Mountain View, CA, USA) for machine learning and deep learning applications. A Python (Version 3.7, Python Software Foundation) script was developed to facilitate the algorithms on the Google Colaboratory [39] platform, a web integrated development environment (IDE) in the Jupyter notebook platform that runs in the cloud environment. The training process was performed using a computer with 12 GB of GPU.

UAS Image Acquisition
A DJI Phantom 4 Pro (DJI, Shenzhen, China) with a 4K RGB camera was used for image acquisition. The UAS has a two-axis gimbal that can maintain the orientation of the camera independently from the movement. The UAS is controlled with a 2.4 GHz frequency bidirectional transmission that receives data of the battery voltage, Global Positioning System (GPS) reception, the distance, and the height differences from the home point. The maximum flight duration of the UAS is about 30 min. The flight plan was created using the DJI GSPro software (Version 2.0.15, DJI, Shenzhen, China). The flight plan included 80% front overlap and 80% side overlap. The angle of the camera was set at 90 degrees to the land surface during flight. The UAS flew at an altitude of 20 m at a speed of 2.4 m s −1 , resulting in an image resolution of 3.3 mm. Images were acquired on 8 June and 14 June, 2020. All image acquisitions were conducted on clear days with light to moderate wind conditions around the local solar noon. For each dataset, the raw images were stitched into an orthomosaic image using the Pix4DMapper software (Version 4.6.4, Pix4D S.A., Prilly, Switzerland).

Training and Testing Images
The training images were prepared using randomly cropped raw UAS images ( Figure 2). The dimension of the training images was 300 by 400 pixels. For each training image, the LabelImg tool [40], a free and open-source image labeling tool, was applied to label individual cotton plants with two to four leaves with rectangular bounding boxes. Each output training image had a corresponding xml file, containing the image filename, path, coordinates of the bounding box of the top left and bottom right corners for each labeled cotton plant, and the height and width of the image. Both training images and their corresponding xml files were used in the model training process. Two training datasets were prepared. The first training dataset including 400 images randomly selected from the dataset acquired on 8 June and 14 June. The second training dataset including 900 images. Lin and Guo [24] found that the CNN model's performance was not stable with less than 500 training images, but its accuracy was high and similar with 900 and 1000 training images. Oh et al. [25] used 200, 400, 600, and 800 manually labeled training images for cotton stand count with YOLOv3 deep learning models. Therefore, we chose 400 and 900 training images to test the optimal requirement of training images for deep learning cotton stand count. The number of cotton plants in the training images were manually counted.
To test the effect of testing image dimension on cotton stand counting accuracy, we created two testing datasets, each containing 100 randomly selected images. The first dataset, referred to as TD1, contained testing images with the same dimension as the training images (300 × 400 pixels). The second dataset, referred to as TD2, contained testing images with a dimension of 250 by 1200 pixels. Each testing image covered one row of cotton plants, 1.2 m for TD1 and 3.6 m for TD2, in the field. Cotton plants in each testing image were manually counted, and the number of cotton plants in TD1 and TD2 varied from one to eight and eight to 23, respectively.

MobileNet
The MobileNet is based on a streamlined architecture that uses depth-wise separable convolutions followed by a pointwise convolution to build lightweight deep neural networks. The SSD-MobileNet V2 model was applied in this study. The single-shot detector (SSD) architecture aims to predict bounding box locations and classify these boxes in a single network. The SSD uses a modified VGG-16 [41] model pre-trained on the ImageNet [42] as its backbone, with additional convolutional feature layers with progressively decreasing sizes. VGG-16 is a commonly used base feature extractor with 16 layers weights. ImageNet is a large visual database for visual object recognition software research. The MobileNetV2 uses only a single convolution network applied to all the channels of the input image and slides the weighted sum to the next pixel. It involves two new features, including linear bottlenecks between layers and short connections between bottlenecks, compared with MobileNetV1 [43]. The MobileNetV2 has two types of blocks, one with a stride of two for downsizing, and the other residual block with a stride of one.
The input image resolution was 320 × 320. The hyperparameters used for training the MobileNet model were random normal initializer, momentum optimizer value = 0.9, cosine decay learning rate base = 0.1, training batch size = 16 and total training steps = 30,000.

CenterNet
Another pre-trained model is the CenterNet Resnet50 from the Tensorflow Object Detection API. The CenterNet is a state-of-the-art object model based on deep convolution neural networks to detect each object as a triplet, rather than a pair, of keypoints [29]. It focuses on the center region information of each target rather than the overlap with the object, making this approach cost-efficient. Compared with the SSD MobileNet model, CenterNet models an object as a single point at the center point of its bounding box. It uses keypoints to find center points and regresses to all other object properties. The backbone of this model is the ResNet50, a 50-layer Residual Network. Center pooling, which helps to better detect center keypoints in both horizontal and vertical directions and aims to capture richer and more recognizable visual patterns [29]. Cascade corner pooling focuses on determining the corners of the bounding box by finding the maximum values on the boundary. Both cascade corner pooling and center pooling can be computed by combining corner pooling at different directions based on various situations [44].
The input image resolution was 320 × 320. The hyperparameters used for training the MobileNet model were random normal initializer, adam optimizer, cosine decay learning rate base = 0.001, training batch size = 8 and total training steps = 30,000.

MobileNet
The MobileNet is based on a streamlined architecture that uses depth-wise separable convolutions followed by a pointwise convolution to build lightweight deep neural networks. The SSD-MobileNet V2 model was applied in this study. The single-shot detector (SSD) architecture aims to predict bounding box locations and classify these boxes in a single network. The SSD uses a modified VGG-16 [41] model pre-trained on the ImageNet [42] as its backbone, with additional convolutional feature layers with progressively decreasing sizes. VGG-16 is a commonly used base feature extractor with 16 layers weights. ImageNet is a large visual database for visual object recognition software research. The MobileNetV2 uses only a single convolution network applied to all the channels of the input image and slides the weighted sum to the next pixel. It involves two new features, including linear bottlenecks between layers and short connections between bottlenecks, compared with MobileNetV1 [43]. The MobileNetV2 has two types of blocks, one with a stride of two for downsizing, and the other residual block with a stride of one.
The input image resolution was 320 × 320. The hyperparameters used for training the MobileNet model were random normal initializer, momentum optimizer value = 0.9, cosine decay learning rate base = 0.1, training batch size = 16 and total training steps = 30,000.

CenterNet
Another pre-trained model is the CenterNet Resnet50 from the Tensorflow Object Detection API. The CenterNet is a state-of-the-art object model based on deep convolution neural networks to detect each object as a triplet, rather than a pair, of keypoints [29]. It focuses on the center region information of each target rather than the overlap with the object, making this approach cost-efficient. Compared with the SSD MobileNet model, CenterNet models an object as a single point at the center point of its bounding box. It uses keypoints to find center points and regresses to all other object properties. The backbone of this model is the ResNet50, a 50-layer Residual Network. Center pooling, which helps to better detect center keypoints in both horizontal and vertical directions and aims to capture richer and more recognizable visual patterns [29]. Cascade corner pooling focuses on determining the corners of the bounding box by finding the maximum values on the boundary. Both cascade corner pooling and center pooling can be computed by combining corner pooling at different directions based on various situations [44]. The input image resolution was 320 × 320. The hyperparameters used for training the MobileNet model were random normal initializer, adam optimizer, cosine decay learning rate base = 0.001, training batch size = 8 and total training steps = 30,000.

Counting and Evaluations
The testing images were run through the models to detect and determine the number of cotton plants. A bounding box was applied around each detected cotton plant. Therefore, the number of bounding boxes represented the number of detected cotton plants in each testing image.
A set of metrics, including the precision, recall, and F1-score, were applied to assess the performance of cotton plant detection and counting. Precision and recall are the most commonly used indicators to evaluate object detection methods. Precision indicates how precise and accurate the trained model is out of the predicted positives and recall states how many of the true positives the trained model captured [45]. The F1-score aims to balance the two indicators [8]. Intersection over Union (IoU) measures how much of the predicted cotton plants overlap with the ground truth manually labeled cotton plants. The average recall (AR) is the recall averaged over all IoU ∈ [0.5, 1.0]. The interpolated average precision (AP) summarizes the shape of the precision/recall curve and is defined as the mean precision at a set of eleven equally spaced recall levels [0, 0.1, ..., 1] [46]. The precision, recall, and F1-score are computed as follows: where true positive (TP) denotes the number of pixels predicted as cotton plants when these pixels are actually cotton plants; false positive (FP) denotes the number of pixels predicted as cotton plants when these pixels are actually soil or other features; false negative (FN) denotes the number of pixels predicted as other features when these pixels are cotton plants; P interp (r) is the interpolated precision at the maximum precision and p( r) is the measured precision at recall r. The mean absolute error (MAE), mean absolute percentage error (MAPE), coefficient of determination (R 2 ), and the root mean squared error (RMSE) were used as evaluation metrics to assess the performance of the models in cotton plant counting.
where m i , m i , and c i represent the manually counted cotton plants for the ith image, the mean manual counts, and the predicted count for the ith image, respectively. n is the number of testing images. Table 1 shows the mAP and AR with IoU greater than 50% for the two models. The mAP, AR, and mean F1-score were 71, 48 and 75% for the CenterNet model, and 67, 39 and 63% for the MobileNet model with 400 training images. These results are similar to a study that reported an mAP of 86% using YOLOv3 with 200 labeled training images in predicting cotton stand count using UAS images [25]. The values of mAP, AR and mean F1-score increased by 8% and 19%, 25% and 33%, 12% and 18%, respectively, for the CenterNet and MobileNet models with 900 training images. These results demonstrate that a greater number of training images results in a more accurate model. The CenterNet model had higher mAP, AR and F1 score values than the MobileNet model for both training datasets, except for the mAP value with 900 training images. Therefore, the CenterNet model had a relatively better performance in training than the MobileNet model. With training images increasing from 400 to 900 images, the improvement of mAP, AR and F1-score values was less obvious for the CenterNet model than the MobileNet model. The less improvement of mAP, AR and F1-score for CenterNet model shows that the model had already had a relatively stable and accurate performance with 400 training images. This indicated that the MobileNet model requires a higher number of training images to achieve acceptable mAP, AR and F1-score compared with the CenterNet model.    33, and the MAPE values were 0.26% and 0.11% with 400 and 900 training images, respectively. For TD2, both models performed adequately, although the accuracies were substantially lower than those for the corresponding models for TD1. The MAE value decreased from 9.03 to 5.39 for the CenterNet model, and from 7.48 to 6.22 for the MobileNet model, with the number of training images increasing from 400 and 900. The MAPE value dropped from 6.57% to 4.73% for the CenterNet model, and from 7.83% to 5.61% for the MobileNet model, with the number of training images increasing from 400 to 900. These results were similar to previous studies that reported MAPE ranging from 9.8% to 4.3% [5,8,9,20,24,37]. Therefore, the MAPE results indicate that these two models had better performance accuracy on cotton plant detection and counting for testing images with the same dimension as training images. For TD2, the MAPE results suggest that both models had similar accuracy compared with previous similar studies.     images, respectively. For TD2, both models performed adequately, although the accuracies were substantially lower than those for the corresponding models for TD1. The MAE value decreased from 9.03 to 5.39 for the CenterNet model, and from 7.48 to 6.22 for the MobileNet model, with the number of training images increasing from 400 and 900. The MAPE value dropped from 6.57% to 4.73% for the CenterNet model, and from 7.83% to 5.61% for the MobileNet model, with the number of training images increasing from 400 to 900. These results were similar to previous studies that reported MAPE ranging from 9.8% to 4.3% [8,9,12,26,33,47]. Therefore, The MAPE results indicate that these two models had better performance in accuracy on cotton plant detection and counting for testing images with the same dimension as the training images. For TD2, the MAPE results suggest that both models had similar accuracy compared with previous similar studies.  Figure 4 demonstrates a false-positive detection and overestimated errors. The bright soil near the cotton plants was detected as a single cotton plant. Figure 5 shows an example of using the two models with images on 8 June 2020, to detect cotton plants, the dimensions of the testing images being the same as the training images. The CenterNet model accurately detected eight cotton plants with 900 training images and six cotton plants with 400 training images. The MobileNet model underestimated cotton plants by three and two with 400 and 900 training images, respectively. Both models had higher accuracy for the testing images acquired on 14 June compared to those on 8 June (data not shown), because the cotton plants were relatively larger and easier to detect. However, the CenterNet model had a better performance with smaller cotton plants. As shown in Figure 5, the blue arrows show that the CenterNet model could detect and separate smaller cotton plants while the MobileNet model failed to detect them. The red arrow represents that both models could not completely detect overlapping cotton plants in high-density situations with 400 training images, but both models successfully separated and detected the overlapping pieces of cotton with 900 training images.

Model Evaluation in Stand Counting
The results also demonstrated that the accuracy for cotton plant detection and counting was higher when the soil was relatively dark. Figure 6 showed an example of testing images acquired on 8 June 2020, trained with 900 same dimension training images. The algorithm successfully detected and counted cotton plants with darker and wetter soil background. The mean F1-score for this scene was 100%. For cotton plants with a dry and brighter soil background, the algorithm underestimated two cotton plants. The mean F1-score value was 54.5% for this scene. Previous studies had similar findings; images with darker soil color and less soil crusting had higher accuracy on cotton plant detection and counting [25,26]. The dataset in this study was limited to facilitate a systematic evaluation to determine the effects of soil background on cotton plant detection. Further studies on cotton stand counting need to incorporate soil moisture, soil color, soil texture and soil roughness conditions that directly affect soil reflectance. Remote Sens. 2021, 13, x FOR PEER REVIEW 12 of 17 The results also demonstrated that the accuracy for cotton plant detection and counting was higher when the soil was relatively dark. Figure 6 showed an example of testing images acquired on June 8, 2020, trained with 900 same dimension training images. The algorithm successfully detected and counted cotton plants with darker and wetter soil background. The mean F1-score for this scene was 100%. For cotton plants with a dry and brighter soil background, the algorithm underestimated two cotton plants. The mean F1score value was 54.5% for this scene. Previous studies had similar findings; images with darker soil color and less soil crusting had higher accuracy on cotton plant detection and counting [25,26]. The dataset in this study was limited to facilitate a systematic evaluation to determine the effects of soil background on cotton plant detection. Further studies on cotton stand counting need to incorporate soil moisture, soil color, soil texture and soil roughness conditions that directly affect soil reflectance.  The results also demonstrated that the accuracy for cotton plant detection and counting was higher when the soil was relatively dark. Figure 6 showed an example of testing images acquired on June 8, 2020, trained with 900 same dimension training images. The algorithm successfully detected and counted cotton plants with darker and wetter soil background. The mean F1-score for this scene was 100%. For cotton plants with a dry and brighter soil background, the algorithm underestimated two cotton plants. The mean F1score value was 54.5% for this scene. Previous studies had similar findings; images with darker soil color and less soil crusting had higher accuracy on cotton plant detection and counting [25,26]. The dataset in this study was limited to facilitate a systematic evaluation to determine the effects of soil background on cotton plant detection. Further studies on cotton stand counting need to incorporate soil moisture, soil color, soil texture and soil roughness conditions that directly affect soil reflectance. Figure 6. Examples of cotton plant detection and counting results using unmanned aerial system images at dry and wet soil conditions. Percentage labels around bounding boxes represent F1-scores for detected cotton plants.

Discussion
The results on training validation and testing showed that the CenterNet model had an overall superior performance. The size of the cotton plants in our study were small, with an average plant diameter of 2.4 cm and 3.5 for cotton plants on 8 June and 14 June. All object detection models have similar challenges when the target object is too small [8,21,22]. The CenterNet model is more sensitive to small objects [48,49]. The CenterNet model uses keypoints estimation to locate the center of each bounding box and other object properties such as orientation, location, and size, are regressed from image features at the center location [29,44]. This explains the superior performance of the CenterNet models on cotton stand counting in this study. On the other hand, the backbone of the MobileNet was VGG-16, which had much fewer convolutional layers causing lower detection accuracy [50,51]. However, the MobileNet model is simpler, faster and more accurate than the two-stage detector models [48,49,52]. Some real-time object detection tasks have been tested using MobileNet models with smartphone applications [53,54]. Since the model is small and has low latency, the cost of data transfer among UAS sensors, cloud databases, and deep learning inference could be minimized. In this study, the trained model size was only about 30 MB and an average of 0.2 s predicting time per image, which is promising to achieve on-site real-time cotton phenotyping with the MobileNet model using images acquired using UAS platforms or smartphones in the future.
Few studies have assessed the performance of deep learning object detection models with limited training images in agricultural studies. The mAP values in this study appeared relatively low (67-86% for the MobileNet model and 71-79% for the CenterNet model), but they are higher than the mAP values of many previous studies on object detection. For example, CoupleNet, Faster R-CNN, mask R-CNN, RetinaNet, and CornerNet algorithms reported mAP values ranging from 28% to 62% [29,43,55,56]. It should be noted that mAP and AR are applied to evaluate the object detection performance in the training process, but not a measure of object detection accuracy. Deep learning tasks require abundant training images [16,57], especially for complex object detection tasks such as cotton plant detection and counting. Lin and Guo [24] provided useful information regarding the required number of training images for sorghum panicle detection. They found a deep learning algorithm had poor performance with low and inconsistent accuracies with fewer than 500 training images, but had accurate results with 1000 training images. The results on TD1 and TD2 showed a similar trend for both CenterNet and MobileNet models, in that the overall accuracy increased with the number of training images. In addition, the dimension of testing images and consistency with training images played a role in this study. Previous studies proved that agricultural-related object detection tasks could be achieved successfully and accurately with limited training images when the dimensions of testing and training images were the same [9,25]. However, robust and practical models are needed to detect objects with different image dimensions for training and testing datasets.
Based on the small plant size, most plant phenotyping tasks required high-resolution UAS images acquired at a low altitude [25,58,59]. This study applied a 4k RGB sensor to acquire images at a relatively low altitude to detect and count cotton plants at the seedling stage. Because plant seedlings are relatively small, a fine ground sample distance (GSD) is required to detect cotton plants from UAS images. This, in turn, requires image acquisition at low altitudes, possibly below 10 m. UAS platforms typically do not fly automatically at such a low altitude. Therefore, researchers have to manually fly the UAS, which may cause many errors during image collections. To overcome this challenge, various sensors and post-processing procedures were evaluated in recent studies. For example, Feng et al. [15] applied multispectral sensors to capture additional image datasets for cotton plant detection and counting. Another study improved the overall quality of RGB images acquired at a relatively high altitude (50 m) by combining high-resolution RGB images, relatively low-resolution multispectral images, different vegetation indices and a digital surface model (DSM) [52]. It is practical and reasonable to use only the 4K RGB sensor flying at an optimal altitude to capture high-quality images for agricultural object detection tasks in commercial fields. More studies and observations are required to examine the optimal image resolutions and corresponding flight altitudes for cotton stand count in future work.
Various environmental factors, such as wind, cloud and light conditions, have effects on UAS image quality, which can influence the execution and performance of deep learning algorithms. In this study, the UAS images were acquired around local solar noon on clear days. As a result, the cotton plants and soil surface were relatively bright and had low contrast in strong sunlight. Both models performed better on cotton plant detection and counting with images having darker soil background. Similar results were found in other studies in crop monitoring and analyses [8,47]. The effects of field conditions, such as soil color, and brightness on the accuracy of cotton plant detection and counting, were similar to previous studies [25,26,60,61]. In future studies, one may consider acquiring UAS images under relatively soft light conditions, such as late afternoon or early morning, to facilitate better contrast between cotton plants and soil background for accurate plant stand counting results. In addition, the images were captured on two dates during the seedling stage in this study. The differences in cotton plant size and canopy cover might have resulted in different results for the two models. Therefore, more studies are required to examine plant size on stand counting accuracy by acquiring images on different dates in the early growing season.

Conclusions
Two deep learning models, the MobileNet and CenterNet, were applied to detect and count cotton plants at the seedling stage from UAS images. These models were trained with two datasets containing 400 and 900 images. The performance of these models was assessed with two testing datasets of different dimensions, 300 by 400 pixels and 250 by 1200 pixels. The CenterNet model had a better overall performance on cotton plant detection and counting, indicated by greater values of mAP and recall, R 2 , and lower RMSE, MAE and MAPE values. The MobileNet model was more efficient on training time and had less requirement for disk space. When the training and testing image dimensions were the same, the accuracy of cotton stand counting was acceptable (R 2 = 0.86 and MAPE = 0.26% for the CenterNet model; R 2 = 0.89 and MAPE = 0.10% for the MobileNet model) with 400 training images. With 900 training images, the cotton plant counting had better performance for both models (R 2 = 0.96 and MAPE = 0.11% for the CenterNet model; R 2 = 0.98 and MAPE = 0.07% the MobileNet model). Cotton stand counting for testing images with larger dimensions required more training images to achieve high accuracy. Therefore, this study helps to determine the right deep learning tools and an appropriate number of training images under certain conditions for object detection in agricultural applications.
Both the CenterNet and MobileNet models have the potential to accurately detect and count cotton plants at the seedling stage. However, there are challenges in detecting small cotton plants under high brightness and low contrast conditions. Therefore, further studies need to investigate cotton plant detection accuracy as influenced by environmental factors, including image resolution, soil background and illumination levels. Further studies are also needed to evaluate the optimal image resolutions and corresponding flight altitudes for accurate cotton stand counting.

Data Availability Statement:
The data generated from this study are available from the corresponding author on request.