High-Performance and Lightweight AI Model with Integrated Self-Attention Layers for Soybean Pod Number Estimation

Huang, Qian

doi:10.3390/ai6070135

Open AccessArticle

High-Performance and Lightweight AI Model with Integrated Self-Attention Layers for Soybean Pod Number Estimation

by

Qian Huang

School of Architecture, Southern Illinois University, Carbondale, IL 62901, USA

AI 2025, 6(7), 135; https://doi.org/10.3390/ai6070135

Submission received: 6 May 2025 / Revised: 10 June 2025 / Accepted: 20 June 2025 / Published: 24 June 2025

Download

Browse Figures

Versions Notes

Abstract

Background: Soybean is an important global crop in food security and agricultural economics. Accurate estimation of soybean pod counts is critical for yield prediction, breeding programs, precision farming, etc. Traditional methods, such as manual counting, are slow, labor-intensive, and prone to errors. With rapid advancements in artificial intelligence (AI), deep learning has enabled automatic pod number estimation in collaboration with unmanned aerial vehicles (UAVs). However, existing AI models are computationally demanding and require significant processing resources (e.g., memory). These resources are often not available in rural regions and small farms. Methods: To address these challenges, this study presents a set of lightweight, efficient AI models designed to overcome these limitations. By integrating model simplification, weight quantization, and squeeze-and-excitation (SE) self-attention blocks, we develop compact AI models capable of fast and accurate soybean pod count estimation. Results and Conclusions: Experimental results show a comparable estimation accuracy of 84–87%, while the AI model size is significantly reduced by a factor of 9–65, thus making them suitable for deployment in edge devices, such as Raspberry Pi. Compared to existing models such as YOLO POD and SoybeanNet, which rely on over 20 million parameters to achieve approximately 84% accuracy, our proposed lightweight models deliver a comparable or even higher accuracy (84.0–86.76%) while using fewer than 2 million parameters. In future work, we plan to expand the dataset by incorporating diverse soybean images to enhance model generalizability. Additionally, we aim to explore more advanced attention mechanisms—such as CBAM or ECA—to further improve feature extraction and model performance. Finally, we aim to implement the complete system in edge devices and conduct real-world testing in soybean fields.

Keywords:

edge computing; attention mechanism; UAV imagery; crop yield prediction

1. Introduction

As a common element in food products, soybean is a key crop for human society as it contains sources of oil and protein [1]. For soybean farmers, soybean pod estimation is critical as it directly influences yield prediction, labor and resource allocation, and farm profit forecasts [2]. In addition, early and precise yield estimation enables better financial planning by providing insights into expected harvest quantities, allowing farmers to efficiently strategize sales contracts, storage spaces, and transportation arrangements. With the increasing demand for food production and sustainable agriculture, the need for data-driven, efficient methods to support crop management has become increasingly urgent. Therefore, with the rising demand for smart agriculture, the study of soybean pod number estimation is a hot research topic.

Traditionally, soybean farmers estimate soybean yield by manually counting pods in the field. This method relies on visual inspection, which is labor-intensive, time-consuming, and prone to inconsistencies due to variability in human judgement and environmental factors such as varying light conditions [3]. Recent developments in computer vision, unmanned aerial vehicles (UAVs), and artificial intelligence (AI) have improved object detection and counting tasks in agriculture. Convolutional neural networks (CNNs) have been widely used for plant feature extraction, but they often struggle with complex backgrounds and varying lighting conditions in real-world fields [4,5]. In contrast, transformer-based architectures, particularly self-attention mechanisms, have recently become popular in vision tasks due to their ability to capture global dependencies more effectively than traditional CNNs [6]. Despite their success, these models are often computationally intensive and less accessible to small farms with limited infrastructure and technicians.

Despite technological advancements, soybean pod number estimation remains a challenging task, especially during the soybean harvesting season [7,8,9]. When soybean plants are close to maturity, their color transitions to grayish-brown, which closely resembles the color of the surrounding soil clumps, withered weeds, and fallen leaves. This color similarity creates significant visual ambiguity, making it hard to differentiate between soybean pods and other objects (e.g., soil clumps, withered weeds, fallen leaves). Moreover, as soybean plants mature, they tend to bend or fall over, which causes them to spread across the farmland and mix with other nearby objects. This physical overlap between soybean plants and other nearby objects further complicates segmentation and object recognition in UAV-based imaging systems. The combination of soil clumps, fallen leaves, withered weeds, shadows, bright sun, and soybean plant bending makes it tougher for AI models to accurately detect and classify soybean pods from backgrounds. These visual and spatial challenges require AI models that can extract essential features in noisy and complex environments.

Another key challenge arises from the operational constraints of UAVs. Due to their limited battery weight and capacity, UAVs must fly at higher altitudes to cover large farmland areas before their power is depleted. Flying at higher altitudes can reduce the number of trips needed for a complete survey. However, this altitude increase results in lower image resolution, as the UAV camera captures a wider field of view. The increased background noise and loss of high-resolution vision make it more difficult to differentiate soybean pods from the background, accurately identify pod structures, and assess crop conditions [10]. This limitation is further exacerbated by shadows, lighting variations, and soil moisture content, which introduce additional noise into the images. As a result, distinguishing between soybean pods and background objects remains a major challenge in UAV-based applications [11].

Furthermore, farmers in rural regions often face barriers to adopting advanced agricultural technologies due to limited access to computing resources and technical experts. Many of these farms operate in areas with inadequate internet connectivity and lack the infrastructure needed to process large datasets efficiently. This technological gap restricts their ability to implement sophisticated AI-driven solutions for precision agriculture. Due to investment limitations, farmers are unlikely to invest in expensive AI machines (such as NVIDIA GPUs) for running AI models. As a result, the AI models targeted in this research topic should be lightweight and optimized to run efficiently on edge devices [12,13,14,15]. For example, devices such as Raspberry Pi 4 or 5—commonly used for agricultural purposes—are available at prices ranging from USD 50 to USD 120. These edge devices offer an affordable platform for deploying AI solutions directly in the field, without the need for cloud connectivity or high-end hardware. This low price ensures that farmers can leverage AI-driven insights without requiring high-end computing resources, making the technology more accessible for their rural farmlands. Therefore, it is essential to develop lightweight AI models that require affordable computational power and are user-friendly, enabling farmers to benefit from data-driven insights without needing extensive technical expertise or expensive hardware. This is the focus of this paper.

The main contributions of this research are summarized below:

(1): Development of a set of lightweight AI models for soybean pod estimation: This research introduces a set of AI models that combines CNNs and transformer mechanisms. CNNs effectively capture spatial features such as texture and structure from soybean field images, while self-attention mechanisms capture long-range dependencies and global context. This hybrid feature is good at accurate estimation of soybean pod density in complex backgrounds with varying lighting conditions and nearby environments with fallen leaves, soil clumps, and withered weeds.
(2): Weight quantization of our lightweight AI models: In order to make AI models accessible to farmers in rural areas with limited computational resources, this research demonstrates the use of weight quantization techniques to reduce the memory footprint and computational overhead. The quantized models retain high accuracy while being optimized to run efficiently on edge devices.
(3): Evaluation of the proposed model performance on Raspberry Pi 4 and 5: The research also evaluates our model performance on edge devices with the Edge Impulse platform, which is a leading online platform for edge AI evaluation [16]. The evaluation results show that the inference speed is only 0.26–0.89 frames per second in Raspberry Pi 4 and 4.5–25 frames per second in Raspberry Pi 5, which depends on the variant of our proposed models. In addition, their memory footprints range from 0.27 MB to 1.91 MB, leaving ample space within each Raspberry Pi’s memory for the operating system, camera services, and image preprocessing.

To conclude, this study proposes a lightweight AI framework for soybean pod estimation that integrates model simplification techniques, transformer-inspired attention mechanisms, and edge-compatible quantization strategies. This paper is organized as follows: Section 2 reviews the related work and analyzes existing AI models. Section 3 discusses the design considerations and presents the proposed AI models along with their optimization strategies. Section 4 provides a discussion of the results and research limitations, and Section 5 concludes the paper.

2. Related Work

In this section, we review the existing state-of-the-art AI models used for soybean pod counting or estimation. These AI models include the following:

In 2019, inspired by the multi-column CNN, the authors of [17] designed a simplified two-column CNN model for soybean seed counting. They used 3 × 3 and 5 × 5 filters to capture features at multiple scales. While both branches have a similar structure, their outputs are combined before producing the final results. In this study, a soybean dataset was built against a black cloth background, which is quite different from real-world field conditions. This paper did not disclose the memory size of their AI model or provide actual counting accuracy under realistic scenarios.

In 2023, in order to overcome the challenges of manual counting and localization in field conditions, the authors of [10] presented a P2PNet-Soy model for field-based soybean counting. A couple of techniques were utilized, such as multi-scale feature extraction and unsupervised clustering. The VGG-16 network was used as the base framework for this model. Even though the authors did not report the exact model size, the use of VGG-16, which has 138 million parameters, indicates that the model size is computationally heavy.

In 2023, to effectively detect soybean pods, the authors of [18] introduced another AI model, YOLO POD, which chose YOLO X architecture as the foundation [19]. Since pod images often involve more background noises, the authors introduced the CBAM module [20] to help the AI model focus on relevant regions, because the CBAM module can adjust weights to important spatial and channel features. However, since all pictures in the dataset were taken against a black cloth background, this study did not evaluate the model performance using field-based images for model training and testing.

In 2024, the authors of [21] presented a soybean pod counting model named SPCN, which integrates dilated convolutions with an attention mechanism. The model architecture consists of three main parts: a front end built on the VGG-19 network for feature extraction, a CBAM module to highlight important spatial and channel features, and a back end that utilizes dilated convolutions. However, this SPCN model is quite complex, as the VGG-19 network alone contains over 144 million parameters.

In 2024, the authors of [7] proposed SoybeanNet, a transformer-based model aimed at point-based counting using real-world field images. This model achieves a test accuracy of 84.51%, which is remarkable progress towards in-field soybean yield estimation. Even though the total number of parameters is not reported, this model relies on Swin transformers [22] as a backbone, which contain parameters in the range of 29 to 88 million. Such high computational demands make this model hard to deploy in edge devices.

In 2025, the authors of [23] introduced PodNet, a lightweight detection network designed with edge device applications in mind. This model consists of an encoder and a decoder. The encoder maps input images into feature maps, with the final stage incorporating the SPPF module [24] along with the CSPDarknet structure [25] to enhance feature representations. Then, the decoder merges and interprets these features to produce the final detection output. The authors claimed that the model has about 2.48 million parameters; however, this study did not provide details on its inference performance on edge devices.

In 2025, the authors of [26] introduced GenPoD, a framework designed to address on-branch soybean pod detection with occlusion and class imbalance. This approach combines synthetic image generation with multi-stage transfer learning and is based on a YOLOv7-tiny model, which contains 6.2 million parameters. By gradually training on a mixed dataset, this framework improves feature extraction and achieves a mean average precision of 81.1%.

Table 1 provides an overview of the existing AI models for soybean pod number estimation in the literature. While these models have shown progress in counting capability, most of them involve a large number of training parameters. As a result, their practical usability on rural farms is limited due to the demand for computational resources. In order to address this challenge, this work will focus on the development of lightweight and high-performance AI models that can run smoothly in edge devices like Raspberry Pi.

3. Proposed AI Design and Evaluation

3.1. Design Considerations

To tackle this problem, we have several key design considerations and principles.

First, it is not necessary to know the exact location of each soybean pod for effective yield prediction and harvesting. Unlike tasks such as fruit picking or precision seed planting, where accurate spatial positioning is crucial, soybeans are typically harvested using combine harvesters [27]. These combine harvesters can process entire rows of soybean plants without targeting individual pods. Therefore, the main goal is to estimate pod counts within an area instead of identifying their exact locations. This focus reduces the computational burden of AI models by allowing them to prioritize density estimation over precise detection. For this reason, we do not consider models that first locate individual pods before counting, as this additional step adds unnecessary complexity and computational overhead.

Second, instead of counting each pod individually, it is often more practical to classify in-field soybean images into pod density ranges [17]. For example, an AI model could estimate whether an area contains between 100 and 120 pods, rather than pinpointing an exact number. This principle reduces computational burden and also fits real-world farming needs, where farmers would like to know yield estimations rather than precise counts. Treating pod estimation as a density classification problem allows for quicker processing and enables the potential to mitigate these real-world challenges, such as lighting intensity and complex background. The model simplification also supports the use of low-cost edge devices for deployment.

Third, inspired by the ideas from [7,21], we aim to develop a hybrid model that combines the strengths of convolutional neural networks (CNNs) and the attention mechanisms of transformers. CNNs are well-suited for capturing spatial features and structural details from soybean field images, but they often struggle with understanding the broader context, especially in agricultural scenes with occlusions and uneven lighting. In contrast, attention mechanisms are effective at capturing relationships across an entire image and help the AI model understand how different regions of an image relate to their neighbors. By combining these two approaches, our model can more accurately estimate pod density in real-world field conditions. Meanwhile, it can maintain a good balance between efficiency and accuracy.

Fourth, we plan to simplify the model and make it lightweight [28,29]. For example, with the basic architecture of neural networks, we may reduce the number of layers, channels, or connections. In this way, the network will gradually become narrower and hence reduce the number of parameters and operations for edge devices.

Last, we plan to make the model lightweight by applying techniques like weight quantization [30,31,32]. This method reduces the precision of the weights in neural networks, typically converting floating-point values into lower bit-width representations, such as 4-bit or 8-bit integers. This change can dramatically lower the memory requirements of AI models and speed up their inference process. As a result, the quantized version becomes more efficient and better fit into resource-constrained edge devices, such as Raspberry Pi, where memory and computational power are limited.

3.2. Dataset Construction and Preprocessing

The dataset used in this study was derived from an open-source UAV image dataset publicly available on Kaggle [33]. The original images were collected under real-world field conditions and included pod count annotations. From these images, we randomly cropped 33,809 sub-images of 300 × 300 pixels (as shown in Figure 1) to build a new dataset tailored for classification based on pod count ranges. Each image was manually verified and assigned to one of eight pod count categories, as detailed in Table 2. The eight classification categories used in this study (<40, 41–80, …, >281 pods) were designed to represent a wide range of pod densities observed in the field dataset. While these categories are not directly aligned with specific crop phenological stages or agronomic stress indicators, they were chosen to enable a meaningful and practical classification of yield potential. The thresholds were selected based on the empirical distribution of pod counts in the data to ensure balanced representation across categories and effective model training. Figure 1 illustrates the appearance of actual soybean fields during the harvest season. It clearly depicts the complex background, including soil clumps, dried weeds, and fallen leaves. Their color and shape similarities introduce ambiguity and make it hard to distinguish soybean pods.

The resulting dataset was split into training (70%), validation (15%), and testing (15%) subsets. To improve model robustness and generalization, we applied a series of preprocessing steps, including brightness normalization and data augmentation techniques, such as random rotations and flips. These procedures enriched the dataset’s diversity without requiring additional manual labeling. The eight pod count categories ensured broad exposure to varying pod densities, and the images within each class captured a wide range of real-world visual conditions—including different lighting, backgrounds, and environmental factors. As a result, the dataset supports the development and evaluation of AI models that can be generalized effectively for deployment in practical agricultural settings.

3.3. Baseline AI Model Performance

Experiments have been performed on several baseline AI models in the literature, including ShuffleNet, MobileNet, EfficientNet, NasNetMobile, DenseNet, Xception, Inception, and ResNet. Table 3 provides a comparative summary of existing AI baseline models used for soybean pod number estimation, highlighting key metrics such as the number of trainable parameters, model accuracy, and memory footprint. Among these models, DenseNet121 and ResNet50 achieve the highest accuracy, at 87.14% and 87.1%, respectively. However, these models also have significantly larger memory footprints (more than 54.95 MB) and over 7 million trainable parameters, which makes them less practical for deployment on resource-constrained edge devices. The most lightweight model is ShuffleNet, which corresponds to the smallest memory usage and the fewest parameters. Unfortunately, its classification accuracy is relatively low, at only 77.69%. In contrast, MobileNet and MobileNetV2 offer a strong balance between performance and efficiency, with an accuracy of about 86% and a memory footprint under 25 MB for each. Figure 2 plots the classification accuracy versus the AI model size. Overall, Table 3 and Figure 2 show the trade-offs between model complexity, resource consumption, and estimation performance—critical considerations when selecting AI architectures for deployment in real-world agricultural settings.

3.4. Baseline AI Models with Simplification

From the previous section, we identified MobileNet and MobileNetV2 as the most promising baseline architectures due to their balance between accuracy and computational efficiency. In this subsection, we explore strategies to further reduce their complexity while preserving their core structures and performance characteristics.

One potential approach to reduce model size in MobileNet architectures is to adjust the width multiplier, known as the alpha value. This parameter gradually scales the number of channels in each layer, allowing control of the overall size of the model while preserving its structural integrity. Lowering the alpha value can significantly reduce both the number of trainable parameters and computational costs. Thus, these models are more suitable for deployment in resource-constrained edge devices. In this study, we selected α values of 0.25, 0.50, and 0.75 to represent lightweight, medium, and near-full capacity configurations, respectively. These values are commonly supported in deployment platforms such as TensorFlow Lite and Edge Impulse, ensuring practical feasibility for real-world use. As shown in Table 4 and Figure 3, model size reduction is associated with the cost of classification accuracy. For example, when alpha is chosen to be 0.5, the model’s memory is scaled down by 4 times, and the accuracy drops by 1–3%. Hence, we will select an appropriate alpha value to make a good trade-off between model size and accuracy. In resource-constrained deployment environments, a smaller alpha value may be preferred to meet memory and speed requirements, even with a slight drop in performance accuracy.

3.5. AI Models with Integrated Self-Attention Layers

While full self-attention mechanisms like transformers are powerful tools for capturing global contexts, they come with substantial computational demands [34]. To maintain a balance between performance and efficiency, we opted to enhance the MobileNet architecture by using a lightweight self-attention module: the Squeeze-and-Excitation (SE) block. The SE block offers a compelling compromise between complexity and performance. It requires only a small number of additional operations and integrates easily into existing architectures, such as MobileNet, typically inserted after convolutional layers with minimal modification. As shown in Table 5, the inclusion of SE blocks increases memory usage by only 3.5% to 4.0% compared to the original MobileNet architectures across various alpha values. Its low memory footprint and straightforward implementation make it particularly well-suited for deployment in resource-constrained environments. Although SE blocks focus solely on channel-wise attention and do not capture broader spatial dependencies like more advanced mechanisms such as CBAM [20], they are well-matched to our application. In soybean pod density estimation, model performance depends more on recognizing local and regional visual cues—such as the presence and clustering of pods—than on long-range dependencies across the image. In this context, the SE block provides meaningful performance improvements while keeping computational costs low. Given its simplicity, efficiency, and effectiveness, the SE block is a practical and reliable choice—especially when minimal computational overhead is a priority. Its balance of lightweight design and accuracy enhancement makes it particularly suitable for real-time applications and edge device deployment.

Figure 4 presents the training and validation loss and accuracy curves for two AI models, which are based on the same MobileNet_alpha0p25 architecture. One model includes SE blocks, while the other does not. Both models show effective learning over 200 epochs, with steadily increasing accuracy. The model with SE blocks performs better overall, showing higher validation accuracy and smoother convergence. This suggests that the inclusion of SE blocks helps the model generalize more effectively and extract more relevant features for the soybean pod estimation task. The SE-enhanced model also reaches stable performance more quickly, indicating better convergence behavior.

3.6. Simplified Baseline AI Models with Weight Quantization

In this subsection, we present the performance of our AI models after applying weight quantization, a technique that further reduces model size and speeds up inference [35]. We used TensorFlow Lite [36] to carry out post-training quantization, as it offers reliable support for a range of platforms, including mobile and edge devices. Then, we used Edge Impulse [16] to evaluate the performance of our simplified and quantized AI models. Edge Impulse is a leading development platform for embedded machine learning, enabling developers to design, train, and deploy AI models directly on edge devices. It supports a wide range of hardware platforms—including Raspberry Pi and Arduino—and provides on-device performance evaluation.

Table 6 shows the performance of various AI models deployed on Raspberry Pi 4 and Raspberry Pi 5. These models are based on different configurations of the MobileNet architecture, with differences in width multipliers (alpha values) and the use of SE blocks using their best-performing reduction parameters. All models were passed through post-training weight quantization to minimize memory size. The footprints of these compact models range from 0.27 MB (MobileNet_alpha0p25) to 1.91 MB (MobileNet_alpha0p75). From Table 5, we can see that the inclusion of SE blocks can significantly improve the classification accuracy while introducing a little bit of memory overhead. For example, MobileNet_alpha0p25 achieves 84.0% accuracy with self-attention compared to 82.87% without it. The memory overhead of including SE blocks is only 5 KB. These results indicate that lightweight models with self-attention can offer a favorable balance between performance and efficiency for edge computing.

As shown in Table 6, the inference speed on Raspberry Pi 5 is much faster than on Raspberry Pi 4. The smallest AI model runs in just 22.73–25 frames per second on the Raspberry Pi 5. In contrast, the same model runs 0.84–0.89 frames per second on Raspberry Pi 4. As expected, both classification accuracy and inference time increase with model size and complexity. These results indicate that Raspberry Pi 5 is a good choice among edge devices to support this classification task at an affordable cost of less than USD 100. In agricultural settings, an inference speed of around 25 frames per second is generally enough for real-time processing. Many tasks, such as crop monitoring and yield estimation, rely on periodic image capture from drones with intervals in the range of seconds. With this inference time, Raspberry Pi 5 can process about 25 frames per second, which is fast enough for real-time soybean pod number estimation.

Figure 5 clearly highlights how AI model size and the use of self-attention layers affect the classification accuracy and inference time in Raspberry Pi 4 and 5 devices. The left graph shows that the accuracy consistently improves with larger model sizes. The inclusion of SE blocks enhances model performance across all model sizes. For example, the smallest model (around 0.3 MB) achieves 82.87% accuracy without self-attention and 84% accuracy with self-attention. The middle and right graphs display inference time for two Raspberry Pi devices. We can see that the inclusion of self-attention slightly increases inference time. Overall, Figure 4 demonstrates that integrating self-attention offers a favorable trade-off between improved accuracy and computational costs.

4. Results and Discussion

Figure 6 presents the confusion matrix for the classification performance of the MobileNet_alpha0p25 model with SE blocks. We can see that most predictions closely match the true labels, as shown by the strong diagonal values. This model performs well on classes 1, 5, 6, 7, and 8, each of which achieves at least 88% prediction accuracy. This indicates that the model is generally reliable in distinguishing between these eight classes, thanks in part to the added self-attention mechanism.

Figure 7 compares the classification accuracy and model size of existing baseline AI models with three proposed models: MobileNet_alpha0p25, MobileNet_alpha0p5, and MobileNet_alpha0p75, all enhanced with SE blocks. The x-axis (logarithmic scale) shows the model size in megabytes, while the y-axis represents the classification accuracy. Compared to the best-performing baseline model, MobileNetV2, our MobileNet_alpha0p75 model with SE blocks reduces memory usage by a factor of 9, with only a slight (0.15%) drop in accuracy. Similarly, the MobileNet_alpha0p25 version achieves a 65 × reduction in model size, with a modest 3.36% decrease in accuracy. These results show that our models maintain competitive performance while significantly reducing memory requirements by one to two orders of magnitude. This balance makes them well-suited for deployment in edge computing environments where both accuracy and efficiency are critical.

Figure 8 illustrates a comparative analysis of soybean pod number estimation models in terms of accuracy versus the number of trainable parameters on a logarithmic scale. The plot includes the main existing studies—YOLO POD [18], PODNet [23], GenPoD [26], and SoybeanNet [7]. While SoybeanNet and YOLO POD models achieve an accuracy of about 84%, they rely on larger architectures with parameter counts of more than 20 million. In contrast, our proposed lightweight models achieve competitive accuracy (84.0% to 86.76%) using fewer than 2 million parameters. This clearly demonstrates the advantages of our models in terms of high accuracy and small memory size.

While this study demonstrates the practicality of deploying lightweight AI models for soybean pod number estimation, several limitations should be noted. First, the dataset was sourced from a single publicly available UAV image collection, which may restrict the generalizability of the models to other geographic locations or crop conditions. Second, although the models performed well under the evaluated field scenarios, they may be sensitive to environmental variations—such as differing soil textures, lighting conditions, or crop phenotypes—that were not captured in the dataset. Third, given the constraints of deploying on resource-limited hardware, this study did not explore alternative attention mechanisms, such as the Convolutional Block Attention Module (CBAM) or Efficient Channel Attention (ECA), which may lead to further performance enhancements. These three limitations will be addressed in future work.

5. Conclusions

This study introduces a set of lightweight and efficient AI models designed for estimating soybean pod numbers in real-world field scenarios. By integrating MobileNet architectures with Squeeze-and-Excitation (SE) blocks and applying post-training weight quantization, we significantly reduce model size while maintaining competitive accuracy. Our most compact proposed model achieves a 65× reduction in size with only a 3.36% drop in accuracy, proving suitable for deployment in resource-constrained devices like Raspberry Pi 4 and 5. Another model we proposed achieves a 9× reduction in size, with only a 0.15% drop in accuracy. We evaluated these models on Raspberry Pi 5, with inference times of 4.5–25 frames per second. SE blocks can enhance model generalization in challenging field environments. Compared to existing models like YOLO POD and SoybeanNet, which require over 20 million parameters for ~84% accuracy, our proposed lightweight models achieve comparable or higher accuracy (84.0–86.76%) with fewer than 2 million parameters, highlighting their superior efficiency. Our results indicate that careful model simplification, attention integration, and quantization can realize compact AI models that are affordable and practical for farmers, especially in rural areas with limited infrastructure.

In future work, we plan to expand the dataset by incorporating soybean images from different geographic regions, crop varieties, and environmental conditions to enhance model generalizability. Additionally, we aim to explore more advanced attention mechanisms—such as CBAM or ECA—to further improve feature extraction and model performance. Finally, we aim to implement the complete system in edge devices and conduct real-world testing in soybean fields.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

These data were derived from the following resources available in the public domain at https://www.kaggle.com/datasets/jiajiali/uav-based-soybean-pod-images (accessed on 1 May 2025).

Conflicts of Interest

The author declares no conflicts of interest.

References

Medic, J.; Atkinson, C.; Hurburgh, C.R. Current knowledge in soybean composition. J. Am. Oil Chem. Soc. 2014, 91, 363–384. [Google Scholar] [CrossRef]
He, H.; Ma, X.; Guan, H.; Wang, F.; Shen, P. Recognition of soybean pods and yield prediction based on improved deep learning model. Front. Plant Sci. 2023, 13, 1096619. [Google Scholar] [CrossRef]
Zhang, C.; Kovacs, J.M. The application of small unmanned aerial systems for precision agriculture: A review. Precis. Agric. 2012, 13, 693–712. [Google Scholar] [CrossRef]
Hu, C.; Sapkota, B.B.; Thomasson, J.A.; Bagavathiannan, M.V. Influence of image quality and light consistency on the performance of convolutional neural networks for weed mapping. Remote Sens. 2021, 13, 2140. [Google Scholar] [CrossRef]
Gao, J.; French, A.P.; Pound, M.P.; He, Y.; Pridmore, T.P.; Pieters, J.G. Deep convolutional neural networks for image-based Convolvulus sepium detection in sugar beet fields. Plant Methods 2020, 16, 29. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Li, J.; Magar, R.T.; Chen, D.; Lin, F.; Wang, D.; Yin, X.; Zhuang, W.; Li, Z. SoybeanNet: Transformer-based convolutional neural network for soybean pod counting from Unmanned Aerial Vehicle (UAV) images. Comput. Electron. Agric. 2024, 220, 108861. [Google Scholar] [CrossRef]
Sarkar, S.; Zhou, J.; Scaboo, A.; Zhou, J.; Aloysius, N.; Lim, T.T. Assessment of soybean lodging using UAV imagery and machine learning. Plants 2023, 12, 2893. [Google Scholar] [CrossRef]
Zhou, L.; Zhang, Y.; Chen, H.; Sun, G.; Wang, L.; Li, M.; Sun, X.; Feng, P.; Yan, L.; Qiu, L.; et al. Soybean yield estimation and lodging classification based on UAV multi-source data and self-supervised contrastive learning. Comput. Electron. Agric. 2025, 230, 109822. [Google Scholar] [CrossRef]
Zhao, J.; Kaga, A.; Yamada, T.; Komatsu, K.; Hirata, K.; Kikuchi, A.; Hirafuji, M.; Ninomiya, S.; Guo, W. Improved field-based soybean seed counting and localization with feature level considered. Plant Phenomics 2023, 5, 0026. [Google Scholar] [CrossRef]
Adedeji, O.; Abdalla, A.; Ghimire, B.; Ritchie, G.; Guo, W. Flight Altitude and Sensor Angle Affect Unmanned Aerial System Cotton Plant Height Assessments. Drones 2024, 8, 746. [Google Scholar] [CrossRef]
Sonmez, D.; Cetin, A. An End-to-End Deployment Workflow for AI Enabled Agriculture Applications at the Edge. In Proceedings of the 2024 6th International Conference on Computing and Informatics (ICCI), New Cairo, Cairo, Egypt, 6–7 March 2024; pp. 506–511. [Google Scholar]
Zhang, X.; Cao, Z.; Dong, W. Overview of edge computing in the agricultural internet of things: Key technologies, applications, challenges. IEEE Access 2020, 8, 141748–141761. [Google Scholar] [CrossRef]
Joshi, H. Edge-AI for Agriculture: Lightweight Vision Models for Disease Detection in Resource-Limited Settings. arXiv 2024, arXiv:2412.18635. [Google Scholar]
Lv, Z.; Yang, S.; Ma, S.; Wang, Q.; Sun, J.; Du, L.; Han, J.; Guo, Y.; Zhang, H. Efficient Deployment of Peanut Leaf Disease Detection Models on Edge AI Devices. Agriculture 2025, 15, 332. [Google Scholar] [CrossRef]
Edge Impulse. Available online: https://edgeimpulse.com/ (accessed on 1 May 2025).
Li, Y.; Jia, J.; Zhang, L.; Khattak, A.M.; Sun, S.; Gao, W.; Wang, M. Soybean seed counting based on pod image using two-column convolution neural network. IEEE Access 2019, 7, 64177–64185. [Google Scholar] [CrossRef]
Xiang, S.; Wang, S.; Xu, M.; Wang, W.; Liu, W. YOLO POD: A fast and accurate multi-task model for dense Soybean Pod counting. Plant Methods 2023, 19, 8. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, X.; Zhuang, Y.; Li, J.; Zhang, Y.; Wang, Z.; Zhao, J.; Li, D.; Gao, Y. SPCN: An Innovative Soybean Pod Counting Network Based on HDC Strategy and Attention Mechanism. Agriculture 2024, 14, 1347. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Yu, Z.; Wang, Y.; Ye, J.; Liufu, S.; Lu, D.; Zhu, X.; Yang, Z.; Tan, Q. Accurate and fast implementation of soybean pod counting and localization from high-resolution image. Front. Plant Sci. 2024, 15, 1320109. [Google Scholar] [CrossRef] [PubMed]
Douillard, A.; Cord, M.; Ollion, C.; Robert, T.; Valle, E. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XX 16; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 86–102. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Wu, K.; Wang, T.; Rao, Y.; Jin, X.; Wang, X.; Li, J.; Zhang, Z.; Jiang, Z.; Shao, X.; Zhang, W. Practical framework for generative on-branch soybean pod detection in occlusion and class imbalance scenes. Eng. Appl. Artif. Intell. 2025, 139, 109613. [Google Scholar] [CrossRef]
Combine Harvester. Available online: https://www.deere.com/assets/pdfs/common/qrg/x9-rth-soybeans.pdf (accessed on 1 May 2025).
Huang, Q. Towards indoor suctionable object classification and recycling: Developing a lightweight AI model for robot vacuum cleaners. Appl. Sci. 2023, 13, 10031. [Google Scholar] [CrossRef]
Tang, Z.; Luo, L.; Xie, B.; Zhu, Y.; Zhao, R.; Bi, L.; Lu, C. Automatic sparse connectivity learning for neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 7350–7364. [Google Scholar] [CrossRef]
Yang, Z.; Wang, Y.; Han, K.; Xu, C.; Xu, C.; Tao, D.; Xu, C. Searching for low-bit weights in quantized neural networks. Adv. Neural Inf. Process. Syst. 2020, 33, 4091–4102. [Google Scholar]
Chenna, D. Edge AI: Quantization as the key to on-device smartness. J. ID 2023, 4867, 9994. [Google Scholar]
Kulkarni, U.; Meena, S.M.; Gurlahosur, S.V.; Benagi, P.; Kashyap, A.; Ansari, A.; Karnam, V. AI model compression for edge devices using optimization techniques. In Modern Approaches in Machine Learning and Cognitive Science: A Walkthrough: Latest Trends in AI, Volume 2; Springer International Publishing: Cham, Switzerland, 2021; pp. 227–240. [Google Scholar]
Soybean Pod Images from UAVs. Available online: https://www.kaggle.com/datasets/jiajiali/uav-based-soybean-pod-images (accessed on 1 May 2025).
Xie, W.; Zhao, M.; Liu, Y.; Yang, D.; Huang, K.; Fan, C.; Wang, Z. Recent advances in Transformer technology for agriculture: A comprehensive survey. Eng. Appl. Artif. Intell. 2024, 138, 109412. [Google Scholar] [CrossRef]
Huang, Q.; Tang, Z. High-performance and lightweight ai model for robot vacuum cleaners with low bitwidth strong non-uniform quantization. AI 2023, 4, 531–550. [Google Scholar] [CrossRef]
TensorFlow Lite. Available online: https://ai.google.dev/edge/litert (accessed on 1 May 2025).

Figure 1. Randomly selected samples from our dataset, each with dimensions of 300 × 300 pixels.

Figure 2. Summary of classification accuracy versus baseline AI model size.

Figure 3. MobileNet and MobileNetV2 architectures: model size and accuracy versus alpha values.

Figure 4. Training and validation loss and accuracy curves for two AI models. (a) MobileNet_alpha0p25 without SE blocks. (b) MobileNet_alpha0p25 without SE blocks.

Figure 5. Performance comparison of quantized MobileNet AI models with and without SE blocks.

Figure 6. Confusion matrix of the AI model of MobileNet_alpha0p25 with self-attention.

Figure 7. Performance comparison of test classification accuracy versus AI model size between this work and existing baseline AI models.

Figure 8. Performance comparison of test accuracy versus AI model size between this work and existing state-of-the-art AI models in the literature.

Table 1. Summary of existing AI baseline models for soybean pod number estimation.

AI Model Architecture	Number of Parameters (Unit: Million)	Accuracy (Unit: %)	Dataset	Inference Speed (Unit: Frame per Second)
Two-column CNN [17]	N/A	N/A	With a black cloth background	N/A
YOLO POD [18]	78.6	83.9		2.16 on GeForce 2080 Ti GPU
SPCN [21]	>144	N/A		N/A
PodNet [23]	2.48	82.8		43.48 on GTX1080Ti GPU
GenPoD [26]	>6.2	81.1		N/A
P2PNet-Soy [10]	>138	N/A	With in-field background	N/A
SoybeanNet [7]	29–88	84.51	With in-field background	N/A

Table 2. Statistics of the dataset created in this study consisting of 8 categories.

Dataset	Category	Soybean Pod Numbers of Different Category Pods
	Category	#1	#2	#3	#4	#5	#6	#7	#8
	Number of Soybean Pods in an Image	(<40)	(41, 80)	(81, 120)	(121, 160)	(161, 200)	(201, 240)	(241, 280)	(>281)
Training	23,662	3027	2928	2972	2984	2958	2916	2937	2940
Validation	5068	648	627	636	639	634	625	629	630
Testing	5079	650	629	638	641	635	626	630	630

Table 3. Summary of existing baseline AI models for soybean pod number estimation using our constructed dataset.

AI Model Architecture	Number of Trainable Parameters	Memory Footprint (Unit: MB)	Accuracy (Unit: %)
ShuffleNet	1,482,368	11.45	77.69
MobileNetV2	2,268,232	17.80	86.91
MobileNet	3,237,064	24.95	85.96
EfficientNetB0	4,059,819	31.65	85.69
NasNetMobile	4,278,172	35.26	86.14
DenseNet121	7,045,704	54.95	87.14
Xception	20,877,872	159.61	87.01
InceptionV3	21,819,176	167.44	86.28
ResNet50V2	23,581,192	180.41	86.73
ResNet50	23,604,104	180.58	87.1

Table 4. Model simplification of two selected top candidate AI models.

AI Model Architecture	Alpha	Number of Trainable Parameters	Memory Footprint (Unit: MB)	Accuracy (Unit: %)
MobileNet	0.25	220,600	2.00	82.97
	0.5	833,640	6.66	85.67
	0.75	1,839,128	14.31	86.83
	1	3,237,064	24.95	85.96
MobileNetV2	0.25	259,016	2.56	79.27
	0.5	716,472	6.02	85.00
	0.75	1,292,312	11.15	86.02
	1	2,268,232	17.80	86.91

Table 5. SE blocks (with a default parameter reduction = 16) characteristic when adding after MobileNet architectures.

AI Model Architecture	Alpha	Number of Trainable Parameters	Memory Footprint (Unit: MB)	Additional Memory Due to SE Blocks (Unit: %)
MobileNet	0.25	228,792	2.07	3.5
	0.5	866,408	6.91	3.8
	0.75	1,912,856	14.88	4.0

Table 6. Performance evaluation of quantized AI models in edge devices of Raspberry Pi.

AI Model Architecture	SE Blocks	Memory Footprint (Unit: MB)	Classification Accuracy (Unit: %)	Inference Speed on Raspberry Pi 4 (Unit: Frame per Second)	Inference Speed on Raspberry Pi 5 (Unit: Frame per Second)
MobileNet_alpha0p25	No	0.27	82.87	0.89	25
MobileNet_alpha0p25	Yes	0.275	84.0	0.84	22.73
MobileNet_alpha0p5	No	0.9	85.44	0.40	7.52
MobileNet_alpha0p5	Yes	0.92	85.56	0.40	7.25
MobileNet_alpha0p75	No	1.87	86.04	0.27	4.81
MobileNet_alpha0p75	Yes	1.91	86.76	0.26	4.50

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Q. High-Performance and Lightweight AI Model with Integrated Self-Attention Layers for Soybean Pod Number Estimation. AI 2025, 6, 135. https://doi.org/10.3390/ai6070135

AMA Style

Huang Q. High-Performance and Lightweight AI Model with Integrated Self-Attention Layers for Soybean Pod Number Estimation. AI. 2025; 6(7):135. https://doi.org/10.3390/ai6070135

Chicago/Turabian Style

Huang, Qian. 2025. "High-Performance and Lightweight AI Model with Integrated Self-Attention Layers for Soybean Pod Number Estimation" AI 6, no. 7: 135. https://doi.org/10.3390/ai6070135

APA Style

Huang, Q. (2025). High-Performance and Lightweight AI Model with Integrated Self-Attention Layers for Soybean Pod Number Estimation. AI, 6(7), 135. https://doi.org/10.3390/ai6070135

Article Menu

High-Performance and Lightweight AI Model with Integrated Self-Attention Layers for Soybean Pod Number Estimation

Abstract

1. Introduction

2. Related Work

3. Proposed AI Design and Evaluation

3.1. Design Considerations

3.2. Dataset Construction and Preprocessing

3.3. Baseline AI Model Performance

3.4. Baseline AI Models with Simplification

3.5. AI Models with Integrated Self-Attention Layers

3.6. Simplified Baseline AI Models with Weight Quantization

4. Results and Discussion

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI