1. Introduction
With the significant increase in computer processing power, modern image recognition technology has developed rapidly. Among such technologies, the YOLO [
1] model is well known for its fast and efficient object recognition capabilities, becoming a commonly used model by researchers. The increase in computational speed and advancements in AI techniques have made image recognition more widespread and easier to implement than before. Emerging models and technical tools have developed continually, allowing even simple tools to conduct efficient and accurate image recognition. The applications of the technology have permeated various aspects of our daily lives, from facial recognition features on smartphones to thermal imaging temperature detection. The technology significantly enhances the safety and convenience of everyday life. In particular, facial recognition technology on smartphones improves the convenience and security of authentication and unlocking. During the pandemic, thermal imaging techniques were used for temperature monitoring to detect abnormal temperatures and ensure public health. Different images have distinct features. Therefore, it is important to find an appropriate image recognition system to effectively identify different images.
In this study, we explored the performance of the YOLO image recognition system. YOLO has strong object recognition capabilities in various scenarios. However, there are slight differences among the various versions of YOLO, such as YOLOv3 [
2] and YOLOv4 [
3], in terms of model structure, computational speed, accuracy, and adaptability. Each version has its specific applicability and issues. We systematically compared the different versions of YOLO to identify the most appropriate version for different object recognition tasks, providing a reference for developing and optimizing image recognition techniques and technical recommendations for different purposes of using YOLO.
2. Literature Review
2.1. YOLO
The YOLO algorithm, proposed by Joseph Redmon and his team in 2016, is a real-time object detection system known for its speed and efficiency. It enables object detection by classifying and bounding objects in an image to identify the objects within the bounding boxes. Unlike traditional methods that perform detection in multiple stages, YOLO processes an image in a single pass, which is appropriate for time-sensitive applications such as autonomous vehicles, surveillance, and robotics.
When the image is divided into A × A grids, with each grid containing B cells, the number of cells defined varies across different versions. Each cell contains five values: (x, y, w, h, and confidence). Here, (x, y) represents the coordinates of the object’s center within the cell, (w, h) indicates the corresponding width and height of the object, and (confidence) reflects the level of certainty regarding the presence of the object in the cell.
YOLOv3 is the third iteration of the YOLO object detection algorithm. It introduces significant improvements, such as multi-scale predictions and the Darknet-53 backbone, which enhances accuracy without sacrificing real-time performance. This version is particularly effective for detecting smaller objects and overlapping objects, making it a versatile choice in various applications. YOLOv4 was built upon the strengths of its predecessors while addressing their limitations. By incorporating CSPDarknet53 [
4], and training optimizations and detection improvements with the Bag of Specials, YOLOv4 achieves state-of-the-art accuracy on benchmark datasets such as Microsoft Common Objects in Context (MS COCO) [
5]. This version is faster and scalable across different hardware configurations, making it ideal for real-world deployments like traffic monitoring and surveillance.
2.2. YOLO-Related Research
Scholars often use YOLOv3 and YOLOv4 as comparison models to evaluate their performance across various application domains. For instance, in animal recognition [
6], mineral classification [
7], and inspection systems [
8], the differences in accuracy and efficiency between these two models have been discussed to highlight technological advancements and applicability. Through such comparisons, researchers identify which version is more appropriate for recognizing specific categories, thereby providing optimal solutions for practical applications.
3. Proposed Method
3.1. Experimental Procedure
In our experiment, we collected the dataset to ensure sufficient and relevant data for the analysis. Subsequently, the YOLOv3 and YOLOv4 algorithms were properly set up to configure the necessary libraries, dependencies, and parameters to optimize the training and testing processes. Once the setup was complete, the experiments were conducted to evaluate the performance of the models on the collected dataset. Finally, the results were analyzed to assess the effectiveness of the methods and their strengths and potential areas for improvement.
3.2. Dataset
3.2.1. Microsoft COCO
We used MS COCO as the training dataset. MS COCO is a large-scale image dataset that is widely used for the tasks of research and development in computer vision. It features the following characteristics:
Diverse data: It contains images from everyday scenes, covering 80 object categories ranging from humans to animals, furniture, tools, and more, reflecting real-world contexts.
Rich annotations:
Object detection: Various bounding boxes and polygon annotations.
Image segmentation: Precise segmentation masks for each object.
Image labeling: The content of entire images.
Key-point detection: Key point annotation for human poses, such as joint positions.
Wide applications: It supports research in object detection, semantic segmentation, image captioning, and key-point detection, making it a popular dataset for training and evaluating deep learning models.
Challenging scenarios: It features complex scenes with multiple objects and occlusions, making it suitable for testing advanced computer vision models.
3.2.2. Testing Dataset
The Pexels website (
https://www.pexels.com/zh-tw/, accessed on 25 December 2024) [
9] was used as the testing dataset for the experiments. Pexels provides a wide variety of high-quality images to evaluate the performance of the proposed methods in diverse and realistic scenarios. The testing dataset specifically consists of 100 images commonly seen in daily life, including cats, cars, and bicycles, with each image having a resolution of 640 × 640 pixels.
3.3. Validation Index
When using machine learning for classification, it is essential to rely on statistical indicators to assist in accurately evaluating classification results, quickly identifying and addressing issues, and improving accuracy. The confusion matrix is divided into four quadrants: true positives, true negatives, false positives, and false negatives.
Table 1 lists the related information for the confusion matrix, which is described in detail as follows:
TP (true positive): “True” indicates that the actual result is correct, and “Positives” indicates that the prediction is correct. For example, a sick individual is diagnosed as having the disease.
TN (true negative): “True” indicates that the actual result is correct, and “Negatives” indicates that the prediction is incorrect. For example, a healthy individual is diagnosed as not having the disease.
FP (false positive): “False” indicates that the actual result is incorrect, and “Positives” indicates that the prediction is correct. This is also known as a Type I Error. For example, a healthy individual is incorrectly diagnosed as having the disease, and this mistake makes the error a false positive.
FN (false negative): “False” indicates that the actual result is incorrect, and “Negatives” indicates that the prediction is incorrect. This is also known as a Type II Error. For example, a sick individual is incorrectly diagnosed as not having the disease, and this is a false negative.
For the evaluation standards, we used accuracy [
10], precision, recall, IoU, average precision (AP), and mean average precision (mAP) [
11] to evaluate the performance of the proposed method, and they are described as follows:
where
N is the total number of categories.
4. Results and Discussion
In this experiment, common objects from daily life, including cars, bicycles, and cats, were used as test subjects. We selected 100 images for each category and resized all of the images to a uniform size of 640 × 640 before testing to ensure consistency in model input. Additionally, we precisely annotated each image, including bounding boxes and corresponding class labels for the target objects. This process facilitated model training and testing and enabled the calculation of AP values to evaluate the accuracy and performance of the model (
Figure 1).
Table 2 lists the comparison results of YOLOv3 and YOLOv4. From
Table 2, the experimental results revealed significant differences in AP values across categories. YOLOv4 performed better in the car and bicycle categories, with an AP value as high as 0.9601 for cars. Conversely, YOLOv3 outperformed YOLOv4 in the cat category. In testing, the selection of test images significantly impacted the results. For instance, several images that YOLOv3 failed to recognize were successfully identified by YOLOv4. Therefore, in future test dataset selection, careful attention must be paid to display the target objects vividly. This approach minimized errors during annotation and discrepancies between test results and annotations, thereby improving the overall accuracy and reliability of the experiment.
5. Conclusions
To ensure the continuous improvement and effectiveness of object detection models, we diversified and expanded the training and testing datasets. By significantly increasing the number of images and incorporating a wide range of object types, sizes, and environments, the model’s ability was significantly enhanced to generalize to real-world scenarios. This process involved curating datasets that included challenging cases, such as occluded objects, varying lighting conditions, and diverse angles, to simulate practical conditions more effectively.
We explored additional versions of YOLO, such as YOLOv5 and YOLOv7, to leverage advancements in their architectures. Each iteration of YOLO introduced refinements in speed, accuracy, and functionality, thus necessitating a systematic evaluation of their contributions. For instance, YOLOv5 is recognized for its user-friendly implementation and scalability, while YOLOv7 [
12] offers enhancements that optimize detection in real-time applications. An extensive comparison across these versions provides data on their respective strengths and trade-offs.
In addition to internal evaluations, we analyzed how YOLO performed compared with other cutting-edge image recognition systems to determine whether alternative systems showed higher precision in specific scenarios. Based on the obtained metrics, such as mAP, inference speed, and computational efficiency in this study, users can identify the most accurate and reliable solutions for their needs. This approach emphasizes practical decision-making over developing new systems.
The results of this study also guide users to select the most appropriate system or version for their specific purposes. By combining data augmentation, version evaluations, and external benchmarking, well-informed recommendations can be made to enhance precision, reliability, and practical applicability in object detection technologies.
Author Contributions
Conceptualization, Y.-S.C. and Y.-X.C.; methodology, Y.-S.C.; software, Y.-X.C.; validation, Y.-S.C.; formal analysis, Y.-S.C.; investigation, Y.-S.C.; resources, Y.-S.C. and Y.-X.C.; data curation, Y.-S.C.; writing—original draft preparation, Y.-S.C. and Y.-X.C.; writing—review and editing, Y.-S.C. and Y.-X.C.; visualization, Y.-S.C.; supervision, Y.-S.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Data are contained within the article.
Acknowledgments
The authors sincerely thank the National Science and Technology Council of Taiwan for the financial support provided through grant number 111-2221-E-167-036-MY2.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. 2018. Available online: http://arxiv.org/abs/1804.02767 (accessed on 25 December 2024).
- Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Wang, J.; Chen, K.; Yang, S.; Loy, C.C.; Lin, D. Region Proposal by Guided Anchoring. In Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2965–2974. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef]
- Ibraheam, M.; Li, K.F.; Gebali, F. An Accurate and Fast Animal Species Detection System for Embedded Devices. IEEE Access 2023, 11, 23462–23473. [Google Scholar] [CrossRef]
- Liu, Q.; Li, J.; Li, Y.; Gao, M. Recognition Methods for Coal and Coal Gangue Based on Deep Learning. IEEE Access 2021, 9, 77599–77610. [Google Scholar] [CrossRef]
- Chethan Kumar, B.; Punitha, R. Mohana YOLOv3 and YOLOv4: Multiple Object Detection for Surveillance Applications. In Proceedings of the 3rd International Conference on Smart Systems and Inventive Technology, ICSSIT 2020, Tirunelveli, India, 20–22 August 2020; pp. 1316–1321. [Google Scholar] [CrossRef]
- Pexels. Available online: https://www.pexels.com/zh-tw/ (accessed on 25 December 2024).
- Alireza, B.; Mostafa, H.; Ahmed, N.; Gehad, E.A. Part 1: Simple Definition and Calculation of Accuracy, Sensitivity, and Specificity. Emergency 2015, 3, 48–49. [Google Scholar]
- Henderson, P.; Ferrari, V. End-to-End Training of Object Class Detectors for Mean Average Precision. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2017; Volume 10115, pp. 198–213. [Google Scholar] [CrossRef]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).