A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS
:1. Introduction
2. YOLO Applications across Diverse Fields
3. Object Detection Metrics and Non-Maximum Suppression (NMS)
3.1. How AP Works?
3.2. Computing AP
3.2.1. VOC Dataset
- For each category, calculate the precision–recall curve by varying the confidence threshold of the model’s predictions.
- Calculate each category’s average precision (AP) using an interpolated 11-point sampling of the precision–recall curve.
- Compute the final average precision (AP) by taking the mean of the APs across all 20 categories.
3.2.2. Microsoft COCO Dataset
- For each category, calculate the precision–recall curve by varying the confidence threshold of the model’s predictions.
- Compute each category’s average precision (AP) using 101 recall thresholds.
- Calculate AP at different Intersection over Union (IoU) thresholds, typically from 0.5 to 0.95 with a step size of 0.05. A higher IoU threshold requires a more accurate prediction to be considered a true positive.
- For each IoU threshold, take the mean of the APs across all 80 categories.
- Finally, compute the overall AP by averaging the AP values calculated at each IoU threshold.
3.3. Non-Maximum Suppression (NMS)
Algorithm 1 Non-Maximum Suppression Algorithm |
4. YOLO: You Only Look Once
4.1. How Does YOLOv1 Work?
4.2. YOLOv1 Architecture
4.3. YOLOv1 Training
4.4. YOLOv1 Strengths and Limitations
- It could only detect at most two objects of the same class in the grid cell, limiting its ability to predict nearby objects.
- It struggled to predict objects with aspect ratios not seen in the training data.
- It learned from coarse object features due to the downsampling layers.
5. YOLOv2: Better, Faster, and Stronger
- Batch normalization on all convolutional layers improves convergence and acts as a regularizer to reduce overfitting.
- High-resolution classifier. Like YOLOv1, they pre-trained the model with ImageNet at . However, this time, they fine-tuned the model for ten epochs on ImageNet with a resolution of , improving the network performance on higher resolution input.
- Fully convolutional. They removed the dense layers and used a fully convolutional architecture.
- Use anchor boxes to predict bounding boxes. They use a set of prior boxes or anchor boxes, which are boxes with predefined shapes used to match prototypical shapes of objects as shown in Figure 7. Multiple anchor boxes are defined for each grid cell, and the system predicts the coordinates and the class for every anchor box. The size of the network output is proportional to the number of anchor boxes per grid cell.
- Dimension clusters. Picking good prior boxes helps the network learn to predict more accurate bounding boxes. The authors ran k-means clustering on the training bounding boxes to find good priors. They selected five prior boxes, providing a good tradeoff between recall and model complexity.
- Direct location prediction. Unlike other methods that predicted offsets [3], YOLOv2 followed the same philosophy and predicted location coordinates relative to the grid cell. The network predicts five bounding boxes for each cell, each with five values , , , , and , where is equivalent to from YOLOv1 and the final bounding box coordinates are obtained as shown in Figure 8.
- Finer-grained features. YOLOv2, compared with YOLOv1, removed one pooling layer to obtain an output feature map or grid of for input images of . YOLOv2 also uses a passthrough layer that takes the feature map and reorganizes it by stacking adjacent features into different channels instead of losing them via a spatial subsampling. This generates feature maps concatenated in the channel dimension with the lower resolution maps to obtain feature maps. See Table 2 for the architectural details.
- Multi-scale training. Since YOLOv2 does not use fully connected layers, the inputs can be of different sizes. To make YOLOv2 robust to different input sizes, the authors trained the model randomly, changing the input size—from up to —every ten batches.
5.1. YOLOv2 Architecture
5.2. YOLO9000 Is a Stronger YOLOv2
6. YOLOv3
- Bounding box prediction. Like YOLOv2, the network predicts four coordinates for each bounding box , , , and ; however, this time, YOLOv3 predicts an objectness score for each bounding box using logistic regression. This score is 1 for the anchor box with the highest overlap with the ground truth and 0 for the rest of the anchor boxes. YOLOv3, as opposed to Faster R-CNN [3], assigns only one anchor box to each ground truth object. Also, if no anchor box is assigned to an object, it only increases classification loss but not localization loss or confidence loss.
- Class Prediction. Instead of using a softmax for the classification, they used binary cross-entropy to train independent logistic classifiers and pose the problem as a multilabel classification. This change allows assigning multiple labels to the same box, which may occur on some complex datasets [56] with overlapping labels. For example, the same object can be a Person and a Man.
- New backbone. YOLOv3 features a larger feature extractor composed of 53 convolutional layers with residual connections. Section 6.1 describes the architecture in more detail.
- Spatial pyramid pooling (SPP) Although not mentioned in the paper, the authors also added to the backbone a modified SPP block [57] that concatenates multiple max pooling outputs without subsampling (stride = 1), each with different kernel sizes , where allowing a larger receptive field. This version is called YOLOv3-spp and was the best-performing version, improving the AP50 by 2.7%.
- Multi-scale Predictions. Similar to feature pyramid networks [58], YOLOv3 predicts three boxes at three different scales. Section 6.2 describes the multi-scale prediction mechanism in more detail.
- Bounding box priors. Like YOLOv2, the authors also use k-means to determine the bounding-box priors of anchor boxes. The difference is that in YOLOv2, they used a total of five prior boxes per cell, and in YOLOv3, they used three prior boxes for three different scales.
6.1. YOLOv3 Architecture
6.2. YOLOv3 Multi-Scale Predictions
6.3. YOLOv3 Results
7. Backbone, Neck, and Head
8. YOLOv4
- An Enhanced Architecture with Bag-of-Specials (BoS) Integration. The authors tried multiple architectures for the backbone, such as ResNeXt50 [68], EfficientNet-B3 [69], and Darknet-53. The best-performing architecture was a modification of Darknet-53 with cross-stage partial connections (CSPNet) [70], and Mish activation function [66] as the backbone (see Figure 12. For the neck, they used the modified version of spatial pyramid pooling (SPP) [57] from YOLOv3-spp and multi-scale predictions as in YOLOv3, but with a modified version of path aggregation network (PANet) [71] instead of FPN as well as a modified spatial attention module (SAM) [72]. Finally, for the detection head, they used anchors, as in YOLOv3. Therefore, the model was called CSPDarknet53-PANet-SPP. The cross-stage partial connections (CSP) added to the Darknet-53 help reduce the computation of the model while keeping the same accuracy. The SPP block, as in YOLOv3-spp, increases the receptive field without affecting the inference speed. The modified version of PANet concatenates the features instead of adding them as in the original PANet paper.
- Integrating Bag of Freebies (BoF) for an Advanced Training Approach. Apart from the regular augmentations such as random brightness, contrast, scaling, cropping, flipping, and rotation, the authors implemented mosaic augmentation that combines four images into a single one, allowing the detection of objects outside their usual context and also reducing the need for a large mini-batch size for batch normalization. For regularization, they used DropBlock [73], which works as a replacement for Dropout [74] but for convolutional neural networks as well as class label smoothing [75,76]. For the detector, they added CIoU loss [77] and cross-mini-batch normalization (CmBN) for collecting statistics from the entire batch instead of from single mini-batches as in regular batch normalization [78].
- Self-adversarial Training (SAT). To make the model more robust to perturbations, an adversarial attack is performed on the input image to create a deception that the ground-truth object is not in the image but keeps the original label to detect the correct object.
- Hyperparameter Optimization with Genetic Algorithms. To find the optimal hyperparameters used for training, they use genetic algorithms on the first 10% of periods and a cosine annealing scheduler [79] to alter the learning rate during training. It starts reducing the learning rate slowly, followed by a quick reduction halfway through the training process, ending with a slight reduction.
9. YOLOv5
YOLOv5 Architecture
10. Scaled-YOLOv4
- Anchor-free. Since YOLOv2, all subsequent YOLO versions were anchor-based detectors. YOLOX, inspired by anchor-free state-of-the-art object detectors, such as CornerNet [92], CenterNet [93], and FCOS [94], returned to an anchor-free architecture simplifying the training and decoding process. The anchor-free increased the AP by 0.9 points concerning the YOLOv3 baseline.
- Multi positives. To compensate for the large imbalances and the lack of anchors produced, the authors use center sampling [94] where they assigned the center area as positives. This approach increased AP by 2.1 points.
- Decoupled head. In [95,96], it was shown that there could be a misalignment between the classification confidence and localization accuracy. Due to this, YOLOX separates these two into two heads (as shown in Figure 14), one for classification tasks and the other for regression tasks, improving the AP by 1.1 points and speeding up the model convergence.
- Advanced label assignment. In [97], it was shown that the ground-truth label assignment could have ambiguities when the boxes of multiple objects overlap and formulate the assigning procedure as an Optimal Transport (OT) problem. YOLOX, inspired by this work, proposed a simplified version called simOTA. This change increased AP by 2.3 points.
- Strong augmentations. YOLOX uses MixUP [86] and Mosaic augmentations. The authors found that ImageNet pretraining was no longer beneficial after using these augmentations. The strong augmentations increased AP by 2.4 points.
13. YOLOv6
- Label assignment using the Task alignment learning approach introduced in TOOD [101].
- A self-distillation strategy for the regression and classification tasks.
14. YOLOv7
- Extended efficient layer aggregation network (E-ELAN). ELAN [109] is a strategy that allows a deep model to learn and converge more efficiently by controlling the shortest longest gradient path. YOLOv7 proposed E-ELAN that works for models with unlimited stacked computational blocks. E-ELAN combines the features of different groups by shuffling and merging cardinality to enhance the network’s learning without destroying the original gradient path.
- Model scaling for concatenation-based models. Scaling generates models of different sizes by adjusting some model attributes. The architecture of YOLOv7 is a concatenation-based architecture in which standard scaling techniques, such as depth scaling, cause a ratio change between the input channel and the output channel of a transition layer, which, in turn, leads to a decrease in the hardware usage of the model. YOLOv7 proposed a new strategy for scaling concatenation-based models in which the depth and width of the block are scaled with the same factor to maintain the optimal structure of the model.
- Planned re-parameterized convolution. Like YOLOv6, the architecture of YOLOv7 is also inspired by re-parameterized convolutions (RepConv) [99]. However, they found that the identity connection in RepConv destroys the residual in ResNet [62] and the concatenation in DenseNet [110]. For this reason, they removed the identity connection and called it RepConvN.
- Coarse label assignment for auxiliary head and fine label assignment for the lead head. The lead head is responsible for the final output, while the auxiliary head assists with the training.
- Batch normalization in conv-bn-activation. This integrates the mean and variance of batch normalization into the bias and weight of the convolutional layer at the inference stage.
- Implicit knowledge inspired in YOLOR [90].
- Exponential moving average as the final inference model.
Comparison with YOLOv4 and YOLOR
- A neural architecture search (NAS). They used a method called MAE-NAS [112] developed by Alibaba to find an efficient architecture automatically.
- A small head. The authors found that a large neck and a small neck yield better performance, and they only left one linear layer for classification and one for regression. They called this approach ZeroHead.
- AlignedOTA label assignment. Dynamic label assignment methods, such as OTA [97] and TOOD [101], have gained popularity due to their significant improvements over static methods. However, the misalignment between classification and regression remains a problem, partly because of the imbalance between classification and regression losses. To address this issue, their AlignOTA method introduces focal loss [6] into the classification cost and uses the IoU of prediction and ground-truth box as the soft label, enabling the selection of aligned samples for each target and solving the problem from a global perspective.
- Knowledge distillation. Their proposed strategy consists of two stages: the teacher guiding the student in the first stage and the student fine-tuning independently in the second stage. Additionally, they incorporate two enhancements in the distillation approach: the Align Module, which adapts student features to the same resolution as the teacher’s, and Channel-wise Dynamic Temperature, which normalizes teacher and student features to reduce the impact of real value differences.
16. YOLOv8
YOLOv8 Architecture
- A ResNet50-vd backbone replacing the DarkNet-53 backbone with an architecture augmented with deformable convolutions [118] in the last stage and a distilled pre-trained model, which has a higher classification accuracy on ImageNet. This architecture is called ResNet5-vd-dcn.
- A larger batch size to improve training stability; they went from 64 to 192, along with an updated training schedule and learning rate.
- Maintained moving averages for the trained parameters, used instead of the final trained values.
- DropBlock is applied only to the FPN.
- An IoU loss is added in another branch along with the L1-loss for bounding-box regression.
- An IoU prediction branch is added to measure localization accuracy along with an IoU-aware loss. During inference, YOLOv3 multiplies the classification probability and objectiveness score to compute the final detection. PP-YOLO also multiplies the predicted IoU to consider the localization accuracy.
- Grid-sensitive approach similar to YOLOv4, it is used to improve the bounding-box center prediction at the grid boundary.
- Matrix NMS [119] is used, which can be run in parallel, making it faster than traditional NMS.
- CoordConv [120] is used for the convolution of the FPN and on the first convolution layer in the detection head. CoordConv allows the network to learn translational invariance, improving the detection localization.
- Spatial Pyramid Pooling is used only on the top feature map to increase the receptive field of the backbone.
17.1. PP-YOLO Augmentations and Preprocessing
- Mixup Training [86] with a weight sampled from distribution, where and .
- Random Color Distortion.
- Random Expand.
- Random Crop and Random Flip with a probability of 0.5.
- RGB channel z-score normalization with a mean of and a standard deviation of .
- Multiple image sizes evenly drawn from [320, 352, 384, 416, 448, 480, 512, 544, 576, 608].
17.2. PP-YOLOv2
- Backbone changed from ResNet50 to ResNet101.
- Path aggregation network (PAN) instead of FPN, similar to YOLOv4.
- Mish activation function. Unlike YOLOv4 and YOLOv5, they only applied the Mish activation function in the detection neck to keep the backbone unchanged with the ReLU.
- Larger input sizes help to increase performance on small objects. They expanded the largest input size from 608 to 768 and reduced the batch size from 24 to 12 images per GPU. The input sizes are evenly drawn from [320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768].
- A modified IoU-aware branch. They modified the calculation of the IoU-aware loss calculation using a soft label format instead of a soft weight format.
17.3. PP-YOLOE
- New backbone and neck. Inspired by TreeNet [123], the authors modified the architecture of the backbone and neck with RepResBlocks, combining residual and dense connections.
- Task alignment learning (TAL). YOLOX was the first to bring up the problem of task misalignment, where the classification confidence and the location accuracy do not agree in all cases. To reduce this problem, PP-YOLOE implemented TAL as proposed in TOOD [101], which includes a dynamic label assignment combined with a task-alignment loss.
- Efficient task-aligned head (ET-head). Different from YOLOX, where the classification and locations heads were decoupled, PP-YOLOE instead used a single head based on TOOD to improve speed and accuracy.
- Varifocal (VFL) and distribution focal loss (DFL). VFL [102] weights loss of positive samples using target score, giving higher weight to those with high IoU. This prioritizes high-quality samples during training. Similarly, both use IoU-aware classification score (IACS) as the target, allowing for joint learning of classification and localization quality, leading to consistency between training and inference. On the other hand, DFL [115] extends focal loss from discrete to continuous labels, enabling successful optimization of improved representations that combine quality estimation and class prediction. This allows for an accurate depiction of flexible distribution in real data, eliminating the risk of inconsistency.
- Quantization-aware modules [126], called QSP and QCI, that combine re-parameterization for 8-bit quantization to minimize the accuracy loss during post-training quantization.
- Automatic architecture design using AutoNAC, Deci’s proprietary NAS technology.
- Hybrid quantization method to selectively quantize certain parts of a model to balance latency and accuracy instead of standard quantization, where all the layers are affected.
- A pre-training regimen with automatically labeled data, self-distillation, and large datasets.
19. YOLO with Transformers
20. Discussion
- Anchors: The original YOLO model was relatively simple and did not employ anchors, while the state of the art relied on two-stage detectors with anchors. YOLOv2 incorporated anchors, leading to improvements in bounding-box prediction accuracy. This trend persisted for five years until YOLOX introduced an anchorless approach that achieved state-of-the-art results. Since then, subsequent YOLO versions have abandoned the use of anchors.
- Framework: Initially, YOLO was developed using the Darknet framework, with subsequent versions following suit. However, when Ultralytics ported YOLOv3 to PyTorch, the remaining YOLO versions were developed using PyTorch, leading to a surge in enhancements. Another deep learning language utilized is PaddlePaddle, an open-source framework initially developed by Baidu.
- Backbone: The backbone architectures of YOLO models have undergone significant changes over time. Starting with the Darknet architecture, which comprised simple convolutional and max pooling layers, later models incorporated cross-stage partial connections (CSP) in YOLOv4, reparameterization in YOLOv6 and YOLOv7, and neural architecture search in DAMO-YOLO and YOLO-NAS.
- Performance: While the performance of YOLO models has improved over time, it is worth noting that they often prioritize balancing speed and accuracy rather than solely focusing on accuracy. This tradeoff is essential to the YOLO framework, allowing for real-time object detection across various applications.
Tradeoff between Speed and Accuracy
21. The Future of YOLO
22. Conclusions
Type | Filters | Size/Stride | Output | |
Conv | 64 | |||
Max Pool | ||||
Conv | 192 | |||
Max Pool | ||||
Conv | 128 | |||
Conv | 256 | |||
Conv | 256 | |||
Conv | 512 | |||
Max Pool | ||||
Conv | 256 | |||
Conv | 512 | |||
Conv | 512 | |||
Conv | 1024 | |||
Max Pool | ||||
Conv | 512 | |||
Conv | 1024 | |||
Conv | 1024 | |||
Conv | 1024 | |||
Conv | 1024 | |||
Conv | 1024 | |||
FC | 4096 | 4096 | ||
Dropout 0.5 | 4096 | |||
FC |
Num | Type | Filters | Size/Stride | Output |
1 | Conv/BN | 32 | ||
2 | Max Pool | |||
3 | Conv/BN | 64 | ||
4 | Max Pool | |||
5 | Conv/BN | 128 | ||
6 | Conv/BN | 64 | ||
7 | Conv/BN | 128 | ||
8 | Max Pool | |||
9 | Conv/BN | 256 | ||
10 | Conv/BN | 128 | ||
11 | Conv/BN | 256 | ||
12 | Max Pool | |||
13 | Conv/BN | 512 | ||
14 | Conv/BN | 256 | ||
15 | Conv/BN | 512 | ||
16 | Conv/BN | 256 | ||
17 | Conv/BN | 512 | ||
18 | Max Pool | |||
19 | Conv/BN | 1024 | ||
20 | Conv/BN | 512 | ||
21 | Conv/BN | 1024 | ||
22 | Conv/BN | 512 | ||
23 | Conv/BN | 1024 | ||
24 | Conv/BN | 1024 | ||
25 | Conv/BN | 1024 | ||
26 | Reorg layer 17 | |||
27 | Concat 25 and 26 | |||
28 | Conv/BN | 1024 | ||
29 | Conv | 125 |
Backbone | Detector |
Bag of Freebies | Bag of Freebies |
Data augmentation | Data augmentation |
- Mosaic | - Mosaic |
- CutMix | - Self-adversarial training |
Regularization | CIoU loss |
- DropBlock | Cross-mini-batch normalization (CmBN) |
Class label smoothing | Eliminate grid sensitivity |
Multiple anchors for a single ground truth | |
Cosine annealing scheduler | |
Optimal hyperparameters | |
Random training shapes | |
Bag of Specials | Bag of Specials |
Mish activation | Mish activation |
Cross-stage partial connections | Spatial pyramid pooling block |
Multi-input weighted residual connections | Spatial attention module (SAM) |
Path aggregation network (PAN) | |
Distance-IoU non-maximum suppression |
Version | Date | Anchor | Framework | Backbone | AP (%) |
YOLO | 2015 | No | Darknet | Darknet24 | 63.4 |
YOLOv2 | 2016 | Yes | Darknet | Darknet24 | 78.6 |
YOLOv3 | 2018 | Yes | Darknet | Darknet53 | |
YOLOv4 | 2020 | Yes | Darknet | CSPDarknet53 | |
YOLOv5 | 2020 | Yes | Pytorch | YOLOv5CSPDarknet | |
PP-YOLO | 2020 | Yes | PaddlePaddle | ResNet50-vd | |
Scaled-YOLOv4 | 2021 | Yes | Pytorch | CSPDarknet | |
PP-YOLOv2 | 2021 | Yes | PaddlePaddle | ResNet101-vd | |
YOLOR | 2021 | Yes | Pytorch | CSPDarknet | |
YOLOX | 2021 | No | Pytorch | YOLOXCSPDarknet | |
PP-YOLOE | 2022 | No | PaddlePaddle | CSPRepResNet | |
YOLOv6 | 2022 | No | Pytorch | EfficientRep | |
YOLOv7 | 2022 | No | Pytorch | YOLOv7Backbone | |
DAMO-YOLO | 2022 | No | Pytorch | MAE-NAS | |
YOLOv8 | 2023 | No | Pytorch | YOLOv8CSPDarknet | |
YOLO-NAS | 2023 | No | Pytorch | NAS |
