The Efficiency of YOLOv5 Models in the Detection of Similar Construction Details

: Computer vision solutions have become widely used in various industries and as part of daily solutions. One task of computer vision is object detection. With the development of object detection algorithms and the growing number of various kinds of image data, different problems arise in relation to the building of models suitable for various solutions. This paper investigates the influence of parameters used in the training process involved in detecting similar kinds of objects, i.e., the hyperparameters of the algorithm and the training parameters. This experimental investigation focuses on the widely used YOLOv5 algorithm and analyses the performance of different models of YOLOv5 (n, s, m, l, x). In the research, the newly collected construction details (22 categories) dataset is used. Experiments are performed using pre-trained models of the YOLOv5. A total of 185 YOLOv5 models are trained and evaluated. All models are tested on 3300 images photographed on three different backgrounds: mixed, neutral, and white. Additionally, the best-obtained models are evaluated using 150 new images, each of which has several dozen construction details and is photographed against different backgrounds. The deep analysis of different YOLOv5 models and the hyperparameters shows the influence of various parameters when analysing the object detection of similar objects. The best model was obtained when the YOLOv5l was used and the parameters are as follows: coloured images, image size—320; batch size—32; epoch number—300; layers freeze option—10; data augmentation—on; learning rate—0.001; momentum—0.95; and weight decay—0.0007. These results may be useful for various tasks in which small and similar objects are analysed.


Introduction
Over the past decade, the application of artificial intelligence has grown in various areas.Many different methods of artificial intelligence can be used for different types of data analysis, such as those involving numbers, texts, sounds, and images.Deep learning methods play a significant role in various scientific research.This is due to the possibility of using not only the CPU, but also the GPU in model building.One field of artificial intelligence based on deep learning methods is computer vision.The most popular computer vision tasks are image classification, segmentation, and object detection.For example, in medicine, image data can be used to predict different diseases, such as cancer [1,2], glaucoma [3,4], and pneumonia [5,6].Object detection models can be used in systems for travel direction recommendation [7], in industry for solutions to robotization tasks [8,9], in face detection for different applications [10,11], or other fields [12][13][14][15][16]. Usually, in all research, various computer vision methods or combinations are used to solve the specific problem.This is because there is no unambiguously appropriate method for all of them and, as a result, the results may depend on various factors.
One of the factors in building successful artificial intelligence models is properly prepared data for model training.A large amount of research has sought to analyse the data showing features that are distinctly different, and for which the natural size of the object is large in the real world [17,18].In this case, the obtained results of object detection are high.A more complex task is to detect a similar object within an image, especially when the object is small in the real world.Objects that are similar can be described by the following characteristics: shape, colour, size, etc.For example, if we analyse medical pill detection, some of the pills can look identical in different images, depending on the angle, distance, lighting, shadows, and other external factors.At self-service checkouts [19], object detection methods have been implemented to detect fruits.It is difficult for models to determine what type of apple the customer is trying to buy due to the similarity between fruits.The same problem occurs in the construction detail analysis because some details can look identical.Detecting similar objects requires a much deeper analysis.
In this investigation, the efficiency of YOLOv5 has been investigated using the newly collected construction details dataset [20].In construction detail analysis, datasets have a large number of categories, and items have similar features.It is therefore important to investigate the parameters of the dataset and find which of them has the highest influence on object detection.Additionally, one must take into account the size of the chosen YOLOv5 model and the selected training hyperparameters.The training dataset used in the experimental investigation consisted of 440 images (22 construction details on a white background, with 20 images in each category).Additionally, a test dataset consisting of 3300 images (22 construction details on 3 different backgrounds, with 50 images belonging to each category).Because training each model costs a large amount of time, the experimental investigation was performed in two stages.In the first stage, primary research was performed to determine the influence of epoch numbers, image size, batch size, layer freeze option, and data augmentation on object detection results.A total of 50 experiments were performed using the most popular and widely used models of other researchers-YOLOv5s and YOLOv5m.During the primary experiments, the best parameters were found and used in the second stage.In the second stage, a total of 135 experiments were performed using all five models of YOLOv5 (n, s, m, l, x).The main aim was to find the best training hyperparameters, such as learning rate, weight decay, and momentum.The main contributions of the paper are as follows: (1) The newly collected dataset has been prepared, is publicly available, and can be used in various computer vision tasks.(2) The five YOLOv5 models of different sizes have been experimentally investigated using the newly collected construction details.A total of 185 experiments have been performed, in which various combinations of the training and algorithm parameters have been analysed.(3) The results of the experimental investigation have shown the efficiency of different models, which allows us to see which nondefault parameters help to achieve higher object detection results.This could be useful for other researchers when analysing similar featured data.(4) The models could be used in the recommendation systems that allow the recommendation of a possible construction by detecting several dozen construction details in one image.
The structure of this paper is as follows.In Section 2, the related works are reviewed.No research has yet been able to solve the problem we are addressing in terms of the detection of similar construction details, nevertheless most similar research is herein overviewed.In addition, a brief overview of the most popular object detection algorithms is presented.In Section 3, an experimental investigation scheme is presented and described in detail.All steps from data collection and data preprocessing to model training and evaluation are presented.In Section 4, the discussion and limitations of the research are presented.Section 5 concludes the paper.

Related Works
The literature analysis has shown that, due to the complexity of the task, there is a lack of research that focuses on the detection of similar objects.Therefore, it is difficult to perform a comparative analysis of such research results.Usually, in such types of object detection tasks, the accuracy of the obtained model is smaller compared with other types of data.Therefore, in various investigations, different object detection algorithms and their parameters are changed in order to increase the model's accuracy.Several research studies have been published that deal with the problem of similar object detection, though they have used different kinds of data.
In the investigation by Kwon et al. [21], the detection of medical pills was analysed using a deep learning algorithm.The authors proposed a two-step model based on a mask region-based convolutional neural network (Mask R-CNN) [22] that improved the detection performance of medical pills.In the first step, the object localization problem was solved in order to detect the medical pill in the image, and, in the second step, the multiclass classification was solved in order to detect the possible type of the medical pill.According to the testing results of the proposed model and YOLOv3 [23], experiments have shown that the accuracy of the proposed Mask R-CNN model (91%) is 18% higher than the results obtained using YOLOv3 (73%).The results obtained have shown that the proposed model can be applied in cases when a small amount of data are used to train the object detection models.Another study, which also focused on the real-time detection of medical pills, was performed by Tan et al. [24].In this research, the efficiencies of the following three object detection algorithms were investigated: RetinaNet, Single Shot Multi-Box Detector (SSD), and YOLOv3.The results of the experimental investigation show that RetinaNet is not suitable for real-time medical pill detection due to slow performance (FPS-17), but that the accuracy, when compared with the other analysed algorithms, was the highest (82.89%).The highest speed performance was obtained by YOLOv3 (FPS-51), but the accuracy is smaller (80.69%) compared with RetinaNet and SSD.Intermediate performance was obtained by the SSD algorithm, where the accuracy was equal to 82.71% (slightly smaller when compared with the RetinaNet) and the speed was equal to 32 (FPS).By concluding the results, the authors state that YOLOv3 is more suitable for similar object detection tasks when the medical pills are analysed.In the research by Ou et al., models based on convolutional neural networks were used to detect and classify medical pills in images.In 2018 [25], an improved model of Inceptionv3 [26] was used, wherein models were trained using a newly collected dataset.The prepared dataset consisted of more than 470,000 images, where each category (different types of medical pills, for a total of 131 categories) had approximately 3600 images, taken from various angles.During the experimental research, the resolution of the images was transformed to 299 × 299.The accuracy of the model was evaluated using additional images of medical pills, which contain 400 images with 2825 annotations.The proposed model achieved 79.4% accuracy.Later, in 2020, Ou et al. [27] used Inception-ResNetv2 for the medical pill classification task due to its experimental performance.The same type of dataset was used, but with a larger amount of medical pill images (612 categories) having been prepared for the model training process.Furthermore, the authors analysed the efficiency of various classifiers (VGG-16, VGG-19, ResNet-50, ResNet-101, Inceptionv3, Inceptionv4, Xception, Inception-ResNetv1, Inception-ResNetv2).The highest accuracy (82.1%) was achieved using Inception-ResNetv2, and the smallest accuracy was obtained using VGG 16 (40.5%).
Saeed et al. [28] proposed an approach for the detection of small industrial objects using an improved faster regional convolutional neural network (Improved Faster RCNN).The main aim of their research was to detect and recognize screws in images.This problem is also related to the problem of similar object detection because, in some images taken from different angles, the various screw types may look the same.To train the models, the authors collected a new dataset from many images of industrial products in which screws could be found.A total of 917 original images of four different types (325, 163, 251, 178) of screws were taken.An augmentation of the dataset was applied and a total of 63,013 images were used in the experimental investigation.The efficiency of the proposed improved model of Faster RCNN was compared with RCNN, Fast RCNN, and Faster RCNN.The experimental results show that the highest accuracy was achieved using the improved Faster RCNN (~91%), followed by the Faster RCNN (~89%), Fast RCNN (~84%), and RCNN (~83%).In the research by Yildiz et al. [29], the authors proposed the combination of the Xception and Inceptionv3 models in order to detect screws in automated disassembly processes.The main objective of the research was to detect screws during hard disk disassembly.All images analysed in the training process were transformed to greyscale.In the research, the efficiencies of Xception, Inceptionv3, ResneXt101, InceptionResnetv2, Densenet201, and Resnet101v2 were evaluated.All analysed models achieved an accuracy greater than 96%, but the highest accuracy was obtained by Inceptionv3 (98.8%), followed by InceptionResnetv2 (98.6%), and ResneXt101 with Xception (98.5%).The lowest accuracy was obtained using Resnet101v2 (96.9%).The authors decided to combine two models with the highest accuracy to increase the accuracy of the combined classifier.For this reason, the results of the models were combined using some chosen weights, and the final prediction results were calculated.The combination of the proposed models achieved 99% accuracy when analysing the selected dataset.In the research by Mangold et al. [30], the YOLOv5 models were used to detect the screw head for automated disassembly and remanufacturing.The authors investigate two types of YOLOv5 group models-YOLOv5s and YOLOv5m.The dataset used in the investigation was pre-processed, and the size of the images reduced to 640 × 640 (the original size of the images was 1200 × 1200).During model training, the batch size was equal to 32.The results of the experimental investigation performed in the research show that the highest accuracy was obtained using the YOLOv5s model (mAP@0.5-98.4% and mAP@0.5:0.95-83.4%).A slightly smaller accuracy was obtained using the YOLOv5m (mAP@0.5-98%and mAP@0.5:0.95-82.6%)but the difference between the different models' accuracy is not significant.The trained YOLOv5s model was evaluated using the real environment, where 20 small and 7 large motor images were passed to the model in order for it to detect screws.The testing results show that 39 out of 45 screws were correctly detected in the images of the small motors and 15 out of 17 screws were correctly detected in the images of the large motors.
This literature review has shown that many object detection algorithms exist and are used in various fields, for example, RCNN, Faster RCNN, SSD, YOLO, etc. [31][32][33].Nowadays, one of the most popular object detection algorithm groups is YOLO, which can be used in real-time object detection tasks and the group algorithms of which allow one to obtain promising results in different areas.Of course, there exist many versions of the YOLO algorithm, from the first original version of YOLO to YOLOv8, YOLO-NAS, and YOLO with transformers [34].The newest versions of YOLO, starting from YOLOv6, are still in the development process, so there are different issues with their practical use.One of the most stable recent versions is YOLOv5, which is widely used in scientific research, such as small and similar object detection [35].YOLOv5 differs from previous versions of the YOLO algorithm because it uses the PyTorch framework, rather than Darknet, and because it uses CSPDarknet53 as the backbone.The YOLOv5 architecture uses the path aggregation network (PANet) as a neck by which to increase the flow of information.The head of YOLOv5 is the same as that of YOLOv3 and YOLOv4, which generates three different feature map outputs to achieve multiscale prediction.This helps to effectively increase the prediction of small and large objects in the model.The output layer generates the results.In the manuscript by Dlužnevskij et al. [36], experimental research has been performed to investigate the efficiency of YOLOv5 using a mobile device with real-time object detection tasks.Four different models of YOLOv5 have been analysed (YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x).The experiments were conducted using the original COCO dataset, reducing it to fit the requirements of the mobile environment.The results of the experimental investigation show that the performance of the model is highly influenced by the hardware architecture and the system in which the model is used.
In our previous research [37], the influence of training parameters on the detection of real-time construction details using YOLOv5s was analysed.Parameters, such as image resolution, batch size, iteration number, and colour of images, were investigated.The focus was only on the one YOLOv5s model that is usually suitable for real-time object detection using a limited technical environment, such as mobile phones.The results of the experimental investigation have shown that, in many cases, the optimal resolution of the construction details images should be 320 × 320 or 640 × 640 and that colour images allow slightly better results compared with greyscale images.Choosing the higher resolution image leads to a lower accuracy of construction detail detection.Furthermore, during model training, the batch size should be chosen as 16 or 32 to achieve higher model accuracy.The limitation of the research was that the other versions of YOLOv5 (n, m, l, x) were not analysed.Additionally, the hyperparameters were not changed during the model training process; instead, only the best hyperparameters are used based on the analysis of related works.The results of related works [38,39] have shown that, generally, other similar research has focused only on a small number of YOLOv5 hyperparameters, such as learning rate, momentum, augmentation parameters, and weight selection, and that other hyperparameters are usually not changed.In the research, the dataset used in the training process was not balanced, and this is important to consider when evaluating the models.Therefore, it is necessary to investigate the influence of different versions of YOLOv5 and hyperparameters on the detection of construction details.
Related works have shown that there is no single best model for object detection and that the results depend on various factors.One of the most important factors is the dataset being analysed.By analysing similar objects, such as medical pills, screws, or construction details, the correct detection depends on the angle of the camera, the lighting, and the position.In some cases, one object can look similar to another.Image pre-processing, such as that involved in the colour of the image or the size of the resolution, also influences the detection results.During the training of the models, it is important to select suitable hyperparameters.However, in computer vision, each new combination of hyperparameters costs a lot of training time because of the image analysis tasks and the model complexities.The object detection model selection is also one of the hardest parts, because related works have shown that older models, such as RCNN, Fast RCNN, or Faster RCNN, can be used to obtain an accuracy that is not inferior to the latest models.In addition, there are many versions or modifications of the object detection model in the scientific literature.All these facts show that it is important to investigate the efficiency of the object detection models using various factors and to find the best combination for each specific domain.

Experimental Investigation
To investigate the influence of various training parameters on different models of YOLOv5, an experimental investigation was performed.YOLOv5 is a large step forward in object identification algorithms, departing from its predecessors by leveraging the PyTorch framework and incorporating the CSPDarknet53 backbone with a new pooling architecture.This architecture solves feature fusion and computational efficiency concerns, improving object localisation accuracy while reducing model size.The focus layer enhances memory use and propagation efficiency [40].The different combinations of training parameters have been used to find the highest accuracy in construction detail detection and, for this reason, a total of 185 models have been created and evaluated.The research workflow is presented in Figure 1.The research was performed in two stages.The first stage focuses on training parameters and the second stage focuses on the hyperparameters of the pre-trained YOLOv5 models [40].All of the YOLOv5 models presented in Table 1 have been trained using the well-known COCO2017 dataset, which was collected and prepared for object detection and segmentation tasks.The COCO2017 dataset is the subset of the MS COCO dataset (containing 164,000 images of 80 different objects with bounding boxes and segmentation masks for each data item).The models were trained using 118,000 images, and the remainder of the dataset was used for validation (5000) and testing (41,000) of the models.All of the steps of the experimental investigation that were performed, from data preparation to model training and evaluation, are described in this section in more detail.
During the experimental investigation, all models were trained in an environment with the following specifications: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 Threads, 10 Cores).The environment used a Linux operating system with 32 GB DDR4 RAM and a Tesla P100 PCIe 12GB GPU.

Results of the Primary Research
The newly collected construction details dataset was used in the experimental investigation [20].The dataset was constructed in such a way that it is divided into three parts in order to be used in three stages.In the first stage, the dataset of 440 images was collected to train the models.The dataset consists of 22 different categories of images of construction details that were photographed on a white background.Each construction detail has The research was performed in two stages.The first stage focuses on training parameters and the second stage focuses on the hyperparameters of the pre-trained YOLOv5 models [40].All of the YOLOv5 models presented in Table 1 have been trained using the well-known COCO2017 dataset, which was collected and prepared for object detection and segmentation tasks.The COCO2017 dataset is the subset of the MS COCO dataset (containing 164,000 images of 80 different objects with bounding boxes and segmentation masks for each data item).The models were trained using 118,000 images, and the remainder of the dataset was used for validation (5000) and testing (41,000) of the models.All of the steps of the experimental investigation that were performed, from data preparation to model training and evaluation, are described in this section in more detail.
During the experimental investigation, all models were trained in an environment with the following specifications: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 Threads, 10 Cores).The environment used a Linux operating system with 32 GB DDR4 RAM and a Tesla P100 PCIe 12GB GPU.

Results of the Primary Research
The newly collected construction details dataset was used in the experimental investigation [20].The dataset was constructed in such a way that it is divided into three parts in order to be used in three stages.In the first stage, the dataset of 440 images was collected to train the models.The dataset consists of 22 different categories of images of construction details that were photographed on a white background.Each construction detail has been rotated 20 times in order for each picture to show a new angle.Each item of the dataset has been manually annotated.The number of images in the dataset is not high because the pre-trained models of YOLOv5 have been used.A sample of the analysed dataset is presented in Figure 2.  To evaluate the efficiency of the models, a larger number of construction details have been prepared.Each construction detail used in the training process has been photographed 50 times from different angles using 3 different backgrounds: white (W), neutral (N), and mixed (M) (Figure 3).The main reason for using three different backgrounds is to simulate the efficiency of the models in a real environment.On a neutral background, all analysed construction details can be clearly observed.In contrast, on a white background, all details usually stand out and are highlighted from the background, except the construction details of the white colour.On the third background, which is mixed, the pattern can be considered as noise.In this case, it is more difficult to correctly detect the object compared with the white and neutral backgrounds.A total of 3300 images were prepared (1100 images on each background).The last part of the dataset which was used in the experimental investigation is formed by 150 images.In these, several dozen construction details are placed on the three backgrounds (Figure 4).The main idea of these images is to evaluate the efficiency of the YOLOv5 models that obtained the highest accuracy.To evaluate the efficiency of the models, a larger number of construction details have been prepared.Each construction detail used in the training process has been photographed 50 times from different angles using 3 different backgrounds: white (W), neutral (N), and mixed (M) (Figure 3).The main reason for using three different backgrounds is to simulate the efficiency of the models in a real environment.On a neutral background, all analysed construction details can be clearly observed.In contrast, on a white background, all details usually stand out and are highlighted from the background, except the construction details of the white colour.On the third background, which is mixed, the pattern can be considered as noise.In this case, it is more difficult to correctly detect the object compared with the white and neutral backgrounds.A total of 3300 images were prepared (1100 images on each background).To evaluate the efficiency of the models, a larger number of construction details have been prepared.Each construction detail used in the training process has been photographed 50 times from different angles using 3 different backgrounds: white (W), neutral (N), and mixed (M) (Figure 3).The main reason for using three different backgrounds is to simulate the efficiency of the models in a real environment.On a neutral background, all analysed construction details can be clearly observed.In contrast, on a white background, all details usually stand out and are highlighted from the background, except the construction details of the white colour.On the third background, which is mixed, the pattern can be considered as noise.In this case, it is more difficult to correctly detect the object compared with the white and neutral backgrounds.A total of 3300 images were prepared (1100 images on each background).The last part of the dataset which was used in the experimental investigation is formed by 150 images.In these, several dozen construction details are placed on the three backgrounds (Figure 4).The main idea of these images is to evaluate the efficiency of the YOLOv5 models that obtained the highest accuracy.The last part of the dataset which was used in the experimental investigation is formed by 150 images.In these, several dozen construction details are placed on the three backgrounds (Figure 4).The main idea of these images is to evaluate the efficiency of the YOLOv5 models that obtained the highest accuracy.
As mentioned above, selecting different parameters during the training process can lead to different results.It is important to investigate not only the hyperparameters of the YOLOv5 models, but also the other important parameters that could influence the final detection results.Training each model can be time consuming, so the experimental study was divided into two stages.The first was called primary research, and the second was called main research.In the primary research, the influence of the following five parameters was investigated: epoch numbers (one complete forward and backward pass of all training examples), batch size (number of images processed simultaneously in a forward pass), image size, layer freezing option, and different data augmentation options.Related works have shown the efficiency of YOLOv5s and YOLOv5m when used for the detection of small objects with similar features.In this case, these two models have been chosen in primary research.Various combinations of training parameters have been used in the training process (Table 2) and a total of 50 models were created and evaluated.2) and a total of 50 models were created and evaluated.

Value of the Parameters Comment
Epoch number 300, 600 The results of our previous research [37] have shown that these parameter options allow for the highest object detection results.Image size 320, 640 (pixels) Batch size 16,32 Layers freeze option 10 The layer freeze option [41] is a feature in which the backbone and head layers can be unused in training mode.Primary research has shown that after 10 backbone layers were frozen, training times were reduced by approximately 2 times and construction detail recognition accuracy improved by approximately 1.5 times.

Augmentation 13 options
The different options for data augmentation have been experimentally chosen and analysed  Table 2. Parameters which were investigated in primary research.

Name of the Parameter Value of the Parameters Comment
Epoch number 300, 600 The results of our previous research [37] have shown that these parameter options allow for the highest object detection results.Image size 320, 640 Batch size 16, 32

Layers freeze option 10
The layer freeze option [41] is a feature in which the backbone and head layers can be unused in training mode.Primary research has shown that after 10 backbone layers were frozen, training times were reduced by approximately 2 times and construction detail recognition accuracy improved by approximately 1.5 times.

Augmentation 13 options
The different options for data augmentation have been experimentally chosen and analysed [42][43][44] To find the best parameter combination, the 50 models have been evaluated using 3300 images.The experiments have been named according to the parameters used in the training process.For example, the name of the model Yolov5s_320_16_300_DefAugm means that the YOLOv5s model has been used and that the parameters are as follows: image size-320; batch size-16; epoch number-300; default parameters of data augmentation.The results of the experimental investigation show that, without data augmentation, the detection results are much lower when compared with the results using augmentation.This is true regardless of whether the default or custom options of the data augmentation have been used.Additionally, the first experiments have shown that object detection accuracy increases significantly using the option of 10 backbone layer freeze.
The influence of different combinations of data augmentation options has been analysed.The results of the experiment show that the best detection ratio achieved is equal to 0.4 (40% of correct detection).In this case, the highest number of construction details has been detected no matter which background has been used (322 construction details on the mixed (M) background; 509 construction details on the neutral (N) background; 485 construction details on the white (W) background).Overall, results show that almost every model better detects the construction details on a neutral background.However, it is important to mention that the models have been trained with the construction details, which were placed on a white background.During the primary research, the best parameters to allow one to achieve the highest ratio were found, and are presented in Table 3.These parameters will be used in the main research.The results of all of the experiments are presented in Table 4.
Table 3.The best parameters obtained in the primary research.

Image size 320
Batch size 32

Results of the Main Research
During the training process of YOLOv5, there is the possibility to choose various hyperparameters that could influence the results of object detection.The analysis of related works has shown that many researchers have focused on the following three main parameters: learning rate, momentum, and weight decay.The various values of these parameters are used in scientific papers.Based on other research, our main experiments investigate several combinations of hyperparameters.In the main research, five versions of the YOLOv5 have been trained using the parameters obtained from the primary research results.The hyperparameters used in the main research are presented in Table 5.A total of 135 models have been trained and evaluated.
The results of the main research show that, when using various combinations of hyperparameters, the highest obtained correct detection ratio of construction details is equal to 0.5012 (50%).In this case, the YOLOv5l model was used.The model was trained with a learning rate equal to 0.001, a momentum of 0.95, and a weight decay of 0.0007.In some cases, the correct detection ratio is equal to 0. The lowest correct detection ratio was obtained using YOLOv5n.The highest correct detection ratio obtained for each YOLOv5 model (n, s, m, l, x) are presented in Figure 5.
YOLOv5 have been trained using the parameters obtained from the primary research results.The hyperparameters used in the main research are presented in Table 5.A total of 135 models have been trained and evaluated.
The results of the main research show that, when using various combinations of hyperparameters, the highest obtained correct detection ratio of construction details is equal to 0.5012 (50%).In this case, the YOLOv5l model was used.The model was trained with a learning rate equal to 0.001, a momentum of 0.95, and a weight decay of 0.0007.In some cases, the correct detection ratio is equal to 0. The lowest correct detection ratio was obtained using YOLOv5n.The highest correct detection ratio obtained for each YOLOv5 model (n, s, m, l, x) are presented in Figure 5.As one can see in Figure 5, the smallest correct detection ratio was obtained using YOLOv5n (22%).The difference between the results of YOLOv5s (34%) and YOLOv5m is equal to 6%, while better results were obtained using YOLOv5m (40%).The results obtained using YOLOv5x (48%) are slightly lower compared with the results of YOLOv5l (50%).In addition, in Figure 6, the curves of precision, recall, mAP@0.5, and map@0.5:0.95 are presented.
One can see (Figure 6) that, until approximately 200 epochs, the model is still training, and after 200 epochs there is no progress.The recall and the precision metrics of the model are close to 1.In the case of the map@0.5 metric, the model is close to value 1 after 100 epochs.The map@0.5:0.95metric shows that the accuracy during all 300 training epochs continues to increase.
The results of all of the main research experiments are presented in Table 6.In Figure 7 the confusion matrices of the best model on three different backgrounds are presented.As one can see, the smallest number of correct detections was on the mixed background (497).Using this background, two details were not detected at all and were recognized as different construction details.Furthermore, in this case, many details were not assigned to any classes at all, which shows that details merge in the mixed background.On the neutral background, the number of correct detections is larger (562), though the same construction detail as in the case of the mixed background was nevertheless recognized incorrectly.All of the details have been correctly recognized on the white background at least once.On a white background, the number of correct detections is largest (595), therefore in this case, 54% of the construction details were recognized correctly.
Appl.Sci.2024, 14, x FOR PEER REVIEW 13 of 19 As one can see in Figure 5, the smallest correct detection ratio was obtained using YOLOv5n (22%).The difference between the results of YOLOv5s (34%) and YOLOv5m is equal to 6%, while better results were obtained using YOLOv5m (40%).The results obtained using YOLOv5x (48%) are slightly lower compared with the results of YOLOv5l (50%).In addition, in Figure 6, the curves of precision, recall, mAP@0.5, and map@0.5:0.95 are presented.One can see (Figure 6) that, until approximately 200 epochs, the model is still training, and after 200 epochs there is no progress.The recall and the precision metrics of the model are close to 1.In the case of the map@0.5 metric, the model is close to value 1 after 100 epochs.The map@0.5:0.95metric shows that the accuracy during all 300 training epochs continues to increase.
The results of all of the main research experiments are presented in Table 6.In Figure 7 the confusion matrices of the best model on three different backgrounds are presented.As one can see, the smallest number of correct detections was on the mixed background (497).Using this background, two details were not detected at all and were recognized as different construction details.Furthermore, in this case, many details were not assigned to any classes at all, which shows that details merge in the mixed background.On the neutral background, the number of correct detections is larger (562), though the same construction detail as in the case of the mixed background was nevertheless recognized incorrectly.All of the details have been correctly recognized on the white background at least once.On a white background, the number of correct detections is largest (595), therefore in this case, 54% of the construction details were recognized correctly.

Discussion
This experimental investigation has shown the importance of training parameters and hyperparameter selection in the model training process.In this investigation, a total of 185 models were trained.The main problem with object detection tasks is that there are many different options for how to train the models, so it is hard to consider all of them.

Discussion
This experimental investigation has shown the importance of training parameters and hyperparameter selection in the model training process.In this investigation, a total of 185 models were trained.The main problem with object detection tasks is that there are many different options for how to train the models, so it is hard to consider all of them.This is especially so when training each model takes a long time.In this research, many different parameter combinations were evaluated.The results may be useful for tasks related to the detection of objects with similar features.The analysed dataset has 22 categories.Some of the construction details could look identical to different categories due to the different photoshoot angles.This means that the results are not as good, but they are still promising and are valuable for future research.Due to the complexity of the task, the detection of construction details may be useful when evaluating the efficiency and performance of the model.
The model obtained in the main research could be used to develop a recommendation for building construction.It would detect details from an image and suggest possible construction.The system or application could be implemented in a mobile environment.An additional experiment was performed in which 150 new images were fed to the best obtained models.As mentioned, images of several dozen construction details were placed and photographed on the three different backgrounds that were used in the primary and main research.A sample of the construction details detection results is presented in Figure 8.The model obtained in the main research could be used to develop a recommendation for building construction.It would detect details from an image and suggest possible construction.The system or application could be implemented in a mobile environment.An additional experiment was performed in which 150 new images were fed to the best obtained models.As mentioned, images of several dozen construction details were placed and photographed on the three different backgrounds that were used in the primary and main research.A sample of the construction details detection results is presented in Figure 8.The five best YOLOv5 models obtained in the main research (Figure 5) were evaluated using 150 images.The full results of the correct detection ratio of each construction detail are presented in Table 7.As one can see, the worst detection results are obtained when a mixed background is used.Only in the case of YOLOv5m, was the correct detection ratio larger than 0 and almost all details were recognized at least once.The highest correct detection ratio is obtained when using the white background.Overall, results show that some construction details, like 1x1_h2_round, 2x3_h1 and 2x3_h2, were not detected at all or detected by only few models.The details 2x3_h1 and 2x3_h2 are differed in terms of their height but can look identical from other angles.Some of the construction details were recognized correctly all the time by the YOLOv5m model, for example, 2x8_h1, and 4x4_h1 when the neutral background is used.Summarized results of the additional research show that the YOLOv5m model recognizes the highest number compared with the other four models.0.00 0.00 0.13 0.00 0.04 0.00 0.07 0.14 0.19 0.00 0.00 0.00 0.01 0.00 0.00 1x2_h2 0.00 0.00 0.01 0.00 0.01 0.05 0.01 0.01 0.04 0.00 0.02 0.00 0.00 0.02 0.00 2x3_h1 0.00 0.00 0.00 0.00 0.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2x4_h1 0.02 0.02 0.  The five best YOLOv5 models obtained in the main research (Figure 5) were evaluated using 150 images.The full results of the correct detection ratio of each construction detail are presented in Table 7.As one can see, the worst detection results are obtained when a mixed background is used.Only in the case of YOLOv5m, was the correct detection ratio larger than 0 and almost all details were recognized at least once.The highest correct detection ratio is obtained when using the white background.Overall, results show that some construction details, like 1x1_h2_round, 2x3_h1 and 2x3_h2, were not detected at all or detected by only few models.The details 2x3_h1 and 2x3_h2 are differed in terms of their height but can look identical from other angles.Some of the construction details were recognized correctly all the time by the YOLOv5m model, for example, 2x8_h1, and 4x4_h1 when the neutral background is used.Summarized results of the additional research show that the YOLOv5m model recognizes the highest number compared with the other four models.
This research has some limitations, because the results have not been compared with the other object detection models the experimental investigation has been based only on the YOLOv5 algorithm.Additionally, it is not possible to ensure that the same results could be obtained using another dataset that has similar features.The results can still depend on many different aspects, for example, the angle of the image taken, the noise appearing around the construction details, etc.However, the results may still be useful to other researchers.The experimental investigation has shown the importance of the freeze option in the training process and the use of nondefault parameters to obtain higher object detection results.

Conclusions
In this paper, the influences of the training parameters and hyperparameters of YOLOv5 on the detection of construction details were analysed.Construction details were chosen due to the task complexity when similar feature data are analysed.In some cases, the construction details appear to be identical.All depends on the angle of the shot used, which in turn depends on the point of view of the camera.During the research, five models of YOLOv5 were analysed.A total of 185 models were trained and evaluated.Model efficiencies were tested using a total of 3300 images placed on 3 different complexity backgrounds.The influence of five training parameters (image size, batch size, epoch size, layer freeze option, and data augmentation) and three hyperparameters (learning rate, momentum, and weight decay) was analysed.All of the parameters mentioned were used in various combinations.
The results of the experimental investigation show that the best parameters for the detection of construction details are as follows: coloured images; image size-320; batch size-32; epoch number-300; layer freeze option-10; data augmentation-on; learning rate-0.001;momentum-0.95; and weight decay-0.0007.In this case, the percentage of correct detection is equal to 50%, regardless of which background is used.The correct detection results of the model only on the white background are equal to 54%.Experimental investigation has shown that the smallest detection results are obtained when a mixed background is used.The main reason for this is that some details merge with the background and that, therefore, the models cannot detect the construction details.Additional research using several dozen construction details in the same image (on three different backgrounds) have shown that the YOLOv5m model correctly recognizes the highest number of structural details.
The number of correct detection results can be increased if the YOLOv5 model is used to localize the structure details in the image.A second step would be to use an additional binary classification to find the correct details of the structure.This could be implemented in the future to find the best way in which to detect similar construction details at different angles.

Figure 1 .
Figure 1.The workflow of the experimental investigation.

Figure 1 .
Figure 1.The workflow of the experimental investigation.
Appl.Sci.2024, 14, x FOR PEER REVIEW 7 of 19 been rotated 20 times in order for each picture to show a new angle.Each item of the dataset has been manually annotated.The number of images in the dataset is not high because the pre-trained models of YOLOv5 have been used.A sample of the analysed dataset is presented in Figure 2.

Figure 2 .
Figure 2. A sample of the dataset used to train the YOLOv5 models.

Figure 3 .
Figure 3.A sample of the dataset used to evaluate the YOLOv5 models.

Figure 2 .
Figure 2. A sample of the dataset used to train the YOLOv5 models.
Appl.Sci.2024, 14, x FOR PEER REVIEW 7 of 19 been rotated 20 times in order for each picture to show a new angle.Each item of the dataset has been manually annotated.The number of images in the dataset is not high because the pre-trained models of YOLOv5 have been used.A sample of the analysed dataset is presented in Figure 2.

Figure 2 .
Figure 2. A sample of the dataset used to train the YOLOv5 models.

Figure 3 .
Figure 3.A sample of the dataset used to evaluate the YOLOv5 models.

Figure 3 .
Figure 3.A sample of the dataset used to evaluate the YOLOv5 models.

19 Figure 4 .
Figure 4.A sample of the dataset used to evaluate the best YOLOv5 model, with several dozen details in one image.As mentioned above, selecting different parameters during the training process can lead to different results.It is important to investigate not only the hyperparameters of the YOLOv5 models, but also the other important parameters that could influence the final detection results.Training each model can be time consuming, so the experimental study was divided into two stages.The first was called primary research, and the second was called main research.In the primary research, the influence of the following five parameters was investigated: epoch numbers (one complete forward and backward pass of all training examples), batch size (number of images processed simultaneously in a forward pass), image size, layer freezing option, and different data augmentation options.Related works have shown the efficiency of YOLOv5s and YOLOv5m when used for the detection of small objects with similar features.In this case, these two models have been chosen in primary research.Various combinations of training parameters have been used in the training process (Table2) and a total of 50 models were created and evaluated.
[42][43][44]: hsv_h-HSV-Hue augmentation of the image.hsv_s-HSV-Saturation augmentation of the image.hsv_v-HSV-Value augmentation of the image.degrees-rotation (+/− degrees) of the image.translate-shifting or moving the objects within the image.scale-resizing the input images to different scales.shear-geometric deformations by tilting or skewing the images along the x or y axes.perspective-simulates perspective changes.flipud-flips the image vertically, the top becomes the bottom, and vice

Figure 4 .
Figure 4.A sample of the dataset used to evaluate the best YOLOv5 model, with several dozen details in one image.
: hsv_h-HSV-Hue augmentation of the image.hsv_s-HSV-Saturation augmentation of the image.hsv_v-HSV-Value augmentation of the image.degrees-rotation (+/− degrees) of the image.translate-shifting or moving the objects within the image.scale-resizing the input images to different scales.shear-geometric deformations by tilting or skewing the images along the x or y axes.perspective-simulates perspective changes.flipud-flips the image vertically, the top becomes the bottom, and vice versa.fliplr-flips the image horizontally, the left side becomes the right side, and vice versa.mosaic-combines several images to create a single training sample with a mosaic-like appearance.mixup-combines pairs of images and their corresponding object labels to create new training examples.copy_paste-involves randomly selecting a portion of one image and pasting it onto another image while maintaining the corresponding object labels.

Figure 5 .
Figure 5.The highest correct detection ratio of each YOLOv5 model.Figure 5.The highest correct detection ratio of each YOLOv5 model.

Figure 5 .
Figure 5.The highest correct detection ratio of each YOLOv5 model.Figure 5.The highest correct detection ratio of each YOLOv5 model.

Figure 6 .
Figure 6.The evaluation of the YOLOv5l model.

Figure 6 .
Figure 6.The evaluation of the YOLOv5l model.

19 Figure 7 .
Figure 7.The confusion matrices of the best obtained model.

Figure 7 .
Figure 7.The confusion matrices of the best obtained model.

Figure 8 .
Figure 8.A sample of construction detail detection in real-world simulation.

Figure 8 .
Figure 8.A sample of construction detail detection in real-world simulation.

Table 2 .
Parameters which were investigated in primary research.

Table 5 .
Hyperparameters used in the main research.

Table 5 .
Hyperparameters used in the main research.

Table 7 .
Correct detection ratio of each model type on white (W), neutral (N), and mixed (M) backgrounds using 150 images.