Automated Counting of Steel Construction Materials: Model, Methodology, and Online Deployment

: Construction material management is crucial for promoting intelligent construction meth-ods. At present, the manual inventory of materials is inefficient and expensive. Therefore, an intelligent counting method for steel materials was developed in this study using the object detection algorithm. First, a large-scale image dataset consisting of rebars, circular steel pipes, square steel tubes, and I-beams on construction sites was collected and constructed to promote the development of intelligent counting methods. A vision-based and accurate counting model for steel materials was subsequently established by improving the YOLOv4 detector in terms of its network structure, loss function, and training strategy. The proposed model provides a maximum average precision of 91.41% and a mean absolute error of 4.07 in counting square steel tubes. Finally, a mobile application and a WeChat mini-program were developed using the proposed model to allow users to accurately count materials in real time by taking photos and uploading them. Since being released, this application has attracted more than 28,000 registered users.


Introduction
The construction industry is a critical pillar of China's national economy.However, significant improvements meant to address low efficiency, a lack of environmental protection, and high energy consumption [1] are required to realize high-quality development.Compared with developed countries and regions, there is an urgent need to solve these challenges in China by enhancing the construction industry's level of intelligence through technological innovation.Intelligent construction is an innovation mode that integrates new information technology and engineering construction [2], fundamentally improving productivity and construction processes [3].Within civil engineering, it is of great significance to cultivate innovative engineering talents in intelligent construction.Therefore, "AI in Civil Engineering" was proposed as a new area and has gained considerable social recognition ever since.Indeed, the integration of AI with civil engineering will be one of the primary goals for the coordinated development of the construction industry over the next fifteen years [4].
The intelligentization of construction processes is an essential application of AI in construction and could help with saving time and costs.According to [5], a warehouse with RFID smart tags demonstrated time-saving of 81-99% in joint ordering and 100% time-saving in processing in warehouse management.The survey by [6] revealed that IoT could optimize space and help significantly reduce total storage costs by up to 16.84% with an average of 9.95%.In material management, counting primary building materials such as rebars for structures and steel pipes for scaffolding and shoring systems represents a critical link between process management and cost control.However, the current management of steel materials on construction sites still relies on manual counting methods such as inspection when delivering and allocation when constructing, as shown in Figure 1, which is inefficient, costly, and unautomated.Therefore, it is necessary to develop intelligent counting methods to solve these challenges and promote intelligent material management.
Buildings 2024, 14, x FOR PEER REVIEW 2 of 20 with an average of 9.95%.In material management, counting primary building materials such as rebars for structures and steel pipes for scaffolding and shoring systems represents a critical link between process management and cost control.However, the current management of steel materials on construction sites still relies on manual counting methods such as inspection when delivering and allocation when constructing, as shown in Figure 1, which is inefficient, costly, and unautomated.Therefore, it is necessary to develop intelligent counting methods to solve these challenges and promote intelligent material management.
(a) Inspection (b) Allocation To date, image processing has been widely adopted in material counting on construction sites.For example, Zhang et al. [7] proposed an online counting and automatic separation system based on template matching and mutative threshold segmentation.However, this method required auxiliary light sources from appropriate angles when capturing images, limiting its application to controlled lighting environments, such as factories.Ying et al. [8] combined edge detectors and image processing algorithms to separate rebars from the background and adopted an improved Hough transform to localize them.Zhao et al. [9] used improved edge detection, image processing, and edge clustering algorithms to detect the number of rebars, but their approach required a stable detection environment.Su et al. [10] adopted an improved gradient Hough circle transform combined with the radius captured by a maximum inscribed circle algorithm to localize rebars in captured images.Wu et al. [11] proposed an online rebar counting method utilizing concave dot matching for segmentation, K-level fault tolerance for counting, and visual feedback for multiple splitting.Liu et al. [12] combined the Canny operator with a morphological edge enhancement algorithm to extract the region of interest and remove noise for automatic counting of circular steel pipes.
The core concept of the above image-processing-based counting methods has been to segment each bar in the image.These approaches have strict requirements in terms of the lighting conditions, material section, and background.However, images captured on construction sites often include various interference factors, such as uneven indentations, oxidation, corrosion, occlusions on bar ends, and nonuniform lighting, that make imageprocessing-based methods impractical.Additionally, little research has been conducted on counting square steel tubes because the human process of stacking square tubes sometimes results in random rotation, which makes image processing for them more challenging than that for rebars and pipes.Figure 2 shows the random rotation of square tubes in sparse and dense arrangement scenarios.To date, image processing has been widely adopted in material counting on construction sites.For example, Zhang et al. [7] proposed an online counting and automatic separation system based on template matching and mutative threshold segmentation.However, this method required auxiliary light sources from appropriate angles when capturing images, limiting its application to controlled lighting environments, such as factories.Ying et al. [8] combined edge detectors and image processing algorithms to separate rebars from the background and adopted an improved Hough transform to localize them.Zhao et al. [9] used improved edge detection, image processing, and edge clustering algorithms to detect the number of rebars, but their approach required a stable detection environment.Su et al. [10] adopted an improved gradient Hough circle transform combined with the radius captured by a maximum inscribed circle algorithm to localize rebars in captured images.Wu et al. [11] proposed an online rebar counting method utilizing concave dot matching for segmentation, K-level fault tolerance for counting, and visual feedback for multiple splitting.Liu et al. [12] combined the Canny operator with a morphological edge enhancement algorithm to extract the region of interest and remove noise for automatic counting of circular steel pipes.
The core concept of the above image-processing-based counting methods has been to segment each bar in the image.These approaches have strict requirements in terms of the lighting conditions, material section, and background.However, images captured on construction sites often include various interference factors, such as uneven indentations, oxidation, corrosion, occlusions on bar ends, and nonuniform lighting, that make imageprocessing-based methods impractical.Additionally, little research has been conducted on counting square steel tubes because the human process of stacking square tubes sometimes results in random rotation, which makes image processing for them more challenging than that for rebars and pipes.Figure 2 shows the random rotation of square tubes in sparse and dense arrangement scenarios.
Recently, deep learning has attracted significant attention and has been applied in various areas of civil engineering.By combining simple nonlinear modules, deep learning achieves highly complex functions with stronger feature extraction and generalization capabilities than traditional machine learning methods, enabling the identification of complex contents in massive datasets [13].Among many deep learning networks that have been proposed, convolutional neural networks offer significant advantages for image processing.Object detection algorithms based on deep learning can rapidly and accurately determine the positions and categories of objects in images [14].Currently, object detection frameworks based on deep learning can be divided into one-stage detectors and twostage detectors [15].One-stage detectors are more time-efficient without a significant decrease in accuracy and more suitable for real-time detection than two-stage detectors [16].Furthermore, in one-stage detectors, the YOLOv4 [17] algorithm has been widely applied to solve problems in civil engineering owing to its excellent performance [18][19][20].This study proposes a new counting method based on an improved YOLOv4 model to count square tubes on construction sites in real time that can be extended to address counting issues of rebars, circular pipes, and I-beams.The proposed method was subsequently applied by developing a mobile application and a WeChat mini-program for practical use on construction sites.Recently, deep learning has attracted significant attention and has been applied in various areas of civil engineering.By combining simple nonlinear modules, deep learning achieves highly complex functions with stronger feature extraction and generalization capabilities than traditional machine learning methods, enabling the identification of complex contents in massive datasets [13].Among many deep learning networks that have been proposed, convolutional neural networks offer significant advantages for image processing.Object detection algorithms based on deep learning can rapidly and accurately determine the positions and categories of objects in images [14].Currently, object detection frameworks based on deep learning can be divided into one-stage detectors and twostage detectors [15].One-stage detectors are more time-efficient without a significant decrease in accuracy and more suitable for real-time detection than two-stage detectors [16].Furthermore, in one-stage detectors, the YOLOv4 [17] algorithm has been widely applied to solve problems in civil engineering owing to its excellent performance [18][19][20].This study proposes a new counting method based on an improved YOLOv4 model to count square tubes on construction sites in real time that can be extended to address counting issues of rebars, circular pipes, and I-beams.The proposed method was subsequently applied by developing a mobile application and a WeChat mini-program for practical use on construction sites.
The remainder of this paper is organized as follows.Section 2 introduces an image dataset of steel materials and evaluation metrics.Section 3 explains the square tube counting model and the proposed improvements to the original YOLOv4 model.Section 4 interprets several training strategies and their implementation.Section 5 presents testing results and extensions of the counting method.Section 6 illustrates the deployment of the counting models to mobile devices.Conclusions are presented in Section 7.

Steel Cross-Section Image Dataset
Object detection based on deep learning is a typical data-driven method that requires numerous real samples for training and evaluation.Conventional horizontal object detection is discouraged when objects are exhibited in arbitrary directions [21].Instead, oriented bounding boxes can effectively detect objects with arbitrary orientations and clut- The remainder of this paper is organized as follows.Section 2 introduces an image dataset of steel materials and evaluation metrics.Section 3 explains the square tube counting model and the proposed improvements to the original YOLOv4 model.Section 4 interprets several training strategies and their implementation.Section 5 presents testing results and extensions of the counting method.Section 6 illustrates the deployment of the counting models to mobile devices.Conclusions are presented in Section 7.

Steel Cross-Section Image Dataset
Object detection based on deep learning is a typical data-driven method that requires numerous real samples for training and evaluation.Conventional horizontal object detection is discouraged when objects are exhibited in arbitrary directions [21].Instead, oriented bounding boxes can effectively detect objects with arbitrary orientations and cluttered arrangements [22] and compactly enclose each object [23].Oriented object detection has been applied in remote sensing [24], autonomous driving [25], and power grid maintenance [26].Therefore, this study adopts oriented object detection to count square tubes.In this study, cross-section images of square tubes at actual construction sites were captured using a typical smartphone camera.The dataset consists of 602 images and 71,887 square tube instances.The annotation tool roLabelImg v3.0 was used to assign the rotating rectangular ground-truth bounding boxes to the cross-sections of the square tubes, and the center coordinates, dimensions, widths, and angles of all cross-sections were saved in the corresponding XML annotation files, as shown in Figure 3.
rectangular ground-truth bounding boxes to the cross-sections of the square tubes, and the center coordinates, dimensions, widths, and angles of all cross-sections were saved in the corresponding XML annotation files, as shown in Figure 3.To further promote the development and evaluation of intelligent counting methods, datasets of other steel materials such as rebars, circular pipes, and I-beams were also established in this study.Figure 4 shows several representative images of these steel materials with annotations.Images of rebars and circular pipes were manually annotated with rectangular ground-truth bounding boxes, and images of I-beams were annotated with polygon ground-truth bounding boxes.As listed in Table 1, the final dataset used in this study comprised 991 images of rebars, 1019 images of circular pipes, 602 images of square tubes, and 501 images of I-beams.The total instances of rebars, circular pipes, square tubes, and I-beams were 181,375, 154,044, 71,887, and 18,578, respectively.To further promote the development and evaluation of intelligent counting methods, datasets of other steel materials such as rebars, circular pipes, and I-beams were also established in this study.Figure 4 shows several representative images of these steel materials with annotations.Images of rebars and circular pipes were manually annotated with rectangular ground-truth bounding boxes, and images of I-beams were annotated with polygon ground-truth bounding boxes.
rectangular ground-truth bounding boxes to the cross-sections of the square tubes, and the center coordinates, dimensions, widths, and angles of all cross-sections were saved in the corresponding XML annotation files, as shown in Figure 3.To further promote the development and evaluation of intelligent counting methods, datasets of other steel materials such as rebars, circular pipes, and I-beams were also established in this study.Figure 4 shows several representative images of these steel materials with annotations.Images of rebars and circular pipes were manually annotated with rectangular ground-truth bounding boxes, and images of I-beams were annotated with polygon ground-truth bounding boxes.As listed in Table 1, the final dataset used in this study comprised 991 images of rebars, 1019 images of circular pipes, 602 images of square tubes, and 501 images of I-beams.The total instances of rebars, circular pipes, square tubes, and I-beams were 181,375, 154,044, 71,887, and 18,578, respectively.As listed in Table 1, the final dataset used in this study comprised 991 images of rebars, 1019 images of circular pipes, 602 images of square tubes, and 501 images of I-beams.The total instances of rebars, circular pipes, square tubes, and I-beams were 181,375, 154,044, 71,887, and 18,578, respectively.Anchor boxes are conducive to accelerating model training and improving detection accuracy.k-means clustering was applied to obtain anchor boxes of each type of steel material using ground-truth boxes in the corresponding dataset.The results are shown in Table 1 and Figure 5. Data points with the same color belong to the same cluster, and the red stars represent anchor boxes.

Circular pipe
154,044 (25,24), (42,41) Anchor boxes are conducive to accelerating model training and improving detection accuracy.k-means clustering was applied to obtain anchor boxes of each type of steel material using ground-truth boxes in the corresponding dataset.The results are shown in Table 1 and Figure

Evaluation Metrics
Average precision (AP) [27] is a commonly used metric used to evaluate object detection model performance.However, the use of a single metric is unsuitable for the intelligent object counting model developed in this study because AP is a comprehensive metric that includes both localization and counting information, making it difficult to separate the impacts of these two factors.The mean absolute error (MAE) [28] and root-meansquare error (RMSE) [29] are useful as evaluation metrics for counting.The MAE and RMSE metrics only include counting information and may lead to false positives, in which

Evaluation Metrics
Average precision (AP) [27] is a commonly used metric used to evaluate object detection model performance.However, the use of a single metric is unsuitable for the intelligent object counting model developed in this study because AP is a comprehensive metric that includes both localization and counting information, making it difficult to separate the impacts of these two factors.The mean absolute error (MAE) [28] and root-mean-square error (RMSE) [29] are useful as evaluation metrics for counting.The MAE and RMSE metrics only include counting information and may lead to false positives, in which the model correctly counts a certain number of the intended identification target but mistakenly counts other objects as targets.Therefore, this study comprehensively adopted the AP, MAE, and RMSE metrics.The AP metric is defined as follows: where r represents the recall value, n represents the number of interpolated points, and r represents the precision at the recall of r.This study used AP 50 , which refers to an intersection over union (IoU) threshold of 0.5, to evaluate the detection performance of the model.The MAE is used to evaluate the counting accuracy of the model, whereas RMSE is used to evaluate the counting robustness of the model.The RMSE metric assigns greater weights to larger errors and is more susceptible to outliers.Hence, this metric is used to measure the counting robustness.The MAE and RMSE metrics are calculated as where n represents the number of images in the test set, y i is the actual number of instances in the image, and ŷi is the number of detections.

Square Tube Counting Model
As mentioned in Section 2.1, the random rotation of square tubes may cause shape orientation changes, making them difficult to detect.Figure 6 shows the differences between square tube detection and circular pipe detection: the overlapping areas and cluttered regions of bounding boxes of square tubes are larger than those of circular pipes.Larger overlapping areas result in the abandonment of certain objects after non-maximum suppression.Furthermore, cluttered regions introduce a great deal of noise, causing interference or even the disappearance of image information features.

(
) ( ) where r represents the recall value, n represents the number of interpolated points, and r represents the precision at the recall of r .This study used AP50, which refers to an intersection over union (IoU) threshold of 0.5, to evaluate the detection performance of the model.The MAE is used to evaluate the counting accuracy of the model, whereas RMSE is used to evaluate the counting robustness of the model.The RMSE metric assigns greater weights to larger errors and is more susceptible to outliers.Hence, this metric is used to measure the counting robustness.The MAE and RMSE metrics are calculated as 1 MAE 1 where n represents the number of images in the test set, i y is the actual number of instances in the image, and ˆi y is the number of detections.

Square Tube Counting Model
As mentioned in Section 2.1, the random rotation of square tubes may cause shape orientation changes, making them difficult to detect.Figure 6 shows the differences between square tube detection and circular pipe detection: the overlapping areas and cluttered regions of bounding boxes of square tubes are larger than those of circular pipes.Larger overlapping areas result in the abandonment of certain objects after non-maximum suppression.Furthermore, cluttered regions introduce a great deal of noise, causing interference or even the disappearance of image information features.Therefore, this study proposed an improved YOLOv4 method for square tube counting.The establishment of the counting model adopting oriented object detection is discussed in detail below.

Improvements in Network Architecture
For object detection tasks in computer vision, once an object is associated with a specific feature map, the corresponding positions in other feature maps are considered the background.This leads to conflicts and interference during feature extraction, reducing Therefore, this study proposed an improved YOLOv4 method for square tube counting.The establishment of the counting model adopting oriented object detection is discussed in detail below.

Improvements in Network Architecture
For object detection tasks in computer vision, once an object is associated with a specific feature map, the corresponding positions in other feature maps are considered the background.This leads to conflicts and interference during feature extraction, reducing the effectiveness of the model.To address this issue, an attention mechanism is typically applied to focus the network on essential features, thereby improving model accuracy.Adaptive spatial feature fusion (ASFF) [30] can fuse feature maps of different resolutions into a fixed-resolution feature map, reducing the inconsistency between differently scaled features caused by the correlation between large objects and low-resolution feature maps, as well as between small objects and high-resolution feature maps.Therefore, ASFF was adopted in this study to directly select and combine effective information from different resolutions, thereby enhancing model performance.The performance of different attention mechanisms was compared through experiments.The model performed best using the squeeze and excitation (SE) module [31] in combination with ASFF.The final overall network architecture is shown in Figure 7, the SE model is depicted in Figure 8, and the ASFF is described in Figure 9.
as well as between small objects and high-resolution feature maps.Therefore, ASFF was adopted in this study to directly select and combine effective information from different resolutions, thereby enhancing model performance.The performance of different attention mechanisms was compared through experiments.The model performed best using the squeeze and excitation (SE) module [31] in combination with ASFF.The final overall network architecture is shown in Figure 7, the SE model is depicted in Figure 8, and the ASFF is described in Figure 9.

Improvements to the Loss Function
The original YOLOv4 model divides the output feature map into different grid cells.Each grid cell predicts three bounding boxes containing coordinates, confidence scores, and class information.The complete intersection over union (CIoU) loss is used to calculate the localization loss [32] by considering the overlapping area of the target boxes, center distance, and aspect ratio in the bounding box regression and is defined as

Improvements to the Loss Function
The original YOLOv4 model divides the output feature map into different grid cells.Each grid cell predicts three bounding boxes containing coordinates, confidence scores, and class information.The complete intersection over union (CIoU) loss is used to calculate the localization loss [32] by considering the overlapping area of the target boxes, center distance, and aspect ratio in the bounding box regression and is defined as

Improvements to the Loss Function
The original YOLOv4 model divides the output feature map into different grid cells.Each grid cell predicts three bounding boxes containing coordinates, confidence scores, and class information.The complete intersection over union (CIoU) loss is used to calculate the localization loss [32] by considering the overlapping area of the target boxes, center distance, and aspect ratio in the bounding box regression and is defined as   As the CIoU loss function does not include angle information, to perform regression on rotated bounding boxes, the angle of the rotation must be defined and applied using a new localization loss function.In a two-dimensional Cartesian coordinate system, rotated rectangular bounding boxes are typically defined by the OpenCV definition method or the long-edge definition method [33].The OpenCV definition method defines the edge that forms an acute angle with the x-axis as the box width with an angle range of [−90 • , 0 • ], as shown in Figure 11a.The long-edge definition method defines the longer edge as the box width, with an angle range of [−90 • , 90 • ], as shown in Figure 11b.As the descriptions of parameters in the long-edge definition method are clearer than those in the OpenCV definition method, the long-edge definition method was adopted in this study.
the long-edge definition method [33].The OpenCV definition method defines the edge that forms an acute angle with the x-axis as the box width with an angle range of [−90°, 0°], as shown in Figure 11a.The long-edge definition method defines the longer edge as the box width, with an angle range of [−90°, 90°], as shown in Figure 11b.As the descriptions of parameters in the long-edge definition method are clearer than those in the OpenCV definition method, the long-edge definition method was adopted in this study.The localization loss function was used to measure the difference between the predicted bounding box and the ground-truth box.In this specific implementation, the parameters of the rotated bounding box were converted into Gaussian distribution features comprising the mean and variance.The Kullback-Leibler (KL) divergence [34] and Gaussian Wasserstein Distance (GWD) [35] were used to quantify the difference between the two two-dimensional Gaussian distributions as the localization loss in the proposed counting model.As shown in Figure 12, the defined parameters of the rotated bounding box can be converted into Gaussian distribution features as follows: ( ) , x y = μ (7) where R represents the rotation matrix; S represents the diagonal matrix of the eigenvalues; µ and σ represent the mean vector and covariance matrix, respectively, of the twodimensional Gaussian distribution; and x, y, w, h, and θ represent the horizontal and vertical coordinates, width, height, and angle of the rotated box, respectively.This transformation method effectively solves the issues of loss discontinuity caused by the periodicity of angles and boundary discontinuity caused by the interchange of the long and short sides [34].The localization loss function was used to measure the difference between the predicted bounding box and the ground-truth box.In this specific implementation, the parameters of the rotated bounding box were converted into Gaussian distribution features comprising the mean and variance.The Kullback-Leibler (KL) divergence [34] and Gaussian Wasserstein Distance (GWD) [35] were used to quantify the difference between the two two-dimensional Gaussian distributions as the localization loss in the proposed counting model.As shown in Figure 12, the defined parameters of the rotated bounding box can be converted into Gaussian distribution features as follows: where R represents the rotation matrix; S represents the diagonal matrix of the eigenvalues; µ and σ represent the mean vector and covariance matrix, respectively, of the two-dimensional Gaussian distribution; and x, y, w, h, and θ represent the horizontal and vertical coordinates, width, height, and angle of the rotated box, respectively.This transformation method effectively solves the issues of loss discontinuity caused by the periodicity of angles and boundary discontinuity caused by the interchange of the long and short sides [34].
Buildings 2024, 14, x FOR PEER REVIEW 10 of 20 After transforming the two-dimensional Gaussian distribution, the losses corresponding to the KL divergence and GWD were calculated as ( ) ( ) ( ) ( )  After transforming the two-dimensional Gaussian distribution, the losses corresponding to the KL divergence and GWD were calculated as (10) where D kl and D gw represent the KL divergence and GWD, respectively; N(µ, σ) represents the two-dimensional normal distribution; µ and σ represent the mean vector and covariance matrix, respectively, of the corresponding distribution; subscripts p and t represent the predicted bounding box and ground-truth box, respectively; Tr represents the trace of a matrix; ∥ • ∥ 2 2 denotes the Euclidean norm of a vector; loss kl and loss gw represent the loss functions based on KL divergence and GWD, respectively; and τ is a tunable parameter, which was set to two in this study.
The original YOLOv4 model has three horizontal anchor boxes, making it difficult to fit the rotated ground-truth boxes of square tubes and leading to a decrease in detection accuracy.This issue was addressed in this study by augmenting each anchor box with six additional angles {−60 • , −30 • , 0 • , 30 • , 60 • , 90 • }, resulting in 18 anchor boxes at each grid on the feature map.Although this approach increases the thickness of the network detection head, it also effectively improves the detection accuracy of the model.
In addition, the positive and negative samples from all the anchor boxes were differentiated during model training.A positive sample at each grid must have an IoU for the corresponding ground-truth box that is greater than a certain threshold.It must also have the highest IoU value among all the anchor boxes at that grid.While calculating the IoU for horizontal object detection is simple and fast, calculating the IoU between rotated bounding boxes during the training phase is more time-consuming.Therefore, an approximate IoU, called the ArIoU, was used to calculate the IoU during the training phase as follows: where T represents the ground-truth box; A represents the anchor box; θ T and θ A , respectively, represent the angles of the ground-truth box and anchor box; and A * represents anchor box A with the angle adjusted to θ T .The definition of positive samples was modified as follows: where 1 represents a positive sample; 0 represents a negative sample; and α, β, and γ are the adjustable parameters that were, respectively, set to 0.6, 0.4, and 15 • in this paper.
The original YOLOv4 model used binary cross-entropy loss as the confidence function during training.To reduce the impact of the foreground-background class imbalance on model training and enhance the sensitivity of the model to difficult samples, the original confidence function was replaced with focal loss (FL) [36], which is defined as where L conf is the confidence loss with FL; y and y ′ represent the ground-truth and predicted values of confidence, respectively; and the adjustable parameter δ is used to balance the importance of easy and difficult samples.During the training process, the FL automatically reduces the contribution of easy background examples to the training weights and rapidly focuses on learning difficult negative samples.The focus parameter, δ, was set to two for all experiments in this study.

Model Training Strategy Selection and Implementation
Different deep learning models adopt different training strategies, and an effective training strategy can significantly improve the detection performance of a model.Commonly used training strategies include data augmentation, learning rate schedules, transfer learning, and multi-scale training.

Data Augmentation
The purpose of data augmentation is to enhance the original image and enrich the training dataset, making the resulting model more robust for different images.Geometric and photometric distortions are two commonly used augmentations.The flip, translation, and rotation are used in geometric distortions.The brightness, contrast, and saturation are used in photometric distortions.Figure 13 shows the augmentation effect of geometric distortions.Because the number of collected images of square tubes in this study was limited, and such objects can be placed at various angles in actual scenarios, rotation was applied for data augmentation to help the model learn to detect square tubes at many different angles using only limited data, thereby improving its generalization ability.In this study, random geometric transformations were applied to half the images in the dataset before model training, followed by random photometric transformations.
dicted values of confidence, respectively; and the adjustable parameter δ is used to balance the importance of easy and difficult samples.During the training process, the FL automatically reduces the contribution of easy background examples to the training weights and rapidly focuses on learning difficult negative samples.The focus parameter, δ, was set to two for all experiments in this study.

Model Training Strategy Selection and Implementation
Different deep learning models adopt different training strategies, and an effective training strategy can significantly improve the detection performance of a model.Commonly used training strategies include data augmentation, learning rate schedules, transfer learning, and multi-scale training.

Data Augmentation
The purpose of data augmentation is to enhance the original image and enrich the training dataset, making the resulting model more robust for different images.Geometric and photometric distortions are two commonly used augmentations.The flip, translation, and rotation are used in geometric distortions.The brightness, contrast, and saturation are used in photometric distortions.Figure 13 shows the augmentation effect of geometric distortions.Because the number of collected images of square tubes in this study was limited, and such objects can be placed at various angles in actual scenarios, rotation was applied for data augmentation to help the model learn to detect square tubes at many different angles using only limited data, thereby improving its generalization ability.In this study, random geometric transformations were applied to half the images in the dataset before model training, followed by random photometric transformations.

Learning Rate Schedule
The learning rate is a hyperparameter that controls the magnitude of the gradient update during the training.An optimal learning rate not only ensures the convergence of the model but also improves its training efficiency.Therefore, an appropriate learning rate schedule should be devised for the training process.This study adopted cosine annealing [37] with warmup [38] for the learning rate schedule as follows: ( ) where ηmax and ηmin represent the maximum and minimum values of the learning rate, respectively; ηt is the learning rate of the current epoch; Tcur represents the current epoch of the training phase; Twarm represents the total epochs of the warmup phase; and Ttotal represents the total epochs of the training phase.In this study, the maximum and minimum learning rates were set to 0.0002 and 0, respectively.The total training epoch was set

Learning Rate Schedule
The learning rate is a hyperparameter that controls the magnitude of the gradient update during the training.An optimal learning rate not only ensures the convergence of the model but also improves its training efficiency.Therefore, an appropriate learning rate schedule should be devised for the training process.This study adopted cosine annealing [37] with warmup [38] for the learning rate schedule as follows: where η max and η min represent the maximum and minimum values of the learning rate, respectively; η t is the learning rate of the current epoch; T cur represents the current epoch of the training phase; T warm represents the total epochs of the warmup phase; and T total represents the total epochs of the training phase.In this study, the maximum and minimum learning rates were set to 0.0002 and 0, respectively.The total training epoch was set to 120.When using cosine annealing with warmup, the epochs of the warmup and cosine annealing were set to 12 and 108, respectively.The Adam optimizer was applied to update parameters.The batch size was set to six, and the Mish function was adopted in the activation layer.Figure 14 depicts the evolution of the learning rate over time using different learning rate schedules.
( ) where ηmax and ηmin represent the maximum and minimum values of the learning rate, respectively; ηt is the learning rate of the current epoch; Tcur represents the current epoch of the training phase; Twarm represents the total epochs of the warmup phase; and Ttotal represents the total epochs of the training phase.In this study, the maximum and minimum learning rates were set to 0.0002 and 0, respectively.The total training epoch was set to 120.When using cosine annealing with warmup, the epochs of the warmup and cosine annealing were set to 12 and 108, respectively.The Adam optimizer was applied to update parameters.The batch size was set to six, and the Mish function was adopted in the activation layer.Figure 14 depicts the evolution of the learning rate over time using different learning rate schedules.

Transfer Learning
The ideal scenario for traditional machine learning involves abundant labeled training data with the same distribution as the test dataset.However, in many scenarios, collecting sufficient training data is expensive, time-consuming, or impractical.Transfer learning, which extracts useful features from a task in one domain and applies them to a new task, represents a promising strategy for solving such problems.Directly training deep learning models without using transfer learning often results in suboptimal performance.Therefore, this study attempted to use the pre-trained weights from the MS-COCO dataset [40] for weight initialization when training the square tube counting model.
However, the test results obtained by the counting model were not as accurate as expected since the square tube dataset significantly differed from the COCO dataset.Furthermore, owing to the limited number of images in the square tube dataset and the significant differences between the types and shapes of square tubes, few improvements in the accuracy of the counting model were observed after retraining.Therefore, considering the features shared by circular pipes and square tubes, as well as the greater abundance of images in the circular pipe dataset, the weights trained on the circular pipe dataset were used as the initial weights for the square tube counting model.

Multi-Scale Training
Before being input into the model, the images were generally adjusted to a fixed scale.If this scale is too large, an out-of-memory error may occur, whereas, if it is too small, the training accuracy requirements may not be satisfied.Therefore, the images must be adjusted to an appropriate scale, i.e., resolution.
Empirical evidence has shown that models trained on a fixed scale exhibit poor generalization [41].As the multi-scale training strategy has proven effective in practice [42], this study employed multi-scale training to enhance the generalization ability of the model.For every ten batches, a random scale was selected from {416, 448, 480, 512, 544, 576, 608}, and the images were adjusted to that scale for training.

Analysis of the Results
Table 3 compares the performance of the square tube counting model under different situations.In this experiment, 602 images of square tubes were randomly split into training and test sets in proportions of 80% and 20%, respectively.The same data augmentation and learning rate schedules were used in all models.During the model testing phase, the image input scale was set to 608, and the confidence loss was calculated using FL.The results in Table 3 indicate significant differences in accuracy when using different attention mechanisms and loss functions.The largest difference in AP was 8.21, and the largest difference in MAE was 1.05.The AP values were generally higher when GWD was used as the loss function than when KL divergence was used.By comparing the equations for the two loss functions presented in Section 3.3, the lower AP values obtained when using the KL divergence can be inferred as the result of its asymmetry, which means the KL divergence between two distributions differs from the reverse KL divergence between the same distributions.Thus, when the predicted bounding and ground-truth boxes remain the same, the loss values will change when their positions are swapped.
When the model adopted the SE attention mechanism (Figure 8), with ASFF (Figure 9) with GWD as the loss function, the highest AP value was 91.41%, and the MAE value was 4.07.These results indicate that the proposed improvements to the original YOLOv4 model presented in this study were highly effective.The detection results with and without the improved loss function are compared in Figure 15, which shows that false detections and significant errors in the sizes and angles of the detection boxes occurred before the loss function was changed.
After modification, the counting accuracy of the model was significantly improved, and the instances of false and missed detections decreased considerably.Examples of the counting performance of the final model in two challenging scenarios (dense and scattered arrangements) are shown in Figure 16.
with GWD as the loss function, the highest AP value was 91.41%, and the MAE value was 4.07.These results indicate that the proposed improvements to the original YOLOv4 model presented in this study were highly effective.The detection results with and without the improved loss function are compared in Figure 15, which shows that false detections and significant errors in the sizes and angles of the detection boxes occurred before the loss function was changed.

Extension of Model to Rebar, Circular Pipe, and I-Beam Counting
The approach used to establish the square tube counting model can be directly extended to rebars, circular pipes, and I-beams.As real-time counting models for individual material types have been developed separately [46,47], the modeling process is not described in detail in this paper.Examples of counting results are shown in Figure 17.

Extension of Model to Rebar, Circular Pipe, and I-Beam Counting
The approach used to establish the square tube counting model can be directly extended to rebars, circular pipes, and I-beams.As real-time counting models for individual material types have been developed separately [46,47], the modeling process is not described in detail in this paper.Examples of counting results are shown in Figure 17.

Discussion
To validate the proposed counting model, a comparison of this study and similar research was conducted.Hernández-Ruiz et al. [48] designed an SA-CNN-DC model, adopting binary classification and distance clustering to automatically count squared steel bars from images.Ghazali et al. [49] proposed a steel detection and counting algorithm adaptable to rectangular steel bars.This method utilized the Hough transform, followed by a postprocessing stage and a series of morphological operations.Comparison results are presented in Table 4.

Discussion
To validate the proposed counting model, a comparison of this study and similar research was conducted.Hernández-Ruiz et al. [48] designed an SA-CNN-DC model, adopting binary classification and distance clustering to automatically count squared steel bars from images.Ghazali et al. [49] proposed a steel detection and counting algorithm adaptable to rectangular steel bars.This method utilized the Hough transform, followed by a postprocessing stage and a series of morphological operations.Comparison results are presented in Table 4.As shown in Table 4, the accuracy of the improved YOLOv4 model was lower than the accuracies of the SA-CNN-DC model and the Hough transform model.However, there is a significant gap in the number of testing images.Additionally, the testing images in other studies were collected at a warehouse, which could provide stable lighting and a simple background.By contrast, the improved YOLOv4 model could count square tubes on construction sites and have higher robustness.The aforementioned reasons also led to a higher MAE value in the improved YOLOv4 model than in the SA-CNN-DC model.In terms of the RMSE metric, the proposed methodology demonstrated a lower value, indicating that the proposed model performed favorably.It is worth mentioning that the inference time of the improved YOLOv4 model was significantly shorter than that of the other two studies, making it suitable for real-time counting applications.
The proposed model still demonstrated suboptimal detection and counting performance in some challenging scenarios at construction sites.Moreover, counting different types of steel materials requires different detection methods, which is not practical and can be a significant drain on computing and hardware resources.
In the future, more construction sites and other building material images should be collected to include a more diverse range of images and scenarios.More contemporary detection algorithms and networks should be considered to improve the model's performance.Additionally, a unified counting model adaptable to different kinds of steel materials should be studied and developed, ultimately enhancing its applicability in practice.

Counting Model Deployment
An "Intelligent Steel Counting" smartphone application was developed to practically apply the proposed counting method and address problems encountered on actual construction sites.Users of this application need only download it from a mobile application store and register to use it.The application homepage and an example of the counting results provided by the application are shown in Figure 18.
When using the application, users upload end-face images of steel material to the server by taking photos with smartphones to complete quantity calculations.The entire calculation and result feedback process generally takes 1-2 s, effectively meeting the requirements for real-time counting.Since its launch, this application has attracted over 28,000 registered users and has completed counting tasks for approximately 180,000 images.
In addition, a WeChat mini-program called "Intelligent Steel Counting" was developed and launched, which users can employ without downloading anything.The functions and usage of the WeChat mini-program are similar to those of the mobile application mentioned above.The homepage of the WeChat mini-program is shown in Figures 19 and 20 shows examples of its counting results.
An "Intelligent Steel Counting" smartphone application was developed to practically apply the proposed counting method and address problems encountered on actual construction sites.Users of this application need only download it from a mobile application store and register to use it.The application homepage and an example of the counting results provided by the application are shown in Figure 18.When using the application, users upload end-face images of steel material to the server by taking photos with smartphones to complete quantity calculations.The entire calculation and result feedback process generally takes 1-2 s, effectively meeting the requirements for real-time counting.Since its launch, this application has attracted over 28,000 registered users and has completed counting tasks for approximately 180,000 images. In

Conclusions
This study proposed an intelligent counting method for different steel materials on construction sites and developed a practical mobile application and a WeChat mini-program to minimize manual counting.The following conclusions were drawn from the results.1.To count square tubes at different angles, this study adopted oriented object detection, which can compactly enclose each object, instead of horizontal object detection.2. This study incorporated the SE attention mechanism, the ASFF module, and a loss function specifically designed for angled objects into the YOLOv4 model to improve the performance of the square tube counting model.Furthermore, the accuracy of the model was significantly improved by combining strategies, including data augmentation and learning rate schedules.In ordinary scenarios, the square tube counting model achieved an AP of greater than 90% and an MAE of 4.07.

Conclusions
This study proposed an intelligent counting method for different steel materials on construction sites and developed a practical mobile application and a WeChat mini-program to minimize manual counting.The following conclusions were drawn from the results.

1.
To count square tubes at different angles, this study adopted oriented object detection, which can compactly enclose each object, instead of horizontal object detection.

2.
This study incorporated the SE attention mechanism, the ASFF module, and a loss function specifically designed for angled objects into the YOLOv4 model to improve the performance of the square tube counting model.Furthermore, the accuracy of the model was significantly improved by combining strategies, including data augmentation and learning rate schedules.In ordinary scenarios, the square tube counting model achieved an AP of greater than 90% and an MAE of 4.07.

3.
The research findings were implemented in a practical mobile application and a WeChat mini-program that have gained a significant user base as they can reduce the need for manpower and resources in actual construction projects.
Notably, this study was subject to several limitations.Owing to the limited training data, the counting capabilities of the square tube model require improvement in complex visual scenarios.Although the intelligent real-time counting of major steel materials in construction was achieved, new models are required to count other related materials, such as scaffolding couplers, templates, and blocks.A unified model for counting different components should be investigated in the future to further promote practicality.Data Availability Statement: Some or all data, models, or codes supporting the findings of this study are available from the corresponding authors upon reasonable request.The datasets are available online at https://github.com/H518123after the paper is published.

Conflicts of Interest:
Author Yang Li was employed by the company China United Engineering Co., Ltd.The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
(a) A raw image with annotations (b) An XML annotation file

Figure 3 .
Figure 3.A raw image with annotations and an annotation file.

Figure 3 .
Figure 3.A raw image with annotations and an annotation file.
(a) A raw image with annotations (b) An XML annotation file

Figure 3 .
Figure 3.A raw image with annotations and an annotation file.
(a) Rotated square tubes (b) Circular pipes

Figure 6 .
Figure 6.Horizontal bounding boxes for detecting square tubes and circular pipes.

Figure 6 .
Figure 6.Horizontal bounding boxes for detecting square tubes and circular pipes.

Figure 7 .
Figure 7. Network of the square tube counting model.Figure 7. Network of the square tube counting model.

Figure 7 . 20 Figure 8 .
Figure 7. Network of the square tube counting model.Figure 7. Network of the square tube counting model.Buildings 2024, 14, x FOR PEER REVIEW 8 of 20

Figure 8 .
Figure 8. Network of the SE module.Figure 8. Network of the SE module.

Figure 8 .
Figure 8. Network of the SE module.

where
LCIoU represents the CIoU loss; b p and b g represent the center points of the predicted bounding box (the gray shaded area) and the ground-truth box (the blue shaded area), respectively; dc represents the distance between b p and b g ; de represents the diagonal length of the smallest box (the red dotted rectangle) enclosing the two boxes; v represents the consistency of the aspect ratio; and (w g , h g ) and (w p , h p ) represent the (width, height) of the predicted bounding box and ground-truth box, respectively, as shown in Figure10.
4)where L CIoU represents the CIoU loss; b p and b g represent the center points of the predicted bounding box (the gray shaded area) and the ground-truth box (the blue shaded area), respectively; d c represents the distance between b p and b g ; d e represents the diagonal length of the smallest box (the red dotted rectangle) enclosing the two boxes; v represents the consistency of the aspect ratio; and (w g , h g ) and (w p , h p ) represent the (width, height) of the predicted bounding box and ground-truth box, respectively, as shown in Figure10.

Figure 8 .
Figure 8. Network of the SE module.

where
LCIoU represents the CIoU loss; b p and b g represent the center points of the predicted bounding box (the gray shaded area) and the ground-truth box (the blue shaded area), respectively; dc represents the distance between b p and b g ; de represents the diagonal length of the smallest box (the red dotted rectangle) enclosing the two boxes; v represents the consistency of the aspect ratio; and (w g , h g ) and (w p , h p ) represent the (width, height) of the predicted bounding box and ground-truth box, respectively, as shown in Figure10.

Figure 10 .
Figure 10.Schematic of the CIoU loss.Figure 10.Schematic of the CIoU loss.

Figure 10 .
Figure 10.Schematic of the CIoU loss.Figure 10.Schematic of the CIoU loss.

Figure 11 .
Figure 11.Different definition methods for rotated rectangular bounding boxes.

Figure 11 .
Figure 11.Different definition methods for rotated rectangular bounding boxes.

Figure 12 .
Figure 12.Modeling a rotating bounding box using a two-dimensional Gaussian distribution.

Figure 12 .
Figure 12.Modeling a rotating bounding box using a two-dimensional Gaussian distribution.

Figure 13 .
Figure 13.Schematic views of data augmentation.

Figure 13 .
Figure 13.Schematic views of data augmentation.

Table 2
lists the AP values of the counting model using different learning rate schedules.The results indicate that the cosine annealing with warmup provided the highest AP value.

Figure 15 .
Figure 15.Comparison of the results under different loss functions.After modification, the counting accuracy of the model was significantly improved, and the instances of false and missed detections decreased considerably.Examples of the counting performance of the final model in two challenging scenarios (dense and scattered arrangements) are shown in Figure 16.

Figure 16 .
Figure 16.Square tube counting results using the proposed counting model.

Figure 15 .Figure 15 .
Figure 15.Comparison of the results under different loss functions.

Figure 16 .
Figure 16.Square tube counting results using the proposed counting model.Figure 16.Square tube counting results using the proposed counting model.

Figure 16 .
Figure 16.Square tube counting results using the proposed counting model.Figure 16.Square tube counting results using the proposed counting model.

Figure 18 .
Figure 18.Functions of the Intelligent Steel Counting smartphone application.
addition, a WeChat mini-program called "Intelligent Steel Counting" was developed and launched, which users can employ without downloading anything.The functions and usage of the WeChat mini-program are similar to those of the mobile application mentioned above.The homepage of the WeChat mini-program is shown in Figures 19 and 20 shows examples of its counting results.

Figure 18 . 2 Figure 19 .
Figure 18.Functions of the Intelligent Steel Counting smartphone application.Buildings 2024, 14, x FOR PEER REVIEW 17 of 2

Figure 19 .
Figure 19.Homepage of the Intelligent Steel Counting WeChat mini-program.

Figure 19 .
Figure 19.Homepage of the Intelligent Steel Counting WeChat mini-program.

Figure 20 .
Figure 20.Results provided by the Intelligent Steel Counting WeChat mini-program.

Figure 20 .
Figure 20.Results provided by the Intelligent Steel Counting WeChat mini-program.

Table 1 .
Details of steel material datasets.

Table 2
lists the AP values of the counting model using different learning rate schedules.The results indicate that the cosine annealing with warmup provided the highest AP value.

Table 2 .
AP values of models using different learning rate schedules.

Table 3 .
Testing results of different models.

Table 4 .
Comparison results of this study and similar research.

Table 4 .
Comparison results of this study and similar research.