Steel Bar Counting from Images with Machine Learning

: Counting has become a fundamental task for data processing in areas such as micro-biology, medicine, agriculture and astrophysics. The proposed SA-CNN-DC (Scale Adaptive— Convolutional Neural Network—Distance Clustering) methodology in this paper is designed for automated counting of steel bars from images. Its design consists of two Machine Learning techniques: Neural Networks and Clustering. The system has been trained to count round and squared steel bars, obtaining an average detection accuracy of 98.81% and 98.57%, respectively. In the steel industry, counting steel bars is a time consuming task which highly relies on human labour and is prone to errors. Reduction of counting time and resources, safety and productivity of employees and high conﬁdence of the inventory are some of the advantages of the proposed methodology in a steel warehouse.


Introduction
Counting is a time-consuming task and a key factor in keeping track of the inventory of any material. When talking about objects with different shapes and sizes, the task becomes more challenging. In the steel industry, the steel bar is one of the most widely used product in the world for building construction and forge. During the manufacturing process, the bars are usually counted by using images, thus allowing for distance, lighting and angle control. However, once they leave the factory, these heavy and large materials must be stacked and stored in warehouses or retails, where a hostile environment prevails in order to track a smooth and reliable inventory.
Traditional steel bar counting is based on human calculation; however, due to the shifting conditions and low manoeuvrability, the manual counting is quite slow and labour-intensive with low accuracy rate. Therefore, an automatic system capable to count these materials regardless of the physical conditions is required in order to improve the effectiveness and reliability of the steel bars counting process.
Basically, it is possible to distinguish two main approaches for counting tasks. Image processing techniques implement algorithms based on mathematical functions to transform an image. Filters, threshold segmentation, edge detection and matching are commonly used techniques [1][2][3][4]. Although these techniques are highly accurate, they are bounded to specific conditions such as constant lightning and background, or special camera requirements. Moreover, they are limited to round steel bars with fixed shape and size, assuming their shape is quasicircular, thus lacking robustness [5]. Note that some of these methods are bounded to the production line in steel fabrics where physical separation of the materials is viable [6,7].
Similarly, other steel bar counting algorithms based on image processing are mainly based on area and template algorithms. Both of these methods are feasible, but there are some disadvantages. The results of the first method cannot directly locate the steel bar in the counting result, so it makes great inconvenience in error analysis of the algorithm [8,9].
Template matching method heavily depends on the shape of the template and target object, so the adaptive ability is limited [10,11].
In the case of machine learning (ML) techniques, convolutional neural networks (CNN) have demonstrated to be highly accurate and fast enough for image processing. Well-known CNN architectures such as Feature Pyramid Network, Visual Geometric Group and Inception-ResNet are used as classifiers and regressors in order to obtain the total number of elements [12][13][14][15][16]. As for industrial applications, neural networks have proven to achieve high accuracy and provide a fast-track solution for problems in hostile environments [17,18].
A methodology that resembles how humans count is presented in [19]. A CNN implemented as a Binary Classifier is used to detect each bar in the image and mark it as a candidate centre. Once every candidate centre is detected, a clustering technique is applied in order to extract the actual geometric centre of the bars. The final count is the number of centres detected. This methodology, referred to as CNN-DC, although it achieves 99.26% of accuracy in 3.58 s, it is restricted to a constant background among the images and a fixed patch size, thus implying that the distance from the camera is constant.
On the other hand, a deep learning fusion model for detecting objects is proposed in [20]. Localisation and segmentation of steel bars is done by using a combined model. The model achieves a 98.17% in F1 score (harmonic mean value of precision and recall) in object detection in 0.03 s; however, the proposed Inception-RFB-FPN architecture is quite complex with many layers and requires high computational resources for its deployment, thus making it unaffordable for an embedded and portable system.
Most of the implementations focus on counting a single object kind with specific characteristics and constant background. Although some machine learning based works have a high performance, a portable system might not be viable due to the required amount of computational resources. By considering these issues, a CNN-based counting methodology approach, namely Scale Adaptive Convolutional Neural Network Distance Clustering (SA-CNN-DC), is presented in this paper in order to count steel bars regardless of their size and shape by adopting a compact design with a minimum number of parameters. The proposed counting methodology is described in Section 2. Section 3 summarises the image processing techniques applied to generate the dataset for each network in the system. A description of the implemented neural network architectures and distance clustering algorithm is shown in Section 4, where a training performance of each one is also presented. The most relevant results and the SA-CNN-DC overall performance validation are presented in Section 5. A comparison with similar implementations is also made in this section. The desktop app implementation is described in Section 6. Finally, conclusions are drawn in Section 7.

Proposed SA-CNN-DC Counting Approach
A variety of steel bar shapes can be found in the steel industry, and each bar kind usually is manufactured with different dimensions. The diversity of sizes and shapes is a challenge that the proposed methodology addresses by adding an extra input called density. This parameter is a fundamental factor in order to obtain an effective bar classification as well as its accurate localisation.
The proposed methodology attempts to improve the CNN-DC framework presented in [19] which requires specific conditions of the images considered, as well as a fixed size and shape bar. In addition, the proposed method is designed to be portable, replicable and robust to natural variability of the conditions in the warehouse, such as noise, light and scale.
Basically, the proposed SA-CNN-DC methodology automatically classifies the steel bar type, locates each bar centre and provides an output with the total amount of bars. Both the material image and the density are the inputs required for the system. It is worth mentioning that based on an analysis of the storing material physical conditions within the warehouse, three types of density were defined (low, medium and high) in order to ensure robustness to the variability of the steel bar dimensions. Figure 1 shows an example of each of the possible steel bar densities. The SA-CNN-DC main core consists of three neural networks and a clustering technique. Each network solves an specific task: bar classification, image resizing and centre localisation. The modular design not only allows portability but also it is replicable, thus providing the capability to add new materials by using the same methodology. Bearing this in mind, round and squared bars were considered for counting purposes; however, angled and rectangular bars were also used for training.
A graphical representation of the proposed methodology is shown in Figure 2. Their main stages are described as follows:

Image Density
Original Image

1.
Preprocessing: The input image is resized to a size limit between 3500 and 1700 pixels. According to its density type, five crops of different sizes are extracted from the centre of the image and resized to the CNN input size. For high density, the crops extracted are smaller and vice-versa for the low density.

2.
Classification: These five crops are feed-forward into the classifier to obtain the softmax output. The class is chosen according to the five predictions obtained through voting, thus the class with the highest number of predictions is selected. From the correct predictions the one with the highest probability is selected. A higher probability means a more reliable prediction and it implies the network is able to detect features with high certainty. The flatten vector of the response with the highest probability, which is generated after the convolutional layers and contains all the information of the image condensed, is stored as the optimal flatten for the next step.

3.
Linear Regressor: This network outputs a resizing factor, f R , which is used to resize the image. This factor determines how much an image must be resized so that a single bar completely fits in a patch of a fixed size, as seen in Figure 3. Then, the scale of the resulting image is adapted for the next stage. The convolutional layers of the classifier are used to train a simple multilayer perceptron with a linear output that acts as regressor for the resizing factor. This process, called transfer learning, helps to drastically reduce the training data and time of the regressor network.

4.
Binary Classification: A fixed-size sliding window moves through the scale adapted image with a small stride. A binary-output network classifies whenever the resulting patch contains a bar or not. If a positive detection is made, the centre coordinate of the patch is stored as a candidate centre.

5.
Distance Clustering: Candidate centres are filtered according to their horizontal and vertical proximity with other candidate centres. If a candidate centre is completely isolated, it is deleted. The distance clustering algorithm measures the Euclidean distance between candidate centres and groups them within a threshold distance. For each cluster, a centre coordinate is stored, ideally this is the geometric centre of the material [19].

6.
Output: The final count is the total number of centre coordinates. For visualisation purposes, the centres are marked with a colour dot in the image.

Dataset Acquisition
Since each neural network has an specific task within the proposed system, three different datasets were built for each network: Classifier-Dataset, Regressor-Dataset and Binary-Dataset. It is worth mentioning that a single input size was fixed for the networks, thus obtaining a modular design. This input size determines how much GPU memory will be required for the kernel weights in the network layers, as well as the batch size and the training time. When an image with large dimensions and high resolution is resized to a considerably smaller size, it ends up distorted and the steel bars lose their shape. In consequence, the CNN is unable to learn characteristics of the materials, instead it learns to extract information from the noise in the image. By considering a trade-off between the required GPU memory and image distortion, the input size was set to 64 × 64 pixels. More-over, since the RGB channels do not contain relevant information about the shape and they also require a larger memory consumption, the images were converted to grayscale.

Classifier-Dataset
The classifier was built to distinguish between steel bar types. For this network it is important that the samples contain the relevant shape characteristics of the bars. Note that an angled, rectangular, round and squared steel bars with different size and colour, as shown in Figure 4, were considered as possible classes. It is important to emphasise that only the round and squared bars were used for counting, while the angled and rectangular categories were introduced as rejection classes, but they will be considered for future research. First, steel bar photographs with different dimensions were manually collected in an ordinary warehouse. To ensure an appropriate representation of the possible conditions in the place, it was required to collect several images with variations in the light, frame, angle and position. In this way, around 400 photographs were taken from each single material pile, thus obtaining a small set of 12,793 images.
From these collected images and by considering the aforementioned density parameter, variable size crops (with a random increase or decrease) were extracted from different coordinates within the images. Once extracted, they were resized to 64 × 64 pixels and converted to grayscale. This technique helps to increase the amount of data and prevents distortion due to resizing. Figure 5 shows an example of the most appropriate resized crop ( Figure 5d) obtained from the original image ( Figure 5a). The crop size was selected according to both the density and the image dimensions. Table 1 shows the dataset created for the CNN. The categories are quite balanced and the amount of data is enough to avoid overfitting when using a fitted CNN size.

Regressor-Dataset
As mentioned before, only round and squared bars were considered for counting. For the Linear Regressor, new photographs with variable dimensions were taken horizontally in front of the material piles where the steel bars were uniformly painted. The resulting 263 images were manually labelled with the bar size in pixels and their corresponding density. The bar size is determined by the height and diameter or width of the bar's section and it is a single number. In this way, the resizing factor could be computed with the equation shown in Equation (1). Moreover, rotations and shifts were used to considerably increase the amount of data. The resulting sample distribution is summarised in Table 2 and some samples are presented in Figure 6.

Binary-Dataset
The Binary Classifier classifies each bar in the image. Input patches generated by a sliding window were classified into two classes: zeros and ones. The first one refers to images containing background, incomplete elements or joints between them, as shown in Figure 7a, and the group labelled as ones contains images with centred and complete elements, as shown in Figure 7b.
An image of each size of the round and squared steel bars were considered for the dataset building (12 images in total). Bar centres were manually marked by using an image editor with a 10% brush size of the bar size. Note that this size determines how many pixels will be considered as centre. Next, a sliding window of the steel bar sizes pass through the image with a stride of 5% in order to extract the required patches. If the centre of the patch matches a centre of an element, which is recognisable by its marker colour, the patch is saved as one in grayscale. On the contrary, if the patch centre is not a steel bar centre, the patch is saved as zero. The sample distribution for the Binary-Dataset is presented in Table 3. It is worth mentioning that the unbalanced data shown for the round bars is not an issue for this network because the data can be chosen randomly in order to balance both classes. More importantly, the addition of new materials is easily done because the two required datasets are automatically processed with the implemented algorithms. Labelling the images is also a simple task just by changing the filename and using a simple raster graphics editor, such as Microsoft Paint. This ensures the replicability of the methodology and provides a fast-track addition for different steel bar types.

Neural Networks Performance
The proposed networks were efficiently designed with the least number of neurons and layers in order to create a modular and portable architecture. The more convolution layers are added, the more abstract information is extracted. However, if the number of convolutional layers exceeds the one required, features with new information are not created because there is no further information to learn. Therefore, it is not recommended to add a large number of convolutional layers. In 1987, Lippmann demonstrated that a multilayer perceptron with two hidden layers is enough to form arbitrary decision regions [21]. These simple guidelines were taken into consideration for the architectures design.
The classifier provides the convolutional layers for the other two networks: the Linear Regressor and the Binary Classifier, which were trained for each single steel bar type with two new datasets. These two networks are smaller and require less data and training time. This characteristic is an advantage if new steel bar types need to be added. Instead of training the whole model, only training of two new multilayer perceptrons is required. This methodology of reusing the feature-extraction part of a trained model with a particular goal to be used in other model with different goal is commonly known as transfer learning [22,23]. Figure 8 shows how this process is carried out: the classifier convolutional layers remains the same and only the small multilayer perceptrons corresponding to both, the Regressor and the Binary Classifier, are trained.

Input
CNN Feature Extraction For each network, the data was divided into 70% for training, 20% for validation and 10% for test in order to cross-validate results. Several simulations were carried out to determine the best architecture and the hyperparameters for each network by using a NVIDIA GeForce GTX 1050 GPU with Keras [24,25].

Classifier
The classifier was designed and trained to classify the four different steel bar types. This network also works as a feature extractor or encoder, which means it reduces the information of a larger input into a compact (flatten) vector.
It was trained for 12 epochs with a batch size of 108 samples and its architecture consists of three convolutional layers and two fully connected layers, as shown in Figure 9. The validation and test accuracies are 99.28% and 99.35%, respectively, and the training time was barely 84 s.  The confusion matrix is used as performance metric for the classification network. The class predictions made by the network with the test data are compared with the actual class in Table 4. Note that round and squared bars are the classes most likely to be confused.

Linear Regressor
The selected architecture for the Linear Regressor network consists of one hidden layer with four neurons and an output layer with a single neuron, as shown in Figure 10. The input corresponds to the flatten vector generated by the convolutional encoder of the classifier and the linear output is a numerical value which represents the resizing factor. The loss is calculated by using the Mean Squared Error (MAE), while the Mean Absolute Error (MAE) is the metric considered for this network. A batch size of 256 round and squared bar samples was considered, resulting in a training time of 80 and 100 epochs, respectively. Once trained, predictions for the validation and test sets were computed by using the Scikit-Learn's linear regression algorithm [26], thus obtaining both, the linear regression (W) and determination (R 2 ) coefficients. The best possible score for the R 2 coefficient is 1, which means that the predictions are equal to the real values and is calculated by: where y true corresponds to the real values and y pred are the values predicted by the network. Figure 11 shows the real and predicted values of the test set for both material types, while Table 5 summarises their resulting metrics and coefficients.

Binary Classifier
Finally, the Binary Classifier consists of a dense hidden layer with six neurons and an output layer with a single neuron and sigmoid activation function, as shown in Figure 12. This kind of output separates the two classes by a threshold, i.e., an output value smaller than 0.5 corresponds to one class while an equal or greater value is another. If required, the threshold can work as a variable hyperparameter, thus adding flexibility to the material detection.  The Receiver Operating Characteristics curve (ROC) shown in Figure 13 represents the metric considered to validate the round-bar network performance. The area under the curve (AUC) defines the separability degree of the classes [27]. Note that the trained model shows a remarkable performance, thus achieving an AUC of 0.9977. The confusion matrix for the round-bar Binary Classifier network is shown in Table 6. Each class in the Table is defined as follows: • TP = True positive "1" (correctly identified) • TN = True negative "0" (correctly rejected) • FP = False positive "1" (incorrectly identified) • FN = False negative "0" (incorrectly rejected)

TP
Portability and processing time highly depends on the number of network parameters. The deep learning networks tendency is to have millions of parameters; however, the proposed methodology just has a total of 149,356. Table 7 summarises the number of parameters for each network conforming the system. It is worth mentioning that for each material addition the number of parameters increases as it is a modular design. However, if counting four different materials was required, the number of parameters would be barely 195,480 because the classifier design actually considers the four material classes.

Distance Clustering
The implemented distance clustering algorithm is used to group the candidate center coordinates. The algorithm, described in [19], is summarized in the next steps: 1.
The Euclidean distance between coordinates of every candidate centre is calculated. 2.
The group for each candidate centre point is created by using a distance threshold as a reference.

3.
Groups that contain close elements are merged.

4.
The mean point of each group of candidate centres is calculated by averaging the maximum and minimum coordinate values of the group.
Once the clusters are found, the sum of every centre point found is the total count of materials. It is worth mentioning that this technique is the same regardless of the material type.

Proposed Methodology Validation
Final tests of the whole SA-CNN-DC were made in order to verify its proper performance. In the first test, round and squared bar counting efficiency was measured and analysed. Then, the proposed methodology was compared with the existing frameworks and, finally, a particular comparison with the top-performance methodology was made.

SA-CNN-DC Performance
A new set of 21 different dimensions images of round bars was collected from two steel warehouses for testing purposes. In addition, the only test image available in [19] also was included. On the other hand, 10 images of squared bars were also collected for test. Note that these 32 images are unknown for the network during the training phase and the bar numbers range from 26 to 1108.
Precision, recall, F1-score and relative accuracy were the metrics evaluated to verify the performance during the test. Precision corresponds to the correct instances ratio between the retrieved instances, and recall is the true positive rate or sensitivity. As shown in Equations (3) and (4), this two metrics are based on the TP, FP and FN values. On other hand, the F1-score, defined in Equation (5), is calculated based on precision and recall values, while the relative accuracy simply considers the final count (FC), which is the output of the methodology, and the ground truth (GT), as shown in Equation (6).
The results of the methodology by considering round and squared test images are shown in Table 8. After testing, a general analysis of the results was made and common errors were detected. For neural networks, noisy background with cardboards or tags may lead to false positives. Similarly, when elements suffer variations in painting and lightning, it becomes more challenging to detect them. Stacking disorder (cluttering) is the main cause for false negatives because the elements appear partially hidden. Finally, if the image has a low resolution, the results become uncertain due to the blurry patches. For the clustering technique, scattered candidate centres for the same bar lead to a cluster division. In contrast, merged clusters with rectangled shape are formed by centre groups of two bars which are too close or joined by a point. Results also show that the bar detection is independent of the material shape and dimension. Even when the round bars have deformations in their tips, the proposed SA-CNN-DC is able to detect them. Moreover, the detection is correct although the image shows variations in the bar orientation. It is worth mentioning that some of the partially hidden or cluttered elements are correctly counted, and the clustering technique successfully works considering that, sometimes, the material size is not constant throughout the image.
Some test images for round and squared bars are presented in Tables 9 and 10, respectively. The first column shows the output obtained by applying the clustering technique, while the other two show the system output and the main remarks. The resulting mistakes are marked in bounding boxes and some of them are described in the last column. Note that the density type, final count, ground truth and inference time are also shown in each image. It is worth mentioning that the inference time value includes not only the network inference time, but every step of the methodology, from preprocessing to the output.

General Comparison
By considering round steel bars, the proposed methodology was compared with other frameworks, and their main performance features are summarised in Table 11. As mentioned before, computational resources were limited during the development and testing of the proposed methodology, thus resulting in higher processing time. However, the obtained results are pretty competitive by achieving an accuracy of 98.72% with precision of 0.9926. In addition, the SA-CNN-DC offers robustness to variations in image dimensions, bar size and conditions such as lightning, background noise, etcetera.

Specific Comparison
Finally, the proposed methodology was compared with the framework described in [19], which shows the best metric results. The same images and computational resources (CPU and GPU) were used for the comparison. It is worth mentioning that the trained model was considered and the test images were fitted to the patch size used in [19]. From the results presented in Table 12, the accuracy of the proposed methodology is 7% higher than Fan's work, and the average processing time is almost four times less. It is worth mentioning that the proposed methodology processing time considers the performance of three networks, while the CNN-DC [19] just a single network but with 14 times more parameters. Every bar is detected correctly .

Desktop App
A user friendly desktop app was developed in order to implement the proposed methodology. Minor improvements were considered within the app to create a practical user interface. The possibility to define the element size as well as the area of interest within the raw image was added. The patch size used for the binary network can be defined just by selecting the element size with a bounding box in the image (Figure 14a). While reducing the area of interest (Figure 14b), the image is cropped so that the possible noise in the original image frames can be reduced.
On the other hand, once the processing is completed and the final count is obtained, the app shows the possible counting errors highlighted in red colour (Figure 14c), so the user can easily verify if those errors are related with cluttering or clustering problems.  Figure 15 shows the flow diagram of the desktop app implementation. First, the app asks for the username and password to login. The user can access the file directory and selects the image. The image is displayed in the density selection window where the user must choose the image density. At this stage, the user has the option to either select a bounding box for an element or for the stack or pile of bars, or both. Once the processing is completed, an output window provides the image with marked centres (where the possible errors also appear), the material type, the final count and the processing time, as shown in Figure 16.

Conclusions
Bar counting is a time-consuming and tedious task for the workers in the steel warehouse and it becomes more complicated due to the different sizes of each bar type. Convolutional neural networks have become suitable for this task because they are able to perform as feature extractors, just exploiting their inherent capability to learn abstract concepts and detecting different shape objects. Therefore, the proposed work is a machine learning based methodology capable to identify the bar type and count the number of elements from an image, which has been used in a real steel warehouse with high level of user satisfaction. The addition of challenging steel bars shapes, such as angled and rectangular bars, has been considered for future work. Collecting more images, general improvements and inference time reduction have been also taken into account. More importantly, the implementation of the current networks in an embedded system is a high priority goal.
The SA-CNN-DC methodology shows a modular and portable design implemented by three multilayer networks, processing the responses of a common convolutional encoder and a clustering technique. A compact design approach was adopted with a minimum number of parameters in order to reduce the computational resources but without compromising its performance and accuracy.
Test simulations were carried out in order to validate its proper performance, so that different warehouses images were considered to verify the generalisation capability. Simulation results showed a good performance by considering round and squared bars counting, with an accuracy of 98.81% and 98.57% respectively. Compared with the implementations found in literature, the proposed methodology is capable of achieving competitive results with the minimum computational resources. Moreover, its modular design allows the addition of new bar types in a simple way, thus fulfilling the real warehouses expectations.
The usage of the desktop app in the steel warehouse has drastically reduced the counting time for round and squared bars. Moreover, there is a higher confidence in the inventory due to the low error rate and the possible errors marks. Painting materials for manual counting is no longer necessary which results in a significant reduction of resources and environmental damage. Finally, productivity is improved since employees can focus on other activities, while avoiding the exhaustive activity of counting a large amount of elements in a hostile environment.