A New Loss Function for Simultaneous Object Localization and Classiﬁcation

: Robots play a pivotal role in the manufacturing industry. This has led to the development of computer vision. Since AlexNet won ILSVRC, convolutional neural networks (CNNs) have achieved state-of-the-art status in this area. In this work, a novel method is proposed to simultaneously detect and predict the localization of objects using a custom loop method and a CNN, performing two of the most important tasks in computer vision with a single method. Two different loss functions are proposed to evaluate the method and compare the results. The obtained results show that the network is able to perform both tasks accurately, classifying images correctly and locating objects precisely. Regarding the loss functions, when the target classiﬁcation values are computed, the network performs better in the localization task. Following this work, improvements are expected to be made in the localization task of networks by reﬁning the training processes of the networks and loss functions.


Introduction
Nowadays, robots are essential for the manufacturing industry. The use of robots has helped the manufacturing industry to manufacture products more efficiently, saving both costs and time. Despite the fact that an increasing need for robots has been observed in all industrial sectors in recent years, the electronics industry has been the main customer of industrial robots since 2020, when it overtook the automotive industry. However, the latter still demands 80,000 robots a year; hence, it is still an important sector for robot manufacturers. Industrial robot manufacturers are making every effort to design and develop safe and human-friendly robots. This is spurred on by the fact that small-and medium-sized companies are increasing their use of industrial robots due to the availability of affordable solutions and easy-to-use collaborative robots. Hence, collaborative solutions, where humans and robots work together, are becoming the new frontier in industrial robotics [1,2]. The use of collaborative robots is also supported by the current trend of automation and data exchange in manufacturing industries, also called Industry 4.0 [3].
In the case of the automotive industry, robots are used mainly in the manufacturing process. At the beginning of the 20th century, when chain production was introduced by the Ford Model T, cars were handmade. Nowadays, this process is mainly automatic. However, there are still tasks where humans need to intervene. In this context, collaborative robots can help workers improve the efficiency and reduce the manufacturing faults of production 1.
Selectively reuse the set of the most important features from preceding layers; 2.
Actively update the set of preceding features to increase their utility for later layers, achieving promising performance in image classification (ImageNet) and object detection (MS COCO) in terms of both theoretical efficiency and practical speed.
In recent years, large advancements have been made in image classification tasks [18]. Therefore, CNNs have great value when there is a need to identify images. However, normally, this feature is not useful when it is used alone. It can be combined with a region proposal network (RPN) and perform traditional object detection.
The traditional object detection method consists of generating region proposals first using an RPN and then classifying each proposal into different object categories [19]. This is the case of R-CNN [20]. Nevertheless, this process is normally very computationally costly. In order to tackle this issue, different iterations of R-CNN have been proposed. Girshick et al. [21] improved their original R-CNN to be faster and more accurate. Ren et al. [22] improved this by introducing an RPN that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. The Faster R-CNN architecture has achieved good results in object detection tasks. For example, Fu et al. [23] and Song et al. [24] used the Faster R-CNN based on ZFNet and VGG16, respectively, to detect kiwifruits in order to enable robots to pick them up.
The other object detection method with regard to the task of regression or the classification problem adopts a unified framework to achieve the final results (categories and locations) directly. Redmon et al. [25] predicted bounding boxes and their associated class probabilities directly from full images in one evaluation. They called this new approach to object detection You Only Look Once (YOLO). The Single-Shot Multibox Detector (SSD) [26] discretizes the output space of bounding boxes into a set of default boxes, adjusting them by the scores generated for the presence of each object category in each default box in order to better match the object shape. CenterNet, proposed by Duan et al. [27], presents an efficient solution based on the detection of each object as a triplet of key points rather than a pair, improving both precision and recall.
However, each of these methods has its own issues: the traditional object detection techniques require a high computational power, whereas the single-stage methods do not have the same level of accuracy as the traditional techniques. In 2017, Li et al. [28] proposed a two-stage object detector based on ResNet-101 [12] to address the shortcomings of these types of detectors, that is, the slow speed of these networks due to their heavy-head designs. In 2018, Zhang et al. [29] proposed a novel single-shot-based detector that achieves a better accuracy than the two-stage methods and maintains an efficiency comparable to that of the one-stage methods. Examples of the advancements that have been made in object detection tasks in recent years are in reference [30].
In 2019, EfficientDet [31] proposed a new family of object detectors based on Effi-cientNet backbones and optimized the weighted bi-directional feature pyramid network (BiFPN) and the compound scaling method. In particular, the model EfficientDet-D7 achieved state-of-the-art results at MS COCO. Another example of this appeared in 2022, when Liu et al. [32] presented a network called ConvNeXts, constructed entirely from standard ConvNet modules. These modules are ResNet modules modernized towards the design of a vision transformer, and they compete favorably with transformers in terms of accuracy and scalability.
Additionally, the networks observed in the literature focus on single-task problems: image classification, object detection, image recognition, etc. To the best of our knowledge, there are no or very few examples of CNNs that have been used to simultaneously perform different tasks. Therefore, we see the need for exploring image classification and object localization tasks using the same CNN. The objective of this article is to determine whether both tasks can be performed accurately with a single CNN. Therefore, we propose a custom evaluation loop that merges the cross-entropy loss (Ex) for the classification task and the half mean square error (mse) for the regression task (object localization). We also compare two different loss functions using different Ex and mse loss proportions and determine which method is the best.

Convolutional Neural Network
A CNN is a type of deep neural network that uses convolutional layers to extract feature maps from the input image. Usually, the network consists of one input layer, one or more convolutional layers, one fully connected layer and one output layer [33]. In this case, the network has two fully connected layers at the end of the convolutional layers, separating each one from the main branch. This allows the network to perform two different tasks using the same convolutional layers. At the end of one fully connected layer, a softmax layer is connected. This branch performs the classification task, while the other performs the detection task. In Figure 1, the structure of the network can be seen. The input layer in has a dimension of 100 × 100 × 1. Therefore, the input data consist of a single matrix with dimensions of 100 × 100, which contain the value of each pixel in gray-scale from 0 (black) to 255 (white).
The convolutional layer is the specific layer of the CNN. The convolutional equation used is that shown in Equation (1): where is the result, is the filter matrix, is the input of the convolutional layer, and is the bias term. In this case, the network features three convolutional layers. The first one has 16 filters with a 5 × 5 size. The second one has 32 filters with a 3 × 3 size. Finally, the third one also has 32 filters with a 3 × 3 size, although it has a stride of one, instead of two like the second layer. Furthermore, the output of this network uses a nonlinear activation function (ReLU), as shown in Equation (2): In order to speed up the training and reduce the sensitivity to network initialization, a batch normalization layer is included between each convolutional layer and the ReLU layer. This is achieved by normalizing a mini-batch of data across all observations for each channel independently. The parameters of the model are listed in detail in Table 1. The input layer in has a dimension of 100 × 100 × 1. Therefore, the input data consist of a single matrix with dimensions of 100 × 100, which contain the value of each pixel in gray-scale from 0 (black) to 255 (white).
The convolutional layer is the specific layer of the CNN. The convolutional equation used is that shown in Equation (1): where y j is the result, w ij is the filter matrix, x j is the input of the convolutional layer, and b j is the bias term. In this case, the network features three convolutional layers. The first one has 16 filters with a 5 × 5 size. The second one has 32 filters with a 3 × 3 size. Finally, the third one also has 32 filters with a 3 × 3 size, although it has a stride of one, instead of two like the second layer. Furthermore, the output of this network uses a nonlinear activation function (ReLU), as shown in Equation (2): In order to speed up the training and reduce the sensitivity to network initialization, a batch normalization layer is included between each convolutional layer and the ReLU layer. This is achieved by normalizing a mini-batch of data across all observations for each channel independently. The parameters of the model are listed in detail in Table 1.
The proposed neural network contains 334.4 k parameters and has a model size of 1.22 MB after training. This network was used because it showed good results in similar tasks. It was considered valuable to use other types of convolutional neural networks, such as VGG-16 and ZFNet, but these networks had too many parameters for this application, and, thus, training would take too long to evaluate the performance of the proposed custom training loop with a custom loss function.
We analyzed the basic structure of the proposed convolutional neural network; in the next subsection, the learning process of the network is discussed.

Learning Process
The learning process of a deep learning network consists of three steps: data acquisition, data preparation and model training. In the current work, the first step consists of capturing images of the surroundings of the pin. The images are taken using a camera that captures images of 612 × 512 pixels. The images are in gray-scale and are saved as a tiff file.
The second step consists of preparing the data to train the network. The first task is to label the images. After manually identifying the pin in each image, the data are used to generate images starting from the seed images. The identification is made by drawing a rectangle surrounding the pin. The center pixel of the marked rectangle is taken as the location of the pin, which is then used in the training process as the target value. Then, from each seed image, 5 images are obtained. In these images, the position of the pin is the same, but the contrast and the brightness of the images are randomly modified using Equations (3)-(5): Contrast factor : Brightness factor : where rand is a random value between 0 and 1, I sij is the seed pixel value, and I ij is the resulting pixel value. This is applied to all seed images to obtain 1620 images. These images, however, still have a size of 612 × 512. In order to train the network, the images need to be transformed so that their size is 100 × 100. Therefore, each image receives a random transformation, where a 100 × 100 size region is chosen from each image. This is carried out by randomly selecting whether the image has a pin, the chance of which is 50/50. At the end of the transformation, there are 810 images with a pin and 810 without a pin.
The final step of the training consists of the model training itself. In this case, a custom training loop is used. MATLAB is the software chosen to develop the different algorithms that are involved in this work. This software has different tools to develop and train deep neural networks. One of these functionalities is to train custom training loops, updating the learnable parameters of the network using different solvers. In this case, the Adam (adaptative moment estimation) solver is used [34].
In this process, each mini-batch of data is evaluated using the modeloGradients function. The modeloGradients function takes the following as inputs: the network and a mini-batch of input data, with the corresponding targets T1 and T2 containing the labels and positions, respectively. Then, it returns the gradients of the loss with respect to the learnable parameters, the updated network state and the corresponding loss.
The loss for each mini-batch θ is calculated by adding the cross-entropy loss of the classification task and the half mean squared error, with the latter multiplied by factor λ = 0.1, following Equation (6): The cross-entropy loss (Ex) for each mini-batch θ is calculated using Equation (7): where N is the number of samples, K is the number of classes, t ni is the indicator showing that the n th sample belongs to the i th class, and y ni is the output for sample n for class i. That is, y ni is the probability that the network associates the n th input with class i. The half mean squared error (mse) operation computes the half mean squared error loss between the network predictions and target values for regression tasks. The loss for each mini-batch θ is calculated using the following Equation (8): where X i is the network prediction, T i is the target value, M is the total number of responses in X (across all observations), and N is the total number of observations in X. Afterwards, the calculated gradients are used to update the learnable parameters of the network. This process continues until the training ends, which is when the training reaches 200 epochs. Each mini-batch consists of 60 elements. Therefore, 5400 iterations are performed. The parameters of the Adam solver are listed in Table 2. During the training, a validation evaluation is performed. This is carried out to ensure that the training is performing well and that the results are converging. To perform this task, a new dataset is created following the same steps as those used for the training data. In this case, 3 images are obtained from each seed image in order to speed up the validation process. This dataset is evaluated as the training dataset in groups of 60 data samples. At the end of each training epoch, all the validation data are evaluated, and the average loss value is returned by the algorithm.
The first loss function is based on a constant ratio between the two different losses. Regarding the second loss function, we only want to perform the localization task when the network detects an object in order to evaluate whether this approach improves the effectiveness of the network. This new loss function is also based on the cross-entropy loss of the classification task and the half mean square error of the regression task. However, the combination of both is not a simple constant ratio, as with the first loss function. At first, we thought that the loss function only needed to take into account the cross-entropy loss when the classification was not performed correctly, because trying to locate a pin in an image that does not have one would not be correct. Therefore, the loss function that was proposed included the target values of the classification task, as well as the network prediction. However, the use of the predictions to calculate the loss led the network to classify all images in one group due to the learnable parameters being related to the predictions. Because of this, it was decided that the network predictions should not be used. Consequently, only the target values of the classification task are used. In the images where there is no pin, only the cross-entropy loss is used to calculate the overall loss. In the other case, the half mean squared error is also computed. This is carried out with the objective of only taking into account the localization task when there is a pin to locate. All this is performed in each image µ of the mini-batch θ using Equation (9): where t µ pc1 is the target probability that image µ contains a pin, loss Ex,θ is the cross-entropy loss of the mini-batch θ (Equation (7)), loss mse,µ is the half mean square error loss of the image µ (Equation (8)), and N is the number of images in the mini-batch θ.
As with the first proposed loss in this article, this loss is used to calculate the gradients of the loss with respect to the learnable parameters in order to update the latter to improve the predictions of the network. The same base network is used to compare the obtained results.
After finalizing the training, the same validation data are used to evaluate the training. At this point, 10 randomly selected images are chosen to evaluate the network performance. The same images are used to evaluate the training of the second loss function. Therefore, both results are directly comparable and allow one to conclude whether the proposed method is effective and which loss function has the best performance.

Results
In this section, the results of the investigation are presented. First, the network is trained using the presented loss function. The loss during the training and the average validation loss are presented in Figure 2. The quick drop that appears in the first iterations suggests that the classification of the images is optimized early in the training. The values obtained at the end of the training are collected in Table 3.   Looking at the results in Table 4, it can be seen that the network achieves very good results in the classification task, labeling most of the images correctly. After analyzing all the images in the validation dataset, 10 images were randomly selected to expose the training results. Regarding the pin localization, the results can be improved. Most of the time, the network is able to locate the pin with decent precision. However, the localization task fails when there is no pin in the image, for example, as shown in images 816, 165 and 836 in Figure 3. It can also be noted that image 357 is not classified correctly, although the localization task is performed accurately. Looking at the results in Table 4, it can be seen that the network achieves very good results in the classification task, labeling most of the images correctly. After analyzing all the images in the validation dataset, 10 images were randomly selected to expose the training results. Regarding the pin localization, the results can be improved. Most of the time, the network is able to locate the pin with decent precision. However, the localization task fails when there is no pin in the image, for example, as shown in images 816, 165 and 836 in Figure 3. It can also be noted that image 357 is not classified correctly, although the localization task is performed accurately.   Loss is the total loss; Classification is the cross-entropy error; Regression is the half mean square error. All the values smaller than 10 −4 are considered null.
The same observation can be made with the second proposed loss. In Figure 4, the loss during the training and the average validation loss are presented. The same observation can be made with the second proposed loss. In Figure 4, the loss during the training and the average validation loss are presented. Figure 3. Images used to analyze the performance of the network. The red marking shows the real position of the pin (manually labeled), whereas the blue marking shows the prediction of the network. Table 4. The values of the analyzed validation images (Figure 3). Loss is the total loss; Classification is the cross-entropy error; Regression is the half mean square error. All the values smaller than 10 −4 are considered null.

Image
The same observation can be made with the second proposed loss. In Figure 4, the loss during the training and the average validation loss are presented.  Figure 2, the loss value is higher at the beginning, although at the end of the training process, the values converge, as can be seen in Table 5. After analyzing all the images in the validation dataset, 10 images were randomly selected to present the training results. Looking at the results in Table 6, as well as those of the first network, this network performs well on the classification task, thus improving its performance in the location task. In images 816 and 836 in Figure 5, it can be seen that the network predicts the positions of the pins, although they are not in the images. However, in the other images, the network predicts the positions of the pins more accurately than the previous network. This can be the result of giving the regression task more influence when there is a pin. This can also provide an explanation for the overall loss values being higher than those in the first network.   Figure 2, the loss value is higher at the beginning, although at the end of the training process, the values converge, as can be seen in Table 5. After analyzing all the images in the validation dataset, 10 images were randomly selected to present the training results. Looking at the results in Table 6, as well as those of the first network, this network performs well on the classification task, thus improving its performance in the location task. In images 816 and 836 in Figure 5, it can be seen that the network predicts the positions of the pins, although they are not in the images. However, in the other images, the network predicts the positions of the pins more accurately than the previous network. This can be the result of giving the regression task more influence when there is a pin. This can also provide an explanation for the overall values being higher than those in the first network.

Discussion
The improvements of convolutional neural networks in image classification [18] and object detection [19,30] have positioned CNNs as the best solution to perform these tasks. Most of the reviewed literature suggests that a single task is performed by each network. Although single-stage object detectors, such as YOLO [25] and SSD [26], perform both the region proposal task and the detection task as a single task, they do not classify the input image. In this work, the objective was to first classify the input image into two groups, namely, images that contain a pin and images that do not contain a pin, and to then position the pin inside the image, all while using the same simple network.
In this work, we proposed two different approaches to perform a custom training loop. The two approaches differ from each other in the loss function that is used in each case. After analyzing the results, we could see that the classification task was performed accurately in both trained networks. Regarding the localization of the pin, the second network achieved better results. The differences in the approaches of the two methods led us to think that the second network performed better in the localization task because it gives more weight to the half mean squared error when it needs to perform the localization task. However, both networks underperformed in the localization task when there was no pin in the image. Therefore, in following iterations of this network, we will attempt to improve the localization task of the network by refining the training processes of the network and the loss function.

Conclusions
The objective of this research is to determine whether image classification and object localization tasks can be performed using a single CNN. In this work, two new loss functions are added to a custom training loop. Both loss functions combine the cross-entropy loss (Ex) for the classification task and the half mean square error (mse) for the regression task (object localization). The main differences in both networks are as follows: 1.
The first network always computes the loss by adding the classification task loss and the localization task loss (Equation (6)), whereas the second only takes into account the localization task loss when there is a pin in the image (Equation (8)).

2.
In the first network, the localization task loss is multiplied by factor λ (Equation (6)), reducing the importance that this loss has in the overall loss of the network.
The first loss function is very simple, allowing for the results to be more easily analyzed and for the network to be finetuned more accurately. However, the second loss function introduces differentiation between the two types of losses that computes the total loss depending on the image localization task. By doing this, only the half mean square error is computed when an object is detected in the image. This is the key aspect of this second loss, because the first loss does not make any differentiation in the computing of the total loss, which, in our humble opinion, is the biggest contribution of this work. Based on the results, this approach shows better performance than the first loss function, paving the way forward for future research.
These results show that computer vision can benefit the manufacturing industry. The tasks that require some type of visual recognition, detection or classification can now be performed efficiently using neural networks. This can lead to more automated manufacturing processes. Therefore, workers can focus less on automatic tasks and more on their efforts in other tasks, such as problem solving and organizational tasks. Funding: The current study was sponsored by the Government of the Basque Country-ELKARTEK21/ 10 KK-2021/00014 ("Estudio de nuevas técnicas de inteligencia artificial basadas en Deep Learning dirigidas a la optimización de procesos industriales") research program.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.