A Product Pose Tracking Paradigm Based on Deep Points Detection

: The paper at hand presents a novel and versatile method for tracking the pose of varying products during their manufacturing procedure. By using modern Deep Neural Network techniques based on Attention models, the most representative points to track an object can be automatically identiﬁed using its drawing. Then, during manufacturing, the body of the product is processed with Aluminum Oxide on those points, which is unobtrusive in the visible spectrum, but easily distinguishable from infrared cameras. Our proposal allows for the inclusion of Artiﬁcial Intelligence in Computer-Aided Manufacturing to assist the autonomous control of robotic handlers. Author Contributions: Conceptualization, L.B. and S.G.M.; methodology, L.B.; software, L.B.; valida-tion, L.B., S.G.M. and A.G.; formal analysis, L.B. and A.G.; investigation, L.B. and S.G.M.; resources, A.G. and S.G.M.; data curation, L.B.; writing—original draft preparation, L.B., S.G.M. and A.G.; writing—review and editing, L.B., S.G.M. and A.G.; visualization, L.B.; supervision, A.G.; project administration, manuscript.


Introduction
During the last decade, significant effort has been given in facilitating contemporary digital technologies into the manufacturing procedure to comply with the Industry 4.0 scheme through Vertical Networking, Horizontal Integration, Through-Engineering, and early adaptation in Exponential Technologies [1,2]. Smart systems have been already introduced into the manufacturing procedure to increase the flexibility and productivity scale in different levels of infrastructure, such as remote control through the Internet of Things (IoT) [3], predictive maintenance [4], failure recovery from non-expert personnel [5,6], low-volume or high-variance production [7], workload scheduling [8], and more. With the means mentioned above, Computer-Aided Manufacturing (CAM) systems can be significantly improved to increase their autonomy and provide real-time analysis of their subjects using Artificial Intelligence (AI).
One of the most promising technologies that can be adopted in an Industry 4.0 ecosystem refers to the processing of visual data from low-cost camera sensors. During the last two decades, computer vision has reported tremendous achievements in automation and manufacturing, spanning from landmark/keypoint detection and extraction [9,10], to exploration [11] and novelty detection [12]. Combined with the cognitive capabilities of AI and Deep Learning (DL), such technologies enable smart systems to better interpret and interact with their environment. Thus, modern automation can perform complex tasks, while also handling unexpected events.
Pose estimation and tracking have been a challenging task that has provoked thought for many researchers. Accurately identifying the position and orientation of an object allows autonomous machines (e.g., robotic manipulators in a production line) to grasp and adequately handle it without compromising their integrity. Traditionally, the whole 3D figure of an item is estimated by using depth visual sensors, such as stereo or RGB-D cameras [13][14][15]. More recently, though, in order to provide cost-efficient solutions, the related literature has been focused on identifying an object's pose through single image instances, which can be acquired by monocular sensors [16]. This is achieved by either using Structure from Motion techniques via frames captured during different time instances [17,18], or through DL methods [9,[19][20][21][22], which can identify the full pose or representative local points of a given object.
In this paper, we present a solution for improving the operation of a smart assembly line, which requires the detection and tracking of products during manufacturing (e.g., [23]). Our proposal refers to an AI-enabled tool for generating representative tracking points to capture the position of different products and close the control loop of CAM systems. Instead of relying on fiducial [24] or other markers that affect an object's appearance, our conceptualization refers to the application of highly infrared (IR) reflective materials, such as Aluminum or Magnesium Oxide, to strategically selected by the AI [25] regions of each product. Such materials can be easily perceived by an IR camera sensor without significantly interfering with the final form of the product, facilitating the tracking and handling procedures of a modern production line with multiple robotic manipulators. Within the scope of this work, we consider the use of Aluminum Oxide, which constitutes a low-cost solution with high reflectivity on the IR spectrum [26]. Our approach realizes the identification of points within a pre-processing step without explicitly requiring the online deployment of the DL architecture during the manufacturing procedure. This allows the Aluminum Oxide markings to be planned beforehand and incorporated in the design procedure, as well as the use of direct methods for retrieving the pose of each object based on well-established 3D geometry techniques.

Proposed Approach
A DL network is proposed to automatically identify representative points to be processed with Aluminum Oxide (e.g., injections, paint, coating [27][28][29][30]) based on the architecture presented in [25]. An open source implementation of the used network can be found in the publicly available repository https://tinyurl.com/githubAttention, accessed on 25 May 2021. In this work, the DenseNet [31] model is used, according to which the extracted information from previous layers is accumulated in the following ones by propagating their respective feature maps. With the view to reduce the computational complexity of the original design, standard convolutional layers are substituted by depthwise separable ones [32]. Specifically, the model's stem consists of Dense Blocks with Inverted Residuals and Mish activation function [33]. Moreover, for sub-sampling feature maps, an antialiasing Blur Pooling filter is introduced, allowing the use of different kernel sizes [34]. The network's main building block, shown in Figure 1, consists of an Attention-Augmented Inverted Residual Block constructed around a standard residual one. Such a mechanism aggregates the similarity between profound query characteristics, and thus, multiple subspaces and spatial positions can be monitored by using a series of attention blocks. Finally, downsampling between Dense Blocks is achieved by a Transition Layer with pointwise convolutions to reduce the feature maps' depth, Blur Pooling, and batch normalization. The network's output corresponds to a set of k coordinates, each denoting a specific point on an object. Besides its competitive performance, this network is appealing for a dynamic pose detection system due to its single-stage end-to-end architecture and its capability to regress the representative points' coordinates within a pre-processing step; before the items reach the production stage. Our approach refers to training the above network with a dataset of known manufacturing objects using their respective Computer-Aided Design (CAD) models. Consequently, the same network can automatically indicate representative tracking points for other products with different shapes. Our conceptualizing is based on the notion that the DL architecture has been trained to detect the most appropriate points from the learning objects, and it can transfer this knowledge to unknown items, as well. Given that CAD learning models can be transformed with respect to any desired viewing angle, the trained network will be able to identify the required amount of tracking points from multiple views. Thus, Aluminum Oxide can be placed accordingly to different objects' sides, allowing an IR camera to accurately identify their position.
More specifically, the proposed concept assumes a basic setup, such as the one presented in Figure 2, including a single or multiple IR camera sensors mounted on the ceiling, according to the production line's size that needs to be monitored. Using an IR source of lighting aimed at such a line, Aluminum Oxide points strongly reflect the light, and they are captured by the sensor. This allows for their detection and tracking on every frame of the respective video stream. Finally, the above 2D image points are associated with the known 3D object's mess (given from its CAD model), and the Perspective-n-Point (PnP) algorithm [35] is applied to estimate the relative 6-degrees-of-freedom transformation between the camera and each product. By assuming n = 3 for the PnP algorithm, four different solutions are obtained, which can be sorted out by a fourth point association. The P3P algorithm, applied on the simplistic setup presented in Figure 3 among the camera center P 0 and three 3D points P i , i ∈ {1, 2, 3}, adheres to the following equation system: where: In the above, p, q, and r can be computed based on the 2D image point correspondences with the world's points P i , while a, b, and c are known from the given CAD model. Furthermore, the notation |...| 2 denotes the L2-norm. By solving the equation system (1), we can obtain X, Y, and Z, which correspond to the depth information of the reference world points. Then, given the intrinsic camera parameters, the 3D points' coordinates are computed (P i ) with respect to the IR sensor's frame of reference. Finally, the relative rotation (R) and translation (T) between the camera and the product are recovered by solving: Considering that the IR sensor's position is known and fixed, each object's pose on the production line can be accurately retrieved and transformed on a global frame of reference. Please note that in the above procedure, one-to-one associations between the images' 2D and the products' 3D points need to be known. There are several approaches for obtaining this information (e.g., [9,36]); however, an efficient scheme is presented in [37], where a brute force searching strategy was constructed by inspecting every possible combination and permutation among 3 point detections. The resulting solutions of the P3P algorithm can then be sorted out based on the re-projection error from the rest of the detected 2D points. At first glance, such an approach may seem computationally intensive; however, one needs to consider that the number of image points is low (k = 50 for our case) and that each regressed coordinate set is associated with a specific point of an object determined in the training samples. The item is processed with an IR reflective material on points proposed by the Attention model [25].
Using the 3D-to-2D point associations, the relative rotation (R) and transformation (T) between the sensor and an item can be retrieved, allowing accurate pose estimation and effective manipulation in a production line.

Preliminary Results
A representative example of the obtained results is depicted in Figure 4. The network can automatically produce the heat-map of the products' most profound tracking points through the feature maps of the Attention-Augmented Inverted Residual Block. The presented outcome is obtained using the network provided in [25], which was originally trained on the PANOPTIC [38] dataset. More specifically, 165,000 learning subjects were used under the Stochastic Gradient Descent optimizer with a triangular policy of Cyclical Learning Rate [39]. The proposed DL architecture identifies regions of increased saliency on a novel object outside the training space, highlighting its versatility and generalization properties. With the view to further evaluate our proposed framework's performance, we conducted an additional round of experiments to measure the network's robustness for detecting salient points through repeatability. To that end, we applied a series of trans-formations (including rotation, translation, and affine) on the item depicted in Figure 4, resulting in 300 different instances. Then, we deployed the Attention model on each of them and measured the detected points' repeatability using the Intersection over Union (IoU) metric. The obtained results are presented in Table 1 and indicate that most points can be effectively re-detected over multiple viewing angles. Please note that the network was set to detect the most prominent k = 50 points from each image.  It is worth noting that in real-world applications, the learning process needs to be implemented over CAD samples from the specific production line's targeted manufacturing products; however, such a learning sample is expected to be small. To that end, datasets such as the ABC [40] one can be used within a pre-training step to transfer-learn the general characteristics (e.g., geometry and appearance) of production items and restrain over-fitting.
Furthermore, the lightweight nature of the network, totaling only 1.9M mixed-precision parameters [41], offers a real-time performance of approximately 20 ms per frame when deployed on a Titan Xp GPU. In comparison, a hardware-accelerated version of hand-crafted keypoints, which do not incorporate any learned qualities for proper pose tracking, has been reported in [42] to be computed at 11 ms per input sample. This real-time performance allows for the detection of points to be refined with Aluminum Oxide even during the manufacturing procedure; however, one can assign this process to earlier stages, such as the products' design.

Conclusions
Our approach proposes the application of a state-of-the-art deep network on the manufacturing procedures. Realizing the identification of points for product pose tracking as a highly dynamical system, we evaluate our technique in terms of repeatability, establishing its robustness against different viewing angles. This is due to the rich information contained within the DL model's multi-layer structure, its ability to describe highly correlated input-output variables, and the use of high-dimensional learning data. Such a framework paves the way to increase a manufacturing facility's level of automation, leading to more effective control in modern industrial environments and remote inspection of the production line within an IoT paradigm. As part of our future, we plan to evaluate a complete system based on our proposal to showcase fully solidified results in an industrial environment. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.