Next Article in Journal
A Broad Dual-Band Bandpass Filter Design Based on Double-Layered Spoof Surface Plasmon Polaritons
Previous Article in Journal
Emotion Analysis and Dialogue Breakdown Detection in Dialogue of Chat Systems Based on Deep Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Multiview Recognition Method of Predefined Objects for Robot Assembly Using Deep Learning and Its Implementation on an FPGA

by
Victor Lomas-Barrie
1,*,
Ricardo Silva-Flores
1,
Antonio Neme
2 and
Mario Pena-Cabrera
1
1
Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico
2
Unidad Académica Mérida, IIMAS, Universidad Nacional Autónoma de México, Mérida 97290, Mexico
*
Author to whom correspondence should be addressed.
Electronics 2022, 11(5), 696; https://doi.org/10.3390/electronics11050696
Submission received: 9 January 2022 / Revised: 21 February 2022 / Accepted: 22 February 2022 / Published: 24 February 2022
(This article belongs to the Section Artificial Intelligence)

Abstract

:
The process of recognizing manufacturing parts in real time requires fast, accurate, small, and low-power-consumption sensors. Here, we describe a method to extract descriptors from several objects observed from a wide range of angles in a three-dimensional space. These descriptors define the dataset, which allows for the training and further validation of a convolutional neural network. The classification is implemented in reconfigurable hardware in an embedded system with an RGB sensor and the processing unit. The system achieved an accuracy of 96.67% and a speed 2.25 × faster than the results reported for state-of-the-art solutions. Our proposal is 655 times faster than implementation on a PC. The presented embedded system meets the criteria of real-time video processing and it is suitable as an enhancement for the hand of a robotic arm in an intelligent manufacturing cell.

1. Introduction

In the field of manufacturing robotics, it is of the highest relevance to count with optical sensors able to provide the system with a sensory mechanism that feeds back the actions of one or more robotic arms that collaborate or interact with a human. Tasks such as welding, machining, painting, or simply clamping and assembly are common in this type of industry. That is why intelligent optical sensing systems are essential for carrying out these tasks. Furthermore, robotics vision allows for automatic learning to achieve fast and accurate object or pattern recognition, which is sometimes a complicated, dangerous, and strenuous task, and humans are usually not involved in the loop. Therefore, these automatic detection mechanisms must be precise when selecting the proper object, which is key to the success of the manufacturing process [1].
In manufacturing cells, it is common for components or objects (e.g., screws, nuts, washers, motors, autoparts, assemblies, and fasteners) to approach the assembly area via conveyors. These objects are usually placed on the worktable of a different assembly robot. These objects are not necessarily approached with the same orientation in each assembly process, and it is even possible that, by some mistake, the part falls and rotates, thus changing the object configuration that was initially planned. It is a highly demanding process to ensure that parts arrive in a specific order and are aligned from the previous stage, so that the robot quickly decides the best way to take them through a deterministic process; this is called the bin-picking problem [2,3]. A robotic vision system must be able to recognize and discriminate parts or objects from any angle and indicate its position to the robot. These vision systems are either fixed to the manufacturing cell [4] or the last end of the robotic arm [5]. In the former, systems have the advantage that, regardless of the overall weight of the system (camera, communications, and computer), vision systems can be easily installed on tripods or the walls of the cell enclosure. Moreover, power consumption is not critical, although the price is usually high. In the case of vision systems being fixed on the wrist of the robotic arm, advantages appear not only as higher image resolution (closer proximity to the object, higher data quality), but also by offering the possibility of dynamically changing the camera angle. However, in order to not add more weight to the robot, embedded systems able to acquire the image, preprocess it, and make inferences are required. In the context of Industry 4.0, priority is given to low-power, stand-alone, low-cost, multimodal, digitally robust, intercommunicable, and, above all, compact systems [6]. This paper focuses on the second precept, that is, of vision systems attached to a robotic arm.
In contrast to what is commonly found in automatic object recognition in other fields, in intelligent manufacturing, the attributes of the components are perfectly known in advance. Features such as color, geometry, texture, and dimensions are greatly relevant to the system so that it can act on the basis of those descriptions. Hence, it is necessary to use customized procedures since those used in pretrained machine-learning models with a large dataset with hundreds of objects are not functional.
In this contribution, we propose an object recognition method implemented in an FPGA in which tasks such as image capture and preprocessing are automated, and at a further stage, an object classification process is conducted. All these tasks are circumscribed in an embedded system placed on the wrist of a robotic arm. Previous pattern recognition works implemented in FPGA can be divided between those based on feature extraction and those based on artificial neural networks (ANNs). In the former, there is work on the Speed-Up Robust Features (SURF) algorithm [7] and its implementation on FPGA [8], and on Features from Accelerated Segment Test [9] and Binary Robust Independent Elementary Features [10] (FAST + BRIEF) [11]. The ANN approach counts with relevant contributions such as the work described in [12], where it is described as an improved version of YOLOv2 implemented in a ZYNQ FPGA. Convolutional neural networks, CNNs such as VGG16 and MobilenetV2 were implemented in hardware [13] using mixed-precision processors. In [14], several CNN model networks (VGG16, MobilenetV2, MobilenetV3, and ShufflenetV2) were produced in hardware using a hybrid vision processing unit (Hybrid VPU).
The boundary object function (BOF) [15] descriptor vector, which is a formalism that characterizes an object by extracting some attributes, as detailed in the following sections, demonstrated potential in terms of being invariant to rotation, scale, and displacement, and being able to condense the information coming from a 2D image to a one-dimensional array. In [16], the BOF and a classifier based on the fuzzy ARTMAP neural network were implemented on an FPGA [17] with very favorable results. However, complications are detected when the object angle on the z-axis exceeds 15 degrees. Therefore, object recognition is complicated if the capture angle is different from the considered angle during network training.
We constructed our dataset by joining the descriptors of all objects. First, a family of BOF vectors linked to a given object was obtained by systematically rotating the camera plane around a surrounding sphere centered on the object. Then, a BOF was obtained from each rotation angle, and the family of BOFs could be tuned by setting the list of rotation angles. Lastly, all objects of interest were characterized in the same form, and the family of BOFs associated with each constituted the complete dataset, as shown in Figure 1.
Although CNNs were originally designed for digital image processing, they were successfully applied in several other contexts. Applications in other areas include, for example, genomic data processing [18], text classification [19], and sentiment analysis [20].
In our contribution, the challenge was to find a CNN model that could classify BOF vector families of different objects. Table 1 shows the pixel input dimensions for the most common CNN models and the BOF and CNN combination introduced in this work. The input dimensions for this case were defined as a 180 × (nbytes fixed point) since the BOF consisted of 180 values, that is, 180 angles of rotation represented as a fixed point.
As stated earlier, in this contribution, we describe a method to classify objects using their BOF vector family as features. The external label or class is given by the object type, such as screws or gears. A deep learning neural network then approximates a function between BOF family vector and object class. The main contributions are:
  • we designed a manufacturing part recognition method capable of classifying multiple part views from their unique descriptors (BOF);
  • we designed a CNN from an existing model, but instead of directly using images, we trained the network with the unique descriptors of each part view; and
  • we implemented a CNN model on a field-programmable gate array (FPGA).
In this contribution, the hypothesis we aimed to prove is that objects of interest in automatic manufacturing can be recognized by their BOF description instead of relying on their images. Since BOFs are vectors representing objects, and BOFs have lower dimensions than images do, we aimed to successfully train a CNN as a classifier on the basis of BOFs. At the same time and equally relevant, we wanted to verify whether training based on BOFs is faster. The third relevant aspect we investigate is the method’s speed gain when implemented in an FPGA (Figure 2).
Related works [13,14] relied on an image detection algorithm based on a CNN. However, the size of the networks is described by its more than 2 million parameters (Mobilenet V2), which consumes 279,000 LUTs and 3600 DSP, and 146,000 LUTs and 212 DSP, respectively. Instead, our model consumes only 31,336 LUTs and 116 DSP. As an additional advantage of our proposal, as is shown in the next sections, it is faster.
The rest of this study is organized as follows. Section 2 describes the entire process of building the dataset, the design of the CNN, and its implementation on the FPGA. Section 3 shows the nature of the data and is divided into two sections: results of the CNN implemented in Python to ensure network efficiency and the results of FPGA implementation. Section 4 shows the main results by comparing the results achieved by our contribution with those of similar works. Lastly, Section 5 presents some concluding remarks.
The method presented in this paper aims to answer these questions.

2. Materials and Methods

The methodology supporting this work consists of the following stages:
  • Select a set of manufacturing parts and print them with additive manufacturing.
  • Obtain images of each object varying the rotation angle on the z axis and the azimuthal angle.
  • Extract the BOF descriptive vector from each image.
  • Create a dataset with data from the previous point.
  • Select a suitable CNN and adjust its parameters, including its architecture, to achieve the best performance and minimize the loss.
  • Implement the CNN architecture on the FPGA.
  • Compare the results.

2.1. BOF Extraction and Dataset Conformation

We designed 19 assembly objects in a 3D rendering tool. Elements were printed using a 3D printer, and we conducted the training and testing phases over those objects. Some of those objects are shown in Figure 3a.
We built a small photographic studio (35 × 35 × 31 [cm]) to take pictures of objects at different angles (Figure 3b). It could rotate an object α degrees every Δ α through a stepper motor with which the rotation is parameterized. Moreover, the θ camera tilt angle could be adjusted by means of a 90-degree arc-shaped bracket graduated every 5 degrees. Thus, manual setting Δ θ was also parameterizable.
The dataset (“3DBOFD”, 3D BOF Dataset) was generated as follows:
  • Set α and θ to initial position (see Figure 4).
  • Obtain object image.
  • Calculate the BOF. Figure 5 shows the process of extracting a vector BOF from an object [15]. Reduce the dimension of this vector by taking the contour points that are separated by 2 degrees. The BOF vector’s dimensions are 180 × 1.
  • Rotate the object Δ α degrees on the z axis to complete 360 degrees and repeat Steps 2 and 3.
  • Tilt the camera Δ θ and repeat Steps 2–4 until completion from 0 to 90 degrees.

2.2. CNN Model

A neural network is a universal classifier. In our context, this property is translated as the capability of finding a function that relates to two sets. In our particular case, those sets were, on the one hand, the attributes describing an object, and on the other hand, the type or class of that object. Convolutional neural networks perform an iterative transformation of vectors from the initial attribute space to new spaces. The core idea is that several attributes are extracted in each stage, and, at the last stage in this class of neural networks, classification is performed in a new feature space via a classifier that may be a neural network.
As a first approach to model the convolutional neural network, we used the LeNet-5 network [21] due to its small size (number of layers, kernel size, and number of parameters). Table 2 shows LeNet-5 architecture. Its original purpose was to recognize handwritten or printed digits. Therefore, input images must be rescaled to 32 × 32 pixels.
Since the dimensions of the BOF (180 × 1) were incompatible with the 32 × 32 pixel input of LeNet-5, a modification of the layers was proposed as follows (see Table 3).
It was initially determined that Δ θ = 10 and Δ α = 10 . Thus, 360 images were generated for each object. However, when θ = 0 , the obtained perspective was very similar between some objects for most α . The latter was because the arrangement with which the images of the objects in the study were taken sharply coincided. Each object was placed with its face (cross, semicircle, or triangle) along the z axis. However, the other face of the object (the height) was parallel to the angle when theta equals zero, making it look like a rectangular outline in all cases. In consequence, all images of the objects were discarded when θ = 0 , resulting in 324 images per object and 6156 images in total.
We implemented the LeNet-5 network (see Table 2) in Python through Keras considering the input vector size imposed by the network design. In order to obtain the best performance of the network, several regularizations and optimization techniques were performed.
Two new convolutional layers of 12 and 24 filters with 1 × 5 kernels were added. In addition, the number of filters in the first convolutional layer was reduced to 5, and 8 in the second convolutional layer. The first convolutional layer was also reduced to 90 neurons. The CNN model implemented in the FPGA is shown in Table 4.

2.3. CNN FPGA Implementation

BOF and CNN implementation was performed on a Zybo Z7-20 Digilent board with the Zynq-7000 SoC (XC7Z020-1CLG400C) from Xilinx. In addition, we used Digilent’s Pcam 5C module for image capture, which is based on Omnivision’s OV5640 RGB sensor.
Figure 6 shows the architecture of the BOF implementation on the FPGA.

2.3.1. Fixed-Point Determination

We opted for a fixed-point representation for BOF and CNN implementation, which carries a cost in accuracy since it is necessary to round or truncate the result at the end of an operation. In addition, this rounding truncation produces a quantization error [22]. Therefore, we tested with different fixed-point formats to establish the best representation in BOF and CNN implementation on the FPGA. As our reference is the implementation of the BOF and CNN on a CPU using Python and Keras, it was necessary to compare each layer of the CNN implemented in Python with the implementation in VHDL. With the use of the fxpmath library [23], fixed-point numbers were presented, and the error was estimated.
For representation in each stage (BOF, CNN convolutional layers, CNN dense layer, and CNN classification layer), classifications were performed with 100 samples with different fixed-point resolutions. Table 5 shows the accuracy percentages by iterating with different numbers of bits for each stage. For instance, the accuracy of 7 bits for BOF representations was 92%, but if we use 9 bits, accuracy improved to 96%. The same case was observed with convolutional layers where, using only 9 bits, accuracy was barely 40%; instead, using 16 bits, accuracy improved to 96%. Results improved by adding more bits; however, resources used in the FPGA then considerably increase. Thus, for the BOF extraction stage, 9 bits were used; for the convolutional and Dense layers of the CNN, 16 bits were used; and for the classification layer, it was necessary to use up to 32 bits.
Table 6 shows the number of bits for each part of the fixed-point representation and their set for each process stage.

2.3.2. BOF Stage Implementation

The authors in [16] described how BOF elements are extracted. First, these values are stored in a BRAM. Then, 5 by 5 data are taken to introduce them to the input of the first layer of the CNN, which has a 1 × 5 kernel.

2.3.3. Convolutional Layer Implementation

Each convolutional layer comprises the following stages: convolution, activation, clustering, and memory storage. In the first stage, convolution is performed with incoming data. In the second stage, the chosen activation function is applied; in the third stage, the most representative values are grouped according to the selected strategy; and in the fourth stage, resulting values are stored. Table 7 shows the parameters for the convolutional and max-pooling stages.
This layer consisted of five filters (25 in total) multiplied by the five values of the input (BOF) and summed. The corresponding bias was added at the end of the stage (Figure 7). Subsequently, values that passed to the clustering stage were determined in the activation stage. These consisted of 5 data. Meanwhile, 5 new data were received from the BRAM, where the BOF was stored, and the previous procedure was repeated. Lastly, a 1 × 2 filter was applied in the max pooling stage with a stride of 2. The output dimension was half of the input dimension.
Due to the accumulated latency, data were stored in a buffer at the output of the first, second, and third convolutional layers. However, this buffer was no longer necessary in the fourth convolutional layer as the data were passed directly to the dense layer.

2.3.4. Dense Layer Implementation

Dense layers have three stages: the sum of products, bias, and activation. In total, 175 values were received from the fourth convolutional layer for the first dense layer. In the sum of products and bias stage, 25 operations were performed per cycle to reduce the use of FPGA resources. Then, the remaining 150 values were consecutively operated on until the end. At the output of this stage, 90 values were processed in the activation stage through a ReLU function. At the output, there was a buffer that passed the data to the next layer. In total, 90 data are received in the second layer, and 84 were delivered and buffered.
The third dense layer was similar to the previous layers; however, the applied activation function was the softmax function. Again, due to the complexity of implementing the exponential function in VHDL, HLS was used for its implementation.
In particular, we relied on the hls_math.h library, which contains the exponential function. We converted fixed-point data from the previous step into a floating point because using a fixed point in exp HLS function caused synchronization issues, so the ap_fixed.h library was called. Once the value of the softmax function had been obtained, it was necessary to return the data to the 32-bit fixed-point format. We used this particular length because the exponential output was expected to be in the range from hundreds of millions to values close to zero.
Lastly, the classification layer determined the class associated with the input BOF. The procedure for this final stage was as follows:
  • All exponentials of the previous layer are stored and added up.
  • Each exponential is divided by the total of the exponentials, and the one with the highest value is stored.
  • The class index with the highest value is the class with the highest classification probability.

2.3.5. FPGA Processing System

BOF extraction and the CNN use the programmable logic of the XC7Z020; however, the processing system performs the following tasks:
  • Communication with the Kuka KRC5 robot controller through the MQTT protocol.
  • The PS receives control commands and configuration from both the board components, and the acquisition and sorting process implemented in the PL.
  • The PS also sends status data and can send an image or the data of the found object, such as the centroid class to which it belongs and even the classification percentage of all classes. Thus, the robot can grasp the part.

3. Results

This section first shows the main results of the CNN model implemented in Python and Keras. Then, we present obtained results from the implementation of the CNN in the FPGA.

3.1. Results for CNN Model Selection

Several tests were performed to obtain the best CNN architecture model to solve the task described in the previous section. First, the dataset was split and randomly sorted. For the training stage, 60% (3694) was selected, 20% (1231) for validation, and 20% (1231) for testing the model.
In order to achieve a robust model, the whole process was performed three times. The following four stages describe the flow of the process: (1) training the network and conducting a classification to validate performance; (2) calculating several performance metrics (error, loss, and accuracy); (3) adjusting the number of layers, size, and parameters; (4) redesigning in order to improve the considered metrics. Table 8 shows the intermediate loss and accuracy results for each stage.
Table 9 shows the precision and recall for each class, obtained by the selected CNN architecture. The two error metrics indicated that the achieved results were satisfactory for all classes in the described dataset.
The confusion matrix (see Figure 8) offers more evidence that the last modification improved the results.

3.2. Results of CNN Model Implemented in FPGA

Table 10 shows the consumed hardware resources (slice LUTs, slice registers, slices, BRAMs, and DSPs) in the FPGA for the CNN implementation. The report was obtained from Xilinx Vivado.
Table 11 shows the number of clock cycles for each layer their stages, and latency. The main clock was 100 Mhz, and although all layers were synchronized, it was sometimes necessary to wait for the completion of the process to move to the next one.
Some layers were executed in parallel. Figure 9 shows the diagram time for the process. The consumed time for one classification was 57.95 μ s, including BOF extraction.

3.3. FPGA vs. Python Implementation Results

Python and FPGA implementations were tested with 5% of the data (60 BOF) from the test set. Figure 10 shows the confusion matrix for the (a) Python and (b) FPGA implementations. In Python, there was only one classification error (class 7). In the FPGA, there were two classification errors (classes 0 and 13), although class 7 was well-classified. Class 7 was the one that had caused a classification error in the Python implementation.
Table 12 shows the accuracy and average latency of both implementations for 60 samples. These results are for the classification process. The image had already been loaded in RAM for the Python case. In the FPGA case, a BRAM that is exclusively for measure performance purposes was loaded with the image; the clock was set to be the same as the rest of the implementation (100 MHz). The CPU where the network implemented in Python was tested was an AMD Ryzen™ 53,400 G processor with Radeon™ RX Vega 11 graphics and 16 GB RAM at 3000 MHz in dual channel.

4. Discussion

Counting with fast, accurate, and reliable systems is of paramount relevance in the manufacturing industry. In this contribution, we described a method that fulfils those three attributes with acceptable performance. The presented method and its implementation in an FPGA could lead to an advanced manufacturing cell that is able to capture video, preprocessing, and postprocessing video frames, and allows for the classification of objects commonly involved in a production line with real-time processing, low cost, low power consumption, small size, and being light enough to be attached to the tip of a robotic arm.
Analysis of the confusion matrix of the system implemented in Python (see Figure 8) showed that the classification errors of classes 0 and 1 were caused by the similarity of those objects in the computed view when 0 < θ < 20 . In addition, the system on the FPGA had accuracy loss of 2%, which is quite acceptable in manufacturing; moreover, it is more than 655 times faster than its implementation on a PC.
The BOF has not been used before to extract information from an object from different angles and train a neural network. To validate this, we conducted a series of tests from which we demonstrated that using BOF instead of the image itself is a significant contribution.
In order to conduct a fair comparison between our proposal and state-of-the-art CNN models, we trained the latter with the image dataset that we had constructed for the case under study by feeding those CNN models with the images and their rotations in two described axes α and θ .
CNNs are usually trained with datasets containing standard object classes (people, animals, vehicles, among others). In this contribution, however, we used the images obtained in Step 2 of the methodology (Section 2); we called it 3D Images Objects Dataset (3DIOD). Briefly, each image was segmented by selecting the region of interest of the object, scaled, and transformed into grayscale. With this, we formed the image dataset. Lastly, we trained two CNN models (ResNET50 [24] and MobileNetV2 [25]) with the image dataset. It was necessary to modify the output layer of one of them because the number of the classes changed to 19. The reason for choosing this CNN model was its relatively small size compared with that of. another CNN. Table 13 shows the number of parameters, number of layers, average latency for classification, and accuracy for different known CNN models. This experiment was only programmed in Python and Keras.
From the results shown in the previous table, we can affirm that our hypothesis is valid. That is, our proposal (Modified LeNet-5 and BOF dataset) is faster and requires fewer parameters than those needed in ResNET50 or MobileNetV2 CNN models, which take raw images as their input data.
Table 14 and Table 15 shows the comparison of our FPGA implementation with the state-of-the-art models. We split the comparison because, in Table 14, the consumed time only shows the results for the classification process assuming that the image had been loaded on the FPGA. Meanwhile, in Table 15, the comparison is focused on fully implemented systems; the image was captured by a 60 FPS sensor and then transferred to the FPGA for inference.
Even though the cited works use images as input, the comparison is valid in contrast with our approach in which we feed BOF input as inputs. The reason for this is that the transfer of knowledge from the trained CNN models (shown in Table 13) to this FPGA implementation do not degrade the hardware performance.
The cited works in Table 14, Refs. [13,14], refer to an FPGA architecture for the MobilenetV2 and Tiny YoloV3 (modified YOLOV3 [26]) CNN models. The MobilNet V2 implemented in the first cited work is faster than the one implemented in the second, but at the cost of more consumed resources. Compared to these two systems, our work is faster and at the same time, requires far less parameters ti be trained.
Continuing with Table 15, the first of the cited works, Ref. [27], implemented a CNN model (YOLOv2 Tiny) directly in hardware. The second case listed in Table 15 is an implementation of BOF and Fuzzy ARTMAP in hardware. The same board was used, Zybo Z7-20. The dataset for this model consisted of 100 images of the objects of interest, one per category.
The latency in our model for classification is 0.04995 [ms]; however, loading the image in the FPGA takes 6.152 [ms]. Therefore, the total latency is 6.2 [ms] and 161.23 FPS. This result is 2.25X faster than [27]. The logic resources consumed in each project are remarkable; ours uses fewer LUTs, BRAMs and DSPs. The same is true for power consumption criteria and board price. Almost half of the XC7Z020 logic resources were used (LUTs, 58.90%; DSPs, 52.73%), so increasing the categories and image number is feasible.

5. Conclusions

Manufacturing requires adaptable systems that are able to cope with several configurations and orientations of components to be assembled. Real-time responses are a critical attribute in manufacturing systems, and conducting efforts to efficiently implement classifiers in hardware is, beyond doubt, a path that offers stable and reliable results. In this contribution, we described such a system and showed that results are encouraging and competitive compared to state-of-the-art solutions.
In this contribution, we proved that, in the robotic assembly context, the use of families of BOFs serving as input to convolutional neural networks leads to a faster and computationally more efficient training than the counterpart of using input multiview images as input to the vision system. We based our findings on a modified LeNet-5 CNN model. The results for the BOF case led to a latency of 37.97 ms for one inference and required 27,354 parameters, versus 57.14 ms and 2 million parameters for the multiple-view images. For the study case of interest in this contribution, both cases considered 19 classes of objects.
Continuing in the path of computational efficiency, the convolutional neural network implemented in FPGA could recognize relevant objects with high accuracy several times faster than its implementation in a general-purpose computer running Python, which is 655 times faster.
Compared with the literature on the state of the art, the latency for only one inference was 0.049 ms in our proposal (modified LeNet-5 trained with 3DBOFD) versus 0.34 ms in the cited work [13]. The latency for the entire process (capture the image and transfer the image to classify in the FPGA) took 6.2 ms, and in the cited work [27], it was 14 ms.
We opened a path to further investigate the impact of reliable hardware for implementing complex classifiers, such as convolutional neural networks, and achieve competitive results. We plan to extend the number of classes, and adjust and improve a larger CNN model as future work. We also plan to build a more comprehensive dataset for additional experiments with the most representative objects in assembly tasks.

Author Contributions

Conceptualization, V.L.-B. and R.S.-F.; methodology, V.L.-B. and R.S.-F.; software, R.S.-F.; validation, V.L.-B., R.S.-F. and A.N.; formal analysis, R.S.-F. and A.N.; investigation, V.L.-B., R.S.-F. and M.P.-C.; resources, V.L.-B. and R.S.-F.; data curation, R.S.-F.; writing—original draft preparation, V.L.-B. and A.N.; writing—review and editing, M.P.-C.; visualization, V.L.-B.; supervision, V.L.-B. and M.P.-C.; project administration, V.L.-B.; funding acquisition, V.L.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw/processed data required to reproduce these findings are in the next link https://github.com/rich-sil/CNN_BOF.git, accessed on 8 January 2022.

Acknowledgments

Sincere thanks to Victor Martinez Pozos for his kind support in testing the system, and Alejandra Cervera for his kind support in proofreading the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ANNArtificial neural network
BOFBoundary object function
BRAMBlock RAM
ES-BOF and CNN BOF and CNN embedded system
PLProgrammable logic
PSProcessing system
3DBOFD3D BOF dataset
3DIOD3D Images Objects Dataset

References

  1. Björnsson, A.; Jonsson, M.; Johansen, K. Automated material handling in composite manufacturing using pick-and-place systems—A review. Robot. Comput.-Integr. Manuf. 2018, 51, 222–229. [Google Scholar] [CrossRef]
  2. Tella, R.; Birk, J.R.; Kelley, R.B. General Purpose Hands for Bin-Picking Robots. IEEE Trans. Syst. Man Cybern. 1982, 12, 828–837. [Google Scholar] [CrossRef]
  3. Iriondo, A.; Lazkano, E.; Ansuategi, A. Affordance-based grasping point detection using graph convolutional networks for industrial bin-picking applications. Sensors 2021, 21, 816. [Google Scholar] [CrossRef] [PubMed]
  4. Lopez-Juarez, I.; Howarth, M. Knowledge acquisition and learning in unstructured robotic assembly environments. Inf. Sci. 2002, 145, 89–111. [Google Scholar] [CrossRef]
  5. Shih, C.L.; Lee, Y. A simple robotic eye-in-hand camera positioning and alignment control method based on parallelogram features. Robotics 2018, 7, 31. [Google Scholar] [CrossRef] [Green Version]
  6. Fernandez, A.; Souto, A.; Gonzalez, C.; Mendez-Rial, R. Embedded vision system for monitoring arc welding with thermal imaging and deep learning. In Proceedings of the 2020 International Conference on Omni-Layer Intelligent Systems, COINS 2020, Barcelona, Spain, 31 August–2 September 2020. [Google Scholar] [CrossRef]
  7. Herbert, B.; Andreas, E.; Tinne, T.; Gool, L.V. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar]
  8. Cizek, P.; Faigl, J. Real-Time FPGA-Based Detection of Speeded-Up Robust Features Using Separable Convolution. IEEE Trans. Ind. Inform. 2018, 14, 1155–1163. [Google Scholar] [CrossRef]
  9. Rosten, E.; Drummond, T. Machine learning for high-speed corner detection. Lect. Notes Comput. Sci. 2006, 3951 LNCS, 430–443. [Google Scholar] [CrossRef]
  10. Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. BRIEF: Binary robust independent elementary features. Lect. Notes Comput. Sci. 2010, 6314 LNCS, 778–792. [Google Scholar] [CrossRef] [Green Version]
  11. Ulusel, O.; Picardo, C.; Harris, C.B.; Reda, S.; Bahar, R.I. Hardware acceleration of feature detection and description algorithms on low-power embedded platforms. In Proceedings of the FPL 2016—26th International Conference on Field-Programmable Logic and Applications, Lausanne, Switzerland, 29 August–2 September 2016. [Google Scholar] [CrossRef]
  12. Zhang, N.; Wei, X.; Chen, H.; Liu, W. FPGA implementation for CNN-based optical remote sensing object detection. Electronics 2021, 10, 282. [Google Scholar] [CrossRef]
  13. Wu, C.; Zhuang, J.; Wang, K.; He, L. MP-OPU: A Mixed Precision FPGA-based Overlay Processor for Convolutional Neural Networks. In Proceedings of the 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021; pp. 33–37. [Google Scholar] [CrossRef]
  14. Liu, P.; Song, Y. A hybrid vision processing unit with a pipelined workflow for convolutional neural network accelerating and image signal processing. Electronics 2021, 10, 2989. [Google Scholar] [CrossRef]
  15. Mario, P.C.; Ismael, L.J.; Reyes, R.C.; Jorge, C.C. Machine vision approach for robotic assembly. Assem. Autom. 2005, 25, 204–216. [Google Scholar]
  16. Lomas-Barrie, V.; Pena-Cabrera, M.; Lopez-Juarez, I.; Navarro-Gonzalez, J.L. Fuzzy artmap-based fast object recognition for robots using FPGA. Electronics 2021, 10, 361. [Google Scholar] [CrossRef]
  17. Carpenter, G.A.; Grossberg, S.; Markuzon, N.; Reynolds, J.H.; Rosen, D.B. Fuzzy ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of Analog Multidimensional Maps. IEEE Trans. Neural Netw. 1992, 3, 698–713. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Sharma, A.; Vans, E.; Shigemizu, D.; Boroevich, K.A.; Tsunoda, T. DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Sci. Rep. 2019, 9, 11399. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  19. Xiang, Z.; Zhao, J.; LeCun, Y. Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems 28 (NIPS 2015); MIT Press: Cambridge, MA, USA, 2013; pp. 3057–3061. Available online: http://xxx.lanl.gov/abs/1502.01710 (accessed on 8 January 2022).
  20. Ankita, A.; Rani, S.; Bashir, A.K.; Alhudhaif, A.; Koundal, D.; Gunduz, E.S. An efficient CNN-LSTM model for sentiment detection in BlackLivesMatter. Expert Syst. Appl. 2022, 193, 116256. [Google Scholar] [CrossRef]
  21. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to digit recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
  22. Devices, A. Fixed-Point vs. Floating-Point Digital Signal Processing. Available online: https://www.analog.com/en/technical-articles/fixedpoint-vs-floatingpoint-dsp.html (accessed on 8 January 2022).
  23. Alcaraz, F.; Justin, J.; Eric, B. GitHub-Francof2a/Fxpmath: A Python Library for Fractional Fixed-Point (Base 2) Arithmetic and Binary Manipulation with Numpy Compatibility. Available online: https://github.com/francof2a/fxpmath#readme (accessed on 8 January 2022).
  24. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. Available online: http://xxx.lanl.gov/abs/1512.03385 (accessed on 8 January 2022). [CrossRef] [Green Version]
  25. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. Available online: http://xxx.lanl.gov/abs/1801.04381 (accessed on 8 January 2022). [CrossRef] [Green Version]
  26. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. 2018. Available online: http://arxiv.org/abs/1804.02767 (accessed on 8 January 2022).
  27. Xu, K.; Wang, X.; Liu, X.; Cao, C.; Li, H.; Peng, H.; Wang, D. A dedicated hardware accelerator for real-time acceleration of YOLOv2. J. Real-Time Image Process. 2021, 18, 481–492. [Google Scholar] [CrossRef]
Figure 1. Obtaining the BOF dataset and training the network.
Figure 1. Obtaining the BOF dataset and training the network.
Electronics 11 00696 g001
Figure 2. Part recognition process implemented on hardware.
Figure 2. Part recognition process implemented on hardware.
Electronics 11 00696 g002
Figure 3. (a) Example of manufacturing parts. (b) Studio.
Figure 3. (a) Example of manufacturing parts. (b) Studio.
Electronics 11 00696 g003
Figure 4. Camera angles.
Figure 4. Camera angles.
Electronics 11 00696 g004
Figure 5. BOF extraction process for each object view.
Figure 5. BOF extraction process for each object view.
Electronics 11 00696 g005
Figure 6. BOF and CNN FPGA implementation.
Figure 6. BOF and CNN FPGA implementation.
Electronics 11 00696 g006
Figure 7. Diagram of first convolution.
Figure 7. Diagram of first convolution.
Electronics 11 00696 g007
Figure 8. Confusion matrix of CNN last version model.
Figure 8. Confusion matrix of CNN last version model.
Electronics 11 00696 g008
Figure 9. CNN diagram time performed in FPGA. (a) Whole process; (b) first μ s of the process.
Figure 9. CNN diagram time performed in FPGA. (a) Whole process; (b) first μ s of the process.
Electronics 11 00696 g009
Figure 10. Confusion matrix of implemented CNN in (a) Python and (b) FPGA.
Figure 10. Confusion matrix of implemented CNN in (a) Python and (b) FPGA.
Electronics 11 00696 g010
Table 1. Image size for every model compared with BOF vector size.
Table 1. Image size for every model compared with BOF vector size.
ModelLeNet-5AlexNetVGG-16GoogLeNetResNet-50(v1)BOF and CNN
Size (pixels)32 × 32227 × 227231 × 231224 × 224224 × 224N/A
Total (bytes)102451,52953,36150,17650,176180 × (nbytes fixed point)
Table 2. LeNet-5 CNN model.
Table 2. LeNet-5 CNN model.
LayerSize
Input1@32 × 32
Convolution6@28 × 8
Average pooling6@14 × 14
Convolution16@10 × 10
Average pooling16@5 × 5
Dense120
Dense84
Dense10
Table 3. First CNN model adaptation.
Table 3. First CNN model adaptation.
LayerConv1Max-Pooling1Conv2Max-Pooling2
Original size5 × 52 × 25 × 52 × 2
Adapted size1 × 51 × 21 × 51 × 2
Table 4. Parameters of final CNN model.
Table 4. Parameters of final CNN model.
LayerOut DimensionNumber of Parameters
Conv1(176, 5)30
Max_pooling1(88, 5)0
Conv2(84, 8)208
Max_pooling2(42, 8)0
Conv3(38, 12)492
Max_pooling3(19, 12)0
Conv4(15, 25)1525
Max_pooling4(7, 25)0
Dense1(1, 90)15,840
Dense2(1, 84)7644
Dense2(1, 19)1615
Table 5. Accuracy of n-bit fixed points per stage.
Table 5. Accuracy of n-bit fixed points per stage.
Stage# of Bits → Accuracy
BOF7 → 92.0% | 9 → 96.0% | 11 → 98.0%
Conv. and Fully connected layers9 → 40% | 16 → 96%
Classif. layer32 → 97%
Table 6. Number of bits for every fixed point part in every stage.
Table 6. Number of bits for every fixed point part in every stage.
Stage/# of BitsSignIntegerDecimal
BOF117
Conv. layers1312
Fully C. layers178
Classif. layer-248
Table 7. Parameters for convolutional and max-pooling stages.
Table 7. Parameters for convolutional and max-pooling stages.
StageKernelStridePaddingActivation
Conv.1 × 51NoneReLU
Max pooling1 × 22None-
Table 8. Accuracy and loss for modified CNN model.
Table 8. Accuracy and loss for modified CNN model.
CNN ModelLoss or Acc.TrainingValidationTest
LeNet-5Loss0.01110.16570.1361
Accuracy0.99960.95940.9594
1st versionLoss0.17310.12770.1407
Accuracy0.94090.96020.9521
Last versionLoss0.3120.05450.0845
Accuracy0.98970.94050.9781
Table 9. Precision and recall per class.
Table 9. Precision and recall per class.
Class0123456
Precision0.99350.97050.99390.98180.97860.99690.9729
Recall0.94440.91361.00.9970.98760.98760.997
Class78910111213
Precision0.90681.00.96670.99381.00.98780.9937
Recall0.9911.00.98460.9970.9610.9970.9691
Class1415161718
Precision0.9671.00.9910.98180.9906
Recall0.99380.99380.97841.00.9753
Table 10. FPGA resource consumption for layers.
Table 10. FPGA resource consumption for layers.
Slice LUTsSlice RegistersBRAMDSPs
BOF1075468352
Conv1706174--
Conv2461272-10
Conv3558399-16
Conv42993814-24
Max_pool183162--
Max_pool2130258--
Max_pool3194386--
Max_pool4403802--
Mem133332-
Mem225242-
Mem322223-
Blk_mem0727-
Blk_mem1723
Blk_mem2722
Dense156483209-25
Dense251402941-18
Dense3749344-14
Exp.1330588-7
Classif.11,7651675--
Total31,33612,57753116
Available53,200106,4004900220
Utilization58.90%11.82%1.08%52.73%
Table 11. Latency of CNN implemented in FPGA.
Table 11. Latency of CNN implemented in FPGA.
LayerClock CyclesTime [ μ s]
Conv1105610.56
Conv2378637.86
Conv3248024.80
Conv4184318.43
Max_pool1105110.51
Max_pool2373637.36
Max_pool3240624.06
Max_pool4169116.91
Mem1381038.1
Mem2372237.22
Mem3257725.77
Blk_mem0165316.53
Blk_mem14184.18
Blk_mem21121.12
Dense1198919.89
Dense25165.16
Dense34394.39
Exp.3243.24
Classif.110.11
Table 12. Python vs. FPGA latency and accuracy.
Table 12. Python vs. FPGA latency and accuracy.
Latency Mean for One SampleAccuracy
Python37.97 [ms]98.33%
FPGA49.95 [ μ s]96.67%
Table 13. Latency of several CNN models, their number of parameters, number of layers, and achieved accuracy using the dataset obtained in Step 2 of methodology.
Table 13. Latency of several CNN models, their number of parameters, number of layers, and achieved accuracy using the dataset obtained in Step 2 of methodology.
CNN ModelDatasetNo. of ParametersNo. of LayersLatency ∗ [ms]Accuracy
ResNET503DIOD23,626,64350118.0297.9%
MobileNetV23DIOD2,282,3235357.1498.5%
Modified LeNet-53DBOFD27,354737.9798.33%
* Latency mean for one sample.
Table 14. Latency related work comparison for one classification.
Table 14. Latency related work comparison for one classification.
[13] [14]Our Work
Year202120212022
PlatformXC7VX690TXC7K325TXC7Z020
NetworkMobilenet V2Tiny YoloV3Mobilenet V2Modified LeNet-5
LUTs279 K146 K31.3 K
BRAM9129253
DSPs3072212116
Classification latency [ms]0.342.913.250.049
Table 15. Latency related work comparison for entire classification process.
Table 15. Latency related work comparison for entire classification process.
Our Work
[27][16]CPUZybo
PlatformArria-10 GX1150Zybo Z7-20 (XC7Z020)Ryzen5 3400 G 16 GBZybo Z7-20 (XC7Z020)
Frecuency [MHz]190100CPU@4200100
NetworkYOLOv2 TinyFAMModified LeNet-5Modified LeNet-5
LUTs [K]145 *3.8N/A31.3
BRAM1027 **50N/A53
DSPs10923N/A116
FPS71.416026.33161.23
Latency [ms]147.61396.2
Power [W]2611.11007
Price [USD] †5520199736199
† Approximated price found in the manufacture company website. * Logical cells are ALM. ** Converted to 36 Kb each.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lomas-Barrie, V.; Silva-Flores, R.; Neme, A.; Pena-Cabrera, M. A Multiview Recognition Method of Predefined Objects for Robot Assembly Using Deep Learning and Its Implementation on an FPGA. Electronics 2022, 11, 696. https://doi.org/10.3390/electronics11050696

AMA Style

Lomas-Barrie V, Silva-Flores R, Neme A, Pena-Cabrera M. A Multiview Recognition Method of Predefined Objects for Robot Assembly Using Deep Learning and Its Implementation on an FPGA. Electronics. 2022; 11(5):696. https://doi.org/10.3390/electronics11050696

Chicago/Turabian Style

Lomas-Barrie, Victor, Ricardo Silva-Flores, Antonio Neme, and Mario Pena-Cabrera. 2022. "A Multiview Recognition Method of Predefined Objects for Robot Assembly Using Deep Learning and Its Implementation on an FPGA" Electronics 11, no. 5: 696. https://doi.org/10.3390/electronics11050696

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop