CNN Training Using 3D Virtual Models for Assisted Assembly with Mixed Reality and Collaborative Robots

: The assisted assembly of customized products supported by collaborative robots combined with mixed reality devices is the current trend in the Industry 4.0 concept. This article introduces an experimental work cell with the implementation of the assisted assembly process for customized cam switches as a case study. The research is aimed to design a methodology for this complex task with full digitalization and transformation data to digital twin models from all vision systems. Recognition of position and orientation of assembled parts during manual assembly are marked and checked by convolutional neural network (CNN) model. Training of CNN was based on a new approach using virtual training samples with single shot detection and instance segmentation. The trained CNN model was transferred to an embedded artiﬁcial processing unit with a high-resolution camera sensor. The embedded device redistributes data with parts detected position and orientation into mixed reality devices and collaborative robot. This approach to assisted assembly using mixed reality, collaborative robot, vision systems, and CNN models can signiﬁcantly decrease assembly and training time in real production.


Introduction and Related Works
Collaborative robots and their implementation in the assisted assembly process is an important part of the Industry 4.0 concept. They can work in the same workspace as human workers and perform basic manipulations or simple monotonous assembly tasks. This area is open to new research, methodology development and definition of basic requirements, because real applications in production processes are currently still limited.
The main advantage of using collaborative robots in the assembly process is a minimal transport delay of assembly parts between manual and automated operation. Other benefits are, for example, an integrated vision system for additional inspection of manual operation, the possibility to provide the interface for digital data collection from sensors and communication with external cloud platforms.
Appropriate human-robot cooperation can significantly improve assembly time, but both must have exactly defined methods of communication between them. For example, the collaborative robot can check the success of a worker assembly operation by the integrated camera and the worker can get information about this status by mixed reality devices. The mixed reality device also can shortcut the time of staff training. Configuration principles of a collaborative robot in an assembly task were introduced in [1], a framework to implement collaborative robots in the manual assembly in [2] and a human-robot collaboration framework for improving ergonomics in [3]. The important condition for the assisted assembly process is the synchronization of augmented (AR), virtual (VR), or mixed The main novelty and the innovation contribution of the article is a complex methodology for CNN training by virtual 3D models and design of a communication framework for assisted assembly devices like collaborative robot and mixed reality device.

Methodology of Deep Learning Implementation into the Assisted Assembly Process
A methodology of CNN training using 3D virtual models for deep learning implementation into the assisted assembly process is based on an automated generation of input sample data for learning without any monotonous manual work. All tasks, such as an object detection position, background and material change can be automated by the scripting language. This methodology can be divided into eight steps: (1) Creation of 3D virtual models from the experimental assembly by any 3D design software or point cloud creation by laser scanning technology with conversion to some standard 3D format (OBJ, FBX, STL, IGES, etc.); (2) Import 3D models into the software with cinematic rendering and some simulation of dynamics; (3) Algorithms design of an automatic data queue of parts positioning, rotating, and camera setup by parts size; (4) Rendering two sets of images: the first for CNN teaching and the second for an automated annotation algorithm; (5) Creating of XML file for single shot detection and JSON format for instance segmentation; (6) Automated ratio sorting to training and testing samples and moving to separate folder; (7) Training of convolutional neural network for parts classification and localization (using single shot detection and instance segmentation); (8) Transformation of CNN models into some type of embedded devices for inference of the trained model and results distribution of the detected position data to assisted assembly systems: a collaborative robot internal Cartesian system and mixed reality device anchoring system; The detected objects are placed on the floor and they can be rotated only around one axis with a chosen increment from 20 • to 360 • by step from 1 to 18. The translation and rotation of virtual parts in scene are computed by standard translation and rotation matrixes for placement in 3D environment [24].
The detected object can be too small, for example as nuts or washers, so it is necessary to get magnification to change the field of view for the camera, according to Equation (1): if where FOV H,V is field of view (horizontal or vertical), D X,Y is object dimension in X or Y axis, H is distance to object [mm], M is magnification setup to 0.5x or 0.25x and f L is focal length [mm]. The number of generated 2D images from all imported parts is counted by simple Equation (2): where p is the number of imported parts, f is the number of used floors and n α is the number of rotations for every part (in the range from 1 to 18). Figure 1 presents a diagram of the proposed methodology of automatic data preparation for CNN automated training using experimental values, evaluation, and execution in the embedded device. Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 16 Figure 1. The simplified algorithm for samples generation from 3D virtual models of assembly parts.

Experimental Platform
The research has been provided in the SmartTechLab for Industry 4.0 at the Faculty of Manufacturing Technologies of Technical University of Kosice. There is installed an experimental SMART manufacturing system established primarily for research purposes, but also for collaboration with companies and teaching purposes. An important part of this system is a work cell for assisted assembly with incorporated technologies for parts recognition, mixed reality and collaborative robotics (Figures 2 and 3).

Experimental Platform
The research has been provided in the SmartTechLab for Industry 4.0 at the Faculty of Manufacturing Technologies of Technical University of Kosice. There is installed an experimental SMART manufacturing system established primarily for research purposes, but also for collaboration with companies and teaching purposes. An important part of this system is a work cell for assisted assembly with incorporated technologies for parts recognition, mixed reality and collaborative robotics (Figures 2 and 3).  Figure 2. A scheme of the experimental assisted assembly work cell with CNN processing unit, mixed reality device and collaborative robot. The point of interest for experimental assembly is a cam switch consisting of 31 parts made from different materials: plastic, rubber, stainless steel, and brass. The disassembled parts are shown in Figure 4a, and the assembled product is shown in Figure 4b.   Figure 2. A scheme of the experimental assisted assembly work cell with CNN processing unit, mixed reality device and collaborative robot. The point of interest for experimental assembly is a cam switch consisting of 31 parts made from different materials: plastic, rubber, stainless steel, and brass. The disassembled parts are shown in Figure 4a, and the assembled product is shown in Figure 4b. The point of interest for experimental assembly is a cam switch consisting of 31 parts made from different materials: plastic, rubber, stainless steel, and brass. The disassembled parts are shown in Figure 4a, and the assembled product is shown in Figure 4b

Input Data Preparation for CNN Training
CNN can work reliably for assembly parts recognition, but a problem is the preparation of input data for their training. A very large quantity of input samples need to be prepared, usually several hundred for one assembly part, because it has to be captured with different angular/translation variations and also with different backgrounds and materials. The replacement of real images of assembly parts with their 3D virtual models can significantly accelerate this process. Applying virtual models is also a trend of the Industry 4.0 concept and they can represent the real production process or product. Such virtual models digitally replicate all aspects of real products and they are called digital twins. They consist of 3D models of parts grouped into assemblies with the possibility of data synchronization with a real product, so the first step in the methodology of deep learning implementation into the assisted assembly process is a digital twin creation of the assembled product, in our case a cam switch ( Figure 5). This digital twin will serve also as an input model into a mixed reality device for staff training.

Input Data Preparation for CNN Training
CNN can work reliably for assembly parts recognition, but a problem is the preparation of input data for their training. A very large quantity of input samples need to be prepared, usually several hundred for one assembly part, because it has to be captured with different angular/translation variations and also with different backgrounds and materials. The replacement of real images of assembly parts with their 3D virtual models can significantly accelerate this process. Applying virtual models is also a trend of the Industry 4.0 concept and they can represent the real production process or product. Such virtual models digitally replicate all aspects of real products and they are called digital twins. They consist of 3D models of parts grouped into assemblies with the possibility of data synchronization with a real product, so the first step in the methodology of deep learning implementation into the assisted assembly process is a digital twin creation of the assembled product, in our case a cam switch ( Figure 5). This digital twin will serve also as an input model into a mixed reality device for staff training.

Input Data Preparation for CNN Training
CNN can work reliably for assembly parts recognition, but a problem is the preparation of input data for their training. A very large quantity of input samples need to be prepared, usually several hundred for one assembly part, because it has to be captured with different angular/translation variations and also with different backgrounds and materials. The replacement of real images of assembly parts with their 3D virtual models can significantly accelerate this process. Applying virtual models is also a trend of the Industry 4.0 concept and they can represent the real production process or product. Such virtual models digitally replicate all aspects of real products and they are called digital twins. They consist of 3D models of parts grouped into assemblies with the possibility of data synchronization with a real product, so the first step in the methodology of deep learning implementation into the assisted assembly process is a digital twin creation of the assembled product, in our case a cam switch ( Figure 5). This digital twin will serve also as an input model into a mixed reality device for staff training.

An Automated 2D Images Generation by the Unreal Engine
In the previous research [24] there was combined Blender 3D software with Python scripting language for an automated generation of the CNN training set, but the Blender rendering engine does not provide cinematic quality of the generated samples. This disadvantage significantly decreases the classification precision of trained convolutional networks by about 20 to 30%. The new approach is based on cinematic rendering from the Unreal Engine combined with Blueprint and Python scripting language. The next new feature is a dynamic collision used for realistic shadow rendering placed under generated samples on a different surface. An example of part position in the 3D inside of Unreal Engine editor setup before dynamic simulation (left) and the part with shadow after simulation (right) is shown in Figure 6.

An Automated 2D Images Generation by the Unreal Engine
In the previous research [24] there was combined Blender 3D software with Python scripting language for an automated generation of the CNN training set, but the Blender rendering engine does not provide cinematic quality of the generated samples. This disadvantage significantly decreases the classification precision of trained convolutional networks by about 20 to 30%. The new approach is based on cinematic rendering from the Unreal Engine combined with Blueprint and Python scripting language. The next new feature is a dynamic collision used for realistic shadow rendering placed under generated samples on a different surface. An example of part position in the 3D inside of Unreal Engine editor setup before dynamic simulation (left) and the part with shadow after simulation (right) is shown in Figure 6. The basic algorithm is coded in the Blueprint Scripting Language, an example of subprogram for 3D virtual part rotation in Z axis is shown in Figure 7. The initial parameters for 2D sample generation are rotation angle, type of CNN model which provides basic image resolution, number of generated backgrounds as floors, and annotation file type. Basic parameters selection in the Unreal HUD menu before automated generation start is shown in Figure 8. Full assembly of the cam switch consists of 31 different parts, the basic setup uses 5 different floor textures and the angle of rotation can be set up from 20° to 360° in Z-axis.

Newton
Dynamics Realistic shadow Camera The basic algorithm is coded in the Blueprint Scripting Language, an example of subprogram for 3D virtual part rotation in Z axis is shown in Figure 7.

An Automated 2D Images Generation by the Unreal Engine
In the previous research [24] there was combined Blender 3D software with Python scripting language for an automated generation of the CNN training set, but the Blender rendering engine does not provide cinematic quality of the generated samples. This disadvantage significantly decreases the classification precision of trained convolutional networks by about 20 to 30%. The new approach is based on cinematic rendering from the Unreal Engine combined with Blueprint and Python scripting language. The next new feature is a dynamic collision used for realistic shadow rendering placed under generated samples on a different surface. An example of part position in the 3D inside of Unreal Engine editor setup before dynamic simulation (left) and the part with shadow after simulation (right) is shown in Figure 6. The basic algorithm is coded in the Blueprint Scripting Language, an example of subprogram for 3D virtual part rotation in Z axis is shown in Figure 7. The initial parameters for 2D sample generation are rotation angle, type of CNN model which provides basic image resolution, number of generated backgrounds as floors, and annotation file type. Basic parameters selection in the Unreal HUD menu before automated generation start is shown in Figure 8. Full assembly of the cam switch consists of 31 different parts, the basic setup uses 5 different floor textures and the angle of rotation can be set up from 20° to 360° in Z-axis.  The initial parameters for 2D sample generation are rotation angle, type of CNN model which provides basic image resolution, number of generated backgrounds as floors, and annotation file type. Basic parameters selection in the Unreal HUD menu before automated generation start is shown in Figure 8. Full assembly of the cam switch consists

An Automated Annotation by the OpenCV Algorithms
The input condition for automated annotation is a black background and binary threshold to get clear object edges. Two basic annotation methods were selected for generated 2D images from 3D virtual models: • Single Shot Detection (SSD) annotation by the basic unrotated bounding box in XML format for LabelImg; • Instance segmentation annotation by a polygon with variable approximation in JSON format for LabelMe.
The process of automated SSD annotation and evaluation is shown in Figure 9. The resolution of generated images can be changed exactly for the used CNN model. The first tested model was Faster R-CNN with Inception v2 with default resolution 600×1024×3. Test samples are separated randomly from the train set for every generated part by a default value of 25%. Segmentation needs much more precise thresholding like single shot detection. The closing algorithm was used to get a precise contour of the object. An example of thresholding with object contour closing is shown in Figure 10.

An Automated Annotation by the OpenCV Algorithms
The input condition for automated annotation is a black background and binary threshold to get clear object edges. Two basic annotation methods were selected for generated 2D images from 3D virtual models:

•
Single Shot Detection (SSD) annotation by the basic unrotated bounding box in XML format for LabelImg; • Instance segmentation annotation by a polygon with variable approximation in JSON format for LabelMe.
The process of automated SSD annotation and evaluation is shown in Figure 9. The resolution of generated images can be changed exactly for the used CNN model. The first tested model was Faster R-CNN with Inception v2 with default resolution 600 × 1024 × 3. Test samples are separated randomly from the train set for every generated part by a default value of 25%.

An Automated Annotation by the OpenCV Algorithms
The input condition for automated annotation is a black background and binary threshold to get clear object edges. Two basic annotation methods were selected for generated 2D images from 3D virtual models: • Single Shot Detection (SSD) annotation by the basic unrotated bounding box in XML format for LabelImg; • Instance segmentation annotation by a polygon with variable approximation in JSON format for LabelMe.
The process of automated SSD annotation and evaluation is shown in Figure 9. The resolution of generated images can be changed exactly for the used CNN model. The first tested model was Faster R-CNN with Inception v2 with default resolution 600×1024×3. Test samples are separated randomly from the train set for every generated part by a default value of 25%. Segmentation needs much more precise thresholding like single shot detection. The closing algorithm was used to get a precise contour of the object. An example of thresholding with object contour closing is shown in Figure 10. Segmentation needs much more precise thresholding like single shot detection. The closing algorithm was used to get a precise contour of the object. An example of thresholding with object contour closing is shown in Figure 10. XML format for SSD is accepted as standard, but instance segmentation has many formats, COCO JSON, CSV, LabelMe JSON, RLE, etc. The simplest JSON structure has LabelMe format with polygon shape, which can be easily implemented to the automated process of contour annotation by OpenCV. An automated contour detection by OpenCV can provide a better contour as a manual process and it can significantly improve CNN instance segmentation after the training process, as can be seen in Figure 11.

The Generated Training and Testing Sample Set
An example of results from an automated sample generation with XML annotation converted to CSV files is shown in Figure 12.  XML format for SSD is accepted as standard, but instance segmentation has many formats, COCO JSON, CSV, LabelMe JSON, RLE, etc. The simplest JSON structure has LabelMe format with polygon shape, which can be easily implemented to the automated process of contour annotation by OpenCV. An automated contour detection by OpenCV can provide a better contour as a manual process and it can significantly improve CNN instance segmentation after the training process, as can be seen in Figure 11. XML format for SSD is accepted as standard, but instance segmentation has many formats, COCO JSON, CSV, LabelMe JSON, RLE, etc. The simplest JSON structure has LabelMe format with polygon shape, which can be easily implemented to the automated process of contour annotation by OpenCV. An automated contour detection by OpenCV can provide a better contour as a manual process and it can significantly improve CNN instance segmentation after the training process, as can be seen in Figure 11.

The Generated Training and Testing Sample Set
An example of results from an automated sample generation with XML annotation converted to CSV files is shown in Figure 12.

The Generated Training and Testing Sample Set
An example of results from an automated sample generation with XML annotation converted to CSV files is shown in Figure 12.
To start the CNN model training process is only necessary to copy folder train, test, and two cumulated annotations to CSV files to the TensorFlow folder and create TF_record files for training and testing.

The Generated Training and Testing Sample Set
An example of results from an automated sample generation with XML annotation converted to CSV files is shown in Figure 12.  All necessary information to identify the sample is encoded in the sample image name:

Experimental Results and Implementation into the Assembly Process
An initial experiment of the cam switch parts recognition was executed using a small set of training samples (five per part, 155 samples altogether) with the different floor. Considering this small teaching set the obtained results are acceptable (see Table 1). The training process for Inception V2 is shown in Figure 13, where the unit on the X-axis is the number of cycles and the unit on the Y-axis is mAP. CNN models with single shot detection can be retrained very fast by transfer learning with accepted results within less than 2 h of training without dedicated GPU (for example the used pretrained Faster RCNN Inception V2 SSD reached the required accuracy within 1.35 h). CNN models with segmentation for the same input samples need 4-5 times more time for successful training (for example the used pretrained Mask RCNN Resnet101 reached required accuracy up to 5.20 h). But in contrast to SSD where a model is saved after unpredictable numbers of iteration, the training of models with Segmentation is stored after each epoch. The results of recognition of other images (not training set) of some parts for both tested CNN models (SSD and instance segmentation) are shown in Figure 14.

Experimental Results and Implementation into the Assembly Process
An initial experiment of the cam switch parts recognition was executed using a small set of training samples (five per part, 155 samples altogether) with the different floor. Considering this small teaching set the obtained results are acceptable (see Table 1). The training process for Inception V2 is shown in Figure 13, where the unit on the X-axis is the number of cycles and the unit on the Y-axis is mAP.  CNN models with single shot detection can be retrained very fast by transfer learning with accepted results within less than 2 h of training without dedicated GPU (for example the used pretrained Faster RCNN Inception V2 SSD reached the required accuracy within 1.35 h). CNN models with segmentation for the same input samples need 4-5 times more time for successful training (for example the used pretrained Mask RCNN Resnet101 reached required accuracy up to 5.20 h). But in contrast to SSD where a model is saved after unpredictable numbers of iteration, the training of models with Segmentation is stored after each epoch. The results of recognition of other images (not training set) of some parts for both tested CNN models (SSD and instance segmentation) are shown in Figure 14. The inference time experiments with trained CNN models (SSD and Mask) have been performed on many different platforms. The delay results presented in Tables 2 and 3 for inference time is average value from a multiple test in loop for recognition of 40 sample images. The obtained results for CNN model SSD Inception V2 and TensorFlow 1 are in Table 2, for CNN Segmentation model Resnet101 and TensorFlow 2 with Pixelib in Table 3.  The inference time experiments with trained CNN models (SSD and Mask) have been performed on many different platforms. The delay results presented in Tables 2 and 3 for inference time is average value from a multiple test in loop for recognition of 40 sample images. The obtained results for CNN model SSD Inception V2 and TensorFlow 1 are in Table 2, for CNN Segmentation model Resnet101 and TensorFlow 2 with Pixelib in Table 3.  The FP16 SSD Inception V2 CNN model can reach about 3 FPS, which is an acceptable parts identification delay for checking worker assembly tasks and collaborative robot assembly status. The experiment with the Mask RCNN segmentation model reached in AGX device about 700 ms delay, which is acceptable in comparison to Desktop PC with high-performance CPU.
The mixed reality device is based on ARM64 architecture which does provide enough power for the execution of the trained inference model. The new approach is to stream video data to NVIDIA Xavier APU which runs an inference model and sends only extracted data: bounding box, contour polygon, and a result of classification as feedback.
The collaborative work cell contains a SMART vision system consisting of three cameras. The primary camera is connected to NVIDIA Xavier AGX (Figure 15a), where is uploaded trained CNN model. The second camera is integrated into mixed reality devices (Figure 15b) and the third camera is integrated into the right hand of the collaborative robot (Figure 15c). The principle of parts detection can be described in these steps: (1) Get images from the mixed reality device and collaborative robot hand; (2) Send images to the front of data; (3) Get images by NVIDIA Xavier embedded device; (4) Inference images by selected CNN model and acquire part position data; (5) Send bounding box (contour) and classification value by TCP communication to both devices (mixed reality and collaborative robot).
Appl. Sci. 2021, 11, x FOR PEER REVIEW 12 of 16 The collaborative work cell contains a SMART vision system consisting of three cameras. The primary camera is connected to NVIDIA Xavier AGX (Figure 15a), where is uploaded trained CNN model. The second camera is integrated into mixed reality devices (Figure 15b) and the third camera is integrated into the right hand of the collaborative robot (Figure 15c). The principle of parts detection can be described in these steps: (1) Get images from the mixed reality device and collaborative robot hand; (2) Send images to the front of data; The implementation principle of parts recognition into the collaborative work cell is shown in Figure 16.  The implementation principle of parts recognition into the collaborative work cell is shown in Figure 16. The implementation principle of parts recognition into the collaborative work cell is shown in Figure 16. Images captured by all vision systems (static dual 4K e-con cameras connected to NVIDIA Xavier AGX and JetPack 4.4, integrated Cognex 7200 camera in ABB Yumi right hand and Microsoft Hololens 2 internal head camera) are shown in Figure 17. The application for the mixed reality device Hololens 2 is coded in the same software (Unreal Engine) as а sample generation software but does not use Python programming language and is coded only by Blueprint programming language with UXTool library.

ABB
The designed calibration principle of all used vision systems to one Cartesian coordinate is shown in Figure 18 and their synchronization is realized in these steps: • NVIDIA Xavier AGX system with e-con dual 4K camera is static and default zero position is set as fixed X, Y offset in pixels; • ABB Yumi collaborative robot position is realized by offset from home point to the axis of the left-hand vision system in Rapid programming language; • The mixed reality device Microsoft Hololens 2 is synchronized by QR code placed on the lower-left corner of the assembly table and rotation measured by integrated MEMS sensors.

•
The information about part position adjusted by QR code is shared between all devices as Part identification, where position is X, Y; size W, H, and contour of the detected part is represented by array of points C. The application for the mixed reality device Hololens 2 is coded in the same software (Unreal Engine) as a sample generation software but does not use Python programming language and is coded only by Blueprint programming language with UXTool library.
The designed calibration principle of all used vision systems to one Cartesian coordinate is shown in Figure 18 and their synchronization is realized in these steps: • NVIDIA Xavier AGX system with e-con dual 4K camera is static and default zero position is set as fixed X, Y offset in pixels; • ABB Yumi collaborative robot position is realized by offset from home point to the axis of the left-hand vision system in Rapid programming language; • The mixed reality device Microsoft Hololens 2 is synchronized by QR code placed on the lower-left corner of the assembly table and rotation measured by integrated MEMS sensors.

•
The information about part position adjusted by QR code is shared between all devices as Part identification, where position is X, Y; size W, H, and contour of the detected part is represented by array of points C.

•
The mixed reality device Microsoft Hololens 2 is synchronized by QR code placed on the lower-left corner of the assembly table and rotation measured by integrated MEMS sensors.

•
The information about part position adjusted by QR code is shared between all devices as Part identification, where position is X, Y; size W, H, and contour of the detected part is represented by array of points C. Current deep learning frameworks provide only image augmentation, which only reduces the number of images that are needed to be prepared. It means, the most monotonous works in deep learning implementation to real application still exist. This is the main reason why is not profitable to use deep learning in small series assembly tasks. On the other hand, an automated generation of training samples from CAD models, which Current deep learning frameworks provide only image augmentation, which only reduces the number of images that are needed to be prepared. It means, the most monotonous works in deep learning implementation to real application still exist. This is the main reason why is not profitable to use deep learning in small series assembly tasks. On the other hand, an automated generation of training samples from CAD models, which are available before production starts, can help to implement more assisted assembly solutions into practice.
The research field in an automated generation of training samples for CNN models from 3D virtual models has a big potential to expand. Current progress in GPU with real-time raytracing can provide new rendering possibilities to reach cinematic quality in object visualization and fast preparation of virtual samples. An interesting project is Kaolin from NVIDIA, a modular differentiable rendering for applications like high-resolution simulation environments, though it is still only as a library under research.
The early research progress is improving the presented tested solution with parts overlay recognition implemented into the Unreal Engine. An example of the first testing implementation with overlays is shown in Figure 19. are available before production starts, can help to implement more assisted assembly solutions into practice. The research field in an automated generation of training samples for CNN models from 3D virtual models has a big potential to expand. Current progress in GPU with realtime raytracing can provide new rendering possibilities to reach cinematic quality in object visualization and fast preparation of virtual samples. An interesting project is Kaolin from NVIDIA, a modular differentiable rendering for applications like high-resolution simulation environments, though it is still only as a library under research.
The early research progress is improving the presented tested solution with parts overlay recognition implemented into the Unreal Engine. An example of the first testing implementation with overlays is shown in Figure 19.

Conclusions
An automated generation of training samples based on 3D virtual models is a new approach in the field of deep learning that can save many hours of manual work. The presented research in the article introduces a methodology of CNN training for deep learning implementation into the assisted assembly process. This methodology was evaluated in an experimental SMART manufacturing system with assisted assembly work cell using cam switch as chosen assembly product from real production where is still used fully manual assembly process [31].
To summarize, those experiments have been performed and these main research results have been acquired in the field of CNN training for parts recognition in the assisted assembly process: • A Blueprint program for an automated generation of 2D images from 3D virtual models as CNN training set in the Unreal Engine 4 software has been created; • A Python algorithm with OpenCV library has been implemented for an automated image annotation for single shot detection as XML and instance segmentation as JSON; Figure 19. Parts overlay early research implemented into the Unreal Engine.

Conclusions
An automated generation of training samples based on 3D virtual models is a new approach in the field of deep learning that can save many hours of manual work. The presented research in the article introduces a methodology of CNN training for deep learning implementation into the assisted assembly process. This methodology was evaluated in an experimental SMART manufacturing system with assisted assembly work cell using cam switch as chosen assembly product from real production where is still used fully manual assembly process [31].
To summarize, those experiments have been performed and these main research results have been acquired in the field of CNN training for parts recognition in the assisted assembly process: • A Blueprint program for an automated generation of 2D images from 3D virtual models as CNN training set in the Unreal Engine 4 software has been created; • A Python algorithm with OpenCV library has been implemented for an automated image annotation for single shot detection as XML and instance segmentation as JSON; • Two separate CNN models (SSD and Mask) have been trained in TensorFlow 2 framework for evaluation of the proposed methodology of deep learning implementation into the assisted assembly process; • Inference experiments with trained CNN models on different platforms including an embedded APU have been performed and evaluated; • Parts recognition data transfer among mixed reality devices, a collaborative robot and an embedded APU device has been designed and implemented.
The future works can be divided into more steps because there are further plans mainly in extending the current software:

•
To increase rendered image quality to cinematic by real-time raytracing implemented in the new generation of GPUs; • To implement an automated parts overlay recognition, which can be simply solved using Unreal Newton Physics and automatic switching to black texture for objects overlapping as input for OpenCV annotation; • Transform current experimental project for free use as a plugin into the Unreal Engine software.