# Trajectory Prediction of Assembly Alignment of Columnar Precast Concrete Members with Deep Learning

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Assembly Technologies

#### 1.2. Computer Vision

#### 1.3. Modeling and Prediction Based on Deep Learning

## 2. Our Approach

_{yz}, I

_{xz}} in X and Y directions in the assembly coordinate system, define the embedded hole at the bottom of the precast concrete members as moving target C, and define the reinforcement as fixed target B. After the object detection model, we obtain bounding boxes of fixed target B = (x

_{b}, y

_{b}, w

_{b}, h

_{b}) and moving target C = (x

_{c}, y

_{c}, w

_{c}, h

_{c}). x

_{b}, y

_{b}, w

_{b}, h

_{b}are respectively the center point coordinate, the width, and the height of bounding box of fixed target B. x

_{c}, y

_{c}, w

_{c}, h

_{c}are respectively the center point coordinate, the width, and the height of the bounding box of moving target C.

_{yz}and I

_{xz}are obtained. We formalize the original scene into a grid diagram, as shown in Figure 3. D

_{s}is composed of a series of adjacent position block diagrams, so that each block corresponds to a specific position S

_{i}of the scene. We call D

_{s}the decision area, so as to represent the future appearance position of the moving target. The higher excitation value is, the higher matching probability between the moving target and the position will be. According to Figure 3 (a), the movement direction of the moving target in the X direction in the future can be obtained. Similarly, according to Figure 3 (b), the movement direction of the moving target in the Y direction in the future can be obtained. The end point of the moving target trajectory is matched with the fixed target. Trajectory prediction terminates when the center abscissa of moving target bounding box is equal to the center abscissa of a fixed target bounding box.

_{i}represents the specific position of the block in the image corresponding to the scene.

#### 2.1. Object Detection Network

_{min}is minimum scale parameter, s

_{max}is maximum scale parameter. Since the detection targets in this paper are circular holes and rebar, the length-width ratio parameter of bounding boxes is set to ${a}_{r}\in \left\{2,3,\frac{1}{2},\frac{1}{3}\right\}$.The bounding boxes include the information of center point coordinates (x,y), width and height (w,h) and confidence.

#### 2.2. Trajectory Prediction Network

_{c}by inputting the sliding window $p({S}_{i})$ into the network:

_{reward}(S

_{i}) is the reward ${r}_{c,t}^{m}$ for each position S

_{i}. $p({S}_{i})$ represents a sliding window. The larger value means the higher reward for that position, namely the higher probability the object will reach that position in the future. ${F}_{S}$ represents the forward propagation function, and $\epsilon $ is learning rate.

_{reward}(S

_{i}) is used to establish the cost equation:

## 3. Experiments

#### 3.1. Dataset

#### 3.2. Preprocessing

#### 3.3. Training

- (1)
- Input the image sequence with target annotation into the object detection network, and supervised learning is used to train and output images with significant target frames.
- (2)
- After grouping and sorting the significant target image sequences, they are input into the trajectory prediction network for training and feature extraction. Calculate reward from the reward estimator with (3), which measures how close the prediction region to the ground-truth region. Generate cost map.
- (3)
- Obtain the motion directions through the direction estimator.

#### 3.4. Results

## 4. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Zhang, X.; Zheng, Y.; Ota, J.; Huang, Y. Peg-in-Hole Assembly Based on Two-phase Scheme and F/T Sensor for Dual-arm Robot. Sensors
**2017**, 17, 2004. [Google Scholar] [CrossRef] [PubMed] - Kim, Y.L.; Song, H.C.; Song, J.B. Hole detection algorithm for chamferless square peg-in-hole based on shape recognition using F/T sensor. Int. J. Precis. Eng. Manuf.
**2014**, 15, 425–432. [Google Scholar] [CrossRef] - Ðurović, P.; Grbić, R.; Cupec, R. Visual servoing for low-cost SCARA robots using an RGB-D camera as the only sensor. Automatika: Časopis za Automatiku, Mjerenje, Elektroniku, Računarstvo i Komunikacije
**2018**, 58, 495–505. [Google Scholar] [CrossRef] - Wan, W.; Lu, F.; Wu, Z.; Harada, K. Teaching Robots to Do Object Assembly using Multi-modal 3D Vision. Neurocomputing
**2017**, 259, 85–93. [Google Scholar] [CrossRef] - Teng, Z.; Xiao, J. Surface-Based Detection and 6-DoF Pose Estimation of 3-D Objects in Cluttered Scenes. IEEE Trans. Robot.
**2016**, 32, 1347–1361. [Google Scholar] [CrossRef] - Kitani, K.; Ziebart, B.; Bagnell, J.; Hebert, M. Activity forecasting. In Proceedings of the Computer Vision–ECCV 2012, Florence, Italy, 7–13 October 2012. [Google Scholar]
- Xu, M.; Song, Y.; Wang, J.; Qiao, M.; Huo, L.; Wang, Z. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE Trans. Pattern Anal. Mach. Intell.
**2018**, 1. [Google Scholar] [CrossRef] - Yoo, Y.; Yun, K.; Yun, S.; Hong, J.; Jeong, H.; Young Choi, J. Visual Path Prediction in Complex Scenes with Crowded Moving Objects. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2668–2677. [Google Scholar]
- CireşAn, D.; Meier, U.; Masci, J.; Schmidhuber, J. Multi-column Deep Neural Network for Traffic Sign Classification. Neural Netw.
**2012**, 32, 333–338. [Google Scholar] [CrossRef] [PubMed] - Wei, Y.; Liang, X.; Chen, Y.; Shen, X.; Cheng, M.M.; Feng, J.; Zhao, Y.; Yan, S. STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell.
**2015**, 39, 2314–2320. [Google Scholar] [CrossRef] - Hong, S.; Kwak, S.; Han, B. Weakly Supervised Learning with Deep Convolutional Neural Networks for Semantic Segmentation: Understanding Semantic Layout of Images with Minimum Human Supervision. IEEE Signal Process. Mag.
**2017**, 34, 39–49. [Google Scholar] [CrossRef] - Wong, C.Y.; Liu, S.; Liu, S.C.; Rahman, M.A.; Lin, S.Ch.; Jiang, G.; Kwok, N.; Shi, H. Image contrast enhancement using histogram equalization with maximum intensity coverage. J. Mod. Opt.
**2016**, 63, 1618–1629. [Google Scholar] [CrossRef] - Singh, K.; Vishwakarma, D.K.; Walia, G.S.; Kapoor, R. Contrast enhancement via texture region based histogram equalization. J. Mod. Opt.
**2016**, 63, 1440–1450. [Google Scholar] [CrossRef] - Jazayeri, A.; Cai, H.; Zheng, J.Y.; Tuceryan, M. Vehicle Detection and Tracking in Car Video Based on Motion Model. IEEE Trans. Intell. Transp. Syst.
**2011**, 12, 583–595. [Google Scholar] [CrossRef] - Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature
**2015**, 521, 436. [Google Scholar] [CrossRef] [PubMed] - Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Leng, J.; Liu, Y. An enhanced SSD with feature fusion and visual reasoning for object detection. Neural Comput. Appl.
**2018**. [Google Scholar] [CrossRef] - Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional Single Shot Detector. arXiv
**2017**, arXiv:1701.06659. [Google Scholar] - Dominguez-Sanchez, A.; Cazorla, M.; Orts-Escolano, S. Pedestrian Movement Direction Recognition Using Convolutional Neural Networks. IEEE Trans. Intell. Transp. Syst.
**2017**, 18, 3504–3548. [Google Scholar] [CrossRef] - Phan, N.; Dou, D.; Wang, H.; Kil, D.; Piniewski, B. Ontology-based Deep Learning for Human Behavior Prediction with Explanations in Health Social Networks. Inf. Sci.
**2017**, 384, 298–313. [Google Scholar] [CrossRef] [PubMed] - Wen, M.; Zhang, Z.; Niu, S.; Sha, H.; Yang, R.; Yun, Y.; Lu, H. Deep-Learning-Based Drug-Target Interaction Prediction. J. Proteome Res.
**2017**, 16, 1401–1409. [Google Scholar] [CrossRef] - Walker, J.; Gupta, A.; Hebert, M. Patch to the Future: Unsupervised Visual Prediction. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] [PubMed] - Pfeiffer, M.; Schaeuble, M.; Nieto, J.; Siegwart, R.; Cadena, C. From Perception to Decision: A Data-driven Approach to End-to-end Motion Planning for Autonomous Ground Robots. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017. [Google Scholar]
- Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.Y. Traffic Flow Prediction With Big Data: A Deep Learning Approach. IEEE Trans. Intell. Transp. Syst.
**2015**, 16, 865–873. [Google Scholar] [CrossRef] - Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions Computer Vision and Pattern Recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Lin, Y.; Dai, X.; Li, L.; Wang, F.Y. Pattern Sensitive Prediction of Traffic Flow Based on Generative Adversarial Framework. IEEE Trans. Intell. Transp. Syst.
**2018**, 1–6. [Google Scholar] [CrossRef] - Kruthiventi, S.S.; Ayush, K.; Babu, R.V. DeepFix: A Fully Convolutional Neural Network for predicting Human Eye Fixations. IEEE Trans. Image Process.
**2017**, 26, 4446–4456. [Google Scholar] [CrossRef] [PubMed] - Vondrick, C.; Khosla, A.; Malisiewicz, T.; Torralba, A. Visualizing Object Detection Features. Int. J. Comput. Vis.
**2015**, 119, 145–158. [Google Scholar] [CrossRef] - Wang, W.; Shen, J.; Member, S. Video Salient Object Detection via Fully Convolutional Networks. IEEE Trans. Image Process.
**2017**, 27, 38–49. [Google Scholar] [CrossRef] [PubMed] - Huang, S.; Li, X.; Zhang, Z.; He, Z.; Wu, F.; Liu, W.; Tang, J.; Zhuang, Y. Deep Learning Driven Visual Path Prediction from a Single Image. IEEE Trans. Image Process.
**2016**, 25, 5892–5904. [Google Scholar] [CrossRef] [PubMed]

**Figure 2.**Total process of our approach: (

**a**) Get the coordinate image I

_{yz}of X direction and I

_{xz}of Y direction. (

**b**) Preprocessing: mask processing and histogram equalization. (

**c**) Object detection framework. (

**d**) Trajectory prediction framework. (

**e**) Direction estimator. The assembly coordinate system X-Y-Z established with the steel bar as the dot.

**Figure 3.**Coordinate image I

_{yz}of X direction and I

_{xz}of Y direction. The embedded hole at the bottom of the precast member is defined as moving target C, and reinforcement is defined as fixed target B.

**Figure 4.**Object detection network includes conv1, pool1, conv2_2, conv3_2, conv4_2, conv5_2, and DSSD layers. There is a pooling layer behind conv1. The conv2_2 layer was disposed with 3 × 3 × 256-s2 convolution kernel, where s2 means that convolution step length is 2. The convolution process here uses the atrous algorithm.

**Figure 5.**Prediction network: after extracting features from the target position image through the convolutional layer, the two-dimensional array was transformed into a one-dimensional array through the flatten layer. Long Short-Term Memory (LSTM) processes one-dimensional data which have sequence characteristics. After processing is complete, the data are transformed into a two dimensional array by reshape function and output them as costmap.

**Figure 7.**Location of reinforcement and hole. The blue dotted line in the figure represents the location of the moving target (hole), red dot represent fixed targets (rebar). Camera1 and Camera2 are placed respectively in the X and Y directions of the assembly coordinate system.

**Figure 8.**The left side of the graph is the image and histogram before processing, and the right side is the image and histogram after processing. In the first line, the maximum number of pixels before processing is 14,739, and the maximum number of pixels after processing is 4734. In the second line, the maximum number of pixels before processing is 14,739, and the maximum number of pixels after processing is 4734. As can be seen from the figure, the processed image has a more balanced distribution of pixels. The yellow arrow indicates that an image block whose average pixel value exceeds the intensity limit allocates redundant pixels to its adjacent image blocks.

**Figure 9.**(

**a**) Represents the original image processed by mask. (

**b**) Represents the gray image after histogram equalization processing. (

**c**) Represents costmap of motion trajectory. (

**d**) Represents the result of direction estimator.

**Figure 10.**The rows of the matrix represent the ground truth, and the columns of the matrices represent the predicted values. TP = number of true positives, TN = number of true negatives, FN = number of false negatives, and FP = number of false positives.

**Figure 11.**Closed loop trajectories of two groups of assembly motions: (

**a**) image of I

_{yz}; (

**b**) image of I

_{xz}; (

**c**) ground truth trajectory of I

_{yz}; (

**d**) ground truth trajectory of I

_{xz}; (

**e**) predicted trajectory of I

_{yz}; (

**f**) predicted trajectory of I

_{xz}; (

**g**) closed loop trajectory.

**Figure 12.**Some qualitative comparison results. Each column represents a sample. First row represents the image after the mask processing. Second row represents the direction discrimination result. The third row represents ground-truth trajectory. The other rows respectively show the predicted trajectories generated by different approaches: ours, deep reinforcement learning (DRL) and Visual Value Prediction (VVP).

Algorithm 1: Trajectory prediction |

Input: Scene image {I_{1}, I_{2}, …, I_{t}}, learning rate $\psi $, and the ground-truth positions of the moving target I_{c} {(x_{1}, y_{1}), …, (x_{M}, y_{M})}.Initialize global-shared parameters ${\theta}^{-}\leftarrow \theta $ Initialize network gradients $d\theta \leftarrow 0$ for t = 1 to M do Crop out the Ds positions according to scene images and moving target (x _{c,t}, y_{c,t}).for i = 1 to N do Extract prediction positions according to $({\widehat{x}}_{c,t},{\widehat{y}}_{c,t})$ and the scene patches with an overlapped sliding window on it. -${R}_{reward}({s}_{i})={F}_{s}({I}_{c},p({s}_{i}),{r}_{t};\psi )$; $i\leftarrow i+1$ end for $d\theta \leftarrow d\theta +{\nabla}_{\theta}{\displaystyle \sum _{t=m}^{M}{r}_{c,t}^{m}}$ Obtain the cost map according to Equations (6)–(8) Calculate the Hausdorff distance. Calculate Dr, direction distinguish. end |

Frames No. | 331# | 585# | 148# | 472# | 981# | 688# | MeanValue | |
---|---|---|---|---|---|---|---|---|

Moving Direction | +X | −Y | −X | −Y | +X | +Y | ||

Hausdorff distance | DRL | 29.13 | 16.75 | 27.48 | 28.18 | 14.27 | 21.42 | 22.87 |

VVP | 32.48 | 19.72 | 31.15 | 31.20 | 19.88 | 24.19 | 26.44 | |

Ours(w/o) | 24.54 | 15.53 | 26.83 | 29.47 | 15.57 | 20.61 | 22.09 | |

Ours | 12.28 | 10.15 | 11.57 | 13.09 | 10.55 | 11.62 | 11.54 |

Wind Speed (m/s) | 0 | 1.5 | 3.3 | 5.4 | |
---|---|---|---|---|---|

Number of tests | 500 | 500 | 500 | 500 | |

Success rate | DRL | 21.2% | 10.2% | 1.4% | 0.2% |

VVP | 18.6% | 5.8% | 0.4% | 0 | |

Visual servo | 53.6% | 21.2% | 15.8% | 8.6% | |

Ours | 62.0% | 37.6% | 20.8% | 12.4% |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zhang, K.; Tong, S.; Shi, H.
Trajectory Prediction of Assembly Alignment of Columnar Precast Concrete Members with Deep Learning. *Symmetry* **2019**, *11*, 629.
https://doi.org/10.3390/sym11050629

**AMA Style**

Zhang K, Tong S, Shi H.
Trajectory Prediction of Assembly Alignment of Columnar Precast Concrete Members with Deep Learning. *Symmetry*. 2019; 11(5):629.
https://doi.org/10.3390/sym11050629

**Chicago/Turabian Style**

Zhang, Ke, Shenghao Tong, and Huaitao Shi.
2019. "Trajectory Prediction of Assembly Alignment of Columnar Precast Concrete Members with Deep Learning" *Symmetry* 11, no. 5: 629.
https://doi.org/10.3390/sym11050629