Dimension Measurement and Key Point Detection of Boxes through Laser-Triangulation and Deep Learning-based Techniques

: Dimension measurement is of utmost importance in the logistics industry. This work studies a hand-held structured light vision system for boxes. This system measures dimension information through laser triangulation and deep learning using only two laser-box images from a camera and a cross-line laser projector. The structured edge maps of the boxes are detected by a novel end-to-end deep learning model based on a trimmed-holistically nested edge detection network. The precise geometry of the box is calculated by the 3D coordinates of the key points in the laser-box image through laser triangulation. An optimization method for effectively calibrating the system through the maximum likelihood estimation is then proposed. Results show that the proposed key point detection algorithm and the designed laser-vision-based visual system can locate and perform dimension measurement of measured boxes with high accuracy and reliability. The experimental outcomes show that the system is suitable for portable automatic box dimension online measurement.


Introduction
The dimensional inspection of 3D objects is an important feature in many intelligent systems. In the logistics industry, specifically for boxes used in package distribution, dimension factors are used in rational packing. This evaluation task usually requires the real-time dimension measurement of the object in advance. Therefore, a box dimension measuring system is required to ensure excellent performance in terms of flexibility, measurement speed, measurement accuracy, and automation.
Box dimension measurement is meaningful in the logistics industry and has gradually attracted the attention of researchers. In industrial applications, various computer vision methods, such as stereo vision [1,2], deep camera [3][4][5][6], and structured light [7,8], have been developed for 3D information measurement. Two effective approaches [5,6] automatically detect box objects and estimate their dimensions by using depth cameras based on TOF technology, with an average error of 8 and 5 mm, respectively. Peng et al. [7] presented a box dimension measurement system based on multiline structure light vision, with errors of less than 5 mm. Gao et al. [8] developed an airline baggage dimension detection approach using a 3D point cloud obtained through a 2D laser rangefinder measurement system. Recently, line structure light detection technology is one of the most common methods for measuring the geometric parameters of objects and for 3D reconstruction.
Line-structured laser light is the most widely used method and generate robust measurement results in practical industrial applications [9][10][11]. This technology has the characteristics of noncontact, wide range, high flexibility, fast speed, high precision, stable algorithms, simple structure, and good antiinterference. Moreover, the sensor is simple, economical, and easy to implement; thus, this technology has been widely used in many industrial fields, such as 3D shape measurement [12,13], vision navigation [10], quality control [14][15][16][17], and automatic inspection [18]. Some outstanding applications have used this technology in different fields. Li et al. [15] proposed a measurement and defect detection system for the weld bead based on the line structured vision sensor, and the vision inspection system achieved satisfactory results for the online inspection. Zhou et al. [19] proposed a quality inspection system for steel rails based on a structured light measurement approach that intersects the rail through the structural light plane projected by the inner and the outer laser sensors. Miao et al. [20] proposed a flatness detection apparatus based on multi-line structured light imaging, thereby achieving a detection accuracy of 99.74% for various computer keyboards in real production lines.
Research on the dimension measurement of boxes in the field of computer vision technology has been published [3][4][5][6][7] and has achieved good results. In the present study, we used the line structured light vision to measure box dimensions. A novel 3D measurement scheme based on key points was developed, rather than directly applying complete 3D surface reconstruction. A portable, low-cost, and real-time dimension measurement system for the box based on a hand-held visual sensor was proposed, as shown in Figure 1(a). Based on the laser triangulation and the detection of the adjacent face key points in the laser-box images, the system can compute the dimension parameters of the box with the 3D coordinates of these key points. The two main difficulties in the system are the detection of structural edges and the key points in the laser-box image and system calibration is another key part. However, most of these systems [14,18,19] should be operated in a fixed scenario to obtain excellent measurement results. When structured light sensors are used, robust light strip segmentation is the key step in detecting and precision positioning the structured edge of the laserbox image because it contains the local 3D information of the box dimension. However, we consider using the line laser vision device in a natural environment for box dimension measurement in this work. The laser light sources are disturbed by various noises, such as sunlight, shadows, and the appendages on the box surface, as shown in Figures 1(b) and (c). Therefore, a robust algorithm is needed to accurately detect the structured edges and key point information in the image. The excellent performance of deep learning technology in image edge detection has made our study possible. Convolutional neural networks (CNN) are effective for edge detection tasks. Xie et al. [21] developed the holistically nested edge (HED) network for edge and object boundary detection through rich hierarchical representations guided by deep supervision on side responses. Liu et al. [22] developed an accurate edge detector using richer convolutional features (RCF) which adopt richer convolutional features by combining all the meaningful convolutional features in a holistic manner. Shen et al. [23] proposed an effective multi-stage multi-recursive-input fully convolutional networks to address neuronal boundary detection from electron microscopy images. He et al. [24] proposed a bi-directional cascade network to encourage the learning of multi-scale representations in different layers and detect edges that are well delineated by their scales, thereby achieving a state-of-the-art result. In general, different layers in a convolutional neural network can learn different semantic levels [25]. The local texture features in the image are learned in the shallow network. The middle layer network can extract primitive features, such as shapes and lines, whereas the deep network can learn the high-level features of objects and categories in the image. HED provides an effective deep learning network with deep supervision for edge detection. Inspired by HED, a novel end-to-endtrimmed holistically nested network is designed to detect the structured edge maps for the laser-box image in the study. The calibration parameters of the visual measurement system are another problem that should be solved. The calibration of the proposed measurement system can be decomposed into the camera intrinsic parameters and the external parameters. Camera intrinsic parameters are unique for a particular camera, and many excellent camera calibration algorithms have been proposed [26][27][28]. The external parameters are their relative position and the orientation from the camera and the laser projector, and the excellent calibration approach can be learned from [29][30][31][32][33]. However, the noise in the calibration image data can affect the robustness and accuracy of machine vision, leading to uncertainties in our calibration parameters. These parameters are also affected by systematic errors. These issues may become an obstacle to industrial applications. Thus, for the accurate calibration of the parameters of the proposed visual system, a novel calibration method based on the maximum likelihood estimation of the probability distribution of the internal and external parameters and filtering of outliers is proposed in this paper.
In this work, we propose a hand-held box dimension measurement system based on a moving coordinate system by combining laser triangulation and deep learning technology. A novel 3D measurement scheme based on key points is developed instead of complete 3D surface reconstruction. The high measurement efficiency is maximized by detecting the key points of the two adjacent face images of the box instead of all the information on the laser stripes. We performed research and related experiments on system modeling, system calibration, measurement methods, structured edge map detection, and key point detection in the proposed visual measurement system. The main contributions of this paper are summarized as follows: (1) A hand-held visual sensor and online measurement system based on laser triangulation and deep learning technique for box dimension measurement are proposed. (2) A valid dataset of the laser-box images is created, and an effective structured edge detection and key point detection approach based on a Trimmed-HED network and straight-line processing are proposed. (3) An optimization method is proposed to achieve the robust calibration of the visual sensor. This paper is organized as follows: in Section 2, the box dimension measurement system is introduced briefly. In Section 3, the measurement algorithmic procedure, the visual sensor's calibration method, laser-box image processing, and detection algorithms for laser stripes and key points are reported. In Section 4, the performance of the measurement system is analyzed. Finally, conclusions and future work are drawn.

Materials and Methods
The portable dimension measurement system for boxes proposed in this paper is shown in Figure 2. The system comprises a cross-line laser projector (power: 10 mW, wavelength: 670 nm), a high-resolution digital color camera, and a compact housing. Two laser stripes are projected from the laser projector, forming a cross-line laser stripe on the box face. The visible cross-line laser stripes in the laser-box images (Figure 3(a)) are captured by using a 2592 × 1944 pixel camera with 3.6 mm lens. The size of our portable visual sensor is 120 mm × 35 mm × 35 mm, thereby making the system suitable for portable box dimension measurement. The system should take two laser-box images to measure the dimension of the box. Figure 4 Figure 4(a) shows the mutual position of the sensor and box, as well as the reference system adopted in the problem. The visual sensor projects cross laser beams onto the box forming cross stripes, which are captured in laser-box images by the camera for measurement. The metric information of the box is stored in the center lines of the laser strips. The values are directly expressed in mm with respect to a camera coordinate system centered on the device. A visual sensor system has been designed to compute the dimension of the box and combine the detection of the inspected boxes' structured edge map and 2D key points (Figure 3(b)) of the laserbox images with the calibrated visual sensor. The system workflow is presented in Figure 5, showing the application operation. With the calibrated device, the two laser-box images of the adjacent faces of the box are captured by the system. The precise geometry of the structured edge of the box face is detected by the trimmed-HED network, and the 2D key points are detected by Hough transformation to the box's silhouette edges. Then, the transformation from 2D image point to 3D space point of the key points on the images is used to fit the plane equations of the box face combined with the calibration parameters. Finally, the length of the face of the measured box is obtained by computing the 3D coordinates of the four vertices. The following sections describe the whole procedure in detail.

Dimension Measurement Principle
We model boxes as parallelepipeds in the present work, although real boxes may present some bent edges, missing corners, and asymmetries. The dimension of a box can be computed from the 3D coordinates of these key points (Dot V1, V2, V3, V4 , D1, D2, D3, D4 and O, as shown in Figure 6(b)) on the silhouette edges of the captured two laser-box images. Thus, before the dimension of the measured box can be obtained, the box silhouettes (Section 3.3.1) and the 2D coordinates of these key points (Section 3.3.2) should be extracted. The camera and the cross-line laser are used to acquire laser-box images of the measured box, forming a cross-line laser strip line embedding the profile structured edge information of the box face, as shown in Figure 6(b). If the parameter equation of the laser plane and the camera (in Section 3.2) is known, the equations of the box face can be computed by intersecting the image rays with the laser planes. Thus, the 3D coordinates of the key points can also be obtained easily.
Here, we define the camera coordinate system as the fiducial coordinate system. The equation of the laser light planes that describe the location of the laser planes in the camera coordinate system is assumed as follows: a x +b y +c z +1=0 a x +b y +c z +1=0 where ai, bi, and ci, (i=1,2) are the coefficients of the two laser planes in our system. Therefore, a camera is modeled via the usual pinhole model to describe the projection relation between the 3D object space and the 2D image [23]. Thus, as shown in Figure 6(a), four coordinate systems are established. These systems include the image pixel coordinate system (unit: pixel), the image physical coordinate system (unit: mm), the camera coordinate system (unit: mm), and the world coordinate system (unit: mm). The relationship between a 3D point P(Xw,Yw,Zw) and its image projection P(u,v) is given by: where s is an arbitrary scale factor, (R, t), called the extrinsic parameters, are the rotation matrix and translation vector, respectively, which relate the camera coordinate system to the world coordinate system, A is the camera intrinsic matrix, and (u0,v0) is the principal point in the image pixel coordinate system. α and β are the scale factors in image u and v axes, and γ is the parameter describing the skew of the two image axes. Equation (3) represents the transformation relationship of the point P between the camera coordinate system (Xc,Yc,Zc) and the image pixel coordinate system (u,v): Assuming the 3D coordinates (Xc,Yc,Zc) are also on the laser stripes and in the image pixel coordinate system, (u,v) could be derived from Equations (1) and (3): Therefore, the 3D coordinates P(Xc,Yc,Zc) on the laser stripes in the camera coordinate system can be computed. Thus, all the 3D coordinates on the laser stripes in the camera coordinate system of the image can be obtained. Thus, with the 2D coordinates (Dot D1, D2, D3, D4, O) on the laser stripes of the box face, the box face plane equation can easily be fitted with the least-squares method in the camera coordinate system: Ax +By +Cz +1=0 (7) where A, B, and C are the coefficients of the box face plane equation. Thus, the 3D coordinates of these key points (V1, V2, V3, and V4) on the silhouette edges of the laser-box image could be derived from Equations (3) and (7). We assume that the 3D coordinates of the vertices of a box face (Dot V1, V2, V3 and V4) are V1(xv1,yv1,zv1), V2(xv2,yv2,zv2), V3(xv3,yv3,zv3), and V4(xv4,yv4,zv4), respectively. The length and the width of the box face can be computed as follows: Through the same strategy, the length' and width' of the box face can be obtained by processing any adjacent face of the box. Therefore, the height of the measured box can be expressed as The box volume can be computed as follows: However, when we measure one box and the length of the two images is w×l, and we need to manually select the height of the box because the height cannot be satisfied with Equation (9) at this time. This is also a shortcoming of our system.
We have completed the box dimension measurement by taking two laser-box pictures via a visual sensor. However, we need to calibrate the visual sensor in advance as discussed in Section 3.2. A robust algorithm is also needed to detect the key points of the collected laser-box image precisely. In this work, the 2D coordinates of the key points are obtained by dealing with the structured edge map of the laser-box images, and the process will be described in detail in Section 3.3.2.

Parameter Calibration of the Visual Sensor
Camera resolution and measurement device calibration are factors that affect the accuracy of length measurement. The internal and external parameters of the system must be calibrated in advance to achieve sufficient accuracy for measurements from the visual sensor. The internal parameters are unique for a particular camera, and it includes the camera intrinsic matrix A and the distortion parameters k1, k2. The external parameter is the relative pose between the camera and the laser projector, and the two laser planes projected from the laser projector are defined as Equation (2) relative to our fiducial coordinate system. The external and internal parameters affect the geometric interpretation of its measurements.
In the camera calibration stage, the camera's inherent parameters are calculated using Zhang's method [26]. We employ a planar calibration pattern viewed simultaneously by the camera and laser projector. The laser light is emitted onto a planar calibration pattern and forms a light strip formed by the intersection of the laser plane and the calibration pattern plane. When collecting images, we move the camera to observe the calibrate pattern from different positions and ensure that the calibration board can fill the entire field of view. The internal parameters of the visual sensor, the camera's extrinsic parameters R, t with respect to the calibrate pattern and the calibration plane equation of the calibrate pattern can be determined using Zhang's method. Then, we extract all the intersections on the fitted line of the laser strip with the horizontal and vertical fitted lines of the feature points (Figure 7(c)) on the calibration pattern as calibration points on the laser strips. We collected N calibration images to obtain a sufficient number of calibration points on the laser stripe. In accordance with Equation (4), the 3D calibrated point's coordinates P(Xc,Yc,Zc) of the 2D coordinates p(u,v) on the laser straight lines can be computed in the camera coordinate system. The objective function is the sum of the squares of the Euclidean distances from the calibration point to the laser plane, and the laser plane equations can be fitted via the Levenberg-Marquardt method [34,35] with these sufficient 3D calibration points: where a, b, c are the parameters of the laser plane equation, N is the number of times the calibration pattern is placed, (Xck,Yck,Zck) is the coordinate of the calibration points on the laser stripe, and N=15, K=1,2,3...N. Figure 7(a) shows the general setup of our calibration approach. Figure 7(b) shows an example of a set of images used in the calibration, that is, 1300 mm × 1200 mm (having 19 × 19 corner pattern, with square size 57.0 mm used) and affixed to the glass. Figure 7(d) shows a laser plane fitted result during calibration.

Optimization for Calibration Parameters by Analyzing the Probability Distributions and Outlier Removal
In practical applications, the internal and external parameters of the same line structure visual device calculated via different calibration image sets will be different. The output of the camera and the laser projector exhibits noise, and the error estimates of the internal and external parameters will affect the final measurement. In the study, we assumed that it has a normal probability distribution in calibration. If calibration is performed at n times, the true value of these parameters can be restored.
For the improved application in engineering projects, a robust approach has been developed by iteratively dropping data parameters with excessive errors. The internal and external parameters are assumed to obey normal distribution, and expressed as N(m,σ 2 ). Then, 99.7% of the data should lie inside the range [m-3σ,m+3σ]. When data lie outside the range, they can be culled given the large error in data compared with their true value. Therefore, the internal and extrinsic parameters of the visual sensor are reestimated in this work by using the remaining parameter sets that meet the ±3σ standards. The proposed algorithm of the detail processing steps is as follows: (1) Acquire N calibration pattern images with laser stripes in different positions, and feature points and calibration points can be detected successfully from these images. (2) Select M images from the N calibration pattern images randomly in Step (1) (α,β,u0,v0,k1,k2,a1,b1,c1,a2,b2,c2) (5) is used as the final internal and external parameters of the visual sensor.

Experimental Verification and Accuracy Assessment
In the present study, calibration image sets M=20, subset M=15 are acquired, and CN M=15504.
Similar to Zhang's algorithm [26], the root mean square (RMS) projection errors between the real pixel coordinates (xi,yi) and the projected pixel coordinates (x project i,y project i) is calculated to assess the accuracy of the parameters: The 10 sets of images are captured by our visual sensor device with the same calibration pattern in different orientations, each of which contains 25 calibration images to evaluate the accuracy of the proposed optimization algorithm. As shown in Table 1, the average RMS of the 10 sets of image sets is calculated, and RMS is calculated by using the true internal and external parameters via the optimization method. The optimization algorithm proposed in this paper is effective and can obtain parameter data close to the true value.

Measurement
Value Mean RMS calculated using 10 sets of calibration pattern images 0.0568 RMS calculated using the true parameters 0.0215 Table 2 shows the minimum and maximum values for each parameter in the given original calibration image sets, and the corresponding true values ultimately calculated via the optimization algorithm. The proposed method provides a robust method for solving the calibration of line structure light visual sensors. The natural mean and standard deviation of the CN M sets of images after calibration are the internal and external parameters of the visual sensor, rather than a fixed result calibrated by a set of image sets. The proposed optimization algorithm is useful in engineering applications. The structured silhouette edges of the captured image should be obtained to compute the image 2D coordinates of the intersections and the vertices crossed by the laser plane and the edge of the box. The VGG16 [25] network is modified. This section presents an automated and effective method for detecting structured edge map and extracting straight lines from them with deep learning. In this section, we propose a novel trimmed-HED network. This structure has the best edge prediction results in our repeated tests. Our trimmed-HED model is improved from three aspects: (1) the laserbox image data set we built, and (2) the first two side-output layers of the HED cut to ignore the detail feature in the image, and (3) the loss function slightly simplified by calculating the fusion layer outputs only that can improve the edge map prediction coarse-to-fine structure progressively.
(1) Laser-box image dataset The problems of structured edge map detection is solved by learning from diverse samples. In building the data set, the best-fitted rectangle is marked by manual manipulation. We labeled images with nine 2D coordinates on the laser-box image containing the four intersections (intersection of the edge of box face and laser line) and other four points (on the box face edge), and the ground truth is obtained by drawing the straight line of these 2D coordinates. Figure 8 shows the sample images and the ground-truth structured edge maps of our developed dataset. Data augmentation is an effective method to generate sufficient training data for learning a robust deep network. We rotate the images to seven different angles (45°, 90°, 135°, 180°, 225°, 270°, and 315°). In total, our dataset comprises 96,000 training images and 500 testing images.
(2): Trimmed-HED network Figure 9. Trimmed-HED network architecture. The green cubes represent the convolution layer, and the blue cubes represent the pooling layer. The prediction stage is a feed-forward network for generating initial predictions, and its architecture is divided into three stages. The final prediction output (a) is obtained by the weighted fusion of the side-output (3), side-output (4), and side-output (5). Figure 9 shows an overview of the proposed trimmed-HED network structured edge detection. The original HED network was designed with five side-output layers and one fuse-output, and the final output was obtained through the weighted-fusion and average layer. HED and RCF indicate that the side-output layers in front of the network (low-level network) are focused on extracting the detail edges of the image, and the high-level network is focused on the extraction of the target contour. However, the overall structured edge of the box face and the laser straight line are the main concerns in the present work. Therefore, the trimmed-HED cut the first two side-output layers of HED.
The total cross-entropy loss in HED is updated via standard stochastic gradient descent by the sum of the loss function at the side outputs and the fusion layer, as shown in the following equation: * (W,w) =argmin(L (W,w) +L (W,w,h) ) side fuse ， ， (13) where L (W,w,h) fuse denotes the loss function at the fusion layer and the L (W,w) side denotes the loss function at the side-output layer. W denotes the standard network layer parameters, w denotes the parameters of the side-output layers, and h indicates the fusion coefficient of each side output layer. The entire network in HED was trained with weighted-fusion supervision and side-output layer supervision. Compared with the train with weighted-fusion supervision only, the edge map predictions are progressively coarse-to-fine, local-to-global when train both. However, in trimmed-HED, the train with weighted-fusion supervision only exhibits complete structural information, and the network can learn from the image. Therefore, our loss function in trimmed-HED becomes: The final edge map predictions (Youtput) can be computed by further aggregating these edge maps of the side-output layers and the weighted-fusion layer: where the Y(3), Y(4), and Y(5) sides are the output of the side-output layer (3), the side-output layer(4), and the side-output layer(5), respectively. The parameters of our networks include the mini-batch size (10), the learning rate (1e-3), the loss weight for each side-output (1), the weight decay (0.0002), and the number of training iterations (1e+5 divides learning rate by 10 after 1000).
The performance of the structured edge detection algorithm was evaluated using three standard measures: a fixed scale of the best F-measure on the data set (ODS), the best threshold for each image (OIS), and the average precision (AP). The trimmed-HED method was compared with the original HED method and the trimmed-HED (with/without deep supervision). The detailed experimental results are shown in Table 3. The results shown by original HED in Table 3 are unsatisfactory and are expected because a dataset not specifically designed to solve the problem in this study was used. Compared with the results shown in the first two rows of Table 3, creating a standard dataset has an advantage on the success of this study. Thus, compared with the original HED and original HED with our dataset, ODS increased by 0.131, OIS increased by 0.096, and AP increased by 0.109. The trimmed-HED without deep supervision achieved the best result in detecting the structured edge map as shown in the experiment. The ODS of our model is 0.803, OIS is 0.816, and AP is 0.809. Table 3. Performance of network alternative architectures. The "without deep supervision" result is trained using Equation (14). The "with deep supervision" result is trained by Equation (13).

Detecting 2D Key Points via the Hough Transformation
The structured edge map was extracted from the laser-box images, which was based on the proposed deep learning network. And the structured edge map shows the four edges of the box face measured and the projected intersecting laser lines. The straight lines should be extracted and their intersections in the structured edge maps to locate the 2D coordinates of these key points. In this work, the three steps were performed as follows: Step 1. The Hough line transform was used to detect the straight line ρ=xcos(θ)+ysin(θ) from the structured edge maps of the laser-box images and transformed from each straight line to the parameter space.
Step 2. (ρ,θ) for many cells was quantified, and an accumulator for each interval area was created. For every pixel (x,y) in the structured edge map of the laser-box image, the quantized value (ρ,θ) was computed, and the nearly collinear line segments were clustered by a suitable threshold for ρ and θ. Step3. The image space lines composed of the N first (ρ,θ) in Step 2 were obtained and fitted via the LSM. N is 6 in this study. Figure 11 presents the key point detection results of the raw input images. The Opencv function cornerSubPix() was used to detect the Sub-pixel coordinates of these key points. The parameter winSize in this function, representing the radius of the search window, was set to 4×4 in this study. For each image, the detected 2D coordinates of the key points are shown in the raw images in Figure  11(c) to illustrate the experimental results intuitively. The location of key points was precisely detected by our proposed approach.

Experimental Results
The overall measurement system is shown in Figure 1. The vision sensor is connected to a mobile device via a USB cable. The effective measurement distance with respect to the vision sensor is 0.1-2.5m. The normal working environmental temperature of the system is -15 ℃-60 ℃. Before these experiments, the calibration parameters of the vision sensor were calibrated in advance with the method described in Section 3.2. The detailed parameters are shown in Table 2.
A few operational experiments under varying operating conditions were carried out to evaluate the performance and effectiveness of the proposed system and the validity of the corresponding algorithms derived above. Five experimental phases were conducted to evaluate system performance. 1) The relative angle of the boxes measured and the visual sensor was changed; 2) the distance between the visual sensor and boxes was changed; 3) systematic error and measurement uncertainty analyses experiments of the system were performed; 4) the measurement accuracy of the system on various boxes in different scenarios was verified; and 5) some online test experiments were performed.

Measurement Statistical Analysis Experimental of Varying Orientations of the Measurement Object
In this experiment, the robustness of the proposed system with varied box orientation with respect to the visual sensor was evaluated.
The visual sensor was placed at five different positions with different orientation, and the angles between the face of the measured and the z-axis of the reference system of the sensor (see figure 4(a)) are 30°, 45°, 60°, 75°, and 90°, respectively.
In this experiment, estimated values are reported as the average of 30 experimental sessions on the same box (Figures 12(a) and (b)). Table 4 shows the measurement results in terms of W, L, and H of the boxes. As shown in Figure 13, the average absolute errors (L, W, and H of the two standard boxes) at 90°, 75°, 60°, 45°, and 30° were 0.867, 1.333, 1.633, 2.533, and 3.083, mm respectively, indicating that the orientation of the measured box with the visual sensor does not significantly affect the measurement results. However, at an angle of 90°, the system can obtain the best measurement results and achieved a 0.26% average relative error. The maximum error in the measurement results in the experiment is 3.8 mm, indicating that the measurement system has good applicability in practical measurement.

Measurement Statistical analysis Experimental of the Changing Distance Between the Visual Sensor and the Measured Box
The box was measured at five different distances from the visual sensor (dis1 = 0.8 m, dis2 = 1.  Figure 14) were recorded, and the measurement error was computed with the relative error. The measurement results reported in this experiment are recorded in Table 5.   Figure 15 shows that at the distance sensor/boxes dis1, dis2, dis3, dis4, and dis5, the average absolute errors are 0.411, 0.844, 1.478, 3.111, and 4.689 mm, respectively. The following data from the table can be obtained: the error of the measured result increases with the increase of the measured distance. The maximum error of the measurement was 5.8 mm, which was kept within ±6 mm. The analyzed data show that our system has good accuracy in the normal measurement range of the box and vision sensor.

Stability Analysis and Evaluation of Uncertainty in the Measurement Experimental of the Measurement System
In this experiment, we evaluated the stability of the measurement system by making repetitive measurements of the box dimension. We used four standard boxes for measurement experimental to increase the credibility of the experiment. As shown in Figures 16(a), (b), (c), and (d), the side length of the box is evenly distributed within the measurement interval of the measuring range, and L, W, and H of these standard boxes are 110.6 × 410.5 × 620.8, 390.8 × 240.6 × 530.7, 1110.7 × 750.8 × 880.9, and 690.7× 570.5 × 1500.0, respectively. The position of the box relative to the vision sensor in each shot was changed, so the measured value can be used to verify the measurement accuracy of the system. The experimental results (L, W, and H) of measurements for the standard boxes are shown in Table 6. The average estimated values were recorded as the average of 15 experimental sessions on the standard box with the best distance from the box-sensors. The stability analysis and the evaluation of uncertainty in the measurement experimental of the measurement system were evaluated by performs data statistics on the results of 15 measurements, and the mean (Mean), average absolute error (Ave_Err), standard deviation (Std), and the uncertainty of class A (μA) were calculated. The formula for calculating μA is as follows: where xi is the measured data and x is the mean of the measured data. n is the number of measurements, which is 15 in the experiment. Table 6. Stability analysis and the evaluation of uncertainty of the measurement system.  The standard deviation of the measurement results was analyzed in accordance with the actual length of the measured boxes. The maximum value of standard deviations was less than 2.68 mm, and the minimum value of the standard deviation was less than 1.01 mm, indicating that the box measurement system has reliable repeated measurement accuracy.
Ave_Err and μA in the computed dimensions increased as the measurement range of the system increased. This phenomenon is attributed to the relative error that tends to decrease as the distance of the visual sensor and box becomes small. Figures 17(a) and (b) show the average error distribution and the measurement uncertainty of the system of these 15 measurements and are consistent with the theoretical analysis. The maximum absolute error of the side length of the standard box is 4.7 mm. The measurement uncertainty of the measuring system is ±1.05 mm to ±2.77 mm within the range of 110-1500 mm, indicating that the measurement system has high reliability in actual measurement and strong practical application.

Measurement Statistical Analysis Experimental for Various Boxes in Different Scenarios
In this experiment, eight different boxes in different scenarios were measured. Figure 18(a) shows a box with a red surface; Figure 18(b) shows a box with high reflective area on the surface, where the reflective area would affect the imaging of the laser stripes in the image; Figure 18(c) shows a box with an ideal state; Figure 18(d) shows a box with a complex pattern and appendages on the surface; Figure 18(e) shows a box with surface variation (not an ideal plane); Figures 18(g) and (h) show the measurement of the target box with several boxes positioned in one plane. Figures 18(f) and (g) show the measurement of the same box in different scenarios. The raw laser-box image, the edge map, and the key points measured by the system are listed in Table 7. The width W, length L, height H of the box, and the absolute errors are recorded in brackets. The experimental results were analyzed and the following results were obtained: box (c) is an ideal box, with excellent performance in measurement results. The measurement results of boxes (a) and (d) indicate that the color characteristics on the surface of the box and the complexity of the pattern have no effect on the dimensional measurement of the box. The measurement results of box (b) show that the system is slightly negatively affected by the optical quality of the surface, thereby affecting key point detection and length measurement. The absolute measurement error of the L, W, and H of the box (e) are 1.3, 9.3, and 11.3 mm, respectively. For the measurement results of the box (e), although the maximum error result is 11.3 mm, our algorithm has a good effect on the edge and key point detection of such a box with an uneven surface. Such measurement results are acceptable for most logistics operations. The measurements of the boxes (f) and (g) are almost identical, suggesting that our system works well in complex situations, where multiple boxes are in the same plane. The measurement results of the box (h) also verify the effectiveness of the measurement system in a complex environment. Experimental results show that the system is slightly negatively affected by the color, the pattern, and the optical quality of the surface. However, a considerable measurement error was detected when a box with an uneven surface was detected. Therefore, the experimental results show that the network designed in this paper can provide accurate positioning of the key point in the laser-box image even in the complex environment of multiple boxes. The measurements in Table 7 show that the errors of the side length of the box are between -2.2-+3.8 mm (excluding box (e) with irregular surfaces). This finding shows that the system we designed has a wide range of applicability. Table 7. Measurement result for various boxes in different scenarios, and the measurement error (mm) in brackets. Table 8 shows the experimental results of measurements for eight standard boxes. The absolute error and the relative error of the measurement results were analyzed in accordance with the actual volumes and length of the measured boxes. The data in Table 8 indicate that the maximum relative error of length in the experiment was 0.575%, and the maximum measurement error of the length was 7.6mm, indicating good dimension measurement accuracy of the system.

Conclusions
A portable online dimension measurement system for boxes is required by the logistics industry to meet the challenging demands of intelligent logistics. In this work, the proposed dimension measurement system takes advantage of 3D reconstruction of box vertices to provide online dimension measurement. The system is based on laser triangulation and deep learning technology via a cross-line laser stripe cast onto the adjacent face of the box to be inspected. This method can accurately compute the 3D dimensions of boxes in adverse environmental conditions. The 2D coordinates of the key point in the laser-box images are detected using a novel end-to-end deep learning network with excellent performance. An effective optimization algorithm for structured light vision calibration was presented in which the camera intrinsic and extrinsic parameters of the device were improved by maximum likelihood estimation based on probability distribution. Experimental results show that the physical design for the proposed visual sensor is rational and the dimension measurement of box is effective. Our approach is readily applicable by future automated systems, which can integrate box targeting with the measurement method presented here. In the future, our work will continue to focus on the study of an intelligent and portable online box dimension measuring equipment system.
Funding: This work was supported by the National Natural Science Foundation of China (61572307).

Conflicts of Interest:
The authors declare no conflict of interest.