Portable System for Box Volume Measurement Based on Line-Structured Light Vision and Deep Learning

Portable box volume measurement has always been a popular issue in the intelligent logistic industry. This work presents a portable system for box volume measurement that is based on line-structured light vision and deep learning. This system consists of a novel 2 × 2 laser line grid projector, a sensor, and software modules, with which only two laser-modulated images of boxes are required for volume measurement. For laser-modulated images, a novel end-to-end deep learning model is proposed by using an improved holistically nested edge detection network to extract edges. Furthermore, an automatic one-step calibration method for the line-structured light projector is designed for fast calibration. The experimental results show that the measuring range of our proposed system is 100–1800 mm, with errors less than ±5.0 mm. Theoretical analysis indicates that within the measuring range of the system, the measurement uncertainty of the measuring device is ±0.52 mm to ±4.0 mm, which is consistent with the experimental results. The device size is 140 mm × 35 mm × 35 mm and the weight is 110 g, thus the system is suitable for portable automatic box volume measurement.


Introduction
Box volume measurement is important for many sectors, including logistics, transportation, and production, and it can assist in designing, packaging, and allocating strategies. Fast, intelligent, accurate, and automatic volume measurement can improve efficiency and reduce labor intensity. User-friendly and cost-effective systems are also vital for box volume measurement.
As previously mentioned, a practical measurement system for box volume should have the following characteristics: (1) relatively small to be handled gracefully, (2) wide measuring range, (3) high measurement accuracy, (4) stable and robust, and (5) easy to use and flexible.
At present, the research hotspots of large-scale measurement methods with three-dimensional (3D) geometric dimension focus on non-contact 3D measurement methods based on computer vision technology. This method has a rigorous theoretical basis, a large range of elasticity, high measurement accuracy and efficiency, no rigid requirement for the spatial relationship between the measuring device and measured object, good robustness, and non-contact measurement. Thus, this method is a feasible solution for solving large-scale 3D geometric measurement.
With the development of computer vision technology, object volume can be calculated while using new technology and sensors [1][2][3]. Many advanced sensors, such as stereovision, time-of-flight (ToF) camera, and structured-light vision sensor, can represent spatial and color information from natural objects, thereby playing a crucial role in the development of industrial automation measurement.
A method for the dimension measurement and inspection of cuboidal objects (boxes) with a ToF camera was described in [4], with an average error of 5 mm. The same ToF camera was used in [5] to build a system for computing the volume of cuboidal objects with an accuracy of 8 mm. The ToF technology can obtain depth information in real time by calculating the time that it takes for a pulse of energy to travel from its transmitter to the object surface and then back to the receiver. The ToF camera technique, due to its robustness and popularity, has been widely studied and applied in industries [6,7]. The dimensional measurement methods for objects that are based on stereovision have also been widely used. A stereovision technique for accurately measuring the distance and size (height and width) of an object in view was introduced in [8]. Ge et al. [9] proposed a method of broccoli seedling recognition in natural environments based on binocular stereovision. As binocular cameras heavily rely on image feature matching, the effect is poor under dark or overexposed lighting. In addition, if the measured scene lacks texture, then extracting and matching the features are difficult. In addition, a binocular stereocamera uses complex correlation algorithm, which is time consuming. The depth calculation of ToF is unaffected by the grayscale and features of the object surface, and the ToF can accurately perform 3D detection. The depth calculation accuracy of ToF does not change with the change in distance. The measurement accuracy can reach the mm level by using an advanced ToF camera and algorithm, as previously mentioned [4,5].
Recently, the technique of computer vision and structured light (SL) measurement has been widely applied in many fields of high-precision measurement, due to its simple structure. Triangulation-based visual sensors are popular for measurement in various industries. They have many advantages, such as non-contact, high-precision, rapid, and automated measurements [10][11][12][13]. Fernandes et al. [14] presented an approach that is based on projective geometry; they computed the box dimensions by using data that were extracted from the box silhouette and the projection of two parallel laser beams on one of the imaged faces of the box. Wang et al. [15] proposed a handheld 3D laser scanning system that consists of a binocular stereovision and line laser projector for measuring large-sized objects on site. Pan et al. [16] proposed a wheel size measurement framework that is based on a structured-light vision sensor, which has high precision and reliability and is suitable for highly reflective conditions. In the present study, we develop a novel volume measurement system for a box that contains high-resolution color digital cameras and line-structured lights and that works indoors and outdoors. Figure 1c shows the designed device for box volume measurement. The device size is 140 mm × 35 mm × 35 mm and the weight is 110 g, thereby easily meeting the requirements of stability and portability. The line-structured light projectors emit laser planes onto the box face, and the laser planes intersect with the face of the measured box and form laser stripes in the laser-modulated image. As the face of the measured box modulates the laser stripes, the image processing algorithm can calculate the dimension information of the box on the basis of the laser triangulation principle and some key points. Thus, our method calculates the volume of boxes from two laser-modulated images (two adjacent faces of the box), and the technique mainly includes two aspects: (1) calibration technology of the vision sensor and (2) the extraction of the box silhouette to obtain the key points from the laser-modulated images. A method for the dimension measurement and inspection of cuboidal objects (boxes) with a ToF camera was described in [4], with an average error of 5 mm. The same ToF camera was used in [5] to build a system for computing the volume of cuboidal objects with an accuracy of 8 mm. The ToF technology can obtain depth information in real time by calculating the time that it takes for a pulse of energy to travel from its transmitter to the object surface and then back to the receiver. The ToF camera technique, due to its robustness and popularity, has been widely studied and applied in industries [6,7]. The dimensional measurement methods for objects that are based on stereovision have also been widely used. A stereovision technique for accurately measuring the distance and size (height and width) of an object in view was introduced in [8]. Ge et al. [9] proposed a method of broccoli seedling recognition in natural environments based on binocular stereovision. As binocular cameras heavily rely on image feature matching, the effect is poor under dark or overexposed lighting. In addition, if the measured scene lacks texture, then extracting and matching the features are difficult. In addition, a binocular stereocamera uses complex correlation algorithm, which is time consuming. The depth calculation of ToF is unaffected by the grayscale and features of the object surface, and the ToF can accurately perform 3D detection. The depth calculation accuracy of ToF does not change with the change in distance. The measurement accuracy can reach the mm level by using an advanced ToF camera and algorithm, as previously mentioned [4,5].
Recently, the technique of computer vision and structured light (SL) measurement has been widely applied in many fields of high-precision measurement, due to its simple structure. Triangulation-based visual sensors are popular for measurement in various industries. They have many advantages, such as non-contact, high-precision, rapid, and automated measurements [10][11][12][13]. Fernandes et al. [14] presented an approach that is based on projective geometry; they computed the box dimensions by using data that were extracted from the box silhouette and the projection of two parallel laser beams on one of the imaged faces of the box. Wang et al. [15] proposed a handheld 3D laser scanning system that consists of a binocular stereovision and line laser projector for measuring large-sized objects on site. Pan et al. [16] proposed a wheel size measurement framework that is based on a structured-light vision sensor, which has high precision and reliability and is suitable for highly reflective conditions. In the present study, we develop a novel volume measurement system for a box that contains high-resolution color digital cameras and line-structured lights and that works indoors and outdoors. Figure 1c shows the designed device for box volume measurement. The device size is 140 mm × 35 mm × 35 mm and the weight is 110 g, thereby easily meeting the requirements of stability and portability. The line-structured light projectors emit laser planes onto the box face, and the laser planes intersect with the face of the measured box and form laser stripes in the laser-modulated image. As the face of the measured box modulates the laser stripes, the image processing algorithm can calculate the dimension information of the box on the basis of the laser triangulation principle and some key points. Thus, our method calculates the volume of boxes from two laser-modulated images (two adjacent faces of the box), and the technique mainly includes two aspects: (1) calibration technology of the vision sensor and (2) the extraction of the box silhouette to obtain the key points from the laser-modulated images.  The paper is organized, as follows. Section 2 presents a brief overview and operating instructions of the visual sensor of the system. Our new approach for measuring the box volume is investigated in detail in Section 3. The experimental results and discussions are presented in Section 4. Finally, conclusions are drawn in Section 5.

Overview
Figure 1c displays the proposed system. High-precision sensors and strict measurement rules achieve high-accuracy measurement. Figure 2b shows the measurement method of the visual sensor and measured box. The detailed workflow is listed, as follows:

1.
Solving parameters: Before using the system, we obtain the parameters by using our calibration method (Section 3.3).

2.
Data collection: The visual sensor connected to a portable mobile device is used. Two images of any two adjacent faces of the box are obtained. The four modulated laser stripes should intersect the four edges of the box face, as shown in Figure 2c,d.

3.
Volume measurement: The system will automatically process the collected images and then obtain the box length, width, and height. Finally, the system automatically obtains the volume of the measured box.  The regular logistics box volume is an important indicator of the freight that was collected in the logistics industry. The box length, width, and height should be measured to determine the box volume. Certain difficulties exist in volume measurement system, which are reflected in the following four aspects: (1) The environment inside the distribution center is complex and it suffers from different illumination information (Figure 3a,c,g,h). (2) Logistics boxes have varied sizes, and the box length ranges from 10 mm to 1800 mm (Figure 3a,b,e,f,j). (3) Laser-modulated images are influenced by variations in box materials, color, and appendages (Figure 3b,d,f,h,i). (4) Non-contact and portable measurements are required. To solve the abovementioned problems, we model the boxes as parallelepipeds, as shown in Figure 2a. The volume of a parallelepiped can be calculated while using the 3D coordinates of the vertices with two arbitrary adjacent faces of the box. The 3D coordinates of a box's face can be obtained on the basis of the intersection of the laser lines and the edges of the box's face. Thus, the edge of the laser line and box edges on the laser-modulated images must be extracted before we can calculate the volume of the measured box (Section 3.4), and then the equations of the laser planes of the laser projector and the camera parameters must be obtained (Section 3.3).
Our portable system for box volume measurement that is based on line-structured light vision and deep learning only requires two laser-modulated box images for the measurement. Figure 4 depicts the scheme behind the proposed solution. Before the measurement, we obtain the parameters by using our calibration method and write the parameters to the device. We input the two lasermodulated images into the designed network to generate the edge probability map. Subsequently, we obtain the coordinates of key points of the box face through a simple image processing of the edge probability map. We can obtain the box volume combined with the calibration parameters and key points. The regular logistics box volume is an important indicator of the freight that was collected in the logistics industry. The box length, width, and height should be measured to determine the box volume. Certain difficulties exist in volume measurement system, which are reflected in the following four aspects: (1) The environment inside the distribution center is complex and it suffers from different illumination information (Figure 3a,c,g,h). (2) Logistics boxes have varied sizes, and the box length ranges from 10 mm to 1800 mm (Figure 3a The regular logistics box volume is an important indicator of the freight that was collected in the logistics industry. The box length, width, and height should be measured to determine the box volume. Certain difficulties exist in volume measurement system, which are reflected in the following four aspects: (1) The environment inside the distribution center is complex and it suffers from different illumination information (Figure 3a,c,g,h). (2) Logistics boxes have varied sizes, and the box length ranges from 10 mm to 1800 mm (Figure 3a,b,e,f,j). (3) Laser-modulated images are influenced by variations in box materials, color, and appendages (Figure 3b,d,f,h,i). (4) Non-contact and portable measurements are required. To solve the abovementioned problems, we model the boxes as parallelepipeds, as shown in Figure 2a. The volume of a parallelepiped can be calculated while using the 3D coordinates of the vertices with two arbitrary adjacent faces of the box. The 3D coordinates of a box's face can be obtained on the basis of the intersection of the laser lines and the edges of the box's face. Thus, the edge of the laser line and box edges on the laser-modulated images must be extracted before we can calculate the volume of the measured box (Section 3.4), and then the equations of the laser planes of the laser projector and the camera parameters must be obtained (Section 3.3).
Our portable system for box volume measurement that is based on line-structured light vision and deep learning only requires two laser-modulated box images for the measurement. Figure 4 depicts the scheme behind the proposed solution. Before the measurement, we obtain the parameters by using our calibration method and write the parameters to the device. We input the two lasermodulated images into the designed network to generate the edge probability map. Subsequently, we obtain the coordinates of key points of the box face through a simple image processing of the edge probability map. We can obtain the box volume combined with the calibration parameters and key points.  To solve the abovementioned problems, we model the boxes as parallelepipeds, as shown in Figure 2a. The volume of a parallelepiped can be calculated while using the 3D coordinates of the vertices with two arbitrary adjacent faces of the box. The 3D coordinates of a box's face can be obtained on the basis of the intersection of the laser lines and the edges of the box's face. Thus, the edge of the laser line and box edges on the laser-modulated images must be extracted before we can calculate the volume of the measured box (Section 3.4), and then the equations of the laser planes of the laser projector and the camera parameters must be obtained (Section 3.3).
Our portable system for box volume measurement that is based on line-structured light vision and deep learning only requires two laser-modulated box images for the measurement. Figure 4 depicts the scheme behind the proposed solution. Before the measurement, we obtain the parameters by using our calibration method and write the parameters to the device. We input the two laser-modulated images into the designed network to generate the edge probability map. Subsequently, we obtain the coordinates of key points of the box face through a simple image processing of the edge probability map. We can obtain the box volume combined with the calibration parameters and key points.

Design of the Visual Sensor Measurement System
The portable volume measurement system that was proposed in this work consists of a 2 × 2 laser line grid projector and high-resolution camera, as shown in Figure 5b; it has a low computational cost. Table 1 lists the detailed parameters of the visual sensor. The size of the designed device is 140 mm × 35 mm × 35 mm, and the weight is 110 g. The baseline length of the device is 120 mm, thereby easily meeting the requirements of stability and portability. Furthermore, connection to other mobile devices, such as a mobile phone or pad, is convenient.

Design of the Visual Sensor Measurement System
The portable volume measurement system that was proposed in this work consists of a 2 × 2 laser line grid projector and high-resolution camera, as shown in Figure 5b; it has a low computational cost. Table 1 lists the detailed parameters of the visual sensor. The size of the designed device is 140 mm × 35 mm × 35 mm, and the weight is 110 g. The baseline length of the device is 120 mm, thereby easily meeting the requirements of stability and portability. Furthermore, connection to other mobile devices, such as a mobile phone or pad, is convenient.

Design of the Visual Sensor Measurement System
The portable volume measurement system that was proposed in this work consists of a 2 × 2 laser line grid projector and high-resolution camera, as shown in Figure 5b; it has a low computational cost. Table 1 lists the detailed parameters of the visual sensor. The size of the designed device is 140 mm × 35 mm × 35 mm, and the weight is 110 g. The baseline length of the device is 120 mm, thereby easily meeting the requirements of stability and portability. Furthermore, connection to other mobile devices, such as a mobile phone or pad, is convenient.     Figure 5a presents the measurement schematics of the proposed volume measurement system. O w − X w Y w Z w is the world coordinate system (WCS), and O c − X c Y c Z c is the camera coordinate system (CCS). The laser stripes are projected onto the box face through a laser projector. The camera captures the laser stripes that are modulated by the box faces. Afterwards, the laser-modulated images are captured. However, the four modulated laser stripes must intersect the four edges of the box faces.

Geometric Model
Camera mapping coordinate points in a 3D world to a two-dimensional (2D) image plane can be described while using a pinhole model [32]. Figure 6 shows the perspective projection relationship between 3D space point and 2D image point in the pinhole camera model.

Geometric Model
Camera mapping coordinate points in a 3D world to a two-dimensional (2D) image plane can be described while using a pinhole model [32]. Figure 6 shows the perspective projection relationship between 3D space point and 2D image point in the pinhole camera model.

Geometric Model
Camera mapping coordinate points in a 3D world to a two-dimensional (2D) image plane can be described while using a pinhole model [32]. Figure 6 shows the perspective projection relationship between 3D space point and 2D image point in the pinhole camera model.

Geometric Model
Camera mapping coordinate points in a 3D world to a two-dimensional (2D) image plane can be described while using a pinhole model [32]. Figure 6 shows the perspective projection relationship between 3D space point and 2D image point in the pinhole camera model.   The projection from a 3D point P(x w , y w , z w ) in the WCS to a 2D image point p(u, v) in the image plane is expressed by the following equation: where T and R represent the translation vector and rotation matrix from the coordinate system to the CCS, respectively. α an β are the scale factors in u and v axes of the camera, respectively, and δ is the skew of the two image axes. ρ is a nonzero factor, and (u 0 , v 0 ) is the principal point. The rotation matrix R and translation vector T, which translate to a 3D point P c (x c , y c , z c ) in the CCS, encapsulate the camera orientation and position. The transformation relation of the CCS to the image coordinate system can be shown as Equation (2) shows the expression of a straight line in space, which connects the point in CCS with the point in the image plane. Practically, radial and tangential distortions of the lens are inevitable. In our practical engineering application, the tangential distortion of the lens has a minimal effect on the result. In this study, we only consider the radial distortion and we have the following equations: where r 2 = x 2 + y 2 , (x, y) T is the distorted image coordinate and (x, y) T is the idealized one. k 1 and k 2 are the radial distortion coefficients of the lens. The laser light plane that is emitted from the visual sensor intersects with the box face and forms laser stripes in the image plane captured by the camera, as shown in Figure  to the CCS, respectively. α an β are the scale factors in u and v axes of the camera, respectively, and δ is the skew of the two image axes. ρ is a nonzero factor, and The rotation matrix R and translation vector T , which translate to a 3D point ) , , ( c c c c z y x P in the CCS, encapsulate the camera orientation and position. The transformation relation of the CCS to the image coordinate system can be shown as Equation (2) shows the expression of a straight line in space, which connects the point in CCS with the point in the image plane. Practically, radial and tangential distortions of the lens are inevitable. In our practical engineering application, the tangential distortion of the lens has a minimal effect on the result. In this study, we only consider the radial distortion and we have the following equations: , ( is the idealized one.  Point D 1 in the image not only belongs to the intersection line with the surface to be digitized, but also to the laser light plane must fulfil the camera model equations. Once the perspective projection matrix of the camera and the equations of the planes containing the sheets of light relative to a global coordinate frame are obtained from the calibration, the triangulation for computing the 3D coordinates of object points simply involves finding the intersection of a ray from the camera and a plane from the projector. Thus, the equation of the laser plane in the CCS is as follows: where i is the laser stripe number and a i , b i , c i , and d i are the coefficients. The number of equations of the planes and light stripes is equal. The laser plane contributes with the additional information that is necessary for completing the equation of the straight line of the camera model, such that their 3D coordinates can be extracted from their 2D image coordinates u, v. A 3D point P(x c , y c , z c ) at the intersection of the viewpoint from the camera and the laser stripe from the projector is triangulated while using the camera and projector parameters. On the basis of Equations (2) and (4), we derive the set of linear equations [X c /Y c , Y c /Z c , 1/Z c ], as follows: Therefore, P(x c , y c , z c ) in the CCS can be expressed as On the basis of the intersection of lines D 1 D 3 and D 5 D 7 in the CCS, the coordinate of intersection point A could be obtained as A(X ca , Y ca , Z ca ). Similarly, we can generate the 3D coordinates of B, C, D in the CCS: B(X cb , Y cb , Z cb ), C(X cc , Y cc , Z cc ), and D(X cd , Y cd , Z cd ). Thus, we derive the length and width of this box side.
Similarly, we capture the box's image of the adjacent face to the first image. On the basis of Equation (9), we can measure the length and width of the second image: width and length . Hence, the box height can be calculated.
Therefore, we can obtain the box volume.
However, a dimension of A * A * B of the measured box is a problem. At this time, if the two captured images that were calculated with the length of the box's faces are A * B, then our algorithm will not work properly. At this point, we obtain the box length and width through the first image, but we cannot calculate the box height from the second image through Equation (10). As the values of A and B calculated by the second image satisfy Equation (10), we must manually select a suitable A or B as the box height in our system.
To date, a box volume measurement approach, which only requires two laser-modulated images of boxes, has been introduced. Section 3.3 designs a one-step calibration method for camera and laser projector. The coordinates of key points, which are automatically obtained by deep learning for laser-modulated image, are presented in Section 3.4.

Calibration Method for the Camera and 2 × 2 Laser Line Grid Projector
In this work, we present a one-step intrinsic and extrinsic calibration method for line-structured light projector that is based on circle calibration target. The coordinates of the key points are solved by increasing the equation of the laser plane.
Zhang et al. [17] provided an excellent method for camera calibration. Line-structured light projector calibration involves determining the camera's intrinsic and extrinsic parameters. Equation (1) represents a camera perspective projection model. The 3 × 3 rotation matrix R and 3 × 1 translation vector T are the external parameters of the camera. The laser plane (Equation (4)) in this coordinate system is obtained during line-structured light projector calibration. Here, we simultaneously generate the system parameters of the camera and the laser projector. Figure 8a shows the circle target that is used in this paper. The visual sensor is placed at a distance from the target board similar to the nominal working distance. N images with different positions, which contain the laser line corresponding to the intersection of the laser plane with the calibration board, are captured ( Figure 8b). We select the first local WCS as the absolute WCS from the N local WCSs previously established. The X and Y axes of each moving target are used as the local WCS to calculate the relative position between the CCS and local WCS R i and T i . The laser plane (Equation (4)) is fitted in the absolute CCS ( Figure 8c).
However, a dimension of B A A * * of the measured box is a problem. At this time, if the two captured images that were calculated with the length of the box's faces are B A* , then our algorithm will not work properly. At this point, we obtain the box length and width through the first image, but we cannot calculate the box height from the second image through Equation (10). As the values of A and B calculated by the second image satisfy Equation (10), we must manually select a suitable A or B as the box height in our system.
To date, a box volume measurement approach, which only requires two laser-modulated images of boxes, has been introduced. Section 3.3 designs a one-step calibration method for camera and laser projector. The coordinates of key points, which are automatically obtained by deep learning for lasermodulated image, are presented in Section 3.4.

Calibration Method for the Camera and 2 × 2 Laser Line Grid Projector
In this work, we present a one-step intrinsic and extrinsic calibration method for line-structured light projector that is based on circle calibration target. The coordinates of the key points are solved by increasing the equation of the laser plane.
Zhang et al. [17] provided an excellent method for camera calibration. Line-structured light projector calibration involves determining the camera's intrinsic and extrinsic parameters. Equation (1) represents a camera perspective projection model. The

3 rotation matrix
R and 1 3 translation vector T are the external parameters of the camera. The laser plane (Equation (4)) in this coordinate system is obtained during line-structured light projector calibration. Here, we simultaneously generate the system parameters of the camera and the laser projector. Figure 8a shows the circle target that is used in this paper. The visual sensor is placed at a  Therefore, the equation coefficients of the ith plane (a i , b i , c i , and d i ) can be computed while using the least squares method. We obtain the line-structured light projector parameters on the basis of the circle calibration target by one step. Moreover, the proposed approach does not need to extract the standard points, but the inputs all coordinates of the laser stripes converted into the CCS. Therefore, the number of calibrated points is sufficient for the calibration of the laser plane. Subsequently, the equation of the laser plane is fitted to reduce the error.
The calibration board is 1300 × 1200 × 5.0 mm, and N(N = 28) images with different poses calibrate the system. The circle calibration target is printed with a high-quality printer and then placed on glass. Table 1 lists the detailed parameters of the camera and laser projector. Table 2 presents the calibration parameters. Variation in box materials, color, and appendages and the box texture influence laser-modulated images. The actual box edges and laser center lines are difficult to distinguish from lines in the laser-modulated images in complex scenarios. Although edge detection technology [33,34] can be used to find the box contour, these algorithms often perform particularly poorly in image processing in practical applications. Recently, FCNN has advanced in addressing the problem of detecting edge and object boundaries in natural images. Inspired by HED, we adopt a similar structure to the HED network and continuously inherit and learn the precise edge in the generated output process through the side output layer. We also design our network by modifying the VGG16 [35] network. Figure 9 displays the developed IHED network for edge detection. In comparison with HED, our modifications can be described, as follows:

1.
To achieve the best edge detection effect, we build our own laser-modulated image dataset.

2.
We cut the first two side output layers. Such an operation can remove considerable low-level edge information.

3.
A cross-entropy loss/sigmoid layer is connected to the up-sampling layer in each stage without deep supervision. In total, 40,000 training images are obtained to determine the IHED network parameters and 1500 images are provided for testing. We manually mark the coordinate of the eight key points of the laser-modulated images and then draw straight lines to obtain the ground truth. Figure 10 shows two example images and the ground-truth edge results of the developed dataset. In total, 40,000 training images are obtained to determine the IHED network parameters and 1500 images are provided for testing. We manually mark the coordinate of the eight key points of the laser-modulated images and then draw straight lines to obtain the ground truth. Figure 10 shows two example images and the ground-truth edge results of the developed dataset. In total, 40,000 training images are obtained to determine the IHED network parameters and 1500 images are provided for testing. We manually mark the coordinate of the eight key points of the laser-modulated images and then draw straight lines to obtain the ground truth. Figure 10 shows two example images and the ground-truth edge results of the developed dataset. Figure 10. Two example images and ground-truth edge results for our dataset: (a,c) Input images; (b,d) ground-truth edges by human annotation of (a,c), respectively.
In our IHED network, we consider the following objective function:  In our IHED network, we consider the following objective function: where l side denotes the image-level loss function for side outputs. W is the set representation of all standard network layer parameters. The parameters of side output are denoted as w = (w (1) , . . . , w M ), and the network has M side output layers. In our network architecture, the loss function is computed over all the pixels in a training image X = (x j , j = 1, . . . , |X|) and edge map Y = (y j , j = 1, . . . , |Y|), y j ∈ {0, 1}. In the training process, this cost function traverses every pixel of the input image and of the output probability graph. For each image, this function is defined as where β = Y _ /|Y| and 1−β = |Y + |/|Y|. Y + and Y _ denote the edge and non-edge ground-truth label sets, respectively. At each side output layer, we obtain the edge probability map prediction side ≡ {α (m) , j = 1, . . . , |Y|} are the activations of the side output of layer m. Thus, the loss function for "weighted-fusion" layer is as follows: where σ(.) is the sigmoid function. Dis(.) is the distance between the fused predictions and ground-truth label map. For all of these parameters, W, w is simultaneously optimized through standard backpropagation: Hence, in the testing stage, given an image X, the final edge probability map can be defined aŝ side ,Ŷ side ,Ŷ The network parameter settings are as follows: input image size (512 × 512), mini-batch size (9), learning rate (1 × 10 −3 ), loss weight for each side output layer (1), weight decay (2 × 10 −4 ), and number of training iterations (1 × 10 5 , learning rate is divided by 10 after 1000). This network design can not only realize high-precision and high-sensitivity edge detection, but also suppress internal texture edge.
A total of 1500 testing images are used to verify the effectiveness of our algorithm. This study uses the precision, recall, and F-measure to evaluate the edge detection performance of the laser-modulated image. The precision recall curve includes the recall rate and precision of the detection result. The precision reflects the pixel ratio of the used approach to extract the true structure edges (TP) and the total number of all detected edges. The recall rate reflects the TP and ground-truth edge. The F-measure is a comprehensive evaluation indicator with a fixed conversion relationship between recall and precision. The recall, precision, and F-measure are calculated, as follows: where FP is the wrong edge pixels that have been extracted and FN is the number of mis-extracted pixels. The proposed IHED network without deep-supervision extraction of structure edges is compared with the HED algorithm to show its effectiveness. Figure 11 shows a performance comparison of these detection algorithms on our dataset with respect to the precision, recall, and F-measure of the extracted edges. The IHED without deep supervision has a better edge extraction performance than the other three network models.  Figure 12 shows several examples of edge detection on the dataset for the HED and IHED networks (network parameters are consistent). Rows 1, 2, 3, and 4 in Figure 12 display that IHED is more advantageous than HED in detecting the structural edge of the box. The HED network detects other non-box structure edges, which are avoided by the improved network (IHED). This result is consistent with the original intention of the edge detection of the design structure.
Raw image HED with deep supervision Figure 11. Performance comparison of the IHED and holistically nested edge detection (HED) networks with/without deep-supervision with respect to edge extraction. Figure 12 shows several examples of edge detection on the dataset for the HED and IHED networks (network parameters are consistent). Rows 1, 2, 3, and 4 in Figure 12 display that IHED is more advantageous than HED in detecting the structural edge of the box. The HED network detects other non-box structure edges, which are avoided by the improved network (IHED). This result is consistent with the original intention of the edge detection of the design structure. Figure 12 shows several examples of edge detection on the dataset for the HED and IHED networks (network parameters are consistent). Rows 1, 2, 3, and 4 in Figure 12 display that IHED is more advantageous than HED in detecting the structural edge of the box. The HED network detects other non-box structure edges, which are avoided by the improved network (IHED). This result is consistent with the original intention of the edge detection of the design structure.

Method for Extracting the 2D Coordinates of the Key Points of the Laser-Modulated Image
We must obtain the supporting lines for the edge probability maps to obtain the 2D coordinates of the box vertices. The edge probability map of the laser-modulated image has been obtained by our network (Section 3.4.1). By using the center coordinate of the image as the origin coordinate, we use

Method for Extracting the 2D Coordinates of the Key Points of the Laser-Modulated Image
We must obtain the supporting lines for the edge probability maps to obtain the 2D coordinates of the box vertices. The edge probability map of the laser-modulated image has been obtained by our network (Section 3.4.1). By using the center coordinate of the image as the origin coordinate, we use the Hough line transform [36] to detect all the straight lines on the edge probability map. Equation (20) is used to represent them.
We separately obtain the fitting line equation of the laser line and the edge of the measured box. Figure 13 shows the operation process. By finding the intersection points of these lines, the coordinates of eight key points on the 2D image can be deduced. Finally, we can easily locate the relationship of the eight key points (D1-D8) on the laser line through the geometric relationships between the box face's edge and the laser line in the 2D image, as shown in Figure 14.  We separately obtain the fitting line equation of the laser line and the edge of the measured box. Figure 13 shows the operation process. By finding the intersection points of these lines, the coordinates of eight key points on the 2D image can be deduced. Finally, we can easily locate the relationship of the eight key points (D1-D8) on the laser line through the geometric relationships between the box face's edge and the laser line in the 2D image, as shown in Figure 14.   The original image resolution is 2592 × 1944 pixels and the size of the edge probability map output by the network is 512 × 512 pixels. Automatically extracting the eight key points in the collected box image with laser line has an important influence on the accuracy and automatic operation of the proposed system. We conduct pixel level coordinate error analysis between the raw image and edge probability image that were obtained through the IHED network. We convert the coordinates of the eight key points obtained to a camera resolution of 2592 × 1944. Here, we consider the maximum measuring range of the system to be 1800 mm. Thus, we can roughly estimate the actual physical distance of each pixel as 1944 1800 mm. Assume that the maximum error allowed by the The original image resolution is 2592 × 1944 pixels and the size of the edge probability map output by the network is 512 × 512 pixels. Automatically extracting the eight key points in the collected box image with laser line has an important influence on the accuracy and automatic operation of the proposed system. We conduct pixel level coordinate error analysis between the raw image and edge probability image that were obtained through the IHED network. We convert the coordinates of the eight key points obtained to a camera resolution of 2592 × 1944. Here, we consider the maximum measuring range of the system to be 1800 mm. Thus, we can roughly estimate the actual physical distance of each pixel as 1800 1944 mm. Assume that the maximum error allowed by the system is 5.0 mm. We can obtain the maximum pixel error that is allowed by the system as 5 * 1944 1800 = 5.40 pixels. We analyze the pixel values of 1500 images in the test dataset.
where M is the number of test datasets. N is the number of key points on the image. In the experiment, M is 1500 and N is 8. (u, v) is the label pixel coordinate and (u , v ) is the pixel coordinate that was obtained by our approach. The pixel coordinate error of key points is 1.96 < 5.40 pixels, which can meet our requirements. Figure 1c illustrates the system, wherein the device is connected with an android phone (HUAWEI honor Play) through a USB cable. The measurement environment parameters are as follows: temperature ( −15 ∼ 60 • C), measured distance from the visual sensor to the measured box (0.1-2.5 m), and measuring range of the box length, width, and height (10-1800 mm). The initial status calibration is performed before the experiment. Table 2 lists the calibration parameters of the visual sensor.

Experiments
Various experimental tests are conducted under varying operating conditions to test the robustness of the proposed system. Four experimental phases are performed to evaluate the system performances: (1) In Section 4.1, the measurement statistical analysis of boxes in complex scene is conducted. (2) In Section 4.2, the stability of the proposed system is verified. (3) In Section 4.3, the statistical analysis on real boxes is performed and the measurement uncertainty is evaluated by using the expression of uncertainty in measurement [37]. (4) In Section 4.4, the measurement error analysis of the optical quality of the boxes surface and the surface variation is performed. (5) In Section 4.5, the practical performance of the proposed system is evaluated in real-world tests.

Measurement Statistical Analysis of Boxes in Complex Scenarios
The experiment tests the accuracy of the system's measurements in complex and outdoor environments. Figure 15a shows a single box captured indoors, with a dimension of 490.7 mm × 560.5 mm × 651.0 mm. Figure 15b presents the box measurement in a complex indoor environment, with multiple interfering boxes that are near the measured box. Figure 15c exhibits the image captured outdoors, in which the laser line is dim in the image due to the influence of strong illumination. calibration is performed before the experiment. Table 2 lists the calibration parameters of the visual sensor.
Various experimental tests are conducted under varying operating conditions to test the robustness of the proposed system. Four experimental phases are performed to evaluate the system performances: (1) In Section 4.1, the measurement statistical analysis of boxes in complex scene is conducted. (2) In Section 4.2, the stability of the proposed system is verified. (3) In Section 4.3, the statistical analysis on real boxes is performed and the measurement uncertainty is evaluated by using the expression of uncertainty in measurement [37]. (4) In Section 4.4, the measurement error analysis of the optical quality of the boxes surface and the surface variation is performed. (5) In Section 4.5, the practical performance of the proposed system is evaluated in real-world tests.

Measurement Statistical Analysis of Boxes in Complex Scenarios
The experiment tests the accuracy of the system's measurements in complex and outdoor environments. Figure 15a shows a single box captured indoors, with a dimension of 490.7 mm × 560.5 mm × 651.0 mm. Figure 15b presents the box measurement in a complex indoor environment, with multiple interfering boxes that are near the measured box. Figure 15c exhibits the image captured outdoors, in which the laser line is dim in the image due to the influence of strong illumination.  Figure 16 shows the measurement results of the box that was acquired in Figure 15. The edge probability map is obtained after processing the IHED network, and coordinates of the eight key points are determined. Even if the box images (Figure 16c) are collected outdoors, the edge probability map can be efficiently processed by our system.  Figure 16 shows the measurement results of the box that was acquired in Figure 15. The edge probability map is obtained after processing the IHED network, and coordinates of the eight key points are determined. Even if the box images (Figure 16c) are collected outdoors, the edge probability map can be efficiently processed by our system. The final estimated values are recorded as the average of three experimental sessions on the box. Figure 17 shows the measurement results and actual dimensions of the measured box under different scenarios. The maximum average absolute error is 1.3 mm. Hence, our volume measurement system can accurately measure the length of each side of the box in a complex environment, which can meet the actual measurement requirements.  Figure 17 shows the measurement results and actual dimensions of the measured box under different scenarios. The maximum average absolute error is 1.3 mm. Hence, our volume measurement system can accurately measure the length of each side of the box in a complex environment, which can meet the actual measurement requirements.
The final estimated values are recorded as the average of three experimental sessions on the box. Figure 17 shows the measurement results and actual dimensions of the measured box under different scenarios. The maximum average absolute error is 1.3 mm. Hence, our volume measurement system can accurately measure the length of each side of the box in a complex environment, which can meet the actual measurement requirements.

Pose Stability Testing
This experiment aims to verify the stability of the measured box from different viewpoints. As shown in Figure 18, the box is measured from different angles with nine poses to simulate the pose difference in actual measurement. In this experiment, the volume measurement system is used to obtain the box length and width under different poses. Only one face of the standard box (800 mm × 600 mm) is measured in this experiment to facilitate measurement and comparison. Estimated values are reported as the average of 30 experimental sessions on the same surface (800 mm × 600 mm) in Table 3. The relative errors are generally relatively small. The deviation between the estimated and actual values is within ±5.0 mm at each pose. The pose of the visual device appears to have minimal effect on the measurement accuracy of the proposed system on the basis of the mean error analysis in Table 3. The proposed system can effectively handle the measured certainty, regardless of which

Pose Stability Testing
This experiment aims to verify the stability of the measured box from different viewpoints. As shown in Figure 18, the box is measured from different angles with nine poses to simulate the pose difference in actual measurement. In this experiment, the volume measurement system is used to obtain the box length and width under different poses. Only one face of the standard box (800 mm × 600 mm) is measured in this experiment to facilitate measurement and comparison. Estimated values are reported as the average of 30 experimental sessions on the same surface (800 mm × 600 mm) in Table 3. The relative errors are generally relatively small. The deviation between the estimated and actual values is within ±5.0 mm at each pose. The pose of the visual device appears to have minimal effect on the measurement accuracy of the proposed system on the basis of the mean error analysis in Table 3. The proposed system can effectively handle the measured certainty, regardless of which view the images are captured with strict measurement rules. The values of standard deviations are 1.7521 and 1.7175 mm respectively, which indicates that the box measurement system has reliable repeated measurement accuracy. Figure 19 shows that the length errors of the box dimensions are within 5.0 mm. The results show that the system stability is remarkable. view the images are captured with strict measurement rules. The values of standard deviations are 1.7521 and 1.7175 mm respectively, which indicates that the box measurement system has reliable repeated measurement accuracy. Figure 19 shows that the length errors of the box dimensions are within 5.0 mm. The results show that the system stability is remarkable.   19. Errors between the standard box and the measured result.

Error Analysis on Real Box and the Evaluation of Uncertainty in the Measurement Result of Box Volume
This volume measurement system can calculate the dimension parameters of the box simply via laser triangulation and deep learning technology; thus, the entire system maintains the advantages of simple configuration and low cost. However, this method includes three main factors that affect the measurement accuracy of the box length: the measurement error of the visual sensor and the   19. Errors between the standard box and the measured result.

Error Analysis on Real Box and the Evaluation of Uncertainty in the Measurement Result of Box Volume
This volume measurement system can calculate the dimension parameters of the box simply via laser triangulation and deep learning technology; thus, the entire system maintains the advantages of simple configuration and low cost. However, this method includes three main factors that affect the measurement accuracy of the box length: the measurement error of the visual sensor and the Figure 19. Errors between the standard box and the measured result.

Error Analysis on Real Box and the Evaluation of Uncertainty in the Measurement Result of Box Volume
This volume measurement system can calculate the dimension parameters of the box simply via laser triangulation and deep learning technology; thus, the entire system maintains the advantages of simple configuration and low cost. However, this method includes three main factors that affect the measurement accuracy of the box length: the measurement error of the visual sensor and the position error of the box (the distance and pose between the measured box and visual sensor). We conduct statistical experiments to evaluate the effectiveness of the method.
As shown in Figure 20, the three standard boxes (#1, #2, and #3) are selected in the experiment. Their length, width, and height are 330.4 × 110.3 × 440.6, 690.7 × 570.5 × 1500.0, and 900.0 × 400.0 × 1800.0, respectively. We use our system to collect 15 measurements for each of the three standard boxes (Table 4). We utilize these data to calculate the mean and standard deviation of each box's side length. position error of the box (the distance and pose between the measured box and visual sensor). We conduct statistical experiments to evaluate the effectiveness of the method. As shown in Figure 20, the three standard boxes (#1, #2, and #3) are selected in the experiment. Their length, width, and height are 330.4 × 110.3 × 440.6, 690.7 × 570.5 × 1500.0, and 900.0 × 400.0 × 1800.0, respectively. We use our system to collect 15 measurements for each of the three standard boxes (Table 4). We utilize these data to calculate the mean and standard deviation of each box's side length.    The data of the measurement results in Table 4 are statistically analyzed to evaluate the measurement accuracy scientifically, and the uncertainty of class A (µ A ) is calculated as where x i is the estimated length and x is the mean value of the measured data. n is the number of measurements, which is 15 in this study. Table 4 shows the measurements result, with a minimum uncertainty of ±0.52 mm and maximum uncertainty of ±4.0 mm. The measurement uncertainty in the estimated length increases with the length. The measurement uncertainty is in accordance with the experiment that is described in Table 4. Figure 21 shows that the length errors of the box dimensions are within ±5.0 mm. The results show that the system has good accuracy. Figure 22 shows the measurement uncertainty of the measuring device, which is consistent with the experimental results. results show that the system has good accuracy. Figure 22 shows the measurement uncertainty of the measuring device, which is consistent with the experimental results.

Measurement Error Analysis of the Optical Quality of the Boxes Surface and the Surface Variation
The experiment tests the effect of the system's measurements on the optical quality of the boxes surface and the surface variation. Figure Figure 23a,b,c exhibit the images captured at different optical quality. Figure 23d,e,f test boxes with surface variation. The second row in Figure 23 shows the image processing results of the boxes faces.

Measurement Error Analysis of the Optical Quality of the Boxes Surface and the Surface Variation
The experiment tests the effect of the system's measurements on the optical quality of the boxes surface and the surface variation. Figure

Measurement Error Analysis of the Optical Quality of the Boxes Surface and the Surface Variation
The experiment tests the effect of the system's measurements on the optical quality of the boxes surface and the surface variation. Figure Figure 23a,b,c exhibit the images captured at different optical quality. Figure 23d,e,f test boxes with surface variation. The second row in Figure 23 shows the image processing results of the boxes faces.   Figure 24b shows the measurements result of the surface variation, with a minimum measurement error of 2.0 mm and maximum error of 7.6 mm. The results show that the system suffered little from the optical quality of the surface, but it has big uncertainty when measuring the surface variation of the box.    Figure 24b shows the measurements result of the surface variation, with a minimum measurement error of 2.0 mm and maximum error of 7.6 mm. The results show that the system suffered little from the optical quality of the surface, but it has big uncertainty when measuring the surface variation of the box. Figure 24a shows the measurement results of the optical quality of the boxes surface, with a minimum measurement error of 0.2 mm and maximum error of 1.3 mm. Figure 24b shows the measurements result of the surface variation, with a minimum measurement error of 2.0 mm and maximum error of 7.6 mm. The results show that the system suffered little from the optical quality of the surface, but it has big uncertainty when measuring the surface variation of the box.

Online Measurement Testing
Six standard boxes with different sizes and volumes are selected for measurement to evaluate the measurement accuracy scientifically, as shown in Figure 25. Table 5 Table 5 indicate that the error of the measurement system increases with the side length of the measured box, but the error range of the measured and actual values of the single side length of each standard box is within ±5.0 mm. The maximum relative measurement error of the volume ( ε ) of the measured box is 2.27% and the mean relative error is 0.83%, which indicates good precision.

Online Measurement Testing
Six standard boxes with different sizes and volumes are selected for measurement to evaluate the measurement accuracy scientifically, as shown in Figure 25. Table 5 displays the corresponding experimental results. The final measurement of the box length is highlighted in bold. We estimate of the relative measurement error of the volume ε = |v e − v a |/v a , where v e is the estimated volume and v a is the value of actual volume. The results in Table 5 indicate that the error of the measurement system increases with the side length of the measured box, but the error range of the measured and actual values of the single side length of each standard box is within ±5.0 mm. The maximum relative measurement error of the volume (ε) of the measured box is 2.27% and the mean relative error is 0.83%, which indicates good precision. optical quality; (d), (e) and (f) exhibit the images captured the boxes with surface variation. Figure 24a shows the measurement results of the optical quality of the boxes surface, with a minimum measurement error of 0.2 mm and maximum error of 1.3 mm. Figure 24b shows the measurements result of the surface variation, with a minimum measurement error of 2.0 mm and maximum error of 7.6 mm. The results show that the system suffered little from the optical quality of the surface, but it has big uncertainty when measuring the surface variation of the box.

Online Measurement Testing
Six standard boxes with different sizes and volumes are selected for measurement to evaluate the measurement accuracy scientifically, as shown in Figure 25. Table 5 Table 5 indicate that the error of the measurement system increases with the side length of the measured box, but the error range of the measured and actual values of the single side length of each standard box is within ±5.0 mm. The maximum relative measurement error of the volume ( ε ) of the measured box is 2.27% and the mean relative error is 0.83%, which indicates good precision.

Conclusions
This research presents a line-structured light-based 3D measuring sensor and deep-learningbased box volume measuring method. Our box volume measurement method only requires two laser-modulated images. We propose a novel end-to-end edge detection architecture based on an IHED network to extract the structure straight edge lines in laser-modulated images. By cutting the first two side output layers and training without deep supervision of HED, our network can learn robust straight line features from laser-modulated images. Moreover, we present a one-step calibration method to calibrate our portable measuring sensor automatically. Experimental results show that the measuring range of our proposed system is 100-1800 mm with errors less than ±5.0 mm. Our system is suitable for portable automatic box volume measurement, and it is useful for warehouses and distribution and logistics companies. Our future work will focus on small portable measuring devices.