A Numerical Measurement Method for Dynamic Granular Materials Based on Computer Vision

Granular materials are widespread in nature and human production, and their macro-mechanical behavior is significantly affected by granule movement. The development of computer vision has brought some new ideas for measuring the numerical information (including the amount of translation, the rotation angle, velocity, acceleration, etc.) of dynamic granular materials. In this paper, we propose a numerical measurement method for dynamic granular materials based on computer vision. Firstly, an improved video instance segmentation (VIS) network is introduced to perform end-to-end multi-task learning, and its temporal feature fusion module and tracking head with long-sequence external memory can improve the problems of poor video data quality and high similarity in appearance of granular materials, respectively. Secondly, the numerical information can be extracted through a series of post-processing steps. Finally, the effectiveness of the measurement method is verified by comparing the numerical measurement results with the real values. The experimental results indicate that our improved VIS obtains an average precision (AP) of 76.6, the relative errors and standard deviations are maintained at a low level, and this method can effectively be used to measure the numerical information of dynamic granular materials. This study provides an intelligent proposal for the task of measuring numerical information of dynamic granular materials, which is of great significance for studying the spatial distribution, motion mode and macro-mechanical behavior of granular materials.


Introduction
Granular materials (such as coarse-grained soil) change from a loose state to a dense one, which is the result of internal mesoscopic structure changes caused by granule movement [1][2][3]. Meanwhile, translation and rotation [4], as two main forms of movement, have a significant impact on the macroscopic mechanical behavior (especially deformation) of dynamic granular materials [5]. Hence, it is necessary to measure the numerical information of dynamic granular materials (including the amount of translation, the rotation angle, velocity, acceleration, etc.). At present, the main method for obtaining and analyzing the numerical information of dynamic granular materials is the discrete element method (DEM), which was first proposed by Cundall and Strack in the 1970s [6], and has been continuously addressed and developed by many scholars [7]. However, the numerical model itself and its statistical results lack effective data verification, so it is difficult to be accepted universally. In general, calibration of numerical simulation parameters through 1.
We combine computer vision and the numerical measurement task to propose a numerical measurement method for dynamic granular materials. This method is mainly based on the VIS, which is able to realize end-to-end multi-task learning and simultaneously detect, segment and track dynamic granular materials; 2.
We analyze the properties of video data and granular materials to improve the VIS network. A temporal feature fusion module and tracking head with long-sequence external memory are introduced to make the VIS network more suitable for the numerical measurement of dynamic granular materials; 3.
A variety of effective post-processing steps such as the extraction of centroid and long axis, ellipse fitting, and pixel-actual distance calibration are used to obtain the amount of translation, the rotation angle, velocity and acceleration of dynamic granular materials; 4.
A set of experimental equipment is designed to collect dynamic granule videos and then the numerical results of dynamic granular materials are measured by the 4. A set of experimental equipment is designed to collect dynamic granule videos and then the numerical results of dynamic granular materials are measured by the proposed method. The amount of translation, the rotation angle, and the velocity and acceleration of granular materials are compared with true results to verify the effectiveness of the proposed method.

Method Framework
The overall method framework is illustrated in Figure 1. Firstly, videos of granular materials are collected and annotated to create a dataset. Secondly, the improved VIS network can be trained by end-to-end multi-task learning and dynamic granular materials can be detected, segmented and tracked simultaneously. Then, the centroids of granules are extracted, and ellipse fitting is performed on the masks. The amount of translation and the rotation angle are calculated by the changes of the centroids and fitted ellipse major axis angles, respectively. Further, the velocity and acceleration of dynamic granular materials could also be extracted. In addition, it is necessary to calibrate the pixel distance and actual distance when measuring translation, velocity, and acceleration of granular materials. Figure 1. Method framework. The main processes of our proposed method are: Collecting videos, creating a dataset, end-to-end multi-task learning, VIS and post-processing steps. VIS is the key process for numerical information measurement of granular materials.

Overall Network Architecture
The overall architecture of our improved VIS, which simultaneously detects, segments and tracks objects in videos through a two-stage multi-task learning approach, is shown in Figure 2. In the first stage, feature maps are produced from the input video Figure 1. Method framework. The main processes of our proposed method are: Collecting videos, creating a dataset, end-to-end multi-task learning, VIS and post-processing steps. VIS is the key process for numerical information measurement of granular materials.

An Improved Video Instance Segmentation Network 2.2.1. Overall Network Architecture
The overall architecture of our improved VIS, which simultaneously detects, segments and tracks objects in videos through a two-stage multi-task learning approach, is shown in Figure 2. In the first stage, feature maps are produced from the input video frames by ResNet [15] and the temporal fusion module can aggregate the feature information of support frames to enhance the feature response in current frame. Then, the multi-scale feature maps are generated through FPN [16] and multiple candidate objects are extracted with RPN [17] to generate a series of candidate proposals. In the second stage, the aligned RoI features are input into the box head, mask head and tracking head. The box head and mask head are inherited from MaskTrack R-CNN [10], which can achieve bounding box regression and mask generation, respectively. In Sections 2.2.2 and 2.2.3, we will analyze the design motivation and detailed structure of the temporal fusion module and tracking head with long-sequence external memory. frames by ResNet [15] and the temporal fusion module can aggregate the feature information of support frames to enhance the feature response in current frame. Then, the multi-scale feature maps are generated through FPN [16] and multiple candidate objects are extracted with RPN [17] to generate a series of candidate proposals. In the second stage, the aligned RoI features are input into the box head, mask head and tracking head. The box head and mask head are inherited from MaskTrack R-CNN [10], which can achieve bounding box regression and mask generation, respectively. In Sections 3.2.2 and 3.2.3, we will analyze the design motivation and detailed structure of the temporal fusion module and tracking head with long-sequence external memory. Figure 2. The overall architecture of our improved VIS. Our improved VIS consists of two stages and the first stage is composed of ResNet, temporal feature fusion module, FPN and RPN, where the added temporal feature fusion module can aggregate the feature information of support frames to enhance the feature response in current frame. The second stage can extract features by RoIAlign, and then the box head, mask head and tracking head can achieve bounding box regression, mask generation, and tracking, respectively. RoIAlign is omitted here.

Temporal Feature Fusion Module
The video data are often fuzzy and low in quality because of some equipment factors such as lens defocus and movement blur, so there is a large quality gap between ordinary static image data. In addition, unfavorable conditions such as vibration and uneven lighting in the collection environment for granular materials can also affect the quality of the data. In response to the above disadvantages, a temporal fusion module is added to our VIS model. As shown in Figure 3, this module can enrich the features of current frame by support frames. Firstly, the feature map output from ResNet can be converted into a new feature map by a 1 × 1 convolution and nonlinear activation, and this new feature map encodes the key information (object category, object location and mask) in the current frame. Secondly, the feature map of the support frames is encoded into and by two parallel 1 × 1 convolutions and nonlinear activations. The attention matrix can be obtained by computing the inner product of and , so is related to each position in and . Then the attention matrix can be used to aggregate the feature of to get a new feature map, and the new feature map fuses temporal information from the support frames. Finally, the new feature map can be transformed into feature map by a 1 × 1 convolution and nonlinear activation, and then is added to the original feature map to acquire the enhanced feature map . The overall process can be summarized in the form of the following equations:

= ⊙
(1) Figure 2. The overall architecture of our improved VIS. Our improved VIS consists of two stages and the first stage is composed of ResNet, temporal feature fusion module, FPN and RPN, where the added temporal feature fusion module can aggregate the feature information of support frames to enhance the feature response in current frame. The second stage can extract features by RoIAlign, and then the box head, mask head and tracking head can achieve bounding box regression, mask generation, and tracking, respectively. RoIAlign is omitted here.

Temporal Feature Fusion Module
The video data are often fuzzy and low in quality because of some equipment factors such as lens defocus and movement blur, so there is a large quality gap between ordinary static image data. In addition, unfavorable conditions such as vibration and uneven lighting in the collection environment for granular materials can also affect the quality of the data. In response to the above disadvantages, a temporal fusion module is added to our VIS model. As shown in Figure 3, this module can enrich the features of current frame by support frames. Firstly, the feature map A output from ResNet can be converted into a new feature map q by a 1 × 1 convolution and nonlinear activation, and this new feature map q encodes the key information (object category, object location and mask) in the current frame. Secondly, the feature map B of the support frames is encoded into k and v by two parallel 1 × 1 convolutions and nonlinear activations. The attention matrix S can be obtained by computing the inner product of q and k, so S is related to each position in q and k. Then the attention matrix S can be used to aggregate the feature of v to get a new feature map, and the new feature map fuses temporal information from the support frames. Finally, the new feature map can be transformed into feature map W by a 1 × 1 convolution and nonlinear activation, and then W is added to the original feature map A to acquire the enhanced feature map Z. The overall process can be summarized in the form of the following equations:

of 17
where is the inner product. i and j are the indices of each position in the similarity matrix and feature map, respectively. N all is the total number of positions. F is the transformation function which corresponds to 1 × 1 convolution and nonlinear activation.⊕ is summing up. The enhanced feature map Z not only preserves some informative key visual semantics of current frame, but also incorporates useful contextual information for support frames in the same video.
= ( ⊙ ( (: , )) ∑ ( (: , )) ) (2) = ⊕ (3) where ⊙ is the inner product. and are the indices of each position in the similarity matrix and feature map, respectively. is the total number of positions. is the transformation function which corresponds to 1 × 1 convolution and nonlinear activation.⊕ is summing up. The enhanced feature map not only preserves some informative key visual semantics of current frame, but also incorporates useful contextual information for support frames in the same video. not only preserves some informative key visual semantics of the current frame, it also incorporates useful contextual information regarding the support frames in same video.

Tracking Head with Long-Sequence External Memory
Granular materials are often densely packed and the similarity in appearance between granular materials is high, which causes difficulties in tracking. Recently, per-clip models [19][20][21] were reported to obtain better VIS effects by aggregating multi-frame information. Inspired by these models, we design a tracking head that can comprehensively compare instance similarity across multiple frames to enhance tracking performance. The structure is shown in Figure 4, and this tracking head mainly includes two fully connected layers and a long-sequence external memory. Two fully connected layers can map features for candidate objects. The long-sequence external memory can store the features of previous instances. We use the inner products to represent the correlation between candidate object and previous instances, and each previous instance in memory can hold features of at most sequences. Specifically, for a candidate object , its inner product with the previous instance already existing in the long-sequence external memory can be expressed as sequences weighted inner product: where is the sequence index. is the sequence discount factor at . is the feature of candidate object and is the feature of instance at . is the inner product of and . For those instances that do not have sequences in memory, we only compute the inner product of existing instances for fair comparison.

Tracking Head with Long-Sequence External Memory
Granular materials are often densely packed and the similarity in appearance between granular materials is high, which causes difficulties in tracking. Recently, per-clip models [19][20][21] were reported to obtain better VIS effects by aggregating multi-frame information. Inspired by these models, we design a tracking head that can comprehensively compare instance similarity across multiple frames to enhance tracking performance. The structure is shown in Figure 4, and this tracking head mainly includes two fully connected layers and a long-sequence external memory. Two fully connected layers can map features for candidate objects. The long-sequence external memory can store the features of previous instances. We use the inner products to represent the correlation between candidate object and previous instances, and each previous instance in memory can hold features of at most L sequences. Specifically, for a candidate object i, its inner product with the previous instance j already existing in the long-sequence external memory can be expressed as sequences weighted inner product: where l is the sequence index. γ l is the sequence discount factor at l. φ i is the feature of candidate object i and φ j l is the feature of instance j at l. φ T i φ j l is the inner product of φ i and φ j l . For those instances that do not have L sequences in memory, we only compute the inner product of existing instances for fair comparison. Materials 2021, 14, x FOR PEER REVIEW 6 of 17 Figure 4. Tracking head structure. Our tracking head consists of two main parts: fully connected layers and long-sequence external memory, which can assign instance IDs to candidate objects in the current frame by calculating and comparing sequences weighted inner products.
In the training phase, we use reference frames and a query frame to train our tracking head. For reference frames, we extract features from their ground-truth instance regions and save them to the long-sequence external memory. Instances between reference frames are also matched by ground truth regions. The sequence discount factor is the average of number of reference frames because the reference frames are randomly selected from video frames during training, and can be expressed as: In the inference phase, we sequentially process each frame in an online fashion. Each current frame has corresponding sequences, and the sequence discount factor is related to the frame sequence number of sequences in video: where is the frame sequence number of ℎ sequence in video and is the frame sequence number of ℎ sequence.
Finally, the probability of assigning instance ID to candidate object is calculated by Softmax, and can be expressed as: where is the number of previous instances. = 0 means that object is a new instance and ∈ [1, ] means that object belongs to one of the previous instances. External memory is dynamically updated when an instance ID is assigned to a new candidate object successfully. If the candidate object belongs to an existing instance ID, we replace the feature of the farthest sequence in memory with feature of new candidate object. If the candidate object does not have a corresponding instance ID that can be assigned, the feature of candidate object is inserted into external memory and a new instance ID is created. Our tracking head can fully consider the features within sequences in instance ID assignment, and increase the robustness of tracking for the multi-instance environment and granular materials with high feature similarity. . Tracking head structure. Our tracking head consists of two main parts: fully connected layers and long-sequence external memory, which can assign instance IDs to candidate objects in the current frame by calculating and comparing sequences weighted inner products.
In the training phase, we use L tr reference frames and a query frame to train our tracking head. For reference frames, we extract features from their ground-truth instance regions and save them to the long-sequence external memory. Instances between reference frames are also matched by ground truth regions. The sequence discount factor γ l is the average of number of reference frames because the reference frames are randomly selected from video frames during training, and γ l can be expressed as: In the inference phase, we sequentially process each frame in an online fashion. Each current frame has L in corresponding sequences, and the sequence discount factor γ l is related to the frame sequence number of sequences in video: where f l is the frame sequence number of f th l sequence in video and f l is the frame sequence number of f th l sequence. Finally, the probability of assigning instance ID x to candidate object i is calculated by Softmax, and can be expressed as: where N is the number of previous instances. x = 0 means that object i is a new instance and x ∈ [1, N] means that object i belongs to one of the previous N instances. External memory is dynamically updated when an instance ID is assigned to a new candidate object successfully. If the candidate object belongs to an existing instance ID, we replace the feature of the farthest sequence in memory with feature of new candidate object. If the candidate object does not have a corresponding instance ID that can be assigned, the feature of candidate object is inserted into external memory and a new instance ID is created. Our tracking head can fully consider the features within L sequences in instance ID assignment, and increase the robustness of tracking for the multi-instance environment and granular materials with high feature similarity.

Loss Function
The loss function of the VIS model consists of four sub-task losses: classification, detection box regression, segmentation and tracking, which can be expressed as: where L cls , L box and L mask are the same losses as in Mask R-CNN [22]. L track is the crossentropy loss similar to MaskTrack R-CNN [10].

Post-Processing Steps
To measure the amount of translation, velocity and acceleration, the centroids of granules need to be extracted first. We determine the abscissa and ordinate of the centroid independently in the x and y directions because the segmented mask is two-dimensional. Specifically, the coordinates of centroids in the x (y) direction are calculated by bisecting the number of pixels on the left and right (up and down) sides.
The centroids of granular materials can be extracted by above operation and then subtracted from the extracted values of the first frame to acquire the amount of translation. The values of velocity and acceleration in x and y directions can be obtained by taking the derivative and second derivative. It is necessary to perform pixel-actual distance calibration, because the unit for the above numerical values is pixels. We complete the calibration in a simple way, which can be expressed as follows: where k is defined as the actual distance corresponding to a pixel. S act is the actual distance and S pix is the pixel distance. The measurement of rotation angle is more complicated, so granular materials need to be fitted. There are many fitting methods, and the ellipse fitting method is the most suitable one for the task of movement information detection [23]. Figure 5 shows the effect of ellipse fitting. Then, the rotation angle can be successfully approximated on the basis of changes in the major axis angles of the fitted ellipses.

Loss Function
The loss function of the VIS model consists of four sub-task losses: classification, detection box regression, segmentation and tracking, which can be expressed as: where , and are the same losses as in Mask R-CNN [22]. is the cross-entropy loss similar to MaskTrack R-CNN [10].

Post-Processing Steps
To measure the amount of translation, velocity and acceleration, the centroids of granules need to be extracted first. We determine the abscissa and ordinate of the centroid independently in the and directions because the segmented mask is two-dimensional. Specifically, the coordinates of centroids in the ( ) direction are calculated by bisecting the number of pixels on the left and right (up and down) sides.
The centroids of granular materials can be extracted by above operation and then subtracted from the extracted values of the first frame to acquire the amount of translation. The values of velocity and acceleration in and directions can be obtained by taking the derivative and second derivative. It is necessary to perform pixel-actual distance calibration, because the unit for the above numerical values is pixels. We complete the calibration in a simple way, which can be expressed as follows: where is defined as the actual distance corresponding to a pixel. act is the actual distance and pix is the pixel distance. The measurement of rotation angle is more complicated, so granular materials need to be fitted. There are many fitting methods, and the ellipse fitting method is the most suitable one for the task of movement information detection [23]. Figure 5 shows the effect of ellipse fitting. Then, the rotation angle can be successfully approximated on the basis of changes in the major axis angles of the fitted ellipses.

Experimental Equipment and Parameter
As shown in Figure 6, we designed a set of experimental equipment to monitor and record the videos of granular materials. It includes an experimental table, coarse granular materials, fine granular materials, a vision sensor and a sensor bracket. The experimental table in this study is a circular table with a diameter of 32 cm, which has with two different

Experimental Equipment and Parameter
As shown in Figure 6, we designed a set of experimental equipment to monitor and record the videos of granular materials. It includes an experimental table, coarse granular materials, fine granular materials, a vision sensor and a sensor bracket. The experimental table in this study is a circular table with a diameter of 32 cm, which has with two different modes of vibration and rotation. In rotation mode, the speed can be set to 0-1.71 rad/min. The vision sensor is located 50 cm above the experimental table and is fixed by the sensor bracket. Coarse granular material and fine granular materials size range from 20 mm to 30 mm and from 2.5 mm to 7.5 mm, respectively. modes of vibration and rotation. In rotation mode, the speed can be set to 0-1.71 rad/min. The vision sensor is located 50 cm above the experimental table and is fixed by the sensor bracket. Coarse granular material and fine granular materials size range from 20 mm to 30 mm and from 2.5 mm to 7.5 mm, respectively. The entire VIS network was trained for 12 epochs with an NVIDIA GeForce RTX3070. The backbone of our VIS network is ResNet 101 [15] with FPN [16], which are pretrained on MSCOCO dataset [24] to quicken the convergence speed. During the training phase, the model also needs to sample other frames in video to help the training of temporal feature fusion module and tracking head. For each input frame, we randomly selected five frames from the same video, and two of which were chosen as support frames according to CompFeat [18]. If a video frame belongs to both support frames and reference frames, the probability of assigning instance IDs will be affected by this frame, so three frames serve as reference frames for the tracking head and is set to 3 during the training phase. The weights of both the pre-trained backbone network and sub-task headers were updated during training. During the inference phase, four additional frames from the test video are treated as support frames and the number of sequences is five, because testing with more information can help improve VIS performance [18]. In addition, the tracking of the evaluation process also incorporates other cues, such as semantic consistency, spatial correlation and detection confidence, as powerful post-processing techniques to improve the robustness of the tracking [10].

Dataset
We utilized the degree of mixing to express the distribution of coarse granules and fine granules, and divided the degree of mixing into four levels. Figure 7 presents different mixing degrees, with Figure 7a representing 100%, which means that the coarse granules and fine granules are uniformly mixed; meanwhile, Figure 7b presents 0% mixing degree, Figure 7c represents a degree of mixing that is between 0% and 50%, which means that a small part of the granules are mixed, and Figure 7d presents a mixing degree of between 50% and 100%, which indicates that most of the granules are mixed. The entire VIS network was trained for 12 epochs with an NVIDIA GeForce RTX3070. The backbone of our VIS network is ResNet 101 [15] with FPN [16], which are pretrained on MSCOCO dataset [24] to quicken the convergence speed. During the training phase, the model also needs to sample other frames in video to help the training of temporal feature fusion module and tracking head. For each input frame, we randomly selected five frames from the same video, and two of which were chosen as support frames according to CompFeat [18]. If a video frame belongs to both support frames and reference frames, the probability of assigning instance IDs will be affected by this frame, so three frames serve as reference frames for the tracking head and L is set to 3 during the training phase. The weights of both the pre-trained backbone network and sub-task headers were updated during training. During the inference phase, four additional frames from the test video are treated as support frames and the number of sequences is five, because testing with more information can help improve VIS performance [18]. In addition, the tracking of the evaluation process also incorporates other cues, such as semantic consistency, spatial correlation and detection confidence, as powerful post-processing techniques to improve the robustness of the tracking [10].

Dataset
We utilized the degree of mixing to express the distribution of coarse granules and fine granules, and divided the degree of mixing into four levels. Figure 7 presents different mixing degrees, with Figure 7a representing 100%, which means that the coarse granules and fine granules are uniformly mixed; meanwhile, Figure 7b presents 0% mixing degree, Figure 7c represents a degree of mixing that is between 0% and 50%, which means that a small part of the granules are mixed, and Figure 7d presents a mixing degree of between 50% and 100%, which indicates that most of the granules are mixed. As shown in Table 1, we collected videos of dynamic granular materials with a total duration of 29,092 frames (about 970 s) using the experimental equipment. Considering that the movement amplitude of vibrating granular material is low, we selected one frame for labeling from every 60 frames of the vibrating videos. However, videos with rotating granular materials have a large amount of movement, so we marked one frame from every 30 frames of the rotating videos. The duration of each video varied from 5-45 s and the label files followed MSCOCO's style [24]. We only performed VIS on coarse granular materials in this experiment, because the labeling of fine granular materials is too difficult. In addition, about one-third of the videos had problems such as lens defocus and uneven lighting to enhance the robustness of the network and verify the model's adaptability to image quality problems. We marked 706 frames and all videos were randomly divided into training videos and validation videos according to the ratio of about 6:1.

Evaluation Indicators
We set up the evaluation indicators on the basis of two aspects: visual processing and numerical measurement. The common average precision (AP) is used to reflect the effect of visual processing. Our AP can be calculated in the same way as in the image except for the intersection-over-union (IoU). This IoU is extended from the image to the video sequence, which can represent the degree of overlap between the predicted mask sequence and the real mask sequence in the entire video sequence [10]. The numerical measurement evaluation indicators can be divided into two parts: relative error and standard deviation. The relative error is the ratio of absolute error caused by the measurement to true value, and mainly includes two parts: the relative error of translation and the relative error of rotation , which reflects the confidence of the measurement results obtained using our method: As shown in Table 1, we collected videos of dynamic granular materials with a total duration of 29,092 frames (about 970 s) using the experimental equipment. Considering that the movement amplitude of vibrating granular material is low, we selected one frame for labeling from every 60 frames of the vibrating videos. However, videos with rotating granular materials have a large amount of movement, so we marked one frame from every 30 frames of the rotating videos. The duration of each video varied from 5-45 s and the label files followed MSCOCO's style [24]. We only performed VIS on coarse granular materials in this experiment, because the labeling of fine granular materials is too difficult.
In addition, about one-third of the videos had problems such as lens defocus and uneven lighting to enhance the robustness of the network and verify the model's adaptability to image quality problems. We marked 706 frames and all videos were randomly divided into training videos and validation videos according to the ratio of about 6:1.

Evaluation Indicators
We set up the evaluation indicators on the basis of two aspects: visual processing and numerical measurement. The common average precision (AP) is used to reflect the effect of visual processing. Our AP can be calculated in the same way as in the image except for the intersection-over-union (IoU). This IoU is extended from the image to the video sequence, which can represent the degree of overlap between the predicted mask sequence and the real mask sequence in the entire video sequence [10]. The numerical measurement evaluation indicators can be divided into two parts: relative error and standard deviation. The relative error is the ratio of absolute error caused by the measurement to true value, and mainly includes two parts: the relative error of translation E T and the relative error of rotation E R , which reflects the confidence of the measurement results obtained using our method: where V is the number of videos in the validation set. M is the number of frames and N is the number of granular materials in the frames. Therefore, V MN represents the total number of measurements performed on the validation set. U n m,v and W n m,v respectively refer to the amount of translation and rotation angle of the n th granule of m th frame in v th video, calculated by our method. u n m,v and w n m,v are the true amount of translation and the true rotation angle. In addition, we calculate the standard deviation of numerical measurement absolute error, which reflects the stability of our proposed measurement method. Similarly, the standard deviation can also be divided into two parts: the standard deviation of translation σ T and the standard deviation of rotation σ R , which can be expressed as: where A T and A R represent the average values of absolute errors of V MN measurements of translation and rotation, respectively, which can be expressed as:

Visual Processing Experiment
We designed a series of visual processing experiments to demonstrate the effectiveness of the improve VIS network. Firstly, the evaluation indexes of visual processing were calculated to verify effect of granular materials VIS and then compared with some methods on a self-created dataset, as presented in Table 2. Secondly, we conducted ablation experiments to investigate the temporal feature fusion module and tracking head with a long-sequence external memory. Finally, qualitative experimental results on different videos are presented in Figure 8. As shown in Table 2, our method achieves the best results in visual processing metrics. All baselines follow the idea of "tracking-by-detection", but IoUTracker+ and Deep SORT are not trained end-to-end. These methods use an instance segmentation algorithm to segment out the mask independently on each frame and then link instances across frames by means of an object tracking algorithm. To compete fairly with end-to-end methods, the instance segmentation part of IoUTracker+ and Deep SORT was Mask R-CNN. Obviously, the overall performance of end-to-end methods is better than that of non-end-to-end methods. This is because the end-to-end approach can integrate detection, segmentation and tracking tasks in one VIS framework and optimize them jointly. In addition, the AP of our method is 2.1% higher than MaskTrack R-CNN, which shows that the temporal feature fusion module and new tracking head can bring advantages to VIS of granular materials. As shown in Table 2, our method achieves the best results in visual processing metrics. All baselines follow the idea of "tracking-by-detection", but IoUTracker+ and Deep SORT are not trained end-to-end. These methods use an instance segmentation algorithm to segment out the mask independently on each frame and then link instances across frames by means of an object tracking algorithm. To compete fairly with end-to-end methods, the instance segmentation part of IoUTracker+ and Deep SORT was Mask R-CNN. Obviously, the overall performance of end-to-end methods is better than that of non-endto-end methods. This is because the end-to-end approach can integrate detection, segmentation and tracking tasks in one VIS framework and optimize them jointly. In addition, the AP of our method is 2.1% higher than MaskTrack R-CNN, which shows that the temporal feature fusion module and new tracking head can bring advantages to VIS of granular materials.
As shown in Table 3, we designed a series of ablation experiments to verify the impact of each component for visual processing results. The temporal feature fusion module has a greater impact on visual processing, because the module can make full use of the contextual information of other video frames. It is worth noting that the tracking head with long-sequence external memory also has a certain improvement effect on visual processing. This is because the IoU in VIS is extended from static images to videos, and it associates the tracking effect with the AP. In summary, adding a temporal feature fusion module and improving the tracking head can achieve better visual processing results of granular materials. Table 3. Comparison of ablation experiment results. "TF" refers to the temporal feature fusion module and "LM" refers to the tracking head with long-sequence external memory. "√" means adding corresponding components to the VIS network. As shown in Table 3, we designed a series of ablation experiments to verify the impact of each component for visual processing results. The temporal feature fusion module has a greater impact on visual processing, because the module can make full use of the contextual information of other video frames. It is worth noting that the tracking head with long-sequence external memory also has a certain improvement effect on visual processing. This is because the IoU in VIS is extended from static images to videos, and it associates the tracking effect with the AP. In summary, adding a temporal feature fusion module and improving the tracking head can achieve better visual processing results of granular materials. Table 3. Comparison of ablation experiment results. "TF" refers to the temporal feature fusion module and "LM" refers to the tracking head with long-sequence external memory.  Figure 8 shows the qualitative results of granular material VIS. We selected one image from every 90 frames for all videos for display and annotated the instance ID of objects inside the bounding box. Most granules in the videos can be segmented and tracked in instance dimension. The segmented masks can overlay objects well, and most granules do not have evident under-segmentation and over-segmentation. We also show the video processing results of uneven illumination and lens defocus in validation. It can be seen that our VIS model can also achieve good instance segmentation and tracking for these two adversely affected videos. The VIS of the above granular materials can obtain complete mask chains and the numerical information of granular materials can be obtained by further post-processing.

Numerical Measurement Experiment
As shown in Table 4, we measured the numerical information of granular materials in the validation set and calculated the measurement errors to verify the effectiveness of our proposed numerical measurement method. Calculations of measurement errors need to firstly extract true numerical results of granular materials. For granular materials in the vibrating state, a method by marking the long axes of primordial granules was developed in order to collect the true movement information. One frame per 5 s of video was selected, and the LabelMe data labeling tool was utilized to artificially mark the long axes of granules. Then, the long axis coordinates were obtained from the corresponding .json file. The amount of translation of granules can be obtained by the change of center locations of the manually marked long axis coordinates, and rotation angles can be approximated on the basis of the rotation angles of long axes. Finally, the movement information extracted with the artificial method was regarded as the true values. For rotating granular materials, we directly calculate their true amount of translation and true rotation angle results through experimental equipment. During the experiment, we found that for a small number of granular materials, mask trajectory interruptions occur, or they are associated with other IDs because of the detection or tracking errors, which may make the entire data chain invalid. The translation and rotation errors of such granules are often huge, so we avoid these granular materials when calculating the measurement error and only count the measurement errors of the effective data chains. Effective data chains can be selected by setting a monitoring threshold for each frame of displacement, and the threshold is the average of the diameters of minor axes of all fitted ellipses. When the displacement exceeds this value, the granule is considered to have an ID assignment error, and the data chain is discarded.
It can be seen from Table 4 that the relative errors of translation and rotation can be kept at a low level, which shows the effectiveness of the proposed method. The relative errors in vibrating-type videos are large because the real values of these videos are obtained by manual calibration. The relative errors of the rotation angles of vibrating-type videos are the largest among all errors, with a value of 16.43%. This is because the rotation angles of granular materials are calculated by fitted ellipses and long axes, which need more approximation. On the other hand, the standard deviations are also maintained at a low level, whether the video is of vibrating or rotating type, which reflects the stability of our proposed measurement method. In addition, the standard deviation of vibrating videos is greater than that of rotating videos, it also because the true values of translation and rotation angles of vibrating videos are obtained by manual marking. In general, our improved VIS network and a series of post-processing steps can accurately measure the amount of translation and rotation angle of dynamic granular materials and maintain a high numerical measurement stability.
As shown in Figure 9, we plot the amount of translation curves and rotation angle curves of the granules as a function of time and compare them with the true values. Considering that velocity and acceleration also have a large effect on the macroscopic mechanical behavior of dynamic granular materials, we also plot these curves ( Figure 10). amount of translation and rotation angle of dynamic granular materials and maintain a high numerical measurement stability.
As shown in Figure 9, we plot the amount of translation curves and rotation angle curves of the granules as a function of time and compare them with the true values. Considering that velocity and acceleration also have a large effect on the macroscopic mechanical behavior of dynamic granular materials, we also plot these curves (Figure 10).  In terms of general laws, the translation and rotation of granular materials in the vibrating-type video are irregular, while the translation of granular materials in Figure 9c is a trigonometric function with the same period, because these granular materials rotate at a constant speed around the center of experimental table. The rotation angle in Figure 9d is a linear function, which is also because these granular materials rotate uniformly around the center of the experimental table. The motion laws of the above granular materials are in line with expectations. It can be observed that the curves and the real scatter points are basically consistent in Figure 9a,b, and it can also be seen from Figure 9c,d that the measured curves and corresponding true curves are generally consistent, which shows that our proposed numerical measurement method can accurately measure the translation and rotation of granular materials in two types of vibration and rotation. In terms of general laws, the translation and rotation of granular materials in the vibrating-type video are irregular, while the translation of granular materials in Figure 9c is a trigonometric function with the same period, because these granular materials rotate at a constant speed around the center of experimental table. The rotation angle in Figure 9d is a linear function, which is also because these granular materials rotate uniformly around the center of the experimental table. The motion laws of the above granular materials are in line with expectations. It can be observed that the curves and the real scatter points are basically consistent in Figure 9a,b, and it can also be seen from Figure 9c,d that the measured curves and corresponding true curves are generally consistent, which shows that our proposed numerical measurement method can accurately measure the translation and rotation of granular materials in two types of vibration and rotation.
It is worth noting that the trend of the amount of translation curve of granule 2 in Figure 9a is generally consistent with the trend of curves of other granules under the same vibrational load, but the rotation angle curve of granule 2 in Figure 9b shows some differences compared to the rotation angle curves of granule 1 and granule 3. To explain the reason for the occurrence of the above phenomenon, we searched for granule 2 in the corresponding video and found its shape to be close to that of a standard circle. In our proposed method, the rotation angle of granular material is calculated by fitting the mask after segmentation into an ellipse and then using the long-axis rotation angle to approximate the rotation angle of granular material. Since the shape of granule 2 is close to that It is worth noting that the trend of the amount of translation curve of granule 2 in Figure 9a is generally consistent with the trend of curves of other granules under the same vibrational load, but the rotation angle curve of granule 2 in Figure 9b shows some differences compared to the rotation angle curves of granule 1 and granule 3. To explain the reason for the occurrence of the above phenomenon, we searched for granule 2 in the corresponding video and found its shape to be close to that of a standard circle. In our proposed method, the rotation angle of granular material is calculated by fitting the mask after segmentation into an ellipse and then using the long-axis rotation angle to approximate the rotation angle of granular material. Since the shape of granule 2 is close to that of a standard circle, the above-mentioned method may generate a certain error in measuring rotation angle, resulting in the phenomenon that the rotation curve of the granule 2 in Figure 9b does not match that of granule 1 and granule 3. Figure 10a,b reflect the velocity and acceleration of granular materials in vibrating state video and the true values are calculated from manual measurements of translation and rotation. Figure 10c,d show the velocity and acceleration of the rotating granule materials and true results calculated from the parameters of the experimental equipment. It can be seen that velocity errors and acceleration errors are maintained at a low level, which demonstrates the effectiveness of our method in measuring the velocity and acceleration of granular materials. It is worth noting that the range of the ordinate in Figure 10d is small, which causes the curve trend of the measured results and the true values to be inconsistent.

Conclusions and Outlook
In this study, a numerical measurement method for dynamic granular materials based on an improved video instance segmentation (VIS) network is proposed. Firstly, the improved VIS network can realize multi-task learning based on data annotations and simultaneously detect, segment, and track dynamic granular materials. Secondly, the adverse effects of lens defocus, uneven light, and high appearance similarity between different granular materials can be effectively dealt with by the temporal feature fusion module and new tracking head with long sequence memory. Finally, the numerical measurement of the amount of translation, the rotation angle, velocity, and acceleration of dynamic granular materials can be achieved through post-processing steps including centroid extraction, long axis extraction, ellipse fitting and pixel-actual distance calibration. The experimental results show that the improved VIS can achieve an average accuracy (AP) of 76.6. The measurement errors of translation and rotation angle are 8.95% and 16.43%, respectively, in vibrating videos, and 5.67% and 9.51%, respectively, in rotating videos with granular materials. Standard deviations of absolute errors of translation and rotation are maintained at a low level, demonstrating the stability of our numerical measurement method.
The method in this study can be used to accurately measure the translation, rotation, velocity and acceleration information of dynamic granular materials, and has great advantages and good application prospects in the calibration of discrete element method. It is believed that this study is of great significance to study the spatial distribution, motion mode and macro-mechanical behavior of granular materials. However, it is worth pointing out that the method in this paper has some shortcomings. Firstly, it is difficult to measure the numerical information of occluded granular materials, because our method relies on a visual sensor to capture videos. Secondly, our method approximates the motion space of granular materials as a two-dimensional plane in the process of extracting the numerical information of granular materials. Thirdly, this method approximately measures the rotation angles of granular materials by fitting ellipse and extracting the rotation angle of long axis, which is challenging to apply to granular materials that are close to standard circles. Finally, similar to the common risk of deep neural networks, the VIS part of our method struggles to provide a detailed theoretical derivation process. Therefore, our approach has poor interpretability compared to traditional mathematical models.
The shortcomings of the method proposed in this study will be further investigated. Firstly, we will implement the numerical measurement of obscured granular materials by obscured object detection methods in computer vision. Secondly, depth information in the experimental environment will be extracted using a depth camera, and we will combine depth information to extend granular materials from the two-dimensional plane into threedimensional space for study. Thirdly, to address the difficulty of measuring the rotation angles of granular materials with shapes close to standard circles, we will further extract finer texture information to obtain a more accurate representations of angles. Finally, the important metric of measurement speed is not considered in this study. We will complete the VIS task with more lightweight neural network model and meet the requirements for real-time performance in real-world measurements. We also hope to strengthen the study of interpretable part in future research. Data Availability Statement: All the research data used in this manuscript will be available whenever requested.

Conflicts of Interest:
The authors declare no conflict of interest.