2.2.2. Architecture Implementation
(1) Data acquisition layer
The function of the data acquisition layer was to ensure the successful input of optical information and the transmission of original data. In practice, the optical information was captured in the form of data frames and the video stream was iterated rapidly between the data frames, which included the information needed to detect the gait.
In the computer, the data frames captured by the camera were in the form of a matrix array, so the data extraction process obtained the coordinate positions of the marked point in each frame by mathematical processing and conversion of the matrix array. The data acquisition layer was not responsible for extraction but instead it invoked the camera and connects the camera data flow with the data extraction program. The data frames corresponding to the matrix array were passed continuously up the hierarchical structure of the model.
In general, by default, the data captured by a camera were represented in the BGR color space format. BGR had three channels for the three matrices in the array representing the blue, green and red parts components. Each pixel color could be divided into a mixture of these three colors in different proportions. After being divided, the proportions of each color were preserved in the corresponding position of each channel. In order to retain most of the original information, instead of processing the matrix array, the data acquisition layer passed it directly to the upper model.
(2) Desaturation layer
In the BGR format, the image data retained most of the optical information but not every operation required all of the information when searching for the marker points. Thus, the useful information could be represented better by compressing a three-channel matrix array into a single channel in order to speed up the search for markers, although some of the data would be lost, where a color image was compressed into a grayscale image. Matrix conversion was performed as follows:
where Y is the matrix representing the grayscale image and a larger number indicates that the pixel representing this position is whiter, whereas a lower number denotes a darker pixel. The desaturation layer sends the BGR image and the processed gray image to the upper layer. The upper level could select and process these two types of data according to the necessary requirements.
(3) Smooth layer
Each input frame generated noise due to natural vibrations, changes in illumination, or hardware problems. The smooth processing of the data frame could effectively remove the noise and avoid interference from the detection process.
Applying Gaussian blur to process each pixel in an image was an effective method for smoothing. For any pixel point, Gaussian blur took the weighted average of all the surrounding pixels, where the weight was distributed according to a normal distribution. The weight was larger when the points were closer and smaller when the distance between the points was greater. In practice, we found that taking the target pixel as the center and employing a 21 × 21 matrix provided the best performance when using the weighted average of the surrounding pixels in the matrix to remove the noise.
The desaturation layer and smoothing layer could be employed as a pretreatment layer, which was convenient for data processing, as described in the following.
(4) Foreground split layer
In the real world, testers will walk through the lens of a monocular camera from left to right and expose the marks on their sides to the camera during the process. When capturing optical information using a monocular camera, the three-dimensional environment was mapped onto a two-dimensional plane. From the perspective of the two-dimensional video, the camera was still so the whole video could easily be divided into the foreground and background, where the foreground was the area where the movement of the tester was located. The tester walked into the frame from one end of the lens and left from the other end, so the foreground area also slid continuously in the video. Excluding the testers, the background comprised the areas that filled the whole environment, that is, parts other than the foreground area. These areas were characterized as remaining stationary in the entire video and they were obscured by the movements in the foreground area. The foreground area occupied a smaller area of the whole video frame, whereas the background area occupied a larger area. Clearly, the area where the target marker was located must be in the foreground. Scanning the background area would consume power and waste time but it would also lead to incorrect results when the matching threshold setting was low and it could further disturb the extraction of marker point data. If the foreground region could be split from the entire data frame, then the search would only focus on this area and the search efficiency could be improved greatly, regardless of the large background area.
The foreground region was divided from the data frame based on the foreground segmentation layer. In actual situations, at the beginning of the video test, the testers had not yet entered the shot and all of the image components in the first frame of the video stream belonged to the background map region and thus this frame was designated as the reference background frame, which was taken as the basis for dividing the foreground and background in the subsequent data frames.
The basis of foreground division involved binarization of the threshold value. First, for any data frame M, the calculation should conform to the following rules:
where
is a reference background frame in video stream, I represents a specific position in the data frame and absdiff is a matrix array.
In a grayscale image, the values of various elements ranged from 0–255 and the values of the various elements of absdiff were within 0–255. Absdiff could also be regarded as a grayscale image. We set the thresholds and conduct binarization to transform absdiff into complete black and white images, where the white areas and black areas roughly represented the foreground and background distributions, respectively. Inevitably, large amounts of noises were present in the images after binarization because data points were compared to various threshold value at the junctions of the foreground and background values. It was difficult to divide the ideal black and white demarcation line. Expansion could be used to trim burrs and eliminate noise points. First, we defined a structure matrix of 5 × 5 and the rules for generating the structure matrix were as follows:
The structure matrix size could be adjusted according to the actual situation and
in the generating rules should be changed to the appropriate number. After the structure element had been generated, it could be used to traverse the image and an expanded binary image could be obtained after processing according to the following rules:
In the dilate image, the white part was expanded to a contiguous area and the boundary was more obvious. Finally, according to the distribution of the white part, a complete area was formed in the data frame with a rectangle, which was taken as the foreground region and the remaining area was the background area, as shown in
Figure 5.
The foreground segmentation layer passed the data frames in the two-color formats to the upper level but it also provided the rectangular coordinate information to the upper foreground area, so the upper layer could focus on the foreground area.
(1) Template matching layer
The template matching layer provided the core function of the program, where it filtered and extracts the marker points from the image information and recorded their coordinates. Before processing, the template-matching layer was preloaded with the three marker point templates, which then traversed the foreground area as a benchmark. When the template traversed to a region where the foreground (x, y) was at the center, the following formula could be applied to obtain the normalized difference squared sum:
where R was regarded as a variable related to the relative degree and the matching degree of the corresponding pixel was better for the surrounding area and the template when the value of R was smaller.
In practice, the marker points on the tester would not always be facing directly at the camera, so the marker point image captured by the camera may be an irregular oval state rather than a circle, as shown in
Figure 6. In addition to the markers facing the camera directly, they may be tilted to the left or right and three templates were designed for these three cases. Thus, the matching degree for each of the three templates and the region were calculated as the frame was traversed and the lowest result was maintained, as follows.
After traversing the image, the R values corresponding to most of the pixel points were fairly large, which indicated that these regions did not match the template, whereas the R values for some pixels were in a very small range, thereby indicating that this area was close to the template. If we took 0.1 as a threshold, the pixel areas with R values greater than this value were discarded, whereas the center pixel coordinate data were recorded for the region where the marker point was considered to be located. Finally, the template matching layer passed the image data and the marker point coordinated to the upper layer for further processing.
(2) Non-maximal inhibition layer
When we checked the coordinate data obtained by the template matching layer, there was obvious overlapping where the points with similarly good matching degrees often appeared together. Their difference squared sums were all less than the threshold value. Therefore, they were recorded but the matching windows corresponding to these points obviously overlapped with each other. The center of the overlapping area was generally a marker point. This cluster of coordinates all represented the same coordinates, so the best representative was selected from the cluster to denote the marker point and the remaining coordinates were discarded as noise:
where Range was the region where a cluster of coordinates was located and the coordinate with the lowest R-value was selected as the best result in terms of the matching degree for the coordinates of the cluster at a certain position. The non-maximal inhibition layer passed the filtered coordinate points and image data to the upper layer.
(3) Color gamut filter layer
The coordinate points obtained at this point comprised the locations of each marker point in theory but the template matching layer was not excluded because of the changes in other areas in the foreground. The foreground area was changing constantly and an incorrectly identified area did not continue to cause interference, so the error was random and uninterrupted. Thus, color gamut filtering was employed as a simple additional discriminant rule for each target area to eliminate this interference.
Color gamut filtering employs color images that have not been previously used for recognizing mark points. The previous screening step was based on the grayscale image and thus the color information in the color images was not used, thereby increasing the computational efficiency. In the color gamut filter layer, this part of the information was employed based on comparisons of the colors of the coordinate points and the recognition results were optimized further.
First, color gamut filtering converted an image in BGR format to HSV (hue (H), saturation (S) and lightness (V)) format using the following formulae:
In the HSV color space, three parameters are more suitable for comparing the degree when approximating two different colors than those in the BGR space. In order for a point to be measured, it passed the test if its core color was white; otherwise, it could not pass the test and the coordinates would be discarded. We determined whether a color belonged to the white range based on the S and V parameters. If the following conditions were satisfied, then the point was white:
This method could be used as an auxiliary means to quickly filter the locations of marker points. The data were filtered relatively well in the previous layers, so the accuracy requirement was not very demanding. The non-maximal inhibition layer and gamut filter layer aimed to further improve the matching coordinates, so they could be collectively referred to as the post-processing layer.
(4) Data storage layer
In addition to the data acquisition layer, the data storage process aimed to ensure that the data passed to this layer were preserved correctly. This data layer contained a color picture and the grayscale image corresponding to the array matrix. A number of filters were applied to determine the final locations of the marker point coordinates. In the video shooting process, the incoming data stream was quickly and effectively stored in different locations in the data storage layer, where the picture was stored in a video frames format and the coordinate information was written in chronological order. If the other layers were interrupted for various reasons during the transmission process, the data storage layer must be kept waiting and it could not be disconnected. It was not possible to overwrite previous data in multiple storage tasks.