1. Introduction
By processing the perception data generated by the shipborne visible light camera (SVLC), shipborne infrared camera (SIRC), automatic identification system, millimeter wave radar, marine radar, and LiDAR and other perception equipment, the unmanned ship can seamlessly perceive the surrounding navigation situation and achieve autonomous navigation of ships. At the same time, the massive shipborne sensing video needs to be compressed and transmitted to the shore-based console to monitor the ship’s real-time dynamics. The shore-based console can intervene inside the ship’s system and perform remote control operations at any time by collecting the navigation environment information inside and around the ship. In order to fill the gap of compressing ship perception video, reduce the pressure on the data storage space, effectively improve the efficiency of maritime bandwidth utilization, avoid wasting bandwidth, and provide richer visual information for remote piloting to ensure the safety of intelligent ship navigation, it is necessary to research and optimize the proposed video compression technology for unmanned ships to reduce the resources occupied by redundant data when transmitting data.
In recent years, researchers from various countries have started to try to apply high-performance video compression algorithms in the field of intelligent driving, such as compression of perceptual video generated by vehicles, traffic, aircraft, and ships for storage or transmission. In the research on compressing shipborne radar video, Lu et al. [
1] proposed a deep learning-based digital video compression method for shipborne radar that reduces the complexity of encoding shipborne radar digital video by using the HEVC technique combined with deep learning algorithms, while reducing the storage pressure on hardware and improving the efficiency of ship-to-shore communication. In reference [
2], HEVC is used to compress intelligent video surveillance and achieve real-time classification of “human-vehicle” objects. Reference [
3] proposed a method to reduce the complexity of the prediction part in HEVC by using Bayesian networks, and the improved HEVC coding scheme is applied to vehicular self-organizing networks (VANETs) to improve road safety. Reference [
4] considered the application of HEVC techniques to vehicle surveillance to ensure surveillance video quality and storage costs as the resolution of VANETs increases and commercial VANETs do not contain large hard disk matrices. In order to reduce the pressure of storing video recordings in vehicles and to prevent problems such as falsification of digital recording evidence, Kim et al. used the Advanced Video Coding (H.264/AVC) coding scheme and processed the I-frames in it and added an anti-forgery watermark [
5]. Reference [
6] provided a fast-transcoding solution for video in the field of the vehicular network (LoV), which intends to solve the coexistence problem of AVC technology and HEVC technology in the field of vehicular network. By exploiting the mapping relationship between decoding information in AVC and CU in HEVC, the authors improve the transmission efficiency of real-time video communication systems in vehicular networks. Reference [
7] adds the ISODATA clustering algorithm and performs frame correlation analysis to the HEVC technique for UAV applications to achieve custom keyframes. Reference [
8] used the H.264/AVC technique to compress video stream data in intelligent traffic system and combined it with deep learning methods for real-time monitoring of vehicles in the traffic stream.
Machine learning methods have provided researchers with new directions in problems related to optimizing the efficiency of HEVC compression. Among them, the processing schemes represented by convolutional neural networks (CNN) are the most popular. Reference [
9] proposed a fast CU division algorithm based on machine learning, and designs an algorithm for CU division prediction using online Support Vector Machine (SVM), and another CNN-based CU division prediction algorithm named DeepCNN. The authors compare the two algorithms laterally and proved that the CNN method is more effective in reducing the computational complexity of HEVC. Reference [
10] designed a variable-filter-size Residue-learning CNN (VRCNN) and proposed a CNN-based post-processing algorithm for HEVC. To solve the problem that the traditional rate distortion cost calculation is too complicated during Coding Tree Unit (CTU) division, reference [
11] proposed a low-complex shallow asymmetric-kernel CNN for intra-frame prediction of patterns and designed an HEVC intra-frame coding fast learning framework. Reference [
12] proposed a combined CNN and LSTM structure for CU segmentation prediction, where the LSTM network is responsible for solving the time-domain correlation in the CU segmentation process. The scheme solves the problem of overly complex recursive CU segmentation search based on quadtrees in the CTU segmentation process to a certain extent and significantly reduces the coding complexity of HEVC. Reference [
13] proposed a CNN-based intra-frame segmentation decision network, and another CNN combined with LSTM for inter-frame segmentation decision network, which enables the use of deep learning methods instead of CU segmentation by predicting the intra-frame and inter-frame CU segmentation results while establishing a large coding test universal sequence CU segmentation data set for open use. In references [
14,
15,
16], a CNN-based PU angle prediction model, named AP-CNN, was proposed to replace the original PU angle prediction in the HEVC lossless coding model. In the literature [
16], an optimized architecture LAP-CNN based on AP-CNN was designed to further reduce the complexity of the model. In reference [
17], the authors combined CNN with image recovery techniques for the loop filtering segment in HEVC to improve the overall performance of coding. In reference [
18], CNNs were used to filter the video luminance and chrominance components separately in the intra-frame coding mode instead of the traditional loop filtering mode. Reference [
19] introduced CNN into inter-frame coding and designed a block-level up/down sampling model to improve the interframe coding performance of HEVC. To reduce the distortion of video image compression at low bit rates, reference [
20] designed a quality-enhanced convolutional neural network (QE-CNN) and proposed a time-constrained quality-enhanced optimization (TQEO) scheme based on HEVC without any modification of the HEVC encoder, and the results of the intra-frame coding test also proved the effectiveness of the scheme. In reference [
21], in order to ensure stable transmission in unstable network environments after video compression, a neural network-based low-complexity fault-tolerant coding algorithm was designed to improve HEVC coding efficiency while reducing the bit error rate, and named LC-MSRE. With the popularity of HD 3D video, ISO and ITU jointly introduced the 3D-HEVC technology that can support 3D video compression, as an extension of HEVC standard, 3D-HEVC adds depth mapping coding techniques. In reference [
22], based on 3D-HEVC, a CNN technology was used to reduce the computational complexity of 3D video compression, and a depth edge classification CNN (DEC-CNN) framework was designed to classify the depth map edges. In reference [
23], the researchers designed a LeNet-5-based CNN optimization model with an early termination CU division strategy to reduce the computational complexity of Rate Distortion cost (RD cost) in intra-frame prediction. In addition to the above approaches that focus on CU division decisions to reduce HEVC computational complexity, some researchers have considered the relationship between CTU division depth and HEVC computational complexity. Reference [
24] set two CTU depth ranges and determined the best division result of CTU based on the texture complexity of the currently encoded CTU, which determines the depth range the CTU recursively computes RD-cost. Reference [
25] transformed the division pattern decision problem into a classification problem and proposes a fast classification algorithm for CU based on convolutional neural network by learning the image texture, shape and other features for fast encoding.
Although all of the above methods have effect on reducing the complexity of video coding, the geographical peculiarities of unmanned ships lead to low transmission bandwidth when the ship is underway. At the same time, the SVSV is more different than ordinary video images, and it requires the specialized data sets to train the network and reduce the distortion of small objects at sea, so these methods are unadaptable for the compression of targets at sea (especially for small targets). Until now, there has been no HEVC accelerated coding scheme developed for SVSV in the maritime domain. Therefore, by analyzing the characteristics of SVSV and combining the results of SVSV compression, optimizing the sessions with high compression delay is one of the key steps to realize real-time transmission and storage of SVSV from unmanned ships.
In this paper, by analyzing the characteristics of SVSVs (visible light video, thermal imaging video) and combining the results of video compression, a deep learning-based algorithm for optimizing the compression latency of shipboard vision sensor videos without affecting the video compression quality is proposed.
By collecting the segmentation results of shipboard vision sensor video compression, a CTU segmentation structure data set is built based on SVSV, training the proposed CU segmentation prediction model to improve the prediction accuracy.
In the process of compressing the SVSV, the encoder predicts the CTU division in advance by invoking the trained CU division prediction model, which reduces the time-consuming cost of calculating the CU rate distortion, decreases the coding complexity of the shipboard vision sensor video, and significantly reduces the compression latency.
This paper is divided into six parts, and the specific organization is as follows.
Section 2 details the image characteristics of shipboard vision sensor video, which includes SCLV video and thermal imaging video.
Section 3 analyzes the results of HEVC compression of SVSV, including the relevant parameters used in the experiments, the time consumed in each step of the SVSV compression process, and the final results. In
Section 4, we optimize the CTU partitioning process for compressing shipboard vision sensor videos, and designs a hierarchical model for CU partitioning based on SVSV compression, and a CNN-based CU partitioning prediction model is proposed, which uses a large number of collected SVSV data sets to train the network model to improve the prediction accuracy.
Section 5 discusses our completed and future work. Finally, in
Section 6 a conclusion is made on the work of this paper.
3. Compression Experiment of Shipborne Vision Sensor Video
According to the analysis of the ship vision sensor video characteristics, it is known that the redundant data of ship vision sensor video mainly contain intra-frame spatial domain redundancy and inter-frame time domain redundancy, etc. In this section, in order to test the actual control performance of HEVC video compression technology on the redundant data within the ship vision sensor video, a large number of HD perception video sequences captured from a real ship experiment and the measured compression effect are analyzed.
3.1. SVSV Sequence Parameters
In the process of compressing SVSV, a total of 63 sets of uncompressed SVLC video sequences and infrared camera video sequences in different navigation scenes and RGB color spaces are acquired, and their main parameters are given in
Table 1. Meanwhile, in order to adapt to the input characteristics of HEVC, the SVSV color space will be preprocessed and converted to YUV format with a sampling rate of 4:2:0, which reduces part of the color information redundancy. The conversion and inverse conversion processes are shown in Equations (3) and (4).
3.2. Coding Complexity Analysis
The complexity of coding computation is one of the main reasons for consuming the computational resources of ships and affecting the compression delay. Too much compression delay will lead to the reduction of real-time storage or transmission of ship vision sensor videos and cause the hidden danger of intelligent ship navigation. In order to analyze the percentage of computational resources consumed by each link when compressing ship vision sensor video, this section selects two coding modes: All Intra (AI) and Low Delay P (LDP), and conducts coding complexity analysis experiments on the captured ship vision sensor video to test the time consumption of different links under the two configurations and takes the average of the measured results as the final result. The AI mode is responsible for testing the computational complexity of each link in the intra-frame PC mode, and the LDP mode is testing the inter-frame PC mode, within Group of Pictures (GOP), the first frame is I and the rest is P.
The experimental platform uses 64-bit Ubuntu 20.04.2 LTS operating system, and the test environment is built using the official HEVC test platform
HM16.17, compiled using C++ language, and the performance analysis tool that comes with Visual Studio is selected to diagnose and analyze the coding time consumption of each link. The hardware configuration of the experimental platform is Intel Core i7-8700 CPU@3.20Hz with 16 G RAM, and the main configuration parameters of the encoder is given in
Table 2. The percentage of time consumed by the CPU in executing each major encoding session code during the video compression of the onboard vision sensor is given in
Table 3.
In the process of compressing the SVSV, it can be seen that the TComTrQuant class, which implements the transform and quantization functions in HEVC, takes the most time in AI mode, with an average of 17.69%. It is mainly because this part needs to calculate the rate distortion cost metric needed for optimal quantization. Next is the TComDataCU section (stores CU data information), the TComPrediction and TEncSearch sections (perform intra-frame prediction and search), and the TComRdCost section (implements the calculation of the rate distortion cost), accounting for 14.44%, 11.95%, 11.6%, and 3.79%, respectively. The intra-frame PC mode requires iterative computation of the rate distortion cost for all possible division methods in order to determine the optimal CTU division, which consumes most of the computational resources. When compressing the SVSV in LDP mode, the motion estimation and compensation parts consume most of the computational resources, with the TComInterpolationFliter class, which is used to implement the interpolation filtering function, accounting for the highest percentage, occupying 19.91% of the total coding time on average. Next is the TComDataCU class and the TComRdCost part for calculating SAD and other rate distortion optimization metrics, which consume 15.61% and 12.25% on average.
The above results can be analyzed that when compressing the SVSV in AI mode, and the highest computational complexity is the CTU division part. This is mainly because the intra-frame search and rate distortion cost calculation consume more coding time. When compressing the SVSV using LDP mode, the motion estimation and motion compensation parts consume more time, which leads to the higher computational complexity of the inter-frame prediction part and affects the compression latency.
3.3. Compression Performance Analysis
In order to test the performance of HEVC compressed SVSV, and the impact of different scenes, different resolutions, and different Quantization Parameter (QP) values on the compression performance, the LDP mode was used to analyze the performance of compressed SVSV, and the test evaluation indexes contain Bitrate, Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), where the calculation formula of PSNR and SSIM are given in Equations (5) and (6). The experimental platform is the same as
Section 3.2. For the encoder configuration, the LDP default configuration is used for encoding, and the average of all the measured results is taken as the final result.
Table 4 shows the performance comparison results of compressing different scenes and different resolutions of SVSVs with the same QP value.
Figure 7 shows the R-D plots of SVLC video with resolution 1920 × 1080 and SIRC video with resolution 704 × 576 in the anchoring scenario with different QP values (22, 27, 32 and 37).
Figure 8 shows the quality comparison of the first frame image in the SVLC video with 1920 × 1080 resolution in the anchoring scene before and after encoding at the four QPs.
where
and
are the width and height of the video image,
,
are the video images before and after compression respectively.
is the grayscale value of the video image.
where
is the brightness comparison function, whose expressions are given in Equation (7);
is the contrast comparison function, given in Equation (8);
is the structure comparison function, given in Equation (9).
and
are the average intensity before and after compression, which expressions are given in Equation (10);
,
are the standard deviations, given in Equation (11);
is the covariance, given in Equation (12);
,
are the constants.
As can be seen from the results in
Table 4 that the bit rate of the compressed SVSV is affected by the video resolution and the navigation scene at the same QP value. The higher the resolution and the more complex the video content, the higher the bit rate of the video. The results of SSIM also reflect that when the QP value is 32, the compression effect of HEVC is relatively good, and the quality of the compressed video does not produce a more obvious difference in sensory experience compared with the original video. At the same time, under the premise of the same video content, the higher the resolution of the SVSV, the higher the compression ratio, which fully indicates that the larger CTU quadtree division structure of HEVC can well adapt to the more complex content areas in the high-resolution SVSV, and also has a good compression effect on flat areas. The comparison results in
Figure 8 show that different QP values have different effects on the compression performance of different types of SVSVs. The larger the QP value the more distortion of content details, and the lower the QP value the clearer the video. When the QP value is set above 27, the slope of the R-D curve is larger, indicating that the video quality is significantly improved, and the code rate changes more slowly. When the QP value is less than 27, the slope of the R-D curve decreases, indicating that the video bit rate grows faster while the growth of the peak signal-to-noise ratio becomes slower. It can also be seen from
Figure 8e that when the QP value is 37, the edge contour of the target in the visible image becomes relatively blurred compared with the original image, but it does not affect the discrimination of the target on the sea surface.
The experimental results of compressed SVSV show that intra-frame prediction coding of HEVC largely reduces the redundant data in the null domain of the SVSV. However, in the process of intra-frame prediction, HEVC recursively divides the CU downward and traverses the computational rate distortion cost to get the best division according to the texture complexity of the QP and CTU, and the decision of CU division brings high computational complexity to the intra-frame prediction coding. Therefore, optimizing the computational complexity of the compression process for the characteristics of SVSV is one of the very critical steps to further upgrade the intelligent ship tele pilot system.
In this section, based on the characteristics and usage scenarios of SVSV images, an algorithm acting in the intra-frame PC session for fast compression of SVSV by predicting the CU division structure in advance is proposed in combination with deep learning methods without affecting the compression quality.
4. Time Delay Optimization of Compressing SVSV Based on Deep Learning
4.1. HEVC Intraframe Prediction of CU Partitioning Mode
The specific CU division process for HEVC intra-frame PC is shown in
Figure 9. The CU division process is a hierarchical and recursive search, and a CTU can be regarded as a combination of one or more different CU sizes. According to the quadtree division rule in HEVC, the CU is divided into four possible depths or sizes according to the texture features of different image contents, which are 64 × 64, 32 × 32, 16 × 16 and 8 × 8. In the process of CU division, HEVC has to traverse a total of 85 CUs from 64 × 64 to 8 × 8 in size, and decide on a CU division scheme with the lowest rate distortion cost as the actual coding structure. The partitioning process is as follows.
Firstly, calculate the CTU rate distortion cost of 64 × 64 size and continue to loop down to divide into four sub-CUs of 32 × 32 size.
The distortion cost of four sub-CU rates of size 32 × 32 is calculated separately, and continues to be cyclically divided down to sub-CUs of size 16 × 16.
The current four 16 × 16 size sub-CU rate distortion cost is calculated sequentially in the process of cyclic downward division and continues cyclic downward division into 8 × 8 size sub-CUs.
The current 4 sub-CU rate distortion costs of 8 × 8 size is computed sequentially in the process of cyclic downward division.
The four 8 × 8-sized sub-CU rate distortion cost are compared with the current 16 × 16-sized CU rate distortion cost in turn, and the solution with the lower rate distortion cost is selected as the current CU division structure.
Similar to step 5, the comparison of sub-CU rate distortion cost of 16 × 16 size is repeated, and the solution with lower rate distortion cost is selected as the division structure of the current CU.
Compare the rate distortion cost of 32 × 32 size sub-CU with the current CTU to get the lowest rate distortion cost of the current CTU and the final division structure of the current CTU.
The above description of the specific process of CU division shows that in order to decide the final CTU division structure, it is necessary to calculate and compare the rate distortion cost for all possible sizes of CUs. The exhaustive search approach of HEVC not only has high computational complexity, but also generates a large number of redundant calculations, which is very unfavorable to the intelligent ships with limited computational resources.
4.2. Modeling of CU Partition Structure
In order to be able to describe the different ways of dividing different CUs in the ship vision sensor images, a hierarchical structure model of CU division based on the ship vision sensor images is proposed, as shown in
Figure 10. Where
. The CU at different sizes and coordinates are uniformly represented by
, where x is the depth corresponding to the current CU size,
i,
j are the coordinates of the current sub-CU in the 64 × 64 and 32 × 32 size of CU respectively, and the binary labels 0 and 1 represent whether the current CU is divided or not, and if
x = 0 or 1, it is not substituted into
i or
i,
j. The specific expression is given in
Table 5. The CU of 64 × 64 size (i.e., the CTU is not divided, and the depth is 0) is defined as D(0), the CU of 32 × 32 size is defined as D(1,
i), and the CU of 16 × 16 size is defined as D(2,
i,
j). For example, D(1,3) = 0 means that the current sub-CU of size 32 × 32 has coordinates 3 in the upper CU and does not continue to divide down. D(2,3,4) = 1 represents that the current sub-CU of size 16 × 16 has coordinates (3,4) in the upper CU and continues to divide down into CUs of size 8 × 8.
By using neural networks to predict all possible ways of dividing the current CTU, the structured output of a total of 21 CUs can be directly derived, saving the computational time consumed by predicting them one by one. Finally, based on the predicted CTU division structure, the step of calculating and comparing the CU rate distortion cost is skipped, and the computational redundancy caused by recursively traversing the CU rate distortion cost is avoided to a certain extent, which helps to reduce the complexity of intra-frame PC.
4.3. Establishment of Data Sets
High-quality data sets are the basis for training and verifying the neural network model, and also provide a guarantee for improving the efficiency of the algorithm. The collected abundant high-definition SVLC video sequences and SIRC videos are processed firstly and sets the corresponding categorization label for training the network model. A large amount of CU division data provides support for training the SVSV CU division prediction model. Each sample data contains the Y component of the CU and the corresponding 0 and 1 division labels. All SVLC video samples constitute a CU division data set. All SIRC video samples together constitute a CU division data set based on SIRC video.
All video samples are compressed by HEVC official test platform
HM16.17. On the encoder configuration, the QP is set to the general values of the coding test 22, 27, 32 and 37. The default configuration of the AI coding mode is adopted to save the Y component and division information of all CUs recorded. Due to the strong correlation between the video frames of the shipborne vision sensor, in order to avoid overfitting in the training phase due to high data repeatability, the frame level interval sampling (FLIS) method is used in the data processing phase to enhance the difference between the data. Considering that the frame rate of the SVLC video is different from that of the SIRC video, the CU division data of one frame image is selected for each encoded 10 frames when compressing the SVLC video, and the CU division data of one frame image is selected for each encoded 5 frames when compressing the SIRC video. The CU division data set of the SVLC video comes from 41 SVLC video sequences, totaling 2460 images. The CU division data set of SIRC video comes from 22 SIRC video sequences with a total of 3300 images. The specific parameter information of the data set is given in
Table 6. After the production of the data set is completed, three seed data sets are obtained through random sampling, of which the training set accounts for 80% of the total data set, the validation set accounts for 10% of the total data set and the testing set accounts for 10% of the total data set.
4.4. CU Division Prediction Model for Shipboard Vision Sensor Video Based on Deep Learning
CNN is an efficient prediction approach developed in recent years. The structure of CNN mainly consists of three parts: convolution layer, pooling layer and fully connected layer. CNN use a convolution kernel of certain size multiplied by the corresponding position of each layer to extract features, and the parameters in the convolution kernel are shared by weight in the feature map to continuously improve learning efficiency. According to the characteristics of shipboard vision sensor video, combined with the Convolutional Neural Network (CNN) [
13], this paper proposes an intra-frame coding delay optimization algorithm for SVSV, designs a shipboard vision sensor video based on deep learning CU partition prediction model, and optimizes the complexity of the neural network model, adding a threshold termination mechanism, and the improved convolution network structure is shown in
Figure 11.
The convolution network structure is composed of two preprocessing layers, three convolutional layers, one pooling layer, one merging layer and three fully connected layers. Before performing PC, the encoder will first divide the visual sensor image into N CTUs of 64 × 64 size. The preprocessing layer is responsible for extracting the brightness matrix containing the main visual information in the image as input information for the predictive model for division of the CU of the SVSV, and then, performing global homogenization, 32 × 32 local homogenization, 16 × 16 local homogenization and normalization to speed up the speed of gradient descent for optimal solutions.
Convolutional layer: The convolutional layer is mainly responsible for extracting local features in the SVSV. The convolutional layer performs the convolution operation on the feature maps input to this layer, and in order to be able to obtain more features in the CU with low complexity, the convolution scheme in Inception Net [
26] is referred to. By using convolutional kernels of different sizes to obtain different sizes of perceptual fields in CU, the features after the operation are higher level. Therefore, three convolutional kernels of different sizes will be used, such as 8 × 8, 4 × 4 and 2 × 2 in the first convolutional layer to extract the low-level features of CU division on three branches respectively. In the second and third convolutional layers, 2 × 2 convolutional kernels of the same size are used, which are responsible for extracting higher-level features on the three branches, and finally, 64 feature maps are obtained on the three branches simultaneously. In order to fit the mutual nonoverlapping rule of CU quadtree division specified by HEVC, the step length of all convolution operations is set to the kernel edge length for non-overlapping convolution operations.
The merged feature vectors are processed by three fully connected layers in three branches in turn, including two hidden layers and one output layer, and the final output results are CU division prediction values. According to the experimental results in
Section 3, it is known that QP is one of the main factors affecting the code rate and CU division size in the process of compressing the video of shipboard vision sensors. Therefore, the selection of QP value is added as a major external feature to the feature vectors of the first and second fully connected layers to improve the adaptability of the model under different QP values and the accuracy of predicting CU division.
In the training and testing phases, to prevent the problem of gradient disappearance, all convolutional layers and the first and second fully connected layers are activated using a Leaky-ReLU (Leaky-ReLU) with leakage correction, as shown in Equation (13). Leaky-ReLU, as a variant of the modified linear unit (ReLU), solves the neuron in the negative interval of the ReLU function by introducing a fixed slope the problem of non-learning. The output layer is activated by an S-type function (Sigmoid) to ensure that the model output value lies within the (0, 1) interval, and its expression is shown in Equation (14). The formula for calculating the cross-entropy in the training phase is shown in Equation (15), where denotes the number of samples and denotes the entropy value between the true value of the ith sample and the predicted value.
4.5. Video Intra-frame Coding Delay Optimization Algorithm Flow of Shipboard Vision Sensor
The specific flow of the intra-frame coding delay optimization algorithm for SVSV is shown in
Figure 12. It can be seen that the algorithm is mainly used to avoid the rate distortion cost calculation and comparison performed when deciding the CTU division structure in HEVC by directly predicting the CU division result of the shipboard vision sensor video image. Meanwhile, a threshold termination mechanism in the model is added, that is, when the predicted value is larger than the threshold, the model stops making CU division structure prediction and outputs the current CU hierarchical division structure prediction result to avoid the waste of computational resources to a certain extent and its specific workflow is shown as follows.
Input the shipboard vision sensor video signal to be compressed into the encoder, and pre-process it to YUV format if the color space of the vision sensor video is RGB. Before formally starting encoding, the encoder will split each frame of the vision sensor video into N CTUs to be encoded.
The Y-component of CTU is fed into the trained prediction model with the network structure shown in
Figure 11, and the pixel matrix is normalized to speed up the convergence. The model output is the probability of D(x,i,j), i.e., the corresponding dichotomous labels 0, 1.
The probability size of the model output D(0) is compared with the set threshold (this paper the threshold is set to 0.5) to determine whether the CU in the current CTU continues to be divided downward. If the probability of judging D(0) is less than the set threshold, it is directly determined that the current CTU is not divided, i.e., D(0) = 0 and output. Otherwise, D(0) = 1 continues to divide downward.
Determine whether the probability size of D(1,i) is greater than the set threshold, if yes then D(1,i) = 1, the current sub-CU continues to divide downward. Otherwise, D(1,i) = 0, the current sub-CU is recorded, and the downward division is stopped. If the prediction results of all the current 4 sub-CUs are less than the threshold, i.e., , the division structure of the current CTU is determined directly and output.
Similar to Step (4), determine whether the probability size of D(1,i,j) is greater than the set threshold, if yes, then D(1,i,j) = 1, the current sub-CU continues to divide downward. Otherwise, D(1,i,j) = 0, the current sub-CU is recorded, and the downward division is stopped. If the prediction results of all 16 current sub-CUs are less than the threshold value, i.e., , the division structure of the current CTU is determined directly and output.
Integrate all recorded sub-CU division structures, corresponding to the CU hierarchical division structure model, the final division structure of the current CTU is determined and output.
In the process of compressing the SVSV, the process of calculating and comparing the cost of rate distortion is avoided by directly predicting the CU division structure of the shipboard vision sensor video image. For example, when the prediction result is , the network model terminates the prediction of the division structure of 16 × 16 size CUs and directly outputs the CTU division structure model containing four 32 × 32 size CUs, which reduces the compression delay to a certain extent.
4.6. Analysis of Experimental Results
In order to verify the effectiveness of the intra-frame encoding delay optimization algorithm for reducing the compression delay of the shipboard vision sensor video, this paper uses Tensorflow 1.13.0 to build a convolutional neural network model, embeds the trained model into the HEVC dedicated testbed
HM16.17 and compiles and tests it in Ubuntu OS, the hardware configuration of the test environment is the same as
Section 3.2 The hardware configuration of the test environment is the same as in
Section 3.2. For the encoder configuration, the AI default configuration was used for encoding. The test set contains shipboard vision sensor videos in three different navigation scenarios and at different resolutions. In this paper, the coding-saving ratio
and the BD-BR and the BD-PSNR of the VCED-M33 proposal are used as metrics to evaluate the effectiveness of the algorithm. The smaller the value of BD-BR, the lower the bit rate of the compressed video, and the larger the value of PSNR, the higher the quality of the compressed video. The smaller the value of
, the less time consuming the encoding is. Equation (16) lists the calculation method of
. BD-PSNR is the same as PSNR calculation method, as shown in Equation (5).
where
represents the encoding elapsed time of encoder
HM16.17,
represents the encoding elapsed time of the methods in this section and n is the total number of encodings.
The results show that the intra-frame coding delay optimization algorithm proposed reduces the total compression time by about 45.49% on average, BD-BR by 1.92% on average, and BD-PSNR by 0.14 dB on average compared with HM16.17 in AI coding mode. Among them, the compression time decreases by about 50.70% on average when compressing the SVLC video with the resolution of 1920 × 1080 and by about 43.97% on average when compressing the SVLC video with the resolution of 1280 × 720. This fully illustrates that the higher the resolution of the SVSV, the better the adaptation of the intra-frame coding delay optimization algorithm for SVSV proposed and the stronger the compression performance.
From
Table 7, it can also be analyzed that the setting of QP value is also a major factor limiting the performance of the intra-frame coding delay optimization algorithm for the SVSV. The larger the QP value set, the higher depth CTUs and the coarser the CU division. For example, when compressing a SIRC video with a resolution of 704 × 576, the average compression savings ratio at a QP value of 22 is 38.55%. When the QP value is 37, the average compression savings ratio increases by nearly 6.11%, up to 44.66%. The performance of the algorithm is also affected by the actual content of the video when the QP value is certain, and the less targets in the video and the flatter the content, the better the algorithm compression performance. For example, when compressing a shipboard visible camera video with a resolution of 1280 × 720 at a QP value of 32, the average compression time in the port-resting environment is 3.12% higher than in the sailing state and 1.36% higher than in the anchoring state. It can be concluded from the above analysis that the intra-frame coding delay optimization algorithm proposed largely reduces the intra-frame PC computational complexity and has a higher performance in compressing the SVSV, especially in compressing the high-resolution SVSV, which significantly reduces the compression time.
5. Discussion
In our research, the characteristics of the SVSV are analyzed in detail, including the video image characteristics of the SVLC and the SIRC. Secondly, the compression experiment is carried out, and the computational complexity of each encoding link in the process is analyzed through the experiment, as well as the main reasons for the compression delay. The performance of applying HEVC technology to compress SVSV in different resolutions and scenes is summarized, and a deep learning based intra-frame coding delay optimization algorithm for SVSV is proposed, while the process of CTU division in HEVC is analyzed. A CU hierarchy based on SVSV is designed by combining the characteristics of SVLC and infrared video. The proposed deep learning-based CU segmentation prediction model for SVSV is built, and a CU segmentation data set based on SVSV is built by using the official HEVC testbed HM16.17 to encode a large number of collected high-definition shipboard vision sensor video sequences. The data set contains the CU partition information of the SVLC video and the infrared camera video under different QP values, then use the obtained samples to train the CU partition prediction model of the SVSV. The final results show that the performance of the algorithm proposed is better when compressing the SVSV. Under the premise of less impact on the overall clarity of the video, the total compression time is comparable to that of the high efficiency video coding official test platform HM16.17. Compared with the test platform HM16.17, the average reduction is about 45.49%.
Finally, in order to intuitively reflect the performance difference between the proposed algorithm and the traditional method, and the general optimization algorithm [
13] on compressed SVSV, it is chosen to visualize the results of compressed SVLC under the constraints of different QP values (22, 27, 32, 37), as shown in
Figure 13. The three methods are represented with different colors. It can be seen that the R-D curve of the proposed method is not much different from that of
HM16.17, and the results on the bit rate and PSNR are better than reference [
13], indicating that the proposed algorithm performs well on compressed SVSV.