A Novel Video Transmission Latency Measurement Method for Intelligent Cloud Computing

: Low latency video transmission is gaining importance in time-critical applications using real-time cloud-based systems. Cloud-based Virtual Reality (VR), remote control, and AI response systems are emerging use cases that demand low latency and good reliability. Although there are many video transmission schemes that claim low latency, they vary over different network conditions. Therefore, it is necessary to develop methods that can accurately measure end-to-end latency online, continuously, without any content modiﬁcation. This research brings these applications one step closer to addressing these next generation use cases. This paper analyzes the cause of end-to-end latency within a video transmission system, and then proposes three methods to measure the latency: timecode, remote online, and lossless remote video online. The corresponding equipment was designed and implemented. The actual measurement of the three methods using related equipment proved that our proposed method can accurately and effectively measure the end-to-end latency of the video transmission system.


Introduction
With the rapid development of cloud based real-time AI, network video services are now available for interactive applications requiring a closed loop response [1].Cloud VR and autonomous systems, including drones, vehicles, and robots, can achieve the same level of performance at fraction of the cost through cloud processing.Vehicles that can carry advanced accident detection systems, micro drones navigating through complex forests at high speeds, VR users with light weight googles without suffering dizziness, robots that recover from trips; all these applications require reliable low latency delivery of content to and from the cloud.One critical content component is video.
Different video applications require different video transmission latencies to ensure QOS (quality of service).One-way video transmission applications such as live video and VOD (video on-demand) only require a latency within 2-5 s [2].Interactive video applications such as video conference have become more and more popular.The latency of video software such as Zoom is between 300 ms and 1 s [3].For emerging applications such as cloud based real-time AI control, VR, and cloud games that require high interactivity, the latency is required to be less than 20 ms in order maintain control over these vehicles in their dynamic environments, or to avoid users' feelings of vertigo during use [4].With the rise of new applications, people's requirements for real-time video are also increasing.The transmission mode of video applications has gradually developed from one-way video transmission to interactive video transmission.In order to guarantee the QoS of video latency in interactive video scenes, it is necessary to monitor the video transmission delay of each application in real time and online.
There exists a significant amount redundancy in raw digitized video, including spatial redundancy, temporal redundancy, and coding redundancy.In order to ensure the efficient remote transmission of video, it is necessary to apply video compression technology to the video transmission process.This paper defines a video transmission system with five processing stages: Each stage will bring its corresponding latency.

Capture
Almost all cameras in the market have a certain latency between capturing and outputting video information.Sven Ubik et al. [5] proposed a method to measure the latency of cameras and tested a series of cameras.The measurement results show that the latency generated by the Blackmagic URSA Mini Pro 4.6 K (for South Melbourne, Victoria, Australia) is the smallest, between 4 and 8 ms, whereas most of the camera latencies are at range between 1 and 3 video frames.

Compression/Decompression
The latency introduced by video compression and decompression is related to the complexity of video encoding and decoding algorithms.Following the processing complexity, video compression can be divided into two types: light-weight compression and hybrid compression.
Light-weight compression is mainly based on image coding.For example, Apple ProRes is a variable bit rate video codec that can independently encode and decode each video frame with a compression ratio between 4 and 6 for real-time video editing [6,7].TICO is a codec that uses only intra-frame compression with a compression ratio of 4:1; this is used in the 4K UHD TV industry [8].JPEG-XS is a lightweight codec, which can achieve a compression ratio of 6:1 and is used in applications within the live broadcast and the AR/VR systems [9][10][11] space.Light-weight compression is mostly used in professional fields.Although the compression ratio is small, the latency is greatly reduced.
Due to limited bandwidth, videos on the internet are usually transmitted after hybrid compression, with a compression ratio in the range of 250-500 [12][13][14].The current international mainstream video coding standards include H.264/AVC, H.265/HEVC, H.266/VVC, AVS [15], and VP9/AV1 [16].These compression standards adopt a hybrid coding framework composed of prediction coding, transform coding, quantization, filter processing, and entropy coding modules.With the development of technology, there exist many extended processing methods within each coding module.For example, predictive coding has forward prediction, backward prediction, and bidirectional prediction.Although these optimized encoding methods improve encoding efficiency, they come at a cost of increased computational complexity [17][18][19].

Transmission
Network transmission is subjected to real-time traffic conditions.Thus, it is necessary to set buffers in the coding end, decoding end, and within the network routers to ensure limited data loss.Network congestion has a significant impact on the network.In the event of a bufferbloat [20], it will lead to long queue delays [21,22].The fluctuating video bit rate will also cause network congestion, resulting in transmission delay and buffering delay of subsequent video streams [23].To alleviate these issues, congestion controls in transmission protocols [24] are developed.The introduction of these buffers will directly increase the overall latency of the transmission system.

Display
The latency of most current monitors can be found on the corresponding website [25].Due to the influence of E-sports games, monitors with a latency of only a few milliseconds have been introduced into the market.

Analysis of Video Transmission Latency
The latency caused by the camera and the display is relatively independent.Through proper selection of the camera and the display, this will minimize the impact on the overall delay.The key to reducing the real-time video transmission latency lies in video codecs and video transmission.
Transmission and codecs are interrelated.On one hand, a burst data stream output by the encoder going into the transmission network will increase network congestion.On the other hand, it is necessary to reduce the video bit rate to ensure the real-time stream transmission when the transmission conditions are limited.Hence the introduction of a cache and rate control throughout the transmission path.The total latency (D_latency) of the transmission system, including encoding, decoding, and transmission, is represented by the encoder latency (D_latency).This system delay is effectively the accumulation of the following components: encoder latency (D_enc), transmission latency (D_net), decoding latency (D_dec), and the latency caused by the buffer of each stage (D_cache).This shown in Figure 1 and represented in Equation (1).
the other hand, it is necessary to reduce the video bit rate to ensure the real-time stream transmission when the transmission conditions are limited.Hence the introduction of a cache and rate control throughout the transmission path.The total latency (D_latency) of the transmission system, including encoding, decoding, and transmission, is represented by the encoder latency (D_latency).This system delay is effectively the accumulation of the following components: encoder latency (D_enc), transmission latency (D_net), decoding latency (D_dec), and the latency caused by the buffer of each stage (D_cache).This shown in Figure 1

Video Codec Latency
Video coding standards have evolved over many generations.Codecs are also becoming more complex in order to deliver higher quality video with limited bandwidth.The codec latency is composed of encoding frame reordering latency, encoding processing latency, encoding buffering latency, decoding processing latency, and decoding frame reordering latency.
The video to be encoded is fed into the encoder in chronological order.Video coding prediction includes intra-prediction, inter-prediction, and bidirectional prediction.Intraprediction exploits the correlation between neighboring pixels by using the reconstructed pixels within the frame to derive predicted values through extrapolation from already coded pixels.Inter-prediction is the use of block-based motion compensation to generate a prediction model from one or more previously encoded image frames.Bidirectional prediction allows an encoder to obtain information from frames that are forward in the video stream, instead of just previously compressed frames.In typical applications, intra-prediction, inter-prediction, and bidirectional prediction are mixed to perform frame level prediction to achieve a balance between latency and quality.This will cause the order of video entering the encoder to be inconsistent with the actual encoding and decoding order, and some frames need to wait for encoding and decoding, resulting in frame reordering latency.Figure 2 describes the reference relationship of the RA (Radom Access) mode of HEVC.In this case, a GOP contains eight video frames and the picture number (Picture Order Count, POC) is 0-7.Due to the predictive reference relationship, the playback order of images is inconsistent with the encoding order, so the encoding order of POC 1 frame is behind POC 4 and 2 frame.Therefore, a frame reordering latency of three frames will be introduced in the encoding stage.In a scene of 50 frames/s, the latency of 3 frames is 60 ms.Likewise, the frame cannot be decoded until those two frames are decoded.The overall DOC (decoding order count) is shown in Figure 2.This part of the latency depends on the data arrival speed and decoding speed because the encoding of the current frame needs to refer to the data of the encoded frame.The GOP structure can be dynamic.The theoretical maximum latency caused by reordering can reach the size of the GOP.

Video Codec Latency
Video coding standards have evolved over many generations.Codecs are also becoming more complex in order to deliver higher quality video with limited bandwidth.The codec latency is composed of encoding frame reordering latency, encoding processing latency, encoding buffering latency, decoding processing latency, and decoding frame reordering latency.
The video to be encoded is fed into the encoder in chronological order.Video coding prediction includes intra-prediction, inter-prediction, and bidirectional prediction.Intraprediction exploits the correlation between neighboring pixels by using the reconstructed pixels within the frame to derive predicted values through extrapolation from already coded pixels.Inter-prediction is the use of block-based motion compensation to generate a prediction model from one or more previously encoded image frames.Bidirectional prediction allows an encoder to obtain information from frames that are forward in the video stream, instead of just previously compressed frames.In typical applications, intraprediction, inter-prediction, and bidirectional prediction are mixed to perform frame level prediction to achieve a balance between latency and quality.This will cause the order of video entering the encoder to be inconsistent with the actual encoding and decoding order, and some frames need to wait for encoding and decoding, resulting in frame reordering latency.Figure 2 describes the reference relationship of the RA (Radom Access) mode of HEVC.In this case, a GOP contains eight video frames and the picture number (Picture Order Count, POC) is 0-7.Due to the predictive reference relationship, the playback order of images is inconsistent with the encoding order, so the encoding order of POC 1 frame is behind POC 4 and 2 frame.Therefore, a frame reordering latency of three frames will be introduced in the encoding stage.In a scene of 50 frames/s, the latency of 3 frames is 60 ms.Likewise, the frame cannot be decoded until those two frames are decoded.The overall DOC (decoding order count) is shown in Figure 2.This part of the latency depends on the data arrival speed and decoding speed because the encoding of the current frame needs to refer to the data of the encoded frame.The GOP structure can be dynamic.The theoretical maximum latency caused by reordering can reach the size of the GOP.
In order to meet low-latency applications, HEVC has introduced the LDP (low-latency P) mode, in which bidirectional prediction frames are canceled.Each incoming frame can be encoded immediately without waiting for other reference frames.Therefore, no reordering delay is introduced.Using this mode reduces the reordering latency, but the encoding performance decreases by 9~42% [31].In order to meet low-latency applications, HEVC has introduced the LDP (low-latency P) mode, in which bidirectional prediction frames are canceled.Each incoming frame can be encoded immediately without waiting for other reference frames.Therefore, no reordering delay is introduced.Using this mode reduces the reordering latency, but the encoding performance decreases by 9~42% [31].
In the encoding process, a large number of redundant searches need to be performed in the current frame and the reference frame to determine the most efficient encoding mode.At the same time, in order to adapt to the local characteristics of the video frame content, the to-be-encoded image should be divided into a series of encoding blocks.Fixed size macroblocks are used in AVC.HEVC uses a quad-tree division method.In addition to the quad-tree division method, VVC also introduces multi-type tree division.By increasing the division method, the coding efficiency of image details is enhanced, and the corresponding division and search computational complexity are also significantly increased.Compared with HEVC, VVC increases the coding complexity by a factor of about 7. The increase in computational complexity extends the encoding time.In actual implementation, parallel operations can be used to optimize the encoding structure to reduce encoding time, but parallelism is essentially the scheduling optimization of encoding and does not reduce the processing volume of encoding and decoding itself.The latest method uses machine learning to reduce the complexity of codec slicing and motion search operations and reduces the coding complexity by 20~70% under the premise of 5% or less compression performance loss [32][33][34][35].However, the time required for entropy encoding and related data access is unavoidable.Similarly, the decoding process also brings some latency due to the large amount of computation.

Network Transmission Latency and Code Rate Control
The network transmission latency is mainly determined by the network bandwidth and buffer size.In order to reduce this part of the latency, a stable bit rate and a buffer as small as possible are required.
In video transmission, the available transmission bandwidth is usually limited to a certain range and is easily reduced by various interferences, especially in a wireless environment.Under the hybrid coding structure, the RC (rate control) mechanism chooses to adjust a series of coding parameters, usually including the partition model, prediction model and QP (quantization parameter), to control the compressed code rate within the available bandwidth.All video coding standards have their own recommended RC models.MPEG-2 adopts TM5 [36], H.263 adopts TMN8 [37], H.264/AVC adopts JVTG012 [38], and H.265/HEVC adopts JCTVC-H0213 [39] and JCTVC-K0103 [40].The latest coding standard, H.266/VVC, adopts JVET-K0390 [41].The research and implementation of rate control are mainly based on rate control algorithms in the Q domain [42], ρ domain [43], and λ domain [44].These algorithms have high precision for average rate control.The bit number control accuracy for frame-level or finer-grained coding units is not enough to adapt to the case of small network buffers, especially when the encoded video contains fast-moving objects, significant object occlusions, and scene changes.In the encoding process, a large number of redundant searches need to be performed in the current frame and the reference frame to determine the most efficient encoding mode.At the same time, in order to adapt to the local characteristics of the video frame content, the to-be-encoded image should be divided into a series of encoding blocks.Fixed size macroblocks are used in AVC.HEVC uses a quad-tree division method.In addition to the quad-tree division method, VVC also introduces multi-type tree division.By increasing the division method, the coding efficiency of image details is enhanced, and the corresponding division and search computational complexity are also significantly increased.Compared with HEVC, VVC increases the coding complexity by a factor of about 7. The increase in computational complexity extends the encoding time.In actual implementation, parallel operations can be used to optimize the encoding structure to reduce encoding time, but parallelism is essentially the scheduling optimization of encoding and does not reduce the processing volume of encoding and decoding itself.The latest method uses machine learning to reduce the complexity of codec slicing and motion search operations and reduces the coding complexity by 20~70% under the premise of 5% or less compression performance loss [32][33][34][35].However, the time required for entropy encoding and related data access is unavoidable.Similarly, the decoding process also brings some latency due to the large amount of computation.

Network Transmission Latency and Code Rate Control
The network transmission latency is mainly determined by the network bandwidth and buffer size.In order to reduce this part of the latency, a stable bit rate and a buffer as small as possible are required.
In video transmission, the available transmission bandwidth is usually limited to a certain range and is easily reduced by various interferences, especially in a wireless environment.Under the hybrid coding structure, the RC (rate control) mechanism chooses to adjust a series of coding parameters, usually including the partition model, prediction model and QP (quantization parameter), to control the compressed code rate within the available bandwidth.All video coding standards have their own recommended RC models.MPEG-2 adopts TM5 [36], H.263 adopts TMN8 [37], H.264/AVC adopts JVTG012 [38], and H.265/HEVC adopts JCTVC-H0213 [39] and JCTVC-K0103 [40].The latest coding standard, H.266/VVC, adopts JVET-K0390 [41].The research and implementation of rate control are mainly based on rate control algorithms in the Q domain [42], ρ domain [43], and λ domain [44].These algorithms have high precision for average rate control.The bit number control accuracy for frame-level or finer-grained coding units is not enough to adapt to the case of small network buffers, especially when the encoded video contains fast-moving objects, significant object occlusions, and scene changes.
The compressed video data stream needs to be sent using a network transmission protocol.In the early days, TCP or UDP was used for transmission.There is a serial number in the TCP protocol to observe the receiving status of the receiving end.The retransmission mechanism can ensure no packet loss, but it easily causes long delays and network congestion.UDP is a connectionless protocol commonly used in video broadcasting.It does not guarantee that the packets all reach the receiving end.If the packet is lost, it will not be retransmitted, so the latency is lower.Compared with the TCP protocol, the UDP protocol will have problems such as packet loss.To solve these problems, some techniques, such as forward error correction, are sometimes adopted to ensure transmission by increasing guard bytes and sacrificing part of the bandwidth [45].Some new transport schemes, such as WebRTC, take full advantage of the uncontrolled nature of UDP.The bottom layer of the transmission uses the UDP protocol, and the upper layer uses the SRTP (Secure Real-Time Transport Protocol) and automatic adaptation rate conversion to improve the video transmission quality.However, in WebRTC the encoding rate is changed on a second time scale.When the network is congested, some of the encoded frames are still sent out due to the response lag.The encoder is then paused until the congestion ends.The connection between the transport layer and the encoding layer of the current video transmission system is loosely coupled.For the transport layer, the information of the coding layer is lagging.The congestion control of the transport layer packets does not match the rate control of the encoding layer exactly.For this reason, it is necessary to configure corresponding buffers in each stage of transmission to alleviate the problem.However, the cache will bring queuing latency.The work in [46] evaluates the current network capacity and optimizes the compression length of each frame to achieve lower video latency and better video quality over variable network paths.The limitation of this method is that the intermediate information in the encoding process needs to be obtained in real time and it is not suitable for most encoders.

Methodology
The measurement accuracy of the end-to-end delay of video transmission systems in previous work is generally in units of frames.This accuracy is insufficient for interactive video applications with very low latency requirements.To this end, we designed and implemented three methods to accurately measure the end-to-end latency for different application scenarios.

Method 1: Timecode Latency Measurement
The measurement scheme is shown in Figure 3a.The camera captures raw video and feeds it into the latency measurement device.The latency measurement device takes the arrival time of the synchronization pulse of the video frame as the input time of each frame.The frame input time and frame number are injected into the fixed position of the original video picture in the form of a barcode, and the video with the timecode is encoded, transmitted, and decoded to obtain a reconstructed video.The reconstructed video is also sent to the latency measurement device and the device records the arrival time of the synchronization pulse of the reconstructed video frame as the time when each frame of the reconstructed video reaches the display end.The barcode is read at the fixed position of the reconstructed video to obtain the frame number and frame input time and the arrival time and input time of the same frame are compared to obtain the end-to-end latency of each frame of the video transmission system.
The timecode injected into the original video appears as a short barcode at the bottom of the screen, as shown in Figure 3b.
The timecode contains frame number and frame input time information, and a color block of 16 × 16 pixels represents a bit, and the 0 and 1 of the bit are represented by the colors YcbCr (0, 80, 80) and YcbCr (80, 80, 80).Taking a 1280 × 720 video frame as an example, there are 80 horizontal blocks in total, which can represent data with a data bit width of 80 bits.Splicing the data to be represented (e.g., frame number and timestamp), it can be written as: {Frame_cnt  The timecode injected into the original video appears as a short barcode at the bottom of the screen, as shown in Figure 3b.
The timecode contains frame number and frame input time information, and a color block of 16 × 16 pixels represents a bit, and the 0 and 1 of the bit are represented by the colors YcbCr (0, 80, 80) and YcbCr (80, 80, 80).Taking a 1280 × 720 video frame as an example, there are 80 horizontal blocks in total, which can represent data with a data bit width of 80 bits.Splicing the data to be represented (e.g., frame number and timestamp), it can be written as: {Frame_cnt [7:0], Time_cnt [31:0]}, which uses a total of 40 bits.
The time value in the actual measurement uses a 32-bit counter that counts with a 25 MHz clock.When the frame synchronization signal of the original video is detected, the value of the counter is saved in the latency measuring device as the input time   () of each frame of the original video, where n represents the frame number.After encoding, transmitting, and decoding, the reconstructed video arrives at the receiving end and the frame synchronization signal time of each reconstructed video frame is   (), where n represents the frame number.According to Equation (2), the delay D(n) of the nth frame in the video transmission system can be obtained.
where   () represents the input time of the input latency measurement device for each frame of the original video,   () represents the frame synchronization signal time for each frame, and n represents the frame number.
Because the input time of the original video and the arrival time of the reconstructed video are obtained from the same clock source, an error of ±1 may be caused by overcounting or undercounting a number.The variance   of the latency measurement error is recorded as the sum of the variance   of the original video input time acquisition error and the variance   of the arrival time acquisition error of the reconstructed video.The variance of the measurement error can be described by Equation (3): Since the unit time of the 25 MHz clock is 40 ns, the variance   of the latency measurement is 3200 ns 2 , and the theoretical standard deviation of the latency measurement can be deduced to be 56.6 ns [47], which meets the requirements of high-precision online latency measurement.
Compared with the side-by-side shooting method, the measurement accuracy of Method 1 is significantly improved, but there are still shortcomings.First, the original video content is partially destroyed in the measurement process.Although the influence is not large, this method cannot be used for online measurement in formal video applications.Secondly, since both the original video and the decoded reconstructed video are The time value in the actual measurement uses a 32-bit counter that counts with a 25 MHz clock.When the frame synchronization signal of the original video is detected, the value of the counter is saved in the latency measuring device as the input time T i (n) of each frame of the original video, where n represents the frame number.After encoding, transmitting, and decoding, the reconstructed video arrives at the receiving end and the frame synchronization signal time of each reconstructed video frame is T r (n), where n represents the frame number.According to Equation ( 2), the delay D(n) of the nth frame in the video transmission system can be obtained.
where T i (n) represents the input time of the input latency measurement device for each frame of the original video, T r (n) represents the frame synchronization signal time for each frame, and n represents the frame number.
Because the input time of the original video and the arrival time of the reconstructed video are obtained from the same clock source, an error of ±1 may be caused by overcounting or undercounting a number.The variance S D of the latency measurement error is recorded as the sum of the variance S i of the original video input time acquisition error and the variance S r of the arrival time acquisition error of the reconstructed video.The variance of the measurement error can be described by Equation (3): Since the unit time of the 25 MHz clock is 40 ns, the variance S D of the latency measurement is 3200 ns 2 , and the theoretical standard deviation of the latency measurement can be deduced to be 56.6 ns [47], which meets the requirements of high-precision online latency measurement.
Compared with the side-by-side shooting method, the measurement accuracy of Method 1 is significantly improved, but there are still shortcomings.First, the original video content is partially destroyed in the measurement process.Although the influence is not large, this method cannot be used for online measurement in formal video applications.Secondly, since both the original video and the decoded reconstructed video are sent to the time delay measurement device, this method is limited to local measurement and cannot be used for long-distance online real-time measurement.

Method 2: Remote Online Measurement
Although Method 1 has higher measurement accuracy, it needs to connect both the video of the sender and the video of the receiver to the same latency measurement device.Due to this limitation, remote video transmission cannot be measured.To this end, we designed Method 2, and the measurement scheme is shown in Figure 4.The solution includes a flashing light and a latency measurement device at the remote receiver.Time synchronization of remotely located measuring equipment uses GPS signals.The flashing light is controlled by the GPS PPS (pulse per second) to flash once per second, and the camera will send the captured pictures including the flashing light to the latency measurement device through the video transmission system.The device records the time when each frame enters the device and finds the frame when the flashlight just starts to light up.The fractional part of the time (in milliseconds) that the frame with the light on enters the device is the end-to-end latency of the video transmission.
sent to the time delay measurement device, this method is limited to local measurement and cannot be used for long-distance online real-time measurement.

Method 2: Remote Online Measurement
Although Method 1 has higher measurement accuracy, it needs to connect both the video of the sender and the video of the receiver to the same latency measurement device.Due to this limitation, remote video transmission cannot be measured.To this end, we designed Method 2, and the measurement scheme is shown in Figure 4.The solution includes a flashing light and a latency measurement device at the remote receiver.Time synchronization of remotely located measuring equipment uses GPS signals.The flashing light is controlled by the GPS PPS (pulse per second) to flash once per second, and the camera will send the captured pictures including the flashing light to the latency measurement device through the video transmission system.The device records the time when each frame enters the device and finds the frame when the flashlight just starts to light up.The fractional part of the time (in milliseconds) that the frame with the light on enters the device is the end-to-end latency of the video transmission.The PPS signal that controls the flash has precise clock synchronization and the signal accuracy is less than 50 ns.The PPS signal of the GPS module we use is a pulse signal with a period of one second and a pulse width of 200 ms.The PPS signal controls the power switch to drive the light to flash, so that the flashing light flashes once per second, lasting 200 ms each time.The latency measurement device adopts a 32-bit counter driven by a 25 MHz clock as the timing value.When the device is powered on, the 32-bit timing counter is reset by the rise of the PPS signal.In normal operation, the rising edge of the PPS signal is used to reset the timer counter once every 10 min to ensure the time synchronization between the flashing light and the latency measuring device.
Taking a 720p@60 camera as an example, when the light is on for 200 ms, the video shot will have approximately 12 consecutive frames in which the light is on.The video enters the latency measurement device through encoding, transmission, and decoding.The measurement device compares the brightness values of the previous and previous frames for five consecutive frames.The device determines that there is a partial increase in brightness in a frame, that is, the light is on, and continuously observes whether the position continues to remain in the next 4 frames.If the position maintains the brightness, it can determine the frame when the light just began to turn on.A 32-bit counter value  () is recorded representing the time the frame during which the latency measurement device entered, i represents the number of measurements. () can be converted into real-time   in seconds.It can be expressed by Equation ( 4): The fractional part of timing time   is the end-to-end latency value   of video transmission, it can be expressed by Equation ( 5): The PPS signal that controls the flash has precise clock synchronization and the signal accuracy is less than 50 ns.The PPS signal of the GPS module we use is a pulse signal with a period of one second and a pulse width of 200 ms.The PPS signal controls the power switch to drive the light to flash, so that the flashing light flashes once per second, lasting 200 ms each time.The latency measurement device adopts a 32-bit counter driven by a 25 MHz clock as the timing value.When the device is powered on, the 32-bit timing counter is reset by the rise of the PPS signal.In normal operation, the rising edge of the PPS signal is used to reset the timer counter once every 10 min to ensure the time synchronization between the flashing light and the latency measuring device.
Taking a 720p@60 camera as an example, when the light is on for 200 ms, the video shot will have approximately 12 consecutive frames in which the light is on.The video enters the latency measurement device through encoding, transmission, and decoding.The measurement device compares the brightness values of the previous and previous frames for five consecutive frames.The device determines that there is a partial increase in brightness in a frame, that is, the light is on, and continuously observes whether the position continues to remain in the next 4 frames.If the position maintains the brightness, it can determine the frame when the light just began to turn on.A 32-bit counter value Count (i) is recorded representing the time the frame during which the latency measurement device entered, I represents the number of measurements.Count (i) can be converted into real-time T i in seconds.It can be expressed by Equation (4): The fractional part of timing time T i is the end-to-end latency value D i of video transmission, it can be expressed by Equation (5): Compared with Method 1, Method 2 is simple in operation and does not need to access the original video and the reconstructed video from the measurement device at the same time.It only needs to place the flash device in the field of view of the camera and connect the decoded reconstructed video to the latency measurement device to complete the latency measurement.Method 2 realizes remote measurement, but the disadvantage of Method 2 is that it is still a lossy video measurement, because the flashing light must be in the video content, which partially destroys the original video picture.Method 2 uses the camera to shoot the flashing light to determine the same frame, the shooting may have a delay within one frame, and the measurement error is large.The biggest limitation is that, because GPS signals cannot be received indoors, this method can only be used for outdoor measurements.

Method 3: Lossless Remote Video Online Measurement
When using Method 2 for measurement, both the sender and the receiver must have GPS signals and the measurement accuracy is low.For this reason, we developed a method that can accurately measure the end-to-end latency of remote video without loss based on the IEEE1588 time synchronization protocol.The measurement scheme is shown in Figure 5.
same time.It only needs to place the flash device in the field of view of the camera and connect the decoded reconstructed video to the latency measurement device to complete the latency measurement.Method 2 realizes remote measurement, but the disadvantage of Method 2 is that it is still a lossy video measurement, because the flashing light must be in the video content, which partially destroys the original video picture.Method 2 uses the camera to shoot the flashing light to determine the same frame, the shooting may have a delay within one frame, and the measurement error is large.The biggest limitation is that, because GPS signals cannot be received indoors, this method can only be used for outdoor measurements.

Method 3: Lossless Remote Video Online Measurement
When using Method 2 for measurement, both the sender and the receiver must have GPS signals and the measurement accuracy is low.For this reason, we developed a method that can accurately measure the end-to-end latency of remote video without loss based on the IEEE1588 time synchronization protocol.The measurement scheme is shown in Figure 5.A latency measurement device is placed at the transmitter and receiver of the remote video transmission system, respectively, and the device uses the built-in IEEE1588 protocol to achieve time synchronization.The original video and the reconstructed video are copied and connected to the transmitter latency measurement device and the receiver latency measurement device, and no changes are made to the video transmission system.The sending-end latency measuring device calculates the hash value of each input frame and combines it with the entry time of the frame to package and send it to the receivingend latency measuring device through the network.The latency measurement device at the receiving end calculates the hash value according to the same algorithm for each frame of the reconstructed video and stamps it with a time stamp.At the receiving end, the latency measurement device matches the hash value of the original video frame and the hash value of the reconstructed video frame, and a successful hash value matching indicates that the corresponding frame is determined.The end-to-end latency of the frame can be obtained by comparing the timestamps of the corresponding frames.
In order to accurately find the correspondence between the reconstructed video and each frame of the original video, this method uses video-aware hashing technology, which includes three processing contents: feature information extraction, video-aware hash extraction, and hash code matching.Most video codecs based on hybrid coding frameworks use motion search for predictive coding, therefore, this method selects motion information features of video frames as hash feature information to reflect changes in video content.Different from the commonly used video-aware hashing algorithm based on gradient orientation centroids, this method chooses to use a difference-valued hash-aware compression algorithm.In order to ensure the accuracy of the measurement, the measuring device at the sending end and the measuring device at the receiving end must maintain time A latency measurement device is placed at the transmitter and receiver of the remote video transmission system, respectively, and the device uses the built-in IEEE1588 protocol to achieve time synchronization.The original video and the reconstructed video are copied and connected to the transmitter latency measurement device and the receiver latency measurement device, and no changes are made to the video transmission system.The sending-end latency measuring device calculates the hash value of each input frame and combines it with the entry time of the frame to package and send it to the receiving-end latency measuring device through the network.The latency measurement device at the receiving end calculates the hash value according to the same algorithm for each frame of the reconstructed video and stamps it with a time stamp.At the receiving end, the latency measurement device matches the hash value of the original video frame and the hash value of the reconstructed video frame, and a successful hash value matching indicates that the corresponding frame is determined.The end-to-end latency of the frame can be obtained by comparing the timestamps of the corresponding frames.
In order to accurately find the correspondence between the reconstructed video and each frame of the original video, this method uses video-aware hashing technology, which includes three processing contents: feature information extraction, video-aware hash extraction, and hash code matching.Most video codecs based on hybrid coding frameworks use motion search for predictive coding, therefore, this method selects motion information features of video frames as hash feature information to reflect changes in video content.Different from the commonly used video-aware hashing algorithm based on gradient orientation centroids, this method chooses to use a difference-valued hash-aware compression algorithm.In order to ensure the accuracy of the measurement, the measuring device at the sending end and the measuring device at the receiving end must maintain time synchronization.In order to ensure the accuracy of the measurement, the measuring device at the sending end and the measuring device at the receiving end must maintain time synchronization.We adopt the IEEE 1588 protocol and implement it in software, which can achieve a synchronization accuracy of 20 ns under a 100 ms synchronization period, which meets the requirements for accurate online measurement of video transmission delay [48].The specific processing flow is shown in Figure 6.
synchronization.In order to ensure the accuracy of the measurement, the measuring device at the sending end and the measuring device at the receiving end must maintain time synchronization.We adopt the IEEE 1588 protocol and implement it in software, which can achieve a synchronization accuracy of 20 ns under a 100 ms synchronization period, which meets the requirements for accurate online measurement of video transmission delay [48].The specific processing flow is shown in Figure 6.Taking the original video sequence   (, ) with a resolution of 1280 × 720 as an example, the reconstructed video sequence after encoding and decoding is   (, ), where (i, j) represents the pixel coordinates in the video, n and m, respectively, represent the n and m frames of the original video and the reconstructed video.The luminance components   (, )and   (, ) of the original video   (, ) and the reconstructed video   (, ) are selected for subsequent processing.
The latency measurement at the sending end and at the receiving end record the input time of each frame of the original video and the reconstructed video.When the device detects the frame synchronization pulse of each frame, it records the value of the 32-bit counter driven by the internal 25 MHz clock as the time stamp of the frame input to the latency measurement device.The time represented by the counter is guaranteed to be synchronized by the IEEE1588 protocol inside the sender and receiver devices.
In order to reduce the amount of data to be processed while retaining useful information, the original video frame and the reconstructed video frame are subjected to block and down sampling processing.The device divides the   (, ) and   (, ) frames into 16 × 16-pixel blocks and assigns the average pixel value of each block to   ′ ( ′ ,  ′ ) and   ′ ( ′ ,  ′ ), as shown in Equation ( 6).The downsampling of the original video frame and the reconstructed video frame is realized, and the original video frame sequence   ′ ( ′ ,  ′ ) and the reconstructed video frame sequence   ′ ( ′ ,  ′ ), with a resolution of 80 × 45, are obtained, where  ′ ∈ (0, 44),  ′ ∈ (0, 79), i = 16 ×  ′ ,  = 16 ×  ′ .Taking the original video sequence O n (i, j) with a resolution of 1280 × 720 as an example, the reconstructed video sequence after encoding and decoding is R m (i, j), where (i, j) represents the pixel coordinates in the video, n and m, respectively, represent the n and m frames of the original video and the reconstructed video.The luminance components Y On (i, j) and Y Rm (i, j) of the original video O n (i, j) and the reconstructed video R m (i, j) are selected for subsequent processing.
The latency measurement devices at the sending end and at the receiving end record the input time of each frame of the original video and the reconstructed video.When the device detects the frame synchronization pulse of each frame, it records the value of the 32-bit counter driven by the internal 25 MHz clock as the time stamp of the frame input to the latency measurement device.The time represented by the counter is guaranteed to be synchronized by the IEEE1588 protocol inside the sender and receiver devices.
In order to reduce the amount of data to be processed while retaining useful information, the original video frame and the reconstructed video frame are subjected to block and down sampling processing.The device divides the Y On (i, j) and Y Rm (i, j) frames into 16 × 16-pixel blocks and assigns the average pixel value of each block to Y On (i , j ) and Y Rm (i , j ), as shown in Equation ( 6).The downsampling of the original video frame and the reconstructed video frame is realized, and the original video frame sequence Y On (i , j ) and the reconstructed video frame sequence Y Rm (i , j ), with a resolution of 80 × 45, are obtained, where i ∈ (0, 44), j ∈ (0, 79), i = 16 × i , j = 16 × j .
The downsampled original video sequence Y On (i , j ) and the downsampled recon- structed video sequence Y Rm (i , j ) are subjected to frame difference processing according to Equation (7), and obtain the video sequence sδ O(n) (i , j ) and δ R(m) (i , j ) that reflect the pixel changes of the previous and previous frames.
A threshold value ε is selected to binarize δ O(n) (i , j ) and δ R(m) (i , j ) according to Equation (8).The pixel value greater than the threshold value is set to 1 and the pixel value less than the threshold value is set to 0 to obtain binarized images B O(n) (i , j ) and B R(m) (i , j ).
where F(i , j ) represents the pixel value of the pixel with coordinate (i , j ).The binarized image is expanded line by line to obtain the hash codes of each frame of the original video and the reconstructed video.The hash code and frame timestamp of each frame of the original video are composed of hash packets which are packaged and sent to the latency measurement device at the receiver, as shown in Figure 7.
Appl  (6) The downsampled original video sequence   ′ ( ′ ,  ′ ) and the downsampled reconstructed video sequence   ′ ( ′ ,  ′ ) are subjected to frame difference processing according to Equation (7), and obtain the video sequence s () ( ′ ,  ′ ) and  () ( ′ ,  ′ ) that reflect the pixel changes of the previous and previous frames.
where ( ′ ,  ′ ) represents the pixel value of the pixel with coordinate ( ′ ,  ′ ).The binarized image is expanded line by line to obtain the hash codes of each frame of the original video and the reconstructed video.The hash code and frame timestamp of each frame of the original video are composed of hash packets which are packaged and sent to the latency measurement device at the receiver, as shown in Figure 7.The latency measurement device at the receiver performs hash matching to find the original video frame that matches the reconstructed video.In the hash library, HashBD, at the receiver,  () and  () are used to represent the hash codes of the original video frame and the reconstructed video frame respectively, and  ()  ,  ()  is the value of the kth bit of the hash code.The Hamming distance is used to calculate the distance of two code strings, that is, the corresponding bits of the reconstructed video frame feature hash value and the original video frame feature hash value are XOR processed and the number of 1 is counted, as shown in Equation ( 9).The distance determines the correspondence between two frames.The smaller the distance, the stronger the correlation between the two frames, and the less the correlation.In this way, the original video frame matching with the reconstructed video frame can be found, and the end-to-end latency of video transmission can be obtained by using the time stamps carried by the two video frames.
Method 3 can carry out high-precision real-time latency monitoring online without damaging the transmitted video.It balances the tradeoff between computational demand, out of band bandwidth utilization, and robustness across different content.The test accuracy is similar to that of Method 1, and the remote measurement function of Method 2 can be realized and is not limited to the scenario with a GPS signal.It can be used as a convenient, continuous, and reliable method for popularization and application.

Evaluation
Combined with the requirements of the above three latency measurement methods on the latency measurement device, we designed a latency measurement device that can The latency measurement device at the receiver performs hash matching to find the original video frame that matches the reconstructed video.In the hash library, HashBD, at the receiver, H O(n) and H R(m) are used to represent the hash codes of the original video frame and the reconstructed video frame respectively, and is the value of the kth bit of the hash code.The Hamming distance is used to calculate the distance of two code strings, that is, the corresponding bits of the reconstructed video frame feature hash value and the original video frame feature hash value are XOR processed and the number of 1 is counted, as shown in Equation ( 9).The distance determines the correspondence between two frames.The smaller the distance, the stronger the correlation between the two frames, and the less the correlation.In this way, the original video frame matching with the reconstructed video frame can be found, and the end-to-end latency of video transmission can be obtained by using the time stamps carried by the two video frames.
Method 3 can carry out high-precision real-time latency monitoring online without damaging the transmitted video.It balances the tradeoff between computational demand, out of band bandwidth utilization, and robustness across different content.The test accuracy is similar to that of Method 1, and the remote measurement function of Method 2 can be realized and is not limited to the scenario with a GPS signal.It can be used as a convenient, continuous, and reliable method for popularization and application.

Evaluation
Combined with the requirements of the above three latency measurement methods on the latency measurement device, we designed a latency measurement device that can meet the requirements of the three measurement methods and implemented the hardware implementation.The design framework and physical drawing are shown in Figure 8, and the device is named CC3030.The device framework is mainly composed of heterogeneous systems based on Xilinx ARTIX-7 series FPGA and STM32F7 series MCU.Among them, FPGA is mainly responsible for image information processing and STM32 is mainly responsible for data processing and external interaction.The information processed by FPGA is transmitted to the STM32 microcontroller and the microcontroller calculates the latency.The device provides two HDSDI ports and two HDMI ports for video input and one HDSDI and one HDMI for OSD (On Screen Display) menu display.It can receive external GPS signals and the device's configuration and measurement results can be seen on other devices' web pages through the network port.
Appl.Sci.2022, 12, x FOR PEER REVIEW 12 of 18 meet the requirements of the three measurement methods and implemented the hardware implementation.The design framework and physical drawing are shown in Figure 8, and the device is named CC3030.The device framework is mainly composed of heterogeneous systems based on Xilinx ARTIX-7 series FPGA and STM32F7 series MCU.Among them, FPGA is mainly responsible for image information processing and STM32 is mainly responsible for data processing and external interaction.The information processed by FPGA is transmitted to the STM32 microcontroller and the microcontroller calculates the latency.The device provides two HDSDI ports and two HDMI ports for video input and one HDSDI and one HDMI for OSD (On Screen Display) menu display.It can receive external GPS signals and the device's configuration and measurement results can be seen on other devices' web pages through the network port.

Method 1: Timecode Method
The method requires a camera, two display screens, an allocator, codec, and a CC3030 before the measurement.The camera is used for real-time video acquisition and the captured video is passed through a 12 G SDI video one-in, two-out splitter to obtain twochannel zero-time latency video.One input is raw video to the display screen, the other input is CC3030.The original video sequence input to CC3030 is called the source video, and the video input is called the source entry (Ori).The source video is output from CC3030, and, after passing through the codec, the reconstructed video is obtained.The reconstructed video is fed into CC3030, which is called the reconstructed entry (Rec).The time stamp in the reconstructed video is found by FPGA in CC3030 and the transmission latency of the video sequence is determined by comparing it with the local time.In order to facilitate the display of the latency measurement results on the reconstructed video, the displayed time bit is controlled at 0.1 ms.
The video input and output in the measurement process need to pass through the video interface (HDSDI or HDMI).The latency caused by these interfaces is measured in microseconds, which is negligible compared to the latency of measuring video transmission.Since the timestamp is embedded in the existing video source, the latency caused by camera acquisition is not included in the latency measurement results.
Figure 9a is the latency measurement process of video transmission through the H.264 codec.In the image, the right screen shows the original video, and the left screen

Method 1: Timecode Method
The method requires a camera, two display screens, an allocator, codec, and a CC3030 before the measurement.The camera is used for real-time video acquisition and the captured video is passed through a 12 G SDI video one-in, two-out splitter to obtain two-channel zero-time latency video.One input is raw video to the display screen, the other input is CC3030.The original video sequence input to CC3030 is called the source video, and the video input is called the source entry (Ori).The source video is output from CC3030, and, after passing through the codec, the reconstructed video is obtained.The reconstructed video is fed into CC3030, which is called the reconstructed entry (Rec).The time stamp in the reconstructed video is found by FPGA in CC3030 and the transmission latency of the video sequence is determined by comparing it with the local time.In order to facilitate the display of the latency measurement results on the reconstructed video, the displayed time bit is controlled at 0.1 ms.
The video input and output in the measurement process need to pass through the video interface (HDSDI or HDMI).The latency caused by these interfaces is measured in microseconds, which is negligible compared to the latency of measuring video transmission.Since the timestamp is embedded in the existing video source, the latency caused by camera acquisition is not included in the latency measurement results.
Figure 9a is the latency measurement process of video transmission through the H.264 codec.In the image, the right screen shows the original video, and the left screen shows the reconstructed video with the measurement results, and the time stamp is presented as a spline at the bottom of the screen.It can be seen that the real-time video acquisition format of the measurement process is 720P60HZ and the latency measurement result is 51.5 ms.
shows the reconstructed video with the measurement results, and the time stamp is presented as a spline at the bottom of the screen.It can be seen that the real-time video acquisition format of the measurement process is 720P60HZ and the latency measurement result is 51.5 ms. Figure 9b shows the latency measurement process of video transmission through the H.265 codec.In the same picture, the right screen is the original video and the left screen is the reconstructed video with the measurement results.The real-time video acquisition format of the measurement process is 1080i50HZ and the latency measurement result is 436.6 ms.

Method 2: Remote Online Measurement
The method requires a camera, a display, a flash, a codec, and a CC3030 before the measurement.In this experiment, the flashing light flickered once every second for 200 ms.When using the camera to capture video, we must ensure that the flashing light is in the capture picture.
When the PPS signal from GPS arrives, the lights start flashing and CC3030 also starts the timing operation.The flash video picture captured by the camera is transmitted and the reconstructed video is obtained at the decoding end.The reconstructed video is input to CC3030 and the FPGA in the device obtains the flashing picture and stops the timing operation after finding it.Finally, the measurement results are printed on the reconstructed video and displayed on the screen.
The frequency of flashing light in this measurement is 1 s.The frequency of the flash determines the range of the measured latency.If the frequency of the flashing light is less than the latency of the transmission system when the FPGA searches for the flashing picture, the latency measurement error will occur due to the stacked flashing picture of the system.Therefore, it is necessary to change the flashing frequency when measuring the transmission system with large latency.
Figure 10a shows the latency result of directly inputting the flash picture collected by the camera into CC3030 without going through the codec, which is the latency of the camera.As can be seen, the video acquisition format is 720P60HZ and the time latency of the camera used for measurement is 34.9 ms. Figure 9b shows the latency measurement process of video transmission through the H.265 codec.In the same picture, the right screen is the original video and the left screen is the reconstructed video with the measurement results.The real-time video acquisition format of the measurement process is 1080i50HZ and the latency measurement result is 436.6 ms.

Method 2: Remote Online Measurement
The method requires a camera, a display, a flash, a codec, and a CC3030 before the measurement.In this experiment, the flashing light flickered once every second for 200 ms.When using the camera to capture video, we must ensure that the flashing light is in the capture picture.
When the PPS signal from GPS arrives, the lights start flashing and CC3030 also starts the timing operation.The flash video picture captured by the camera is transmitted and the reconstructed video is obtained at the decoding end.The reconstructed video is input to CC3030 and the FPGA in the device obtains the flashing picture and stops the timing operation after finding it.Finally, the measurement results are printed on the reconstructed video and displayed on the screen.
The frequency of flashing light in this measurement is 1 s.The frequency of the flash determines the range of the measured latency.If the frequency of the flashing light is less than the latency of the transmission system when the FPGA searches for the flashing picture, the latency measurement error will occur due to the stacked flashing picture of the system.Therefore, it is necessary to change the flashing frequency when measuring the transmission system with large latency.
Figure 10a shows the latency result of directly inputting the flash picture collected by the camera into CC3030 without going through the codec, which is the latency of the camera.As can be seen, the video acquisition format is 720P60HZ and the time latency of the camera used for measurement is 34.9 ms. Figure 10b shows the latency result after the video source is transmitted by the codec.The codec selected for measurement is the H.264 video codec and the video acquisition format is 720P60HZ.It can be seen that the latency generated in the video transmission process is 96.4 ms and 61.5 ms can be obtained after removing the latency caused by the camera video acquisition.Figure 10b shows the latency result after the video source is transmitted by the codec.The codec selected for measurement is the H.264 video codec and the video acquisition format is 720P60HZ.It can be seen that the latency generated in the video transmission process is 96.4 ms and 61.5 ms can be obtained after removing the latency caused by the camera video acquisition.

Method 3: Lossless Remote Video Online Measurement
It is necessary to prepare one CC3030 at the transmitter and one at the receiver before measurement.Since it is impossible to ensure that both the transmitter and receiver of remote video transmission can receive GPS signals, the IEEE1588 protocol is used to realize time synchronization between the transmitter and receiver.In the measurement process, two transmission channels are carried out synchronously.One method is the transmission of the original video to obtain the reconstructed video.The other method is to process the original video and reconstruct the video.At the sending end, the original video is input to CC3030 and the hash code of the original video is obtained.At the receiver, the reconstructed video is input into CC3030 and the hash code of the reconstructed video is obtained.The original video hash code with time stamp is transmitted to the receiver CC3030 through the network, and the original video frame hash code matching the reconstructed video frame hash code is found.The video transmission latency is obtained by using the timestamp carried in their hash codes.
The codec system used in this measurement is the H.264 codec.Table 1 shows the timestamp of the first 10 s of the reconstructed video and the matched original video frame and calculates the delay value.Due to the limitation of the distance between the sender and the receiver, the latency measurement results are designed to be real-time output in the form of web pages, where the horizontal coordinate is time, the unit of second represents the monitoring once a second in actual operation, and the vertical coordinate is the latency value, the unit of which is millisecond, as shown in Figure 11.

Comparison of Proposed Methods with Existing Methods
Based on the actual evaluation results, the advantages and disadvantages of the three measurement methods are listed, as shown in Table 2.

Comparison of Proposed Methods with Existing Methods
Based on the actual evaluation results, the advantages and disadvantages of the three measurement methods are listed, as shown in Table 2.The measurement principle of Method 1 is the same as that of Method 3 and Method 3 uses out-of-band hash data and ieee1588 time synchronization to realize lossless remote measurement.The time value in the actual measurement uses a 32-bit counter that counts with a 25 MHz clock.Therefore, the minimum time unit of the timestamp in Method 1 and Method 3 is 40 ns.According to Equation (2), the maximum error between the arrival time of the reconstructed video and the arrival time of the original video due to ±1 error is 80 ns and the accuracy is set as 0.1 us for statistical convenience.The measurement principle of Method 2 is similar to that of Reference [30].Reference [30] measures time delay introduced by a video transmission system under test to the propagation of light from a light-emitting diode (LED) to a phototransistor (PT).The resistance of the PT decreases when the LED lights up in the displayed image.The sampling rate of PT is 2 kHz, yielding a precision of 0.5 ms.However, the measurement relies on the output of the display with a video camera, and it has limited precision of, e.g., 16.7 ms for a 60 Hz video camera.The actual precision of Reference [30] is subframe, the same as Method 2. The work in [26] uses an embedded barcode to measure the end-to-end delay of computer video chat applications.The precision of the method is subframe due to the influence of screen refresh and software execution speed.Reference [5] measures latency by waveform shift on an oscilloscope.
The measurement accuracy can be guaranteed, but it is very inconvenient and cannot be measured online.

Summary
In this paper, the generation of latency in video transmission is analyzed and three latency measurement methods are proposed: timecode method, remote online measurement method, and lossless remote video online measurement method.Depending on the deployment scenario, different time latency measurement methods are suitable.The measurement accuracy of Method 1 and Method 3 is similar, but Method 1 is only suitable for local measurement, where the source content is modified with a timestamp.This paper proposes a synchronization framework for Method 3 that balances the out-of-band bandwidth, the measurement accuracy, and the deployment ease using video-aware hash tables.Both Methods 2 and 3 can be used to measure the latency of remote video transmission, but Method 3 does not alter the original video content.One unique value of Method 2 is that it captures the latency of the capturing and the rendering devices (e.g., camera, display).Compared with Methods 1 and 2, Method 3 can achieve high-precision lossless remote video online latency measurement.Method 3 is scalable through today's networks, providing continuous monitoring.This will bring the latency QoS to a level that is then acceptable for next generation applications, including VR and real-time AI solutions.
In the future, Method 3 can be integrated into cameras, and can directly measure the camera latency.If the camera contains an encoding and decoding system, the video transmission latency can be obtained directly.Furthermore, Method 3 can be extended to not only measure the video transmission latency, but also detect the frame loss in the transmission process.

1 )Figure 1 .
Figure 1.Video transmission delay distribution.The squares 1-3 marked in the figure are the video frames transmitted.

Figure 1 .
Figure 1.Video transmission delay distribution.The squares 1-3 marked in the figure are the video frames transmitted.

Figure 2 .
Figure 2. Frame reference relationship of HEVC RA mode.In the figure, frame I is an internal coding frame, frame P is a forward reference frame, and frame B is a bidirectional reference frame.

Figure 2 .
Figure 2. Frame reference relationship of HEVC RA mode.In the figure, frame I is an internal coding frame, frame P is a forward reference frame, and frame B is a bidirectional reference frame.

Figure 3 .
Figure 3. (a) Timecode latency measurement scheme; (b) Spline Bar in the screen."bar" indicates the timecode.The picture contains Chinese words: "School of Information Engineering, Jimei University".

Figure 3 .
Figure 3. (a) Timecode latency measurement scheme; (b) Spline Bar in the screen."bar" indicates the timecode.The picture contains Chinese words: "School of Information Engineering, Jimei University".

Figure 5 .
Figure 5. Lossless remote video online measurement scheme.

Figure 5 .
Figure 5. Lossless remote video online measurement scheme.

Figure 8 .
Figure 8. Design framework and physical and picture of delay measurement equipment.The picture contains Chinese words: "School of Information Engineering, Jimei University".The following is an example of three latency measurement methods using different configurations of this self-made CC3030.The camera model used in the measurement is SHD60 and the two codec systems are H.264 codec and H.265 codec.H.264 codec was selected from Sculpture Networks.The encoder model was Snenc1000 (for San Diego, CA, USA), and the decoder model was Sncupid1000 (for San Diego, CA, USA).H.265 codec is the codec used by Huawei Hislicon HI3519 video codec solution.

Figure 8 .
Figure 8. Design framework and physical and picture of delay measurement equipment.The picture contains Chinese words: "School of Information Engineering, Jimei University".The following is an example of three latency measurement methods using different configurations of this self-made CC3030.The camera model used in the measurement is SHD60 and the two codec systems are H.264 codec and H.265 codec.H.264 codec was selected from Sculpture Networks.The encoder model was Snenc1000 (for San Diego, CA, USA), and the decoder model was Sncupid1000 (for San Diego, CA, USA).H.265 codec is the codec used by Huawei Hislicon HI3519 video codec solution.

Figure 9 .
Figure 9. (a) Delay measurement of H.264 codec; (b) Delay measurement of H.265 codec.The Chinese characters in the two pictures are "School of Information Engineering, Jimei University".

Figure 9 .
Figure 9. (a) Delay measurement of H.264 codec; (b) Delay measurement of H.265 codec.The Chinese characters in the two pictures are "School of Information Engineering, Jimei University".

Figure 11 .
Figure 11.The delay results of method 3 in the form of web pages.The Chinese words in the picture are "unsafe".

Figure 11 .
Figure 11.The delay results of method 3 in the form of web pages.The Chinese words in the picture are "unsafe".

Table 1 .
Latency results in the First 10 s of video transmission.

Table 1 .
Latency results in the First 10 s of video transmission.

Table 2 .
Pro and cons of 3 proposed methods.

Table 2 .
Pro and cons of 3 proposed methods.