1. Introduction
As the automotive industry progresses toward full automation (SAE level 5), a practical hybrid model combining autonomous driving with human teleoperation is emerging as a critical solution [
1]. This approach allows a remote human operator to intervene and safely control a vehicle when its onboard automation encounters complex scenarios it cannot resolve, such as unexpected road work, system malfunctions, or dense traffic [
2]. The viability of this model, however, depends on providing the remote driver with a high-quality, real-time video feed over variable and demanding cellular networks [
3].
For teleoperation to be a safe and scalable solution, strict performance requirements must be met. The total human–machine loop, often measured as the Glass-to-Glass (G2G) delay, must be minimized. Experiments over 5G have reported G2G delays of around 202 ms, which was deemed safe only for low speeds [
4], while other studies suggest that a target of under 100 ms is crucial for a driver to respond effectively [
5]. Meeting this low-latency requirement necessitates aggressive video compression, but this creates a critical trade-off with image quality.
This trade-off becomes particularly dangerous when considering how a remote operator perceives the scene. Standard video compression can degrade the legibility of small but vital objects. Even conventional region of interest (ROI) video compression, which partitions a frame into a simple foreground and background, often fails in this context. Critical objects like traffic signs and lights are frequently overshadowed by larger foreground elements like the road or other vehicles, preventing them from receiving sufficient bandwidth to ensure they are clearly visible. This perceptual ambiguity introduces an unacceptable safety risk. To address this specific challenge, this paper introduces an enhanced multi-category region of interest (ROI) compression framework. Our approach moves beyond a binary ROI partition by using semantic segmentation to classify the scene into three distinct tiers based on driving importance:
Strong ROI: Containing the most critical objects for driver decision-making, specifically traffic signs and traffic lights.
Weak ROI: Containing other essential contextual elements, such as the road, lane markings, pedestrians, and other vehicles.
Background: Containing non-essential elements like sky, trees, and buildings.
By dynamically allocating bandwidth according to this three-tiered classification, our method significantly improves the perceptual quality of the most critical objects without degrading the surrounding context. This study makes the following contributions:
We design and implement a multi-category ROI compression system that prioritizes traffic signs and lights to improve safety in remote driving teleoperation.
Using a photo-realistic driving simulator, we demonstrate that our method improves the perceptual quality of traffic signs and lights by up to 5.5 dB compared to a conventional two-category ROI approach.
We validate that this improvement is achieved while maintaining the visual quality of other essential driving elements in the weak ROI, ensuring a comprehensive and clear view for the remote operator.
2. Background
Developing a fully automated self-driving car beyond the Society of Automotive Engineers (SAE) level 3 is challenging [
1]. Designing a fully self-maneuverable vehicle that can drive safely under any driving conditions, including fog, heavy rain, road obstacles, complex roadside works, impolite and incapable drivers, and pedestrians on the road is not easy. This problem is much more complicated than initially thought by many of the designers.
A fallback solution allows for the remote driver control of the vehicle when the self-driving car’s automation encounters difficulty in handling the driving tasks independently. Most of the time, driving is autonomous, and only in rare cases, such as extreme weather, very dense driving situations, and obstacles, will the remote driver take control of the vehicle. To make such a system cost-effective, in most remote driver interventions, the distant driver will drive the car quickly to a safe and nearby location, and then the support will arrive at the vehicle.
Reliable and safe teleoperation by a remote driver in a distant control room requires high-quality, low-delay, and reliable video communication. The available cellular 4G and 5G network bandwidth varies significantly when a car moves and switches from one cell to another. Furthermore, varying distances between the vehicle and the network antennas and network congestion make a video broadcast over the cellular network demanding [
3].
The required Glass-to-Glass (G2G) delay for the video transmission is measured as the time passed from the rays of an event hitting the camera’s lens until the event is displayed on the teleoperator’s monitor [
4,
6,
7]. The authors of [
4], in an experimental evaluation of a Kia Soul EV with a teleoperated driving interface over 5G, reported an average G2G of 202 ms and RTT (Round Trip Latency) of 47 ms. They concluded that their teleoperation setup is safe at low speeds of less than 20 km/h.
A study of the influence of the delay in a multiparty racing car game shows that a player can respond well if the delay is smaller than 100 ms. Therefore, reliable low-delay video transmission by a teleoperator field is crucial for safe driving [
8].
This study proposes an improvement in video quality by applying multi-category ROI compression. Instead of parsing the image to the ROI and background, we rank the ROI according to their importance for remote driving. We used two categories: weak and strong ROI. Critical objects for driving are traffic signs and lights, which are assigned much of the relative (to their size) bandwidth, enhancing their visibility while at least maintaining the quality of the weak ROI and migrating bandwidth from the background.
The video encoder compresses video coding blocks to a higher quality if they contain more critical elements for safe driving. Conversely, coding blocks that are less important for safe driving are considered as the background and are compressed with a lower quality.
The requirement for a fast teleoperator response and low-delay video causes the granularity of the rate control (RC) to be at the level of a single frame. However, many video applications are less prone to real-time performance and have RC granularity at the Group of Picture (GOP) level or time frame (e.g., a second).
The car’s edge computer analyzes the video and detects which parts are more important for remote driving and which are less critical. Several image parsing and comprehension methods are available. First, deep learning (DL) semantic segmentation tools [
9] can classify every image pixel by its class, and then it is possible to assign an image block to an ROI category based on the block content. Second, DL object detection methods [
9] can be trained to find the bounding boxes of essential objects such as vehicles, traffic lights, and traffic signs. The coding blocks are mapped and assigned ROI priorities only if they contain parts of the crucial objects. Third, the car’s Lidar (light detection and ranging) depth maps [
10] can distinguish between ROI and non-ROI coding blocks according to their distances from the vehicle. After mapping the image into non-ROI and ROI coding blocks, the encoder allocates bandwidths to both regions. Finally, it assigns the bitrates according to the temporal bandwidth available from the network operator, the relative sizes of the ROI and non-ROI, and the predefined weight that defines the normalized bitrate ratio of the non-ROI to ROI.
This study proposes further improving video quality by applying multi-category ROI compression. Instead of parsing the image into the ROI and background, we used two categories: weak and strong ROI. Critical objects for driving, that is, traffic signs and lights, are assigned much of the relative (to their size) bandwidth, enhancing their visibility while at least maintaining the quality of all other objects that are also important for the distant driver.
This work relies on several concurrent technologies: self-driving cars, the teleoperation of autonomous vehicles when necessary, complementary metal-oxide-semiconductor (CMOS) video cameras, Lidar sensors, radars, a Global Positioning System (GPS), High-Efficiency Video Coding (HEVC), also known as the H.265 video compression standard, a Cognata [
11] autonomous vehicle photo-realistic simulator, and deep neural networks that can analyze the video image for the following tasks, including object classification, object recognition, object tracking, semantic segmentation, instance segmentation, etc.
2.1. Self-Driving Cars
Self-driving cars will change our public or private attitude toward transportation, driving, car ownership, car safety, car accidents, and loss of working hours.
The SAE defines five levels of self-driving car automation [
12]. Level 0 represents a conventional car in which the human driver controls the vehicle. Most of the cars available today are at level 0. However, human driver functionalities migrate into the car’s automation as the SAE level increases. At level 1, the vehicle has driving assistance such as automatic braking, lane assistance, and adaptive cruise control. The car’s automation can control its speed or the steering, but not both. The driver is responsible for monitoring the road and must immediately take over if the assistance system fails.
Level 2 refers to partial automation. The car can control steering, acceleration, and braking if possible. The automation is responsible for more complicated tasks such as changing lanes, responding to traffic signals, and looking for hazardous conditions. The driver’s hands should immediately take control of the vehicle when requested. Examples of level 2 systems are Tesla Autopilot and Mercedes-Benz Driver Assistance Systems. From level 3 and above, car automation can take over in certain traffic conditions, such as assistance in traffic jams or acting as a highway chauffeur. In level 3, the driver can perform tasks other than driving. Texting on a cell phone, reading, writing, or relaxing is possible. The driver should be prepared to control the vehicle if the automation cannot handle complex driving situations or if there is a malfunction in the car’s automation.
Level 4 indicates that the car is fully automated. It can handle all driving tasks, even under challenging driving conditions. At level 4, there is no need for human driver intervention. The driver can take control and steer the car by hand in case of a system malfunction or personal desire for self-driving; however, the vehicle generally acts as a self-driving device. Currently, the only level 4 self-driving car operating is the Waymo fleet. The Waymo case is still a level 4 in a narrow sense because it is available only in limited locations. The system is not entirely autonomous and does not function in bad weather, and remote human assistance can guide the automated vehicles under certain conditions [
2].
Finally, level 5 represents an autonomous vehicle without human interaction. At this level, cars are autonomous robots that transport passengers and goods independently.
2.2. Teleoperation of Almost Self-Driving Cars
The highest level of automation available for public sale is level 3, and even its availability is minimal [
13]. It is limited only to a few mapped highways, the speed is limited to 40 mph, and it is allowed only during the day and in good weather conditions. Migration from a level 3 car to a fully automated level 4 vehicle or a completely autonomous vehicle at level 5 is challenging. Teleoperation can fill the technological gap between an autonomous vehicle, which is imperfect and requires human assistance on rare occasions, and the desired fully robotic system. The remote driver controls the computerized vehicle in case the car’s automation malfunctions, or an unexpected driving situation arises that the vehicle cannot handle safely. The idea is to allow automation to control the vehicle fleet at 99.9%, while the remote drivers will intervene in the remaining 0.1% of the driving hours [
14]. The teleoperation of the car should receive a high-quality video with minimal delay between the camera and the remote driver’s monitor. The round-trip signal delay comprises the car’s camera-to-remote monitor delay, the teleoperator response time, and the time required to transmit the steering wheel and pedal commands from the distant driver to the autonomous car. This human–machine closed-loop time should be at most 170 ms [
15], leaving much less time for signal travel between the camera and the teleoperator’s monitor.
2.3. Use of 4G and 5G Cellular Networks for Teleoperation
Experiments with teleoperation over 4G networks [
3,
16] showed that the upload and download requirements are adequate under many conditions. However, the network must be more robust to enable smooth teleoperation. These experiments with the 4G network suggest that the 5G network may be acceptable for reliable teleoperation. A remote driver should have a continuous HD video of high quality and a maximum delay of the image between the camera sensor and the operator monitor of no more than 100 ms. This short interval also contains the video encoding, decoding, and 5G packet transmission time. Therefore, successful remote teleoperation requires high-quality video transmission with a minimum delay over a 5G network. This task is very challenging because the network quality is not uniform, and there are many driving conditions under which the network bandwidth degrades significantly.
2.4. The Cognata Driving Simulator
Driving scenarios were generated using the Cognata driving simulator. Cognata produces driving movies for autonomous vehicles through controlled simulations [
17]. In addition, the Cognata simulator can create photo-realistic videos that are close to reality. Photo-realistic driving scenarios are essential for video compression research because their compression results are close to the compression of real-life scenarios rich in fine details. However, non-photo-realistic simulators yield simplistic scenes with low spatial complexity, leading to much smaller compressed files than actual and natural driving scenes. The user can decide where to place cameras, Lidars, radars, GPS, and lane detectors on the car’s layout. After performing a driving simulation according to a modified driving scenario, the user can download the response of each sensor provided by a contiguous video or a collection of non-compressed images in Portable Network Graphics (PNG) format.
Figure 1a shows a snapshot of a driving scenario from the car’s front camera.
Figure 1b shows the ground truth semantic segmentation map of
Figure 1a according to the pixel classes of the Cognata simulator. The semantic segmentation sensor has around 50-pixel classes, where every class has its unique color. For example, the sky pixels are colored blue, and the road pixels are colored gray. The Cognata pixel classes are based on earlier works. A key example of such influential earlier work is the Cityscapes dataset [
18], which is notable for scene understanding and semantic segmentation for self-driving cars. It comprises an extensive collection of urban street scenes captured from different cities, with pixel-level annotations that label various objects and areas within the images. The dataset is frequently used to train and evaluate algorithms designed to understand and interpret the visual information present in urban environments. Cityscape has 30-pixel classes.
2.5. Concurrent Video Coder Encoders (Codecs)
A crucial component of a teleoperation system is a video encoder/decoder. Therefore, it is essential to select an appropriate video compression standard. An aging standard may be robust and rich in practical software and hardware implementations, but it yields poor rate-distortion (RD) performance. However, a very recent standard may not be mature yet and lack practical and available real-time deployments.
In this research, we chose HEVC (H.265) [
19] over Advanced Video Coding (AVC, H.264) [
20] because of its superior advantage in terms of bitrate savings. The results in [
21] show the clear benefits of HEVC over AVC and Google’s open-source VP9 [
22]. The comparative study in [
23] compares H.265, H.264, and VP9 for low-delay real-time video conferencing applications. The results show that H.265 is better than H.264 and VP9 by 42% and 32.5%, respectively, regarding bitrate and for the same peak signal-to-noise ratio (PSNR).
In addition, contemporary hardware can manage the higher coding complexity of HEVC. Although there are new compression standards, such as the Versatile Video Coding (VVC)/H.266 [
24], AV1 [
25], and MPEG5 [
26] their real-time practical encoding implementations are yet to be made available. According to Bitmovin’s annual Video Developers Report [
27], the AVC is still the most dominant video codec in current video applications. HEVC is the second. However, if we look at the Codecs that developers plan to use in future designs, we can see that HEVC excels over AVC.
The H.265 encoder partitions an image into Coding Tree Units (CTUs). Its size can be as high as 64 × 64 pixels [
19]. The encoder can recursively subdivide each CTU into Coding Units (CUs), ranging from 32 × 32 to 8 × 8 pixels. Each Quantization Group (QG) has a CU size according to the CTU partitions and can be assigned a separate
[
19]. Varying the
controls the compression quality. The standard defines and allows for fine control such that each CU has a separate
. However, for simplicity, in this study, we limited the compression quality granularity to the size of the entire CTU.
2.6. Rate Control (RC)
The RC algorithm is a lossy compression optimization method that fits the compressed video size into a bandwidth constraint. The target is to minimize the video distortion (D) at a given encoder rate for reconstructed images from the lossy compressed representation. The RC algorithm fits the compressed video size into the bandwidth limitations of the network. One of the most common measures of D is the mean square error (MSE). There is always an RD tradeoff. When the network allows a higher R, it achieves smaller D values. RC methods have changed significantly during the long journey to attain modern video compression since the 80s. With the development of new video encoding standards, rate-control methods have also been adjusted to meet recent changes and requirements.
In [
28], the refresh rate,
, and movement detection thresholds were adjusted to regulate the very coarse bitrate of the early H.261 standard. The authors of [
29] proposed a quadratic relationship between
and
R for the MPEG2 and MPEG4 compression standards:
where a and b are calculated based on the encoding of previous frames.
The authors of [
30] showed that the
R-λ RC methods proposed for HEVC outperformed the previous HEVC RC, which is based on
R-Q model dependencies and is part of the HM reference software (the R-λ model became a prominent part of the rate control mechanisms in HM around HM 10.0 and later versions).
Lagrange multipliers help us select an optimal point along the RD monotonically decreasing convex function [
31]. It uses the Lagrange multiplier optimization, where λ, the Lagrange multiplier, depends on the frame complexity. The granularity of this algorithm can be an entire GOP, a unit of time, or at the frame or CTU levels. It can assign a separate
to each CTU. The Joint Collaborative Team on Video Coding (JCT-VC) included this method in its HEVC Test Model. The
R-λ method is also implemented in the Kvazaar [
32] as the default RC method, and we will review it.
The objective of RC is to minimize the overall
D at a target bitrate,
R.
where
and
are the distortion and target bits, respectively, for the
i’th CTU.
M is the total number of CTUs in a frame. The Lagrange method converts the constrained Equation (2) into an unconstrained optimization equation. Equation (3) can be solved by taking the derivatives with respect to
and comparing the values to zero. Li et al. [
30] showed that the hyperbolic model λ fits better than the exponential model and can be written as follows:
where
and
are determined by the video content. Because Equation (3) has two variables,
and
, they cannot be deduced directly. Li has developed an updating algorithm based on the collocated CTU values of
and
and
(bits per pixel) using the following equations:
The encoded bits for an entire frame or a CTU are known after the encoding.
and
are the previous frames or collocated CTU at the same hierarchical level.
where
lr is the log ratio between the estimated
from the allocated
bpp and the actual
after the encoding is performed and the actual
bpp is known.
Then, from the log ratio, the
and
parameters that define the updated
according to the video complexity are updated as follows:
and
,
are constants that govern the updating rate. The
QP is calculated from λ, where
and
, by the following:
The authors of [
33] have proposed an Optimal Bit Allocation (OBA) RC method. According to the OBA method, an RD estimation is proposed instead of an R–λ estimation for better
D minimization. The authors argue that the original R–λ optimization method proposed for HEVC in [
34] is not optimal, and the direct calculation of the Lagrange multiplier yields higher distortion. Using this RD method, an accurate estimate of
achieves the RC target.
This method uses the Taylor expansion method to obtain a closed-form solution by iterating over the Taylor expansion. The iterative procedure converges quickly and requires no more than three iterations.
2.7. ROI Compression
ROI compression is a lossy video technique that favors certain image areas over others. The research described in [
35] implemented skin tone filtering to distinguish between humans in a conversational video and the background. The video encoder adaptively adjusts the coding parameters, such as
, search range, candidates for mode decision, the accuracy of motion vectors, block search at the Macro Block (MB) level, and relative importance of each MB. Finally, the encoder assigns more resources, bits, and computation power to the MB that resides inside an ROI and fewer resources to the MB that resides inside a non-ROI.
RD optimization for conversational video detects the key features of the faces and allows separation between the human faces and the background [
36]. The ROI was located around the detected faces. Then, the Largest Coding Units (LCUs) containing parts of faces are searched more deeply during the LCU encoding process and assigned a lower
. Non-ROI LCUs were searched for shallower parts and assigned higher
values. The model uses a quadratic equation that determines the relationship between the desired
, the bit budget, the Mean Average Difference (MAD) between the original LCU and the predicted LCU, model parameters, and the weighted ratio between the bits allocated for the ROI and those assigned for the non-ROI.
A novel ROI compression for the H.264 Scalable Video Coding (SVC) extension was proposed and studied [
37]. The ROI is dynamically adjustable according to the location, size, resolution, and bitrate. The authors of [
10] allocated the Lidar bitrate budget according to the ROI and non-ROI, significantly improving the quality of self-driving car depth maps. In [
38], the encoder allocated the CTU bits according to their visual saliency. The
for each CTU is then calculated using Lagrange multipliers.
An object-based ROI using an unsupervised saliency learning algorithm for saliency detection was proposed in [
39]. This study detected salient areas in two steps: detection and validation.
The authors of [
40] proposed ROI compression for static camera scenes by considering ROI blocks significantly different from the previous frame and applying the discrete Chebyshev transform (DTT) instead of the conventional discrete cosine transform (DCT).
The authors of [
41] critique traditional ROI image compression, where separately predicting and coding the image is suboptimal. They propose a unified, deep learning framework that performs these tasks simultaneously in an end-to-end manner for direct rate-distortion optimization. The novel encoder network concurrently generates an implicit ROI mask and multi-scale image representations to guide the bit allocation process. This jointly optimized approach yields significantly better objective performance and visual quality compared to traditional cascaded methods. In [
42], ROI rate control for H.265 video conferencing was solved by applying ideas from game theory. The encoder then allocates the ROI and non-ROI CTUs using Newton’s method.
2.8. HEVC Codec Applications
The most popular HEVC codec for research purposes is the HEVC Test Model (HM) reference software (latest release is HM 18.0) developed by the Joint Collaborative Team on Video Coding (JCT-VC). It is a reliable source for ensuring compatibility and adherence to the HEVC standards. Although its time performance improved from the HM-16.6 to the HM-16.12, the encoding must be faster and is still far from real-time requirements [
43]. The x265 is among the most popular HEVC Codecs. It is continuously developed and improved by MulticoreWare and used or supported by many popular video applications such as FFmpeg, Handbrake, and Amazon Prime Video. It also has fast coding modes suitable for real-time applications. However, providing a
map for each image CTUs’ is not supported.
We selected the Kvazaar [
32] HEVC Codec for this study. Kvazaar is an open-source academic video Codec specializing in high-performance, low-latency encoding using the HEVC (H.265) standard. It is suitable for real-time applications and offers a command-line interface and an API for flexibility. Kvazaar enables the definition of a
ROI map per image and CTU.
Teleoperation requires minimal G2G delay, necessitating low-delay video encoding. This can be achieved by limiting the encoded video GOP sequence to intra (I) and predictive (P) frames. Highly coding-efficient bi-directional (B) frames are not allowed here because their encoding process introduces an additional delay of several frames.
3. Closely Related Research Works
Several studies have explored region of interest (ROI)-based video compression and its applications, particularly in contexts relevant to autonomous vehicles and teleoperation. While these works provide a valuable foundation, they also reveal certain limitations that our current research aims to address.
The following thesis [
44] explores the possibility of reducing data transmission by compressing a video stream based on the importance of specific features by using static and dynamic feature selection, with more critical components receiving less compression by segmenting the image into the ROI and region of non-interest (RONI). While this work mentions the concept of multi-category ROI compression by allowing traffic signs with lower
QP, no practical demonstration is given for this use case, and it primarily focuses on a binary ROI/RONI distinction. Such a two-category approach, as our preliminary findings show, may not sufficiently enhance the quality of small, critical elements like traffic signs and lights when the general ROI is large.
Tong’s Ph.D. dissertation [
45] proposed a ROI rate control for H.263 video conferencing, segmenting faces using skin color and motion vectors. First, the face in a video conversation is segmented from the image by skin color and the mosaic rule in the
I frames, followed by motion vectors only for the
P frames. Then, after the segmentation maps are created, ROI quadratic RC methods are applied to the H.263 video compression. This work, while pioneering for its time and application, is specific to face detection in video conferencing and utilizes older compression standards.
Bagwe’s master’s thesis [
46] focuses on bandwidth saving for autonomous vehicles (AVs) by skipping redundant frames. While relevant to overall data reduction, this approach does not specifically address the perceptual quality of different regions within the transmitted frames, which is crucial for a remote operator’s ability to interpret the scene accurately.
The work by Hofbauer et al. [
47] presents two techniques to improve the video quality for the remote driver. The first scheme focuses on adapting a traffic-aware multi-view video stream, beginning with traffic-aware view prioritization. This approach determines the importance of individual camera views based on the current driving situation. The second scheme reduces the bandwidth of a separate camera stream by applying dynamic elliptical masks over the ROI and blurring the area outside the ROI. This causes the video encoder to use fewer bits for the background image blocks. However, the elliptical masking and general blurring approach for a single stream, while reducing background bits, may not offer the fine-grained control needed to specifically enhance the legibility of diverse and irregularly shaped critical objects like textual traffic signs within the primary forward-facing view.
Skupin et al. [
48] address rate assignment in a distributed tile encoding system in multi-resolution tiled streaming services based on the emerging VVC Standard. A model for rate assignment is derived based on random forest regression using spatio-temporal activity and encodings of a learning dataset.
In a preliminary study [
49], it was demonstrated that a content-adaptive encoding scheme using semantic segmentation can improve video quality within a region of interest (ROI) for remote driving applications when compared to standard encoders. The authors noted, however, that the performance gains were limited due to their method’s simplistic partitioning of the video frame into only two regions: the ROI and the background.
Many existing ROI-based encoding methods, while improving general ROI quality, often fall short in scenarios requiring highly granular prioritization. For instance, a simple two-category (ROI/background) segmentation often fails to substantially improve the legibility of small but vital elements like traffic signs and lights when these are part of a larger, general ROI. This limitation underscores the need for a more nuanced approach. Our work seeks to address this gap by proposing a multi-category ROI compression scheme. This scheme specifically differentiates between ‘strong ROIs’ (e.g., traffic signs and lights, which are often difficult to discern under compression yet are paramount for safety) and ‘weak ROIs’ (other important driving elements like roads and pedestrians), alongside the background. By allocating bandwidth with greater granularity, particularly by giving strong ROIs a significantly higher share of bits relative to their size, our method aims to demonstrably improve the perceptual quality of these most critical elements, a challenge not fully resolved by the broader or less granular ROI techniques discussed in prior works.
4. Method: Three Categories of ROI Compression
In this work, we categorized the pixel classes into three separate categories. Category 2 contains the classes that are most important for the teleoperation of self-driving cars. These are traffic signs (standard and large) and traffic lights. During video compression, our goal is to compress CTUs containing pixels of such classes with pixel counts that exceed a certain threshold to the best video quality possible with our constraints of a specific low bandwidth and requirement for ultra-low delay. These classes are considered to have a strong ROI. Category 1 contains all the pixel classes that are also important for driving. Among them are the roads, lane marking of various types, cars, pedestrians, and bicycles. In our process of ROI compression, we do not want to degrade the quality of these objects compared to standard compression. The third, category 0, is the background classes. These include the sky, trees, vegetation, and buildings. These classes are least important for remote driving, and we can allow them to be transmitted with a lower quality.
The Cognata [
9] driving simulator generates driving scenarios. Among the many sensors and maps that this simulator can simulate, we are interested in the semantic segmentation maps of each simulated video frame. Since the photo-realistic video image sequence and the related semantic segmentation images are generated by the same tool, the semantic segmentation maps are an almost perfect ground truth of their photo-realistic counterparts.
The division of the classes into three ROI categories that we used in this analysis is described in
Table 1. It is only a specific proposal tailored to a particular driving environment. It may be varied in a different environment. For example, suppose there is no separation, such as a sidewalk, between the building and the road. In that case, it may be more practical to move the near-building pixel class from the background category to the weak ROI category, or if the acoustic wall is always well separated from the road, moving its class from the weak ROI to the background may be more effective in reducing background bandwidth.
Figure 2 explains the encoding process. Image (a) describes a non-compressed ‘png’ image as taken from a photo-realistic simulated driving scenario. The frame rate is 30 fps, and the resolution is HD of 1080 × 1920 pixels. Image (b) is the semantic segmentation map of (a). Each of the pixel classes has its unique color. Image (c) describes the grouping of the semantic map pixel classes into pixel categories according to
Table 1. Background pixels that belong to the background classes are colored black, all pixels that belong to the weak ROI classes (classes that are important for teleoperation and road understanding) are colored gray, and the pixels that belong to the strong ROI containing the traffic signs and traffic lights are colored white. According to the HEVC standard [
50], the image is divided into CTU rectangles of a constant size. In our case, the 1080 × 1920 HD image has 17 × 30 CTUs, each of 64 × 64 pixels. The last row of CTUs is 56 × 64 pixels in size. Image (d) describes the conversion from pixel categories into CTU assignments according to the following:
where
is the CTU category assignment for CTU at row
i and column
j.
is the number of pixels of a CTU of category
c.
and
are the threshold values for categories 1 and 2, respectively.
In this analysis, and were set to 512, which is one-eighth of the number of pixels in our CTU. We can observe in (c) that some of the small traffic signs that are colored white are converted into background CTUs because their pixel count of category two does not exceed the threshold value. This thresholding saves us bandwidth by not encoding with lower (higher bandwidth) CTUs that are far away and contain traffic signs, and traffic lights that are barely visible. Also, CTUs with only a few weak ROI pixels are assigned as background CTUs. Thresholding is also essential in real and not simulated video encoding, where a trained DNN produces the semantic segmentation maps, has segmentation errors, and can neglect a small number of pixels classified erroneously.
The next step is to take the CTU HD 17 × 30 category map of every image and convert it into
values. A category 2 CTU is assigned with
= −
q, category 1 with
= 0 (no change from the base
), and category 0 with
=
q. The value
q is an integer value between one and ten. The video encoder does not allow for higher values due to stability issues and the over consumption of bandwidth. The higher the
, the higher the DCT/DST quantization, resulting in a coarser image and reduced bandwidth. The non-compressed movie sequence of ‘png’ images was converted into a YUV420p raw video by FFmpeg [
51], the
maps of 17 × 30 values, one map for every image was represented in binary, and both raw video and ROI description were fed into the Kvazaar HEVC Codec. The bitrate was set to 1 Mbps, and low-delay operation mode was selected. The RC that was used is a frame-level λ domain rate control that adjusts the frame λ when advancing along the video frames according to frame bitrate undershoot or overshoot. Then, the base frame
is calculated as a function of λ. The actual
of each CTU is then calculated by the following:
To avoid out-of-range values, clipping should be performed to the HEVC
QP range:
The video sequence was also compressed by Kvazaar without an ROI map. In the former case, the rate-control method that is applied is an implementation of Li’s [
30] λ domain rate control, where the rate control is also optimized over the individual CTUs.
The metrics we use to evaluate the performance of the ROI compression over the three ROI categories are based on the PSNR and SSIM. Since we are interested in the MPSNR (mean PSNR) concerning the three different ROI categories, we define the MPSNR by the following:
where
is the number of CTUs of category
c and
is the PSNR of the
i’th CTU. Similarly, we define the categorical MSSIM (Mean Structural Similarity Index) by the following:
8. Future Work
While this study successfully demonstrates the objective benefits of a multi-category ROI compression framework, several avenues for future research are apparent. First, a crucial next step is to conduct a task-based subjective evaluation with human teleoperators. Such a study would measure the real-world impact of our method on driver performance, reaction time, and cognitive load, providing a more definitive assessment of its effectiveness than objective metrics like PSNR and SSIM alone.
Second, a detailed analysis of the temporal dynamics of quality variation is warranted. Future work should investigate methods to ensure temporal stability, minimizing perceptible quality fluctuations or flickering between the ROI categories as they change size and position, which can be detrimental to the operator’s visual experience and development of accurate, practical methods for varying the baseline QP.
Finally, while this work utilized a simulator with perfect ground-truth segmentation, future research will focus on integrating our compression framework with a state-of-the-art, real-time semantic segmentation model operating on real-world video feeds. This will involve addressing the challenges of segmentation errors and ensuring the end-to-end system remains robust and efficient for deployment in live teleoperation scenarios.