1. Introduction
Global internet traffic has been experiencing a steady growth rate of approximately 22% annually, currently surpassing 33 exabytes per day [
1]. This rapid increase is largely driven by the rising demand for high-definition video across various applications, such as video conferencing, security surveillance, medical care, agriculture, forestry, and online video streaming platforms like YouTube and Netflix. Despite advancements in hardware storage and network transmission technologies, the sheer size of uncompressed raw video files continues to pose significant challenges in terms of storage capacity and bandwidth requirements. As a result, video compression has emerged as a critical area of research, focused on developing methods that reduce the volume of video data while preserving as much visual quality as possible after reconstruction.
Traditionally, video encoding has relied on techniques such as the discrete cosine transform (DCT) [
2] and predictive coding across spatial and temporal domains. However, deep learning-based video compression algorithms offer considerable advantages, particularly in terms of end-to-end optimization, improved quality retention, and enhanced compression ratios. Prominent works in this domain include learning-based modules for adapting conventional codecs [
3,
4,
5,
6,
7,
8] and end-to-end video compression models [
9,
10,
11,
12,
13,
14,
15,
16]. Moreover, Neural Representations for Video (NeRV), models [
17,
18,
19,
20], which are based on implicit neural representations, have garnered widespread attention due to their simplicity, high adaptability, and exceptionally fast decoding speeds. Notable recent advancements include the Expedite Neural Representation for Videos (E-NeRV) [
18] and Hybrid Neural Representation for Videos (HNeRV) [
19], which offer significant improvements in the efficient reconstruction of video frames with superior quality compared to the original NeRV model [
17].
Although E-NeRV [
18] and HNeRV [
19] have achieved promising results, research on NeRV still faces several limitations and challenges.
Firstly, while both E-NeRV [
18] and HNeRV [
19] achieve marginal improvements by adjusting the number of channels in NeRV blocks, their superior performance primarily arises from the optimization of the input embeddings in the NeRV network. In [
17], Chen et al. used frame indices, which are simple scalar values, as temporal input embeddings. E-NeRV [
18] further enhanced this approach by incorporating spatial coordinates as spatial embeddings. HNeRV [
19] enriches the spatial embeddings by extracting feature maps from the ground-truth video images, employing ConvNeXt [
21] (a regular Convolutional Neural Network (CNN)) as an encoder. While improving the quality of input embeddings is a highly effective strategy for enhancing model performance, increasing the efficiency of the NeRV block itself remains a critical concern.
Secondly, the current best-performing model, HNeRV [
19], exhibits limitations in generating visually coherent images, leading to the loss of texture and edges.
Figure 1 provides an illustrative example. HNeRV [
19] fails to capture the edge details of the nose and mouth when reconstructing a character’s face, and introduces noise points that affect color uniformity across the face. We hypothesize that the narrow receptive field and absence of high-frequency information are the primary causes of this phenomenon. First, small convolutional kernels are limited in the range of features they can capture, which can lead to incorrect pixel values being generated by the network. Although increasing kernel size effectively expands the receptive field and improves performance, it also results in a significant increase in network parameters, which grows quadratically. Second, convolution is a weighted summation operation that tends to produce smooth, low-frequency information over broad regions rather than high-frequency signals with sharp local variations. This limitation hinders the network’s ability to accurately reconstruct object edges and texture details. Although high-frequency details may be prioritized under a constrained compression ratio, the human visual system remains highly sensitive to such details, such as textures and edges. Loss of these elements causes videos to appear blurred, which is especially noticeable in scenes requiring fine detail, such as satellite imagery, medical videos, and game streaming.
In light of these challenges, our research is motivated by the following considerations:
Existing NeRV-type methods primarily focus on incorporating multimodal or enhanced input data to improve video reconstruction, rather than enhancing the intrinsic performance of the network modules themselves. Although modifying the input data is less likely to introduce fluctuations in model parameters to affect the compression rate, it remains essential to design a new core module to enhance the intrinsic performance of the network.
Although current NeRV-type approaches can learn implicit representations of video frames, they lack dedicated modeling of high-frequency information, resulting in insufficient detail reconstruction. Therefore, a novel fundamental module capable of reconstructing high-frequency content is required.
Based on the aforementioned motivations, we propose an innovative approach called High-frequency Spectrum Hybrid Neural Representation for Video (HFS-HNeRV).
Figure 2 illustrates the primary architecture and workflow of HFS-HNeRV. To address the first challenge, we introduce the HFS-HNeRV block, which enhances the basic NeRV module by incorporating a high-frequency spectrum convolution module (HFSCM). HFSCM includes a high-spectral attention mechanism based on the channel–spatial attention structure of CBAM [
22] and GAM [
23], along with an additional convolutional layer. Channel attention reweights each channel in the feature map by integrating the information of all channels for each pixel, encouraging the model to focus on channels that are most critical to overall semantics. Spatial attention allows the model to highlight regions that are vital to global semantics along the spatial dimension. Moreover, since the channel dimension can be greatly reduced in spatial attention, a larger receptive field (such as a large convolution kernel) can be applied without substantially increasing the number of parameters. This design allows the model to integrate a wider range of local contextual information with only a minimal increase in parameter count, thereby considering more global semantic information when redistributing weights. After the attention module accentuates the important feature information, the subsequent convolutional layers not only expand the receptive field but also further fuse these attention-weighted features to generate richer and higher-quality feature representations. This modification significantly improves video frame reconstruction while maintaining a stable parameter count. The HFS-HNeRV block also exhibits excellent compatibility and generalizability, making it easily integrable into a wide range of NeRV networks without necessitating significant changes to the original architecture.
To address the second challenge, the proposed HFSCM includes a novel high-frequency enhancement attention mechanism, which leverages the Haar wavelet transform to strengthen high-frequency components. This technique effectively captures the high-frequency features within the feature map, facilitating the restoration of edge details and textures, thereby enhancing the overall image quality. Additionally, the attention mechanism enables the module to better extract and fuse global information, partially mitigating the issue of insufficient receptive fields. Furthermore, HFSCM incorporates a dual convolutional layer structure, which further refines the features enhanced by the high-frequency spectrum attention mechanism (HFSAM), resulting in richer feature representations.
We also propose a high-frequency spectrum loss function to aid in the training of the model. This loss function extracts high-frequency signals from both the predicted and ground-truth images via Fourier transform and high-pass filters and then computes the mean square error (MSE) between them. The high-frequency spectrum (HFS) loss is integrated into the overall loss function alongside the MSE loss, with a hyperparameter introduced to adjust its weight relative to the total error. This adjustment allows the model to reduce the disproportionate influence of low-frequency components, thereby encouraging greater focus on generating finer image details, such as edges and textures.
Finally, inspired by classical image and video super-resolution networks, we introduce several modifications to the decoder’s structure. Specifically, we incorporate a multi-scale feature reuse path (MSFRP), which enriches the final output feature representations by fusing feature maps from different scale layers.
In summary, our work makes the following contributions:
We propose a novel NeRV module, HFS-HNeRV block, which can be easily integrated into various NeRV networks without substantial modifications to the network architecture.
We introduce a new loss function specifically designed for high-frequency information generation, enhancing the model’s capacity to reconstruct image details.
We optimize the NeRV network design by incorporating MSFRP into the current NeRV framework.
3. Proposed Method
Figure 3a,b show the overall structure of the HFS-HNeRV network. In
Section 3.1, we will explain the structure and function of the key parts in the HFS-HNeRV block.
Section 3.2 is an introduction to MSFRP. Finally, the HFS loss function is described in
Section 3.3.
3.1. HFS-HNeRV Block
As can be seen in
Figure 3c, HFS-HNeRV block is composed of a sub-pixel convolution module and HFSCM.
3.1.1. Sub-Pixel Convolution Module
For the first half of the HFS-HNeRV block, we retain the sub-pixel convolution (SPC) module. It has been employed as a basic module in previous NeRV-type works. Detailed information can be found in [
33]. Here, we only give a brief introduction.
The SPC module integrates a convolutional layer with a pixel shuffle layer. In the convolutional process, as shown in
Figure 3c, the input feature map adheres to the dimensions
, while the output feature map is represented as
. This dimensionality enhancement can be interpreted as the network layer extracting features, which subsequently serve as references for generating more contextually relevant features. A reduction in the number of input or output channels will significantly degrade the performance of this network layer. Although increasing the size of the convolutional kernel can improve network efficiency, it also results in a considerable increase in the number of model parameters. Therefore, to ensure parameter stability, the original kernel size and channel configuration have been maintained.
3.1.2. High-Frequency Spectrum Convolution Module
HFSCM is primarily composed of two components: a high-frequency spectrum attention mechanism and an additional convolutional layer. As depicted in
Figure 3c, the entire module adopts a residual block structure.
The high-frequency spectrum attention mechanism (HFSAM) consists of two key parts: the channel attention layer and the frequency domain spatial attention layer. The channel attention layer employs a dual multi-layer perceptron structure to produce a channel attention map by extracting global information from the feature vectors at each
position within the feature map
, as demonstrated in
Figure 4a. This process can be expressed by the following formula:
where
denotes the sigmoid function.
represents the GeLU activation function. ⊗ represents the element-wise multiplication.
represents the multi-layer perceptron.
represents a convolutional layer.
Before introducing the frequency domain spatial attention layer, we briefly describe the processing of the Haar wavelet transform on the feature map. The Haar wavelet basis functions are defined by the scaling function
and the wavelet function
:
For a one-dimensional signal
x of length
N, the corresponding low- and high-frequency operators are denoted
and
, respectively:
where
.
Since the Haar wavelet transform is applied to the feature map on a channel-by-channel basis, only a two-dimensional Haar wavelet transform is required. First, the operator transforms each row of the feature map to obtain a new matrix
:
Next, the columns of
are transformed to yield the matrix
:
Finally, the matrix
is partitioned into four sub-regions:
where
denotes the low–low subband, representing the approximation coefficients after applying low-pass filtering in both horizontal and vertical directions.
represents the low–high subband, containing vertical detail coefficients obtained by low-pass filtering horizontally and high-pass filtering vertically.
represents the high–low subband, containing horizontal detail coefficients obtained by high-pass filtering horizontally and low-pass filtering vertically.
denotes the high–high subband, capturing diagonal detail coefficients after applying high-pass filtering in both horizontal and vertical directions. More detailed information about the Haar wavelet transform can be found in [
54].
In the frequency domain spatial attention layer, as depicted in
Figure 4b, the initial convolutional layer is designed to reduce the number of channels in the input feature map
. This reduction primarily aims to minimize the number of parameters in the attention layer, ensuring computational efficiency. Following this, the feature map undergoes decomposition into various frequency component sub-maps through the Haar wavelet transform, producing the low frequency–low frequency (LL) map, low frequency–high frequency (LH) map, high frequency–low frequency (HL) map, and high frequency–high frequency (HH) map. After decomposition, the sub-maps are upsampled to match the original feature map’s dimensions. Each of these four sub-maps is then multiplied by a set of distinct enhancement weights, followed by element-wise addition with the original input feature map. These four enhanced sub-maps are concatenated with the input feature map, forming an enriched feature representation. The concatenated feature map is subsequently processed through a sequence of layers, including normalization, activation, convolution, and sigmoid functions, resulting in the generation of the spatial attention map
. The incorporation of the Haar wavelet transform enables the analysis of frequency domain information, allowing the HFSCM to capture high-frequency features more effectively. This leads to the restoration of edge details and textures within the image, thereby improving the overall quality of image generation. Additionally, the use of the attention mechanism strengthens the module’s ability to extract and integrate global information, partially alleviating the issue of insufficient receptive field. The calculation process for spatial attention is described as follows:
where
denotes the bilinear interpolation operation, and
represents the Haar wavelet transform.
represents the enhancement factor of the frequency maps.
,
,
, and
represent the feature maps for the
,
,
and
sub-bands, respectively.
represents the concatenation operation that merges multiple tensors.
In addition, we introduced extra convolutional layers following the HFSAM (as shown in
Figure 3c) to allow the network to better focus on and exploit the high-frequency features enhanced by the HFSAM. The additional convolutional layers expand the receptive field, thereby enhancing the model’s ability to represent intricate high-frequency details. The shortcut connection contributes to the overall stability of the module during training to avoid the problem of gradient disappearance. The following formula can be used to express how the feature map is calculated:
where ⊕ denotes the element-wise addition.
3.2. Multi-Scale Feature Reuse Path
MSFRP, whose structure is shown in
Figure 5, enables the model to capture more information at different scales and further enhances the model’s expressiveness. In the NeRV-based network, avoiding growth of the number of parameters is an essential prerequisite. Therefore, we decide to upsample the feature map produced by the model’s third-to-last layer via the bilinear interpolation method to the same size as the final output feature map of the model. Specifically, the feature maps from
layers
are first reduced to the channels at 3 through a
convolutional layer. Then, they are resized to a common spatial resolution
via bilinear interpolation. Finally, the aligned feature maps are fused by element-wise addition. Bilinear interpolation is a technique that involves two linear interpolations in a two-dimensional plane grid cell. Assuming that the coordinates of the four corners of the grid cell are
,
,
and
, the bilinear interpolation polynomial formula can be expressed as
Detailed information on bilinear interpolation can be found in [
55].
3.3. High-Frequency Spectrum Loss
The MSE loss function is widely used in various downstream tasks within computer vision. To further direct the model’s focus towards high-frequency features in images, we introduce the HFS loss, which is based on the Fourier transform and high-pass filters, and incorporate it into the total loss function.
Specifically, we first transform both the predicted and ground-truth images into the frequency domain by employing the 2D discrete Fourier transform (DFT), implemented via PyTorch’s torch.fft.fft2. The zero-frequency component is shifted to the center of the spectrum to facilitate the application of a high-pass filter.
The high-pass filter is constructed as a binary circular mask that suppresses low-frequency components. Specifically, for a frequency spectrum of size , we set a square region of size , centered at , to zero. Here, m is a tunable cutoff parameter controlling the frequency threshold. The mask is broadcasted across batch and channel dimensions to match the shape of the input tensors.
After masking, we apply an amplification factor g to the remaining high-frequency components to emphasize fine-grained details such as edges and textures. Then, the filtered and enhanced frequency spectra are transformed back to the spatial domain by using the inverse 2D DFT. The HFS loss is defined as the mean square error between the spatial domain reconstructions derived from the high-frequency components of the predicted and ground-truth images.
The DFT and inverse DFT of a two-dimensional image can be represented as follows:
where
H and
W represent the height and width of the image, respectively.
x and
y denote the spatial coordinates within the image, and
u and
v represent frequency coordinates within the spectrum.
Given a video sequence
and a frame index
t, we have a predicted image
and a ground-truth image
. The formulae for MSE loss and HFS Loss are expressed as
where
H and
W represent the height and width of the image, respectively.
C represents the number of channels of the image.
and
represent the predicted image and ground-truth image, respectively.
,
,
M and
g denote discrete Fourier transform, inverse discrete Fourier transform, binary high-pass filter mask, and high-frequency amplification factor, respectively.
The total loss function can be expressed as
where
is the weight that controls the influence of HFS loss. It is set to 0.12 in the experiments.
5. Conclusions
In this paper, we present HFS-HNeRV, a NeRV network optimized for learning high-frequency features within the frequency domain. To support its training, we introduce a specialized loss function designed to target high-frequency features, thereby improving the model’s performance to reproduce fine details such as edges and textures. Specifically, we propose the HFSCM and the HFS loss, which enable the model to more effectively focus on and learn high-frequency information in the frequency domain.
Quantitative results reveal that HFS-HNeRV significantly outperforms other NeRV-based networks, including NeRV, E-NeRV, and HNeRV, achieving improvements in PSNR of +5.75 dB, +4.53 dB, and +1.05 dB, respectively. In terms of visual reconstruction quality, HFS-HNeRV demonstrates superior performance in restoring edge textures and produces images with more cohesive and natural color distributions. Importantly, both HFSCM and HFS loss exhibit a high degree of flexibility, allowing them to be easily integrated into a variety of NeRV architectures, thus offering substantial benefits for tasks related to video compression and reconstruction.