Compression of High-Component Gaussian Mixture Model (GMM) Based on Multi-Scale Mixture Compression Model

Zhang, Linwei; Zhang, Jin; Tan, Mingye; Liang, Shi

doi:10.3390/electronics14244858

Open AccessArticle

Compression of High-Component Gaussian Mixture Model (GMM) Based on Multi-Scale Mixture Compression Model^†

School of Automation, Pukou Campus, Nanjing University of Information Science and Technology, Nanjing 210031, China

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in the Special Issue “New Trends in Distributed Estimation and Control of Network Autonomous Systems” of Electronics.

Electronics 2025, 14(24), 4858; https://doi.org/10.3390/electronics14244858 (registering DOI)

Submission received: 3 November 2025 / Revised: 5 December 2025 / Accepted: 8 December 2025 / Published: 10 December 2025

(This article belongs to the Special Issue New Trends in Distributed Estimation and Control of Network Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

This study addresses the redundancy problem caused by an excessive number of components in Gaussian mixture models (GMMs) in practical applications, as well as the derivative issues such as overfitting and exponential growth of computational complexity, and proposes a component reduction method based on the GMM multi-scale mixture compression model (GMMultiMixer). Traditional GMM compression methods are limited by local optima, which can lead to model distortion and difficulty in handling complex multi-peak distributions. This paper draws on the multi-scale hybrid architecture and dynamic feature extraction capabilities of the TimeMixer++ model to propose the GMMultiMixer model for reconstructing the weights, means, and covariance parameters of GMM, thereby achieving optimal approximation of the original model. Experimental results demonstrate that this method significantly outperforms traditional strategies in terms of KL divergence metrics, particularly when fitting multi-modal, high-dimensional complex distributions, and it can also handle the compression task of two-dimensional GMM. Additionally, when combined with Kalman filtering for unmanned aerial vehicle (UAV) state estimation, this compression strategy effectively improves the system’s computational efficiency and state estimation accuracy.

Keywords:

Gaussian mixture model; component compression; TimeMixer++ model; multi-scale feature fusion; UAV state estimation; kalman filtering; networked autonomous systems

1. Introduction

In the field of data science, real-world data distributions often exhibit high complexity: from Red–Green–Blue (RGB) image pixel probability distributions and semantic segmentation feature maps in computer vision to Bidirectional Encoder Representations from Transformers (BERT) word vector sequences and syntactic dependency tree structures in natural language processing, multi-modal characteristics have become a typical feature of data distributions [1,2]. However, single-Gaussian distribution models can only describe uni-modal data patterns, and their representational capabilities are fundamentally limited when dealing with multi-modal data [3,4].

As an important statistical modeling framework, GMM is widely applied in various fields such as deep learning [5], signal processing [6,7], and image processing [8]. By combining multiple Gaussian distributions through weighted combinations, Gaussian mixture models (GMMs) can accurately capture multi-modal and nonlinear data distributions, making it a foundational tool in data science research [9,10]. For example, in speech recognition, GMM can be used to model the features of speech signals. Using GMM to represent the voice features of each person or word as a weighted combination of corresponding Gaussian distributions, each person or word corresponds to a Gaussian mixture model. The model continuously adjusts its parameters based on training data to optimize the probability distribution of speech signals [11]. GMM also has important applications in drone state estimation. Estimation of the drone state involves multiple parameters such as position, velocity, acceleration, and direction. However, uncertainties and noise in the environment can affect sensor data, and traditional Kalman filters may perform poorly due to nonlinear issues [12,13]. GMM can model various state parameters of a drone as multiple Gaussian distributions, with each Gaussian distribution representing a potential flight state. By continuously updating the weights and parameters of each component based on flight data, GMM achieves dynamic state estimation [6,7].

It seems that for any complex data, GMM can approximate any continuous probability distribution by increasing the number of components K [14,15]. However, in practical applications, high-component GMM faces the following problems:

Overfitting problem: High-component GMM is prone to overfitting on training data, especially when the data itself is not that complex. Excessive components can cause the model to model noise or random features in the data, thereby losing its true reflection of the underlying structure of the data [16,17];
Redundant component problem: GMM with a high number of components may exhibit redundant Gaussian components, i.e., some Gaussian components do not make a significant contribution, or multiple Gaussian components overlap in the same region. These redundant components make the model more complex without contributing to the expression of the data features [16,18];
Computational complexity issue: The total number of parameters in a GMM depends on both the feature space dimensionality D and the number of components M. Specifically, each Gaussian component contains one weight parameter, D mean parameters, and $\frac{D (D + 1)}{2}$ unique covariance parameters (due to symmetry of the covariance matrix). Thus, the total number of parameters for a GMM is

$M \times (1 + D + \frac{D (D + 1)}{2}) + (M - 1)$

(1)

where the additional $(M - 1)$ term accounts for the constraint that component weights sum to 1. As M or D increases, the total number of parameters grows faster than linear, leading to high computational complexity for high-component or high-dimensional GMMs [6,19]. For example, in the subsequent UAV state estimation application scenario (Section 5), the number of GMM components evolves exponentially with time steps (Equation (21)), which directly leads to explosive growth in computational complexity.

To avoid these situations, we can compress high-component GMM. The essence of Gaussian mixture model compression is to find the optimal approximation of the original model with a lower-dimensional probability distribution. Ref. [20,21] merging strategies based on distance metrics formed the core framework of early GMM compression. Ref. [22] proposed a greedy merging algorithm using Kullback–Leibler divergence (KL divergence) as the metric standard, which iteratively selects a pair of Gaussian terms from the original Gaussian mixture model for merging to maximize the similarity between the simplified model and the original model, establishing a statistical distance optimization paradigm between the simplified model and the original distribution. However, since each merger inevitably increases the KL divergence, this leads to distortion of the simplified model when the local optimal solution of the merging strategy still causes a significant increase in KL divergence.

Subsequent research [6] introduced a variable-step truncation–fusion mechanism to mitigate this issue to some extent. This method introduces a truncation strategy to reduce the number of components in the original GMM by discarding the least influential Gaussian components, then dynamically selects the better strategy by comparing the KL divergence increments between the fusion strategy and the truncation strategy. However, it does not fundamentally resolve the issue that each iteration of the KL divergence greedy merging algorithm inevitably causes KL divergence increases. This is because both methods are constrained by the local optimality trap of gradually reducing components during simplification. Essentially, they pursue local optimal solutions in each iteration and cannot find the optimal approximation scheme for the original distribution from a global perspective. This makes them prone to model distortion when facing complex GMM with multiple peaks or high dimensions, due to the cumulative error of local decisions [23].

Additionally, traditional methods have poor adaptability to complex distributions, as their design logic relies on simple merging or truncation methods, lacking the ability to deeply capture the global features of GMM [24]. When the original model contains a large number of overlapping components, multi-scale fluctuations, or high-dimensional features, traditional methods either lose critical distribution details due to over-merging similar components or ignore potential important features due to blindly truncating low-weight components [25,26]. Subsequent experimental results also demonstrate the limitations of traditional methods in handling complex Gaussian mixture functions.

In recent years, the development of artificial intelligence methods has provided new ideas for solving the problem of GMM simplification. Among them, deep learning has shown significant advantages due to its ability to model complex data distributions and efficiently learn features [27]. In this context, the Timemixer++ model, as a general time series analysis framework, with its unique multi-scale hybrid architecture and dynamic modeling capabilities, has provided important inspiration for dealing with complex distribution problems [28]. Although TimeMixer++ was originally designed for time series tasks, considering that GMM is essentially a mixture of independent components [9], it enables the capture of multi-level features of different GMM.

For this reason, this paper focuses on the core challenge of high-component GMM simplification and achieves a key result: realizing higher-fidelity approximation of the original GMM with far fewer components, drawing on the architectural design of Timemixer++ and proposing the GMMultiMixer simplified model, using KL divergence as the standard indicator to evaluate the simplification effect. Experimental results verify this core advantage explicitly, as follows: for a

1 D

GMM with 128 components, when simplified to only 8 components (a 16:1 compression ratio), the KL divergence of GMMultiMixer is only on the order of 0.0001, while that of traditional methods reaches about 0.87 (Figure 1), representing an approximation accuracy improvement of over 99%. For 2D high-component GMMs, the model maintains extremely low KL divergence while reducing computational complexity. Compared with traditional fusion and truncation strategies, this model can reduce the KL divergence between the simplified GMM and the original model, achieving a better approximation of the original distribution. At the same time, for complex GMM data such as multi-peak, multi-fluctuation, and high-dimensional data, the model breaks through the local optimal trap of traditional methods by fusing and modeling local and global features, can capture key features of the original distribution, and has the ability to process complex data. In addition, this paper also combines this model with Kalman filtering and applies it to unmanned aerial vehicle (UAV) state estimation, which improves the efficiency and accuracy of UAV state estimation.

2. Problems Formulation

The compression problem of GMM essentially involves the simplified representation of high-dimensional complex probability distributions: given an original GMM with K Gaussian components, it needs to be compressed into a simplified model containing only M Gaussian components (where

K > M

), aiming to reduce the model complexity by decreasing the number of components while preserving as many key features of the original GMM as possible.

2.1. Notation and Preliminaries

The probability density function of a GMM with K Gaussian components is defined as

P (x) = \sum_{k = 1}^{K} w_{k} \cdot N (x; μ_{k}, P_{k}),

(2)

where K denotes the number of components in the original Gaussian mixture model.

w_{k}

,

μ_{k}

, and

P_{k}

are, respectively, the weight (satisfying

\sum_{k = 1}^{K} w_{k} = 1

), mean vector, and covariance matrix of the k-th Gaussian component, with

N (\cdot)

representing the Gaussian distribution.

The compression goal is to construct a simplified GMM:

Q (x) = \sum_{m = 1}^{M} {\hat{w}}_{m} \cdot N (x; {\hat{μ}}_{m}, {\hat{P}}_{m}),

(3)

where

M < K

, and

{\hat{w}}_{m}

,

{\hat{μ}}_{m}

, and

{\hat{P}}_{m}

are the parameters of the simplified model (satisfying

\sum_{m = 1}^{M} {\hat{w}}_{m} = 1

). It is required to adjust the component weights, means, and covariances of the simplified model to minimize the distribution difference between the simplified model and the original model.

A commonly used method to measure the difference between these two GMM (P and Q) is the Kullback–Leibler divergence (KL divergence), whose core idea is to quantify the difference between two probability distributions. A smaller value indicates a better compression effect, and the goal is to make the distribution of the simplified model as close as possible to that of the original model. Its calculation formula is

D_{KL} (P ‖ Q) = \int_{X} P (x) log (\frac{P (x)}{Q (x)}) d x

(4)

by further substituting the probability density functions of the original GMM and the simplified GMM, we can obtain the parameterized optimization objective for minimizing the KL divergence, which is expressed as follows:

\begin{matrix} arg min_{\begin{matrix} {\hat{w}}_{m}, h_{m}, P_{m} \end{matrix}} D_{KL} (P ‖ Q) \\ = arg min_{\begin{matrix} {\hat{w}}_{m}, h_{m}, P_{m} \end{matrix}} \int_{X} [\sum_{k = 1}^{K} w_{k} \cdot N (x; μ_{k}, P_{k})] log (\frac{\sum_{k = 1}^{K} w_{k} \cdot N (x; μ_{k}, P_{k})}{\sum_{m = 1}^{M} {\hat{w}}_{m} \cdot N (x; h_{m}, P_{m})}) d x \end{matrix}

(5)

Subject to the constraints:

\{\begin{matrix} \sum_{m = 1}^{M} {\hat{w}}_{m} = 1 \\ {\hat{w}}_{m} \geq 0 \\ P_{k} ≻ 0, {\hat{P}}_{m} ≻ 0 \end{matrix}

(6)

2.2. Existing Challenges in GMM Compression

Theoretically, an explicit analytical solution exists only when compressing to a single component:

m e a n : μ_{opt} = \sum_{k = 1}^{K} w_{k} μ_{k},

(7)

c o v a r i a n c e : P_{opt} = \sum_{k = 1}^{K} w_{k} (P_{k} + (μ_{k} - μ_{opt}) {(μ_{k} - μ_{opt})}^{T})

(8)

However, it is evident that a single-component Gaussian model cannot fully capture the characteristics of complex data.

For the case where

M > 1

, traditional methods mainly rely on iterative component merging or truncation. Early work [22] established a greedy merging algorithm using KL divergence as a metric, iteratively merging the most similar pair of components. Although this method can compress GMM components, it is trapped in a local optimum because each merge inevitably increases the KL divergence, leading to accumulated errors and model distortion, especially for complex multi-modal distributions [23].

Subsequent studies, such as the variable-step truncation–fusion mechanism [6], attempt to alleviate this problem by dynamically choosing to merge or discard low-weight components. Although this represents a more advanced non-deep-learning solution, it still operates within the paradigm of local, greedy optimization. It fails to escape the fundamental trap of irreversible local decisions, thus being unable to find a globally optimal approximation scheme [23,24]. Therefore, when dealing with complex GMM with a large number of overlapping components or high-dimensional features, these methods often oversimplify due to over-merging, losing key details, or retaining redundancy due to blind component truncation [16,18].

2.3. Overview of the Proposed Method

To address the local optimality trap and the poor adaptability of traditional methods to complex distributions, this paper adopts a deep learning approach, particularly leveraging the ideas of the TimeMixer++ model in multi-scale feature fusion and dynamic modeling [28], and proposes the GMMultiMixer compression model. Our goal is to directly learn the nonlinear mapping from the original high-component GMM to the simplified low-component GMM through a data-driven and globally-aware network architecture, thereby breaking through the limitations of traditional methods and achieving more accurate and efficient model compression.

3. Simplification of Complex GMM Based on the GMMultiMixer Model

Traditional methods for simplifying GMM have high computational complexity, model distortion, and the inability to fit complex GMM. The introduction of the GMMultiMixer model provides a new approach to solving these problems.This paper proposes the GMMultiMixer model by referring to the framework of the TimeMixer++ model. The GMMultiMixer model inherits the logical ideas of the multi-scale hybrid architecture and dynamic feature extraction mechanism. This inspiration is not a direct migration, but rather adaptive improvements made in combination with the characteristics of GMM data (which lacks the periodicity of time series).

3.1. Overview of the TimeMixer++ Model

TimeMixer++ is a general-purpose time series analysis model that processes multi-scale time series data to capture multi-scale and multi-periodic features. Its architecture primarily consists of an input projection, a MixerBlock stack, and an output projection.

The input time series data is first down-sampled to convert the original time series into a sequence of ( $M + 1$ ) scales;
The input projection then captures the interaction information between variables in the time series data through channel mixing and embedding operations;
The MixerBlock then extracts features from the time series from the perspectives of seasonal and trend characteristics through modules such as Multi-Resolution Time Imaging (MRTI), Time Image Decomposition (TID), Multi-Scale Mixing (MCM), and Multi-Resolution Mixing (MRM);
The output projection makes predictions based on features at different scales and weights the integrated results of predictions at various scales to enhance the robustness of the model’s predictions.

Drawing on the architectural ideas of TimeMixer++, the core insight of GMMultiMixer lies in leveraging a data-driven deep learning framework to fundamentally address the local optimum trap and insufficient global feature capture capability faced by traditional GMM compression methods. Compared with existing models, the innovation and differentiation of this study are mainly reflected in the shift from local iteration to global mapping. Existing traditional methods rely on greedy, step-by-step merging or truncation strategies, where each decision is based on local information, making them prone to error accumulation and model distortion. In contrast, GMMultiMixer directly learns the complex nonlinear mapping from high-component GMM to low-component GMM through an end-to-end neural network. This global optimization paradigm enables it to consider the interactions of all components at once, thereby finding a simplification scheme closer to the global optimum.

3.2. Principles and Improvements of GMMultiMixer

First, there is an essential difference between GMM data and time series data. Most GMM data does not have periodicity or trend characteristics. For example, the GMM in [6] is used to describe the current state data of a drone, and the data does not have time series characteristics. Therefore, the GMMultiMixer model replaces the extraction of temporal features of data in the Timemixer++ model with the analysis of the features of individual Gaussian components and how these components interact together to form the features of the GMM.

To intuitively present the core architecture of the GMMultiMixer model, Figure 2 shows the overall framework of the model. This framework clearly reflects how the model converts high-component GMM parameters into a unified input format, captures local and global features through key modules (MRTI, TID, MCM, MRM), and finally parses them into valid simplified GMM parameters, laying the foundation for the subsequent sections to further elaborate on the principles and functions of each module.

3.2.1. Multi-Resolution Time Imaging

In the step of extracting Top-K significant periods by the Multi-Resolution Time Imaging (MRTI) module, the extracted data represents the features derived from the parameters of the original GMM. We modify the number of extracted high-frequency periods to the number of target components after simplification, so as to initially learn the features of the target components. The core logic of this adjustment is analogous to the periodic attention to time-series data in Timemixer++ and the component features in GMM. In Timemixer++, high-frequency periods are crucial for capturing the dynamics of time series; in GMMultiMixer, the target component M is the key to approximating the original distribution. By setting the number of extracted features to M, the model is guided to focus on the most representative parameter patterns during the initial feature extraction stage, laying a foundation for subsequent component fusion and reconstruction.

3.2.2. Time Image Decomposition

In the subsequent Time Image Decomposition (TID) module, since GMM data does not possess periodicity or trend characteristics—instead, GMM is characterized by the relative independence of its components while their mutual interaction forms the global data features—this module no longer learns the seasonal or trend features of the data. Instead, it adopts column-wise attention and row-wise attention to replace the original learning objectives, aiming to learn the local features of individual Gaussian components and the global features of the Gaussian mixture model.

Column-axis attention (focusing on the local features of a single Gaussian component) models the local features of the parameters (weights

w_{k}

, mean

μ_{k}

, and covariance

Σ_{k}

) of the k-th Gaussian component in the GMM by calculating the correlation weights of the internal parameters of the component and aggregating the unique features of that component. Let the parameter vector of a single Gaussian component be

f_{k} = [w_{k}, μ_{k}^{(1)}, μ_{k}^{(2)}, \dots, μ_{k}^{(D)}, Σ_{k}^{(1, 1)}, \dots, Σ_{k}^{(D, D)}]

(9)

where D is the data dimension, the column-axis attention weight

α_{k, i, j}

represents the correlation strength between the i-th parameter and the j-th parameter in the k-th component, and the calculation formula is

α_{k, i, j} = Softmax (\frac{q_{k, i} \cdot k_{k, j}}{\sqrt{d}})

(10)

where

q_{k, i}

and

k_{k, j}

are the query vector and key vector of the i-th and j-th parameters in the k-th component, respectively, and d is the dimension of the parameter vector. The local feature vector of the k-th component is obtained by weighting the following:

l_{k} = \sum_{j = 1}^{L} α_{k, i, j} \cdot v_{k, j}

(11)

where

L = 1 + D + \frac{D (D + 1)}{2}

(the total length of parameters in a single component),

v_{k, j}

is the value vector of the j-th parameter in the k-th component, and

l_{k}

is the output that focuses on the local features of that component. Row-axis attention is modeled across all Gaussian components by calculating the correlation weights between different components and aggregating global distribution statistics (such as the distribution trend of component weights and the overall shift in the mean). Let the parameter matrix of all components be

F = {[f_{1}, f_{2}, \dots, f_{K}]}^{T}

(12)

where K is the original number of components.To compute global features, we first derive global query, key, and value vectors for each component through linear transformations of their local feature vectors

l_{k}

:

Q_{k} = W_{Q} l_{k}, K_{l} = W_{K} l_{l}, V_{l} = W_{V} l_{l}

(13)

where

W_{Q}

,

W_{K}

, and

W_{V}

are learnable weight matrices. The row-axis attention weight

β_{k, l}

represents the global correlation strength between the k-th component and the l-th component, and the calculation formula is

β_{k, l} = Softmax (\frac{Q_{k} \cdot K_{l}}{\sqrt{K}})

(14)

where

Q_{k}

and

K_{l}

are the global query vector and key vector for the k-th and l-th components, respectively, and

\sqrt{K}

is the normalization factor. The global statistical feature vector is obtained by averaging the weighted value vectors across all components:

g = \frac{1}{K} \sum_{k = 1}^{K} \sum_{l = 1}^{K} β_{k, l} \cdot V_{l}

(15)

where

V_{l}

is the global value vector for the l-th component, and

g

is the global feature output that captures the overall distribution characteristics of all components.

This design enables the module to map the probability distribution parameters of the GMM to local and global features, where local patterns preserve the uniqueness of each component, and global trends reflect the overall distribution pattern.

3.2.3. Multi-Scale Mixing

In the Multi-Scale Mixing module (MCM), ‘scales’ refer to the granularity of parameter features: fine scales correspond to detailed parameter variations of individual components, while coarse scales correspond to aggregated features of similar components.

Local features are processed using a top-down approach. Transposed convolutions are used to propagate coarse-scale features to fine scales, and similar components are fused to reduce redundancy. Let the representation of the local feature at scale m in the k-th cycle of the l-th layer be

l_{m}^{(l, k)}

, with the formula:

l_{m}^{(l, k)} = l_{m}^{(l, k)} + 2 D - TransConv (l_{m + 1}^{(l, k)}), m \in {M - 1, \dots, 0}

(16)

l_{m + 1}^{(l, k)}

is the local feature at the coarse scale (

m + 1

);

2 D - TransConv (\cdot)

uses transposed convolution to upscale the coarse-scale feature to the fine scale m, thereby merging similar local components and reducing redundancy.

Global features (overall statistical patterns of all components) are processed using a bottom-up approach, aggregating fine-scale features into coarse-scale patterns through convolution, so that global features serve as constraint conditions. Let the representation of global features at scale m in the k-th cycle of layer l be

g_{m}^{(l, k)}

, with the formula:

g_{m}^{(l, k)} = g_{m}^{(l, k)} + 2 D - Conv (g_{m - 1}^{(l, k)}), m \in {1, \dots, M}

(17)

g_{m - 1}^{(l, k)}

is the global feature at the fine scale (

m - 1

);

2 D - Conv (\cdot)

down-samples the fine-scale features to the coarse scale m via convolution, aggregating them into a global trend, which serves as a constraint to simplify the process and prevent model distortion.

3.2.4. Multi-Resolution Mixing

The Multi-Resolution Mixing (MRM) module achieves component aggregation and feature retention through dynamic weight allocation. Specifically, component weights and KL divergence are used as contribution metrics: weight values represent the influence of the component on the mixed model, while the KL divergence value between the component and the global distribution represents the uniqueness of the component.

Calculation of component significance weights:

{\hat{W}}_{k} = \frac{w_{k} \cdot exp (- λ \cdot D_{K L} (k, global))}{\sum_{k = 1}^{K} w_{k} \cdot exp (- λ \cdot D_{K L} (k, global))}

(18)

where

D_{K L} (k, global)

is the KL divergence between the k-th component and the global GMM distribution,

w_{k}

is the original weight of the k-th component, and

λ

is a regularization parameter. After generating the retention priority weights through Softmax normalization, the similar components are then subjected to weighted parameter fusion (weighted average of mean and covariance), which reduces the number of compressed components while maintaining the statistical characteristics of the original distribution.

3.2.5. Parameter Parsing and Model Reconstruction of the Simplified GMM

In the final output processing stage, the fused features output by the MixerBlock are parsed into simplified GMM weights, means, and covariance parameters to reconstruct the compressed GMM.

3.2.6. Integration Mechanism of GMMultiMixer and Kalman Filter

The core advantage of the GMMultiMixer model lies in its ability to globally and efficiently compress high-component GMM into low-component ones. This feature enables it to seamlessly integrate into the GMM-based Kalman filter framework, addressing issues such as the exponential explosion of component numbers encountered in state estimation for unmanned aerial vehicles (UAVs) under negative acknowledgment (NACK)-enabled networks. This subsection will elaborate on the integration steps and collaborative working mechanism of the two in detail. The integration of GMMultiMixer and Kalman filter adheres to a “Monitor-Compress-Update” cyclic process, the core of which is embedding GMMultiMixer as an efficient compression module into the recursive estimation loop of the Kalman filter. The specific steps are as follows:

Forward Prediction and Update: At time step k, based on the posterior GMM $G M (x_{k - 1}, M)$ at the previous time step, the posterior GMM $G M (x_{k}, K)$ at the current time step is computed through the prediction and update equations of the Kalman filter, taking into account all possible data packet loss scenarios. This process leads to a sharp increase in the number of components K of the GMM compared with that at time step $k - 1$ ; for detailed formulas, refer to Equations (21) and (22) in the subsequent sections.
Component Number Monitoring and Compression Triggering: The system continuously monitors the number of components K of the current posterior GMM in real time. Once K exceeds the preset real-time computing capacity threshold $K_{threshold}$ , the GMMultiMixer compression module is triggered. This design ensures that the system’s computational load remains controllable at all times.
Global Compression: The parameters (weights ${w_{k}}$ , means ${μ_{k}}$ , and covariances ${Σ_{k}}$ ) of the high-component GMM $G M (x_{k}, K)$ to be compressed are fed as inputs to the GMMultiMixer model. After the multi-scale feature extraction and fusion process as described earlier, the model directly outputs all parameters of the compressed low-component GMM $G M (x_{k}, M)$ , where $M ≪ K$ . This step replaces the computationally intensive iterative merging or truncation strategies in traditional methods.
Prior Transmission: The compressed GMM $G M (x_{k}, M)$ serves as the prior distribution for the recursive computation of the Kalman filter at the next time step ( $k + 1$ ). Since the number of components has been reduced from K to M, the computational complexity of the prediction and update in the next round is significantly reduced.
Cyclic Execution: The aforementioned process is executed cyclically at each estimation time step, thus continuously stabilizing the number of GMM components around the controllable range M in long-term time series estimation and avoiding the exponential growth of computational complexity.

This integration strategy fundamentally improves the performance of the traditional GMM–Kalman filter from two aspects: computational efficiency and estimation accuracy. In terms of computational efficiency, GMMultiMixer achieves compression through a single forward propagation, and its approximately linear computational complexity completely avoids the iterative KL divergence calculations on the order of

O (K^{2})

in traditional methods. This shifts the complexity of filter updates from depending on K to relying on the fixed M, achieving a stepwise improvement. In terms of estimation accuracy, the Multi-Scale Mixing architecture of GMMultiMixer can globally capture and retain the key features of the original distribution, overcoming the model distortion issues caused by local optimality in traditional methods. This effectively suppresses error accumulation, significantly improves the accuracy and robustness of long-term estimation, and performs particularly well in phases with strong nonlinear dynamics such as UAV sharp turns.

3.3. Implementation of a Simplified Model Based on GMMultiMixer

The original Gaussian mixture model (GMM) is determined by three types of parameters: component weights, component means, and component covariances. Therefore, the original GMM data can be flattened and converted into an input format suitable for the model’s input layer. Assuming the original GMM consists of K Gaussian components, each with independent normalized weights, means, and covariances, these parameters are arranged in a specific order and then flattened into a one-dimensional dataset with a length of

K \times (1 + D + \frac{D (D + 1)}{2})

(including K component weights, D-dimensional means, and

\frac{D (D + 1)}{2}

-dimensional covariances). This fixed order ensures that subsequent modules can accurately locate all parameters of a single component, thereby correctly calculating the correlations between the internal parameters of the component. This design linearly maps all GMM parameters into a unified one-dimensional structure, which is then fed into the input projection. Subsequently, Channel Mixing maps the flattened parameter sequence into a high-dimensional dense feature space. While preserving the intrinsic associations of each component, it enhances the discriminability of subtle differences between components. The subsequent Feature Embedding reshapes the mixed high-dimensional features into a tensor structure compatible with the core modules (MRTI, TID, MCM, and MRM), ensuring that the core modules can efficiently parse local and global features. These fused features are finally input to the Output Projection module, which parses the high-dimensional fused features back into a structured parameter sequence that matches the dimensional requirements of the simplified GMM. The final output dimension of the model is

M \times (1 + D + \frac{D (D + 1)}{2})

(where M is the number of components in the simplified GMM, and

M < K

). Specifically, the first M elements serve as the weights of the simplified model, the next

M \times D

elements are reshaped into a matrix to act as the means, and the last

M \times \frac{D (D + 1)}{2}

elements are reshaped into a tensor to function as the covariances. To ensure the physical validity of the simplified GMM, the reconstructed covariance matrices must be symmetric and positive definite. For each output covariance matrix

{\hat{P}}_{m}

, we first enforce symmetry by computing

{\hat{P}}_{m} = \frac{1}{2} ({\hat{P}}_{m} + {\hat{P}}_{m}^{⊤})

. Subsequently, we ensure positive definiteness by performing an eigenvalue decomposition

{\hat{P}}_{m} = V Λ V^{⊤}

and replacing any non-positive eigenvalues in the diagonal matrix

Λ

with a small positive value

ϵ = 1 \times 10^{- 6}

before reconstructing the matrix. This two-step correction guarantees that all covariance matrices in the simplified GMM are valid.

In this essay, the KL divergence is still used as the loss function to measure the distribution difference between the simplified and original GMM. Its core role is to minimize this loss function during the simplification of GMM components, ensuring that the simplified GMM retains similar distribution characteristics to the original GMM and avoiding the loss of data features and distribution information.

However, due to the infeasibility of analytical calculation of the KL divergence between GMM, a deterministic grid sampling method is adopted to achieve numerical approximation. A uniform grid with a fixed spacing of 0.01 between points is defined over the primary probability region of the original GMM distribution

P (x)

. This region is set to cover all component means ±3 standard deviations to ensure the capture of the vast majority of the distribution’s mass. The probability densities under both

P (x)

and the simplified model

Q (x)

are evaluated at these predetermined grid points. The final KL divergence is approximated by summing over these discrete grid points:

D_{K L} (P ‖ Q) = \sum_{i = 1}^{N} P (x_{i}) log \frac{P (x_{i})}{Q (x_{i})} Δ x

(19)

P (x_{i})

is the probability density of point

x_{i}

under the original GMM distribution P, and

Q (x_{i})

is the probability density of point

x_{i}

under the simplified GMM distribution Q.

{\{x_{i}\}}_{i = 1}^{N}

are the points on the predefined grid, and

Δ x

is the volume element, which is constant (=0.01) for the uniform grid.

For multivariate GMMs, direct sampling-based estimation of KL divergence may introduce significant errors due to high-dimensionality. Thus, we adopt an analytical approximation method for KL divergence calculation, leveraging the closed-form expressions of Gaussian distributions:

D_{KL} (P ‖ Q) \approx \sum_{k = 1}^{K} w_{k} [log \frac{w_{k}}{\sum_{m = 1}^{M} {\hat{w}}_{m} \cdot N (μ_{k}; {\hat{μ}}_{m}, {\hat{Σ}}_{m})} + \frac{1}{2} E_{x \sim N (μ_{k}, Σ_{k})} [log \frac{N (x; μ_{k}, Σ_{k})}{Q (x)}]]

(20)

where

N (μ_{k}; {\hat{μ}}_{m}, {\hat{Σ}}_{m})

is the probability density of the k-th original component’s mean under the m-th simplified component. This analytical approximation balances computational efficiency and accuracy, avoiding the limitations of sampling-based methods in high dimensions. In practice, we verified that the approximation error is less than 1% for

d \leq 10

, which is negligible for our compression task.

3.4. The Pseudocode Algorithm

To systematically present the implementation process of the GMMultiMixer-based high-component GMM compression method, the following pseudocode details the key steps from input parameter flattening to simplified model reconstruction. Algorithm 1 integrates the multi-scale feature fusion mechanism and covariance correction strategy described in Section 3.2 and Section 3.3, ensuring that each operation corresponds to the theoretical framework and maintains the global optimization paradigm of the model.

Algorithm 1 High-Component GMM Compression Algorithm

Require: Original GMM parameters:

{w_{k}, μ_{k}, Σ_{k}}

for

k = 1, \dots, K

; Target component count: M (

M < K

); Data dimension: D;
Ensure: Compressed GMM parameters:

{{\hat{w}}_{m}, {\hat{μ}}_{m}, {\hat{Σ}}_{m}}

for

m = 1, \dots, M

;

1:: Step 1: Input Representation and Flattening
2:: Initialize input_matrix $\leftarrow []$
3:: for $k = 1$ to K do
4:: Flatten parameters of k-th component (Equation (9)):
5:: $f_{k} = [w_{k}, μ_{k}^{(1)}, \dots, μ_{k}^{(D)}, vec (Σ_{k})]$
6:: input_matrix.append( $f_{k}$ )
7:: end for
8:: F = stack(input_matrix) (Shape: $K \times (1 + D + D^{2})$ )
9:: Step 2: Multi-Resolution Time Imaging (MRTI)
10:: Extract M target component features (Section 3.2.1):
11:: $H = MRTI (F, M)$ (Output shape: $M \times feature_\dim$ )
12:: Step 3: Time Image Decomposition (TID)
13:: Column-axis attention for local features (Equations (10)–(11)):
14:: for $k = 1$ to K do
15:: $α_{k, i j} = Softmax (\frac{q_{k, i} \cdot k_{k, j}}{\sqrt{d}})$
16:: $I_{k} = \sum_{j = 1}^{L} α_{k, i j} \cdot v_{k, j}$
17:: end for
18:: Row-axis attention for global features (Equations (13)–(14)):
19:: $β_{k, l} = Softmax (\frac{Q_{k} \cdot K_{l}}{\sqrt{K}})$
20:: $g = \sum_{l = 1}^{K} β_{k, l} \cdot V_{l}$
21:: Step 4: Multi-Scale Mixing (MCM)
22:: Top-down processing for local features (Equation (15)):
23:: for $m = M - 1$ down to 0 do
24:: $l_{m}^{(l, k)} = l_{m}^{(l, k)} + 2 D - TransConv (l_{m + 1}^{(l, k)})$
25:: end for
26:: Bottom-up processing for global features (Equation (16)):
27:: for $m = 1$ to M do
28:: $g_{m}^{(k)} = g_{m}^{(k)} + 2 D - Conv (g_{m - 1}^{(k)})$
29:: end for
30:: Step 5: Multi-Resolution Mixing (MRM)
31:: Calculate component significance weights (Equation (17)):
32:: for $k = 1$ to K do
33:: ${\hat{W}}_{k} = \frac{w_{k} \cdot exp (- λ \cdot D_{KL} (k, global))}{\sum_{k = 1}^{K} w_{k} \cdot exp (- λ \cdot D_{KL} (k, global))}$
34:: end for
35:: Perform weighted parameter fusion:
36:: for $m = 1$ to M do
37:: ${\hat{μ}}_{m} = weighted_average (μ_{k}, {\hat{W}}_{k})$ (Mean fusion)
38:: ${\hat{Σ}}_{m} = weighted_average (Σ_{k}, {\hat{W}}_{k})$ (Covariance fusion)
39:: ${\hat{w}}_{m} = \sum {\hat{W}}_{k}$ for fused components (Weight aggregation)
40:: end for
41:: Step 6: Parameter Parsing and Model Reconstruction
42:: output_vector = Model_Output (Shape: $M \times (1 + D + D^{2})$ )
43:: $\hat{w} = output_vector [0 : M]$ (First M elements are weights)
44:: $\hat{μ} = reshape (output_vector [M : M + M \times D], (M, D))$
45:: $\hat{Σ} = reshape (output_vector [M + M \times D :], (M, D, D))$
46:: Step 7: Covariance Matrix Correction (Section 3.3)
47:: for $m = 1$ to M do
48:: Enforce symmetry: ${\hat{Σ}}_{m} = 0.5 \times ({\hat{Σ}}_{m} + {\hat{Σ}}_{m}^{T})$
49:: Ensure positive definiteness:
50:: $V, Λ = eigen_decomposition ({\hat{Σ}}_{m})$
51:: $Λ_{corrected} = max (Λ, 1 \times 10^{- 6})$
52:: ${\hat{Σ}}_{m} = V \times Λ_{corrected} \times V^{- 1}$
53:: end for
54:: Step 8: Loss Calculation and Optimization
55:: if $D = = 1$ then
56:: 1D case: use discrete grid approximation (Equation (18))
57:: $D_{KL} = \sum_{i = 1}^{N} P (x_{i}) \times log (\frac{P (x_{i})}{Q (x_{i})}) \times Δ x$
58:: else
59:: High-dimensional case: use analytical approximation (Equation (19))
60:: $D_{KL} = \sum_{k = 1}^{K} w_{k} [log (\frac{w_{k}}{\sum_{m = 1}^{M} {\hat{w}}_{m} \cdot N (μ_{k}; {\hat{μ}}_{m}, {\hat{Σ}}_{m})}) + \frac{1}{2} E_{x \sim N (μ_{k}, Σ_{k})} [log (\frac{N (x; μ_{k}, Σ_{k})}{Q (x)})]]$
61:: end if
62:: Optimize model parameters: $θ \leftarrow θ - η \times \nabla_{θ} D_{KL}$
63:: return Compressed GMM parameters: ${{\hat{w}}_{m}, {\hat{μ}}_{m}, {\hat{Σ}}_{m}}$ for $m = 1, \dots, M$

4. Comparative Verification Experiments and Expansion

Firstly, this experiment utilizes Python 3.9.16’s built-in random function to construct a parameter generation function for GMM. This function is used to generate weights, means, and covariances of the GMM within specified ranges to ensure data randomness, and then construct sample data for the Gaussian mixture function. The weights are generated as K random numbers following a standard normal distribution, which are then normalized using the softmax function to ensure the sum of the weights of all components equals 1. The means are generated uniformly at random within the interval

[- 10, 10]

, and the covariance matrices are in the form of diagonal matrices with their diagonal elements (variances) randomly sampled from the interval

[0.5, 2.0]

, thereby ensuring data randomness and constructing sample data.

Experiment 1 simplifies the GMM with dimension D = 1 and Gaussian components

K = 128

, generating simplified GMM with component counts

M_{1} = 8

,

M_{2} = 16

,

M_{3} = 32

, and

M_{4} = 64

. Experiment 2 simplifies a high-component GMM with dimension

D = 1

and Gaussian components

K_{1} = 1024

and

K_{2} = 512

, generating a simplified GMM with component

M = 64

. Experiment 3 simplifies a two-dimensional GMM with dimension

D = 2

and Gaussian components

K = 128

, generating a simplified GMM with component numbers

M_{1} = 8

,

M_{2} = 16

, and

M_{3} = 32

. The main metric for evaluating compression quality is the KL divergence between the original GMM

P (x)

and the simplified GMM

Q (x)

, whose definition is given in Equation (18).

In all subsequent experiments, the so-called “traditional model” or “traditional method” specifically refers to the variable-step truncation–fusion algorithm proposed in [6]. This method is selected as a representative of traditional GMM compression methods because it represents a relatively advanced, non-deep-learning solution, capable of dynamically choosing between truncation and fusion strategies. Compared with simple greedy merging or pure truncation methods, it provides a more robust and advanced comparison benchmark for this study.

4.1. Experiment 1: Comparative Verification of Simplification Effects for One-Dimensional GMM

The results of Experiment 1 (Figure 1 and Table 1) show that GMMultiMixer exhibits accuracy advantages across all compressed component sizes. When M = 64, the KL divergence of GMMultiMixer is only 0.000035, which is reduced by approximately 99.93% compared to the traditional model’s 0.051326. When M = 32, the KL divergence of the traditional model surges to 0.4198759, while that of GMMultiMixer drops to 0.000026 instead, with a reduction rate of 99.994%. Even when compressed to a small component size (M = 8), the KL divergence of GMMultiMixer is still kept within 0.000110. Additionally, by analyzing Table 1, we observe an interesting phenomenon: when using the GMMultiMixer simplified model to process M = 32, the KL divergence value is actually lower than that for M = 64. This precisely reflects that the model does not simply reduce components proportionally during simplification, but instead conducts an in-depth analysis of the characteristics of GMM data and learns the distribution of the original model. The model can adaptively adjust its simplification strategy based on the intrinsic characteristics of the data. Whether simplified to 64 components or 32 components, it can be optimized according to actual data conditions. In this way, adaptability and flexibility are among its core advantages.

4.2. Experiment 2: Verification of Extreme Simplification Capability for One-Dimensional High-Component GMM

The results show of Experiment 2 (Figure 3) that even when the original high-component GMM with 1024 or 512 components are aggressively compressed to only 64 components, GMMultiMixer can still successfully capture their fundamental distribution characteristics. The KL divergence between the simplified model and the original model remains below

10^{- 3}

in both cases. This demonstrates the model’s robustness and extreme simplification capability, effectively alleviating the overfitting and redundancy issues associated with high-component GMM. It proves that the method is scalable and practical for practical application scenarios that may initially contain a large number of components.

4.3. Experiment 3: Verification of Simplification Effects for Two-Dimensional GMM

For the simplification problem of two-dimensional GMM, the computational complexity of traditional methods is particularly prominent. Each Gaussian component consists of one weight parameter, two-dimensional mean parameters, and a 2 × 2 symmetric covariance matrix, with a single component involving a total of six parameters. When the original model contains K components, the total parameter scale amounts to 6K. Since traditional fusion and truncation strategies require frequent calculations of the KL divergence between components to determine the optimization direction, the computation of KL divergence in the two-dimensional case involves numerical integration of multi-dimensional probability density functions, whose complexity is much higher than that in the one-dimensional scenario. This leads to a sharp increase in the overall computation time, a viewpoint that has also been confirmed in experiments. This drawback greatly limits its application in practical scenarios with high real-time requirements.

In contrast, the deep learning method based on GMMultiMixer in this paper effectively avoids the exponentially increasing computational burden of traditional methods by virtue of its multi-scale hybrid architecture and dynamic feature fusion mechanism. The results of Experiment 3 (Figure 4 and Table 2) show that under different values of M, GMMultiMixer consistently achieves a 35% to 45% reduction in KL divergence compared to the traditional model. This highlights its advantage in fitting complex multi-dimensional distributions. More importantly, the deep learning-based GMMultiMixer, with its Multi-Scale Mixing architecture and dynamic feature fusion mechanism, effectively avoids the exponentially increasing computational burden of traditional methods. When traditional methods struggle to achieve real-time applications due to their computational cost or latency, GMMultiMixer can efficiently handle 2D compression tasks, thereby expanding the application scope of GMM compression methods to higher-dimensional problems.

4.4. Analysis of the Significance of Experimental Results

Based on the data from the three groups of experiments above, the GMMultiMixer model demonstrates significant advantages over traditional GMM compression methods in three key aspects: distribution approximation accuracy, adaptability to extremely complex scenarios, and computational efficiency.

The model avoids the local optimal trap of traditional methods and, by leveraging its multi-scale hybrid architecture, truly captures the features of GMM data from a global perspective to accomplish the component compression task.

5. Application of the GMMultiMixer Model in Unmanned Aerial Vehicle (UAV) State Estimation

5.1. System Setup and Problem Description

In a Non-Acknowledgment (NACK) wireless network system based on the User Datagram Protocol (UDP), for hydrogen-powered aircraft with ultra-long endurance capabilities, their state estimation requires achieving accurate inference of system states under complex network constraints. During the ultra-long endurance operations of hydrogen-powered aircraft, the significant increase in signal transmission distance exacerbates issues such as signal attenuation and noise interference, which in turn lead to more severe network-induced data packet loss. Meanwhile, the system itself lacks an acknowledgment mechanism and thus cannot promptly obtain information about data transmission status, rendering traditional state estimation algorithms difficult to be directly applied to such long-distance flight control scenarios. The system dynamics are described by discrete-time linear equations, and the state equation is

x_{k} = A x_{k - 1} + α_{k} B u_{k} + ω_{k}

(21)

where

x_{k} \in R^{n}

is the UAV state vector (e.g., position, velocity) at time k, A is the state transition matrix, B is the control input matrix,

u_{k} \in R^{m}

is the control input vector,

α_{k}

is a Bernoulli random variable characterizing the control input packet loss status (

P (α_{k} = 1) = α

indicates successful transmission,

P (α_{k} = 0) = 1 - α

indicates loss), and

ω_{k} \sim N (0, Q)

is zero-mean Gaussian process noise with covariance matrix

Q > 0

. The observation equation is

y_{k} = \{\begin{matrix} C x_{k} + v_{k}, & β_{k} = 1 \\ \emptyset, & β_{k} = 0 \end{matrix}

(22)

where

y_{k} \in R^{d}

is the observation vector (e.g., sensor measurements), C is the observation matrix,

β_{k}

is a Bernoulli random variable characterizing the observation packet loss status (

P (β_{k} = 1) = β

indicates successful reception,

P (β_{k} = 0) = 1 - β

indicates loss), ∅ denotes no observation data, and

v_{k} \sim N (0, R)

is zero-mean Gaussian measurement noise with covariance matrix

R > 0

. This system exhibits two key characteristics: first, the independence of packet loss, meaning

α_{k}

and

β_{k}

are independent and identically distributed (i.i.d.) random variables with constant loss rates

1 - α

and

1 - β

independent of the system state; second, the lack of an acknowledgment mechanism, meaning there is no dedicated channel for the estimator to receive the true packet loss sequences

{α_{1}, α_{2}, \dots, α_{k}}

and

{β_{1}, β_{2}, \dots, β_{k}}

, presenting a fundamental challenge for state estimation.

The objective of UAV state estimation is to compute the optimal state estimate

{\hat{x}}_{k}

based on the historical observation data

Y_{k} ≜ {y_{1}, y_{2}, \dots, y_{k}}

, minimizing the estimation error covariance

P_{k} ≜ E [(x_{k} - {\hat{x}}_{k}) {(x_{k} - {\hat{x}}_{k})}^{⊤} ∣ Y_{k}]

, where

E [\cdot]

denotes the mathematical expectation. However, in the NACK system, the unknown packet loss status necessitates describing the posterior probability density function

f (x_{k} ∣ Y_{k})

by a GMM. Specifically, each possible packet loss sequence corresponds to one Gaussian component, leading to an exponential increase in the number of components over time. Consequently, the computational complexity grows exponentially, making the optimal estimator practically infeasible as follows:

f (x_{k} ∣ Y_{k}) = \sum_{i = 1}^{2^{k}} G_{x_{k}} ({\hat{x}}_{k}^{i}, P_{k}^{i}) π_{k}^{i} ≜ G M (x_{k}, 2^{k})

(23)

where

G_{x_{k}} ({\hat{x}}_{k}^{i}, P_{k}^{i})

represents the i-th Gaussian component (with mean

{\hat{x}}_{k}^{i}

and covariance

P_{k}^{i}

) and

π_{k}^{i}

is its weight satisfying

\sum_{i = 1}^{2^{k}} π_{k}^{i} = 1

. This exponential growth makes the optimal estimator practically infeasible. Therefore, the core problem in UAV state estimation is how to compress the number of components while preserving the essential characteristics of the GMM distribution.

However, it should be noted that although the number of components of the GMM data evolves dynamically, the GMMs processed by the GMMultiMixer in subsequent experiments are still static. Specifically, the GMMultiMixer performs component compression on the GMM data only when the number of components of the GMM to be compressed reaches a predefined threshold.

5.2. Experimental Results and Analysis

To validate the effectiveness of GMMultiMixer in practical engineering scenarios, the experiment strictly adheres to the hardware platform and network environment parameters as those in ref. [6] to ensure fair comparison. The detailed experimental parameters are supplemented as follows based on the benchmark setup:

5.2.1. System Model and Initial Conditions

The motion of the hydrogen-powered UAV in the horizontal X-Y plane is modeled as a four-dimensional linear discrete system, describing the state vector

x_{k} = {[x_{k}, V_{x_{k}}, y_{k}, V_{y_{k}}]}^{⊤}

(where

x_{k}, y_{k}

are the horizontal positions and

V_{x_{k}}, V_{y_{k}}

are the corresponding velocities). The system state equation is

[\begin{matrix} x_{k} \\ V_{x_{k}} \\ y_{k} \\ V_{y_{k}} \end{matrix}] = [\begin{matrix} 1 & T & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & T \\ 0 & 0 & 0 & 1 \end{matrix}] [\begin{matrix} x_{k - 1} \\ V_{x_{k - 1}} \\ y_{k - 1} \\ V_{y_{k - 1}} \end{matrix}] + [\begin{matrix} 0.5 T^{2} & 0 \\ T & 0 \\ 0 & 0.5 T^{2} \\ 0 & T \end{matrix}] ω_{k - 1}

(24)

where the sampling period

T = 0.01 s

, and the initial state is set to

x_{0} = {[0, 10, 0, 10]}^{⊤}

(unit: position in m, velocity in m/s). The process noise

ω_{k}

and measurement noise

v_{k}

are zero-mean Gaussian noises with covariance matrices

Q > 0

and

R > 0

, respectively.

5.2.2. Network and Detection Settings

The experiment adopts a Non-Acknowledgment (NACK) wireless network based on UDP, where both control-input and observation packets suffer from random dropouts. The packet loss rate (PLR) is set to 20%, characterized by independent and identically distributed (i.i.d.) Bernoulli random variables: the successful transmission probability of control-input packets

α = 0.8

(PLR

\bar{α} = 0.2

) and that of observation packets

β = 0.8

(PLR

\bar{β} = 0.2

). No dedicated acknowledgment channel is configured, and packet dropouts are implemented by randomly discarding 20% of the transmitted data.

5.2.3. Experimental Environment and Cruising Route

The outdoor experiment is conducted in a test field with dimensions

300 \times 300 \times 1000 cm

(length × width × height). The UAV cruises at a fixed altitude of 10 m, following a circular route with a radius of 2.5 m. The real-time position of the UAV is recorded by flight control logs, which serve as the reference for evaluating estimation accuracy.

5.2.4. Comparison and Analysis of Results

As shown in Figure 5, the obtained flight trajectory is projected onto the X-Y plane from a top-down view, and the estimation results of the GMMultiMixer (yellow) model are compared with those of the optimal estimator (red), one-step estimator (gray), and variable step-size estimator (blue). Although the optimal estimator can provide the best estimation performance, its computational burden increases rapidly over time, which makes it unable to be applied in the field of actual real-time tracking. The one-step estimator uses a single Gaussian function to approximate the Gaussian mixture model, which will lead to serious distortion in the shape of the probability density function when fitting complex data (refer to Figure 1). The computational complexity of the variable step-size estimator [6] is lower than that of the optimal estimator, and its fitting ability is stronger than that of the one-step estimator. However, when dealing with complex and two-dimensional GMM, problems such as accuracy degradation and increased computational complexity will occur. In contrast, the GMMultiMixer has a stronger ability to handle complex data and a lower computational burden.

In terms of estimation accuracy, the estimation error of the GMMultiMixer model is significantly smaller than that of the traditional variable-step method, both in smooth trajectory segments and abrupt curvature segments. At the sharp turning segments of the trajectory, the system’s dynamic nonlinearity intensifies, and the posterior probability distribution often exhibits a complex multi-modal shape. Due to the loss of key modal information during the compression process, the traditional variable-step method results in obvious lag and deviation in its estimated trajectory at the turning points. In contrast, relying on its multi-scale feature fusion capability, GMMultiMixer better preserves the components describing different possible states during compression, thereby achieving accuracy closer to that of the optimal estimator at these key nodes with a significant reduction in error. The Mean Absolute Error (MAE) metric fully confirms the above observations. When compressed to 64 components, the accuracy of GMMultiMixer is 50.18% higher than that of the traditional method; when subjected to extreme compression to 8 components, this advantage further expands to 61.54%. This proves that GMMultiMixer is not only effective under conventional compression but also better able to retain the essential characteristics of the distribution under extreme compression, avoiding the dramatic performance degradation of traditional methods caused by over-simplification.

Furthermore, in terms of computational efficiency, the single estimation time of the GMMultiMixer model is significantly shorter than that of the variable-step method. This is mainly attributed to the fundamental advantages enabled by its unique compression mechanism. The traditional variable-step method requires online execution of complex component evaluation, Kullback–Leibler (KL) divergence calculation, and iterative merging/truncation decisions at each estimation step, with its single-step computational complexity being related to the square of the original number of components K or a higher power. In contrast, the proposed GMMultiMixer model in this paper learns a nonlinear mapping function from high-dimensional GMM to low-dimensional GMM through offline training. In the deployment phase, the model can directly and quickly generate a simplified GMM with only M components through forward propagation, thereby significantly reducing the computational burden of Kalman filter updates in state estimation from being related to K to being related to M. Since

M ≪ K

, this direct reduction in the number of components achieves a stepwise decrease in computational complexity. Even in the continuous estimation process during long-endurance cruising, its efficient compression mechanism can maintain a stable and fast response, avoiding potential computational delays caused by repeated iterative calculations in traditional methods, thus better adapting to the real-time requirements of UAVs.

In summary, by optimizing the GMM compression strategy, the new model achieves dual improvements of smaller error and shorter computation time in UAV state estimation, providing a practical solution for high-precision and high-efficiency state estimation in real-world scenarios.

6. Conclusions

This study successfully proposes the GMMultiMixer model by drawing on the basic framework of the TIMEMIXER++ model and improves the simplification problem of GMM. By integrating multi-scale features and adopting a dynamic weight allocation mechanism, this study solves the core problems faced by traditional methods, such as model distortion and the inability to simplify complex GMM. Compared with traditional methods, the simplified model based on GMMultiMixer achieves a significant quantitative improvement in distribution approximation accuracy: In the one-dimensional GMM simplification task of Experiment 1, the KL divergence of GMMultiMixer is reduced by more than 99% compared with that of the traditional model; even for the ultra-high-component scenario in Experiment 2, its KL divergence still remains below

10^{- 3}

. When dealing with the two-dimensional GMM compression task, the KL divergence of GMMultiMixer is still reduced by 35% to 45% compared with that of the traditional model, which reflects its advantage in fitting complex multi-dimensional distributions. More importantly, this method fundamentally solves the problem of explosive computational complexity of traditional methods in processing two-dimensional GMM, expanding the application scope of two-dimensional GMM compression methods. This paper also combines this method with Kalman filtering for UAV state estimation, and its performance is improved more significantly: in terms of estimation accuracy, when compressed to 64 components, it is 50.18% higher than the traditional variable-step method; when compressed to 8 components, this advantage is further expanded to 61.54%. Experimental results show that it effectively solves the problems of sensor noise and nonlinear dynamics, and improves the real-time performance and robustness of the system.

This research breaks through the limitations of traditional GMM compression in terms of the number of components and distribution complexity, providing a new technical pathway for multi modal data modeling. Although this study has achieved positive outcomes, there are still multiple directions worthy of in-depth exploration in the future. Firstly, the model structure of GMMultiMixer can be further optimized, such as introducing sparsification mechanisms or hierarchical feature interaction modules, to enhance its efficiency and adaptability in processing higher-dimensional data. Secondly, the current method essentially compresses the instantaneously generated high-component GMM independently at each estimation time step.

In the future, the model structure of GMMultiMixer can be further optimized—for instance, by introducing sparsification mechanisms or hierarchical feature interaction modules—to enhance its efficiency and adaptability when processing higher-dimensional data. Secondly, the current method essentially compresses the instantaneously generated high-component GMM independently at each estimation time step. In the future, temporal attention mechanisms can be incorporated to capture the temporal evolution patterns of components, extending the method to truly online or sequential GMM simplification scenarios. Additionally, efficient algorithms should be designed to directly handle the streaming growth of GMM components while fully leveraging temporal correlations. Furthermore, the rise of quantum computing holds promising potential impacts on GMM and related fields such as machine learning, data analysis, and signal processing. In terms of computational foundations, quantum linear algebra algorithms can be utilized to accelerate operations involving large-scale covariance matrix inversion and eigenvalue decomposition—core bottlenecks in GMM parameter estimation and inference. In terms of application scenarios, quantum-enhanced GMM may open up new possibilities for real-time processing of extremely complex data, such as hyperspectral remote sensing images, brain–computer interface signals, and multi-modal risk distributions in financial markets. Its powerful computing capability may lay the foundation for dynamically tracking and processing streaming GMM data in the future, thereby breaking through the computational limitations of classical computers. Naturally, this will also bring new challenges, such as the design of quantum-classical hybrid algorithms, practical solutions in the noisy qubit era, and the quantum state representation of GMM itself. Exploring this interdisciplinary field will undoubtedly be an exciting frontier direction for future GMM research.

Author Contributions

Conceptualization, L.Z. and S.L.; data curation, L.Z. and J.Z.; formal analysis, L.Z. and M.T.; funding acquisition, L.Z.; investigation, L.Z., J.Z., M.T. and S.L.; methodology, L.Z., J.Z. and S.L.; project administration, L.Z.; resources, L.Z., M.T. and S.L.; software, L.Z. and J.Z.; supervision, S.L.; validation, L.Z., J.Z., M.T. and S.L.; visualization, L.Z. and M.T.; writing—original draft, L.Z.; writing—review and editing, L.Z., J.Z., M.T. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

We used ChatGPT (GPT-4o) for English correction and translation.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
Ramachandram, D.; Taylor, G.W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar] [CrossRef]
Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Guo, C.; Zhou, J.; Chen, H.; Ying, N.; Zhang, J.; Zhou, D. Variational autoencoder with optimizing Gaussian mixture model priors. IEEE Access 2020, 8, 43992–44005. [Google Scholar] [CrossRef]
Liang, S.; Cai, C.; Xia, M.; Lin, H. Variable step-size estimation over UDP-based wireless networks with application to a hydrogen-powered UAV. IET Control Theory Appl. 2024, 18, 865–876. [Google Scholar] [CrossRef]
Liang, S.; Lin, H.; Lu, S.; Su, H. Single target tracking with unknown detection status: Filter design and stability analysis. IEEE Trans. Ind. Electron. 2023, 71, 11316–11325. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, L.; Liu, J. Image Denoising Using Asymmetric Gaussian Mixture Models in Wavelet Domain. Electronics 2021, 10, 345. [Google Scholar] [CrossRef]
Roweis, S.T.; Ghahramani, Z. A unifying review of linear Gaussian models. Neural Comput. 1999, 11, 305–345. [Google Scholar] [CrossRef] [PubMed]
Tipping, M.E.; Bishop, C.M. Mixtures of probabilistic principal component analyzers. Neural Comput. 1999, 11, 443–482. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Wu, X.; Chen, Y. GMM-Based Voice Conversion with Spectral Smoothing for Low-Resource Languages. Electronics 2022, 11, 789. [Google Scholar] [CrossRef]
Julier, S.J.; Uhlmann, J.K. A new extension of the Kalman filter to nonlinear systems. In Proceedings of the 1997 American Control Conference, Albuquerque, NM, USA, 6 June 1997; IEEE: New York, NY, USA, 1997; Volume 1, pp. 477–481. [Google Scholar] [CrossRef]
Wan, E.A.; Van Der Merwe, R. The unscented Kalman filter for nonlinear estimation. In Proceedings of the 2000 IEEE Adaptive Systems for Signal Processing, Communications, and Control Symposium, Lake Louise, AB, Canada, 4 October 2000; IEEE: New York, NY, USA, 2000; pp. 153–158. [Google Scholar] [CrossRef]
Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2000. [Google Scholar] [CrossRef]
Webb, A.R. Statistical Pattern Recognition, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar] [CrossRef]
McLachlan, G.J.; Rathnayake, S. On the number of components in a Gaussian mixture model. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2014, 4, 341–355. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar] [CrossRef]
Figueiredo, M.A.T.; Jain, A.K. Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 381–396. [Google Scholar] [CrossRef]
McLachlan, G.J.; Peel, D. Finite Mixture Models; John Wiley & Sons: Hoboken, NJ, USA, 2004. [Google Scholar] [CrossRef]
Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn. 1999, 37, 183–233. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar] [CrossRef]
Chen, L.; Wang, H.; Zhang, X. Efficient Compression of Gaussian Mixture Models via KL Divergence Minimization. Electronics 2023, 12, 1876. [Google Scholar] [CrossRef]
Ueda, N.; Nakano, R. Deterministic annealing EM algorithm. Neural Netw. 1998, 11, 271–282. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Zhao, P.; Li, W. Challenges of Traditional Gaussian Mixture Model Optimization in High-Dimensional Spaces. Electronics 2020, 9, 1890. [Google Scholar] [CrossRef]
Yang, M.S.; Sinaga, K.P. Federated Multi-View K-Means Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2446–2459. [Google Scholar] [CrossRef] [PubMed]
Ioannou, I.; Gregoriades, A.; Christophorou, C.; Raspopoulos, M.; Vassiliou, V. Implementing a Cell-Free 6G Distributed AI Network with the Use of Deep ML Under a Traditional Multi-Cell Mobile Network. In Proceedings of the 2025 5th IEEE Middle East and North Africa Communications Conference (MENACOMM), Byblos, Lebanon, 20–22 February 2025; IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Li, J.; Shi, X.; Ye, Z.; Mo, B.; Lin, W.; Ju, S.; Chu, Z.; Jin, M. Timemixer++: A general time series pattern machine for universal predictive analysis. arXiv 2024, arXiv:2410.16032. [Google Scholar]

Figure 1. Experiment 1: Comparison diagram of simplification effects for a one-dimensional GMM with 128 original components under different numbers of simplified components (a)

M = 64

; (b)

M = 32

; (c)

M = 16

; (d)

M = 8

.

Figure 1. Experiment 1: Comparison diagram of simplification effects for a one-dimensional GMM with 128 original components under different numbers of simplified components (a)

M = 64

; (b)

M = 32

; (c)

M = 16

; (d)

M = 8

.

Figure 2. Framework of GMMultiMixer.

Figure 3. Experiment 2: Verification diagram of simplification effects of GMMultiMixer on high-component one-dimensional GMM (K = 1024, 512). (a) Simplification effect when the original number of components K = 1024 is reduced to M = 64; (b) Simplification effect when the original number of components K = 512 is reduced to M = 64.

Figure 4. Experiment 3: Comparison diagram of simplification effects for a two-dimensional GMM with 128 original components under different numbers of simplified components (M = 32, 16, 8). (a) Simplification effect when simplified to M = 32; (b) Simplification effect when simplified to M = 16; (c) Simplification effect when simplified to M = 8.

Figure 5. Comparison of tracking performance for Hydrogen-Powered UAV with simplified target components of 64 and 8. (a) Tracking performance when the number of simplified target components is 64; (b) Tracking performance when the number of simplified target components is 8.

Table 1. Comparison of KL divergence among three different GMM simplification schemes (traditional simplified model, GMMultiMixer model, and single-component analytical solution) for a one-dimensional GMM with 128 original components under different numbers of simplified components (where M = 64, 32, 16, 8).

	M = 64	M = 32	M = 16	M = 8	M = 1
Traditional Model	0.051326	0.4198759	0.619011	0.8678492	None
GMMultiMixer	0.000035	0.000026	0.000091	0.000110	None
Analytical Solution	None	None	None	None	0.045774

Table 2. Comparison of KL divergence among three different GMM simplification schemes (traditional simplified model, GMMultiMixer model, and single-component analytical solution) for a two-dimensional GMM with 128 original components under different numbers of simplified components M (where M = 32, 16, 8).

	M = 32	M = 16	M = 8	M = 1
Traditional Model	0.009894	0.048143	0.137103	None
GMMultiMixer	0.006537	0.025016	0.076115	None
Analytical Solution	None	None	None	0.277146

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Zhang, J.; Tan, M.; Liang, S. Compression of High-Component Gaussian Mixture Model (GMM) Based on Multi-Scale Mixture Compression Model. Electronics 2025, 14, 4858. https://doi.org/10.3390/electronics14244858

AMA Style

Zhang L, Zhang J, Tan M, Liang S. Compression of High-Component Gaussian Mixture Model (GMM) Based on Multi-Scale Mixture Compression Model. Electronics. 2025; 14(24):4858. https://doi.org/10.3390/electronics14244858

Chicago/Turabian Style

Zhang, Linwei, Jin Zhang, Mingye Tan, and Shi Liang. 2025. "Compression of High-Component Gaussian Mixture Model (GMM) Based on Multi-Scale Mixture Compression Model" Electronics 14, no. 24: 4858. https://doi.org/10.3390/electronics14244858

APA Style

Zhang, L., Zhang, J., Tan, M., & Liang, S. (2025). Compression of High-Component Gaussian Mixture Model (GMM) Based on Multi-Scale Mixture Compression Model. Electronics, 14(24), 4858. https://doi.org/10.3390/electronics14244858

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Compression of High-Component Gaussian Mixture Model (GMM) Based on Multi-Scale Mixture Compression Model †

Abstract

1. Introduction

2. Problems Formulation

2.1. Notation and Preliminaries

2.2. Existing Challenges in GMM Compression

2.3. Overview of the Proposed Method

3. Simplification of Complex GMM Based on the GMMultiMixer Model

3.1. Overview of the TimeMixer++ Model

3.2. Principles and Improvements of GMMultiMixer

3.2.1. Multi-Resolution Time Imaging

3.2.2. Time Image Decomposition

3.2.3. Multi-Scale Mixing

3.2.4. Multi-Resolution Mixing

3.2.5. Parameter Parsing and Model Reconstruction of the Simplified GMM

3.2.6. Integration Mechanism of GMMultiMixer and Kalman Filter

3.3. Implementation of a Simplified Model Based on GMMultiMixer

3.4. The Pseudocode Algorithm

4. Comparative Verification Experiments and Expansion

4.1. Experiment 1: Comparative Verification of Simplification Effects for One-Dimensional GMM

4.2. Experiment 2: Verification of Extreme Simplification Capability for One-Dimensional High-Component GMM

4.3. Experiment 3: Verification of Simplification Effects for Two-Dimensional GMM

4.4. Analysis of the Significance of Experimental Results

5. Application of the GMMultiMixer Model in Unmanned Aerial Vehicle (UAV) State Estimation

5.1. System Setup and Problem Description

5.2. Experimental Results and Analysis

5.2.1. System Model and Initial Conditions

5.2.2. Network and Detection Settings

5.2.3. Experimental Environment and Cruising Route

5.2.4. Comparison and Analysis of Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Compression of High-Component Gaussian Mixture Model (GMM) Based on Multi-Scale Mixture Compression Model^†