Next Article in Journal
A Deep Learning-Based Ensemble System for Brent and WTI Crude Oil Price Analysis and Prediction
Previous Article in Journal
Low-Carbon Economic Operation of Natural Gas Demand Side Integrating Dynamic Pricing Signals and User Behavior Modeling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MVIB-Lip: Multi-View Information Bottleneck for Visual Speech Recognition via Time Series Modeling

School of Electronic Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(11), 1121; https://doi.org/10.3390/e27111121 (registering DOI)
Submission received: 8 September 2025 / Revised: 18 October 2025 / Accepted: 23 October 2025 / Published: 31 October 2025
(This article belongs to the Special Issue The Information Bottleneck Method: Theory and Applications)

Abstract

Lipreading, or visual speech recognition, is the task of interpreting utterances solely from visual cues of lip movements. While early approaches relied on Hidden Markov Models (HMMs) and handcrafted spatiotemporal descriptors, recent advances in deep learning have enabled end-to-end recognition using large-scale datasets. However, such methods often require millions of labeled or pretraining samples and struggle to generalize under low-resource or speaker-independent conditions. In this work, we revisit lipreading from a multi-view learning perspective. We introduce MVIB-Lip, a framework that integrates two complementary representations of lip movements: (i) raw landmark trajectories modeled as multivariate time series, and (ii) recurrence plot (RP) images that encode structural dynamics in a texture form. A Transformer encoder processes the temporal sequences, while a ResNet-18 extracts features from RPs; the two views are fused via a product-of-experts posterior regularized by the multi-view information bottleneck. Experiments on the OuluVS and a self-collected dataset demonstrate that MVIB-Lip consistently outperforms handcrafted baselines and improves generalization to speaker-independent recognition. Our results suggest that recurrence plots, when coupled with deep multi-view learning, offer a principled and data-efficient path forward for robust visual speech recognition.

1. Introduction

Lipreading, also known as visual speech recognition (VSR), refers to the task of interpreting utterances solely from the visual information of lip movements. It has long been recognized as an important tool for communication among individuals with hearing impairments and as a complementary modality to improve speech recognition in noisy acoustic environments [1,2,3]. When audio signals are corrupted or unavailable, visual cues from the lips provide robust information that can significantly enhance intelligibility. This dual role, supporting accessibility and enhancing robustness, makes lipreading an enduring research problem with both societal and technical significance.
Early research on lipreading primarily sought to integrate visual features into existing audio-based automatic speech recognition (ASR) systems, leveraging the fact that visual articulation patterns often disambiguate phonemes that are acoustically confusable [4,5]. In this era, statistical sequence models such as Hidden Markov Models (HMMs) were the dominant paradigm [6,7]. Hand-crafted visual descriptors, such as Active Shape Models (ASM) and Active Appearance Models (AAM), have been employed to capture the temporal dynamics of lip shapes [8,9,10]. Other commonly used features include Gabor filters and Local Binary Patterns (LBP) [11,12]. While such approaches demonstrated feasibility, they often required manual preprocessing (e.g., cropped mouth regions), and their performance degraded significantly under unconstrained conditions.
In the past decade, advances in computer vision and machine learning have revolutionized lipreading. Convolutional and recurrent neural networks enabled the first end-to-end recognition systems, such as LipNet [13], which directly map video frames to text sequences. Large-scale datasets such as LRW [14] and LRS2/LRS3 have further fueled the development of deep models, including spatiotemporal CNNs [15,16], attention-based transformers [17], and sequence-to-sequence architectures. Most recently, self-supervised learning frameworks such as AV-HuBERT [18] have achieved state-of-the-art results by pretraining on millions of audiovisual samples, reducing the reliance on labeled data. Despite this remarkable progress, deep neural networks typically require massive training corpora and can struggle in low-resource scenarios or when generalizing across speakers.
Against this backdrop, researchers have sought alternative representations that balance discriminative power and data efficiency. A common direction is to model lip movements as multivariate time series, capturing the trajectories of facial landmarks, which emphasize the dynamical patterns of articulation. To the best of our knowledge, this is the first work to transform landmark time series of the mouth region into recurrence plots (RPs) and leverage them for lipreading. This novel formulation recasts lipreading as a tractable image classification problem, where RPs highlight structural similarities in the temporal dynamics through texture-like images.
In this paper, we modernize and extend this formulation by introducing MVIB-Lip, a multi-view information bottleneck framework for lipreading. Specifically, we treat the raw landmark time series and their recurrence plots as two complementary views. A Transformer encoder models the temporal dynamics of lip trajectories, while a ResNet-based encoder extracts discriminative patterns from RP images. These representations are fused via a product-of-experts posterior, regularized by the information bottleneck principle to retain only task-relevant, view-shared information while discarding nuisance variability such as speaker identity. This design provides the best of both worlds: sample efficiency from structured time-series modeling and representational power from deep neural networks.
While large multimodal models have recently advanced audiovisual understanding, visual-only lipreading remains an essential research direction. In many real-world scenarios, audio signals are unavailable, corrupted, or intentionally suppressed due to privacy constraints, e.g., in meeting rooms, hospitals, surveillance environments, and hearing-assistive applications. A visual-only system also enables deployment on low-power or privacy-preserving devices where storing or transmitting audio is impractical. Furthermore, a reliable visual front-end can complement audio-based or multimodal foundation models by providing an interpretable and noise-robust representation of articulatory dynamics. From this perspective, the proposed MVIB-Lip serves as a principled and lightweight framework for visual speech recognition, focusing on interpretable, data-efficient, and deployment-friendly modeling that remains complementary to recent multimodal advances.
The main contributions of this work are threefold:
  • We propose MVIB-Lip, the first multi-view IB framework for lipreading that jointly exploits temporal landmark dynamics and recurrence plot textures.
  • To the best of our knowledge, this is the first work to transform mouth landmark time series into recurrence plots (RPs) for lipreading. By introducing MVIB-Lip, we fuse raw time-series dynamics with RP-based structural patterns under an information bottleneck framework, achieving both sample efficiency and strong representational power.
  • We provide a systematic evaluation on two benchmarks (OuluVS and LRW) and a self-collected dataset. We show that MVIB-Lip achieves superior performance compared to handcrafted pipelines and single-view neural encoders, particularly in speaker-independent recognition.
The remainder of this paper is organized as follows: in Section 2, we briefly review related works on visual-based isolated sentences recognition. In Section 3, our proposed system using multivariate time series modeling is described in detail by each step. Then, in Section 4, experiments on two database are conducted and results analysis is presented. Finally, Section 5 concludes this paper.

2. Related Works

2.1. Traditional Visual Speech Recognition

Encoding the dynamics of lip movements as a descriptor has a long history in lipreading research. Graph models have been used extensively in visual-only speech recognition (VSR) or audio-visual speech recognition (AVSR). In [1,4], the HMM was used to encode the visual dynamics of speech using Active Shape Model (ASM) [19] and Active Appearance Model (AAM) [20], respectively. Ref. [21] uses articulatory features and a generalized dynamic Bayesian network (DBN) for recognizing spoken phrases with multiple loosely synchronized streams. Apart from works on efficient temporal modeling, other works aim at developing more discriminative spatiotemporal features. For example, refs. [2,8] extract a single spatiotemporal feature to represent visual information of different speech video, whereas [9] uses motion history image (MHI) to represent speech videos. These two approaches outperform in the case of small size and stable videos but it might be sensitive to frame outliners.
A comprehensive review can be found in [22,23]. It is worth noting that almost all the existing isolated sentences recognition systems suffer from one or two of the following issues: (1) although there are some works proposed in recent years concentrating on accurate lip localization in a realistic and uncontrolled VSR environment (e.g., [24]), majority of the previous works (e.g., [1,9,22,25,26]) are tested on the standard database where the mouth region is manually cropped from facial videos beforehand, thus their performance is unknown in the wild or constrained conditions; and (2) features characterizing different sentences are not distinguishable in complex circumstances [2,8,9].
Admittedly, prior lipreading systems such as LipNet [13] have demonstrated the potential of deep learning for automatic visual speech recognition. However, these approaches largely focus on treating lipreading as a sequence-to-sequence mapping problem and rely heavily on large-scale training data to ensure robust performance [13,14].
In contrast, our work approaches lipreading explicitly from the perspective of time-series modeling. We regard mouth landmark trajectories as multivariate temporal signals and design a deep learning framework that integrates this structured representation with complementary views, such as recurrence plots. By embedding time-series modeling principles into the deep learning pipeline, our method aims to achieve more data-efficient and interpretable sentence-level lipreading. To the best of our knowledge, this is among the first attempts to systematically introduce time-series modeling into sentence-level lipreading tasks.

2.2. Deep Learning for Lipreading

With the rise of deep learning, visual speech recognition (VSR) has advanced rapidly. One of the earliest end-to-end models, LipNet [13], demonstrated sentence-level lipreading on the GRID corpus using 3D convolutions and gated recurrent units. Around the same time, Chung and Zisserman [14] introduced the large-scale LRW dataset, which enabled training and evaluation of word-level lipreading systems in the wild. This was later extended to sentence-level corpora such as LRS2 and LRS3 [27,28], which have become standard benchmarks. Building on these datasets, subsequent work explored stronger architectures, including 2D/3D CNN–RNN hybrids [29], spatiotemporal convolutional networks [15,16], temporal convolutional models [30], and attention-based designs using conformers and transformers [17,31]. These approaches capture long-range coarticulation patterns and achieve robust performance under challenging pose and illumination conditions.
The growth of large datasets has also facilitated self-supervised pretraining. In particular, AV-HuBERT [18] showed that masked prediction objectives on large-scale audiovisual data can yield representations that transfer effectively to lipreading tasks with limited labeled data. Related methods based on masked autoencoding for video have also been applied to visual speech [32,33], highlighting the promise of self-supervision in reducing reliance on costly manual annotations. Despite these advances, most state-of-the-art models require massive training corpora and computational resources, and their performance often drops on smaller, domain-specific datasets.
Another line of research considers multi-view lipreading, motivated by the observation that different viewpoints (e.g., frontal and profile) provide complementary cues. Datasets such as OuluVS2 include multiple camera views, and early models fused features from different perspectives by concatenation or recurrent/attention-based modules [34,35,36]. While these approaches improve robustness compared with single-view models, they generally treat views as redundant channels and lack mechanisms to disentangle view-invariant articulatory content from view-specific appearance information. As a result, they may fail to fully exploit the complementary nature of multiple viewpoints, particularly under occlusions or mismatched camera angles.
Our work departs from these prior approaches by formulating lipreading as a multi-view information bottleneck (MV-IB) problem. Instead of simply stacking features across views, we explicitly separate a common latent representation that captures view-invariant speech content from view-specific latent components that encode complementary cues. This formulation is grounded in information theory: we maximize the sufficiency of the combined representation for predicting speech labels while penalizing irrelevant information about the inputs. Compared with standard multi-view fusion, this approach reduces redundancy, improves sample efficiency, and provides interpretability by quantifying the unique contribution of each view. Moreover, it is orthogonal to recent advances in self-supervised pretraining, meaning that pretrained encoders such as AV-HuBERT or VideoMAE can be integrated into our framework while still benefiting from the principled IB-based fusion. In this way, MV-IB combines the strengths of deep representation learning with an information-theoretic perspective on multi-view modeling, offering improved generalization in both cross-view and data-limited scenarios.

2.3. Multi-View Learning and Information Bottleneck

The Information Bottleneck (IB) [37,38] aims to obtain a compressed representation Z from X, preserving predictive information about Y. Formally, the objective is to find a representation Z that maximizes the mutual information I ( Y ; Z ) while constraining I ( X ; Z ) below a predefined threshold α .
In practical applications, solving this constrained optimization problem can be highly challenging. Therefore, Z is typically obtained by maximizing the IB Lagrangian:
L IB = I ( Y ; Z ) β I ( X ; Z ) ,
where β > 0 is a Lagrange multiplier that balances the sufficiency (measured by I ( Y ; Z ) ) against the simplicity (measured by I ( X ; Z ) ).
The multi-view information bottleneck (MVIB) framework [39,40] provides a principled way to learn compact, task-relevant representations while discarding irrelevant variations. Equation (1) can be extended to a multi-view setting with proper modifications. Given multi-view labeled data { X 1 , , X M , Y } , MIB [40,41] learns a fused representation Z by the following:
min I ( Y ; Z ) + i = 1 M β i I ( X i , Z i ) , s . t . , Z = f ( Z 1 , , Z M ) ,
where f is a fusion network, β i is the Lagrange multiplier on the i-th view, Z m is the view-specific representation.
Recent work extends IB to multimodal learning [42], demonstrating its potential to improve robustness and generalization across different sensory modalities. The multimodal information bottleneck (MIB) [42] combines an early-fusion strategy and late-fusion strategy and formulates the learning as follows:
min I ( Y ; Z ) + β I ( X ; Z ) + i = 1 M I ( Y ; Z m ) + β I ( X m ; Z m ) , s . t . , Z = f ( Z 1 , , Z M ) ,
In parallel, multi-view representation learning has become a powerful paradigm for fusing heterogeneous data sources. Inspired by this line of research, we frame lipreading as a two-view problem: the temporal dynamics of lip landmarks and the structural recurrence patterns derived from them. By leveraging MVIB, we ensure that the shared latent space captures discriminative features for sentence recognition while suppressing nuisance factors such as speaker-specific styles. Unlike previous single-view pipelines, our approach naturally integrates handcrafted dynamical descriptors with modern deep encoders in an end-to-end trainable fashion.

3. System Description

Our system includes three components: (1) multivariate time series generation using joint trajectories of facial landmarks on the outer lip contour; (2) multivariate time series representation using a modified recurrence plot; and (3) view-specific feature extraction and fused feature learning with multi-view information bottleneck. The flow-chart of our system is shown in Figure 1. The following sections describe each step in detail.

3.1. Multivariate Time Series Generation

The first step is to generate multivariate time series to uncover the spatiotemporal information of different lip movements when speaking different sentences. A straightforward way is to detect and track facial landmarks on the outer lip contour in successive video frames. As a state-of-the-art in this direction, we use supervised descent method (SDM) [43]. Compared with classical ASM and AAM, the relatively higher efficiency and accuracy of SDM lie in its ability to adaptively enforce shape constraints, the strong learning capacity from large training datasets, and the precise objective function [44]. For more details on deformable object fitting with SDM and a detailed explanation of its solution, the reader is referred to [43].
Facial landmark detection can be cast as a nonlinear least squares (NLS) problem, where the objective is to align an initial facial shape estimate to the true landmark configuration. Formally, let h ( x ) : R n R m denote a nonlinear feature extraction function (e.g., SIFT descriptors sampled at landmark locations), y R m the target feature vector extracted from the ground-truth landmarks, and x R n the parameters encoding the current landmark positions. The NLS objective takes the following form:
min x f ( x ) = min x h ( x ) y 2 .
Classical Newton-type methods update x iteratively according to the following:
x k = x k 1 α A J h ( x k 1 ) ( h ( x k 1 ) y ) ,
where J h ( x ) R m × n is the Jacobian of h, A R n × n is either the identity matrix (first-order) or an approximation to the inverse Hessian (second-order), and α is a step size. However, computing J h and A in high dimensions is computationally expensive and often unstable, particularly since h (e.g., SIFT, HOG) may be non-differentiable.
The supervised descent method (SDM) [43] circumvents this issue by learning a sequence of generic descent directions directly from training data, without requiring Jacobian or Hessian computation. Specifically, SDM introduces a generic descent map R R n × m defined such that there exists 0 < c < 1 with the following:
x k x * c x k 1 x * , x k = x k 1 R h ( x k 1 ) h ( x * ) ,
where x * is the optimal landmark configuration. Intuitively, R acts as a weighted average gradient direction that drives x k towards the ground truth x * . Xiong and De la Torre [43] proved that such a descent map exists when (1) R h ( x ) is strictly locally monotone at x * , and (2) h ( x ) is locally Lipschitz continuous.
In practice, R is learned via linear regression between the feature differences and the displacement to the ground-truth landmarks. More concretely, during training, SDM minimizes the following:
min R , b i x 0 i ( x * i x 0 i ) ( R ϕ 0 i + b ) 2 ,
where ϕ 0 i = h ( d i ( x 0 i ) ) are the features extracted from image d i at the perturbed initialization x 0 i , and x * i are the corresponding ground-truth landmarks. The process is repeated in a cascaded manner: after each regressor ( R k , b k ) is learned, landmark estimates are updated and new training pairs are generated for the next stage. Typically, convergence is achieved after 4–5 stages.
At test time, given a new face image, the detector initializes the mean landmark shape. Then, SDM applies the sequence of learned regressors to iteratively refine the estimate, written as follows:
x k = x k 1 + R k 1 ϕ k 1 + b k 1 , ϕ k 1 = h ( d ( x k 1 ) ) .
This procedure avoids any explicit Jacobian/Hessian computation and yields robust alignment even under large pose, illumination, and occlusion variations. Experiments in [43] demonstrated that SDM achieves state-of-the-art performance on challenging “in-the-wild” datasets such as LFPW and LFW-A&C.
Note that this paper tracks 12 facial landmarks simultaneously, as suggested in [24,43,45]. Although using more facial landmarks or interpolating these landmarks to construct a whole lip contour may improve the classification accuracy, it also increases the modeling complexity at the same time. A representative multivariate time series generation result is demonstrated in Figure 2. To compensate for the natural head movement during speaking, the relative coordinates of facial landmarks is used, where the reference point is assumed as the central point between two endpoints of mouth corner.

3.2. Multivariate Time Series Representation Using a Modified Recurrence Plot

Recurrence plot (RP) is a visualization tool for dynamical systems to capture the system’s behavior, and is distinctive for different dynamical systems [46]. A basic recurrence matrix is defined as follows:
R ( i , j ) = θ ( ϵ | | x i x j | | 2 )
where x i and x j are the observations (or states) for a given time series at time indices i and j, respectively, θ ( · ) is the unit step function, and ϵ stands for the threshold. For the time slots when values are within the threshold, black dots will be shown on the recurrence texture. An improved recurrence matrix is defined as follows [47]:
R ( i , j ) = | | x i x j | | 2
In our method, motivated by the structure of the famed bilateral filter [48] in computer vision and image analysis, we propose a modified recurrence plot to jointly consider temporal distance and radiometric differences:
R ( i , j ) = g σ ( | | x i x j | | 2 ) g σ ( | i j | )
where g σ ( · ) is a RBF kernel with width σ ( σ = 1 in our system). Figure 3 shows three representative recurrence plots with their corresponding sentences. As can be seen, the modified RP is distinctive for different sentences, thus holding the potential for isolated sentence recognition.

3.3. Multi-View Information Bottleneck for Lipreading

We propose MVIB-Lip, a multi-view framework that models lip movements from two complementary perspectives: the raw landmark time series and the recurrence plot (RP) derived from it. Let X T denote the temporal sequence of lip landmark positions, and X R the recurrence plot representation. The objective is to learn a compact and robust latent representation Z that preserves task-relevant information about the spoken label Y, while discarding nuisance factors such as speaker identity, illumination changes, and background noise.
To simplify the analysis, we consider the case of two views of lip-reading inputs: the raw time series X T and the recurrence plot X R , together with the class label Y. We thus optimize the following objective:
max Z T , Z R I ( Y ; Z ) λ 1 I ( X T ; Z T ) λ 2 I ( X R ; Z R ) , s . t . Z = f θ ( Z T , Z R ) ,
where Z T and Z R are latent representations derived from encoders for X T and X R , respectively.
Each view is processed by a dedicated encoder network. For the temporal view, a Transformer encoder f ϕ ( X T ) is employed to model long-range dependencies in the landmark trajectories. For the recurrence plot view, a ResNet-18 encoder f ψ ( X R ) is used to extract discriminative texture features. Each encoder produces a variational posterior distribution of the latent representation:
q ϕ ( z T X T ) = N ( μ T , σ T 2 I ) , q ψ ( z R X R ) = N ( μ R , σ R 2 I ) ,
where μ T , σ T and μ R , σ R are predicted by the respective encoder networks.
To integrate complementary information across views, we adopt the Product-of-Experts (PoE) rule. The fused latent posterior is defined as follows:
q ( z X T , X R ) q ϕ ( z T X T ) · q ψ ( z R X R ) .
This fusion mechanism emphasizes shared, consistent information while down-weighting view-specific noise. Notably, the PoE formulation also provides robustness to missing views: if one modality is absent, the joint posterior reduces to the available encoder distribution.

3.4. Optimization

A central challenge in optimizing Equation (9) lies in computing the mutual information. For two random variables A and B, mutual information can be written as follows:
I ( A ; B ) = E p ( a , b ) log p ( a | b ) p ( a ) = p ( a , b ) log p ( a | b ) p ( a ) d a d b ,
where p ( a ) denotes the marginal distribution of a, p ( a , b ) the joint distribution of ( a , b ) , and p ( a | b ) the conditional distribution of a given b.
Direct evaluation of this quantity is generally infeasible in high-dimensional settings, since the true data distributions are unknown. To address this, variational approximation techniques are often employed, which introduce tractable surrogate distributions. By doing so, the intractable terms can be bounded from below, yielding an optimization-friendly objective. In essence, these methods approximate the true distributions with parameterized families and convert the original mutual information objective into a computable lower bound.
As for the mutual information I ( Y ; Z ) , we adopt a variational approximation [39,49]:
I ( Y ; Z ) = d y d z p ( y , z ) log p ( y | z ) p ( y ) .
Since p ( y | z ) is intractable, we introduce a variational distribution q ( y | z ) to approximate it. Leveraging the non-negativity of the Kullback–Leibler divergence, we have:
I ( Y ; Z ) d y d z p ( y , z ) log q ( y | z ) .
Therefore, the variational lower bound of I ( Y ; Z ) can be optimized directly. Notice that the entropy of the label H ( Y ) is independent of optimization and thus can be dropped.
Focusing on the p ( y , z ) , we can rewrite it as follows:
p ( y , z ) = d x R d x T d z R d z T p ( x R , x T , z R , z T , y , z ) .
Therefore, Equation (15) can be rewritten as follows:
I ( Y ; Z ) d y d z d x R d x T d z R d z T p ( x R , x T , z R , z T , y , z ) log q ( y | z ) .
In order to solve Equation (16), we need to find the joint probability density function of all variables to obtain its variational lower bound. Leveraging the Markov assumption, p ( x R , x T , z R , z T , y , z ) can be represented as follows:
p ( x R , x T , z R , z T , y , z ) = p ( z | z R , z T , x R , x T , y ) p ( z R | z T , x R , x T , y ) × p ( z T | x R , x T , y ) p ( x R , x T , y ) .
x R and x T are two views of lip movements. z R and z T are learned from them respectively. Therefore, we assume given x R , z R is independent of x T , z T and y. Accordingly, we also assume that given x T , z T is independent of x R , z R and y. The Markov chain between these variables is shown in Figure 4.
Substituting Equation (17) into Equation (16) while applying our assumptions, we can obtain a new lower bound of the mutual information between Y and Z:
I ( Y ; Z ) d x R d x T d y p ( x R , x T , y ) d z R d z T p ( z | z R , z T ) p ( z R | x R ) p ( z T | x T ) × log q ( y | z ) .
For the penalty terms I ( X T ; Z T ) and I ( X R ; Z R ) , we follow the information bottleneck (IB) principle and aim to minimize these mutual information quantities. Direct estimation of I ( X ; Z ) is intractable, but it can be rewritten as follows:
I ( X ; Z ) = H ( Z ) H ( Z X ) .
In our VAE-like encoder, q ϕ ( z x ) = N ( μ ϕ ( x ) , diag σ ϕ 2 ( x ) ) , the conditional entropy has the closed form
H ( Z X ) = 1 2 E p ( x ) log ( 2 π e ) d det Σ ϕ ( x ) ,
which depends only on the encoder variances. When these variances are kept fixed or properly regularized, H ( Z X ) is (approximately) constant. In this case, minimizing H ( Z ) is equivalent to minimizing I ( X ; Z ) up to an additive constant. Even if Σ ϕ ( x ) varies, bounding its eigenvalues ensures H ( Z X ) is lower-bounded, so minimizing H ( Z ) still tightens an upper bound on I ( X ; Z ) [50].
To operationalize this idea, we approximate the marginal q ϕ ( z ) by the aggregated posterior over a minibatch of size n, i.e., the following:
q ^ ( z ) = 1 n i = 1 n N ( z μ i , Σ i ) , μ i = μ ϕ ( x i ) , Σ i = diag σ ϕ 2 ( x i ) .
We then penalize its Rényi entropy of order 2 [51], written as follows:
H 2 ( Z ) = log q ( z ) 2 d z ,
which admits an exact closed form for Gaussian mixtures
H ^ 2 ( Z ) = log 1 n 2 i = 1 n j = 1 n N μ i μ j , Σ i + Σ j .
Here each term N ( μ i μ j , Σ i + Σ j ) is the Gaussian density evaluated at μ i with mean μ j and covariance Σ i + Σ j . This estimator is unbiased, differentiable with respect to both means and variances, and can be computed efficiently per minibatch.
Each encoder outputs a diagonal Gaussian posterior parameterized by a mean and variance pair ( μ , s ) . The standard deviation is computed as σ = softplus ( s ) + σ min , followed by clipping to an upper limit σ max . In our experiments, σ min = 10 3 and σ max = 0.5 . This formulation guarantees that all latent variances remain strictly positive and bounded, thereby keeping the conditional entropy H ( Z | X ) within a finite range. Bounding the variance stabilizes training and ensures that minimizing the marginal entropy H ( Z ) effectively reduces the mutual information I ( X ; Z ) under the information bottleneck framework. Empirically, this constraint prevents numerical instability and over-compression of latent features while maintaining sufficient capacity for discriminative representation learning.
Theorem 1
(Closed-form empirical H 2 for a Gaussian-mixture aggregated posterior). Let the empirical (mini-batch) aggregated posterior be the following:
q ^ ( z ) = 1 n i = 1 n N z μ i , Σ i ,
with μ i R d and positive-definite covariances Σ i R d × d . The Rényi entropy of order 2 of q ^ is written as follows:
H 2 ( q ^ ) = log q ^ ( z ) 2 d z = log 1 n 2 i = 1 n j = 1 n N μ i μ j , Σ i + Σ j .
Lemma 1
(Gaussian product integral). For any μ 1 , μ 2 R d and positive-definite Σ 1 and Σ 2 , the following is written:
R d N ( z μ 1 , Σ 1 ) N ( z μ 2 , Σ 2 ) d z = N μ 1 μ 2 , Σ 1 + Σ 2 .
Proof of Lemma 1.
Write the normalized Gaussian density as follows:
N ( z μ , Σ ) = ( 2 π ) d / 2 | Σ | 1 / 2 exp 1 2 z μ Σ 1 2 ,
where x A 2 : = x A x . Then we obtain the following:
N ( z μ 1 , Σ 1 ) N ( z μ 2 , Σ 2 ) = C exp 1 2 z μ 1 Σ 1 1 2 + z μ 2 Σ 2 1 2 ,
with C = ( 2 π ) d | Σ 1 | 1 / 2 | Σ 2 | 1 / 2 . Complete the square in z. Let A : = Σ 1 1 + Σ 2 1 and m : = A 1 ( Σ 1 1 μ 1 + Σ 2 1 μ 2 ) . Then we obtain the following:
z μ 1 Σ 1 1 2 + z μ 2 Σ 2 1 2 = z m A 2 + μ 1 μ 2 Σ 1 1 + Σ 2 1 2 m A 2 + μ 1 Σ 1 1 2 + μ 2 Σ 2 1 2 .
Integrating in z uses exp ( 1 2 z m A 2 ) d z = ( 2 π ) d / 2 | A 1 | 1 / 2 . After cancellations, one obtains the following:
N ( z μ 1 , Σ 1 ) N ( z μ 2 , Σ 2 ) d z = ( 2 π ) d / 2 | Σ 1 + Σ 2 | 1 / 2 exp 1 2 μ 1 μ 2 ( Σ 1 + Σ 2 ) 1 2 ,
which equals N ( μ 1 μ 2 , Σ 1 + Σ 2 ) . □
Proof of Theorem 1.
By definition, we obtain the following:
q ^ ( z ) 2 d z = 1 n i = 1 n N ( z μ i , Σ i ) 1 n j = 1 n N ( z μ j , Σ j ) d z = 1 n 2 i = 1 n j = 1 n N ( z μ i , Σ i ) N ( z μ j , Σ j ) d z .
Apply Lemma 1 termwise to obtain the following:
q ^ ( z ) 2 d z = 1 n 2 i = 1 n j = 1 n N μ i μ j , Σ i + Σ j .
Finally, the Rényi entropy of order 2 is H 2 ( q ^ ) = log q ^ ( z ) 2 d z , which yields the claimed expression. □
Remark 1
(Differentiability and efficiency). The map ( { μ i , Σ i } i = 1 n ) H 2 ( q ^ ) is smooth on the set of positive-definite Σ i , since it is a composition of finite sums of Gaussian densities and a log ( · ) on a strictly positive argument. Therefore, H 2 ( q ^ ) admits unbiased reverse-mode derivatives with respect to both μ i and Σ i and can be computed exactly in O ( n 2 ) time per mini-batch.
Remark 2
(Unbiasedness for the empirical mixture and consistency). Conditional on a fixed mini-batch ( μ i , Σ i ) } i = 1 n , Theorem 1 gives the exact H 2 of the empirical mixture q ^ , hence no estimation bias is introduced at this level. If the pairs ( μ i , Σ i ) are i.i.d. draws from a population distribution (e.g., induced by the data distribution and encoder), then the inner double sum is a U-statistic and converges almost surely to its population counterpart as n by the law of large numbers, implying consistency of the empirical H 2 .
In summary, instead of minimizing I ( X ; Z ) directly, we minimize its tractable surrogate H 2 ( Z ) . This yields a stable and theoretically justified penalty that regularizes the representations R T and R R in our model. The Rényi-2 formulation provides a closed-form, differentiable, and numerically stable regularizer for Gaussian-mixture posteriors. Unlike variational mutual-information estimators such as MINE [52] or InfoNCE [53], it requires no auxiliary neural network for density-ratio estimation and avoids the adversarial training dynamics that often destabilize optimization in mutual-information-based methods. This design allows MVIB-Lip to achieve a smooth and low-variance training process while maintaining the theoretical connection to mutual-information minimization under the Information Bottleneck principle.
We finally present our overall training objective. Let q ϕ ( z T X T ) = N ( μ T , Σ T ) and q ψ ( z R X R ) = N ( μ R , Σ R ) be the posteriors for the temporal (T) and recurrence plot (R) encoders, respectively. The PoE joint posterior is q ( z X T , X R ) q ϕ ( z T X T ) q ψ ( z R X R ) , and the classifier is p ω ( y z ) . Our training minimizes the following:
L = E z q ( z X T , X R ) log p ω ( y z ) prediction loss + α H ^ 2 ( Z ) + β T KL q ϕ ( z T X T ) N ( 0 , I ) + β R KL q ψ ( z R X R ) N ( 0 , I ) + γ KL q ( z X T , X R ) N ( 0 , I ) ,
where H ^ 2 ( Z ) is the Rényi-2 penalty of the minibatch aggregated posterior (Thm. 1; Equation (25)). Unless stated otherwise, we use α = 0.1 , β T = β R = γ = 10 3 , selected on the validation set.

4. Experiments

In this paper, we consider both speaker-dependent and speaker-independent lipreading scenarios. Speaker-dependent recognition evaluates a model on the same speakers that appear in the training set, while speaker-independent recognition requires the model to generalize to entirely unseen speakers in the test set. Since accent, speech rate, and pronunciation habits vary significantly across speakers, it is well known that the performance of most lipreading systems drops substantially under the speaker-independent setting [25].
The proposed MVIB-Lip framework consists of two lightweight encoders and a shared classifier. The temporal encoder is a three-layer Transformer (hidden size 256, four attention heads) that models the dynamics of 12 lip landmarks across 30 frames, while the recurrence plot (RP) encoder is a modified ResNet-18 operating on single-channel 12 × 12 RP images to capture spatial texture evolution. The two latent distributions are fused using a Product-of-Experts (PoE) mechanism to obtain a compact joint embedding, followed by a linear classification head. All encoders employ diagonal Gaussian posteriors with bounded variance ( σ min = 10 3 , σ max = 0.5 ) for information bottleneck regularization. The model is trained end-to-end using the AdamW optimizer with an initial learning rate of 3 × 10 4 , weight decay of 10 4 , batch size of 64, and cosine learning-rate decay for 120 epochs. Training is performed on a single NVIDIA RTX 4090 GPU with mixed precision, achieving real-time inference at approximately 200 frames per second.

4.1. Datasets

To provide a comprehensive evaluation, we therefore report results in both settings using four datasets.
OuluVS Database [8]: OuluVS is a widely used dataset for speaker-dependent evaluation. It contains 817 video sequences from 20 speakers, each uttering 10 fixed sentences one to five times. The speakers come from four different countries, exhibiting natural variation in accent and speech style. Because the training and test sets share the same speakers, OuluVS provides a standard benchmark for speaker-dependent lipreading.
Self-Collected Database: To further assess performance in speaker-dependent scenarios, we constructed our own dataset with 10 subjects (5 male and 5 female, aged 22–45) from the Xi’an University of Posts and Telecommunications. The participants come from four provinces of China, covering different accents, speech rates, and facial characteristics to ensure diversity. Each subject was recorded sitting in front of a 1080p HD camera (Logitech C920, Lausanne, Switzerland) at a distance of approximately 0.5 m, under uniform indoor illumination and a neutral background. Each participant repeated the same 10 sentences used in OuluVS ten times, yielding 100 clips per speaker (Table 1). The videos were captured at 30 fps and manually checked for alignment accuracy and visual quality. All recordings were then preprocessed using the same ROI cropping and alignment pipeline as OuluVS to ensure consistency across datasets. Metadata such as age, gender, and accent are retained to facilitate future cross-speaker or cross-domain studies. This dataset therefore introduces additional diversity in age, gender, accent, and skin color, while preserving controlled recording conditions suitable for reproducible experiments.
LRW [14]: LRW is a large-scale English word-level lipreading dataset designed for speaker-independent evaluation. It consists of more than 500 , 000 video clips extracted from BBC television, covering 500 target words. The dataset contains significant variation in pose, illumination, and background conditions. The official split includes approximately 489 k training samples, 25 k validation samples, and 25 k test samples, with no overlap of speakers between training and test sets.
LRW-1000 [54]: LRW-1000 is currently the largest publicly available Mandarin lipreading dataset, also designed for speaker-independent evaluation. It includes more than 700 , 000 clips across 1000 word classes, recorded under unconstrained conditions with diverse speakers and large variations in pose and scale. The official partition provides around 718 k training samples, 172 k validation samples, and 172 k test samples, again with no speaker overlap between training and testing.
For the LRW dataset, each video sequence is processed through several steps. First, we apply face detection and alignment to normalize the frames. Each frame is then aligned to a reference mean face shape, after which a fixed region of interest (ROI) of size 96 × 96 pixels is cropped around the mouth region to ensure consistent centering. The cropped frames are further converted to grayscale, as preliminary experiments showed no clear advantage in using RGB inputs. For the LRW-1000 dataset, the mouth ROIs are already provided, so no additional preprocessing is required.
In summary, OuluVS and our self-collected dataset are used to evaluate performance in the speaker-dependent setting, whereas LRW and LRW-1000 serve as challenging benchmarks for speaker-independent recognition.

4.2. Results

To provide a fair comparison under limited training data, we first evaluate our approach against several traditional machine learning baselines on the OuluVS and self-collected datasets. Specifically, we re-implemented systems based on LBP descriptors [8], HMMs [4], and DBNs [21]. In contrast to our framework, refs. [4,21] rely on sequential models (HMM and DBN, respectively) to capture temporal dynamics from image sequences, while [8] employs handcrafted spatiotemporal features extracted directly from video. In addition, to include stronger contemporary models, we added two recent visual baselines: ResNet-18 + TCN, which couples spatial convolution with temporal modeling, and VideoMAE [32], a transformer-based self-supervised model fine-tuned for lipreading. Due to the limited number of samples per phrase and speaker, we adopt the leave-one-utterance-out cross-validation protocol as in [8], where each utterance is iteratively held out for testing while the remaining utterances are used for training.
Table 2 summarizes the recognition accuracies averaged over 10 subjects. On OuluVS, our MVIB-Lip model achieves 87.0%, outperforming both traditional approaches (LBP 64.2%, HMM 60.9%, DBN 42.8%) and the two deep learning baselines (ResNet-18 + TCN 84.9%, VideoMAE-tiny 86.1%). On the self-collected dataset, MVIB-Lip attains 83.3%, higher than ResNet-18 + TCN (80.4%) and VideoMAE-tiny (81.7%). These results demonstrate that the proposed multi-view information-bottleneck fusion provides consistent gains over both handcrafted and modern visual baselines under limited training data.
It is worth noting that our current system is not yet optimized; more advanced time-series analysis techniques (e.g., [55,56]) could potentially yield further gains. Nonetheless, as an initial effort, our framework already demonstrates the effectiveness of combining deep learning with structured time-series modeling for lipreading, highlighting its promise for facial dynamics analysis under data-scarce conditions.
On the other hand, although MVIB-Lip introduces dual encoders, both are lightweight modules (a Transformer with ≤3 layers and a ResNet-18 backbone). The added computational cost is moderate (approximately 25 % increase during training), and inference remains real-time (about 200 frames/s on a single GPU). Moreover, the PoE fusion mechanism enables flexible deployment by using either view alone without significant performance loss.
We further compare our approach with strong deep learning baselines on the LRW and LRW-1000 benchmarks. As shown in Table 3, the original LRW baseline reports 61.1% accuracy on LRW, while the more advanced two-stream 3DCNN [57] achieves 84.1%. The current state-of-the-art is the Multi-Scale TCN [30], which obtains 85.3% on LRW and 41.4% on LRW-1000.
Our framework, which models lip movements as multivariate time series complemented by recurrence plot representations, achieves state-of-the-art performance on both LRW and LRW-1000. On LRW, our method attains 86.2% accuracy, clearly outperforming the two-stream 3DCNN (84.1%) and exceeding the Multi-Scale TCN (85.3%). LRW-1000 poses a much greater challenge due to its larger vocabulary and substantial speaker and pose variability. The LRW baseline is expected to drop to around 28%, while the two-stream 3DCNN is estimated to reach 38.7%. In contrast, our approach obtains 42.1%, surpassing the Multi-Scale TCN by +0.7 points and the estimated two-stream 3DCNN by +3.4 points. These improvements demonstrate that explicitly modeling lip dynamics as structured time series, together with recurrence-based complementary features, provides consistent advantages over existing deep learning architectures on both moderate- and large-scale benchmarks.
Finally, we note that a transformer-based architecture LipFormer [58] achieves 87.3% accuracy on LRW and approximately 45.2% on LRW-1000 when trained on full-scale data. In contrast, the proposed MVIB-Lip attains 86.2% on LRW and 42.1% on LRW-1000 using a considerably smaller backbone (ResNet-18 combined with a three-layer Transformer) and without large-scale pretraining. These results indicate that MVIB-Lip delivers competitive performance while preserving interpretability, computational efficiency, and robustness under data-limited settings. Moreover, the framework is complementary to large transformer-based systems and can be seamlessly integrated with them through the multi-view bottleneck formulation to further enhance generalization.

4.3. Ablation Study

To better understand the contribution of each component in our framework, we conduct an ablation study on the LRW dataset. Our method consists of two complementary views: (1) the raw landmark time series, which capture temporal dynamics of mouth movements, and (2) recurrence plots (RPs), which provide structural representations of temporal similarity. In addition, our framework is regularized with an information bottleneck (IB) objective to encourage the fused representation to retain task-relevant information while discarding nuisance variability.
Table 4 reports the recognition accuracy under different configurations. When using only the time-series view, the system achieves solid performance (82.3%), demonstrating that temporal trajectories alone carry strong discriminative cues. Recurrence plots alone yield 80.7%, slightly lower but still competitive, indicating that structural recurrence information is also informative. When combining both views without IB regularization, the performance increases to 85.4%, confirming that the two modalities are complementary. Finally, the full model with IB regularization further improves to 86.2%, showing that the IB constraint helps to filter out redundant view-specific information and sharpen the fused representation.
These results highlight three key insights. First, both time-series trajectories and recurrence plots provide valuable but distinct information. Second, their fusion yields clear gains, confirming the complementarity of the two views. Third, IB regularization plays a crucial role by enforcing compactness and task relevance, leading to the best overall performance.
To examine the influence of batch size on the Rényi-2 regularization, we trained MVIB-Lip using mini-batches of 32, 64, and 128 samples while keeping all other hyperparameters fixed. The resulting accuracies on OuluVS varied by less than 0.4 % , indicating that the estimator is numerically stable across practical batch sizes. In addition, we observed that scaling the coefficient α proportionally to 1 / B preserves a comparable regularization strength and yields nearly identical convergence behavior. This confirms that the empirical Rényi-2 estimator remains unbiased for the minibatch mixture and consistently approximates the population mixture as the batch size increases.

5. Conclusions

In this paper, we presented a novel multi-view framework for lipreading that jointly models mouth landmark trajectories as multivariate time series and their recurrence plot representations. By fusing these complementary views under an information bottleneck (IB) principle, our method captures both fine-grained temporal dynamics and structural recurrence patterns while discarding nuisance variability.
Extensive experiments under both speaker-dependent and speaker-independent settings demonstrate the effectiveness of our approach. On OuluVS and our self-collected dataset, our framework consistently outperforms traditional approaches such as LBP, HMM, and DBN, highlighting the advantage of deep learning combined with structured time-series modeling in low-data regimes. On large-scale benchmarks, our method achieves state-of-the-art performance, surpassing the two-stream 3DCNN and Multi-Scale TCN on LRW and LRW-1000. Ablation studies further confirm that both views contribute complementary information and that IB regularization plays a crucial role in improving the fused representation.
Overall, our work demonstrates that explicitly treating lip movements as structured time series, augmented with recurrence-based representations, provides a powerful and flexible framework for visual speech recognition. In future work, we plan to extend this paradigm to sentence-level lipreading in more diverse conditions, explore advanced time-series analysis tools, and investigate its integration into multi-modal audiovisual speech recognition systems.

Author Contributions

Conceptualization, Y.L. and J.W.; methodology, Y.L. and J.W.; software, Y.L., H.S., and J.C.; validation, Y.L., H.S., and J.C.; formal analysis, Y.L.; investigation, Y.L.; resources, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, J.W.; visualization, Y.L., H.S. and J.C.; supervision, J.W.; project administration, J.W.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China, Science and Technology Innovation 2030—“New Generation Artificial Intelligence” Major Project: Development and Application of Self-Reconstructing and Self-Evolving AI Chips for Complex Scenarios (Grant No. 2022ZD0119000).

Data Availability Statement

The OuluVS and LRW/LRW-1000 datasets used in this study are publicly available at https://www.oulu.fi/en/university/faculties-and-units/faculty-information-technology-and-electrical-engineering/center-for-machine-vision-and-signal-analysis and https://www.robots.ox.ac.uk/~vgg/data/lip_reading/, accessed on 7 September 2025. The self-collected dataset was acquired at Xi’an University of Posts and Telecommunications under institutional ethics approval and is available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
IBInformation Bottleneck
MVIBMulti-View Information Bottleneck
SDMSupervised Descent Method
RPRecurrence Plot

References

  1. Matthews, I.; Cootes, T.F.; Bangham, J.A.; Cox, S.; Harvey, R. Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 198–213. [Google Scholar] [CrossRef]
  2. Zhao, G.; Barnard, M.; Pietikäinen, M. Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 2009, 11, 1254–1265. [Google Scholar] [CrossRef]
  3. Fan, X.; Busso, C.; Hansen, J.H.L. Audio-visual isolated digit recognition for whispered speech. In Proceedings of the 2011 19th European Signal Processing Conference, Barcelona, Spain, 29 August–2 September 2011; pp. 1500–1503. [Google Scholar]
  4. Tao, F.; Busso, C. Lipreading approach for isolated digits recognition under whisper and neutral speech. In Proceedings of the Interspeech, Singapore, 14–18 September 2014; pp. 1154–1158. [Google Scholar]
  5. Almajai, I.; Cox, S.; Harvey, R.; Lan, Y. Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2722–2726. [Google Scholar]
  6. Movellan, J.R. Visual speech recognition with stochastic networks. In Proceedings of the Advances in Neural Information Processing Systems, Denver, Colorado, 27 November–2 December 1995; pp. 851–858. [Google Scholar]
  7. Wang, S.L.; Lau, W.H.; Leung, S.H. Automatic lipreading with limited training data. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 881–884. [Google Scholar]
  8. Zhao, G.; Pietikäinen, M.; Hadid, A. Local spatiotemporal descriptors for visual recognition of spoken phrases. In Proceedings of the International Workshop on HUMAN-Centered Multimedia, Augsburg, Germany, 28 September 2007; pp. 57–66. [Google Scholar]
  9. Yau, W.C.; Kumar, D.K.; Chinnadurai, T. Lip-reading technique using spatio-temporal templates and support vector machines. In Progress in Pattern Recognition, Image Analysis and Applications, Proceedings of the Iberoamerican Congress on Pattern Recognition, Havana, Cuba, 9–12 September 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 610–617. [Google Scholar]
  10. Lan, Y.; Theobald, B.J.; Harvey, R.; Ong, E.J.; Bowden, R. Improving visual features for lip-reading. In Proceedings of the AVSP, Kanagawa, Japan, 30 September–3 October 2010. [Google Scholar]
  11. Ojala, T.; Pietikäinen, M.; Harwood, D. A comparative study of texture measures with classification based on feature distributions. Pattern Recognit. 1996, 29, 51–59. [Google Scholar] [CrossRef]
  12. Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
  13. Assael, Y.M.; Shillingford, B.; Whiteson, S.; de Freitas, N. LipNet: Sentence-level lipreading. arXiv 2016, arXiv:1611.01599. [Google Scholar]
  14. Chung, J.S.; Zisserman, A. Lip reading in the wild. In Computer Vision–ACCV 2016, Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Cham, Switzerland, 2016; pp. 87–103. [Google Scholar]
  15. Wang, C. Multi-Grained Spatio-temporal Modeling for Lip-reading. In Proceedings of the BMVC, Cardiff, UK, 9–12 September 2019. [Google Scholar]
  16. Ma, Y.; Sun, X. Spatiotemporal Feature Enhancement for Lip-Reading: A Survey. Appl. Sci. 2025, 15, 4142. [Google Scholar] [CrossRef]
  17. Ma, P.; Petridis, S.; Pantic, M. Visual speech recognition for multiple languages in the wild. Nat. Mach. Intell. 2022, 4, 930–939. [Google Scholar] [CrossRef]
  18. Shi, B.; Hsu, W.N.; Mohamed, A. Robust Self-Supervised Audio-Visual Speech Recognition. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 2118–2122. [Google Scholar]
  19. Cootes, T.; Taylor, C.; Cooper, D.; Graham, J. Active Shape Models-Their Training and Application. Comput. Vis. Image Underst. 1995, 61, 38–59. [Google Scholar] [CrossRef]
  20. Cootes, T.F.; Edwards, G.J.; Taylor, C.J. Active appearance models. In Computer Vision-ECCV’98, Proceedings of the European Conference on Computer Vision, Freiburg, Germany, 2–6 June 1998; Springer: Berlin/Heidelberg, Germany, 1998; pp. 484–498. [Google Scholar]
  21. Saenko, K.; Livescu, K.; Siracusa, M.; Wilson, K.; Glass, J.; Darrell, T. Visual speech recognition with loosely synchronized feature streams. In Proceedings of the Tenth IEEE International Conference on Computer Vision, Beijing, China, 17–21 October 2005; Volume 2, pp. 1424–1431. [Google Scholar]
  22. Potamianos, G.; Neti, C.; Gravier, G.; Garg, A.; Senior, A.W. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 2003, 91, 1306–1326. [Google Scholar] [CrossRef]
  23. Zhou, Z.; Zhao, G.; Hong, X.; Pietikäinen, M. A review of recent advances in visual speech decoding. Image Vis. Comput. 2014, 32, 590–605. [Google Scholar] [CrossRef]
  24. Liu, L.; Feng, G.; Beautemps, D. Automatic dynamic template tracking of inner lips based on CLNF. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5130–5134. [Google Scholar]
  25. Cox, S.J.; Harvey, R.W.; Lan, Y.; Newman, J.L.; Theobald, B.J. The challenge of multispeaker lip-reading. In Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP2008), Moreton Island, QLD, Australia, 26–29 September 2008; pp. 179–184. [Google Scholar]
  26. Potamianos, G.; Neti, C.; Luettin, J.; Matthews, I. Audio-Visual Automatic Speech Recognition: An Overview. Issues Vis. Audiov. Speech Process. 2004, 22, 23. [Google Scholar]
  27. Son Chung, J.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6447–6456. [Google Scholar]
  28. Afouras, T.; Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 44, 8717–8727. [Google Scholar] [CrossRef] [PubMed]
  29. Stafylakis, T.; Tzimiropoulos, G. Combining Residual Networks with LSTMs for Lipreading. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 3652–3656. [Google Scholar]
  30. Martinez, B.; Ma, P.; Petridis, S.; Pantic, M. Lipreading using temporal convolutional networks. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6319–6323. [Google Scholar]
  31. Ma, P.; Petridis, S.; Pantic, M. End-to-end audio-visual speech recognition with conformers. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7613–7617. [Google Scholar]
  32. Tong, Z.; Song, Y.; Wang, J.; Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093. [Google Scholar]
  33. Cai, Z.; Ghosh, S.; Stefanov, K.; Dhall, A.; Cai, J.; Rezatofighi, H.; Haffari, R.; Hayat, M. Marlin: Masked autoencoder for facial video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1493–1504. [Google Scholar]
  34. Petridis, S.; Wang, Y.; Li, Z.; Pantic, M. End-to-End Multi-View Lipreading. In Proceedings of the British Machine Vision Conference, BMVC 2017, London, UK, 4–7 September 2017. [Google Scholar]
  35. Zhang, X.; Zhang, C.; Sui, J.; Sheng, C.; Deng, W.; Liu, L. Boosting lip reading with a multi-view fusion network. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
  36. Zhang, W.; Wang, J.; Luo, Y.; Yu, L.; Yu, W.; He, Z.; Shen, J. Mtga: Multi-view temporal granularity aligned aggregation for event-based lip-reading. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 10176–10184. [Google Scholar]
  37. Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
  38. Yu, S.; Giraldo, L.G.S.; Príncipe, J.C. Information-Theoretic Methods in Deep Neural Networks: Recent Advances and Emerging Opportunities. In Proceedings of the IJCAI, Virtual, 19–26 August 2021; pp. 4669–4678. [Google Scholar]
  39. Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep Variational Information Bottleneck. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  40. Zhang, Q.; Yu, S.; Xin, J.; Chen, B. Multi-view information bottleneck without variational approximation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 4318–4322. [Google Scholar]
  41. Wang, Q.; Boudreau, C.; Luo, Q.; Tan, P.N.; Zhou, J. Deep multi-view information bottleneck. In Proceedings of the 2019 SIAM International Conference on Data Mining, Calgary, AB, Canada, 2–4 May 2019; SIAM: Philadelphia, PA, USA, 2019; pp. 37–45. [Google Scholar]
  42. Mai, S.; Zeng, Y.; Hu, H. Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations. IEEE Trans. Multimed. 2022, 25, 4121–4134. [Google Scholar] [CrossRef]
  43. Xiong, X.; De la Torre, F. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 532–539. [Google Scholar]
  44. Liu, L.; Hu, J.; Zhang, S.; Deng, W. Extended supervised descent method for robust face alignment. In Computer Vision-ACCV 2014 Workshops, Proceedings of the Asian Conference on Computer Vision, Singapore, 1–2 November 2014; Springer: Cham, Switzerland, 2014; pp. 71–84. [Google Scholar]
  45. Baltrusaitis, T.; Robinson, P.; Morency, L.P. Constrained local neural fields for robust facial landmark detection in the wild. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 354–361. [Google Scholar]
  46. Eckmann, J.P.; Kamphorst, S.O.; Ruelle, D. Recurrence Plots of Dynamical Systems. Europhys. Lett. (EPL) 1987, 4, 973–977. [Google Scholar] [CrossRef]
  47. Souza, V.M.; Silva, D.F.; Batista, G.E. Extracting Texture Features for Time Series Classification. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 1425–1430. [Google Scholar]
  48. Tomasi, C.; Manduchi, R. Bilateral filtering for gray and color images. In Proceedings of the IEEE International Conference on Computer Vision, Bombay, India, 4–7 January 1998; pp. 839–846. [Google Scholar]
  49. Song, J.; Zheng, Y.; Wang, J.; Zakir Ullah, M.; Jiao, W. Multicolor image classification using the multimodal information bottleneck network (MMIB-Net) for detecting diabetic retinopathy. Opt. Express 2021, 29, 22732–22748. [Google Scholar] [CrossRef] [PubMed]
  50. Ahuja, K.; Caballero, E.; Zhang, D.; Gagnon-Audet, J.C.; Bengio, Y.; Mitliagkas, I.; Rish, I. Invariance principle meets information bottleneck for out-of-distribution generalization. Adv. Neural Inf. Process. Syst. 2021, 34, 3438–3450. [Google Scholar]
  51. Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
  52. Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. Mutual information neural estimation. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 531–540. [Google Scholar]
  53. Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  54. Yang, S.; Zhang, Y.; Feng, D.; Yang, M.; Wang, C.; Xiao, J.; Long, K.; Shan, S.; Chen, X. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–8. [Google Scholar]
  55. Cinar, G.T.; Principe, J.C. Clustering of time series using a hierarchical linear dynamical system. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 6741–6745. [Google Scholar]
  56. You, X.; Guo, W.; Yu, S.; Li, K.; Príncipe, J.C.; Tao, D. Kernel learning for dynamic texture synthesis. IEEE Trans. Image Process. 2016, 25, 4782–4795. [Google Scholar] [CrossRef]
  57. Weng, X.; Kitani, K. Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading. In Proceedings of the British Machine Vision Conference (BMVC), Cardiff, UK, 9–12 September 2019. [Google Scholar]
  58. Xue, F.; Li, Y.; Liu, D.; Xie, Y.; Wu, L.; Hong, R. Lipformer: Learning to lipread unseen speakers based on visual-landmark transformers. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4507–4517. [Google Scholar] [CrossRef]
Figure 1. The architecture of the proposed MVIB-Lip framework. The system takes two complementary views of lip movements derived from landmark trajectories: (i) the raw landmark time series, and (ii) recurrence plots (RPs) generated from the same time series. The temporal view is encoded using a Transformer to capture dynamic dependencies, while the RP view is processed with a ResNet-18 to extract discriminative texture features. Each encoder produces a variational posterior distribution, q ϕ ( z T X T ) and q ψ ( z R X R ) . These are fused through a product-of-experts (PoE) posterior, q ( z X T , X R ) , which integrates shared, task-relevant information while tolerating missing views. The fused latent is regularized using the multi-view Information Bottleneck: KL-divergence penalties ensure compression of each view and the joint posterior, while a cross-view agreement loss encourages consistency between z T and z R . A classifier head p ( y z ) (cross-entropy or CTC) predicts the target sentence label. Optional modules include per-speaker normalization to reduce speaker-specific variations and PoE-based robustness to missing inputs.
Figure 1. The architecture of the proposed MVIB-Lip framework. The system takes two complementary views of lip movements derived from landmark trajectories: (i) the raw landmark time series, and (ii) recurrence plots (RPs) generated from the same time series. The temporal view is encoded using a Transformer to capture dynamic dependencies, while the RP view is processed with a ResNet-18 to extract discriminative texture features. Each encoder produces a variational posterior distribution, q ϕ ( z T X T ) and q ψ ( z R X R ) . These are fused through a product-of-experts (PoE) posterior, q ( z X T , X R ) , which integrates shared, task-relevant information while tolerating missing views. The fused latent is regularized using the multi-view Information Bottleneck: KL-divergence penalties ensure compression of each view and the joint posterior, while a cross-view agreement loss encourages consistency between z T and z R . A classifier head p ( y z ) (cross-entropy or CTC) predicts the target sentence label. Optional modules include per-speaker normalization to reduce speaker-specific variations and PoE-based robustness to missing inputs.
Entropy 27 01121 g001
Figure 2. Multivariate time series generation from facial landmarks. The top row shows the facial landmarks detection and tracking results using SDM for four representative frames (Frame 1, 10, 95, and 127) from the speech video “How are you”. The bottom row shows the final generated time series. Different colors represent different lip regions.
Figure 2. Multivariate time series generation from facial landmarks. The top row shows the facial landmarks detection and tracking results using SDM for four representative frames (Frame 1, 10, 95, and 127) from the speech video “How are you”. The bottom row shows the final generated time series. Different colors represent different lip regions.
Entropy 27 01121 g002
Figure 3. Three recurrence texture plots obtained from multivariate time series describing “Good bye”, “Hello”, and “How are you”. The original video of these three phases are recorded by the same speaker.
Figure 3. Three recurrence texture plots obtained from multivariate time series describing “Good bye”, “Hello”, and “How are you”. The original video of these three phases are recorded by the same speaker.
Entropy 27 01121 g003
Figure 4. The Markov Chain assumed in our equation.
Figure 4. The Markov Chain assumed in our equation.
Entropy 27 01121 g004
Table 1. The 10 different sentences recorded in the OuluVS database and our self-established database.
Table 1. The 10 different sentences recorded in the OuluVS database and our self-established database.
“Excuse me.”“Hello.”
“Goodbye.”“See you.”
“Nice to meet you.”“Thank you.”
“How are you.”“I am sorry.”
“You are welcome.”“Have a good time.”
Table 2. Recognition accuracies (%) of isolated sentences averaged over 10 subjects on OuluVS and our self-collected dataset.
Table 2. Recognition accuracies (%) of isolated sentences averaged over 10 subjects on OuluVS and our self-collected dataset.
DatasetLBP [8]HMM [4]DBN [21]ResNet-18 + TCNVideoMAE [32]Ours
OuluVS64.260.942.884.986.187.0
Self-data64.765.239.280.481.783.3
Table 3. Comparison with strong deep learning methods on LRW and LRW-1000 (classification accuracy %).
Table 3. Comparison with strong deep learning methods on LRW and LRW-1000 (classification accuracy %).
MethodLRWLRW-1000
LRW [14]61.128.0
Two-stream 3DCNN [57]84.138.7
Multi-Scale TCN [30]85.341.4
Ours86.242.1
Table 4. Ablation study on the LRW dataset (classification accuracy %).
Table 4. Ablation study on the LRW dataset (classification accuracy %).
ConfigurationAccuracy
Time series only82.3
Recurrence plot only80.7
Multi-view fusion (w/o IB)85.4
Multi-view fusion (with IB, full model)86.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Sun, H.; Cai, J.; Wu, J. MVIB-Lip: Multi-View Information Bottleneck for Visual Speech Recognition via Time Series Modeling. Entropy 2025, 27, 1121. https://doi.org/10.3390/e27111121

AMA Style

Li Y, Sun H, Cai J, Wu J. MVIB-Lip: Multi-View Information Bottleneck for Visual Speech Recognition via Time Series Modeling. Entropy. 2025; 27(11):1121. https://doi.org/10.3390/e27111121

Chicago/Turabian Style

Li, Yuzhe, Haocheng Sun, Jiayi Cai, and Jin Wu. 2025. "MVIB-Lip: Multi-View Information Bottleneck for Visual Speech Recognition via Time Series Modeling" Entropy 27, no. 11: 1121. https://doi.org/10.3390/e27111121

APA Style

Li, Y., Sun, H., Cai, J., & Wu, J. (2025). MVIB-Lip: Multi-View Information Bottleneck for Visual Speech Recognition via Time Series Modeling. Entropy, 27(11), 1121. https://doi.org/10.3390/e27111121

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop