A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos

Wang, Yan; Wang, Yingchong; Zhang, Xiuqi; Ding, Xiaoyu

doi:10.3390/s26051407

Open AccessArticle

A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos

School of Mechanical Engineering, Beijing Institute of Technology, Haidian District, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(5), 1407; https://doi.org/10.3390/s26051407

Submission received: 8 December 2025 / Revised: 7 February 2026 / Accepted: 10 February 2026 / Published: 24 February 2026

(This article belongs to the Section Physical Sensors)

Download

Browse Figures

Versions Notes

Abstract

Subtle vibrations induced in everyday objects by ambient sound, especially speech, carry rich acoustic cues that can potentially be transformed into meaningful text, with potential implications for monitoring and security-related scenarios. This paper presents a Vision-based Subtitle Generator (VSG). This is the first attempt to recover text directly from high-speed videos of sound-induced object vibrations using a generative approach. To this end, VSG introduces a phase-based motion estimation (PME) technique that treats each pixel as an “independent microphone”, and extracts thousands of pseudo-acoustic signals from a single video. Meanwhile, the pretrained Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) serves as the encoder of the proposed VSG-Transformer architecture, effectively transferring large-scale acoustic representation knowledge to the vibration-to-text task. These strategies significantly reduce reliance on large volumes of video data. Experimentally, text was generated from vibrations induced in a bag of chips by AISHELL-1 audio samples. Two VSG-Transformer variants with different parameter scales (Base and Large) achieved character error rates of 13.7 and 12.5%, respectively, demonstrating the remarkable effectiveness of the proposed generative approach. Furthermore, experiments using signal upsampling techniques show that the VSG-Transformer maintains effective performance when operating on videos with limited temporal sampling, indicating robustness to lower sampling rates.

Keywords:

text reconstruction from vibrations; phase-based motion estimation (PME); pretrained acoustic model; transformer

1. Introduction

When sound interacts with an object, the sound induces subtle vibrations on the object surface. The resulting motion patterns partially preserve informative features of the surrounding acoustic environment. Remarkably, if speech is a component of the ambient sound, the vibrations may encode semantic content that can be extracted and reconstructed into text. This is analogous to subtitle generation for a silent film, and we term the system that we built a Vision-based Subtitle Generator (VSG).

In recent years, subtle sound-induced vibrations of object surfaces have exhibited significant potential in terms of surveillance and security [1]. Such vibrations can reveal human activity in the environment and even enable remote eavesdropping, as speech-induced mechanical responses can be sensed without direct acoustic access. This capability raises important privacy and security considerations and motivates systematic investigation of vibration-based sensing [1,2,3,4,5,6,7,8,9]. Most prior work in this area falls into two main categories [2]: Audio recovery and acoustic source classification.

Audio recovery methods reconstruct acoustic signals in a form that can be played and recognized by the human ear [1,2,3,4,5,6,7,8,9]. One classical example involves the use of laser Doppler vibrometers (LDVs) [3] that recover sound by measuring the Doppler shift in laser light reflected as the surface of an object vibrates. In recent years, researchers have explored a range of alternative physical mechanisms for audio recovery. The Lamphone [2,4] use telescopes and optical sensors to detect minute optical variations of distant objects that are induced by sound-driven vibrations. The intensities of speaker light-emitting diode (LED) indicators [5] and vibrations of the read/write heads of hard disk drives [6] can also “accidentally” leak acoustic information that enables remote audio recovery. In addition, camera-based vibration measurement has attracted considerable research interest [10,11,12], demonstrating that subtle vibrations can be reliably extracted from video using image analysis, and thereby enabling new opportunities for audio reconstruction from visual signals. For example, Davis et al. [1] were the first to introduce a phase-based algorithm that retrieved acoustic signals from human-imperceptible motions in high-speed videos. Zhang et al. [7] developed a more computationally efficient method using singular value decomposition (SVD) techniques to enhance the robustness and accuracy of sound reconstruction from video data.

In contrast, acoustic source classification techniques are aimed at extracting information on the sound sources in a given scene. Such data include the number and/or gender of speakers. Also, the methods may link signals to particular words in a precompiled dictionary [13,14,15,16,17]. For example, Gyrophone [13] uses gyroscopes of the micro-electro-mechanical system embedded in modern smartphones to capture signals. This identifies speakers and enables digit-level speech recognition, albeit within a limited vocabulary (“one”, “two”, “three”, etc.). Side Eye [14] leverages the rolling shutter and movable lens mechanisms of mainstream smartphone cameras to create a point-of-view (POV), optical-acoustic side channel that accurately identifies spoken digits, speakers, and genders. As the latter technique does not focus on human auditory perception, a broad range of unintelligible or rough signals can be utilized. Table 1 presents a summary of related work.

However, both mainstream approaches are associated with distinct challenges [2]:

(1): Audio recovery techniques focus on human auditory perception, imposing strict requirements on the complex signal processing pipelines that are often built using expert prior knowledge. Recognition performance can vary significantly by the hearing capacity and training level of the listener.
(2): The principal limitation of classification-based techniques is the restricted output space of existing models. Classification is often limited to isolated words or digits, usually in precompiled dictionaries. This renders the compilation of task-specific word lists difficult.

Innovatively, VSG represents the first attempt to recover text directly from high-speed videos of sound-induced object vibrations via a generative approach. Compared to existing constrained approaches—either human-auditory-dependent audio recovery or fixed-vocabulary acoustic source classification—VSG transcends their limitations by formulating the task as an open-ended generative problem. To tackle Challenge 1, VSG introduces a phase-based motion estimation (PME) technique that treats each pixel as an “independent microphone”, extracting thousands of pseudo-acoustic signals (PASs) from a single video. Large volumes of human-unintelligible PASs are directly utilized during training. This also avoids any need for the complex, task-specific, audio preprocessing steps that were traditionally used to enhance the auditory experience of the listener. For Challenge 2, VSG employs an autoregressive generative architecture (termed VSG-Transformer) rather than classification-based designs. The state-of-the-art pretrained Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) serves as the encoder in this architecture. In recent years, the original BERT [18] and variants thereof, such as HuBERT [19], XLNet [20], and RoBERTa [21], have become the dominant paradigms of natural language processing. Through transfer learning, the pretrained HuBERT model can be modified to handle downstream tasks, effectively transferring large-scale acoustic representation knowledge to the vibration-to-text conversion. Moreover, these strategies significantly reduce reliance on large volumes of video data.

Experimentally, the vibrational responses of everyday objects are converted to standard audio samples of the AISHELL-1 corpus. This is a widely used, open-source Mandarin speech dataset that serves as a speech recognition benchmark [22]. The evaluation metric is the character error rate (CER). This is the standard parameter employed when assessing automatic speech recognition (ASR) models. The results validate the effectiveness of the proposed VSG approach. Text is reconstructed from subtle object vibrations captured on video. Specifically, VSG-Transformer-Base and VSG-Transformer-Large (similar models but of different scales) exhibited CERs of 13.7 and 12.5%, respectively. In addition, performance variations and limiting cases under diverse acquisition conditions were analyzed to assess robustness. Such analysis helps characterize when vibration-based text reconstruction is more likely to be effective, providing a basis for informed consideration of its broader implications, such as potential monitoring- or eavesdropping-related applications. The applicability of the VSG-Transformer to videos with limited temporal sampling was further investigated. Signal upsampling techniques were employed to mitigate the reliance on high-speed imaging devices. When appropriate interpolation strategies were applied, the VSG-Transformer maintained acceptable recognition performance under limited temporal resolution. These results suggest that the proposed VSG approach remains applicable when video data are captured with lower temporal sampling.

The remainder of the paper is organized as follows: Section 2 describes the proposed method. Section 3 presents the experimental validation and discusses the results. Section 4 concludes the paper.

2. Methods

This section presents the general framework of the proposed VSG and the key related concepts, including PAS synthesis and the VSG-Transformer architecture. The theoretical background and implementation details follow.

2.1. General Framework of the Method

VSG evaluates only ambient sound. The semantic content encoded in pixel signals is used to reconstruct output text. Figure 1 illustrates the overall flow, which consists of the following steps, A full demonstration animation of VSG is available at [https://youtu.be/FLw-quDNizQ] (accessed on 9 February 2026) (Animation_S1 in Supplementary Materials).

Video capture: The response of any object to sound is purely physical. As the acoustic excitations vary, the resulting object surface vibrations are captured by a camera, effectively transforming physical displacements into pixel-level signals within the video frames;
PAS acquisition: PASs are obtained via phase-based processing of the pixel signals;
VSG-Transformer training and testing: Large-scale PASs that encode rich acoustic features are used to construct the PAS dataset employed to train and evaluate VSG-Transformer. A multi-stage transfer learning strategy effectively links the pretrained acoustic representations of HuBERT to the PAS-driven VSG task;
Text reconstruction: The trained VSG-Transformer reconstructs text based the PASs extracted from new videos.

2.2. Extraction of PAS

As outlined above, VSG requires audio-like signals that effectively encode acoustic features of the input video V(x,y,t). Crucially, such signals must be widely accessible. This section details the principles and procedures involved. The process can be broadly divided into two stages. First, local motion signals are computed at each pixel location. Next, these signals are transformed into pixel-level representations that approximate audio waveforms. These are the PASs.

(a): Computation of Local Motion Signals

Fleet and Jepson were the first to use spatio-temporal bandpass filters for PME in image sequences [23]. The phase gradient of the complex-valued output is a reliable approximation of the motion field. Building on this, subsequent studies used more advanced filtering strategies [24,25,26]. For example, Gautama and Van Hulle [24] employed a set of quadrature Gabor filter pairs when computing the temporal phase gradient of a spatially bandpassed video to estimate the motion field. In recent years, many studies have bypassed the explicit computation of optical-flow vector fields by, rather, directly leveraging phase variations when estimating the displacements of image textures in video sequences [1,27].

Following the approach of [1], local motion is here computed by analyzing the phase variations within a complex, steerable pyramid representation of the input video V(x,y,t). A complex steerable pyramid is a multi-scale, multi-orientation filter bank (see [27] for details) that decomposes each video frame into complex-valued sub-bands, indexed by the scale r and orientation θ. At each pixel location (x,y), the sub-band output can be represented in terms of amplitude A and phase φ, as follows:

S (r, θ; x, y, t) = A (r, θ; x, y, t) e^{i φ (r, θ; x, y, t)}

(1)

Traditionally, a complex steerable pyramid is subjected to downsampling-based decomposition as in [1,27]. Here, however, the resolution across all filters is uniform (Figure 2). This yields sub-band outputs that are perfectly aligned in terms of spatial resolution. The amplitude and phase at each pixel location (x,y) across the different sub-bands exhibit a direct correspondence. This facilitates later pixel-level signal synthesis.

The phase variations

φ_{v} (r, θ; x, y, t)

are then computed by subtracting the phase of each pixel in the reference frame—typically the first frame of the video—from the corresponding phase in the current frame. The formal expression is:

φ_{v} (r, θ; x, y, t) = φ (r, θ; x, y, t) - φ_{r e f} (r, θ; x, y, t_{0})

(2)

By [24], the computed phase variations afford very good approximations of image texture displacements, especially those of subtle motions, along the corresponding orientation and scale.

To provide intuition for the relationship between phase variations and physical displacement, a simple one-dimensional example is considered. Let

f (x)

denote a 1D image intensity profile, and suppose the signal undergoes a global translation over time, resulting in

f (x + δ (t))

, where

δ (t)

denotes the displacement. Using Fourier series decomposition, the translated signal can be expressed as a sum of complex sinusoids:

f (x + δ (t)) = \sum_{ω = - \infty}^{\infty} A_{ω} e^{i ω (x + δ (t))}

(3)

Each term corresponds to a frequency component ω, which can be interpreted as a particular scale

r

of the signal. For a given frequency band, the signal can be written as:

S_{ω} (x, t) = A_{ω} e^{i ω (x + δ (t))}

(4)

Since

S_{ω} (x, t)

is a complex sinusoid, its phase term

ω (x + δ (t))

varies linearly with the displacement

δ (t)

. Therefore, temporal changes in phase directly encode the physical motion. In two-dimensional images, this principle allows pixel-wise phase differences to precisely describe local displacements, enabling each pixel to act as an independent vibration sensor.

(b): PAS Synthesis

A large number of PASs can be independently constructed by utilizing phase variations

φ_{v} (r, θ; x, y, t)

at each pixel. Specifically, within each sub-band, each local motion signal is weighted by the squared amplitude at time

t_{0}

. This can be formulated as:

Φ_{i} (r, θ; x, y, t) = {A (r, θ; x, y, t_{0})}^{2} φ_{v} (r, θ; x, y, t)

(5)

where

i

denotes the sub-band index. Subsequently, the outputs across all orientations

θ

and scales

r

are aggregated via summation:

P A S (x, y, t) = \sum_{i} Φ_{i} (r, θ; x, y, t)

(6)

Finally, PAS is normalized by scaling and centering to within the range [−1, 1]. Figure 3 compares the spectrogram of the original audio [Figure 3b] to those of the PASs [Figure 3c–e] derived from different surface regions [

p_{1}, p_{2}, p_{3}

of Figure 3a] of the object. To provide an intuitive understanding of signal quality, the PASs were rendered as audible waveforms using the method proposed in [1]. Audio samples corresponding to Figure 3b–e are also available at [https://youtu.be/rMcX9ofYY68] (accessed on 9 February 2026) (Animation_S2 in Supplementary Materials), enabling subjective comparison of the reconstructed PASs with the original audio. Given the inherent ambiguities of local phases in regions where image texture is weak, motion signals extracted from such pixel locations tend to be very noisy and/or unreliable.

In most areas, the resulting PASs are too coarse to be intelligible to the human ear. Current end-to-end ASR models also exhibit limited recognition performance on these signals, as detailed in Section 3.3. However, the advantage of PASs lies in their abundance—they can be densely extracted at the pixel level and collectively capture rich semantic information embedded in surface vibrations. Experimental results confirm their substantial training potential when combined with dedicated network architectures and learning strategies, as further demonstrated in Section 3.3.

Note that variations in texture, brightness, and material properties across different pixel locations that correspond to distinct regions of the object surface introduce diverse noise characteristics and frequency attenuations. Such variations contribute to the robustness of model training using large-scale PAS inputs.

2.3. Proposed Model: The VSG-Transformer

Figure 4 shows the architecture of the VSG-Transformer. An overview of each component follows, and the design rationale is explained. In broad terms, the model features three key modules: (1) a convolutional shrinkage module that suppresses noise and enhances features; (2) a pretrained HuBERT-based encoder that captures high-level acoustic representations; and, (3) an attention-based decoder that generates autoregressive text.

(a): The convolutional shrinkage module

As defined by Equation (6) of Section 2.2, PASs are combinations of

Φ_{i}

derived from the amplitude and phase outputs across all orientations

θ

and scales

r

of the corresponding filters. For a given pixel location, filters that are mismatched in terms of scale and/or orientation typically deliver weak sub-band amplitudes. If such amplitudes are used to compute squared weightings, the PASs contain noise components that are substantially weaker than the actual signal. Therefore, soft thresholding was here used to eliminate noise. Such thresholding is as a core feature of signal denoising algorithms [28]. Noise features with magnitudes near zero are suppressed, effectively reducing them to zero. The nonlinear nature of the transformation is formally defined by the following equation:

y = \{\begin{matrix} x - τ & x > τ \\ 0 & - τ \leq x \leq τ \\ x + τ & x < - τ \end{matrix}

(7)

where x denotes the input feature, y is the output, and τ is a positive threshold parameter. However, determination of an appropriate threshold τ remains challenging in practice.

As shown in [29], soft thresholding can be integrated into the deep architecture as a nonlinear transformation layer. The threshold parameter τ is then automatically learned by the network, not manually specified. Here, soft thresholding was employed as the activation function that was integrated with convolutional layers to form a convolutional shrinkage module that suppressed noise-related information and enhanced feature discriminability.

Specifically, given a PAS dataset for a VSG task with N training samples,

D_{V S G} = {〈 {P A S}^{(j)}, W^{(j)} 〉}_{j = 1}^{N}

, where PAS is a one-dimensional waveform input and W a sequence of L tokens

{w_{1}^{'}, \dots, w_{L}^{'}}

(Figure 4). The convolutional shrinkage module features two 1D convolutional layers that flank a shrinkage unit. The kernel configurations are 32 and 1, respectively, and both maintain a stride of 1. Initial processing through the first convolutional layer expands the feature channels of the waveform input to 32. The subsequent shrinkage layer first dynamically computes channel-specific τ values and generates a 32-channel threshold vector. Next, that layer subjects each feature channel to a nonlinear soft thresholding operation, effectively suppressing noise but preserving critical signal components. The residual connection integrates the features that are soft-thresholded. The final 1D convolutional layer compresses the feature channels back to a single channel. This ensures that the output is compatible with the subsequent processing stages.

(b): The pretrained HuBERT encoder

In the second module, the pretrained HuBERT encoder extracts high-level acoustic representations from the input waveform. HuBERT has consistently attracted considerable research attention. HuBERT captures complex temporal dependencies and semantic features effectively because HuBERT was trained using large datasets, including 10,000 h of speech from the WenetSpeech [30]. The pretrained HuBERT model uses its extensive prior knowledge to engage in transfer learning. HuBERT therefore rapidly adapts to PAS inputs. In this study, two pretrained HuBERT models with different parameter scales—HuBERT-Base (95 million parameters) and HuBERT-Large (317 million parameters)—were integrated with VSG-Transformer, yielding VSG-Transformer-Base and VSG-Transformer-Large, respectively.

As illustrated in Figure 4, the pretrained HuBERT model features a convolutional neural network (CNN) encoder followed by a Transformer. The CNN encoder contains seven convolutional layers, each with 512 output channels. The first layer receives a single-channel input that directly matches the one-dimensional waveform output by the convolutional shrinkage module. The subsequent layers process the 512 channel inputs. The CNN encoder delivers a downsampled two-dimensional feature sequence

{p_{1}, \dots, p_{n}}

wherein each feature vector

p_{i}

is of dimensionality 512 and is computed by aggregating information across all input channels. The feature sequence is subjected to local masking using the strategy described in [31,32] before transmission to the Transformer module, the Transformer features a configurable number of blocks with predefined embedding dimensions, inner FFN dimensions, and attention heads. The detailed architectural specifications of the HuBERT-Base and HuBERT-Large models are listed in Table 2.

Both models output a sequence of hidden units

{h_{1}, \dots, h_{n}}

that serve as high-level acoustic representations. These are next fed to the downstream decoder module to improve text generation.

(c): The decoder

The decoder module of VSG-Transformer transforms the high-level acoustic representations produced by the HuBERT encoder into coherent text outputs. The decoder employs a self-attention mechanism to effectively capture long-range dependencies within the sequence. The decoder operates in an autoregressive manner through a stack of attention blocks, integrating both acoustic features from the HuBERT encoder and text-level information from previous decoding steps to enhance transcriptional accuracy.

The structure of the decoder is illustrated in Figure 4. A character-embedding layer is first used to convert the character sequence into an output encoding

{{c e}_{1}, \dots, {c e}_{L}}

, which is then combined with a positional encoding

{{p e}_{1}, \dots, {p e}_{L}}

. The dimensionality of both embeddings is that of HuBERT. The combined embeddings are then passed through a stack of

N_{d} = 6

decoder blocks that generate the final decoder outputs. Each decoder block features three sub-blocks, of which the first is a masked multi-head self-attention block wherein the queries, keys, and values are identical. Masking is used to ensure that the prediction at position

j

depends only on the outputs at positions earlier than

j

. The second sub-block is a multi-head attention block in which the keys and values are the encoder outputs

{h_{1}, \dots, h_{n}}

. Here, the queries are derived from the output of the preceding sub-block. The third sub-block is a position-wise feed-forward network. Each sub-block in the decoder features a residual connection and layer normalization. The latter uses a pre-norm rather than the conventional post-norm structure. Given input

x

to a sub-block, the corresponding output is:

x + S u b B l o c k (L a y e r N o r m (x))

(8)

Finally, the decoder outputs are computed via a linear projection followed by a softmax function.

In general, VSG-Transformer effectively integrates noise suppression, acoustic representation, and autoregressive text generation, leveraging transfer learning to fully exploit the capabilities of HuBERT and enable reliable PAS-to-text conversion. The combination of modules described above is well-suited to the VSG task, which demands accurate modeling of complex acoustic and semantic patterns embedded in input PAS signals.

Notably, VSG-Transformer does not rely heavily on semantic features within PASs per se. Rather, transfer learning enables the model to leverage knowledge acquired in the source domain (the speech corpora) using the HuBERT module and then apply that information to tasks in the target domain (the PAS dataset). This significantly reduces model dependency on large volumes of PAS data. The model can be trained using only a few high-speed videos. Data acquisition from such videos is computationally demanding.

3. Experimental Validation

The VSG technique was evaluated in a series of experiments. The setup included a common object (a bag of potato chips), a loudspeaker (KRK Rokit 5), and a high-speed camera (i-SPEED 230). All video recordings were captured in a typical meeting room (Figure 5). Videos were captured at high frame rates (about 16 kHz; the spatial resolution was 128 × 128 pixels). Standard audio files from the AISHELL-1 corpus—an open-source Mandarin speech dataset with a speech recognition baseline—were played through the loudspeaker to excite sound-induced vibrations on the object surface. A total of 500 video clips were recorded, each corresponding to a distinct speech utterance randomly sampled from the AISHELL-1 corpus, with no sentence or audio clip repeated across the dataset. The speech signals were played at an approximate sound pressure level (SPL) of 80 dB, comparable to that of an unamplified stage actor. Experimental evidence suggests that this quantity is adequate, as the model’s semantic understanding primarily originates from the pretrained module. The PASs extracted from the videos expose the model to variations in texture, brightness-related noise, and frequency attenuation—factors that enhance robustness during training. Subsequently, initial and final non-informative frames were manually truncated to retain only valid segments.

A chip bag was selected as the primary object for evaluation due to its strong and consistent vibration response under acoustic excitation. Compared to other common materials such as foam cups, plants, tissue boxes, and other everyday items, it exhibits more favorable motion characteristics, as demonstrated in [1]. Since the current experiment represents an initial feasibility study of VSG, the evaluation focused on objects with a higher likelihood of producing reliable results. Given the substantial time and computational demands of high-speed video acquisition and processing, large-scale testing across a broader range of materials is being conducted in follow-up work and will be published in future updates.

3.1. Dataset Generation

Figure 6 shows frames from videos captured during the experiments. Each frame is fully occupied by the object, effectively eliminating background interference. During recording, the object pose and the camera viewing angle were randomly varied but with maintenance of an approximately constant distance between the object and the camera. This significantly enhanced the robustness and generalizability of the experimental results.

PASs were extracted from the videos as described in Section 2.2. Motion signals from pixels on the object surface that exhibited clear textural and phase variations were employed. Specifically, all PASs within each video—128 pixels (height) × 128 pixels (width) = 16,384 pixels in total—were extracted for subsequent processing. A percentile-based pooling strategy was employed. This is analogous to the pooling layers employed during deep learning. Specifically, each input frame of size 128 (height) × 128 (width) pixels was partitioned into 256 regions, each measuring 8 (height) × 8 (width) pixels. Within each region, the PAS corresponding to the pixel, the amplitude of which was closest to

A_{\min} + t \times (A_{\max} - A_{\min})

, was selected, where

A_{\max}

and

A_{\min}

were the maximum and minimum amplitude within the region, respectively, and

t

a user-defined threshold (here, 0.8). This identified a representative PAS within each local region, suppressed any effects of noise and outliers, and thereby improved the general quality of the selected PASs.

After application of the percentile-based pooling strategy, each 128 (height) × 128 (width) video frame yielded 256 PASs. A total of 500 videos were randomly split into training, validation, and test sets in a 70:15:15 ratio (350, 75, and 75 videos, respectively) at the video level, such that no video appears in more than one split. The speech samples were drawn from a subset of speakers in AISHELL-1, and individual speakers may appear in more than one split. However, the utterance content across different speakers is strictly non-overlapping, as the task focuses on text reconstruction rather than speaker-specific characteristics. In total, the PAS dataset contained 128,000 PAS samples (500 × 256). Detailed statistics are listed in Table 3.

Additionally, the publicly available AISHELL-1 corpus, which contains 178 h of Mandarin speech, was employed during certain stages of training. All recordings were sampled at 16 kHz. To facilitate comparisons, each training sample in the PAS dataset is here termed an “utterance”, following the AISHELL-1 convention. However, as detailed in Section 2.2, PASs are not equivalent to semantically intelligible speech in the traditional sense.

3.2. Training and Testing

Section 2.3 described the architecture of the proposed framework. There are three key modules: a convolutional shrinkage module for noise suppression, a pretrained HuBERT encoder for acoustic representation, and an attention-based decoder for text generation. A transfer learning strategy was used to retain as much as possible of the knowledge of the pretrained HuBERT when effectively training our model across all components. This involved stage-wise adjustments of the training parameters, the use of module freezing policies, and employment of different dataset inputs. This section outlines these in detail.

During training stage 1, both the shrinkage layer of the convolutional shrinkage module and the entire HuBERT encoder were frozen, as illustrated in Figure 7a. During this stage, only AISHELL-1 was used for training over 130 epochs. Two parallel experiments were conducted using HuBERT-Base and HuBERT-Large. Stage 1 sought to establish an automatic baseline ASR capability based on the AISHELL-1 corpus. In stage 2, the shrinkage layer was thawed but the HuBERT encoder remained frozen [Figure 7b]. Training in this stage employed only the PAS dataset (40 epochs). The two parallel training tracks were based on the respective models obtained during stage 1. Stage 2 aimed to optimize the shrinkage layer in terms of task-specific noise suppression while allowing transfer of knowledge from AISHELL-1 to the recognition tasks posed by the PAS dataset.

All VSG-Transformer modules were implemented in PyTorch 2.1.0 [33]. Each training batch (

size = 24

) contained approximately 100 s of speech or PAS and the corresponding transcriptions. The Adam optimizer [34] was employed; the parameters were

β_{1} = 0.9, β_{2} = 0.98,

and

ϵ = 10^{- 9}

. The learning rate was varied during training using the warm-up schedule of:

l r a t e = D_{m}^{- 0.5} \cdot m i n ({s t e p}^{- 0.5}, s t e p \cdot {w a r m u p}^{- 1.5})

(9)

where warmup was set to 12,000. To prevent overfitting, the neighborhood smoothing strategy of [35] was employed, with the probability of the correct label set to 0.8. The token vocabulary included 4231 characters from the training set with two special symbols: “<unk>” for unseen tokens and “<eos>” to pad the ends of token sequences. Both the residual dropout and attention dropout rates were 0.1. Residual dropout was applied to each sub-block prior to addition of the residual connections. Attention dropout was applied to the softmax activation within each attention mechanism. Finally, the model parameters from the last 10 epochs were averaged to obtain the final output. During inference, beam search decoding employed a beam width of 10 and a length penalty of 1.0 [36]. When PASs were extracted from the same video, a majority voting strategy was used to aggregate the recognition results. This featured token-wise voting across all sequences followed by selection of the most frequent token at each position when creating the final transcription. All experiments were conducted on an NVIDIA RTX 4090 GPU, and the reported CER results are the averages over five independent runs. The overall pipeline of VSG is visualized in Figure 8.

3.3. Results and Ablation Studies

The results are presented in Table 4. Using the transfer learning strategy, the VSG-Transformer successfully achieved PAS-to-text conversion using high-speed video recordings of common vibrating objects. Human auditory perception was not in play. Specifically, VSG-Transformer-Base achieved a CER of 13.7% on the VSG task; the VSG-Transformer-Large figure was 12.5%. In contrast, for the AISHELL-1 ASR task of stage 1, the CERs were 6.4% and 6.1%, respectively, indicating that VSG-Transformer can engage in end-to-end ASR. VSG-Transformer-Large performed better than VSG-Transformer-Base on both the ASR and VSG tasks, showing that an increase in the model scale enhanced the capacity to extract rich acoustic representations for downstream tasks.

After multi-stage training of the multi-module VSG-Transformer network, ablation studies were used to assess quantitatively the utilities of individual modules within the network architecture and to isolate and evaluate the impacts of the stage-wise training strategy by selectively removing certain training components.

The ablation study was designed to systematically examine the contributions of key components and design choices in the proposed VSG framework. Six ablation tests were conducted using the VSG-Transformer-Large model. The results are summarized in Table 5.

Specifically, freezing the convolutional shrinkage layer during stage-2 training results in a clear degradation in recognition accuracy, with the CER increasing from 12.5% to 19.1%. This observation indicates that adaptive noise suppression is essential for mitigating PAS-specific artifacts and stabilizing the representations prior to decoding. Varying the depth of the decoder shows that architectural capacity plays an important but non-linear role. Increasing the number of decoder blocks generally improves recognition performance, suggesting that deeper decoders better capture long-range semantic dependencies in PAS sequences. However, the gains become marginal once the number of blocks exceeds six, indicating diminishing returns from excessive depth. The aggregation strategy at inference time is also critical. Removing majority voting causes a substantial performance drop, with the CER rising to 24.5%. This result confirms that individual PASs often contain incomplete or noisy vibration cues due to spatial variability across pixels, and that majority voting effectively integrates complementary information from multiple PASs to suppress local disturbances and improve transcription robustness. Performance is further influenced by the PAS selection strategy. Both overly aggressive pooling and excessively coarse or fine spatial partitioning lead to degraded accuracy. The baseline configuration, using 8 × 8 regions with a percentile threshold of

t

= 0.80, achieves the best balance between preserving informative vibration cues and suppressing noise, whereas weighted averaging and extreme region sizes introduce additional interference or dilute useful information. Comparisons with conventional baselines further reinforce the effectiveness of the proposed design. A visual-microphone-based pipeline directly followed by an ASR model, without stage-2 adaptation, performs poorly (CER = 40.3%). A PAS + ASR baseline without stage-2 training degrades even further (CER = 56.7%). These results indicate that the proposed VSG-Transformer, together with the stage-wise transfer learning strategy, is essential for reliable PAS-to-text conversion.

3.4. Computational Cost and Runtime Analysis

The proposed VSG framework focuses on reconstructing textual content directly from subtle vibrations captured in high-speed video. Beyond recognition accuracy, an understanding of the computational characteristics of the processing pipeline and its behavior under challenging acquisition conditions is important for assessing the practical scope of the proposed approach. Section 3.4 therefore analyzes the runtime of the main processing stages.

The overall processing pipeline consists of two major stages: (1) PME-based generation of PASs, and (2) inference using the VSG-Transformer. Among these stages, PME constitutes the dominant computational overhead, as it operates at the pixel level on high-frame-rate video sequences. The runtime of this stage scales approximately linearly with video duration and spatial resolution, and further increases with the number of pyramid scales and orientations employed in the steerable pyramid decomposition. As reported in Table 6, for a representative 5-s video clip with a spatial resolution of 128 × 128 pixels, the PME preprocessing time ranges from 129.79 s when using a shallow configuration (1 scale, 2 orientations) to 212.38 s for a deeper and more directional configuration (3 scales, 4 orientations). This increase reflects the additional computational burden introduced by multi-scale and multi-orientation filtering. All experiments were conducted on a workstation equipped with an NVIDIA RTX 4090 GPU and an Intel Core i9-12900K CPU. All runtimes reported in Table 6 are averaged over 10 independent runs and are provided as representative references.

In contrast, the inference stage based on the VSG-Transformer is comparatively lightweight once PASs have been extracted and cached offline. The inference time is approximately 0.16 s per PAS for the Base model and 0.43 s per PAS for the Large model. When aggregating 256 PASs per video using batch inference followed by majority voting, the total transcription time is approximately 27.12 s for the Base model and 75.51 s for the Large model per video.

These results indicate that, while PME preprocessing remains computationally intensive in the current implementation, the downstream text inference stage does not constitute a bottleneck.

3.5. Failure Modes and Operational Limitations

In addition to computational cost, the robustness of the proposed framework under diverse acquisition conditions is an important practical consideration. To this end, we analyzed the performance variations and limiting cases observed in the experiments, as summarized in Table 7. Figure 9 presents representative frames extracted from videos captured under different acquisition and scene conditions.

Performance degradation is primarily associated with conditions that weaken vibration-induced motion cues or reduce their reliability in PAS extraction. Specifically, increasing the camera–object distance and decreasing the SPL make vibration-induced motion cues harder to capture, resulting in lower signal-to-noise ratios in the extracted PASs and correspondingly higher CERs. Similarly, adverse imaging conditions—such as reduced illumination or defocus—negatively affect the quality of phase-based motion estimation by diminishing local texture contrast and phase stability, thereby introducing additional noise into PAS extraction. Scene composition also plays an important role. As the object occupancy ratio within the field of view decreases, the amount of usable vibration information becomes limited, which degrades transcription accuracy. Furthermore, under complex acoustic conditions involving overlapping speech, recognition performance deteriorates markedly as interference increases, reflecting the inherent difficulty of isolating target speech content from visually induced vibrations alone.

In extreme cases—including very low SPL, severe defocus, strong interfering speech, or minimal object occupancy—PAS extraction becomes unstable, leading to partial or complete recognition failure. These failure modes delineate the current operational boundaries of the proposed method and highlight scenarios in which visual vibration-based speech recovery is intrinsically challenging.

From a broader perspective, the identified performance boundaries also provide practical guidance for assessing the feasibility of vision-based vibration sensing in monitoring or eavesdropping-related applications. The results indicate that successful text reconstruction relies on sufficiently strong vibration responses, adequate visual resolution, and limited acoustic interference. These findings offer practical insights for determining when visual vibration analysis may yield usable semantic information, and when it is unlikely to be effective. As such, the present study contributes not only a proof of concept, but also a set of empirically grounded conditions that can inform future system design and feasibility assessment in surveillance-oriented scenarios.

3.6. VSG Operation on Videos with Limited Temporal Sampling

The VSG framework and the experimental validation thereof have been described above. There is one key constraint: the raw VSG input must be high-speed (16-kHz) video because both the HuBERT-Base and HuBERT-Large models [19] were pretrained on speech data encoded as 16-kHz mono WAV files. The PASs must match this sampling rate to ensure appropriate alignment with the expected input format. Otherwise, feature extraction errors that significantly compromise model performance are to be expected.

However, unlike audio recording, high-speed video recording is resource-intensive. Memory consumption is high, as is the computational cost. This section investigates whether VSG might accept lower-frame-rate videos. Audio waveform upsampling techniques were leveraged when seeking to build a lightweight VSG version.

The experiments investigated this issue using four different upsampling methods. One was a traditional signal processing technique (Bandlimited Sinc Interpolation (BSI) of the widely adopted PyTorch-based Torchaudio library [37]). Additionally, three deep learning approaches were assessed: a deep neural network (DNN) [38], attention-based feature-wise linear modulation (AFiLM) [39], and Phase-Net [40]. A DNN can implicitly predict missing high-frequency components in bandwidth-expanded speech, thereby improving recognition performance when speech signals are obtained under lower temporal sampling. A DNN also addresses the discontinuity between a narrowband input and a reconstructed high-frequency spectrum. AFiLM uses a self-attention mechanism to model long-range temporal dependencies. It is aimed at increasing the fidelity of waveform upsampling. Both of these methods operate directly on 1D PAS waveforms. However, Phase-net uses phase data as the primary inputs. Phase-net was originally designed for video frame interpolation, leveraging phase and amplitude features from complex-valued sub-bands obtained by decomposing each frame using a multi-scale, multi-orientation filter bank (Section 2.2 above). Phase-Net can integrate the obtained phase interpolations with PAS computations, thereby enabling effective PAS upsampling.

BSI upsampling used the parameters of the Torchaudio audio-processing guidelines [37]. These leverage GPU-based acceleration to optimize interpolation, effectively balancing computational efficiency with spectral fidelity. The structural configurations of the deep learning models were those of the original publications [38,39,40]. Hyperparameter tuning and the training details can be found in the respective references. All four upsampling methods were used to modify original high-speed videos sampled at 8, 4, and 2 kHz, generating 16-kHz PASs via 2×, 4×, and 8× interpolation, respectively. The upsampled PASs were used to construct the PASs dataset and then for stage 2 training and evaluation of VSG-Transformer-Large. The CER results are summarized in Table 8.

As shown in Table 8, at all upsampling ratios, Phase-Net consistently yielded the best VSG performance, with all CERs below 20%. Remarkably, Phase-Net achieved test CERs of 13.6 and 15.3% under 2× (original 8 kHz) and 4× (original 4 kHz) upsampling. Performance degradation was slight compared to direct training on native 16 kHz data. Performance did not collapse.

Both DNN and AFiLM performed very similarly in terms of the CERs, particularly when recognizing speech at original 4-kHz or higher sampling rates. In contrast, BSI did not perform well. After 8× upsampling, VSG-Transformer did not converge during stage 2 training. This may be because, during sampling at 2 kHz, the Nyquist limit restrains the signal bandwidth to below 1 kHz. Such a restricted bandwidth typically captures only the fundamental frequency (

F_{0}

) and part of the first formants (

F_{1}

) of vowels and voiced consonants, entirely missing the higher formants such as

F_{2}

and

F_{3}

. Moreover, high-frequency signals from unvoiced consonants such as /f/, /s/, and /k/, which are often above 2 kHz, are completely absent in such low-bandwidth signals [14], significantly impairing recognition accuracy.

Historically, all three deep learning-based upsampling methods sought to incorporate speech features embedded in a high-frequency spectrum into the narrowband speech signal. During training of Phase-Net, DNN, and AFiLM, 16-kHz signals served as the targets of training. In contrast, BSI—a pure numerical interpolation method—fails to recover lost high-frequency features during upsampling. This fundamental distinction largely explains the performance gap between BSI and the deep learning methods. Effective VSG extraction from video data with limited temporal sampling seems to require guidance from high-frame-rate signals during upsampling. However, deep networks featuring learnable upsampling parameters serve as promising directions toward lightweight VSG implementation.

4. Conclusions and Future Work

This paper shows that subtle vibrations of everyday objects, caused by ambient sound, can be extracted from video recordings and used to generate text, effectively transforming the objects into vision-based subtitle generators. The VSG pipeline initially acquires pixel-level PASs from the surfaces of objects by PME. These are scalable, abundant robust signals that effectively represent the underlying acoustic features. A pretrained HuBERT-based generative model, termed the VSG-Transformer, is then employed for the PAS-to-text task. The architecture effectively integrates three key components: a convolutional shrinkage module for noise suppression, a pretrained HuBERT-based encoder for extraction of high-level acoustic representations, and an attention-based decoder for autoregressive text generation. Training employs multi-stage transfer learning. The model leverages the knowledge embedded in the pretrained HuBERT and effectively adapts this to recognition of the PAS dataset.

Experimentally, the Base and Large variants of VSG-Transformer successfully accomplished VSG tasks. The CERs of the models were 13.7 and 12.5%, respectively. Both models also performed excellently on the conventional ASR task trained on AISHELL-1 with CERs of 6.4 and 6.1%, respectively. Comprehensive ablation studies quantitatively evaluated the individual contributions of each module and the overall effectiveness of multi-stage training. Finally, the possibility of lightweight VSG implementation was explored. Low-sampling-rate PASs were upsampled. Four upsampling methods—BSI, DNN, AFiLM, and Phase-Net—that covered both traditional signal processing and deep learning approaches, were tested. Phase-Net upsampling consistently yielded the best performance. Specifically, the VSG-Transformer-Large CERs were 13.6, 15.3, and 19.6% after training on and evaluation of PAS datasets upsampled from original 8, 4 and 2 kHz videos, respectively. In addition, this study systematically examined performance variations and limiting cases under diverse acquisition conditions. The results help characterize when vibration-based text reconstruction is more likely to be effective, providing a principled basis for considering its broader implications.

In conclusion, VSG fully leverages the flexibility of deep networks and the strength of transfer learning, effectively bridging computer vision-based measurement and natural language processing. The VSG represents a promising new research direction with considerable potential. Future work should explore signal combinations from multiple objects and the integration of inputs from multiple synchronized cameras. Such temporally aligned signals might contain implicit correlations across different sources, further enhancing VSG performance. The rapid advances in smartphone electronic components—particularly the significant increase in camera frame rates (approaching 2 kHz in certain devices)—suggest that VSG deployment in such smartphones is feasible.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/s26051407/s1.

Author Contributions

Conceptualization, Y.W. (Yan Wang) and X.Z.; methodology, Y.W. (Yan Wang); formal analysis, X.Z.; investigation, Y.W. (Yingchong Wang); writing—original draft preparation, Y.W. (Yan Wang) and X.D.; writing—review and editing, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. U2141217).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available upon reasonable request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Davis, A.; Rubinstein, M.; Wadhwa, N.; Mysore, G.; Durand, F.; Freeman, W. The visual microphone: Passive recovery of sound from video. ACM Trans. Graph. 2014, 33, 1–10. [Google Scholar] [CrossRef]
Nassi, B.; Pirutin, Y.; Swissa, R.; Shamir, A.; Elovici, Y.; Zadov, B. Lamphone: Real-time passive sound recovery from light bulb vibrations. Cryptol. ePrint Arch. 2020, 2020, 4401–4417. [Google Scholar]
Rothberg, S.; Baker, J.; Halliwell, N. Laser vibrometry: Pseudo-vibrations. J. Sound Vib. 1989, 135, 516–522. [Google Scholar] [CrossRef][Green Version]
Nassi, B.; Swissa, R.; Shams, J.; Zadov, B.; Elovici, Y. The little seal bug: Optical sound recovery from lightweight reflective objects. In Proceedings of the IEEE Security and Privacy Workshops, San Francisco, CA, USA, 25 May 2023; IEEE: New York, NY, USA, 2023; pp. 298–310. [Google Scholar] [CrossRef]
Nassi, B.; Pirutin, Y.; Galor, T.; Elovici, Y.; Zadov, B. Glowworm attack: Optical tempest sound recovery via a device’s power indicator led. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, 15–19 November 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1900–1914. [Google Scholar] [CrossRef]
Kwong, A.; Xu, W.; Fu, K. Hard drive of hearing: Disks that eavesdrop with a synthesized microphone. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 19–23 May 2019; IEEE: New York, NY, USA, 2019; pp. 905–919. [Google Scholar] [CrossRef]
Zhang, D.; Guo, J.; Jin, Y.; Zhu, C. Efficient subtle motion detection from high-speed video for sound recovery and vibration analysis using singular value decomposition-based approach. Opt. Eng. 2017, 56, 094105. [Google Scholar] [CrossRef]
Zhang, D.; Guo, J.; Lei, X.; Zhu, C. Note: Sound recovery from video using svd-based information extraction. Rev. Sci. Instrum. 2016, 87, 086111. [Google Scholar] [CrossRef] [PubMed]
Guri, M.; Solewicz, Y.; Daidakulov, A.; Elovici, Y. SPEAKE(a) R: Turn speakers to microphones for fun and profit. In Proceedings of the 11th USENIX Workshop on Offensive Technologies, Vancouver, BC, Canada, 14–15 August 2017. [Google Scholar] [CrossRef]
Chen, J.; Davis, A.; Wadhwa, N.; Durand, F.; Freeman, W.; Büyüköztürk, O. Video camera–based vibration measurement for civil infrastructure applications. J. Infrastruct. Syst. 2017, 23, B4016013. [Google Scholar] [CrossRef]
Davis, A.; Bouman, K.; Chen, J.; Rubinstein, M.; Durand, F.; Freeman, W. Visual vibrometry: Estimating material properties from small motion in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 5335–5343. [Google Scholar] [CrossRef]
Zona, A. Vision-based vibration monitoring of structures and infrastructures: An overview of recent applications. Infrastructures 2021, 6, 4. [Google Scholar] [CrossRef]
Michalevsky, Y.; Boneh, D.; Nakibly, G. Gyrophone: Recognizing speech from gyroscope signals. In Proceedings of the 23rd USENIX Security Symposium, San Diego, CA, USA, 20–22 August 2014; USENIX Association: Berkeley, CA, USA, 2014; pp. 1053–1067. [Google Scholar]
Long, Y.; Naghavi, P.; Kojusner, B.; Butler, K.; Rampazzi, S.; Fu, K. Side eye: Characterizing the limits of pov acoustic eavesdropping from smartphone cameras with rolling shutters and movable lenses. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 21–25 May 2023; IEEE: New York, NY, USA, 2023; pp. 1857–1874. [Google Scholar] [CrossRef]
Zhang, L.; Pathak, P.; Wu, M.; Zhao, Y.; Mohapatra, P. AccelWord: Energy efficient hotword detection through accelero-meter. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services, Florence, Italy, 18–22 May 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 301–315. [Google Scholar] [CrossRef]
Han, J.; Chung, A.J.; Tague, P. Pitchln: Eavesdropping via intelligible speech reconstruction using non-acoustic sensor fusion. In Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks, Pittsburgh, PA, USA, 18–20 April 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 181–192. [Google Scholar] [CrossRef]
Wang, G.; Zou, Y.; Zhou, Z.; Wu, K.; Ni, L. We can hear you with Wi-Fi! In Proceedings of the IEEE Transactions on Mobile Computing, Piscataway, NJ, USA, 1 November 2016; IEEE: New York, NY, USA, 2016; Volume 15, pp. 2907–2920. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Hsu, W.; Bolte, B.; Tsai, Y.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 5754–5764. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, Seoul, Republic of Korea, 1–3 November 2017; IEEE: New York, NY, USA, 2017; pp. 1–5. [Google Scholar] [CrossRef]
Fleet, D.; Jepson, A. Computation of component image velocity from local phase information. Int. J. Comput. Vis. 1990, 5, 77–104. [Google Scholar] [CrossRef]
Gautama, T.; VanHulle, M. A phase-based approach to the estimation of the optical flow field using spatial filtering. IEEE Trans. Neural Netw. 2002, 13, 1127–1136. [Google Scholar] [CrossRef]
Freeman, W.; Adelson, E. The design and use of steerable filters. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 891–906. [Google Scholar] [CrossRef]
Chou, J.; Chang, C.; Spencer, J. Out-of-plane modal property extraction based on multi-level image pyramid reconstruction using stereophotogrammetry. Mech. Syst. Signal Process. 2022, 169, 108786. [Google Scholar] [CrossRef]
Wadhwa, N.; Rubinstein, M.; Durand, F.; Freeman, W. Phase based video motion processing. ACM Trans. Graph. 2013, 32, 80. [Google Scholar] [CrossRef]
Isogawa, K.; Ida, T.; Shiodera, T.; Takeguchi, T. Deep shrinkage convolutional neural network for adaptive noise reduction. IEEE Signal Process. Lett. 2018, 25, 224–228. [Google Scholar] [CrossRef]
Zhao, M.; Zhong, S.; Fu, X.; Tang, B.; Pecht, M. Deep residual shrinkage networks for fault diagnosis. IEEE Trans. Ind. Inform. 2020, 16, 4681–4690. [Google Scholar] [CrossRef]
Zhang, B.; Lv, H.; Guo, P.; Shao, Q.; Yang, C.; Xie, L. Wenetspeech: A 10000 hours multi-domain mandarin corpus for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6182–6186. [Google Scholar] [CrossRef]
Joshi, M.; Chen, D.; Liu, Y.; Weld, D.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
Chorowski, J.; Jaitly, N. Towards better decoding and language model integration in sequence to sequence models. arXiv 2017, arXiv:1612.02695. [Google Scholar]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Yang, Y.; Hira, M.; Ni, Z.; Astafurov, A.; Chen, C.; Puhrsch, C. Torchaudio: Building blocks for audio and speech processing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6982–6986. [Google Scholar] [CrossRef]
Li, K.; Huang, Z.; Xu, Y.; Lee, C. DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015; International Speech Communication Association: Grenoble, France, 2015; pp. 2578–2582. [Google Scholar]
Rakotonirina, N. Self-attention for audio super-resolution. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing, Gold Coast, Australia, 25–28 October 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar] [CrossRef]
Meyer, S.; Djelouah, A.; McWilliams, B.; Sorkine-Hornung, A.; Gross, M.; Schroers, C. Phasenet for video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 498–507. [Google Scholar]

Figure 1. Flowchart of the proposed VSG method: (1) Video capture; (2) PAS acquisition; (3) VSG-Transformer training and testing; (4) Text reconstruction. The semantic content of the text visible on the object surface is irrelevant to the experiment and does not affect the subsequent pipeline.

Figure 2. Decomposition of a video frame.

Figure 3. Comparison between the spectrogram of the original audio signal and those of the PASs extracted from different surface locations. (a) Experimental scene and data acquisition setup. The PASs were extracted from a 4.4-kHz video of a bag of potato chips placed on the table in a typical meeting room. (b) Spectrogram of the source audio recorded by a standard microphone placed next to a bag of potato chips. (c–e) Spectrograms of PASs extracted from three representative surface locations, denoted as

p_{1}

,

p_{2}

, and

p_{3}

, respectively. The sentence “We are continuously improving our technology”, corresponding to the Mandarin pronunciation “Wǒmen yězài bùduàn tíshēng wǒmen de jìshù,” was spoken by a person near the bag at an approximate volume of 80 dB. The recovered sound is noisy but comprehensible, and the corresponding audio clips are available at [https://youtu.be/rMcX9ofYY68] (accessed on 9 February 2026).

Figure 3. Comparison between the spectrogram of the original audio signal and those of the PASs extracted from different surface locations. (a) Experimental scene and data acquisition setup. The PASs were extracted from a 4.4-kHz video of a bag of potato chips placed on the table in a typical meeting room. (b) Spectrogram of the source audio recorded by a standard microphone placed next to a bag of potato chips. (c–e) Spectrograms of PASs extracted from three representative surface locations, denoted as

p_{1}

,

p_{2}

, and

p_{3}

, respectively. The sentence “We are continuously improving our technology”, corresponding to the Mandarin pronunciation “Wǒmen yězài bùduàn tíshēng wǒmen de jìshù,” was spoken by a person near the bag at an approximate volume of 80 dB. The recovered sound is noisy but comprehensible, and the corresponding audio clips are available at [https://youtu.be/rMcX9ofYY68] (accessed on 9 February 2026).

Figure 4. The VSG-Transformer architecture.

Figure 5. The experimental setup.

Figure 6. (a–d) Frames from the videos.

Figure 7. The stage-wise training protocol. (a) Stage 1. (b) Stage 2.

Figure 8. Visualization of the proposed VSG method.

Figure 9. Representative frames extracted from videos captured under different acquisition and scene conditions. (a–c) Natural lighting, dim lighting, and sparse highlights. (d) Moderate defocus and (e) severe defocus. (f–h) Decreasing object occupancy ratios of approximately 75%, 50%, and 25%, respectively.

Table 1. Summary of related work.

Method	Exploited Device	Sampling Rate	Technique Category
Lamphone [2,4]	Photodiode	2–4 kHz	Recovery
LDVs [3]	Laser transceiver	40 kHz
Glowworm [5]	Photodiode	4–8 kHz
Hard Drive of Hearing [6]	Magnetic hard drive	17 kHz
Visual Microphone [1]	High-speed camera	2–20 kHz
SVD [7,8]	High-speed camera	2.2 kHz
SPEAKE(a)R [9]	Speakers	48 kHz
Gyrophone [13]	Gyroscope	200 Hz	Classification
Side Eye [14]	Smartphone cameras	60 Hz
Accelword [15]	Accelerometer	200 Hz
Pitchln [16]	Fusion of several motion sensors	2 kHz
WiHear [17]	Software-defined radio	300 Hz
VSG of the present paper	High-speed camera	2–16 kHz	Generation

Table 2. The parameters of HuBERT-Base and HuBERT-Large.

		Base	Large
CNN encoder	Strides	5, 2, 2, 2, 2, 2, 2
	Kernel Width	10, 3, 3, 3, 3, 2, 2
	Channels	512
Transformer	Blocks	12	24
	Embedding Dimension	768	1024
	Inner FFN Dimension	3072	4096
	Attention Heads	12	16
Number of Parameters		95 M	317 M

Table 3. Statistics: The AISHELL-1 and PAS datasets.

		AISHELL-1			PAS Dataset
		Train.	Dev.	Test	Train.	Dev.	Test
Utterances		120,098	14,326	7176	89,600	19,200	19,200
Hours		150	18	10	107	28	29
Durations (s)	Min.	1.2	1.6	1.9	3.5	3.8	3.5
	Max.	14.5	12.5	14.7	12.4	10.2	10.4
	Avg.	4.5	4.5	5	4.3	5.3	5.5
Tokens	Min.	1.0	3.0	3.0	4.0	5.0	4.0
	Max.	44.0	35.0	37.0	35.0	22.0	26.0
	Avg.	14.4	14.3	14.6	13.6	12.3	11.2

Table 4. Model CERs on ASR and VSG tasks at different training stages.

Training Stage	Model Scale	Dataset		Training Epochs	Frozen Layer(s)	Development (%)	Test (%)
Training Stage	Model Scale	AISHELL-1	PAS	Training Epochs	Frozen Layer(s)	Development (%)	Test (%)
Stage 1	Base	√	-	130	Shrinkage + HuBERT	6.2	6.4
Stage 1	Large	√	-	130	Shrinkage + HuBERT	5.9	6.1
Stage 2	Base	-	√	40	HuBERT	13.3	13.7
Stage 2	Large	-	√	40	HuBERT	12.1	12.5

Table 5. Ablation test configuration and the CER results. (“–” indicates that the corresponding configuration failed to converge or did not produce a valid recognition result under the given setting.).

	Configuration		AISHELL-1		PAS Dataset
	Configuration		Development (%)	Test (%)	Development (%)	Test (%)
	Baseline Model		5.9	6.1	12.1	12.5
Test 1	Shrinkage layer frozen during stage 2		5.9	6.1	18.5	19.1
Test 2	Number of decoder blocks		Development (%)	Test (%)	Development (%)	Test (%)
	10		5.8	6.1	12.0	12.1
	8		5.8	6.1	12.0	12.2
	6 (baseline model)		5.9	6.1	12.1	12.5
	4		6.4	6.7	13.4	13.6
Test 3	Without majority voting		5.9	6.1	21.4	24.5
Test 4	PAS selection strategy		AISHELL-1		PAS Dataset
	Region	Pooling	Development (%)	Test (%)	Development (%)	Test (%)
	8 × 8	$t$ = 0.90	5.9	6.1	-	-
	8 × 8	$t$ = 0.85			31.5	32.1
	8 × 8	$t$ = 0.80			12.1	12.5
	8 × 8	$t$ = 0.75			12.3	12.8
	4 × 4	$t$ = 0.80			12.1	12.5
	16 × 16	$t$ = 0.80			14.7	16.2
	32 × 32	$t$ = 0.80			-	-
	8 × 8	Weighted averaging			14.6	15.2
Test 5	Visual microphone + ASR (baseline model without stage 2)		40.3%
Test 6	PAS + ASR (baseline model without stage 2)		56.7%

Table 6. Representative runtimes of different stages in the VSG pipeline.

Stage	Configuration	Input	Runtime (s)
PME (PAS generation)	Steerable pyramid: (1 scale, 2 orientations)	Video (128 × 128 pixels, 5 s)	129.79
	Steerable pyramid: (1 scale, 4 orientations)		159.07
	Steerable pyramid: (2 scales, 4 orientations)		196.27
	Steerable pyramid: (3 scales, 4 orientations)		212.38
VSG-Transformer (Base)	HuBERT-Base (95 M parameters)	PAS sequence (5 s)	0.16
VSG-Transformer (Base)	HuBERT-Base (95 M parameters)	256 PASs per video (5 s)	27.12
VSG-Transformer (Large)	HuBERT-Large (317 M parameters)	PAS sequence (5 s)	0.43
VSG-Transformer (Large)	HuBERT-Large (317 M parameters)	256 PASs per video (5 s)	75.51

Table 7. Performance variation under different acquisition and environmental conditions.

Category	Factor	Levels/Settings	CER (%)
Acquisition geometry	Camera–object distance (m)	2.5	12.5
		5	12.5
		7.5	17.9
		10	28.6
Acoustic excitation	Sound pressure level (SPL) (dB)	80	12.5
		75	22.3
		70	45.4
		65	-
Illumination condition	Lighting condition	Natural lighting	12.5
		Dim lighting	15.7
		Sparse highlights	24.3
Focus quality	Defocus level	No defocus	12.5
		Moderate defocus	-
		Severe defocus	-
Scene composition	Object occupancy ratio (%)	~100	12.5
		~75	13.4
		~50	-
		~25	-
Speech complexity	Overlapping speech (target speaker: ~80 dB)	Interference level: ~80 dB	-
		Interference level: ~75 dB	-
		Interference level: ~70 dB	38.1
		Interference level: ~65 dB	12.7

Table 8. The CER results for a VSG using different upsampling techniques and ratios.

Upsampling Methods	2 × (Original 8 kHz)		4 × (Original 4 kHz)		8 × (Original 2 kHz)
Upsampling Methods	Development (%)	Test (%)	Development (%)	Test (%)	Development (%)	Test (%)
BSI	22.3	27.1	34.8	41.5	-	-
DNN [38]	14.4	14.8	16.8	17.3	24.7	25.6
AFiLM [39]	14.3	14.8	16.2	16.8	22.5	24.2
Phase-net [40]	13.2	13.6	14.1	15.3	18.4	19.6
-	12.1/12.5 (original 16 kHz)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Wang, Y.; Zhang, X.; Ding, X. A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos. Sensors 2026, 26, 1407. https://doi.org/10.3390/s26051407

AMA Style

Wang Y, Wang Y, Zhang X, Ding X. A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos. Sensors. 2026; 26(5):1407. https://doi.org/10.3390/s26051407

Chicago/Turabian Style

Wang, Yan, Yingchong Wang, Xiuqi Zhang, and Xiaoyu Ding. 2026. "A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos" Sensors 26, no. 5: 1407. https://doi.org/10.3390/s26051407

APA Style

Wang, Y., Wang, Y., Zhang, X., & Ding, X. (2026). A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos. Sensors, 26(5), 1407. https://doi.org/10.3390/s26051407

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos

Abstract

1. Introduction

2. Methods

2.1. General Framework of the Method

2.2. Extraction of PAS

2.3. Proposed Model: The VSG-Transformer

3. Experimental Validation

3.1. Dataset Generation

3.2. Training and Testing

3.3. Results and Ablation Studies

3.4. Computational Cost and Runtime Analysis

3.5. Failure Modes and Operational Limitations

3.6. VSG Operation on Videos with Limited Temporal Sampling

4. Conclusions and Future Work

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI