A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos
Abstract
1. Introduction
- (1)
- Audio recovery techniques focus on human auditory perception, imposing strict requirements on the complex signal processing pipelines that are often built using expert prior knowledge. Recognition performance can vary significantly by the hearing capacity and training level of the listener.
- (2)
- The principal limitation of classification-based techniques is the restricted output space of existing models. Classification is often limited to isolated words or digits, usually in precompiled dictionaries. This renders the compilation of task-specific word lists difficult.
2. Methods
2.1. General Framework of the Method
- Video capture: The response of any object to sound is purely physical. As the acoustic excitations vary, the resulting object surface vibrations are captured by a camera, effectively transforming physical displacements into pixel-level signals within the video frames;
- PAS acquisition: PASs are obtained via phase-based processing of the pixel signals;
- VSG-Transformer training and testing: Large-scale PASs that encode rich acoustic features are used to construct the PAS dataset employed to train and evaluate VSG-Transformer. A multi-stage transfer learning strategy effectively links the pretrained acoustic representations of HuBERT to the PAS-driven VSG task;
- Text reconstruction: The trained VSG-Transformer reconstructs text based the PASs extracted from new videos.
2.2. Extraction of PAS
- (a)
- Computation of Local Motion Signals
- (b)
- PAS Synthesis
2.3. Proposed Model: The VSG-Transformer
- (a)
- The convolutional shrinkage module
- (b)
- The pretrained HuBERT encoder
- (c)
- The decoder
3. Experimental Validation
3.1. Dataset Generation
3.2. Training and Testing
3.3. Results and Ablation Studies
3.4. Computational Cost and Runtime Analysis
3.5. Failure Modes and Operational Limitations
3.6. VSG Operation on Videos with Limited Temporal Sampling
4. Conclusions and Future Work
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Davis, A.; Rubinstein, M.; Wadhwa, N.; Mysore, G.; Durand, F.; Freeman, W. The visual microphone: Passive recovery of sound from video. ACM Trans. Graph. 2014, 33, 1–10. [Google Scholar] [CrossRef]
- Nassi, B.; Pirutin, Y.; Swissa, R.; Shamir, A.; Elovici, Y.; Zadov, B. Lamphone: Real-time passive sound recovery from light bulb vibrations. Cryptol. ePrint Arch. 2020, 2020, 4401–4417. [Google Scholar]
- Rothberg, S.; Baker, J.; Halliwell, N. Laser vibrometry: Pseudo-vibrations. J. Sound Vib. 1989, 135, 516–522. [Google Scholar] [CrossRef][Green Version]
- Nassi, B.; Swissa, R.; Shams, J.; Zadov, B.; Elovici, Y. The little seal bug: Optical sound recovery from lightweight reflective objects. In Proceedings of the IEEE Security and Privacy Workshops, San Francisco, CA, USA, 25 May 2023; IEEE: New York, NY, USA, 2023; pp. 298–310. [Google Scholar] [CrossRef]
- Nassi, B.; Pirutin, Y.; Galor, T.; Elovici, Y.; Zadov, B. Glowworm attack: Optical tempest sound recovery via a device’s power indicator led. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, 15–19 November 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1900–1914. [Google Scholar] [CrossRef]
- Kwong, A.; Xu, W.; Fu, K. Hard drive of hearing: Disks that eavesdrop with a synthesized microphone. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 19–23 May 2019; IEEE: New York, NY, USA, 2019; pp. 905–919. [Google Scholar] [CrossRef]
- Zhang, D.; Guo, J.; Jin, Y.; Zhu, C. Efficient subtle motion detection from high-speed video for sound recovery and vibration analysis using singular value decomposition-based approach. Opt. Eng. 2017, 56, 094105. [Google Scholar] [CrossRef]
- Zhang, D.; Guo, J.; Lei, X.; Zhu, C. Note: Sound recovery from video using svd-based information extraction. Rev. Sci. Instrum. 2016, 87, 086111. [Google Scholar] [CrossRef] [PubMed]
- Guri, M.; Solewicz, Y.; Daidakulov, A.; Elovici, Y. SPEAKE(a) R: Turn speakers to microphones for fun and profit. In Proceedings of the 11th USENIX Workshop on Offensive Technologies, Vancouver, BC, Canada, 14–15 August 2017. [Google Scholar] [CrossRef]
- Chen, J.; Davis, A.; Wadhwa, N.; Durand, F.; Freeman, W.; Büyüköztürk, O. Video camera–based vibration measurement for civil infrastructure applications. J. Infrastruct. Syst. 2017, 23, B4016013. [Google Scholar] [CrossRef]
- Davis, A.; Bouman, K.; Chen, J.; Rubinstein, M.; Durand, F.; Freeman, W. Visual vibrometry: Estimating material properties from small motion in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 5335–5343. [Google Scholar] [CrossRef]
- Zona, A. Vision-based vibration monitoring of structures and infrastructures: An overview of recent applications. Infrastructures 2021, 6, 4. [Google Scholar] [CrossRef]
- Michalevsky, Y.; Boneh, D.; Nakibly, G. Gyrophone: Recognizing speech from gyroscope signals. In Proceedings of the 23rd USENIX Security Symposium, San Diego, CA, USA, 20–22 August 2014; USENIX Association: Berkeley, CA, USA, 2014; pp. 1053–1067. [Google Scholar]
- Long, Y.; Naghavi, P.; Kojusner, B.; Butler, K.; Rampazzi, S.; Fu, K. Side eye: Characterizing the limits of pov acoustic eavesdropping from smartphone cameras with rolling shutters and movable lenses. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 21–25 May 2023; IEEE: New York, NY, USA, 2023; pp. 1857–1874. [Google Scholar] [CrossRef]
- Zhang, L.; Pathak, P.; Wu, M.; Zhao, Y.; Mohapatra, P. AccelWord: Energy efficient hotword detection through accelero-meter. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services, Florence, Italy, 18–22 May 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 301–315. [Google Scholar] [CrossRef]
- Han, J.; Chung, A.J.; Tague, P. Pitchln: Eavesdropping via intelligible speech reconstruction using non-acoustic sensor fusion. In Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks, Pittsburgh, PA, USA, 18–20 April 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 181–192. [Google Scholar] [CrossRef]
- Wang, G.; Zou, Y.; Zhou, Z.; Wu, K.; Ni, L. We can hear you with Wi-Fi! In Proceedings of the IEEE Transactions on Mobile Computing, Piscataway, NJ, USA, 1 November 2016; IEEE: New York, NY, USA, 2016; Volume 15, pp. 2907–2920. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Hsu, W.; Bolte, B.; Tsai, Y.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 5754–5764. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, Seoul, Republic of Korea, 1–3 November 2017; IEEE: New York, NY, USA, 2017; pp. 1–5. [Google Scholar] [CrossRef]
- Fleet, D.; Jepson, A. Computation of component image velocity from local phase information. Int. J. Comput. Vis. 1990, 5, 77–104. [Google Scholar] [CrossRef]
- Gautama, T.; VanHulle, M. A phase-based approach to the estimation of the optical flow field using spatial filtering. IEEE Trans. Neural Netw. 2002, 13, 1127–1136. [Google Scholar] [CrossRef]
- Freeman, W.; Adelson, E. The design and use of steerable filters. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 891–906. [Google Scholar] [CrossRef]
- Chou, J.; Chang, C.; Spencer, J. Out-of-plane modal property extraction based on multi-level image pyramid reconstruction using stereophotogrammetry. Mech. Syst. Signal Process. 2022, 169, 108786. [Google Scholar] [CrossRef]
- Wadhwa, N.; Rubinstein, M.; Durand, F.; Freeman, W. Phase based video motion processing. ACM Trans. Graph. 2013, 32, 80. [Google Scholar] [CrossRef]
- Isogawa, K.; Ida, T.; Shiodera, T.; Takeguchi, T. Deep shrinkage convolutional neural network for adaptive noise reduction. IEEE Signal Process. Lett. 2018, 25, 224–228. [Google Scholar] [CrossRef]
- Zhao, M.; Zhong, S.; Fu, X.; Tang, B.; Pecht, M. Deep residual shrinkage networks for fault diagnosis. IEEE Trans. Ind. Inform. 2020, 16, 4681–4690. [Google Scholar] [CrossRef]
- Zhang, B.; Lv, H.; Guo, P.; Shao, Q.; Yang, C.; Xie, L. Wenetspeech: A 10000 hours multi-domain mandarin corpus for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6182–6186. [Google Scholar] [CrossRef]
- Joshi, M.; Chen, D.; Liu, Y.; Weld, D.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
- Chorowski, J.; Jaitly, N. Towards better decoding and language model integration in sequence to sequence models. arXiv 2017, arXiv:1612.02695. [Google Scholar]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Yang, Y.; Hira, M.; Ni, Z.; Astafurov, A.; Chen, C.; Puhrsch, C. Torchaudio: Building blocks for audio and speech processing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 6982–6986. [Google Scholar] [CrossRef]
- Li, K.; Huang, Z.; Xu, Y.; Lee, C. DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015; International Speech Communication Association: Grenoble, France, 2015; pp. 2578–2582. [Google Scholar]
- Rakotonirina, N. Self-attention for audio super-resolution. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing, Gold Coast, Australia, 25–28 October 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Meyer, S.; Djelouah, A.; McWilliams, B.; Sorkine-Hornung, A.; Gross, M.; Schroers, C. Phasenet for video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 498–507. [Google Scholar]









| Method | Exploited Device | Sampling Rate | Technique Category |
|---|---|---|---|
| Lamphone [2,4] | Photodiode | 2–4 kHz | Recovery |
| LDVs [3] | Laser transceiver | 40 kHz | |
| Glowworm [5] | Photodiode | 4–8 kHz | |
| Hard Drive of Hearing [6] | Magnetic hard drive | 17 kHz | |
| Visual Microphone [1] | High-speed camera | 2–20 kHz | |
| SVD [7,8] | High-speed camera | 2.2 kHz | |
| SPEAKE(a)R [9] | Speakers | 48 kHz | |
| Gyrophone [13] | Gyroscope | 200 Hz | Classification |
| Side Eye [14] | Smartphone cameras | 60 Hz | |
| Accelword [15] | Accelerometer | 200 Hz | |
| Pitchln [16] | Fusion of several motion sensors | 2 kHz | |
| WiHear [17] | Software-defined radio | 300 Hz | |
| VSG of the present paper | High-speed camera | 2–16 kHz | Generation |
| Base | Large | ||
|---|---|---|---|
| CNN encoder | Strides | 5, 2, 2, 2, 2, 2, 2 | |
| Kernel Width | 10, 3, 3, 3, 3, 2, 2 | ||
| Channels | 512 | ||
| Transformer | Blocks | 12 | 24 |
| Embedding Dimension | 768 | 1024 | |
| Inner FFN Dimension | 3072 | 4096 | |
| Attention Heads | 12 | 16 | |
| Number of Parameters | 95 M | 317 M | |
| AISHELL-1 | PAS Dataset | ||||||
|---|---|---|---|---|---|---|---|
| Train. | Dev. | Test | Train. | Dev. | Test | ||
| Utterances | 120,098 | 14,326 | 7176 | 89,600 | 19,200 | 19,200 | |
| Hours | 150 | 18 | 10 | 107 | 28 | 29 | |
| Durations (s) | Min. | 1.2 | 1.6 | 1.9 | 3.5 | 3.8 | 3.5 |
| Max. | 14.5 | 12.5 | 14.7 | 12.4 | 10.2 | 10.4 | |
| Avg. | 4.5 | 4.5 | 5 | 4.3 | 5.3 | 5.5 | |
| Tokens | Min. | 1.0 | 3.0 | 3.0 | 4.0 | 5.0 | 4.0 |
| Max. | 44.0 | 35.0 | 37.0 | 35.0 | 22.0 | 26.0 | |
| Avg. | 14.4 | 14.3 | 14.6 | 13.6 | 12.3 | 11.2 | |
| Training Stage | Model Scale | Dataset | Training Epochs | Frozen Layer(s) | Development (%) | Test (%) | |
|---|---|---|---|---|---|---|---|
| AISHELL-1 | PAS | ||||||
| Stage 1 | Base | √ | - | 130 | Shrinkage + HuBERT | 6.2 | 6.4 |
| Large | √ | - | 130 | Shrinkage + HuBERT | 5.9 | 6.1 | |
| Stage 2 | Base | - | √ | 40 | HuBERT | 13.3 | 13.7 |
| Large | - | √ | 40 | HuBERT | 12.1 | 12.5 | |
| Configuration | AISHELL-1 | PAS Dataset | ||||
|---|---|---|---|---|---|---|
| Development (%) | Test (%) | Development (%) | Test (%) | |||
| Baseline Model | 5.9 | 6.1 | 12.1 | 12.5 | ||
| Test 1 | Shrinkage layer frozen during stage 2 | 5.9 | 6.1 | 18.5 | 19.1 | |
| Test 2 | Number of decoder blocks | Development (%) | Test (%) | Development (%) | Test (%) | |
| 10 | 5.8 | 6.1 | 12.0 | 12.1 | ||
| 8 | 5.8 | 6.1 | 12.0 | 12.2 | ||
| 6 (baseline model) | 5.9 | 6.1 | 12.1 | 12.5 | ||
| 4 | 6.4 | 6.7 | 13.4 | 13.6 | ||
| Test 3 | Without majority voting | 5.9 | 6.1 | 21.4 | 24.5 | |
| Test 4 | PAS selection strategy | AISHELL-1 | PAS Dataset | |||
| Region | Pooling | Development (%) | Test (%) | Development (%) | Test (%) | |
| 8 × 8 | = 0.90 | 5.9 | 6.1 | - | - | |
| 8 × 8 | = 0.85 | 31.5 | 32.1 | |||
| 8 × 8 | = 0.80 | 12.1 | 12.5 | |||
| 8 × 8 | = 0.75 | 12.3 | 12.8 | |||
| 4 × 4 | = 0.80 | 12.1 | 12.5 | |||
| 16 × 16 | = 0.80 | 14.7 | 16.2 | |||
| 32 × 32 | = 0.80 | - | - | |||
| 8 × 8 | Weighted averaging | 14.6 | 15.2 | |||
| Test 5 | Visual microphone + ASR (baseline model without stage 2) | 40.3% | ||||
| Test 6 | PAS + ASR (baseline model without stage 2) | 56.7% | ||||
| Stage | Configuration | Input | Runtime (s) |
|---|---|---|---|
| PME (PAS generation) | Steerable pyramid: (1 scale, 2 orientations) | Video (128 × 128 pixels, 5 s) | 129.79 |
| Steerable pyramid: (1 scale, 4 orientations) | 159.07 | ||
| Steerable pyramid: (2 scales, 4 orientations) | 196.27 | ||
| Steerable pyramid: (3 scales, 4 orientations) | 212.38 | ||
| VSG-Transformer (Base) | HuBERT-Base (95 M parameters) | PAS sequence (5 s) | 0.16 |
| 256 PASs per video (5 s) | 27.12 | ||
| VSG-Transformer (Large) | HuBERT-Large (317 M parameters) | PAS sequence (5 s) | 0.43 |
| 256 PASs per video (5 s) | 75.51 |
| Category | Factor | Levels/Settings | CER (%) |
|---|---|---|---|
| Acquisition geometry | Camera–object distance (m) | 2.5 | 12.5 |
| 5 | 12.5 | ||
| 7.5 | 17.9 | ||
| 10 | 28.6 | ||
| Acoustic excitation | Sound pressure level (SPL) (dB) | 80 | 12.5 |
| 75 | 22.3 | ||
| 70 | 45.4 | ||
| 65 | - | ||
| Illumination condition | Lighting condition | Natural lighting | 12.5 |
| Dim lighting | 15.7 | ||
| Sparse highlights | 24.3 | ||
| Focus quality | Defocus level | No defocus | 12.5 |
| Moderate defocus | - | ||
| Severe defocus | - | ||
| Scene composition | Object occupancy ratio (%) | ~100 | 12.5 |
| ~75 | 13.4 | ||
| ~50 | - | ||
| ~25 | - | ||
| Speech complexity | Overlapping speech (target speaker: ~80 dB) | Interference level: ~80 dB | - |
| Interference level: ~75 dB | - | ||
| Interference level: ~70 dB | 38.1 | ||
| Interference level: ~65 dB | 12.7 |
| Upsampling Methods | 2 × (Original 8 kHz) | 4 × (Original 4 kHz) | 8 × (Original 2 kHz) | |||
|---|---|---|---|---|---|---|
| Development (%) | Test (%) | Development (%) | Test (%) | Development (%) | Test (%) | |
| BSI | 22.3 | 27.1 | 34.8 | 41.5 | - | - |
| DNN [38] | 14.4 | 14.8 | 16.8 | 17.3 | 24.7 | 25.6 |
| AFiLM [39] | 14.3 | 14.8 | 16.2 | 16.8 | 22.5 | 24.2 |
| Phase-net [40] | 13.2 | 13.6 | 14.1 | 15.3 | 18.4 | 19.6 |
| - | 12.1/12.5 (original 16 kHz) | |||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, Y.; Wang, Y.; Zhang, X.; Ding, X. A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos. Sensors 2026, 26, 1407. https://doi.org/10.3390/s26051407
Wang Y, Wang Y, Zhang X, Ding X. A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos. Sensors. 2026; 26(5):1407. https://doi.org/10.3390/s26051407
Chicago/Turabian StyleWang, Yan, Yingchong Wang, Xiuqi Zhang, and Xiaoyu Ding. 2026. "A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos" Sensors 26, no. 5: 1407. https://doi.org/10.3390/s26051407
APA StyleWang, Y., Wang, Y., Zhang, X., & Ding, X. (2026). A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos. Sensors, 26(5), 1407. https://doi.org/10.3390/s26051407
