Next Article in Journal
Reconstructing Spatial Localization Error Maps via Physics-Informed Tensor Completion for Passive Sensor Systems
Previous Article in Journal
Adaptive Trajectory-Constrained Heading Estimation for Tractor GNSS/SINS Integrated Navigation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics

Graduate School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Japan
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(2), 596; https://doi.org/10.3390/s26020596 (registering DOI)
Submission received: 10 December 2025 / Revised: 11 January 2026 / Accepted: 13 January 2026 / Published: 15 January 2026
(This article belongs to the Special Issue Body Area Networks: Intelligence, Sensing and Communication)

Abstract

This study presents a multimodal framework that uses smartphone motion sensors and generative AI to create audio comics from live news headlines. The system operates without direct touch or voice input, instead responding to simple hand-wave gestures. The system demonstrates potential as an alternative input method, which may benefit users who find traditional touch or voice interaction challenging. In the experiments, we investigated the generation of comics on based on the latest tech-related news headlines using Really Simple Syndication (RSS) on a simple hand wave gesture. The proposed framework demonstrates extensibility beyond comic generation, as various other tasks utilizing large language models and multimodal AI could be integrated by mapping them to different hand gestures. Our experiments with open-source models like LLaMA, LLaVA, Gemma, and Qwen revealed that LLaVA delivers superior results in generating panel-aligned stories compared to Qwen3-VL, both in terms of inference speed and output quality, relative to the source image. These large language models (LLMs) collectively contribute imaginative and conversational narrative elements that enhance diversity in storytelling within the comic format. Additionally, we implement an AI-in-the-loop mechanism to iteratively improve output quality without human intervention. Finally, AI-generated audio narration is incorporated into the comics to create an immersive, multimodal reading experience.
Keywords: Internet of Things; body area networks; gyroscope sensors; large language models (LLMs); generative AI; comics; voice; multimodal; accessibility Internet of Things; body area networks; gyroscope sensors; large language models (LLMs); generative AI; comics; voice; multimodal; accessibility

Share and Cite

MDPI and ACS Style

Faraz, G.; Jing, L.; Li, X. Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics. Sensors 2026, 26, 596. https://doi.org/10.3390/s26020596

AMA Style

Faraz G, Jing L, Li X. Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics. Sensors. 2026; 26(2):596. https://doi.org/10.3390/s26020596

Chicago/Turabian Style

Faraz, Gul, Lei Jing, and Xiang Li. 2026. "Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics" Sensors 26, no. 2: 596. https://doi.org/10.3390/s26020596

APA Style

Faraz, G., Jing, L., & Li, X. (2026). Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics. Sensors, 26(2), 596. https://doi.org/10.3390/s26020596

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop