This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics
by
Gul Faraz
Gul Faraz
,
Lei Jing
Lei Jing
and
Xiang Li
Xiang Li *
Graduate School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Japan
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(2), 596; https://doi.org/10.3390/s26020596 (registering DOI)
Submission received: 10 December 2025
/
Revised: 11 January 2026
/
Accepted: 13 January 2026
/
Published: 15 January 2026
Abstract
This study presents a multimodal framework that uses smartphone motion sensors and generative AI to create audio comics from live news headlines. The system operates without direct touch or voice input, instead responding to simple hand-wave gestures. The system demonstrates potential as an alternative input method, which may benefit users who find traditional touch or voice interaction challenging. In the experiments, we investigated the generation of comics on based on the latest tech-related news headlines using Really Simple Syndication (RSS) on a simple hand wave gesture. The proposed framework demonstrates extensibility beyond comic generation, as various other tasks utilizing large language models and multimodal AI could be integrated by mapping them to different hand gestures. Our experiments with open-source models like LLaMA, LLaVA, Gemma, and Qwen revealed that LLaVA delivers superior results in generating panel-aligned stories compared to Qwen3-VL, both in terms of inference speed and output quality, relative to the source image. These large language models (LLMs) collectively contribute imaginative and conversational narrative elements that enhance diversity in storytelling within the comic format. Additionally, we implement an AI-in-the-loop mechanism to iteratively improve output quality without human intervention. Finally, AI-generated audio narration is incorporated into the comics to create an immersive, multimodal reading experience.
Share and Cite
MDPI and ACS Style
Faraz, G.; Jing, L.; Li, X.
Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics. Sensors 2026, 26, 596.
https://doi.org/10.3390/s26020596
AMA Style
Faraz G, Jing L, Li X.
Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics. Sensors. 2026; 26(2):596.
https://doi.org/10.3390/s26020596
Chicago/Turabian Style
Faraz, Gul, Lei Jing, and Xiang Li.
2026. "Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics" Sensors 26, no. 2: 596.
https://doi.org/10.3390/s26020596
APA Style
Faraz, G., Jing, L., & Li, X.
(2026). Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics. Sensors, 26(2), 596.
https://doi.org/10.3390/s26020596
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article metric data becomes available approximately 24 hours after publication online.