Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming
Abstract
1. Introduction
2. Related Work
2.1. Overview of Related Work
2.2. Review of Visual Object Detection Models
2.3. Video-Level Product Recognition and Multimodal Understanding
2.4. Keyframe Selection and Multi-Frame Strategies
- The absence of a quantifiable frame information value assessment mechanism tailored to downstream product recognition;
- The inability to guarantee the preservation of task-relevant semantic details, such as textual information, brand identifiers, and product specifications;
- A focus on multi-frame strategies on semantic diversity rather than targeted optimization of product recognition performance.
2.5. Image Quality Assessment (IQA)
2.6. Large Language Models and Multimodal Fusion
3. Method
3.1. Problem Definition
3.2. Overall Architecture
- Frame extraction & scoring: Frames are sampled from the input video at a fixed frame rate, and a quality score is computed for each frame (e.g., a combination of detection confidence and Laplacian sharpness). According to the scoring strategy, a single frame or multiple candidate frames are selected for downstream visual processing.
- OCR pipeline: Text recognition is performed on the selected frame to obtain raw textual content and internal OCR confidence scores. The raw text, confidence scores, and positional information are provided as input to an LLM, which is instructed to return normalized and structured outputs, such as brand name, product name, category, and specifications.
- ASR pipeline: The system performs speech recognition on the entire audio stream of the video to obtain a complete transcription of spoken content. This transcription does not depend on the selected frame and does not include temporal or frame-level alignment information. Instead of localizing specific video segments, it serves as clip-level semantic evidence that assists in understanding the products presented in the video. The LLM then parses and standardizes the raw ASR transcription to produce structured information, including potentially mentioned product names, brands, specifications, and their semantic confidence scores. At this stage, the LLM effectively conducts a semantic-level preliminary judgment based on spoken descriptions, forming textual evidence units parallel to the visual modality.
- Visual-semantic recognition with Qwen3-VL: For each product region detected in the selected frame, Qwen3-VL is invoked to perform image-based product recognition and description. In this study, Qwen3-VL specifically refers to Qwen3-VL-30B-A3B-Instruct. The model input consists of cropped product regions and task-specific prompts, and the output includes category cues, possible brand cues, specifications, and confidence scores, forming standardized structured evidence units.
- Evidence aggregation and fusion LLM: Structured outputs from the three pipelines are aggregated as contextual input and provided to the fusion LLM with explicit fusion instructions. The fusion LLM produces the final decision, including the product name, brand, and category.
3.3. Frame Scoring Strategy
3.3.1. Baseline: Fixed-Position Strategy
3.3.2. Strategy A: Traditional Visual Feature-Based Scoring
3.3.3. Strategy B: Deep Learning-Based Scoring
3.4. Frame Selection and Implementation Details
3.4.1. Frame Selection Pipeline
- A detector is applied to each frame to obtain product bounding boxes and their confidence scores.
- For each detected bounding box, either a sharpness score is computed (Strategy A) or the cropped region is fed into the regression model (Strategy B).
- A frame-level composite score is calculated for each frame.
- The frame with the highest composite score is selected as the final recognition frame:
- The selected frame is then passed to the downstream OCR and multimodal fusion modules.
3.4.2. Top-K Multi-Frame Fusion Strategy
3.5. Multimodal Fusion and Final Matching
3.5.1. Multimodal Information Extraction
3.5.2. LLM-Based Fusion Mechanism
- Consistency First: When multiple modalities provide identical or highly consistent information, such results are prioritized to enhance the reliability of the final output.
- Reliability Ranking: Modalities are fused according to their confidence-based reliability, with the priority order defined as OCR (packaging text) > image recognition > ASR (speech), ensuring that highly reliable evidence dominates the final decision.
- Conflict Resolution: In cases where outputs from different modalities conflict, the result with higher confidence and stronger consistency with OCR evidence is preferred.
- Fault-Tolerant Completion: When information from certain modalities is missing, the model generates a complete judgment based on the available modalities, thereby maintaining the stability and robustness of the system output.
3.5.3. Temperature Parameter and Inference Stability
3.6. Data
3.6.1. Data Sources and Overall Statistics
3.6.2. Product Detection Annotation
3.6.3. Ground-Truth Construction for Product Recognition
3.6.4. Data Augmentation and Preprocessing
3.6.5. Pseudo-Labeling Pipeline for EfficientNetV2-M Training
3.6.6. Evaluation Subsets
3.7. Models and Training
3.7.1. D-FINE Detector
3.7.2. EfficientNetV2
4. Experiments
4.1. Model Training Results and Analysis
4.1.1. D-FINE Training Results and Performance Analysis
4.1.2. EfficientNetV2 Training Results and Performance Analysis
4.2. Experimental Metrics and Statistical Methods
4.2.1. Experiment 1: OCR Information Quantity Metrics
4.2.2. Experiment 2: Frame Quality and OCR-Stage Metrics
4.2.3. Experiments 3 and 4: Structured Product Recognition Metrics
- Brand Recognition Accuracy: proportion of samples whose brand similarity is no less than 0.7;
- Product Name Recognition Accuracy: proportion of samples whose product-name similarity is no less than 0.5;
- Category Recognition Accuracy: proportion of samples whose category similarity is no less than 0.5.
- Perfect Match Rate: proportion of videos for which brand, product name, and category are all correctly recognized according to the corresponding field-level thresholds;
- Semantic Similarity: average semantic similarity across the three evaluation fields.
4.3. Experiment 1: OCR-Stage Comparison Between Single-Frame and Top-K Multi-Frame Fusion
4.3.1. Experimental Setup
- Single-Frame Strategy: Directly select the single frame with the highest comprehensive quality score as the OCR input;
- Top-3 Fusion Strategy: Select the top 3 frames with the highest quality scores and fuse their OCR outputs;
- Top-5 Fusion Strategy: Select the top 5 frames with the highest quality scores and fuse their OCR outputs.
4.3.2. Experimental Results
4.3.3. Result Analysis
4.4. Experiment 2: Comparison of Frame Quality Between Strategy A and Strategy B
4.4.1. Experimental Setup
4.4.2. Experimental Results
- (1)
- OCR Confidence Performance
- (2)
- Text Recognition Quality
- (3)
- Correlation Between Strategy B Scores and OCR-Stage Indicators
- (4)
- Time Cost
- (5)
- Human Evaluation Results
4.4.3. Results Analysis
- (1)
- OCR-Stage Performance and Score Correlation
- (2)
- Sample-Level Variation and Processing Cost
- (3)
- Human Evaluation and Overall Interpretation
4.5. Experiment 3: End-to-End Recognition Accuracy Comparison
4.5.1. Experimental Setup
- Baseline: Directly use the last frame of the video for recognition while retaining the proposed OCR/ASR/VLM fusion pipeline;
- External VLM Baseline: Use Qwen3-VL-30B-A3B-Instruct as an external video-language baseline that directly takes the original video as input and predicts the structured product fields;
- Strategy A: Perform end-to-end recognition using the frame selection Strategy A analyzed in Experiment 2, with the 0.7/0.3 weight setting and the Laplacian normalization constant of 500 determined by the sensitivity analysis;
- Strategy B: Perform end-to-end recognition using the frame selection Strategy B analyzed in Experiment 2, with the 0.8/0.2 weight setting determined by the sensitivity analysis.
4.5.2. Sensitivity Analysis of Frame-Scoring Parameters
- (1)
- Weight Sensitivity
- (2)
- Sensitivity of the Laplacian Normalization Constant
4.5.3. Experimental Results
4.5.4. Result Analysis
- (1)
- Overall Performance Analysis
- (2)
- Comparison with the External VLM Baseline
- (3)
- Paired Statistical Comparison Between Strategy A and Strategy B
4.5.5. End-to-End Error Analysis
4.6. Experiment 4: Ablation Study on Multimodal Evidence Fusion
4.6.1. Experimental Setup
- OCR-only: uses only textual information extracted from the selected keyframe.
- ASR-only: uses only the speech transcript extracted from the full audio stream of the video.
- VLM-only: uses only the visual-semantic description generated by Qwen3-VL-30B-A3B-Instruct.
- OCR + ASR: combines visual text and speech information.
- OCR + VLM: combines visual text and visual-semantic information.
- ASR + VLM: combines speech information and visual-semantic information.
- Full Framework: integrates OCR, ASR, and VLM evidence for final product recognition.
4.6.2. Experimental Results
4.6.3. Result Analysis
4.7. Efficiency and Cost Analysis
5. Conclusions
- The Top-K multi-frame fusion strategy improves OCR-stage evidence completeness, including the number of recognized OCR characters and high-confidence text, but its computational cost increases with larger K. Considering both efficiency and OCR evidence completeness, the single-frame strategy offers higher engineering feasibility in resource-constrained and practical deployment scenarios, while Top-K fusion can be regarded as an optional OCR-stage extension when richer textual evidence is required.
- Both Strategy A and Strategy B substantially improve end-to-end structured product recognition over the last-frame baseline. Strategy A provides a simple and interpretable scoring mechanism, while Strategy B further incorporates a learned quality component. In the final end-to-end evaluation, Strategy B achieves the best overall performance among the compared methods, with a Perfect Match Rate of 0.775 and a Semantic Similarity of 0.802. Compared with Strategy A, Strategy B improves the Perfect Match Rate by 0.050, and the paired statistical analysis supports its advantage in video-level perfect matching. Nevertheless, Strategy A remains a lightweight and interpretable alternative for scenarios with stricter computational constraints.
- The multimodal ablation study confirms the complementary value of OCR, ASR, and visual-semantic evidence. OCR serves as the strongest single modality because product identity is often explicitly encoded in packaging text, while ASR and Qwen3-VL provide complementary speech-based and visual-semantic cues. The full framework achieves the best overall performance under both Strategy A and Strategy B, indicating that the benefit of multimodal evidence fusion is not limited to a single frame selection strategy.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Prompt Templates Used in the LLM/VLM Pipeline
Appendix A.1. OCR Evidence Structuring Prompt
- Prompt template. You are a professional e-commerce product recognition assistant. The following text is extracted by OCR. Please extract product information strictly from the OCR text. Do not use external knowledge, do not fabricate missing information, and do not provide explanations.
- Input. The input is the raw OCR text extracted from the selected product frame.
- Output format. The model must output exactly three fields: product name, brand, and category. No additional explanation, reasoning process, analysis, or extra text is allowed.
- Field rules. The product name should be the shortest identifiable official product name supported by the OCR text. Brand, series, model, flavor, or functional terms are retained only when they are part of the product name. The brand must be the consumer-facing brand explicitly visible in the packaging text. Company names, distributors, platform names, or brands inferred only from the product name are not allowed. The category must be selected from the predefined closed set. If the OCR evidence is insufficient for a field, the model outputs “unknown”.
Appendix A.2. ASR Evidence Structuring Prompt
- Prompt template. You are a professional e-commerce product recognition expert. The following text is obtained from the product introduction video using ASR. The transcript may contain spoken expressions, repetitions, recognition errors, pauses, irrelevant content, or transcription noise. Please extract product information only from the speech transcript. Do not use external knowledge or infer missing information from common sense.
- Input. The input is the FunASR transcript extracted from the product introduction video.
- Output format. The model must output exactly three fields: product name, brand, and category. No additional explanation, reasoning process, source analysis, or extra text is allowed.
- Field rules. The product name should be the shortest product name explicitly mentioned in, or directly supported by, the transcript. Brand, series, model, flavor, or functional terms are retained only when they are part of the official product name. The brand must be explicitly mentioned in the speech and must refer to the consumer-facing brand. Company names, distributors, platform names, or brands inferred from the product name are not allowed. The category must be selected from the predefined closed set. If a field cannot be reliably determined from the transcript, the model outputs “unknown”. Minor interpretation of ASR errors is allowed only when it is directly supported by the surrounding context.
Appendix A.3. Visual-Semantic Recognition Prompt
- Prompt template. You are a professional e-commerce product recognition expert. Please observe the target product in the input image and extract key product information. If the image is a cropped product region, only judge the product in the cropped region. If a bounding box is provided, prioritize the single product inside the box and do not rely on background objects or other products.
- Input. The input is the selected product image, cropped product region, or product region indicated by a bounding box.
- Output format. The model must output exactly three fields: product name, brand, and category. No additional explanation, reasoning process, image-quality evaluation, or extra text is allowed.
- Field rules. The model is instructed to focus on the brand logo, product name, series name, flavor, model, and category cues visible on the package. The product name should be the shortest identifiable official product name and should not include pure specifications, promotional words, channel words, or uncertain descriptions. The brand must be the consumer-facing brand visible on the package. Company names, distributors, platform names, or brands inferred only from appearance or category are not allowed. The category must be selected from the predefined closed set. For blurred, occluded, reflective, or unreadable information, the model is instructed not to force completion and to output “unknown” when the visual evidence is insufficient.
Appendix A.4. Final Multimodal Fusion Prompt
- Prompt template. You are a professional e-commerce product recognition expert. You need to synthesize recognition results from multiple data sources and make the final judgment for a single product.
- Input. The input consists of the structured outputs from the available evidence sources, including OCR evidence, ASR evidence, and visual-semantic evidence. The prompt also specifies how many of the three evidence sources are available for the current sample.
- Output format. The model must output exactly three fields: product name, brand, and category. No additional explanation, reasoning process, source comparison, analysis, or extra text is allowed.
- Field rules. The product name should be the shortest identifiable official product name. Brand, series, model, flavor, or functional terms are retained only when they are part of the official product name. The brand must be the consumer-facing brand. Company names, distributors, platform names, or brands inferred from the product name are not allowed. If both Chinese and English names are supported as official brand evidence, both may be retained. The category must be selected from the predefined closed set. If all sources are insufficient for a field, the model outputs “unknown”.
- Fusion principles. The model is instructed to prefer information supported by multiple sources. When different sources conflict, the priority order is OCR packaging text, visual evidence, and then ASR speech evidence. The product name, brand, and category fields are judged independently; a clear value in one field should not be used to fabricate another field. If evidence is insufficient, the model outputs “unknown”.
Appendix A.5. External Direct-Video VLM Baseline Prompt
- Prompt template. Given an e-commerce livestreaming video, identify the primary target product being displayed. Do not enumerate all objects or all products in the video. Output only a structured JSON-style result without Markdown, explanations, or reasoning.
- Input. The input is the original e-commerce livestreaming video.
- Output format. The output must contain three fields:
- brand
- product_name
- category
Appendix A.6. Local Post-Processing
References
- Wangjingshe E-commerce Research Center. The Release of “2024 China Live E-Commerce Market Data Report”. 2025. Available online: https://www.100ec.cn/detail--6649904.html (accessed on 9 June 2025).
- Transparency Market Research. Livestream E-Commerce Market Size & Industry Share to 2034; Technical report; Transparency Market Research: Wilmington, DE, USA, 2025. [Google Scholar]
- Pe, P. TikTok Shop GMV in 2024 Surpassed US$30 Billion. 2025. Available online: https://thelowdown.momentum.asia/tiktok-shop-gmv-in-2024-surpassed-us30-billion (accessed on 2 February 2026).
- Wangjingshe. TikTok E-Commerce French Station Accelerates Expansion, Transaction Volume Soars Sevenfold in Half a Year. 2025. Available online: https://fgw.sz.gov.cn/ztzl/qtztzl/szscjmyjjfzzhfwpt/hwtz/sjal/content/post_12527556.html (accessed on 2 February 2026).
- Sellercraft. TikTok Shop vs Shopee GMV Trends in Southeast Asia (2023–2025). 2025. Available online: https://sellercraft.co/tiktok-shop-vs-shopee-gmv-trends-in-southeast-asia-2023-2025-unpacking-the-e-commerce-showdown/ (accessed on 25 June 2026).
- Yang, W.; Chen, Y.; Li, Y.; Cheng, Y.; Liu, X.; Chen, Q.; Li, H. Cross-view Semantic Alignment for Livestreaming Product Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2023; pp. 13404–13413. [Google Scholar]
- State Administration for Market Regulation. The Fifth Batch of Typical Cases in the Field of Live E-Commerce. 2026. Available online: https://www.samr.gov.cn/xw/zj/art/2026/art_a7a1fb24ceac4ed9a4789161bfe49e0f.html (accessed on 25 June 2026).
- TikTok Shop. TikTok Shop’s Latest Safety and IPR Reports: Focusing on Security While Growing Globally. 2025. Available online: https://seller.tiktokglobalshop.com/business/us/newsroom/detail/10023362 (accessed on 25 June 2026).
- Amazon. 2024 Brand Protection Report: How Amazon Uses AI Innovations to Stop Fraud and Counterfeits; Technical report; Amazon: Seattle, WA, USA, 2025. [Google Scholar]
- OECD; EUIPO. Mapping Global Trade in Fakes 2025: Global Trends and Enforcement Challenges; Technical report; OECD: Paris, France, 2025. [Google Scholar]
- Zhu, H.; Wei, H.; Li, B.; Yuan, X.; Kehtarnavaz, N. A Review of Video Object Detection: Datasets, Metrics and Methods. Appl. Sci. 2020, 10, 7834. [Google Scholar] [CrossRef]
- Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Kerrville, TX, USA, 2019; p. 6558. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020. [Google Scholar]
- Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine Regression Task in DETRs as Fine-Grained Distribution Refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Kuhn, H.W. The Hungarian Method for the Assignment Problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
- Neubeck, A.; Van Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the International Conference on Pattern Recognition (ICPR); IEEE: New York, NY, USA, 2006; pp. 850–855. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
- Weng, Z.; Meng, L.; Wang, R.; Wu, Z.; Jiang, Y.G. A Multimodal Framework for Video Ads Understanding. In MM ’21: Proceedings of the 29th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar]
- Gu, J.; Qin, T.; Chen, H. A Key Frame Extraction Method Based on MPEG-7 Color Features and Block Motion Information. J. Guangxi Univ. (Natural Sci. Ed.) 2010, 35, 310–314. [Google Scholar] [CrossRef]
- Cernekova, Z.; Pitas, I.; Nikou, C. Information Theory-Based Shot Cut/Fade Detection and Video Summarization. IEEE Trans. Circuits Syst. Video Technol. 2006, 16, 82–91. [Google Scholar] [CrossRef]
- Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-Reference Image Quality Assessment in the Spatial Domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef] [PubMed]
- Kharchevnikova, A.; Savchenko, A.V. Efficient Video Face Recognition Based on Frame Selection and Quality Assessment. PeerJ Comput. Sci. 2021, 7, e391. [Google Scholar] [CrossRef] [PubMed]
- Park, J.; Lee, J.; Kim, I.J.; Sohn, K. SumGraph: Video Summarization via Recursive Graph Modeling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXV; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Zhuang, Y.; Rui, Y.; Huang, T.S.; Mehrotra, S. Adaptive Key Frame Extraction Using Unsupervised Clustering. In Proceedings of the International Conference on Image Processing (ICIP); IEEE: New York, NY, USA, 1998; pp. 866–870. [Google Scholar] [CrossRef]
- Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “Completely Blind” Image Quality Analyzer. IEEE Signal Process. Lett. 2013, 20, 209–212. [Google Scholar] [CrossRef]
- Su, S.; Yan, Q.; Zhu, Y.; Zhang, C.; Ge, X.; Sun, J.; Zhang, Y. Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020; pp. 3664–3673. [Google Scholar] [CrossRef]
- Liu, X.; van de Weijer, J.; Bagdanov, A.D. RankIQA: Learning from Rankings for No-Reference Image Quality Assessment. arXiv 2017, arXiv:1707.08347. [Google Scholar]
- Zhang, W.; Zhai, G.; Wei, Y.; Yang, X.; Ma, K. Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective. arXiv 2023, arXiv:2303.14968. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
- Golestaneh, S.A.; Dadsetan, S.; Kitani, K.M. No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency. arXiv 2022, arXiv:2108.06858. [Google Scholar]
- Yang, S.; Wu, T.; Shi, S.; Lao, S.; Gong, Y.; Cao, M.; Wang, J.; Yang, Y. MANIQA: Multi-Dimension Attention Network for No-Reference Image Quality Assessment. arXiv 2022, arXiv:2204.08958. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning (ICML); PMLR: Cambridge, MA, USA, 2023; pp. 19730–19742. [Google Scholar]
- Cui, C.; Sun, T.; Lin, M.; Gao, T.; Zhang, Y.; Liu, J.; Wang, X.; Zhang, Z.; Zhou, C.; Liu, H.; et al. PaddleOCR 3.0 Technical Report. arXiv 2025, arXiv:2507.05595. [Google Scholar]
- An, K.; Chen, Q.; Deng, C.; Du, Z.; Gao, C.; Gao, Z.; Gu, Y.; He, T.; Hu, H.; Hu, K.; et al. FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs. arXiv 2024, arXiv:2407.04051. [Google Scholar]
- Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. arXiv 2021, arXiv:2104.00298. [Google Scholar]






| Category | Parameter | Value |
|---|---|---|
| Model | Backbone | HGNetv2-B2 initialized from public stage-1 pretrained weights |
| Hidden dimension | 256 | |
| Decoder layers | 4 | |
| Number of queries | 300 | |
| Number of classes | 2 (product, background) | |
| Training | Epochs | 132 |
| Batch size | 16 | |
| Learning rate | (backbone: ) | |
| Optimizer | AdamW (, weight decay ) | |
| LR schedule | MultiStepLR (milestone=500, ) + linear warmup | |
| Warmup steps | 500 | |
| AMP | Enabled | |
| EMA | Enabled (decay = 0.9999) | |
| Data | Input size | |
| Data augmentation | PhotometricDistort, RandomZoomOut, RandomIoUCrop, Flip | |
| Stop epoch | 120 | |
| Loss | Loss weights | vfl: 1.0, bbox: 5.0, giou: 2.0, fgl: 0.15, ddf: 1.5 |
| Evaluation | Metrics | COCO mAP (IoU 0.5–0.95), AP50, AP75 |
| Category | Parameter | Value |
|---|---|---|
| Model | Backbone | EfficientNetV2-M (pretrained on ImageNet-1K) |
| Input resolution | ||
| Parameters | ≈24 M | |
| Dropout rate | 0.3 | |
| Training | Batch size | 16 |
| Epochs | 20 | |
| Learning rate | ||
| Optimizer | Adam | |
| Weight decay | ||
| Loss | Function | MSE (regression) |
| LR schedule | Scheduler | ReduceLROnPlateau (factor = 0.5, patience = 3, min lr ) |
| Data | Split | 80%/20% |
| Augmentation | Methods | Flip (50%), brightness/contrast (), Gaussian noise (20%) |
| Training strategy | Early stopping | 5 epochs |
| Metric | Value | Description |
|---|---|---|
| mAP (IoU 0.5–0.95) | 27.76% | COCO standard mean Average Precision over IoU thresholds from 0.5 to 0.95 |
| AP50 | 40.36% | Average Precision at IoU = 0.5 |
| AP75 | 28.25% | Average Precision at IoU = 0.75 |
| AR100 | 44.14% | Average Recall with up to 100 detections per image |
| Final training loss | 20.71 | Loss value at the end of training (epoch 131) |
| Metric | Value | Description |
|---|---|---|
| Validation MSE | 0.00220 | Mean Squared Error |
| Validation MAE | 0.03667 | Mean Absolute Error |
| Validation RMSE | 0.04687 | Root Mean Squared Error |
| Pearson correlation | 0.9323 | Linear correlation between predicted scores and ground-truth scores |
| Best epoch | 17 | Epoch with the lowest validation loss |
| Metric | Description |
|---|---|
| Average number of recognized characters | Average number of recognized characters per video |
| Average OCR confidence | Mean confidence score of recognized OCR text instances |
| High-confidence character count | Average number of recognized characters per video from OCR text instances with confidence |
| Processing time | Average processing time per video from input to OCR completion |
| Metric | Definition | Role |
|---|---|---|
| Average OCR confidence | Mean confidence score of OCR text instances recognized from the selected frame | Reflects the reliability of OCR recognition on the selected frame |
| High-confidence text ratio | Proportion of OCR text instances with confidence no less than 0.7 | Evaluates the proportion of reliable OCR text evidence |
| Recognized text count | Number of OCR text instances recognized from the selected frame | Reflects the amount of extracted textual evidence |
| Recognized character count | Number of recognized characters in the selected frame | Serves as a text-volume indicator in the correlation analysis of Strategy B scores |
| Processing time | Total time from video input to selected-frame output | Reflects the runtime efficiency of the frame selection process |
| Variable | Threshold | Perfect Match Rate | Corresponding Field Accuracy |
|---|---|---|---|
| Brand | 0.600 | 0.795 | 0.805 |
| Brand | 0.650 | 0.790 | 0.800 |
| Brand | 0.700 | 0.775 | 0.795 |
| Brand | 0.750 | 0.755 | 0.770 |
| Brand | 0.800 | 0.695 | 0.710 |
| Product Name | 0.400 | 0.785 | 0.955 |
| Product Name | 0.450 | 0.775 | 0.915 |
| Product Name | 0.500 | 0.775 | 0.895 |
| Product Name | 0.550 | 0.745 | 0.840 |
| Product Name | 0.600 | 0.715 | 0.780 |
| Product Name | 0.650 | 0.675 | 0.705 |
| Product Name | 0.700 | 0.645 | 0.645 |
| Category | 0.400 | 0.775 | 0.990 |
| Category | 0.450 | 0.775 | 0.980 |
| Category | 0.500 | 0.775 | 0.960 |
| Category | 0.550 | 0.765 | 0.940 |
| Category | 0.600 | 0.710 | 0.850 |
| Metric | Single Frame () | Top-3 Frames | Top-5 Frames |
|---|---|---|---|
| Average Number of Characters | 17 | 20 | 25 |
| Average Confidence | 0.7593 | 0.8897 | 0.8850 |
| Average Number of High-Confidence Characters | 15 | 18 | 20 |
| Average Processing Time (s) | 32.51 | 93.11 | 155.52 |
| OCR-Stage Indicator | Learned Quality Component q | Final Score | ||
|---|---|---|---|---|
| Pearson | Spearman | Pearson | Spearman | |
| Average OCR confidence | 0.067 (0.5068) | 0.070 (0.4861) | 0.078 (0.4787) | 0.028 (0.7839) |
| High-confidence text ratio | 0.075 (0.4596) | 0.070 (0.4877) | 0.071 (0.4791) | 0.013 (0.8982) |
| Recognized text count | 0.128 (0.2054) | 0.125 (0.2157) | 0.230 (0.0214) | 0.258 (0.0097) |
| Recognized character count | 0.283 (0.0043) | 0.121 (0.2306) | 0.252 (0.0115) | 0.338 (0.0006) |
| Strategy A | Strategy B | Score Improvement (B relative to A) | |
|---|---|---|---|
| Average Quality Score | 3.31 | 3.52 | +0.21 (6.34%) |
| Weight | Perfect Match Rate | Brand Acc. | Product Name Acc. | Category Acc. | Semantic Sim. |
|---|---|---|---|---|---|
| 0.5/0.5 | 0.705 | 0.765 | 0.870 | 0.965 | 0.788 |
| 0.6/0.4 | 0.705 | 0.785 | 0.870 | 0.965 | 0.791 |
| 0.7/0.3 | 0.725 | 0.785 | 0.870 | 0.965 | 0.792 |
| 0.8/0.2 | 0.705 | 0.765 | 0.850 | 0.955 | 0.785 |
| 0.9/0.1 | 0.665 | 0.755 | 0.830 | 0.955 | 0.781 |
| Weight | Perfect Match Rate | Brand Acc. | Product Name Acc. | Category Acc. | Semantic Sim. |
|---|---|---|---|---|---|
| 0.5/0.5 | 0.755 | 0.775 | 0.875 | 0.940 | 0.783 |
| 0.6/0.4 | 0.735 | 0.755 | 0.875 | 0.940 | 0.782 |
| 0.7/0.3 | 0.755 | 0.775 | 0.895 | 0.960 | 0.800 |
| 0.8/0.2 | 0.775 | 0.795 | 0.895 | 0.960 | 0.802 |
| 0.9/0.1 | 0.765 | 0.795 | 0.895 | 0.940 | 0.798 |
| Normalizer | Perfect Match Rate | Brand Acc. | Product Name Acc. | Category Acc. | Semantic Sim. |
|---|---|---|---|---|---|
| 300 | 0.705 | 0.785 | 0.830 | 0.965 | 0.785 |
| 400 | 0.685 | 0.785 | 0.810 | 0.965 | 0.783 |
| 500 | 0.725 | 0.785 | 0.870 | 0.965 | 0.792 |
| 600 | 0.665 | 0.745 | 0.790 | 0.925 | 0.768 |
| 700 | 0.685 | 0.785 | 0.830 | 0.965 | 0.784 |
| Method | Input | D-FINE Keyframe | Multimodal Fusion |
|---|---|---|---|
| Baseline | Last frame + audio | No | Yes |
| External VLM Baseline | Original video | No | No |
| Strategy A | Selected keyframe + audio | Yes | Yes |
| Strategy B | Selected keyframe + audio | Yes | Yes |
| Method | Perfect Match Rate | Semantic Sim. | Brand Acc. | Product Name Acc. | Category Acc. |
|---|---|---|---|---|---|
| Baseline | 0.609 | 0.697 | 0.676 | 0.818 | 0.880 |
| External VLM Baseline | 0.705 | 0.799 | 0.725 | 0.920 | 0.980 |
| Strategy A | 0.725 | 0.792 | 0.785 | 0.870 | 0.965 |
| Strategy B | 0.775 | 0.802 | 0.795 | 0.895 | 0.960 |
| Comparison | A+/B− | A−/B+ | McNemar p | PM Diff. | PM 95% CI | Sim. Diff. | Sim. 95% CI |
|---|---|---|---|---|---|---|---|
| B vs. A | 5 | 15 | 0.0414 | +0.050 | [0.005, 0.095] | +0.0104 | [−0.0068, 0.0290] |
| Group | Error Source | Brand Only | Product Only | Category Only | Multiple-Field |
|---|---|---|---|---|---|
| A correct/B wrong | Strategy B errors | 4 | 0 | 0 | 1 |
| A wrong/B correct | Strategy A errors | 5 | 4 | 2 | 4 |
| A wrong/B wrong | Strategy A errors | 21 | 5 | 1 | 13 |
| A wrong/B wrong | Strategy B errors | 19 | 3 | 1 | 17 |
| Case Group | Sample ID | Main Error Type | Interpretation |
|---|---|---|---|
| A correct/B wrong | 6933996025563 | Brand error | Strategy B confuses the target brand with a visually similar brand name, while Strategy A preserves the correct brand evidence. |
| A correct/B wrong | 6973142375302 | Multiple-field error | Strategy B fails to recover all three structured fields, indicating insufficient selected-frame or fused evidence. |
| A wrong/B correct | 6975477310676 | Brand error | Strategy A predicts an incorrect electronics brand, whereas Strategy B recovers the correct brand and product name from the selected evidence. |
| A wrong/B correct | 6975178290086 | Multiple-field error | Strategy A only recognizes a generic product category, while Strategy B recovers both the brand and the specific product name. |
| A wrong/B wrong | 06941812761847 | Multiple-field error | Both strategies fail to recover the brand and complete product name from the selected keyframe, while the external video baseline succeeds by using the original video input. |
| A wrong/B wrong | 6947503700232 | Multiple-field error | Strategy A suffers from brand confusion, while Strategy B lacks sufficient structured evidence; the external video baseline recognizes the brand and model more accurately. |
| Strategy | Modality | Perfect Match Rate | Brand Acc. | Product Name Acc. | Category Acc. | Sem. Sim. |
|---|---|---|---|---|---|---|
| A | OCR-only | 0.650 | 0.705 | 0.785 | 0.910 | 0.770 |
| ASR-only | 0.270 | 0.320 | 0.515 | 0.675 | 0.638 | |
| VLM-only | 0.145 | 0.155 | 0.185 | 0.505 | 0.482 | |
| OCR + ASR | 0.680 | 0.745 | 0.815 | 0.925 | 0.784 | |
| OCR + VLM | 0.670 | 0.735 | 0.785 | 0.935 | 0.782 | |
| ASR + VLM | 0.385 | 0.425 | 0.595 | 0.745 | 0.682 | |
| Full Framework | 0.725 | 0.785 | 0.870 | 0.965 | 0.792 | |
| B | OCR-only | 0.670 | 0.715 | 0.775 | 0.855 | 0.763 |
| ASR-only | 0.270 | 0.320 | 0.520 | 0.665 | 0.635 | |
| VLM-only | 0.450 | 0.485 | 0.525 | 0.575 | 0.656 | |
| OCR + ASR | 0.735 | 0.755 | 0.855 | 0.915 | 0.796 | |
| OCR + VLM | 0.705 | 0.745 | 0.785 | 0.835 | 0.766 | |
| ASR + VLM | 0.540 | 0.590 | 0.695 | 0.825 | 0.742 | |
| Full Framework | 0.775 | 0.795 | 0.895 | 0.960 | 0.802 |
| Model | Role | Input Price | Output Price |
|---|---|---|---|
| Qwen3.5-27B | Evidence structuring and final fusion | 0.6 | 4.8 |
| Qwen3-VL | Visual-semantic recognition | 0.75 | 3.0 |
| Pipeline | Input Tok. | Output Tok. | Cost/Run | Cost/1000 Runs |
|---|---|---|---|---|
| Full | 3069.4 | 9253.1 | 0.0465 CNY | 46.46 CNY |
| API Stage | Mean Time (s) | Median Time (s) | P95 Time (s) |
|---|---|---|---|
| OCR evidence structuring LLM | 27.62 | 10.53 | 87.38 |
| ASR evidence structuring LLM | 43.92 | 41.90 | 111.85 |
| Qwen3-VL visual-semantic recognition | 3.90 | 1.84 | 19.25 |
| Final fusion LLM | 39.89 | 35.63 | 95.02 |
| Serial API latency | 113.55 | 108.40 | 227.06 |
| Parallel API latency | 90.68 | 84.86 | 182.71 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zheng, Y.; Shi, J.; Shen, W. Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming. Appl. Sci. 2026, 16, 6585. https://doi.org/10.3390/app16136585
Zheng Y, Shi J, Shen W. Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming. Applied Sciences. 2026; 16(13):6585. https://doi.org/10.3390/app16136585
Chicago/Turabian StyleZheng, Yichuan, Jin Shi, and Wei Shen. 2026. "Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming" Applied Sciences 16, no. 13: 6585. https://doi.org/10.3390/app16136585
APA StyleZheng, Y., Shi, J., & Shen, W. (2026). Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming. Applied Sciences, 16(13), 6585. https://doi.org/10.3390/app16136585

