Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming

Zheng, Yichuan; Shi, Jin; Shen, Wei

doi:10.3390/app16136585

Open AccessArticle

Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming

by

Yichuan Zheng

¹,

Jin Shi

² and

Wei Shen

^1,*

¹

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

²

College of Management Science and Engineering (College of Quality and Standardization), China Jiliang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(13), 6585; https://doi.org/10.3390/app16136585

Submission received: 28 April 2026 / Revised: 11 June 2026 / Accepted: 27 June 2026 / Published: 1 July 2026

(This article belongs to the Special Issue Innovative Applications of Artificial Intelligence in Engineering, Second Edition)

Download

Browse Figures

Versions Notes

Abstract

Product recognition in e-commerce live streaming is hindered by rapid viewpoint changes, occlusions, motion blur, and inconsistencies between visual and spoken information. Existing approaches typically focus on individual components such as detection, OCR, or speech recognition, which limits their effectiveness in end-to-end structured product understanding. To address this problem, we propose an integrated framework that combines task-oriented keyframe selection with multimodal semantic fusion. The framework first uses D-FINE to localize product regions and then selects informative frames through two complementary strategies. Strategy A considers both detection confidence and Laplacian-based sharpness, while Strategy B combines detection confidence with a learned quality component estimated by an EfficientNetV2-M regression model. OCR, visual-semantic recognition, and ASR are then applied to extract complementary evidence, and a Qwen3.5-27B large language model is used to structure and fuse multimodal evidence into standardized product outputs, including brand, product name, and category. Experiments on an in-house e-commerce livestreaming dataset demonstrate substantial gains over a last-frame baseline. Strategy B achieves the best overall result, improving the Perfect Match Rate from 0.609 to 0.775 and the Semantic Similarity from 0.697 to 0.802. Ablation studies further show that the full multimodal framework consistently outperforms unimodal and dual-modality variants under both frame selection strategies. In addition, Top-K analysis indicates that single-frame inference provides a practical balance between OCR evidence completeness and efficiency. Efficiency analysis shows that the per-video API monetary cost remains low under the pricing configuration used in this study, while API latency is mainly limited by Qwen3.5-27B LLM calls for evidence structuring and final fusion. Overall, the proposed framework offers an effective and extensible solution for structured product recognition in complex live-streaming scenarios.

Keywords:

e-commerce live streaming; product recognition; keyframe selection; multimodal fusion; large language models; object detection; image quality assessment; information extraction

1. Introduction

In recent years, e-commerce live streaming has emerged as a major driver of growth in the global retail industry, owing to its strong interactivity and high conversion rates. According to the 2024 China Live E-commerce Market Data Report released by Wangjing Society [1], the market size of China’s e-commerce live-streaming sector exceeded RMB 5 trillion in 2024, with a user base surpassing 600 million. Internationally, live commerce has exhibited a similarly rapid growth trajectory. A report by Transparency Market Research indicates that the global live e-commerce market reached a value of USD 940.3 billion in 2024 and is projected to reach USD 6079.8 billion by the end of 2035 [2]. This growth is particularly pronounced across major platforms and regions. For example, TikTok Shop achieved a global GMV of USD 32.6 billion in 2024, including USD 9.0 billion from the U.S. market [3]. In France, TikTok Shop’s launch in March 2025 triggered rapid growth within six months: transaction volumes surged sevenfold, while livestream- and short-video-driven sales increased by 3.5 times and 14 times, respectively [4]. This expansion is supported by 27.8 million monthly active users in France and a broader European base exceeding 200 million. Similarly, Southeast Asia demonstrated robust growth in 2024, with Shopee and TikTok Shop achieving GMVs of USD 83.4 billion and USD 22.6 billion, respectively [5]. These figures indicate a steadily maturing global live-commerce ecosystem characterized by high interaction and strong conversion performance across major markets.

The rapid expansion of this ecosystem has simultaneously increased the demand for scalable product understanding in live-streaming videos. As the primary carrier of product presentation and transaction-related evidence, livestream content requires systems that can reliably identify products at the semantic level rather than merely detect generic objects [6]. Such capability is important not only for merchandising and retrieval, but also for downstream applications including content review, compliance support, and consumer protection. In practice, however, the gap between coarse object localization and structured product recognition remains substantial, especially in unconstrained live-streaming environments.

The practical importance of this problem is amplified by the prevalence of counterfeit goods, misleading promotions, and visually ambiguous product presentations in live commerce. Recent enforcement reports and platform disclosures show that large-scale product screening remains a persistent challenge across major marketplaces [7,8,9,10]. This motivates recognition systems that are not only accurate at the image level, but also robust enough to support high-throughput video analysis in realistic operational settings.

Despite its importance, product recognition in live-streaming environments remains highly challenging. Frequent scene transitions, illumination changes, and occlusions make single-frame static methods unstable [11]. At the same time, streamers’ spoken commentary and background noise introduce semantic redundancy and interference, which limits the reliability of single-modality approaches, whether based on vision or speech alone [6]. In addition, exhaustive analysis of the entire video stream incurs substantial computational cost, making practical deployment at platform scale difficult.

In response to these challenges, existing studies still exhibit an important gap. Much of the prior literature improves individual components, such as generic object detection, OCR, ASR, or pre-trained vision–language modeling [12], but comparatively little attention has been paid to end-to-end frameworks that are explicitly optimized for product recognition in e-commerce livestreams. In particular, there is still limited understanding of how to select recognition-effective frames from dynamic videos and how to reconcile complementary yet sometimes conflicting evidence across OCR, visual semantics, and speech.

To address this gap, we propose a keyframe extraction and multimodal fusion framework for structured product recognition in e-commerce live streaming. The central idea is to rank frames according to their utility for downstream recognition rather than according to generic visual quality alone. The framework therefore prioritizes frames that preserve task-effective information, such as packaging text, brand identifiers, and discriminative product appearance, and then integrates OCR, visual-semantic cues, and speech evidence through LLM-based semantic fusion. This design enables the system to produce standardized outputs for key product fields, including product name, brand, and category.

The main contributions of this work are threefold. First, we formulate keyframe selection for livestream product recognition as a task-oriented ranking problem and compare two practically motivated scoring strategies based on local sharpness and learned image-quality regression, respectively. Second, we develop an end-to-end multimodal framework in which OCR, ASR, and vision–language recognition are unified through structured evidence representation and LLM-centered semantic adjudication. Third, we provide a systematic empirical study on an in-house benchmark, including end-to-end comparisons, multimodal ablations, threshold sensitivity analysis, and single-frame versus Top-K trade-off evaluation. Collectively, these results show that effective frame selection and multimodal evidence fusion are both critical for robust product recognition in live-streaming scenarios.

2. Related Work

2.1. Overview of Related Work

This section reviews prior studies related to the main components of the proposed framework, including object detection, video-level product recognition, keyframe selection, image quality assessment, and large language model (LLM)-based multimodal fusion. Rather than treating these topics as isolated techniques, we focus on their relevance to structured product recognition in e-commerce live-streaming scenarios, where product localization, recognition-oriented frame selection, multimodal evidence extraction, and stable structured output must be jointly considered. The following subsections therefore discuss how existing methods support or motivate the design of the proposed framework, as well as their limitations when applied to fine-grained product recognition under dynamic live-streaming conditions.

2.2. Review of Visual Object Detection Models

To justify the technical origin and selection rationale for the detection module, this subsection briefly reviews three representative detection paradigms and their typical implementations: the YOLO (You Only Look Once) family [13], DETR (Transformer-based end-to-end detection) [14], and the recently proposed D-FINE (a DETR-based fine-grained regression improvement) [15]. We then summarize their respective strengths and weaknesses and discuss their suitability for the task of product recognition in e-commerce video.

YOLO pioneered treating object detection as a single-shot regression problem, simultaneously predicting bounding boxes and class labels in one network forward pass, which enables high-frame-rate real-time detection [13]. This paradigm has been continuously optimized in subsequent engineering implementations such as YOLOv4/YOLOv5 (including data augmentation, network architecture, and training techniques), making the YOLO family a preferred baseline for industrial deployment and real-time applications (including retail shelf detection and video-stream detection). YOLO’s advantages lie in its high throughput and deployment friendliness, but one-stage methods typically lag behind more fine-grained approaches in localization accuracy for dense small-object scenarios or settings that demand very high bounding-box precision.

DETR reformulates object detection as a set prediction problem, employing a Transformer encoder–decoder [16] and the Hungarian matching loss [17] to produce end-to-end, non-redundant detection outputs, thereby obviating many hand-designed post-processing steps in traditional detectors (e.g., NMS [18], anchor design [19]) [14]. DETR’s strengths include strong global-context modeling and a conceptually simple architecture; however, the original DETR suffers from slow convergence and somewhat weaker performance on small objects, issues that subsequent work has addressed with various improvements to accelerate training and strengthen local modeling. For video scenarios that require enhanced semantic understanding and global reasoning, DETR-style frameworks offer greater representational capacity.

D-FINE, recently proposed, reconstructs the boundary regression task within the DETR paradigm. Its two core innovations are Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD): FDR represents boundary regression as discrete probability distributions over distances to the four sides and iteratively refines them, thereby expressing localization uncertainty in a distributional form; GO-LSD leverages fine-level outputs to self-distill shallower layers, improving localization accuracy and training stability [15]. Experimental results show that D-FINE significantly improves localization precision while retaining the advantages of DETR, achieving a favorable trade-off between efficiency and accuracy and making it a strong candidate for tasks requiring precise localization (for example, recognizing product edges or packaging regions).

Based on the above comparison, the primary reason for selecting D-FINE as the product detection module in this study is its fine-grained localization design within the DETR-style detection paradigm. D-FINE augments the DETR framework with FDR and GO-LSD to improve boundary localization and training stability while preserving the Transformer’s global context modeling capability. In typical e-commerce live-streaming scenarios, products are often densely arranged, small in scale, subject to varying viewpoints, and affected by occlusion or motion blur. For example, in streamer demonstrations of cosmetics, snacks, or apparel, multiple small packaging boxes, lipsticks, or accessories may appear in the frame simultaneously. These characteristics require a detector that can provide reliable product-region proposals and detection-confidence cues for subsequent frame scoring.

Therefore, D-FINE is adopted in this work to support product-region localization and confidence-aware keyframe scoring. Its role in the proposed framework is not to perform semantic product recognition directly, but to provide visual-region candidates and detection-confidence signals for downstream OCR, visual-semantic recognition, and multimodal fusion. The task-specific performance of the trained detector is reported and analyzed in the experimental section.

2.3. Video-Level Product Recognition and Multimodal Understanding

Early research on product recognition was predominantly based on detection and classification methods applied to static images. However, in video scenarios, factors such as dynamic content variations, shot transitions, occlusions, and motion blur make it difficult to directly transfer static-image-based approaches. To address these challenges, a growing body of recent work has incorporated both visual and speech modalities to enhance video advertisement understanding. For instance, in the context of structured analysis of advertisement videos, Weng et al. [20] proposed a multimodal framework that jointly leverages frame-level visual features and textual information extracted via OCR and ASR for scene segmentation and multimodal annotation, thereby improving the structured understanding of advertisement videos. This approach was also validated as an effective solution in the TAAC advertisement understanding challenge.

Although the aforementioned studies have explored joint modeling strategies of visual and speech modalities for video advertisement understanding, their primary focus lies in scene segmentation and content annotation, with limited attention paid to fine-grained product-level semantic recognition, which is more directly related to compliance supervision. Moreover, existing methods typically rely on fixed sampling strategies or full-video processing, without explicitly addressing frame quality variation and information sparsity that are prevalent in e-commerce scenarios. In contrast, the present work places greater emphasis on adaptively selecting information-rich keyframes from dynamic videos and achieving high-accuracy product name recognition through multimodal fusion, thereby providing technical support for violation detection and authenticity verification in e-commerce livestreaming environments.

2.4. Keyframe Selection and Multi-Frame Strategies

In the preprocessing stage of video recognition, keyframe selection and frame quality assessment play a crucial role in reducing computational cost and improving the performance of downstream recognition tasks. Traditional keyframe extraction methods often integrate multiple low-level visual features, such as color histograms, inter-frame differences, or motion estimation, to improve selection quality. For example, Gu Jiayu et al. (2010) proposed a keyframe extraction method that combines MPEG-7 color layout descriptors with block-based motion information [21]. This approach captures the spatial distribution of colors using color layout features, analyzes local variations through block motion estimation, and adaptively extracts keyframes via weighted fusion and cumulative distance measures. Other studies have adopted information-theoretic models to quantify frame importance. Cernekova et al. [22] utilized information entropy and mutual information to measure content variation between frames, enabling shot boundary detection and keyframe extraction by identifying information redundancy, thereby providing effective solutions for video summarization tasks.

However, the aforementioned methods primarily focus on global content changes and are often misaligned with the semantic requirements of downstream tasks such as product recognition. In the presence of occlusion, blur, or camera motion, selected frames may not contain the most discriminative visual information, such as textual elements on product packaging or brand logos. This limits the applicability of general-purpose keyframe extraction methods to fine-grained, task-oriented video understanding.

In recent years, with the development of no-reference image quality assessment (NR-IQA) [23] and deep learning techniques, researchers have increasingly employed learning-based models to predict frame quality and perform quality-aware frame selection, aiming to obtain more robust selected frames. Kharchevnikova and Savchenko [24] further proposed an efficient no-reference frame quality prediction approach using lightweight convolutional neural networks with teacher–student distillation, and demonstrated that selecting high-quality frames has a significant impact on downstream recognition performance. The frame scoring module in this work, which combines detection confidence with quality-related cues, is inspired by such approaches.

Regarding multi-frame strategies, prior studies have shown that a single frame may be insufficient to capture all useful information in a video. Consequently, selecting multiple frames and fusing their outputs can improve semantic coverage and robustness. For instance, some video summarization and keyframe extraction methods model frames as graph nodes and select representative frames based on semantic similarity or clustering mechanisms [25,26]. Nevertheless, these methods are primarily designed for summarization or visual abstraction, where the objective is to maximize semantic diversity and coverage across different scenes, rather than to ensure the reliable capture of critical product-recognition details such as brand names, packaging text, or product specifications.

In summary, existing keyframe selection and multi-frame strategies designed for video summarization or general video understanding exhibit the following limitations when applied to product recognition as a downstream task:

The absence of a quantifiable frame information value assessment mechanism tailored to downstream product recognition;
The inability to guarantee the preservation of task-relevant semantic details, such as textual information, brand identifiers, and product specifications;
A focus on multi-frame strategies on semantic diversity rather than targeted optimization of product recognition performance.

Given that e-commerce product recognition highly depends on the clarity of local details and the completeness of visual information, this work adopts a task-oriented keyframe scoring mechanism that integrates product detection confidence with local sharpness or quality-related cues, thereby prioritizing frames with high recognition-oriented evidence value. Meanwhile, the proposed multi-frame strategy serves as an auxiliary mechanism to examine whether additional frames can improve information coverage. The primary objective of the proposed framework remains improving downstream multimodal recognition performance through the selection of high-value frames.

2.5. Image Quality Assessment (IQA)

In the field of image quality assessment (IQA), a large number of no-reference image quality assessment (NR-IQA) models have been proposed to predict subjective image quality in the absence of pristine reference images. Early approaches, such as BRISQUE [23] and NIQE [27], rely on natural scene statistics (NSS) to construct handcrafted features, which perform well on traditional distortion types but struggle to capture high-level semantic characteristics in complex scenes. With the rapid development of deep learning, convolutional neural networks (CNNs) have gradually become the dominant paradigm for NR-IQA. For example, HyperIQA [28] achieves more accurate quality prediction through a content-adaptive mechanism, while RankIQA [29] adopts a ranking-based learning strategy that learns feature representations from image quality rankings, effectively alleviating the scarcity of annotated data and improving regression performance. Subsequently, LIQE [30] approaches NR-IQA from a multi-task learning perspective by jointly optimizing quality assessment, scene classification, and distortion type recognition, and leverages the alignment capabilities of vision–language pre-trained models (CLIP) [31] to enable automatic knowledge transfer and loss re-weighting, further enhancing prediction accuracy and cross-dataset generalization. More recently, Vision Transformer (ViT)-based models [32], such as TReS [33] and MANIQA [34], have significantly improved the modeling of texture and structural distortions through global and multi-dimensional attention mechanisms, pushing NR-IQA performance to new levels.

However, generic NR-IQA models primarily focus on evaluating the perceptual quality of images, such as sharpness, noise, contrast, and distortion, and do not explicitly reflect the semantic quality required by downstream tasks. In tasks such as text recognition or product recognition, semantically informative content is often concentrated in local regions, such as brand logos, packaging text, or specification labels. As a result, an image with high overall perceptual quality but poor visibility of critical product information may be less favorable for recognition than a slightly blurred image that preserves complete semantic evidence. Therefore, directly adopting generic NR-IQA models may lead to a mismatch between quality scores and task-specific recognition performance.

Motivated by these limitations, this work introduces NR-IQA into the keyframe selection pipeline as an exploratory validation, aiming to examine whether perceptual quality cues can facilitate product recognition performance. Since generic IQA scores do not directly measure product recognizability, the proposed Strategy B combines a learned quality component distilled from IQA pseudo-labels with object detection confidence rather than relying on perceptual quality alone. In this way, the framework considers both the perceptual quality of detected product regions and their visibility as task-relevant visual evidence. This design allows us to examine both the usefulness and the limitations of quality-aware frame selection under a downstream product recognition task.

2.6. Large Language Models and Multimodal Fusion

In recent years, the rapid advancement of large language models (LLMs) has provided more powerful semantic understanding and reasoning capabilities for multimodal fusion. Existing studies primarily focus on architectural innovations, such as designing global–local attention mechanisms to integrate OCR, ASR, and visual features (e.g., the multimodal Transformer proposed by Tsai et al. [12]), or encoding visual information into natural language descriptions that are subsequently processed by LLMs for open-ended reasoning (e.g., the BLIP-2 framework proposed by Li et al. [35]). While these approaches enhance the depth of multimodal understanding, they often suffer from unstable outputs and a lack of structured representations in practical applications, which limits their direct applicability to e-commerce video recognition scenarios that demand high reliability and reproducibility.

The limitations of existing methods are mainly reflected in the following aspects: high-temperature sampling or open-ended generation strategies may lead to semantic drift or variations in expression for identical inputs, making it difficult to meet the stringent requirements for reproducibility and auditability in regulatory settings. Moreover, these methods typically produce free-form natural language outputs without standardized structured fields (e.g., brand, product name, and category), which necessitates additional parsing or manual intervention and substantially increases system deployment costs. In large-scale e-commerce livestreaming regulation, platforms are required to conduct high-throughput or timely compliance review over massive volumes of video content, where any output instability or missing fields may result in missed detections or false judgments, thereby undermining the efficiency of screening for false advertising or counterfeit products.

Motivated by these engineering challenges, this work proposes a fusion paradigm that emphasizes practicality and stability. Built upon the Qwen series models, the proposed approach employs carefully designed prompt templates and deterministic greedy decoding to achieve stable and structured multimodal outputs on top of strong foundation models. The goal is to provide an efficient, reliable, and readily deployable semantic fusion solution for e-commerce video recognition systems.

3. Method

3.1. Problem Definition

The objective of this study is to automatically extract and recognize product information from e-commerce livestreaming videos to support platform-level content regulation, such as false advertising detection and counterfeit product screening. The dataset constructed in this study is organized at the video-product level: each video clip is associated with one target product identity. This setting does not imply that the visual scene contains only a single physical product instance. In practice, a clip may contain multiple visible instances of the same target product, repeated appearances of the product packaging, platform overlays, hands, shelves, background items, or other distracting objects. Accordingly, the task is formulated as recognizing the structured information of the target product associated with the input clip, rather than enumerating or separately identifying every product or object instance appearing in the scene. Given a livestreaming video

V = {f_{1}, f_{2}, \dots, f_{n}}

, where

f_{i}

denotes the i-th video frame, along with the corresponding audio stream A, the task is defined as follows:

Input: the video stream V and audio stream A, together with a predefined set of target product fields

F = {product name, brand, category}

.

Output: a structured product information set

G = {g_{1}, g_{2}, \dots, g_{m}}

, where each

g_{j}

represents the predicted value of the corresponding field in

F

, optionally accompanied by supporting evidence or confidence metadata.

Constraints: (1) resource-aware and practical deployment requirements, where processing latency should remain within an acceptable range for long livestreaming videos; (2) high reliability, where recognition results must be traceable and verifiable.

The core challenge arises from the highly dynamic nature of livestreaming videos, in which product appearances are transient and highly variable. This necessitates the intelligent selection of an information-rich frame subset

K \subseteq {1, 2, \dots, n}

from V, followed by the integration of visual evidence from OCR and the visual-language model, together with auditory evidence from ASR. A large language model is then employed to perform semantic-level decision making and output accurate, structured product information

G

. The objective of this work is to minimize the end-to-end recognition error while ensuring that computational efficiency satisfies platform-scale deployment requirements, formulated as:

min_{K, Θ} L (G_{pred} (K, Θ), G_{gt}) s . t . Time (V) \leq T_{\max}

(1)

where

Θ

denotes the system parameters and

L

represents the end-to-end loss function.

This study focuses on optimizing the visual modality in e-commerce livestreaming scenarios. Specifically, we investigate how to automatically select frames from the video stream that provide the most informative content for downstream recognition, namely frames that are most valuable in terms of visual clarity, the presence of key information (e.g., packaging text and brand logos), and detection confidence. The goal is to enable the OCR and visual-language recognition stages to obtain as much reliable evidence as possible. Rather than improving the performance of individual modules in isolation, the ultimate objective is to achieve sufficient and accurate product recognition at the end-to-end level through keyframe selection, unified structured outputs of multimodal evidence, and semantic-level fusion based on large language models. To this end, this work designs and compares frame scoring strategies based on traditional visual features and deep quality regression, as well as single-frame and Top-K multi-frame fusion strategies, with end-to-end recognition performance on real-world e-commerce video datasets serving as the final evaluation criterion.

3.2. Overall Architecture

The overall workflow of the proposed system is illustrated in Figure 1. The proposed system takes videos as input and first decomposes each video into a visual stream and an audio stream. In the visual stream, representative frames are selected through a frame extraction and frame quality scoring module. The selected frame is then used for two visual recognition pipelines: OCR (PaddleOCR [36]) and visual-semantic product recognition with Qwen3-VL. In parallel, the ASR pipeline (FunASR [37]) operates on the entire audio stream of the video rather than on the selected frame. The raw recognition outputs from each pipeline are subsequently structured and standardized by a large language model (LLM), producing unified-format evidence units. All evidence units are then aggregated and fed into a final fusion LLM, which performs multimodal reasoning and outputs the final recognition results along with supporting evidence.

Frame extraction & scoring: Frames are sampled from the input video at a fixed frame rate, and a quality score is computed for each frame (e.g., a combination of detection confidence and Laplacian sharpness). According to the scoring strategy, a single frame or multiple candidate frames are selected for downstream visual processing.
OCR pipeline: Text recognition is performed on the selected frame to obtain raw textual content and internal OCR confidence scores. The raw text, confidence scores, and positional information are provided as input to an LLM, which is instructed to return normalized and structured outputs, such as brand name, product name, category, and specifications.
ASR pipeline: The system performs speech recognition on the entire audio stream of the video to obtain a complete transcription of spoken content. This transcription does not depend on the selected frame and does not include temporal or frame-level alignment information. Instead of localizing specific video segments, it serves as clip-level semantic evidence that assists in understanding the products presented in the video. The LLM then parses and standardizes the raw ASR transcription to produce structured information, including potentially mentioned product names, brands, specifications, and their semantic confidence scores. At this stage, the LLM effectively conducts a semantic-level preliminary judgment based on spoken descriptions, forming textual evidence units parallel to the visual modality.
Visual-semantic recognition with Qwen3-VL: For each product region detected in the selected frame, Qwen3-VL is invoked to perform image-based product recognition and description. In this study, Qwen3-VL specifically refers to Qwen3-VL-30B-A3B-Instruct. The model input consists of cropped product regions and task-specific prompts, and the output includes category cues, possible brand cues, specifications, and confidence scores, forming standardized structured evidence units.
Evidence aggregation and fusion LLM: Structured outputs from the three pipelines are aggregated as contextual input and provided to the fusion LLM with explicit fusion instructions. The fusion LLM produces the final decision, including the product name, brand, and category.

3.3. Frame Scoring Strategy

To select the most suitable frames from video streams for product recognition, this study designs and compares three types of frame scoring strategies: a Baseline method (fixed-position sampling), Strategy A (based on traditional visual features), and Strategy B (based on deep learning-based quality regression). The selected frames are subsequently used as inputs for OCR and multimodal recognition, thereby reducing the computational cost of downstream recognition modules while improving overall recognition accuracy. The following sections describe the definitions, implementation details, and boundary-case handling mechanisms of the three strategies in turn.

3.3.1. Baseline: Fixed-Position Strategy

The Baseline strategy simply selects the last frame of the video as the recognition frame without performing any quality assessment, serving as a reference baseline. This method has the lowest computational complexity; however, in live-streaming scenarios, the final frame is often a transition or scene-switching frame, which may significantly degrade recognition performance. Therefore, it is used primarily as a benchmark for evaluating the effectiveness of downstream methods.

3.3.2. Strategy A: Traditional Visual Feature-Based Scoring

Strategy A assigns a quality score to each frame by jointly considering object detection confidence and image sharpness. For each frame

f_{i}

in the video, the overall quality score is defined as:

S_{i}^{(A)} = w_{d} \cdot C_{d} (f_{i}) + w_{s} \cdot C_{s} (f_{i}),

(2)

where

C_{d} (f_{i}) \in [0, 1]

denotes the detection confidence score,

C_{s} (f_{i}) \in [0, 1]

denotes the sharpness score, and the weights

w_{d}

and

w_{s}

satisfy

w_{d} + w_{s} = 1

. In the default setting used in this study, Strategy A adopts

w_{d} = 0.7

and

w_{s} = 0.3

. The sensitivity of this setting is evaluated in the experimental section.

Detection confidence ( $C_{d}$ ): When multiple detection boxes are present in a frame,

C_{d} (f_{i})

is computed as the arithmetic mean of the confidence scores of all detected product bounding boxes. Alternatively, top-k averaging or maximum-value strategies may be employed to emphasize the most reliable candidate. In this work, mean aggregation is used to mitigate the impact of occasional false detections with abnormally high confidence. Since the detector outputs confidence scores already normalized to the range

[0, 1]

, no additional normalization is required.

Sharpness score ( $C_{s}$ ): For each detected product region of interest (ROI), image sharpness is evaluated using the response of the Laplacian operator and normalized as:

C_{s} (f_{i}) = min (\frac{Var (Δ^{2} R O I_{i})}{500}, 1.0),

(3)

where

Δ^{2}

denotes the Laplacian operator,

Var (\cdot)

represents the variance operation,

R O I_{i}

corresponds to a detected product region, and 500 denotes the normalization constant used for the Laplacian-based clarity score. The value of 500 is selected according to the end-to-end parameter sensitivity analysis.

When multiple product regions are present in a frame, the frame-level sharpness score

C_{s} (f_{i})

is obtained by averaging the sharpness values of all

R O I_{i}

. If the detector fails to return any valid ROI, the Laplacian variance computed over the entire frame is used as a degraded fallback estimate of

C_{s} (f_{i})

to avoid missing values.

Overall, Strategy A provides a computationally efficient and interpretable frame scoring scheme based on detection confidence and local clarity.

3.3.3. Strategy B: Deep Learning-Based Scoring

Strategy B replaces the traditional operator-based sharpness metric with a trained deep quality regression model

Q_{θ}

to estimate a visual quality and readability-related score for a frame. The frame scoring function is defined as:

S_{i}^{(B)} = w_{d} \cdot C_{d} (f_{i}) + w_{q} \cdot Q_{θ} (f_{i}),

(4)

where

Q_{θ} (f_{i}) \in [0, 1]

denotes the regression output of the EfficientNetV2-M model, and the weights

w_{d}

and

w_{q}

satisfy

w_{d} + w_{q} = 1

. In the default setting used in this study, Strategy B adopts

w_{d} = 0.8

and

w_{q} = 0.2

. The sensitivity of this setting is evaluated in the experimental section.

Model output normalization. After training, the raw regression outputs of

Q_{θ}

are linearly normalized to the range

[0, 1]

to ensure numerical comparability with the clarity scores used in Strategy A.

Input preprocessing. For each detected product region of interest (ROI), the corresponding image patch is cropped and resized to

224 \times 224

pixels before being fed into the regression model. If no detection bounding boxes are present in a frame, the entire frame is used as the model input to avoid missing scores.

Training procedure. The regression model is trained using pseudo-labels generated by a teacher model (HyperIQA). Supervised learning is performed by minimizing the mean squared error (MSE) loss between the predicted scores and the pseudo-labels. After training, the output of

Q_{θ}

is used as the learned quality component in Strategy B. Since the training objective only encourages the student model to approximate the HyperIQA pseudo-labels, its relationship with OCR-stage indicators is analyzed separately in Experiment 2.

Overall, Strategy B introduces a deep quality regression signal into the frame scoring process, enabling the system to incorporate broader visual quality cues beyond the hand-crafted Laplacian clarity measure. The effectiveness of this strategy is further evaluated in the OCR-stage and end-to-end experiments.

3.4. Frame Selection and Implementation Details

3.4.1. Frame Selection Pipeline

Given an input video

V = {f_{1}, \dots, f_{n}}

, the frame selection procedure is performed as follows:

A detector is applied to each frame to obtain product bounding boxes and their confidence scores.
For each detected bounding box, either a sharpness score is computed (Strategy A) or the cropped region is fed into the regression model $Q_{θ}$ (Strategy B).
A frame-level composite score $S_{i}$ is calculated for each frame.
The frame with the highest composite score is selected as the final recognition frame:

$f_{best} = arg max_{i} S_{i} .$

(5)
The selected frame is then passed to the downstream OCR and multimodal fusion modules.

3.4.2. Top-K Multi-Frame Fusion Strategy

To further improve robustness and information completeness, this work also explores an optional Top-K multi-frame fusion strategy. Instead of selecting a single frame, the top K frames (

K > 1

) with the highest quality scores are selected from the video, OCR is performed on each frame independently, and the OCR recognition results are fused at the text level.

Specifically, all frames are first sorted in descending order according to their quality scores. A greedy selection algorithm with a minimum inter-frame distance constraint is then applied: frames are examined sequentially, and a candidate frame is skipped if its temporal distance to any already selected frame is smaller than a predefined threshold. This constraint prevents selecting temporally adjacent and visually redundant frames. By leveraging complementary information across frames, such as variations in viewpoint, illumination, and occlusion, this strategy can improve the completeness and robustness of OCR evidence.

The fusion of multi-frame OCR results consists of four steps. First, all OCR texts are normalized, including lowercasing, removal of redundant punctuation, unification of numeric formats and units, and application of common OCR error correction rules, in order to reduce superficial textual discrepancies. Second, deduplication and fuzzy matching are performed: exact matching is prioritized, while approximately similar but non-identical strings are merged based on string similarity metrics. Third, for texts merged into the same candidate item, the maximum confidence score among all sources is retained as the final confidence. Finally, unmerged independent text items are preserved to capture complementary information across frames, yielding a more comprehensive and reliable OCR evidence set.

3.5. Multimodal Fusion and Final Matching

To achieve robust product recognition in videos, this work constructs a large language model (LLM)-based multimodal fusion framework. The framework integrates complementary information from visual modalities (OCR and image recognition) and the auditory modality (ASR), represents them uniformly in textual form, and feeds them into an LLM, which performs semantic-level fusion and reasoning to generate the final product recognition results.

3.5.1. Multimodal Information Extraction

After the visual frame selection stage, the system extracts complementary evidence from three modalities: visual text, speech, and visual semantics. It should be noted that the selected frame is used only for the visual branches, namely OCR and visual-semantic recognition. In contrast, the ASR branch operates on the complete audio stream of the video clip and provides video-level semantic evidence.

Visual Text Modality (OCR). PaddleOCR is applied to the selected product frame to recognize textual information appearing on the product packaging. The outputs include recognized text content, spatial coordinates, and confidence scores. This modality primarily captures salient product identifiers such as product names, brands, specifications, ingredients, and promotional slogans, and thus serves as a direct source of explicit identification information.

Auditory Modality (ASR). FunASR is used to perform speech recognition on the entire audio stream of the video, producing a transcription of spoken content. Unlike OCR and visual-semantic recognition, ASR is not applied to the selected frame and is not temporally constrained by the frame-selection result. Instead, it provides clip-level semantic evidence from the host’s spoken descriptions, including product features, usage scenarios, promotional expressions, and auxiliary product descriptions. Through these semantic cues, product names, brands, and specifications can be inferred, thereby complementing the limitations of purely visual information.

Visual-Semantic Modality (Image Recognition). The visual-semantic recognition module is applied to the selected product frame and its detected product regions. In this study, the open-weight Qwen3-VL-30B-A3B-Instruct model is used to generate product-related visual descriptions, candidate labels, and category cues. This modality provides appearance-based semantic features—such as packaging style, visible logos, and functional type—and offers supplementary judgments when textual information is missing or poorly recognized.

The recognition results from the three modalities are converted into textual representations denoted as

T_{OCR}

,

T_{ASR}

, and

T_{Image}

, and are stored in a structured format for subsequent multimodal fusion.

3.5.2. LLM-Based Fusion Mechanism

In this work, Qwen3.5-27B is adopted as the core LLM for evidence structuring and multimodal fusion. The model input is constructed by concatenating the textual outputs from the three modalities, while a system prompt explicitly defines the task objective, fusion logic, and output format.

The fusion process follows the principles below:

Consistency First: When multiple modalities provide identical or highly consistent information, such results are prioritized to enhance the reliability of the final output.
Reliability Ranking: Modalities are fused according to their confidence-based reliability, with the priority order defined as OCR (packaging text) > image recognition > ASR (speech), ensuring that highly reliable evidence dominates the final decision.
Conflict Resolution: In cases where outputs from different modalities conflict, the result with higher confidence and stronger consistency with OCR evidence is preferred.
Fault-Tolerant Completion: When information from certain modalities is missing, the model generates a complete judgment based on the available modalities, thereby maintaining the stability and robustness of the system output.

The final fusion result generated by the LLM is structured into the following standardized format:

Q_{fusion} = {Product Name, Brand, Category} .

The prompt templates used for evidence structuring and multimodal fusion are provided in Appendix A.

3.5.3. Temperature Parameter and Inference Stability

During inference with large language models, the temperature parameter T controls the randomness of token generation and impacts output stability and reproducibility. In this work, we adopt a greedy decoding strategy, selecting at each step the token with the highest logit probability. Formally, let

z_{t}

denote the predicted logits at time step t, then the selected token is:

w_{t} = arg max_{j} z_{t, j},

(6)

which corresponds to the limit of the softmax distribution as

T \to 0^{+}

:

P (w_{t} ∣ w_{< t}) = lim_{T \to 0^{+}} softmax (\frac{z_{t}}{T}) .

(7)

This deterministic decoding approach is motivated by three considerations. First, it ensures reproducibility and comparability across experiments: given identical inputs, the model produces consistent outputs, reducing variance when evaluating different frame scoring strategies or multimodal fusion configurations. Second, it preserves the stability of structured output fields (e.g., product name, brand, and category), preventing missing or malformed fields that might arise under stochastic sampling. Finally, deterministic decoding mitigates semantic drift during multimodal fusion, ensuring that evidence from OCR, ASR, and visual modalities is faithfully integrated without generating irrelevant or redundant content.

By explicitly formulating the

T \to 0^{+}

limit, we avoid mathematical inconsistencies of setting

T = 0

directly in the softmax, while clearly conveying that greedy decoding is employed throughout our experiments.

3.6. Data

3.6.1. Data Sources and Overall Statistics

This study constructs a dedicated dataset for structured product recognition in e-commerce livestreaming scenarios. The videos are collected from publicly available e-commerce livestream content on major platforms, including Taobao, Douyin, and Pinduoduo. To capture diverse livestreaming conditions, the dataset covers different streamer styles, camera perspectives, product display patterns, lighting conditions, and platform UI layouts.

A total of 442 videos are collected for model development and intermediate analysis. From these videos, approximately 36,000 raw frames are extracted at a fixed frame rate. After manual filtering, product bounding-box annotation, and data augmentation, around 3000 images are obtained for training the product detection model, D-FINE. In addition, the raw frames are processed by the trained product detector to retain frames containing visible products, resulting in approximately 11,000 product-containing frames for quality-score modeling and frame selection analysis.

For end-to-end structured product recognition evaluation, a separate manually verified evaluation set is constructed, consisting of 200 product-related video samples. Each sample is associated with a normalized ground-truth record containing three evaluation fields: brand, product name, and category. These fields are used as the reference for the end-to-end recognition experiments.

3.6.2. Product Detection Annotation

For training the product detector, D-FINE adopts a single unified detection category, goods. This design focuses the detector on localizing visible product instances rather than distinguishing fine-grained product categories or brands at the detection stage.

Annotators are required to draw bounding boxes around the complete and clearly recognizable external contour of each product instance. Product regions that are severely occluded, extremely low-resolution, or visually unrecognizable are excluded from detection annotation. Each detection annotation record includes fields such as image_id, bbox, annotator_id, and annotator_confidence, which support traceability and annotation quality checking during detector training.

3.6.3. Ground-Truth Construction for Product Recognition

To ensure the consistency and reliability of the structured ground-truth records, a barcode-assisted manual annotation protocol is used. During dataset construction, product barcodes are collected together with the corresponding videos and used as video identifiers. For each video sample, the product barcode is first queried in the official commodity barcode database to obtain basic product information. When the barcode query result is incomplete, unavailable, or inconsistent with the visible product packaging, the annotation is further verified using the product appearance, packaging text, brand official websites, e-commerce platforms, search engines, and other publicly verifiable sources.

The ground-truth records are normalized into three evaluation fields: brand, product name, and category. The brand field records the consumer-facing brand shown on the product packaging, rather than the manufacturer, distributor, or platform name. The product name field records the shortest identifiable official product name that can uniquely locate the target product, excluding promotional, channel-related, or purely packaging-related expressions. The category field follows a predefined closed-set taxonomy and is manually normalized according to the primary product usage.

Two annotators independently annotate the samples according to the same written annotation protocol. After the independent annotation stage, a primary annotator reviews all entries to ensure consistency in naming conventions, brand normalization, and category mapping. Inconsistent or ambiguous cases are resolved according to predefined source-priority rules: visible information on the product packaging is prioritized when available, followed by official barcode query results, brand official sources, e-commerce platforms, and other verifiable public information. If the ambiguity cannot be fully resolved by these rules, the final decision is made by the primary annotator based on the consistency among the video appearance, packaging text, barcode information, and public product information. The outputs of Strategy A, Strategy B, or any recognition model are not used as ground-truth annotations.

The specifications field is retained as auxiliary metadata when reliable information is available, including capacity, quantity, size, or model information. However, this field is not used in the evaluation metrics of this study. All 200 samples in the end-to-end evaluation set are completed under this protocol, providing manually verified reference labels for evaluating structured product recognition performance.

Although public livestreaming product recognition benchmarks such as LPR4M are valuable for independent comparison, they are not directly compatible with the field-level structured evaluation protocol used in this study. The present evaluation requires standardized ground-truth labels for brand, product name, and category, which are constructed through barcode-assisted manual verification, packaging-based evidence checking, and field normalization. Applying the same protocol to LPR4M would require additional product identity rechecking, brand and product-name normalization, category–taxonomy mapping, and conflict resolution under the same field-level schema.

3.6.4. Data Augmentation and Preprocessing

To enhance the generalization performance of D-FINE under limited annotated data conditions, data augmentation is applied to the manually annotated detection samples. The augmentation operations include horizontal and vertical flipping; brightness, contrast, and saturation perturbations; random cropping; and mild Gaussian noise injection. The augmentation process expands the manually annotated image set by approximately

2 \times

, resulting in around 3000 images for D-FINE training. All input images are normalized and standardized before training.

3.6.5. Pseudo-Labeling Pipeline for EfficientNetV2-M Training

The EfficientNetV2-M model is trained to predict visual quality and readability-related scores for product-containing frames. Since manually annotating frame-level visual quality scores is costly, a semi-supervised pseudo-labeling strategy is adopted. Specifically, approximately 36,000 frames are first extracted from the 442 videos, and about 11,000 product-containing frames are retained using the trained product detector. These frames are then scored by HyperIQA, a no-reference image quality assessment model, and the resulting scores are used as pseudo-labels for training the EfficientNetV2-M-based regression model.

In this study, the EfficientNetV2-M output is used as a learned quality component for Strategy B. This component is interpreted as a visual quality and readability-related cue rather than as a complete task-specific product recognizability score.

3.6.6. Evaluation Subsets

Different evaluation subsets are used for different experimental purposes. For the OCR-stage comparison between Strategy A and Strategy B in Experiment 2, 100 videos are sampled from the product-related video dataset. Both frame selection strategies are applied to the same videos, and the selected frames are evaluated using OCR-stage indicators, including average OCR confidence, high-confidence text ratio, recognized text count, processing time, and correlation with Strategy B scores.

For the end-to-end structured product recognition experiments, including the main comparison and ablation study, the 200-video manually verified evaluation set is used. System outputs are evaluated against the normalized ground-truth fields brand, product name, and category. Semantic similarity is computed using BGE-M3 embeddings, while field-level accuracy and Perfect Match Rate are computed according to the evaluation protocol described in the following section.

3.7. Models and Training

This study trains two core models: (1) D-FINE, which is used for detecting products in video frames and outputs product bounding boxes along with detection confidence scores, and (2) EfficientNetV2-M, which is employed to estimate visual quality and readability-related scores and provides the learned quality component for the Strategy B frame scoring function. Both models are trained on a self-constructed dataset, and all data splits are performed at the video level to ensure that the training, validation, and test sets do not overlap.

3.7.1. D-FINE Detector

For candidate product region detection, we adopt an implementation of the D-FINE framework (Detection with Fine-grained Localization and Global Self-Distillation) to improve bounding box localization accuracy and detection robustness in livestream video scenarios. The model consists of three main components: an HGNetv2-B2 backbone, a HybridEncoder, and a DFINETransformer decoder. The HGNetv2-B2 backbone is initialized using the public stage-1 pretrained weights PPHGNetV2_B2_stage1.pth, and the full D-FINE detector is subsequently trained on our manually annotated product detection dataset. During inference, the resulting custom full-model checkpoint is loaded for product detection. This description is used consistently throughout the paper to distinguish the backbone initialization from the final task-specific detector checkpoint.

The encoder aligns multi-scale features into a unified channel dimension (hidden_dim = 256) via multi-scale feature fusion. The decoder follows a Transformer architecture with deformable attention, comprising four decoding layers (num_layers = 4) and 300 query vectors (num_queries = 300). It directly outputs detection boxes and class predictions in an end-to-end manner, without relying on conventional FPN or NMS post-processing.

The detector predicts two classes (product/background) and is used solely for detection; it does not output embeddings for semantic retrieval. Subsequent semantic recognition is performed by the OCR and multimodal recognition modules.

The core innovations of D-FINE lie in Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD) [15]. FDR replaces direct numerical regression of bounding box coordinates with discrete distribution modeling, where the model predicts a probability distribution for each boundary position and progressively refines it in a residual manner across decoding layers, enabling fine-grained modeling of spatial uncertainty. GO-LSD treats the localization results from the final decoding layer as teacher signals and distills them into intermediate layers using a distribution consistency loss, thereby enforcing globally optimal inter-layer constraints. This mechanism is implemented through the Fine-grained Localization (FGL) loss and the Decoupled Distillation Focal (DDF) loss, where FGL enforces unimodal distribution constraints and boundary refinement, while DDF focuses on inter-layer distillation.

By jointly considering classification confidence, bounding box regression, distribution prediction, and distillation constraints, the overall training objective is defined as:

L = 1.0 \times L_{vfl} + 5.0 \times L_{bbox} + 2.0 \times L_{giou} + 0.15 \times L_{fgl} + 1.5 \times L_{ddf}

(8)

Here,

L_{vfl}

denotes the Varifocal Loss for confidence-weighted classification optimization,

L_{bbox}

is the L1 bounding box regression loss,

L_{giou}

is the GIoU loss, and

L_{fgl}

and

L_{ddf}

correspond to the fine-grained localization and distillation constraint terms, respectively. This weighting strategy has been shown in the original D-FINE work to significantly improve localization accuracy for small and medium-sized objects as well as convergence speed.

Training configuration and hyperparameters: The input resolution is fixed at

640 \times 640

pixels, and training is conducted using COCO-format annotations. The AdamW optimizer is employed, together with a linear warm-up phase and a multi-stage learning rate decay schedule to balance convergence speed and training stability. Mixed-precision training (AMP) and Exponential Moving Average (EMA) are enabled to improve generalization performance and numerical stability. Data augmentation strategies include brightness and contrast perturbations, random scaling, IoU-based cropping, and horizontal flipping, enhancing robustness to illumination changes and viewpoint variations. The main hyperparameters and training strategies are summarized in Table 1.

Optimization Objective and Evaluation: The model’s evaluation metrics follow the COCO-style standards, including mAP@[0.5:0.95], AP50, and AP75. Training logs and validation results are recorded at each epoch, and early stopping is applied to select the best model when no improvement is observed in the validation set metrics. The final output confidence scores are used for subsequent frame scoring and candidate prioritization.

Advantages: Compared to traditional two-stage detection frameworks (e.g., Faster R-CNN + FPN), the DETR-style end-to-end detection in D-FINE offers three main advantages:

Elimination of the NMS post-processing step, achieving true end-to-end optimization.

The FDR + GO-LSD mechanism significantly improves bounding box accuracy, showing stronger robustness, especially in live-streaming product scenes with large-scale changes and motion blur.

The Transformer decoder structure is more compatible with downstream multimodal recognition modules, providing structural compatibility for unified vision–language reasoning.

3.7.2. EfficientNetV2

In the frame scoring (Strategy B), we use EfficientNetV2-M as a lightweight quality regression model to estimate visual quality and readability-related scores for candidate frames. This score is then normalized and used as the quality term in the frame ranking. The choice of EfficientNetV2-M as the backbone is based on its advantages in training efficiency and parameter efficiency: compared to traditional ConvNets or some Transformer variants, EfficientNetV2 significantly reduces training time and inference latency while maintaining high accuracy, as illustrated in the performance comparison in Figure 2 [38].

This strategy is inspired by Kharchevnikova & Savchenko (2021) [24], who used lightweight CNNs combined with distillation/weak-labeling to build an efficient frame quality estimator for video frame selection/quality assessment. The approach generates “quality” pseudo-labels for a large number of unlabeled or weakly labeled frames using a teacher model, and then a student model learns this quality scale for cost-effective large-scale training. Based on this idea, in the live product recognition scenario, we use a no-reference image quality evaluator (HyperIQA) as the teacher model to generate continuous quality scores for the candidate frames extracted at a fixed frame rate and filtered by a detector. These pseudo-labels are then used to train EfficientNetV2-M. This design allows the lightweight student model to approximate the HyperIQA quality scores with lower inference overhead, thereby providing an efficient learned quality component for the Strategy B frame scoring function.

Let the output of EfficientNetV2-M be defined as a continuous regression value

\hat{y}

, with the training objective being to minimize the mean squared error (MSE) between the predicted value and the teacher’s pseudo-label y:

L_{MSE} = \frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}

(9)

The training data consists of approximately 36,000 frames extracted from 442 videos at a fixed frame rate, and after filtering by the detector, about 11,000 frames containing products. The data is split by video into training and validation sets (80%/20%). During training, input images undergo uniform preprocessing and data augmentation (see Table 2). The ReduceLROnPlateau learning rate scheduler and early stopping mechanism are employed to avoid overfitting, and mixed precision training (FP16) is enabled to accelerate training and reduce memory usage.

After training, the regression output is linearly normalized to the range

[0, 1]

as the final quality score. The regression metrics (MSE, MAE, RMSE) are reported on the validation set, along with the Pearson correlation coefficient between the model’s predictions and the teacher’s scores.

4. Experiments

This section aims to validate the effectiveness of the proposed frame scoring strategy, Top-K multi-frame fusion mechanism, and multimodal recognition framework. The experiments are conducted from three perspectives: frame selection quality evaluation, OCR information extraction performance comparison, and end-to-end recognition accuracy analysis. All experiments are performed under a unified system environment, ensuring consistency in model parameters, inference settings, and hardware conditions to guarantee comparability and reproducibility of the results.

4.1. Model Training Results and Analysis

4.1.1. D-FINE Training Results and Performance Analysis

The model training process converges smoothly, reaching its final performance after 132 epochs. The specific performance metrics are shown in Table 3.

Given the relatively small scale of the training set and the difficulty of product detection in livestreaming scenarios, the final mAP score over IoU thresholds from 0.5 to 0.95 is 27.76%, and the AP₅₀ score is 40.36%, indicating that frame-level product detection remains challenging. Livestreaming product detection differs from generic object detection benchmarks because products are often affected by dense packaging, occlusion, motion blur, changing viewpoints, reflections, and fine-grained product boundaries. Therefore, these metrics should be interpreted together with the role of the detector in the proposed pipeline: the detector is used primarily to support video-level frame selection rather than to exhaustively detect all product instances in every frame.

The training results suggest three main observations:

Effective backbone initialization: The HGNetv2-B2 backbone is initialized from public stage-1 pretrained weights, which provide general visual representations and support stable convergence during training on the custom product detection dataset.

Successful domain adaptation: Despite the limited number of manually annotated training samples, the detector learns task-relevant product localization patterns and can provide detection confidence scores for subsequent frame scoring.

Practical support and error propagation in video-level frame selection: Although the frame-level detection metrics are modest, the detector remains useful in the proposed end-to-end pipeline because the downstream task does not require detecting every product instance in every frame. Instead, in most experiments, the system selects a single representative frame from a multi-minute video clip, while the Top-K experiment is used only as an exploratory comparison. In this setting, the detector mainly serves as a filtering and scoring component: it helps identify frames where the product is visible and provides detection confidence as one cue for ranking candidate frames. The temporal redundancy of livestreaming videos can partially mitigate frame-level misses, since the same target product often appears repeatedly across many consecutive frames. Therefore, successful detection in one or more informative frames can still be sufficient to support downstream OCR and visual-semantic recognition. Nevertheless, the AR₁₀₀ of 44.14% indicates that a substantial fraction of product appearances may still be missed at the frame level. These missed detections and low-confidence bounding boxes may propagate to the frame selection stage, reduce the chance of selecting frames with complete textual or visual evidence, and thereby limit the upper bound of the final recognition performance, especially when the product appears only briefly, is severely occluded, or contains small textual cues. Further improving the detector and analyzing detection-error propagation remain important directions for future work.

4.1.2. EfficientNetV2 Training Results and Performance Analysis

The model converges well on both the training and validation sets, triggering the early stopping mechanism after reaching the best validation performance at epoch 17. The final model performance is shown in Table 4.

The model’s Pearson correlation coefficient reaches 0.9323, indicating that the predicted quality scores are highly consistent with the scores from the HyperIQA teacher model, successfully achieving knowledge distillation. The MAE of 0.0367 corresponds to an average error of only 3.67% within the [0, 1] scoring range, demonstrating high prediction accuracy. The MSE and MAE are relatively close, with RMSE slightly higher than MAE, suggesting the presence of a small number of outlier samples with larger errors, but overall predictions remain stable.

4.2. Experimental Metrics and Statistical Methods

To comprehensively evaluate the performance across different experiments, this paper defines corresponding metric systems for each experimental setting. For structured product recognition, the evaluation fields are limited to brand, product name, and category. The specifications field is retained as auxiliary metadata in the dataset but is not used in the evaluation metrics of this study.

4.2.1. Experiment 1: OCR Information Quantity Metrics

This experiment evaluates OCR performance from three aspects: recognition quantity, confidence, and efficiency, thereby assessing whether multi-frame fusion effectively improves the amount and reliability of recognized information. The specific metrics are shown in Table 5.

4.2.2. Experiment 2: Frame Quality and OCR-Stage Metrics

This experiment employs downstream OCR-stage performance as a unified objective evaluation benchmark for comparing Strategy A and Strategy B. Since the two strategies use different internal scoring mechanisms, their raw scores are not directly compared. Instead, the selected frames are evaluated in terms of OCR confidence, recognized text evidence, and processing efficiency. The specific metrics are shown in Table 6.

In addition, the experiment includes a human evaluation component to complement the objective OCR-stage metrics. Five raters independently scored the representative frames selected by different strategies from four aspects: text clarity, product completeness, brand visibility, and overall visual quality. A five-point scale was adopted, where 1 indicates very poor quality, 2 poor quality, 3 acceptable quality, 4 good quality, and 5 excellent quality. More specifically, a score of 1 corresponds to frames with severe blur, incomplete product presentation, unclear brand cues, or almost unreadable text; a score of 3 corresponds to frames that are basically usable but still contain noticeable defects; and a score of 5 corresponds to frames with clear text, complete product presentation, clearly visible brand information, and high overall readability. To reduce subjective bias, the candidate frames were presented in randomized order without revealing the corresponding strategy labels. For each sample, the scores given by all annotators were first averaged within each evaluation dimension, and the final human evaluation score was then obtained by averaging the dimension-level mean scores.

For paired comparisons between Strategy A and Strategy B, the mean and standard deviation, paired mean difference, 95% confidence interval, p-value, and Cohen’s

d_{z}

are reported where applicable. In addition, Experiment 2 reports the correlation between Strategy B scores and OCR-stage indicators. Specifically, Pearson and Spearman correlations are computed between the learned quality component q, the final Strategy B score

S_{B}

, and the OCR-stage indicators including average OCR confidence, high-confidence text ratio, recognized text count, and recognized character count.

4.2.3. Experiments 3 and 4: Structured Product Recognition Metrics

Experiments 3 and 4 evaluate end-to-end structured product recognition performance. Experiment 3 compares different frame selection strategies in the full multimodal recognition pipeline, while Experiment 4 evaluates the contribution of different modality combinations through ablation analysis. Both experiments use the same structured evaluation fields: brand, product name, and category.

For each field, semantic similarity between the predicted result and the ground-truth annotation is computed using BGE-M3 text embeddings and cosine similarity. Field-level correctness is then determined by applying predefined similarity thresholds to the corresponding similarity scores:

Brand Recognition Accuracy: proportion of samples whose brand similarity is no less than 0.7;
Product Name Recognition Accuracy: proportion of samples whose product-name similarity is no less than 0.5;
Category Recognition Accuracy: proportion of samples whose category similarity is no less than 0.5.

At the video level, two evaluation metrics are defined:

Perfect Match Rate: proportion of videos for which brand, product name, and category are all correctly recognized according to the corresponding field-level thresholds;
Semantic Similarity: average semantic similarity across the three evaluation fields.

BGE-M3 embeddings are used to compute semantic similarity scores, while field-level accuracies and Perfect Match Rate are obtained by applying the predefined thresholds to these scores. Compared with exact string matching, this evaluation design better accommodates minor expression variations in product names and category descriptions while preserving field-level evaluation consistency.

The default thresholds are determined according to the semantic characteristics of different fields. Specifically, the brand field usually contains short and semantically explicit expressions, so a stricter threshold of 0.7 is adopted. In contrast, product names in e-commerce livestreaming scenarios often include abbreviations, specification-related modifiers, and expression variations; therefore, a relatively tolerant threshold of 0.5 is used. The category field is comparatively coarse-grained, and a threshold of 0.5 is also adopted to preserve moderate semantic tolerance.

To examine the robustness of the adopted thresholds, a one-factor sensitivity analysis is conducted using the end-to-end recognition outputs. In this analysis, one threshold is varied at a time while the other two thresholds are fixed at their default values. Since threshold variation only affects the correctness criterion rather than the semantic similarity scores themselves, Table 7 presents the changes in Perfect Match Rate and the corresponding field accuracy under different threshold settings.

As shown in Table 7, the adopted thresholds provide a reasonable balance between semantic tolerance and evaluation strictness. For the brand field, lower thresholds lead to slightly higher Perfect Match Rate and Brand Recognition Accuracy, but they also make the matching criterion more permissive for a short and semantically explicit field. When the threshold is increased beyond 0.700, both metrics decline more noticeably, indicating that excessively strict brand matching may reject acceptable brand variants. Therefore, 0.700 is adopted as a balanced threshold for brand recognition.

For the product-name field, thresholds below 0.500 produce comparable or slightly higher Perfect Match Rate, but they also make the correctness criterion more tolerant for a field with substantial expression variation. When the threshold is increased above 0.500, both Perfect Match Rate and Product Name Recognition Accuracy decline steadily, suggesting that overly strict product-name matching may reject semantically valid variants. Therefore, 0.500 is adopted to balance tolerance to expression variation and evaluation strictness.

For the category field, Perfect Match Rate remains unchanged from 0.400 to 0.500, while Category Recognition Accuracy remains high. However, both metrics decline when the threshold is further increased. This result indicates that 0.500 provides a stable threshold for the relatively coarse-grained category field.

Overall, these results support the adopted threshold setting of 0.700 for brand, 0.500 for product name, and 0.500 for category as a stable and interpretable evaluation configuration. For end-to-end recognition experiments, Brand Recognition Accuracy, Product Name Recognition Accuracy, Category Recognition Accuracy, Perfect Match Rate, and Semantic Similarity are reported. The same metric definitions and threshold settings are used consistently in the main comparison and the ablation study.

4.3. Experiment 1: OCR-Stage Comparison Between Single-Frame and Top-K Multi-Frame Fusion

4.3.1. Experimental Setup

This experiment aims to examine the effect of the Top-K multi-frame fusion strategy on OCR-stage text evidence completeness and reliability. The compared strategies include the following three:

Single-Frame Strategy: Directly select the single frame with the highest comprehensive quality score as the OCR input;
Top-3 Fusion Strategy: Select the top 3 frames with the highest quality scores and fuse their OCR outputs;
Top-5 Fusion Strategy: Select the top 5 frames with the highest quality scores and fuse their OCR outputs.

This OCR-stage comparison is conducted on 30 randomly sampled videos. All strategies employ the same frame scoring method, namely Strategy A, computed through a weighted combination of detection confidence (weight 0.7) and Laplacian clarity (weight 0.3). To avoid redundant information from temporally adjacent frames, a minimum inter-frame distance constraint is introduced during frame selection:

d_{min} = ⌊\frac{n}{2 K}⌋ .

(10)

which encourages the selected frames to be more temporally dispersed along the video timeline.

In the multi-frame fusion stage, a “text deduplication + confidence weighting” approach is used to integrate OCR outputs from each selected frame, thereby obtaining the final fused OCR text evidence.

4.3.2. Experimental Results

The performance of single-frame and multi-frame strategies is compared according to the OCR-stage metrics for Experiment 1 defined in Section 4.2. The experimental results are shown in Table 8.

4.3.3. Result Analysis

As shown in Table 8, multi-frame fusion improves OCR-stage evidence completeness compared with the single-frame strategy. The Top-3 fusion strategy achieves an approximately 17.2% improvement in average OCR confidence compared with the single-frame strategy and a 20% increase in the number of high-confidence characters. This indicates that multi-frame OCR evidence can partially compensate for blur, occlusion, or missing textual content in a single selected frame.

However, this gain is accompanied by a substantial increase in processing time. The processing time of the Top-3 strategy is approximately 2.9 times that of the single-frame strategy, while the Top-5 strategy reaches approximately 4.8 times. Furthermore, the improvement from Top-3 to Top-5 is limited: the number of high-confidence characters increases by 11.1%, while the average OCR confidence slightly decreases from 0.8897 to 0.8850. This suggests that after a certain number of selected frames, additional frames may introduce redundant or lower-quality textual evidence, thereby weakening the marginal benefit of multi-frame fusion.

Overall, the Top-K experiment demonstrates that multi-frame fusion can enrich OCR evidence, but its computational overhead increases substantially with the number of selected frames. Therefore, in the deployment-oriented setting of this study, the single-frame strategy remains the default visual input strategy, while Top-K fusion is treated as an optional OCR-stage extension for scenarios where richer textual evidence is required and efficiency constraints are less strict.

4.4. Experiment 2: Comparison of Frame Quality Between Strategy A and Strategy B

4.4.1. Experimental Setup

This experiment compares the OCR-stage frame selection performance of Strategy A and Strategy B. Strategy A is based on traditional computer vision cues and combines D-FINE detection confidence with Laplacian clarity, whereas Strategy B is based on deep learning quality assessment and combines detection confidence with the learned quality component obtained from the HyperIQA-distilled EfficientNetV2-M model.

Because the two strategies use different scoring mechanisms, their raw frame scores are not directly comparable. Therefore, this experiment uses downstream OCR-stage performance as a unified evaluation criterion. For each of the 100 randomly sampled test videos, Strategy A and Strategy B are applied separately to select one representative frame. The parameter settings are fixed as follows: Strategy A uses a detection-confidence weight of 0.7 and a Laplacian-clarity weight of 0.3, while Strategy B uses a detection-confidence weight of 0.8 and a learned-quality-component weight of 0.2.

The same OCR model, PaddleOCR, is then applied to the selected frames from both strategies. This experiment evaluates only the OCR-stage performance of the selected frames and does not incorporate ASR, vision–language recognition, or final LLM-based multimodal fusion. The reported metrics include average OCR confidence, high-confidence text ratio, recognized text count, processing time, human evaluation scores, and the correlation between Strategy B scores and OCR-stage indicators.

4.4.2. Experimental Results

This section reports the OCR-stage comparison between Strategy A and Strategy B, including OCR confidence, text recognition quality, the correlation between Strategy B scores and OCR-stage indicators, processing time, and human evaluation results.

(1): OCR Confidence Performance

As shown in Figure 3, the average OCR confidence for Strategy A is

0.733 \pm 0.118

, while that for Strategy B is

0.745 \pm 0.122

. Strategy B obtains a slightly higher numerical average OCR confidence than Strategy A, with a paired mean difference of

+ 0.012

and a relative change of 1.70%. However, the 95% confidence interval of the paired difference is

[- 0.006, + 0.032]

, and the corresponding p-value is 0.2037, indicating that this difference is not statistically significant.

Among the 100 test samples, Strategy B achieves higher OCR confidence than Strategy A in 34 samples, while Strategy A achieves higher OCR confidence in 44 samples. The two strategies select the same frame or obtain equivalent OCR confidence in 22 samples. This result suggests that the learned quality component in Strategy B can improve OCR confidence in some cases, but its advantage is not consistent across all videos.

(2): Text Recognition Quality

As shown in Figure 4 and Figure 5, Strategy B obtains higher numerical values than Strategy A in both high-confidence text ratio and recognized text count. The high-confidence text ratio increases from

67.02 \pm 15.67 %

for Strategy A to

69.35 \pm 15.93 %

for Strategy B, with a paired mean difference of

+ 2.33

percentage points. The 95% confidence interval of the difference is

[- 0.10, + 5.06]

, and the corresponding p-value is 0.0791. Although this difference does not reach the conventional 0.05 significance level, it suggests a positive trend that Strategy B may improve the proportion of high-confidence OCR text.

For recognized text count, Strategy A obtains an average of

25.56 \pm 14.87

recognized text items, while Strategy B obtains

26.71 \pm 15.69

. The paired mean difference is

+ 1.15

, with a 95% confidence interval of

[- 0.93, + 3.59]

and a p-value of 0.3477. This result indicates that Strategy B also produces a slightly higher numerical amount of OCR text evidence, although the difference is not statistically significant.

Overall, these results indicate that Strategy B tends to select frames with slightly richer and more reliable OCR text evidence, especially in terms of the high-confidence text ratio. However, the relatively small paired differences also suggest that the benefit of Strategy B at the OCR stage is moderate rather than decisive.

(3): Correlation Between Strategy B Scores and OCR-Stage Indicators

Since Strategy B includes a learned quality component q obtained from the HyperIQA-distilled model, this analysis examines the relationship between Strategy B scores and OCR-stage indicators. The purpose is not to treat q as a complete task-specific product recognizability score, but to evaluate whether it captures visual quality and readability-related information that may be useful for frame selection.

For each sample in the Experiment 2 evaluation set, we analyze the correlations between the learned quality component q, the final Strategy B score

S_{B}

, and four OCR-stage indicators: average OCR confidence, high-confidence text ratio, recognized text count, and recognized character count. The final Strategy B score

S_{B}

is the composite score actually used for frame selection. Pearson and Spearman correlations are reported with their corresponding p-values in Table 9.

The results show that the learned quality component q has only weak associations with confidence-related OCR indicators and recognized text count. Although q shows a significant Pearson correlation with recognized character count, the corresponding Spearman correlation is not significant, suggesting that this relationship is limited and not consistently monotonic across samples. Therefore, q should be interpreted as a visual quality and readability-related cue rather than a standalone task-specific product recognizability score.

The final Strategy B score

S_{B}

shows significant positive correlations with recognized text count and recognized character count, while its correlations with confidence-related OCR indicators remain weak. This pattern suggests that

S_{B}

is more related to the amount of OCR-extractable textual evidence than to OCR confidence itself. Therefore, the correlation results provide limited but useful support for using the composite Strategy B score as a practical frame selection signal. The final impact of this scoring strategy is evaluated in the end-to-end recognition experiment.

(4): Time Cost

As shown in Figure 6, Strategy B requires a longer processing time than Strategy A. The average processing time increases from

81.19 \pm 43.31

s for Strategy A to

103.72 \pm 52.03

s for Strategy B, with a paired mean difference of

+ 22.53

s. The 95% confidence interval of the difference is

[+ 19.65, + 25.54]

, and the corresponding p-value is 0.0001, indicating a statistically significant increase in processing time. This result shows that the EfficientNetV2-M-based quality assessment branch introduces additional computational overhead compared with the Laplacian-clarity-based scoring strategy.

(5): Human Evaluation Results

To complement the automatic OCR-stage evaluation, this study further conducts human evaluation on the same 100 samples. The purpose of this evaluation is to assess the visual quality and text readability of the frames selected by Strategy A and Strategy B from a human perceptual perspective. The results are shown in Table 10.

In the human evaluation, Strategy B receives higher scores than Strategy A in 37.0% of the samples (37/100), while Strategy A receives higher scores in 15.0% of the samples (15/100). The remaining 48.0% of the samples (48/100) obtain comparable scores. These results indicate that Strategy B is more consistent with human perception of visual frame quality in a larger proportion of samples. Together with the automatic OCR-stage results, the human evaluation suggests that Strategy B tends to select frames with better perceived visual quality, while its OCR-stage advantage remains moderate.

4.4.3. Results Analysis

The experimental results show that Strategy B introduces a different frame quality assessment mechanism from Strategy A and provides useful complementary observations at the OCR stage.

(1): OCR-Stage Performance and Score Correlation

Strategy B achieves slightly higher numerical values in average OCR confidence, high-confidence text ratio, and recognized text count. Among these metrics, the high-confidence text ratio shows the most evident positive tendency, suggesting that Strategy B can help select frames containing more reliable OCR text evidence in some cases. However, the paired statistical tests indicate that these OCR-stage improvements are moderate and not uniformly significant. This suggests that global visual quality and OCR-oriented text readability are related but not equivalent: a frame with higher overall visual quality does not always contain the most complete or most readable product text.

The correlation analysis is consistent with this interpretation. The learned quality component q shows only limited associations with OCR-stage indicators, while the final Strategy B score

S_{B}

is more closely associated with text-volume indicators. This suggests that Strategy B should be interpreted as a quality-aware composite scoring strategy rather than as a direct OCR-success predictor. In other words, its value is not that the learned quality component alone fully predicts OCR success, but that the composite scoring function can help identify frames with richer textual evidence for subsequent multimodal product recognition.

(2): Sample-Level Variation and Processing Cost

The sample-level comparison shows that the effect of Strategy B is not uniform across all videos. For example, for product 6956346649349, the OCR confidence increases from 0.403 under Strategy A to 0.865 under Strategy B, indicating that the quality assessment model helps select a more text-readable frame in this case. In contrast, for product 6953689004187, the OCR confidence decreases from 0.666 under Strategy A to 0.365 under Strategy B. These cases illustrate that quality-based frame selection depends not only on global visual quality, but also on whether the selected frame preserves task-relevant textual evidence.

Strategy B also introduces additional computational cost. Compared with Strategy A, Strategy B increases the average processing time by

+ 22.53

s, and this increase is statistically significant. This indicates that the deep quality assessment branch improves the richness of frame quality modeling, but also increases the computational overhead of the frame selection stage.

(3): Human Evaluation and Overall Interpretation

The human evaluation provides additional evidence from the perspective of subjective visual quality. Strategy B improves the average quality score from 3.31 to 3.52 and is rated higher than Strategy A in a larger proportion of samples. This indicates that Strategy B is more consistent with human perception of visual frame quality in many cases, even though the corresponding OCR-stage gains remain moderate.

Overall, the OCR-stage automatic evaluation, correlation analysis, and human evaluation suggest that Strategy B provides a quality-aware frame selection strategy with better perceived visual quality and moderate OCR text-evidence gains, while also introducing higher processing time. This experiment provides an intermediate OCR-stage analysis of frame quality and text readability. The impact of the two frame selection strategies on final structured product recognition is evaluated separately in the end-to-end recognition experiment.

4.5. Experiment 3: End-to-End Recognition Accuracy Comparison

4.5.1. Experimental Setup

This experiment serves as the end-to-end validation of this paper, aiming to evaluate the comprehensive impact of different visual input strategies on the final product recognition accuracy within the complete multimodal recognition pipeline, thereby verifying the effectiveness of the proposed framework in real-world product recognition tasks. The evaluation is conducted on a 200-video test set. All methods are evaluated using the same ground-truth annotations and the same BGE-M3-based field-level semantic matching protocol.

The compared strategies are as follows:

Baseline: Directly use the last frame of the video for recognition while retaining the proposed OCR/ASR/VLM fusion pipeline;
External VLM Baseline: Use Qwen3-VL-30B-A3B-Instruct as an external video-language baseline that directly takes the original video as input and predicts the structured product fields;
Strategy A: Perform end-to-end recognition using the frame selection Strategy A analyzed in Experiment 2, with the 0.7/0.3 weight setting and the Laplacian normalization constant of 500 determined by the sensitivity analysis;
Strategy B: Perform end-to-end recognition using the frame selection Strategy B analyzed in Experiment 2, with the 0.8/0.2 weight setting determined by the sensitivity analysis.

For the Baseline, Strategy A, and Strategy B, the complete experimental pipeline includes: frame selection → OCR recognition on the selected frame using PaddleOCR → visual-language recognition on the selected frame using Qwen3-VL-30B-A3B-Instruct → ASR transcription on the entire audio stream using FunASR → multimodal fusion of the structured evidence from OCR, ASR, and VLM using Qwen3.5-27B with deterministic decoding → field-level semantic similarity matching between the predicted outputs and the ground-truth annotations using BGE-M3.

The ASR branch processes the full audio stream and is not affected by the visual frame selection strategy. Unlike the Baseline, Strategy A, and Strategy B, the external VLM baseline does not use the proposed intermediate evidence extraction and fusion modules.

4.5.2. Sensitivity Analysis of Frame-Scoring Parameters

To examine how the frame-scoring parameters influence the final structured product recognition performance, this study conducts an end-to-end sensitivity analysis using the same 200-video test set as the main recognition evaluation. Two types of parameters are analyzed: the weight ratio between detection confidence and the quality-related score, and the normalization constant used in the Laplacian-based clarity score. The evaluation is based on the final recognition metrics, including Perfect Match Rate, field-level accuracy, and Semantic Similarity. The purpose of this analysis is not to claim a globally optimal parameter configuration, but to determine empirically supported frame-scoring parameter settings used throughout the experiments according to the end-to-end recognition results.

(1): Weight Sensitivity

Table 11 and Table 12 report the end-to-end recognition performance of Strategy A and Strategy B under different weight configurations. In each weight setting, the first value denotes the weight of detection confidence, while the second value denotes the weight of the corresponding quality-related score. For Strategy A, the quality-related score is the Laplacian clarity score. For Strategy B, it is the quality score output by the EfficientNetV2-M model.

As shown in Table 11, Strategy A achieves its best end-to-end result under the 0.7/0.3 configuration, with a Perfect Match Rate of 0.725 and a Semantic Similarity of 0.792. When the detection-confidence weight is increased to 0.9, the Perfect Match Rate decreases to 0.665, suggesting that relying too heavily on detection confidence may reduce the contribution of local clarity to downstream recognition. Therefore, the 0.7/0.3 configuration is selected for Strategy A.

As shown in Table 12, Strategy B achieves its best overall end-to-end performance under the 0.8/0.2 configuration, with a Perfect Match Rate of 0.775 and a Semantic Similarity of 0.802. Compared with the 0.7/0.3 configuration, the 0.8/0.2 setting further improves the Perfect Match Rate from 0.755 to 0.775 while maintaining the same Product Name Recognition Accuracy and Category Recognition Accuracy. This result indicates that, for Strategy B, assigning a slightly larger weight to detection confidence while retaining the EfficientNetV2-M quality score as a quality-aware cue provides a better balance for the final structured recognition task. Therefore, the 0.8/0.2 configuration is selected for Strategy B.

(2): Sensitivity of the Laplacian Normalization Constant

Since Strategy A uses the Laplacian clarity score as a hand-crafted image-quality cue, this study further evaluates the influence of the Laplacian normalization constant on the final end-to-end recognition performance. The results are shown in Table 13. The same 200-video test set is used for all configurations.

Table 13 shows that the normalization constant of 500 provides the best overall end-to-end performance among the tested settings for Strategy A, achieving a Perfect Match Rate of 0.725 and a Semantic Similarity of 0.792. In contrast, both smaller and larger normalization constants lead to lower Perfect Match Rates and lower Product Name Recognition Accuracy. For example, increasing the constant to 600 reduces the Perfect Match Rate to 0.665 and the Product Name Recognition Accuracy to 0.790. These results indicate that the Laplacian normalization setting affects the contribution of local clarity to frame scoring, and the value of 500 provides a stable balance between local clarity and detection confidence in the end-to-end recognition task.

Overall, the sensitivity analysis shows that the final frame-scoring parameter settings are determined according to the end-to-end recognition results. Strategy A uses the 0.7/0.3 weight configuration together with the Laplacian normalization constant of 500, while Strategy B uses the 0.8/0.2 weight configuration. These settings provide the strongest or most stable end-to-end performance among the tested configurations and are therefore used throughout the experiments.

4.5.3. Experimental Results

Table 14 summarizes the input settings and module usage of the compared methods. Table 15 reports the end-to-end recognition performance of different methods.

To further assess whether the difference between Strategy A and Strategy B is stable at the sample level, a paired statistical analysis is conducted on the Perfect Match results and Semantic Similarity scores. In Table 16, A+/B− denotes the samples where Strategy A is correct and Strategy B is incorrect, whereas A−/B+ denotes the samples where Strategy A is incorrect and Strategy B is correct.

4.5.4. Result Analysis

(1): Overall Performance Analysis

As shown in Table 15, the proposed keyframe selection and multimodal fusion framework substantially improve the end-to-end structured product recognition performance compared with the last-frame baseline. The Baseline obtains a Perfect Match Rate of 0.609 and a Semantic Similarity of 0.697. After applying the proposed D-FINE-based keyframe selection framework, Strategy A improves the Perfect Match Rate to 0.725 and the Semantic Similarity to 0.792, while Strategy B further improves the Perfect Match Rate to 0.775 and the Semantic Similarity to 0.802. These results indicate that recognition-oriented keyframe selection is effective for improving the quality of visual evidence used in the downstream multimodal recognition pipeline.

At the field level, Strategy B achieves the highest brand recognition accuracy among all compared methods, reaching 0.795, and improves Product Name Recognition Accuracy from 0.818 in the Baseline to 0.895. Although the external VLM baseline obtains the highest Product Name Recognition Accuracy and Category Recognition Accuracy, Strategy B achieves the highest Perfect Match Rate, indicating better overall consistency across the three structured fields.

Among the proposed frame selection strategies, Strategy B achieves the highest Perfect Match Rate and Semantic Similarity. Compared with Strategy A, Strategy B improves the Perfect Match Rate by 0.050, Product Name Recognition Accuracy by 0.025, and Semantic Similarity by 0.010. Compared with the last-frame baseline, Strategy B improves the Perfect Match Rate by 0.166, Product Name Recognition Accuracy by 0.077, and Semantic Similarity by 0.105. This suggests that the quality-aware keyframe selection strategy can provide more effective visual evidence for final structured product recognition when combined with OCR, ASR, VLM, and multimodal fusion.

(2): Comparison with the External VLM Baseline

The Qwen3-VL external video baseline provides a strong VLM comparison. It achieves the highest Product Name Recognition Accuracy and Category Recognition Accuracy among all methods, reaching 0.920 and 0.980, respectively. This result indicates that native video-language models have strong potential for direct product name and category understanding from video inputs. However, its brand recognition accuracy is 0.725 and its Perfect Match Rate is 0.705, both lower than those of Strategy B. This suggests that direct video-language recognition can perform well on semantically salient fields such as product name and category, but may still be less reliable when all structured fields, especially brand information, need to be jointly matched.

This comparison shows that direct video-language recognition and the proposed decomposed framework have different strengths. The Qwen3-VL external video baseline can exploit video-level visual-language priors and performs strongly on product name and category prediction. In contrast, the proposed framework explicitly preserves intermediate OCR, ASR, selected-frame, and visual-semantic evidence, which is beneficial for structured field-level recognition and overall field consistency, especially for brand-sensitive product identification. Therefore, the external baseline results support the value of the proposed decomposed and evidence-aware recognition pipeline, while also showing the potential of native video-language models as a strong future direction.

(3): Paired Statistical Comparison Between Strategy A and Strategy B

The paired statistical analysis further evaluates the difference between Strategy A and Strategy B in Perfect Match Rate. As shown in Table 16, there are 15 samples where Strategy A is incorrect but Strategy B is correct, whereas there are 5 samples where Strategy A is correct but Strategy B is incorrect. The McNemar test gives a p-value of 0.0414, indicating that the improvement in Strategy B over Strategy A in Perfect Match Rate is statistically significant at the 0.05 level.

The paired bootstrap confidence interval of the Perfect Match Rate difference is [0.005, 0.095], which remains above zero and further supports the stability of the improvement. For Semantic Similarity, Strategy B obtains a higher numerical value than Strategy A, with a mean difference of +0.0104. However, the 95% confidence interval is [−0.0068, 0.0290], which crosses zero. Therefore, the Semantic Similarity improvement should be interpreted as a numerical advantage rather than a statistically conclusive difference.

Overall, the end-to-end results show that Strategy B achieves the strongest final structured recognition performance among the compared methods, particularly in terms of Perfect Match Rate. Compared with the external VLM baseline, Strategy B provides better overall field-level consistency, although Qwen3-VL performs better on product name and category recognition. Therefore, Strategy B is adopted as the final quality-aware keyframe selection strategy in the proposed framework.

4.5.5. End-to-End Error Analysis

To further interpret the performance difference between Strategy A and Strategy B, an end-to-end error analysis is conducted based on the sample-level prediction results. Since Perfect Match Rate requires all structured fields to be correctly recognized, this analysis focuses on the error fields in the disagreement cases between the two strategies, as well as the joint-failure cases where both strategies make incorrect predictions.

Table 17 summarizes the field-level error distribution. Here, a multiple-field error indicates that two or three structured fields are incorrectly predicted at the same time. For the cases where Strategy A is correct but Strategy B is wrong, most errors of Strategy B are brand-only errors. In contrast, for the cases where Strategy A is wrong but Strategy B is correct, the errors of Strategy A are distributed across brand, product name, category, and multiple-field errors. This indicates that Strategy B does not merely improve a single field, but can correct different types of errors made by Strategy A in a larger number of samples.

The disagreement cases further explain the improvement in Strategy B in Perfect Match Rate. In the A-correct/B-wrong group, Strategy B has only 5 failure cases, and most of them are caused by brand recognition errors or insufficient evidence in the selected frame. For example, in sample 6933996025563, Strategy B confuses the target brand with a visually similar brand name, while Strategy A preserves the correct brand information. In sample 6973142375302, Strategy B fails to recover all three structured fields, suggesting that the selected frame or fused evidence is insufficient for reliable recognition.

In contrast, in the A-wrong/B-correct group, Strategy B corrects 15 samples that Strategy A fails to recognize completely. For example, in sample 6975477310676, Strategy A incorrectly predicts the brand as a different electronics brand, whereas Strategy B correctly recovers the target brand and product name. In sample 6975178290086, Strategy A only recognizes a generic product category, while Strategy B successfully recovers both the brand and the specific product name. These cases suggest that the quality-aware frame selection strategy can preserve more task-relevant visual evidence in cases where the traditional clarity-based strategy fails.

Table 18 lists representative cases from the main disagreement and failure groups. The cases show that Strategy B is especially helpful when Strategy A selects frames with insufficient brand or product-name evidence. However, the joint-failure cases also indicate that both strategies can still fail when the selected keyframe does not contain sufficient front-facing packaging information, when the brand text is visually ambiguous, or when the relevant product information is distributed across multiple moments in the video.

Overall, the error analysis is consistent with the paired statistical comparison. Strategy B corrects more samples than it degrades, especially in cases involving brand or product-name recovery, which contributes to its higher Perfect Match Rate. At the same time, the remaining joint-failure cases show that single-keyframe-based recognition still has limitations when key product information is not visible in the selected frame. This suggests that future work may further explore stronger temporal evidence aggregation and multi-frame reasoning to address cases where product information is distributed across the video.

4.6. Experiment 4: Ablation Study on Multimodal Evidence Fusion

4.6.1. Experimental Setup

To quantify the contribution of each modality to end-to-end structured product recognition, an ablation study is conducted on the same 200-video evaluation set used in Experiment 3. For each sample, the manually verified ground-truth record contains three core fields: brand, product name, and category. The evaluation follows the same protocol as in Experiment 3, including Perfect Match Rate, field-level accuracies, and semantic similarity.

In this experiment, the VLM branch is implemented using the open-source Qwen3-VL-30B-A3B-Instruct model. To examine whether the contribution of each modality is robust to the frame selection strategy, the ablation study is performed under both Strategy A and Strategy B. Seven modality configurations are compared for each strategy:

OCR-only: uses only textual information extracted from the selected keyframe.
ASR-only: uses only the speech transcript extracted from the full audio stream of the video.
VLM-only: uses only the visual-semantic description generated by Qwen3-VL-30B-A3B-Instruct.
OCR + ASR: combines visual text and speech information.
OCR + VLM: combines visual text and visual-semantic information.
ASR + VLM: combines speech information and visual-semantic information.
Full Framework: integrates OCR, ASR, and VLM evidence for final product recognition.

All configurations use the same evidence structuring and final decision protocol with greedy decoding. Therefore, within each frame selection strategy, the observed differences mainly reflect the effect of changing the available modality evidence rather than changes in the downstream fusion procedure.

4.6.2. Experimental Results

The ablation results are summarized in Table 19. Overall, the Full Framework achieves the best Perfect Match Rate and semantic similarity under both Strategy A and Strategy B.

4.6.3. Result Analysis

The results first confirm the central role of OCR in the proposed framework. Under Strategy A, OCR-only achieves a Perfect Match Rate of 0.650, which is substantially higher than ASR-only and VLM-only. A similar pattern is observed under Strategy B, where OCR-only reaches 0.670. This indicates that visual text on product packages, labels, and promotional materials remains the most direct evidence source for fine-grained product identity recognition in live-streaming scenarios. The high category recognition accuracy of OCR-only under both strategies also suggests that textual cues are frequently sufficient for identifying broad product types.

The unimodal results also reveal the limitations of relying on speech or visual semantics alone. ASR-only obtains a Perfect Match Rate of 0.270 under both strategies. Although speech transcripts can provide useful product descriptions, not all videos contain recognizable or product-relevant speech content. In some samples, the audio stream contains only background music or non-speech sounds, while in others it includes background noise, promotional expressions, conversational context, or speech that does not explicitly mention the target product. In addition, streamers do not always mention the complete brand or exact product name. Therefore, ASR is more suitable as a complementary source rather than a standalone recognition channel.

VLM-only shows stronger dependence on the selected frame. Under Strategy A, its Perfect Match Rate is only 0.145, whereas under Strategy B it increases to 0.450. This difference suggests that Strategy B selects frames that are more favorable for visual-semantic interpretation by the VLM. Nevertheless, VLM-only remains below OCR-only in both strategies, indicating that visual-semantic descriptions alone are still insufficient for reliable structured product identification.

For dual-modality configurations, adding ASR to OCR consistently improves recognition performance. Under Strategy A, OCR + ASR improves the Perfect Match Rate from 0.650 to 0.680, while under Strategy B it improves the Perfect Match Rate from 0.670 to 0.735. The gain is especially visible in Product Name Recognition Accuracy, which increases from 0.785 to 0.815 under Strategy A and from 0.775 to 0.855 under Strategy B. This supports the complementary value of speech information, particularly when the product name is partially occluded, blurred, or not fully captured by OCR.

The contribution of the VLM branch is more dependent on the frame selection strategy and the accompanying modalities. Under Strategy A, OCR + VLM slightly improves the Perfect Match Rate over OCR-only and achieves the highest category recognition accuracy among the dual-modality settings. This indicates that visual-semantic cues such as product appearance, packaging layout, and category-specific visual features can help refine category-level understanding. Under Strategy B, VLM-only and ASR + VLM perform much better than their counterparts under Strategy A, suggesting that the frames selected by Strategy B are more compatible with visual-semantic reasoning. However, OCR + VLM is still weaker than OCR + ASR under Strategy B, implying that VLM evidence may introduce ambiguity when it is not sufficiently constrained by textual or speech evidence.

When all three modalities are integrated, the Full Framework achieves the best overall performance under both frame selection strategies. Under Strategy A, the Full Framework improves the Perfect Match Rate from 0.650 for OCR-only to 0.725, with corresponding improvements in brand recognition accuracy, product name recognition accuracy, category recognition accuracy, and semantic similarity. Under Strategy B, the Full Framework further increases the Perfect Match Rate to 0.775 and achieves the highest semantic similarity of 0.802. Compared with OCR-only under the same strategy, the Full Framework improves the Perfect Match Rate by 0.075 under Strategy A and by 0.105 under Strategy B. These results indicate that OCR, ASR, and VLM provide complementary evidence for structured product recognition.

Overall, the ablation study supports two conclusions. First, OCR serves as the strongest single modality because product identity in live-streaming videos is often explicitly encoded in visual text. Second, ASR and VLM provide complementary information from speech and visual-semantic perspectives, respectively, and their combination with OCR improves both exact-field matching and semantic consistency. The fact that the Full Framework achieves the best overall performance under both Strategy A and Strategy B further suggests that the benefit of multimodal evidence fusion is not limited to a single frame selection strategy.

4.7. Efficiency and Cost Analysis

To evaluate the practical deployment cost of the proposed framework, this study further analyzes the API usage, monetary cost, and empirical API latency of the multimodal recognition pipeline. All remote Qwen-series model calls in this analysis are made through Alibaba Cloud Model Studio (Bailian). Instead of reporting individual API call logs, this analysis uses aggregated telemetry statistics. The cost analysis is based on the remote model calls involved in the full pipeline, while the latency analysis focuses only on remote API calls. Local modules, including frame selection, OCR execution, ASR transcription, cache reading, and internal post-processing, are excluded from the API latency statistics.

A complete API-based run contains four remote model calls by design: OCR evidence structuring, ASR evidence structuring, Qwen3-VL visual-semantic recognition, and final multimodal fusion. The three pre-fusion branches are executed in parallel, and their outputs are then passed to the final fusion LLM. Therefore, the parallel API latency is estimated as:

T_{API-parallel} = max (T_{OCR-LLM}, T_{ASR-LLM}, T_{VLM}) + T_{fusion} .

For comparison, the serial API latency is computed as:

T_{API-serial} = T_{OCR-LLM} + T_{ASR-LLM} + T_{VLM} + T_{fusion} .

The API pricing configuration used in this study is shown in Table 20. The Qwen3.5-27B model is used for evidence structuring and final fusion, while Qwen3-VL-30B-A3B-Instruct is used for visual-semantic recognition.

Table 21 summarizes the average API usage and estimated monetary cost. The reported token usage is computed from successful full-pipeline API runs after excluding cache-only records. Based on the recorded token usage, the full pipeline uses an average of 3069.4 input tokens and 9253.1 output tokens per run, resulting in an estimated cost of 0.0465 CNY per video and 46.46 CNY per 1000 videos. This indicates that, although the framework involves multiple LLM/VLM calls, the monetary cost remains low at the per-video level under the pricing configuration used in this study.

The API latency statistics are shown in Table 22. Runtime is measured as wall-clock API latency. For remote calls through Alibaba Cloud Model Studio (Bailian), this latency includes network transmission, request scheduling, provider-side inference, and response return time; therefore, it should be interpreted as empirical API latency rather than pure model inference time. P95 denotes the 95th percentile latency, which reflects the upper-bound latency experienced by most API calls.

The latency results show that Qwen3-VL visual-semantic recognition is not the main API bottleneck. Its mean latency is 3.90 s and its median latency is 1.84 s, indicating that the visual-semantic recognition call is relatively fast in most cases. In contrast, the Qwen3.5-27B LLM calls used for evidence structuring and final fusion dominate the API latency. In particular, ASR evidence structuring and final fusion have mean latencies of 43.92 s and 39.89 s, respectively.

The parallel execution strategy reduces the accumulated waiting time of the three pre-fusion branches. Compared with the serial API latency of 113.55 s, the parallel API latency is reduced to 90.68 s on average, corresponding to a reduction of 22.87 s. However, the final fusion LLM remains a second-stage bottleneck because it must wait for the structured evidence from all modalities. Therefore, the current API-based implementation is more suitable for offline product information extraction or semi-real-time analysis, while real-time deployment would require further optimization.

From a deployment perspective, the current experiments report API-based monetary cost and API latency. The visual-semantic branch uses Qwen3-VL-30B-A3B-Instruct, while the evidence structuring and final fusion stages use Qwen3.5-27B. Since the framework is modular, recurring API dependency can be reduced by deploying model-specific components locally when hardware resources are available. Specifically, the OCR, ASR, product detection, and frame-quality assessment modules can run locally. The Qwen3-VL visual-semantic branch can also be migrated from API calls to a locally deployed Qwen3-VL-30B-A3B-Instruct model. For the LLM evidence-structuring and fusion stages, the Qwen3.5-27B model used in this study is an open-weight model that supports local deployment. Therefore, these stages can be migrated from Alibaba Cloud Model Studio (Bailian) API calls to local inference when sufficient hardware resources are available. The actual local deployment cost and latency depend on the quantization strategy, inference framework, and hardware configuration. Overall, the proposed framework supports both API-based and local-deployment-oriented implementations, with a trade-off among monetary cost, hardware resources, engineering complexity, and inference latency.

Overall, the efficiency analysis shows that the proposed framework has low per-video API monetary cost, while its API latency is mainly limited by Qwen3.5-27B LLM calls. Future optimization can reduce latency by replacing evidence-structuring LLM calls with lighter models or rule-based extraction, shortening structured outputs, caching reusable intermediate results, and further optimizing the final fusion prompt and decoding strategy.

5. Conclusions

This paper presents an end-to-end framework for structured product recognition in e-commerce livestreaming. The framework combines task-oriented keyframe selection with multimodal semantic fusion, enabling the system to preserve recognition-effective visual evidence while integrating OCR, ASR, and visual-semantic cues at the decision level. Relative to conventional pipelines that either rely on fixed-position frame sampling or process video streams exhaustively, the proposed design provides a more favorable balance between recognition quality and computational efficiency.

In the system evaluation, we systematically compared the single-frame and Top-K multi-frame strategies, as well as two frame scoring strategies: Strategy A, based on detection confidence and local clarity, and Strategy B, based on detection confidence and a learned quality component. The experimental results show that:

The Top-K multi-frame fusion strategy improves OCR-stage evidence completeness, including the number of recognized OCR characters and high-confidence text, but its computational cost increases with larger K. Considering both efficiency and OCR evidence completeness, the single-frame strategy offers higher engineering feasibility in resource-constrained and practical deployment scenarios, while Top-K fusion can be regarded as an optional OCR-stage extension when richer textual evidence is required.
Both Strategy A and Strategy B substantially improve end-to-end structured product recognition over the last-frame baseline. Strategy A provides a simple and interpretable scoring mechanism, while Strategy B further incorporates a learned quality component. In the final end-to-end evaluation, Strategy B achieves the best overall performance among the compared methods, with a Perfect Match Rate of 0.775 and a Semantic Similarity of 0.802. Compared with Strategy A, Strategy B improves the Perfect Match Rate by 0.050, and the paired statistical analysis supports its advantage in video-level perfect matching. Nevertheless, Strategy A remains a lightweight and interpretable alternative for scenarios with stricter computational constraints.
The multimodal ablation study confirms the complementary value of OCR, ASR, and visual-semantic evidence. OCR serves as the strongest single modality because product identity is often explicitly encoded in packaging text, while ASR and Qwen3-VL provide complementary speech-based and visual-semantic cues. The full framework achieves the best overall performance under both Strategy A and Strategy B, indicating that the benefit of multimodal evidence fusion is not limited to a single frame selection strategy.

Compared with prior work, the main distinction of this study lies in treating frame selection as a recognition-oriented problem rather than a generic video summarization or image-quality assessment problem. The results further show that multimodal gains are not obtained merely by adding more inputs; instead, they depend on selecting frames that preserve task-effective information and on reconciling heterogeneous evidence through semantic-level fusion. From this perspective, the proposed framework offers a practically grounded solution for product recognition in dynamic livestreaming environments while also clarifying the relationship between visual evidence quality, modality complementarity, and structured recognition performance.

The efficiency and cost analysis further shows that the proposed framework has a low per-video API monetary cost under the pricing configuration used in this study. However, the current API-based latency is mainly dominated by Qwen3.5-27B LLM calls for evidence structuring and final fusion, rather than by Qwen3-VL visual-semantic recognition. The parallel execution of the three pre-fusion branches reduces accumulated waiting time, but the final fusion LLM remains a second-stage bottleneck. Therefore, the current API-based implementation is more suitable for offline product information extraction or semi-real-time analysis, while real-time deployment would require further latency optimization.

Despite these encouraging results, several limitations remain. First, although Strategy B improves final recognition performance, its learned quality component is still supervised by generic IQA pseudo-labels and does not fully represent task-specific product recognizability. Future work may explore task-aligned quality supervision, such as OCR success rate, field-level recognition correctness, or multimodal evidence completeness. Second, the dataset is constructed from livestreaming videos collected on Chinese e-commerce platforms. Accordingly, the reported results may not directly generalize to multilingual, cross-regional, or cross-platform livestreaming environments without additional validation. An important direction for future work is therefore to expand the evaluation to larger and more diverse datasets spanning additional regions, platforms, product categories, and presentation conditions. Applying the framework in such settings may also require multilingual OCR, cross-lingual ASR, and adaptation or retraining of the product detection and recognition modules. Validation on public benchmarks should additionally be conducted under compatible field-level annotation and evaluation protocols. Third, the current implementation uses Qwen3.5-27B for evidence structuring and final fusion and Qwen3-VL-30B-A3B-Instruct for visual-semantic recognition. Because the observed performance may partly depend on these specific models, future work should examine the portability of the proposed framework to other open-weight or commercially available foundation models and determine whether comparable performance can be maintained. Fourth, the current task formulation is organized at the video-product level, where each video clip is associated with one target product identity. Extending the framework to multi-product or multi-target livestreaming scenarios would require temporal product segmentation, product-instance association, and more explicit cross-modal alignment between visual regions, OCR text, and ASR evidence. Finally, although the proposed system supports modular API-based and local-deployment-oriented implementations, further engineering optimization is needed to reduce LLM latency, including lighter evidence-structuring models, shorter structured outputs, reusable evidence caching, and more efficient final fusion strategies.

In summary, the proposed keyframe selection and multimodal fusion framework achieves consistent improvements in structured product recognition while preserving interpretable intermediate evidence. The study demonstrates that recognition-oriented frame selection and structured multimodal evidence fusion are both important for robust product information extraction in complex e-commerce livestreaming scenarios.

Author Contributions

Conceptualization, Y.Z., J.S. and W.S.; methodology, Y.Z., J.S. and W.S.; software, Y.Z.; validation, Y.Z., J.S. and W.S.; formal analysis, Y.Z. and W.S.; investigation, Y.Z., J.S. and W.S.; resources, J.S. and W.S.; data curation, Y.Z. and J.S.; writing—original draft preparation, Y.Z.; writing—review and editing, J.S. and W.S.; visualization, Y.Z.; supervision, J.S. and W.S.; project administration, W.S.; funding acquisition, J.S. and W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Plan Project of Zhejiang Provincial Administration for Market Regulation (“Research on the Intelligent Definition System of CCC Products Empowered by AI in the Context of Artificial Intelligence”), Grant No. LY2026018.

Institutional Review Board Statement

Ethical review and approval were waived for this study based on Articles 2 and 9 of the Measures for Ethical Review of Science and Technology Activities (Trial) of the People’s Republic of China. According to these provisions, science and technology activities involving human participants as the objects of testing, investigation, or observation, or involving the use of personal information data, are subject to ethical review; Article 9 requires submission to a science and technology ethics committee when an ethical risk assessment determines that an activity falls within the scope of Article 2. The present study used publicly accessible e-commerce livestreaming materials solely for product recognition and product-related multimodal information extraction. The livestream hosts appearing incidentally in the source videos were neither research participants nor research subjects, and no identifiable person-related information was analyzed or used as research data. Accordingly, the study did not fall within the scope requiring formal ethical review under the cited provisions.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available from the corresponding author upon reasonable request and subject to data-use restrictions. The raw videos and the full evaluation benchmark are not publicly available because they contain platform interfaces, seller-generated livestreaming content, commercial product information, user comments, and potentially privacy-sensitive contextual information. The dataset is also being continuously expanded and used in an ongoing research project.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. During the preparation of this manuscript, the authors used ChatGPT (GPT-5.4 Thinking, OpenAI) and Gemini (Gemini 3.1 Pro, Google) for language editing and grammatical corrections. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Prompt Templates Used in the LLM/VLM Pipeline

This appendix provides English versions of the main prompt templates used in the LLM/VLM-based evidence structuring and multimodal fusion pipeline. The implementation uses Chinese prompts because the experimental data are collected from Chinese e-commerce livestreaming platforms. For readability and reproducibility, the prompt templates below are presented in English while preserving the target fields, evidence constraints, output schema, and fusion rules used in the implementation.

The final pipeline predicts three structured target fields: product name, brand, and category. The category field is selected from a predefined closed set, including dairy products, beverages, snacks, toys, stationery, instant food, condiments, grain and cooking oil products, alcoholic products, detergents, cleaning products, paper products, personal care products, cosmetics, maternal and infant products, health food, daily necessities, home appliances, digital accessories, sports products, pet products, and others. If the evidence is insufficient for a field, the system outputs “unknown”.

Appendix A.1. OCR Evidence Structuring Prompt

Prompt template. You are a professional e-commerce product recognition assistant. The following text is extracted by OCR. Please extract product information strictly from the OCR text. Do not use external knowledge, do not fabricate missing information, and do not provide explanations.
Input. The input is the raw OCR text extracted from the selected product frame.
Output format. The model must output exactly three fields: product name, brand, and category. No additional explanation, reasoning process, analysis, or extra text is allowed.
Field rules. The product name should be the shortest identifiable official product name supported by the OCR text. Brand, series, model, flavor, or functional terms are retained only when they are part of the product name. The brand must be the consumer-facing brand explicitly visible in the packaging text. Company names, distributors, platform names, or brands inferred only from the product name are not allowed. The category must be selected from the predefined closed set. If the OCR evidence is insufficient for a field, the model outputs “unknown”.

Appendix A.2. ASR Evidence Structuring Prompt

Prompt template. You are a professional e-commerce product recognition expert. The following text is obtained from the product introduction video using ASR. The transcript may contain spoken expressions, repetitions, recognition errors, pauses, irrelevant content, or transcription noise. Please extract product information only from the speech transcript. Do not use external knowledge or infer missing information from common sense.
Input. The input is the FunASR transcript extracted from the product introduction video.
Output format. The model must output exactly three fields: product name, brand, and category. No additional explanation, reasoning process, source analysis, or extra text is allowed.
Field rules. The product name should be the shortest product name explicitly mentioned in, or directly supported by, the transcript. Brand, series, model, flavor, or functional terms are retained only when they are part of the official product name. The brand must be explicitly mentioned in the speech and must refer to the consumer-facing brand. Company names, distributors, platform names, or brands inferred from the product name are not allowed. The category must be selected from the predefined closed set. If a field cannot be reliably determined from the transcript, the model outputs “unknown”. Minor interpretation of ASR errors is allowed only when it is directly supported by the surrounding context.

Appendix A.3. Visual-Semantic Recognition Prompt

Prompt template. You are a professional e-commerce product recognition expert. Please observe the target product in the input image and extract key product information. If the image is a cropped product region, only judge the product in the cropped region. If a bounding box is provided, prioritize the single product inside the box and do not rely on background objects or other products.
Input. The input is the selected product image, cropped product region, or product region indicated by a bounding box.
Output format. The model must output exactly three fields: product name, brand, and category. No additional explanation, reasoning process, image-quality evaluation, or extra text is allowed.
Field rules. The model is instructed to focus on the brand logo, product name, series name, flavor, model, and category cues visible on the package. The product name should be the shortest identifiable official product name and should not include pure specifications, promotional words, channel words, or uncertain descriptions. The brand must be the consumer-facing brand visible on the package. Company names, distributors, platform names, or brands inferred only from appearance or category are not allowed. The category must be selected from the predefined closed set. For blurred, occluded, reflective, or unreadable information, the model is instructed not to force completion and to output “unknown” when the visual evidence is insufficient.

Appendix A.4. Final Multimodal Fusion Prompt

Prompt template. You are a professional e-commerce product recognition expert. You need to synthesize recognition results from multiple data sources and make the final judgment for a single product.
Input. The input consists of the structured outputs from the available evidence sources, including OCR evidence, ASR evidence, and visual-semantic evidence. The prompt also specifies how many of the three evidence sources are available for the current sample.
Output format. The model must output exactly three fields: product name, brand, and category. No additional explanation, reasoning process, source comparison, analysis, or extra text is allowed.
Field rules. The product name should be the shortest identifiable official product name. Brand, series, model, flavor, or functional terms are retained only when they are part of the official product name. The brand must be the consumer-facing brand. Company names, distributors, platform names, or brands inferred from the product name are not allowed. If both Chinese and English names are supported as official brand evidence, both may be retained. The category must be selected from the predefined closed set. If all sources are insufficient for a field, the model outputs “unknown”.
Fusion principles. The model is instructed to prefer information supported by multiple sources. When different sources conflict, the priority order is OCR packaging text, visual evidence, and then ASR speech evidence. The product name, brand, and category fields are judged independently; a clear value in one field should not be used to fabricate another field. If evidence is insufficient, the model outputs “unknown”.

Appendix A.5. External Direct-Video VLM Baseline Prompt

Prompt template. Given an e-commerce livestreaming video, identify the primary target product being displayed. Do not enumerate all objects or all products in the video. Output only a structured JSON-style result without Markdown, explanations, or reasoning.
Input. The input is the original e-commerce livestreaming video.
Output format. The output must contain three fields:
- brand
- product_name
- category

The category must be selected from the predefined closed-set category list. If the brand, product name, or category cannot be determined, the corresponding field is set to “unknown”.

Appendix A.6. Local Post-Processing

No separate LLM/VLM prompt is used for JSON repair, confidence scoring, field normalization, or fallback completion in the final pipeline. After the model outputs are returned, deterministic local post-processing is applied to normalize the three-field structured output. Missing fields are filled with “unknown”. If no usable OCR, ASR, or visual-semantic result is available, the local fallback output sets all three fields, namely product name, brand, and category, to “unknown”.

References

Wangjingshe E-commerce Research Center. The Release of “2024 China Live E-Commerce Market Data Report”. 2025. Available online: https://www.100ec.cn/detail--6649904.html (accessed on 9 June 2025).
Transparency Market Research. Livestream E-Commerce Market Size & Industry Share to 2034; Technical report; Transparency Market Research: Wilmington, DE, USA, 2025. [Google Scholar]
Pe, P. TikTok Shop GMV in 2024 Surpassed US$30 Billion. 2025. Available online: https://thelowdown.momentum.asia/tiktok-shop-gmv-in-2024-surpassed-us30-billion (accessed on 2 February 2026).
Wangjingshe. TikTok E-Commerce French Station Accelerates Expansion, Transaction Volume Soars Sevenfold in Half a Year. 2025. Available online: https://fgw.sz.gov.cn/ztzl/qtztzl/szscjmyjjfzzhfwpt/hwtz/sjal/content/post_12527556.html (accessed on 2 February 2026).
Sellercraft. TikTok Shop vs Shopee GMV Trends in Southeast Asia (2023–2025). 2025. Available online: https://sellercraft.co/tiktok-shop-vs-shopee-gmv-trends-in-southeast-asia-2023-2025-unpacking-the-e-commerce-showdown/ (accessed on 25 June 2026).
Yang, W.; Chen, Y.; Li, Y.; Cheng, Y.; Liu, X.; Chen, Q.; Li, H. Cross-view Semantic Alignment for Livestreaming Product Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2023; pp. 13404–13413. [Google Scholar]
State Administration for Market Regulation. The Fifth Batch of Typical Cases in the Field of Live E-Commerce. 2026. Available online: https://www.samr.gov.cn/xw/zj/art/2026/art_a7a1fb24ceac4ed9a4789161bfe49e0f.html (accessed on 25 June 2026).
TikTok Shop. TikTok Shop’s Latest Safety and IPR Reports: Focusing on Security While Growing Globally. 2025. Available online: https://seller.tiktokglobalshop.com/business/us/newsroom/detail/10023362 (accessed on 25 June 2026).
Amazon. 2024 Brand Protection Report: How Amazon Uses AI Innovations to Stop Fraud and Counterfeits; Technical report; Amazon: Seattle, WA, USA, 2025. [Google Scholar]
OECD; EUIPO. Mapping Global Trade in Fakes 2025: Global Trends and Enforcement Challenges; Technical report; OECD: Paris, France, 2025. [Google Scholar]
Zhu, H.; Wei, H.; Li, B.; Yuan, X.; Kehtarnavaz, N. A Review of Video Object Detection: Datasets, Metrics and Methods. Appl. Sci. 2020, 10, 7834. [Google Scholar] [CrossRef]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Kerrville, TX, USA, 2019; p. 6558. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020. [Google Scholar]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine Regression Task in DETRs as Fine-Grained Distribution Refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Kuhn, H.W. The Hungarian Method for the Assignment Problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Neubeck, A.; Van Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the International Conference on Pattern Recognition (ICPR); IEEE: New York, NY, USA, 2006; pp. 850–855. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Weng, Z.; Meng, L.; Wang, R.; Wu, Z.; Jiang, Y.G. A Multimodal Framework for Video Ads Understanding. In MM ’21: Proceedings of the 29th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar]
Gu, J.; Qin, T.; Chen, H. A Key Frame Extraction Method Based on MPEG-7 Color Features and Block Motion Information. J. Guangxi Univ. (Natural Sci. Ed.) 2010, 35, 310–314. [Google Scholar] [CrossRef]
Cernekova, Z.; Pitas, I.; Nikou, C. Information Theory-Based Shot Cut/Fade Detection and Video Summarization. IEEE Trans. Circuits Syst. Video Technol. 2006, 16, 82–91. [Google Scholar] [CrossRef]
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-Reference Image Quality Assessment in the Spatial Domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef] [PubMed]
Kharchevnikova, A.; Savchenko, A.V. Efficient Video Face Recognition Based on Frame Selection and Quality Assessment. PeerJ Comput. Sci. 2021, 7, e391. [Google Scholar] [CrossRef] [PubMed]
Park, J.; Lee, J.; Kim, I.J.; Sohn, K. SumGraph: Video Summarization via Recursive Graph Modeling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXV; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Zhuang, Y.; Rui, Y.; Huang, T.S.; Mehrotra, S. Adaptive Key Frame Extraction Using Unsupervised Clustering. In Proceedings of the International Conference on Image Processing (ICIP); IEEE: New York, NY, USA, 1998; pp. 866–870. [Google Scholar] [CrossRef]
Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “Completely Blind” Image Quality Analyzer. IEEE Signal Process. Lett. 2013, 20, 209–212. [Google Scholar] [CrossRef]
Su, S.; Yan, Q.; Zhu, Y.; Zhang, C.; Ge, X.; Sun, J.; Zhang, Y. Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020; pp. 3664–3673. [Google Scholar] [CrossRef]
Liu, X.; van de Weijer, J.; Bagdanov, A.D. RankIQA: Learning from Rankings for No-Reference Image Quality Assessment. arXiv 2017, arXiv:1707.08347. [Google Scholar]
Zhang, W.; Zhai, G.; Wei, Y.; Yang, X.; Ma, K. Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective. arXiv 2023, arXiv:2303.14968. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Golestaneh, S.A.; Dadsetan, S.; Kitani, K.M. No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency. arXiv 2022, arXiv:2108.06858. [Google Scholar]
Yang, S.; Wu, T.; Shi, S.; Lao, S.; Gong, Y.; Cao, M.; Wang, J.; Yang, Y. MANIQA: Multi-Dimension Attention Network for No-Reference Image Quality Assessment. arXiv 2022, arXiv:2204.08958. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning (ICML); PMLR: Cambridge, MA, USA, 2023; pp. 19730–19742. [Google Scholar]
Cui, C.; Sun, T.; Lin, M.; Gao, T.; Zhang, Y.; Liu, J.; Wang, X.; Zhang, Z.; Zhou, C.; Liu, H.; et al. PaddleOCR 3.0 Technical Report. arXiv 2025, arXiv:2507.05595. [Google Scholar]
An, K.; Chen, Q.; Deng, C.; Du, Z.; Gao, C.; Gao, Z.; Gu, Y.; He, T.; Hu, H.; Hu, K.; et al. FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs. arXiv 2024, arXiv:2407.04051. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. arXiv 2021, arXiv:2104.00298. [Google Scholar]

Figure 1. Overall architecture of the proposed model.

Figure 2. Training speed/parameter scale comparison of EfficientNetV2. In panel (a), the red and pink solid lines represent EfficientNetV2 variants, while the green and black dashed lines represent NFNet and EfficientNet variants, respectively. The blue triangle and black star denote ViT-L/16(21k) and the reproduced EfficientNet-B7, respectively. Models labeled “(21k)” were pretrained on ImageNet21k and then fine-tuned on ImageNet ILSVRC2012, whereas all other models were trained directly on ImageNet ILSVRC2012. (Source: Mingxing Tan et al., 2021 [38]).

Figure 3. Comparison of OCR Confidence Between Strategy A and Strategy B. The white diamonds indicate the mean values, while the black horizontal lines indicate the medians.

Figure 4. Comparison of High-Confidence Text Ratio Between Strategy A and Strategy B. The white diamonds indicate the mean values, the black horizontal lines indicate the medians, and the open circles indicate outliers.

Figure 5. Comparison of Recognized Text Count Between Strategy A and Strategy B. The white diamonds indicate the mean values, the black horizontal lines indicate the medians, and the open circles indicate outliers.

Figure 6. Comparison of Processing Time Between Strategy A and Strategy B. The white diamonds indicate the mean values, the black horizontal lines indicate the medians, and the open circles indicate outliers.

Table 1. Training Hyperparameters of the D-FINE Detector.

Category	Parameter	Value
Model	Backbone	HGNetv2-B2 initialized from public stage-1 pretrained weights
	Hidden dimension	256
	Decoder layers	4
	Number of queries	300
	Number of classes	2 (product, background)
Training	Epochs	132
	Batch size	16
	Learning rate	$2.5 \times 10^{- 4}$ (backbone: $2.5 \times 10^{- 5}$ )
	Optimizer	AdamW ( $β_{1} = 0.9, β_{2} = 0.999$ , weight decay $1.25 \times 10^{- 4}$ )
	LR schedule	MultiStepLR (milestone=500, $γ = 0.1$ ) + linear warmup
	Warmup steps	500
	AMP	Enabled
	EMA	Enabled (decay = 0.9999)
Data	Input size	$640 \times 640$
	Data augmentation	PhotometricDistort, RandomZoomOut, RandomIoUCrop, Flip
	Stop epoch	120
Loss	Loss weights	vfl: 1.0, bbox: 5.0, giou: 2.0, fgl: 0.15, ddf: 1.5
Evaluation	Metrics	COCO mAP (IoU 0.5–0.95), AP50, AP75

Table 2. EfficientNet Training Hyperparameters.

Category	Parameter	Value
Model	Backbone	EfficientNetV2-M (pretrained on ImageNet-1K)
	Input resolution	$224 \times 224$
	Parameters	≈24 M
	Dropout rate	0.3
Training	Batch size	16
	Epochs	20
	Learning rate	$5 \times 10^{- 4}$
	Optimizer	Adam
	Weight decay	$1 \times 10^{- 4}$
Loss	Function	MSE (regression)
LR schedule	Scheduler	ReduceLROnPlateau (factor = 0.5, patience = 3, min lr $1 \times 10^{- 5}$ )
Data	Split	80%/20%
Augmentation	Methods	Flip (50%), brightness/contrast ( $\pm 30 %$ ), Gaussian noise (20%)
Training strategy	Early stopping	5 epochs

Table 3. Performance Metrics of the D-FINE Detector.

Metric	Value	Description
mAP (IoU 0.5–0.95)	27.76%	COCO standard mean Average Precision over IoU thresholds from 0.5 to 0.95
AP₅₀	40.36%	Average Precision at IoU = 0.5
AP₇₅	28.25%	Average Precision at IoU = 0.75
AR₁₀₀	44.14%	Average Recall with up to 100 detections per image
Final training loss	20.71	Loss value at the end of training (epoch 131)

Table 4. Performance Metrics of EfficientNetV2.

Metric	Value	Description
Validation MSE	0.00220	Mean Squared Error
Validation MAE	0.03667	Mean Absolute Error
Validation RMSE	0.04687	Root Mean Squared Error
Pearson correlation	0.9323	Linear correlation between predicted scores and ground-truth scores
Best epoch	17	Epoch with the lowest validation loss

Table 5. Evaluation metrics for Experiment 1.

Metric	Description
Average number of recognized characters	Average number of recognized characters per video
Average OCR confidence	Mean confidence score of recognized OCR text instances
High-confidence character count	Average number of recognized characters per video from OCR text instances with confidence $\geq 0.8$
Processing time	Average processing time per video from input to OCR completion

Table 6. Frame quality and OCR-stage metrics for Experiment 2.

Metric	Definition	Role
Average OCR confidence	Mean confidence score of OCR text instances recognized from the selected frame	Reflects the reliability of OCR recognition on the selected frame
High-confidence text ratio	Proportion of OCR text instances with confidence no less than 0.7	Evaluates the proportion of reliable OCR text evidence
Recognized text count	Number of OCR text instances recognized from the selected frame	Reflects the amount of extracted textual evidence
Recognized character count	Number of recognized characters in the selected frame	Serves as a text-volume indicator in the correlation analysis of Strategy B scores
Processing time	Total time from video input to selected-frame output	Reflects the runtime efficiency of the frame selection process

Table 7. Sensitivity analysis of evaluation thresholds.

Variable	Threshold	Perfect Match Rate	Corresponding Field Accuracy
Brand	0.600	0.795	0.805
Brand	0.650	0.790	0.800
Brand	0.700	0.775	0.795
Brand	0.750	0.755	0.770
Brand	0.800	0.695	0.710
Product Name	0.400	0.785	0.955
Product Name	0.450	0.775	0.915
Product Name	0.500	0.775	0.895
Product Name	0.550	0.745	0.840
Product Name	0.600	0.715	0.780
Product Name	0.650	0.675	0.705
Product Name	0.700	0.645	0.645
Category	0.400	0.775	0.990
Category	0.450	0.775	0.980
Category	0.500	0.775	0.960
Category	0.550	0.765	0.940
Category	0.600	0.710	0.850

Note: Bold values indicate the adopted default thresholds and their corresponding evaluation results.

Table 8. Comparison of OCR-Stage Performance Under Different Numbers of Selected Frames.

Metric	Single Frame ( $K = 1$ )	Top-3 Frames	Top-5 Frames
Average Number of Characters	17	20	25
Average Confidence	0.7593	0.8897	0.8850
Average Number of High-Confidence Characters	15	18	20
Average Processing Time (s)	32.51	93.11	155.52

Table 9. Correlation between Strategy B scores and OCR-stage indicators. All correlations are computed over 100 samples. Values are reported as coefficient (p-value), and bold values indicate

p < 0.05

.

Table 9. Correlation between Strategy B scores and OCR-stage indicators. All correlations are computed over 100 samples. Values are reported as coefficient (p-value), and bold values indicate

p < 0.05

.

OCR-Stage Indicator	Learned Quality Component q		Final Score $S_{B}$
OCR-Stage Indicator	Pearson $r$	Spearman $ρ$	Pearson $r$	Spearman $ρ$
Average OCR confidence	0.067 (0.5068)	0.070 (0.4861)	0.078 (0.4787)	0.028 (0.7839)
High-confidence text ratio	0.075 (0.4596)	0.070 (0.4877)	0.071 (0.4791)	0.013 (0.8982)
Recognized text count	0.128 (0.2054)	0.125 (0.2157)	0.230 (0.0214)	0.258 (0.0097)
Recognized character count	0.283 (0.0043)	0.121 (0.2306)	0.252 (0.0115)	0.338 (0.0006)

Table 10. Comparison of Human Evaluation Results for Strategy A and Strategy B.

	Strategy A	Strategy B	Score Improvement (B relative to A)
Average Quality Score	3.31	3.52	+0.21 (6.34%)

Table 11. End-to-End Weight Sensitivity Analysis of Strategy A.

Weight	Perfect Match Rate	Brand Acc.	Product Name Acc.	Category Acc.	Semantic Sim.
0.5/0.5	0.705	0.765	0.870	0.965	0.788
0.6/0.4	0.705	0.785	0.870	0.965	0.791
0.7/0.3	0.725	0.785	0.870	0.965	0.792
0.8/0.2	0.705	0.765	0.850	0.955	0.785
0.9/0.1	0.665	0.755	0.830	0.955	0.781

Note: Bold values indicate the selected weight configuration and its corresponding evaluation results.

Table 12. End-to-End Weight Sensitivity Analysis of Strategy B.

Weight	Perfect Match Rate	Brand Acc.	Product Name Acc.	Category Acc.	Semantic Sim.
0.5/0.5	0.755	0.775	0.875	0.940	0.783
0.6/0.4	0.735	0.755	0.875	0.940	0.782
0.7/0.3	0.755	0.775	0.895	0.960	0.800
0.8/0.2	0.775	0.795	0.895	0.960	0.802
0.9/0.1	0.765	0.795	0.895	0.940	0.798

Note: Bold values indicate the selected weight configuration and its corresponding evaluation results.

Table 13. End-to-End Sensitivity Analysis of the Laplacian Normalization Constant.

Normalizer	Perfect Match Rate	Brand Acc.	Product Name Acc.	Category Acc.	Semantic Sim.
300	0.705	0.785	0.830	0.965	0.785
400	0.685	0.785	0.810	0.965	0.783
500	0.725	0.785	0.870	0.965	0.792
600	0.665	0.745	0.790	0.925	0.768
700	0.685	0.785	0.830	0.965	0.784

Note: Bold values indicate the selected normalization constant and its corresponding evaluation results.

Table 14. Compared Methods in the End-to-End Evaluation.

Method	Input	D-FINE Keyframe	Multimodal Fusion
Baseline	Last frame + audio	No	Yes
External VLM Baseline	Original video	No	No
Strategy A	Selected keyframe + audio	Yes	Yes
Strategy B	Selected keyframe + audio	Yes	Yes

Table 15. End-to-End Recognition Performance of Different Methods.

Method	Perfect Match Rate	Semantic Sim.	Brand Acc.	Product Name Acc.	Category Acc.
Baseline	0.609	0.697	0.676	0.818	0.880
External VLM Baseline	0.705	0.799	0.725	0.920	0.980
Strategy A	0.725	0.792	0.785	0.870	0.965
Strategy B	0.775	0.802	0.795	0.895	0.960

Note: Bold values indicate the selected weight configuration and its corresponding evaluation results.

Table 16. Paired Statistical Comparison Between Strategy A and Strategy B.

Comparison	A+/B−	A−/B+	McNemar p	PM Diff.	PM 95% CI	Sim. Diff.	Sim. 95% CI
B vs. A	5	15	0.0414	+0.050	[0.005, 0.095]	+0.0104	[−0.0068, 0.0290]

Table 17. Field-Level Error Distribution in End-to-End Disagreement and Failure Cases.

Group	Error Source	Brand Only	Product Only	Category Only	Multiple-Field
A correct/B wrong	Strategy B errors	4	0	0	1
A wrong/B correct	Strategy A errors	5	4	2	4
A wrong/B wrong	Strategy A errors	21	5	1	13
A wrong/B wrong	Strategy B errors	19	3	1	17

Table 18. Representative End-to-End Error Cases.

Case Group	Sample ID	Main Error Type	Interpretation
A correct/B wrong	6933996025563	Brand error	Strategy B confuses the target brand with a visually similar brand name, while Strategy A preserves the correct brand evidence.
A correct/B wrong	6973142375302	Multiple-field error	Strategy B fails to recover all three structured fields, indicating insufficient selected-frame or fused evidence.
A wrong/B correct	6975477310676	Brand error	Strategy A predicts an incorrect electronics brand, whereas Strategy B recovers the correct brand and product name from the selected evidence.
A wrong/B correct	6975178290086	Multiple-field error	Strategy A only recognizes a generic product category, while Strategy B recovers both the brand and the specific product name.
A wrong/B wrong	06941812761847	Multiple-field error	Both strategies fail to recover the brand and complete product name from the selected keyframe, while the external video baseline succeeds by using the original video input.
A wrong/B wrong	6947503700232	Multiple-field error	Strategy A suffers from brand confusion, while Strategy B lacks sufficient structured evidence; the external video baseline recognizes the brand and model more accurately.

Table 19. Ablation results of different modality combinations under Strategy A and Strategy B.

Strategy	Modality	Perfect Match Rate	Brand Acc.	Product Name Acc.	Category Acc.	Sem. Sim.
A	OCR-only	0.650	0.705	0.785	0.910	0.770
	ASR-only	0.270	0.320	0.515	0.675	0.638
	VLM-only	0.145	0.155	0.185	0.505	0.482
	OCR + ASR	0.680	0.745	0.815	0.925	0.784
	OCR + VLM	0.670	0.735	0.785	0.935	0.782
	ASR + VLM	0.385	0.425	0.595	0.745	0.682
	Full Framework	0.725	0.785	0.870	0.965	0.792
B	OCR-only	0.670	0.715	0.775	0.855	0.763
	ASR-only	0.270	0.320	0.520	0.665	0.635
	VLM-only	0.450	0.485	0.525	0.575	0.656
	OCR + ASR	0.735	0.755	0.855	0.915	0.796
	OCR + VLM	0.705	0.745	0.785	0.835	0.766
	ASR + VLM	0.540	0.590	0.695	0.825	0.742
	Full Framework	0.775	0.795	0.895	0.960	0.802

Table 20. API pricing configuration used in this study.

Model	Role	Input Price	Output Price
Qwen3.5-27B	Evidence structuring and final fusion	0.6	4.8
Qwen3-VL	Visual-semantic recognition	0.75	3.0

Note: Prices are reported in CNY per 1M tokens. Qwen3-VL refers to Qwen3-VL-30B-A3B-Instruct.

Table 21. API usage and estimated monetary cost of the full multimodal pipeline.

Pipeline	Input Tok.	Output Tok.	Cost/Run	Cost/1000 Runs
Full	3069.4	9253.1	0.0465 CNY	46.46 CNY

Table 22. API latency statistics of major remote model calls.

API Stage	Mean Time (s)	Median Time (s)	P95 Time (s)
OCR evidence structuring LLM	27.62	10.53	87.38
ASR evidence structuring LLM	43.92	41.90	111.85
Qwen3-VL visual-semantic recognition	3.90	1.84	19.25
Final fusion LLM	39.89	35.63	95.02
Serial API latency	113.55	108.40	227.06
Parallel API latency	90.68	84.86	182.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, Y.; Shi, J.; Shen, W. Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming. Appl. Sci. 2026, 16, 6585. https://doi.org/10.3390/app16136585

AMA Style

Zheng Y, Shi J, Shen W. Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming. Applied Sciences. 2026; 16(13):6585. https://doi.org/10.3390/app16136585

Chicago/Turabian Style

Zheng, Yichuan, Jin Shi, and Wei Shen. 2026. "Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming" Applied Sciences 16, no. 13: 6585. https://doi.org/10.3390/app16136585

APA Style

Zheng, Y., Shi, J., & Shen, W. (2026). Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming. Applied Sciences, 16(13), 6585. https://doi.org/10.3390/app16136585

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming

Abstract

1. Introduction

2. Related Work

2.1. Overview of Related Work

2.2. Review of Visual Object Detection Models

2.3. Video-Level Product Recognition and Multimodal Understanding

2.4. Keyframe Selection and Multi-Frame Strategies

2.5. Image Quality Assessment (IQA)

2.6. Large Language Models and Multimodal Fusion

3. Method

3.1. Problem Definition

3.2. Overall Architecture

3.3. Frame Scoring Strategy

3.3.1. Baseline: Fixed-Position Strategy

3.3.2. Strategy A: Traditional Visual Feature-Based Scoring

3.3.3. Strategy B: Deep Learning-Based Scoring

3.4. Frame Selection and Implementation Details

3.4.1. Frame Selection Pipeline

3.4.2. Top-K Multi-Frame Fusion Strategy

3.5. Multimodal Fusion and Final Matching

3.5.1. Multimodal Information Extraction

3.5.2. LLM-Based Fusion Mechanism

3.5.3. Temperature Parameter and Inference Stability

3.6. Data

3.6.1. Data Sources and Overall Statistics

3.6.2. Product Detection Annotation

3.6.3. Ground-Truth Construction for Product Recognition

3.6.4. Data Augmentation and Preprocessing

3.6.5. Pseudo-Labeling Pipeline for EfficientNetV2-M Training

3.6.6. Evaluation Subsets

3.7. Models and Training

3.7.1. D-FINE Detector

3.7.2. EfficientNetV2

4. Experiments

4.1. Model Training Results and Analysis

4.1.1. D-FINE Training Results and Performance Analysis

4.1.2. EfficientNetV2 Training Results and Performance Analysis

4.2. Experimental Metrics and Statistical Methods

4.2.1. Experiment 1: OCR Information Quantity Metrics

4.2.2. Experiment 2: Frame Quality and OCR-Stage Metrics

4.2.3. Experiments 3 and 4: Structured Product Recognition Metrics

4.3. Experiment 1: OCR-Stage Comparison Between Single-Frame and Top-K Multi-Frame Fusion

4.3.1. Experimental Setup

4.3.2. Experimental Results

4.3.3. Result Analysis

4.4. Experiment 2: Comparison of Frame Quality Between Strategy A and Strategy B

4.4.1. Experimental Setup

4.4.2. Experimental Results

4.4.3. Results Analysis

4.5. Experiment 3: End-to-End Recognition Accuracy Comparison

4.5.1. Experimental Setup

4.5.2. Sensitivity Analysis of Frame-Scoring Parameters

4.5.3. Experimental Results

4.5.4. Result Analysis

4.5.5. End-to-End Error Analysis

4.6. Experiment 4: Ablation Study on Multimodal Evidence Fusion

4.6.1. Experimental Setup

4.6.2. Experimental Results

4.6.3. Result Analysis

4.7. Efficiency and Cost Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Prompt Templates Used in the LLM/VLM Pipeline

Appendix A.1. OCR Evidence Structuring Prompt

Appendix A.2. ASR Evidence Structuring Prompt

Appendix A.3. Visual-Semantic Recognition Prompt

Appendix A.4. Final Multimodal Fusion Prompt

Appendix A.5. External Direct-Video VLM Baseline Prompt

Appendix A.6. Local Post-Processing

References

Share and Cite

Article Metrics