DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data

Liu, Mengyin; Zhu, Chao

doi:10.3390/electronics15050969

Open AccessArticle

DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data

by

Mengyin Liu

and

Chao Zhu

^*

School of Computer & Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(5), 969; https://doi.org/10.3390/electronics15050969

Submission received: 6 January 2026 / Revised: 15 February 2026 / Accepted: 24 February 2026 / Published: 26 February 2026

(This article belongs to the Special Issue Advances in Multimodal AI: Challenges and Opportunities)

Download

Browse Figures

Versions Notes

Abstract

With the prosperity of e-commerce applications, the web data of products are presented by multiple modalities, e.g., vision and language. For mining the product characteristics, multimodal attribute values are crucial, which are extracted from textual descriptions, assisted by helpful image regions. However, most previous works (1) fuse the multimodal information within a newly learned range based on co-occurrence rather than language meanings and (2) predict the outputs within a range of all attributes rather than the product-related ones. These issues yield unsatisfactory results; thus, we propose a novel approach via Dynamic Range Modulation (DRAM): (1) First, we propose an Information Range Calibration (IRC) method to dynamically fuse multimodal features of related meanings as Text-Related Embeddings (TEM) within a language range, which is calibrated from the range to fuse language features by a powerful attention mechanism of a pretrained language model. (2) Moreover, an Attribute Range Minimization (ARM) method is proposed to minimize the output attribute range based on the adaptive selection of product-related attribute prototypes. Experiments on the popular multimodal e-commerce benchmarks show that our DRAM performs well compared with previous methods.

Keywords:

attribute value extraction; vision and language; e-commerce application; retrained language model

1. Introduction

Understanding the web data of e-commerce products, especially mining potentially valuable information, has become significant for online recommendation. Highlighting the product characteristics, attribute value extraction [1,2] is expertly extracts unstructured texts in Figure 1a into attribute value pairs in Figure 1b.

In detail, as one of the information extraction tasks (e.g., Named Entity Recognition (NER) [3]), attribute value extraction predicts the values from a text sequence based on a symbol set. For example, “B-sleeve_type, I-sleeve_type, O” marks the beginning, inside, and outside parts (not a value) of a value “short sleeve” for an attribute “sleeve_type” from the input text “short sleeve long” in Figure 1.

Furthermore, multimodal attribute value extraction is naturally text-dominant, with only text-related image regions useful for fusion, which is different from typical vision-dominant multimodal fusion tasks [4,5] that fully employ the information of all the regions of multiple vision inputs for fusion. However, most previous methods of multimodal attribute value extraction use the text-guided or cross-modal fusion modules such as M-JAVE [6], EKE-MMRC [7], and SMARTAVE [8], whose parameters are newly learned from the co-occurrence of texts and images during training. Moreover, the generative manner of a recent multimodal model like DEFLATE [9] yields unstable prediction rather than fixed-length sequence tagging, which performs unsatisfactorily in the following Section 4. For example, the hemline of a dress is co-occurred and mismatched with the word “Long” and yields wrong output in the left of Figure 1a.

Inspired by the pretrained language models to fuse contextual language features, we propose to fuse multimodal features of related meanings within a language range. Firstly, based on the fine-tuning by multimodal tasks, a language model is “calibrated” to handle multimodal inputs, as shown in the bottom of Figure 2. Then, multimodal features of related language meanings are fused dynamically via its powerful attention mechanism within a multimodal language range, which is calibrated from a unimodal language range.

In addition, the diversity of products leads to different attribute ranges. For example, “battery_life” more rarely occurs in clothes than “sleeve_length”. However, most previous methods set the candidate attribute range as all attributes in the dataset, as shown in Figure 1b. Quantitatively, every textual position is classified into double the attribute amount

N

with an “O”, denoted as the numeric range of the predicted attribute classes

C_{o}

:

C_{o} = (N \times | {B, I} | + | {O} |) \times S_{1},

(1)

where

S_{1}

is the text length, and

| \cdot |

is the number of predicted tagging symbols, “B”, “I”, and “O”, for each kind of attribute class. Such a maximal attribute range leads to non-essential classification outputs, which leads to wrong predictions.

To address the above issues, we design a novel approach to dynamically minimize the attribute range of the specific product. First, the attribute range is determined via only one multi-label classification. Second, it looks up learnable prototypes of these selected attributes. Finally, only values within the range are predicted via the guidance of the chosen prototypes. Given that

N_{m} < < N

is the number of selected attributes, the range of the predicted class is smaller than Equation (1):

C_{m} = N + (| {B, I, O} | \times S_{1}) \times N_{m} \Rightarrow C_{m} < < C_{o} .

(2)

In conclusion, we have observed the issues that hinder boosting the performance of multimodal attribute value extraction, including both the calibration of the information range and the minimization of the attribute range. As illustrated in Figure 2, we propose a novel attribute value extraction approach, named Dynamic Range Modulation (DRAM). The main contributions of this paper are as follows:

First, the Information Range Calibration (IRC) method is proposed to dynamically fuse the multimodal information of related meanings into a language range by fine-tuning an attention-based language model to handle each modality as Text-Related Embeddings (TEM).
Second, the Attribute Range Minimization (ARM) method is designed to decrease wrong predictions in two steps: (1) Once multi-class classification decides the attributes range, (2) learnable prototypes are selected to dynamically predict for chosen attributes.
Finally, by the integration of the proposed IRC and ARM, our new approach DRAM achieves superior performances compared to the previous state-of-the-art techniques on popular MEPAVE and MAE benchmarks.

2. Related Works

2.1. Unimodal Attribute Value Extraction

Due to the text-centric task characteristics, early works of attribute value extraction were designed for unimodal texts.

For instance, OpenTag [10] makes multi-class prediction at each textual position as sequence tagging via BiLSTM+CRF. Similarly, slot-filling methods [11,12] use frozen word embeddings and fine-tuned RNN. Slot-Gated [13] adopts a gating mechanism. However, each position should be classified into

2 \times

all attribute numbers, leading to limitations on large-scale datasets with thousands of attributes.

Based on OpenTag, SUOpenTag [14] predicts for given attributes of input texts and thus succeeds on large-scale datasets. AVEQA [15] learns a matching task between text and attribute. AdaTag [16] encodes attributes as the dynamic parameters. However, if the known attributes are unavailable, false positives might occur on all the attributes. Joint-BERT [17] fine-tunes a pretrained model on down-stream tasks. Furthermore, K-PLUG [18] is fine-tuned on 25 M extra data.

Recently, in a generative manner of prediction, Joint-AVE [2] explores more modern pretrained models like T5 [19] and FLAN-T5 [20]. ExtractGPT [21] and New-LLM [1] are proposed based on a Large Language Model (LLM) for textual attribute value extraction.

Differently, our proposed ARM minimizes the attribute range for current products by conditional guidance of their prototypes. In addition, our IRC performs fine-tuning without any extra data, and a more dynamic language range of fusion is achieved by the task-calibrated multimodal language model. The generative manner is more unstable than our DRAM constrained by explicitly tagging the text by B-I-O symbols as sequence tagging.

2.2. Multimodal Attribute Value Extraction

Recently, attribute value extraction models have also benefited from the assistance of more modalities, especially visual images of e-commerce products with rich descriptive clues to distinguish the attribute values.

MAE-model [22] predicts the input image, text, and attribute by generating the attribute value. OCR allows PAM [23] to obtain more visual clues, whereas the stability of generation is unsure without the constraint of texts. M-JAVE [6] is a sequence tagging approach, but it extracts task-agnostic features on frozen models. Newly initialized decoders learn from the co-occurrence of texts and images and thus yield an inaccurate non-language range. They predict without any attribute range minimization, thus struggling in a maximum range.

Differently, EKE-MMRC [7] adopts a frozen pretrained object detector to obtain objects which are more related to language via class labels. Its knowledge-guided generative prediction is less stable than sequence tagging. SMARTAVE [8] fine-tunes both visual and textual encoders and embraces extra OCR inputs. However, it adopts known attribute inputs like various unimodal methods [14,15,16]. With a strong language model, our proposed IRC calibrates a language dynamic range of multimodal fusion. Moreover, our ARM minimizes the attribute range via prototype guidance.

Recently, DEFLATE [9] proposed a novel multimodal T5 Encoder [19] with multimodal self-attention and exploited DALL-E [24] as a visual encoder, which is a popular text-to-image multimodal model with a QA-based generation model. Meanwhile, EIVEN [25] is proposed with a multimodal LLM fine-tuned for down-stream multimodal attribute value extraction, which reports its performance on non-mainstream datasets unlike widely used MEPAVE [6].

However, they implicitly generate output from such a generative model, unlike the output of our DRAM constrained, by explicitly tagging the text by B-I-O symbols in a sequence tagging manner.

2.3. Multimodal Encoding via Language Models

Pretrained language models are widely used to encode multimodal information because they do well in finding language-related clues. In Named Entity Recognition (NER), RpBERT [3] extracts textual named entities based on text and images. BERT is fine-tuned on such multimodal tasks. Similarly, visual-language pretraining methods adopt a language model to encode texts and image patches into latent semantics [26]. Because there are text-irrelevant image regions for product data, our proposed IRC selects the parts of related meanings, encodes them into Text-Related Embeddings (TEM), and then feeds them into a language model.

3. Proposed Method

In Figure 2, our proposed Dynamic Range Modulation (DRAM) involves two components: Information Range Calibration (IRC) and Attribute Range Minimization (ARM). IRC decides the ranges of multimodal information to be fused. Then ARM selects the range of attributes to predict values.

3.1. Information Range Calibration

As shown in Figure 3, Information Range Calibration (IRC) calibrates the pretrained language model to handle multimodal inputs as Text-Related Embeddings (TEM). Related multimodal features are fused dynamically via its attention mechanism within a multimodal language range calibrated from pretrained models.

A merely frozen pretrained language model is not trained to adopt multimodal inputs. Hence, fine-tuned by the multimodal attribute extraction task, a language model to fuse related language meanings is calibrated from a pretrained unimodal one into a multimodal one.

In detail, an attention-based language model [27] is composed of multiple stacked layers, where Multi-Head Self-Attention is one of the key elements. After large-scale language pretraining, MHSA is powerful enough to fuse the features with related language meanings within an accurate range of fusion. However, the self-attention mechanism [28] is not trained with multimodal inputs to fuse only textual-related visual parts with their corresponding parts; thus, we propose to calibrate the information range of the vision input, which modifies the attention map by unimodal text-queried attention with multimodal keys and values.

3.1.1. Task-Calibrated Multimodal Language Model

Given two input sequences,

x_{Q} \in R^{S_{q} \times D}

and

x_{K} = x_{V} \in R^{S_{k} \times D}

, they are projected by weight matrices

W_{Q}

,

W_{K}

, and

W_{V} \in R^{D \times D^{'}}

, where

S_{q}

and

S_{k}

are the lengths of the two inputs. D and

D^{'}

are the dimensions before and after projection. Thus, the projected sequences are

Q = x_{Q} W_{Q}, K = x_{K} W_{K}, V = x_{V} W_{V} .

(3)

Q \in R^{S_{q} \times D^{'}},

K, and

V \in R^{S_{k} \times D^{'}}

represent three kinds of features, where Q is a “Query” to retrieve useful parts of V “Value”. However, since Q and V might originate differently, e.g., they are either text or an image, directly computing their matching degree is inappropriate. K is paired with each position in V, which performs the “Key” role of the key value pair to match Q. Thus, we have the formula of attention:

A t t e n t i o n (K, Q, V) = \frac{Q K^{⊤}}{\sqrt{D^{'}}} V,

(4)

where the attention mapping

M = Q K^{⊤} / \sqrt{D^{'}} \in R^{S_{1} \times S_{2}}

means the matching degrees across each position of Q and K, normalized by

\sqrt{D^{'}}

. All the items of V are multiplied to these degrees in M. The result of Equation (4) means the information inside V matched to Q is extracted via the intermediate K, which can be fused to Q.

As illustrated in Figure 4a, considering the inputs T and V, the attention mapping

M = Q K^{⊤} / \sqrt{D^{'}}

of the 1st layer of the language model in Vanilla IRC can be divided into 4 parts: T→T, V→V, T→V, V→T. Text-irrelevant parts (e.g., a background region with a red ✗ in Figure 4) of V contribute the V→T and V→V and lead to a wrong range of multimodal fusion. So we propose to solve this issue, which is evaluated by experiments in the following sections.

3.1.2. Text-Related Embeddings for Task-Calibrated Multimodal Language Model

Since the Vanilla IRC might produce a wrong text-irrelevant range of multimodal fusion, we proposed stronger Text-Related Embeddings (TEM) to obtain a text-related range. In detail, text embeddings

x^{t}

perform as

x_{Q}

for both self-attention inside a text and cross-attention between text and image. For self-attention,

x_{K} = x_{V} = x^{t}

. For cross-attention,

x_{K} = x_{V} = x^{v}

. Two attentions are formulated respectively as

S e l f A t t (x^{t}) = \frac{(x^{t} W_{Q}^{t}) {(x^{t} W_{K}^{t})}^{T}}{\sqrt{D^{'}}} x^{t} W_{V}^{t},

(5)

C r o s s A t t (x^{t}, x^{v}) = \frac{(x^{t} W_{Q}^{v}) {(x^{v} W_{K}^{v})}^{T}}{\sqrt{D^{'}}} x^{v} W_{V}^{v} .

(6)

In Figure 3c and Figure 4b, based on these two attentions, text embeddings play the roles of queries to extract text-related information as T→V and T→T, without the visual V→T and V→V. Given text and image embeddings

E = x^{t}, G = x^{v}

, the extra embeddings

E_{m}

are calculated similarly to the positional or token-type ones in BERT or ViT, which can also be added into input text embeddings E:

E_{m} = S e l f A t t n (E, E, E) + C r o s s A t t n (E, G, G) .

(7)

Finally,

E_{T E M} = E + E_{m} (E, G)

replaces the

ϕ (E, G)

and is fed into the task-calibrated multimodal language model of our proposed IRC, which achieves a better range of text-related multimodal fusion, compared with Vanilla IRC.

3.2. Attribute Range Minimization

In e-commerce scenarios, the attribute ranges of products diversify greatly, e.g., “battery_life” appears in electronic products more often than clothes. However, most methods [6,10] predict the existence of all attributes, leading to unnecessary false positives, while others [14,16] rely on the known attributes of the input that might not be available with only image–text pairs of the products as inputs.

Having observed the inconsistency between data distribution and model design above, we propose a novel dynamic Attribute Range Minimization (ARM) method to tackle it. In detail, the “Guidance Mechanism” guides the model by classification loss to dynamically select the attributes related to current inputs and punishes the mistakenly selected attributes by the model. With these supervisions, the model is trained to predict only for the selected attributes, which is considered as kind of “Guidance”.

3.2.1. Different Policies of ARM

First, as shown in Figure 5d, the model predicts the attribute range dynamically for the current inputs, as a multi-label classification task:

R = σ (F_{θ_{I R C}} (E, G) [CLS] W_{c l s}) .

(8)

R denotes the existence score of all attributes given textual and visual inputs

(E, G)

. Our proposed IRC

θ_{I R C}

outputs multimodal feature F. The

[CLS]

feature is projected into the digits for

N

attributes by

W_{c l s} \in R^{D \times N}

. A sigmoid function

σ

non-linearly normalizes the digits into scores

R \in [0, 1]

.

As shown in Figure 5, three different policies are designed: (1) The Prototype-Guided Policy selects the learnable vector “Prototypes” P corresponding to the attributes looked up by R and guides the model to predict values only for these attributes. (2) The DyNet-Guided Policy applies the parameters W of the selected attributes into a dynamic network. (3) BERT-Guided Policy uses a pretrained BERT model to encode the embeddings

β

of chosen attribute words.

Take policy (1) as an example. The score R is binarized with

τ_{R}

for “Proto Lookup” in Figure 5d. Then, a subset

P_{m i n}

is looked up from the prototypes P of all attributes, where a decreased count of prototypes

| P_{m i n} | = N_{m} < < | P | = N

is a minimization of the attribute range:

P_{m i n} = P [R > τ_{R}] .

(9)

Denote a prototype vector for an attribute

A_{i}

selected within the range as

P_{m i n}^{A_{i}} \in R^{1 \times D}

and the feature at textual position j for the current input as

F_{j} \in R^{1 \times D}

; the feature

F_{j}^{A_{i}}

is conditioned for

A_{i}

via a “Guidance Mechanism”:

F_{j}^{A_{i}} = C o n c a t e (F_{j}, \frac{{(P_{m i n}^{A_{i}})}^{⊤} (F_{j})}{∥ P_{m i n}^{A_{i}} ∥ ∥ F_{j} ∥} F_{j}) .

(10)

ω_{i j} = ({(P_{m i n}^{A_{i}})}^{⊤} (F_{j})) / (∥ P_{m i n}^{A_{i}} ∥ ∥ F_{j} ∥)

between every

F_{j} \in F

and

P_{m i n}^{A_{i}}

indicating the potential existence of

A_{i}

’s values. The concatenation of

F_{j}

and

ω_{i j} \cdot F_{j}

allows them to contribute equally in feature channels. Each

F^{A_{i}}

is fed into the feature decoder to predict a single-class “B-I-O” sequence

Y^{A_{i}}

.

3.2.2. The Prediction Pipeline of ARM

Furthermore, to explain the prediction pipeline of our proposed ARM, some detailed examples are provided as following in a step-by-step manner:

(1) Given an input textual description and its corresponding image of the product item, the content of this text and its attribute value annotations as a “B-I-O” sequence are

Input text:

“13 inch bag matched with sliver laptop computers”

Annotation:

“B- $bag_size$ I- $bag_size$ O O O O B- $cont_type$ I- $cont_type$ ”

(2) Assuming the total attributes of the whole dataset are

A = {bag_size, cont_type, bag_thick, bag_color} .

(11)

Multi-label classification of ARM outputs four scores

R \in [0, 1]

which indicate the existence of these attributes. The ground truth of the classification for the input above is

[1, 1, 0, 0]

.

Meanwhile, because it is difficult for the model to learn the conditioned prediction of the selected attributes with both a handful of scores and high-dimensional feature F, high-dimensional vectors are proposed to replace these scores. In this example, the model should have four vectors:

P = {P^{bag_size}, P^{cont_type}, P^{bag_thick}, P^{bag_color}} .

(12)

They are “Prototypes”, since every

P^{A_{i}} \in R^{D}

represents the existence of an attribute

A_{i}

and is learnable like the other network parameters which are updated during training.

(3) If the scores are

R = [0.7, 0.4, 0.6, 0.3]

, they are converted by

τ_{R} = 0.5

into a list

[1, 0, 1, 0]

, which means that a subset “bag_size”, “bag_thick” is selected as the minimized attribute range. Hence,

{P^{bag_size}, P^{bag_thick}}

are selected.

With the Guidance Mechanism in Equation (12), the features conditioned by the selected prototypes are

{F^{bag_size}, F^{bag_thick}}

, which are fed into the feature decoder to predict.

In practice, for a parallel inference, the conditioned features of each input in a mini-batch are accumulated to replace the 1st batch dimension. If the batch comprises three inputs and the number of selected prototypes in each input is

{2, 3, 1}

, the feature

F \in R^{3 \times S_{1} \times D}

is conditioned by the selected prototypes into

F^{c o n d} = {F^{{a_{1}^{1}, a_{1}^{2}}} \in R^{2 \times S_{1} \times D}, F^{{a_{2}^{1}, a_{2}^{2}, a_{2}^{3}}} \in R^{3 \times S_{1} \times D}, F^{{a_{3}^{1}}} \in R^{1 \times S_{1} \times D}},

(13)

where

F^{c o n d} \in R^{(2 + 3 + 1) \times S_{1} \times D}

rather than three separate features which can only be fed into the feature decoder one by one. For simplicity, the batch size is set to one in the following sections.

(4) Based on

{F^{bag_size}, F^{bag_thick}}

, the ground truths that correspond to them are generated from the original annotation. If the feature decoder predicts single-class “B-I-O” results as

Prediction based on $F^{bag_size}$ :

“B I O O O O O O”

Prediction based on $F^{bag_thick}$ :

“B O O O O O O”

Original annotation:

“B- $bag_size$ I- $bag_size$ O O O O B- $cont_type$ I- $cont_type$ ”

Ground truth for $bag_size$ :

“B I O O O O O O”

Ground truth for $bag_thick$ :

“O O O O O O O O”

Since

bag_thick

is mistakenly selected, the ground truth of its prediction is generated into a sequence filled with “O”s, but its prediction is still inconsistent. Therefore, the model will be punished by both the Binary Cross-Entropy loss and the CRF loss for attribute value extraction.

3.3. Loss Functions

Log-Likelihood loss in the CRF layer in Figure 3 supervises the attribute value extraction task. Hence, the overall loss function

L

has only

L_{C R F}

, which is denoted as

L = L_{C R F} = - log \frac{p_{GT}}{p_{1} + p_{2} + \dots + p_{C_{o}}},

(14)

where

C_{o}

is the classification time (denoted in Section 1). The CRF layer outputs probability p for each potential “path” of the output sequence in the CRF. For the proposed ARM, multi-label Binary Cross-Entropy loss

L_{B C E}

optimizes the classification of predicted R and label

\bar{R}

:

L_{B C E} = - \frac{1}{N} \sum_{k = 1}^{N} {\bar{R}}_{k} log R_{k} + (1 - {\bar{R}}_{k}) log (1 - R_{k}) .

(15)

The prediction of the CRF layer in the feature decoder equipped with ARM is multiple single-class sequences

Y^{A_{i}}

, e.g., “O, B, I, O, O”, for attribute

A_{i}

from the total selected

N_{m}

attributes. Given ground truth

{GT}_{i}

for each selected

A_{i}

,

L_{C R F}^{A R M}

is denoted as

L_{C R F}^{A R M} = - \frac{1}{N_{m}} \sum_{i = 1}^{N_{m}} log \frac{p_{{GT}_{i}}}{p_{1} + p_{2} + \dots + p_{C_{m}^{'}}},

(16)

where

C_{m}^{'}

and

C_{m}

are mentioned in Section 1. Finally, the overall loss function is

L = L_{C R F}^{A R M} + λ L_{B C E} .

(17)

4. Experiments

4.1. Datasets

MEPAVE [6] is collected from a mainstream Chinese e-commerce web platform JD.com, with 87,194 instances with 26 types of attributes. In total, 71,194/8000/8000 is split as the train/val/test set. The F1 score, i.e., harmonic mean of Precision and Recall, is the main metric of both attribute classification (CLS-F1) and attribute value extraction (TAG-F1). Precision and Recall are unavailable in most previous works [6,7] for comparison. The detailed statistics of MEPAVE is shown in Table 1.

MAE [22] is an English e-commerce multimodal attribute extraction dataset, collected from various e-commerce websites via Diffbot Product Web API. In total, 2.2 M records with 2 K types of attributes are split into a train/val/test set by 8:1:1, which is a larger scale than MEPAVE in the range of attribute classes and thus challenging. To avoid the attribute values not in the texts, the MAE-text subset is proposed by M-JAVE [6] and denoted as MAE in this paper. Accuracy is the performance metric of attribute value extraction: num(correctly_predicted_attr_values)/num(total_attr_values). The overall statistics of the MAE dataset is shown in Table 2 due to the detailed categories not being tagged during collection in the original paper.

The MEPAVE dataset can be accessed by submitting the application form to the authors via https://github.com/jd-aig/JAVE (accessed on 15 Februrary 2025). The MAE dataset is available online at https://rloganiv.github.io/mae/ (accessed on 15 Februrary 2025).

4.2. Implementation Details

Our proposed method is implemented on the basis of the BiLSTM+CRF [10] feature decoder and PyTorch 1.6 framework. RoBERTa-small-Chinese is specialized for the Chinese dataset MEPAVE, and BERT-base-uncased is specialized for the English dataset MAE. ResNet-152 pretrained on ImageNet encodes images. For our proposed IRC, Adam optimizes fine-tuning with a learning rate of

5 \times 10^{- 5}

. The loss weight

λ = 1.0 \times 10^{3}

. For MEPAVE, our model is trained and tested on one Nvidia V100 GPU. For MAE, a total batch size of 560 on 8 GPUs is used.

4.3. Ablation Study

Table 3 shows the ablation study between the frozen or task-calibrated language models of our proposed IRC. For the multimodal inputs, our IRC surpasses the multimodal M-JAVE [6] with frozen pretrained models and gating mechanisms. To compare with newly learned text-irrelevant parameters for fusion, we also report the results of a fully fine-tuned M-JAVE [6] (marked with “‡”). Another fully fine-tuned method SMARTAVE [8] is also compared. Our IRC surpasses them under the same settings with the advantage of our task-calibrated multimodal language model.

In Table 4, the BERT version might over-fit with more parameters during training, while RoBERTa generalizes ideally as a basic model. To evaluate the problem of a self-attention language model of Vanilla IRC due to text-irrelevant image regions, we compare our TEM with a naïve self-attention layer. Replacing it with our proposed TEM that encodes text-related parts of each modality into text-related embeddings, TAG-F1 clearly increases to 96.35. Moreover, Table 5 shows the weight

λ

of classification loss for our proposed ARM in Equation (17) affecting the learning direction of attribute classification (CLS-F1) and attribute value extraction (TAG-F1). Too large

λ

suppresses the learning of attribute value extraction and thus to lower TAG-F1. Hence,

λ = 1.0 \times 10^{3}

is the most suitable.

4.4. Analysis of Precision and Recall

Table 6 analyzes the Precision and Recall of ARM and TEM. Both Vanilla IRC+ARM and +ARM+TEM (=DRAM) obtain a higher Precision gain than Recall, which significantly shows that ARM decreases false positives by Attribute Range Minimization. In detail, Precision = True Positives/(True Positives + False Positives) evaluates the false positives from wrong attributes predicted by the model, and our proposed ARM solves this issue by minimizing the attribute range into a correct one. And the Recall and TAG-F1 gains of both TEM and ARM are also prominent, which shows its advantage of eliminate False Negatives from missing attributes by filtering the text-relevant visual parts to emphasize the existence of attributes by TEM. In conclusion, our proposed DRAM wins higher F1-score to further boost the task of attribute value extraction.

4.5. Comparison of Variants of ARM

We evaluate the effectiveness of different policies in Figure 5 for our proposed Attribute Range Minimization (ARM), based on our Vanilla IRC and TEM. Table 7 shows that the DyNet-Guided Policy leads to a slightly lower performance, probably because of high instability during training dynamic parameters. The BERT-Guided Policy ranks the

2 nd

with more parameters than the prototype, which might over-fit while training. Finally, Prototype-Guided Policy performs the best in both CLS-F1 and TAG-F1. Such performance shows that the model can learn both correct attribute classification (CLS-F1) and attribute value extraction (TAG-F1) via simple but effective learnable “Prototype” vectors, which is stabler than dynamic networks and lighter than heavier BERT pretraining weights. Thus, the prototype-guided ARM is selected as a component of our DRAM.

4.6. Selection of $τ_{R}$ for ARM

Considering the diversified attribute distribution of different benchmarks of MEPAVE [6] and MAE [22], multiple values of the hyper-parameter

τ_{R}

are shown in Figure 6, which are thresholds to filter the attribute range of our proposed ARM. Intuitively, a more balanced distribution of attributes leads to the equal higher probability of attribute existence in each sample; thus, a too-low threshold might always obtain all maximized attributes, which is meaningless. In contrast, a long-tailed distribution makes some tailed classes have low presence probability, and a too-high threshold might always ignore them.

For the 26 relatively compacted and balanced attributes in MEPAVE, a higher

τ_{R}

value encourages the attribute range to converge into a smaller one. Therefore, in Figure 6a,

τ_{R} = 0.5

achieves the best TAG-F1 score over the other values. For the distracted and long-tailed 2 K attributes in MAE, a lower

τ_{R}

value obtains all the potential attributes covered in the attribute range, especially the low-frequent ones. Hence,

τ_{R} = 0.4

improves the performance of

0.5

significantly in Figure 6b. Due to the constraint memory budget of our GPUs, other values of

τ_{R}

on the large-scale benchmark MAE are not experimented with. Note that it is a priori knowledge that

τ_{R}

is selected according to data distribution. Even the default 0.5 achieves state-of-the-art performance on both benchmarks.

4.7. State-of-the-Art Comparisons

For MEPAVE, our DRAM is compared with unimodal and multimodal methods (Table 8): slot-filling methods RNN-LSTM [12], Attn-BiRNN [11], Slot-Gated [13], and Joint-BERT [17]; unimodal attribute value extractors SUOpenTag [14], JAVE [6], AVEQA [15], AdaTag [16], MAVEQA [29], SMARTAVE-text [8], and K-PLUG [18] (pretrained with extra 25 M data, marked with “*”); multimodal attribute value extractors M-JAVE [6], PAM [23], SMARTAVE [8], EKE-MMRC [7], and DEFLATE [9] (pretrained with multimodal foundation model DALL-E [24], marked with “*”). Joint-BERT, K-PLUG, and SMARTAVE fine-tune the pretrained models, while the others only use the frozen ones. PAM and SMARTAVE have extra OCR input. The source code link of EIVEN [25] provided in its paper is currently broken; thus, its implementation on the mainstream dataset MEPAVE is unavailable. Our proposed DRAM out-performs previous methods as the new state of the art, especially on the TAG-F1 metric.

For MAE, state-of-the-art methods MAE-model [22] and M-JAVE [6] are compared. Note that the MAE-model is a generative model based on extra known attribute inputs, while our DRAM and M-JAVE need no extra inputs. Table 9 shows that our DRAM performs the best on such a large-scale dataset with 2 K attributes. Our DRAM performs well with both benchmarks.

5. Visualizations

As shown in Figure 7, the attention maps on the input image are visualized under the query of Chinese characters from the MEPAVE [6] dataset. Ignoring key parts to indicate the laced pattern of the dress but affected by the text-irrelevant background, Vanilla IRC misunderstands the

2 nd

Chinese character “silk” as a value of an attribute “Material”. Instead, DRAM captures more laced patterns on the top and bottom of the dress by our proposed TEM, and thus, successfully extracts two Chinese characters as a value “laced” of an attribute “Pattern” in a correct attribute range by our proposed ARM. Meanwhile, such visualization also shows key roles of images for multimodal attribute value extraction.

Moreover, for the examples correcting the missing prediction of attributes, Figure 8 shows model prediction and corresponding attention visualization of other e-commerce products like bags and jackets. From the baseline model Vanilla IRC, the distracted attention cannot focus on the key text-related parts in the input images, i.e., “black” on the surface of the bag for attribute “Color”, and “round collar” on the top of the jacket for the attribute “Collar”. With text-guided attention of our proposed TEM, our DRAM focuses on both correct visual parts for attributes and global contexts around these visual parts to enrich multimodal semantics.

6. Future Works

Our proposed DRAM is task-specific for multimodal attribute value extraction, which requires that the task should be both text-dominant and multimodal, as well as highly relevant with fine-grained attributes of the input. The range of attributes should be predefined rather than unstable by generation, but the current attribute of input images and texts should be unknown. Hence, our proposed DRAM can be applied to some multimodal tasks like fine-grained VQA (Visual Question Answering) [30,31,32], with object attributes in textual questions and the image of the objects. Meanwhile, stronger baselines will be adopted more than sequence tagging methods [6], especially the generative prediction manner of multimodal language models like BLIP [33] and Flamingo [34].

For failure cases, Section 3.2.2 shows a typical case that “Ground Truth of bag_thick” is all “O” for no presence of attributes, while the model selects “B” at the first token in “Prediction based on F^bag_thick”. It is sensitive to the numeric token “13” in the original textual input “13 inch bag matched with sliver laptop computers”; such quantitative thickness cannot be directly reflected by the input image of a bag like Figure 8; thus, the assistance of multimodal input is ineffective, which should be tackled by future works.

7. Conclusions

In this paper, we proposed a novel multimodal e-commerce attribute value extraction method via Dynamic Range Modulation, which comprises two key components: IRC calibrates the pretrained language model to perform multimodal fusion with a language range to fuse features of related language meanings, which is fine-tuned to handle each modality as Text-Related Embeddings (TEM); ARM dynamically determines the range of attributes and selects the learnable prototypes to guide the prediction of the chosen attributes. Different policies (DyNet-Guided, BERT-Guided, and Prototype-Guided) is investigated for attribute range minimization. With the cooperation of IRC and ARM, the novel approach DRAM performs well on challenging benchmarks MEPAVE and MAE.

Author Contributions

Conceptualization, M.L. and C.Z.; methodology, M.L.; software, M.L.; validation, M.L., C.Z.; formal analysis, M.L.; investigation, M.L.; resources, M.L.; data curation, M.L.; writing—original draft preparation, C.Z.; writing—review and editing, M.L.; visualization, C.Z.; supervision, C.Z.; project administration, C.Z.; funding acquisition, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China under Grant Number 62072032.

Data Availability Statement

Data are contained within the Section 4.1 of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Çiftlikçi, M.S.; Çakmak, Y.; Kalaycı, T.A.; Abut, F.; Akay, M.F.; Kızıldağ, M. A New Large Language Model for Attribute Extraction in E-Commerce Product Categorization. Electronics 2025, 14, 1930. [Google Scholar] [CrossRef]
Roy, K.; Goyal, P.; Pandey, M. Exploring generative frameworks for product attribute value extraction. Expert Syst. Appl. 2024, 243, 122850. [Google Scholar] [CrossRef]
Sun, L.; Wang, J.; Zhang, K.; Su, Y.; Weng, F. RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER. Proc. AAAI Conf. Artif. Intell. 2021, 35, 13860–13868. [Google Scholar] [CrossRef]
Yuan, D.; Zhu, H.; Chen, R.; Zhou, S.; Tang, J.; Shu, X.; Liu, Q. CMMDL: Cross-Modal Multi-Domain Learning Method for Image Fusion. Neural Netw. 2025, 196, 108450. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Liu, Q.; Yuan, D.; Li, X.; Liu, Y. PPIFuse: Physical priors injected infrared and visible image fusion. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
Zhu, T.; Wang, Y.; Li, H.; Wu, Y.; He, X.; Zhou, B. Multimodal Joint Attribute Prediction and Value Extraction for E-commerce Product. In 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2129–2139. [Google Scholar]
Bai, C. E-Commerce Knowledge Extraction via Multi-modal Machine Reading Comprehension. In International Conference on Database Systems for Advanced Applications; Springer: Cham, Switzerland, 2022; pp. 272–280. [Google Scholar]
Wang, Q.; Yang, L.; Wang, J.; Krishnan, J.; Dai, B.; Wang, S.; Xu, Z.; Khabsa, M.; Ma, H. SMARTAVE: Structured Multimodal Transformer for Product Attribute Value Extraction. In Findings of the Association for Computational Linguistics: EMNLP 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 263–276. [Google Scholar]
Zhang, Y.; Wang, S.; Li, P.; Dong, G.; Wang, S.; Xian, Y.; Li, Z.; Zhang, H. Pay attention to implicit attribute values: A multi-modal generative framework for AVE task. In Findings of the Association for Computational Linguistics: ACL 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 13139–13151. [Google Scholar]
Zheng, G.; Mukherjee, S.; Dong, X.L.; Li, F. OpenTag: Open attribute value extraction from product profiles. In 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1049–1058. [Google Scholar]
Liu, B.; Lane, I. Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 685–689. [Google Scholar]
Hakkani-Tür, D.; Tür, G.; Celikyilmaz, A.; Chen, Y.N.; Gao, J.; Deng, L.; Wang, Y.Y. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 715–719. [Google Scholar]
Goo, C.W.; Gao, G.; Hsu, Y.K.; Huo, C.L.; Chen, T.C.; Hsu, K.W.; Chen, Y.N. Slot-gated modeling for joint slot filling and intent prediction. In 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 753–757. [Google Scholar]
Xu, H.; Wang, W.; Mao, X.; Jiang, X.; Lan, M. Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. In 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 5214–5223. [Google Scholar]
Wang, Q.; Yang, L.; Kanagal, B.; Sanghai, S.; Sivakumar, D.; Shu, B.; Yu, Z.; Elsas, J. Learning to extract attribute value from product via question answering: A multi-task approach. In 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2020; pp. 47–55. [Google Scholar]
Yan, J.; Zalmout, N.; Liang, Y.; Grant, C.; Ren, X.; Dong, X.L. AdaTag: Multi-Attribute Value Extraction from Product Profiles with Adaptive Decoding. In 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4694–4705. [Google Scholar]
Chen, Q.; Zhuo, Z.; Wang, W. Bert for joint intent classification and slot filling. arXiv 2019, arXiv:1902.10909. [Google Scholar] [CrossRef]
Xu, S.; Li, H.; Yuan, P.; Wang, Y.; Wu, Y.; He, X.; Liu, Y.; Zhou, B. K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce. In Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1–17. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 2024, 25, 3381–3433. [Google Scholar]
Brinkmann, A.; Shraga, R.; Bizer, C. Extractgpt: Exploring the potential of large language models for product attribute value extraction. In International Conference on Information Integration and Web Intelligence; Springer: Berlin/Heidelberg, Germany, 2024; pp. 38–52. [Google Scholar]
Logan, R.L., IV; Humeau, S.; Singh, S. Multimodal Attribute Extraction. In Proceedings of the 6th Workshop on Automated Knowledge Base Construction, AKBC@NIPS, Long Beach, CA, USA, 8 December 2017. [Google Scholar]
Lin, R.; He, X.; Feng, J.; Zalmout, N.; Liang, Y.; Xiong, L.; Dong, X.L. PAM: Understanding Product Images in Cross Product Category Attribute Extraction. In 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2021; pp. 3262–3270. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
Zou, H.; Yu, G.; Fan, Z.; Bu, D.; Liu, H.; Dai, P.; Jia, D.; Caragea, C. EIVEN: Efficient Implicit Attribute Value Extraction using Multimodal LLM. In 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 453–463. [Google Scholar]
Gao, H.; Zhu, C.; Liu, M.; Gu, W.; Wang, H.; Liu, W.; Yin, X.C. CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling. In 30th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2022; pp. 4957–4966. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Yang, L.; Wang, Q.; Yu, Z.; Kulkarni, A.; Sanghai, S.; Shu, B.; Elsas, J.; Kanagal, B. Mave: A product dataset for multi-source attribute value extraction. In Fifteenth ACM International Conference on Web Search and Data Mining; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1256–1265. [Google Scholar]
Li, Q.; Fu, J.; Yu, D.; Mei, T.; Luo, J. Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions. In 2018 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1338–1346. [Google Scholar]
Wu, Q.; Shen, C.; Wang, P.; Dick, A.; Van Den Hengel, A. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1367–1381. [Google Scholar] [CrossRef] [PubMed]
Berlot-Attwell, I.; Agrawal, K.K.; Carrell, A.M.; Sharma, Y.; Saphra, N. Attribute diversity determines the systematicity gap in vqa. In 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 9576–9611. [Google Scholar]
Xue, L.; Shu, M.; Awadalla, A.; Wang, J.; Yan, A.; Purushwalkam, S.; Zhou, H.; Prabhu, V.; Dai, Y.; Ryoo, M.S.; et al. Blip-3: A family of open large multimodal models. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2025; pp. 6124–6135. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]

Figure 1. Illustration of differences between our proposed methods (right column) and previous works (left column). Their limitations are as follows: (a) Information: The range of multimodal fusion is newly learned from co-occurrence and leads to mismatching. (b) Attributes: The popular maximum range causes wrong values in the attribute value predictions. These two processes, i.e., cross-modal fusion and prediction, are fundamental for multimodal attribute extraction.

Figure 2. Comparison between previous works (top) and our proposed Dynamic Range Modulation (bottom). (1) Multimodal fusion of previous works is newly learned merely from co-occurrence, while IRC calibrates the unimodal language fusion into a multimodal one. (2) Selecting product-related attribute prototypes by ARM makes the prediction more adaptive and accurate.

Figure 3. Our proposed Information Range Calibration (IRC). (a) First, with the fine-tuning by multimodal attribute extraction task, IRC “calibrates” the unimodal pretrained language model to perform multimodal fusion via fine-tuning with the multimodal attribute value extraction task marked as gradient flow with a red dashed line. (b) Second, Vanilla IRC adopts a task-calibrated multimodal language model to handle text and image embeddings. (c) Finally, Text-Related Embeddings (TEM) selects text-relevant parts of each modality for multimodal fusion. Such calibration maintains the textual semantic knowledge from the pretraining of the language model. With the modification of Vanilla self-attention, the proposed fusion operation of TEM avoids more text-irrelevant visual components.

Figure 4. Attention

M = Q K^{T} / \sqrt{D^{'}}

as the range of feature fusion in (a) the pretrained language model, our Vanilla IRC, and (b) Vanilla IRC+TEM. “T→V” means Textual Query matches the Visual Key to fuse with the value corresponding to the key. Blue, green, yellow and pink colors mark T→T, T→V, V→T, and V→V. Red ✗ and dashed line marks the text-irrelevant background of the image affecting multimodal fusion.

Figure 4. Attention

M = Q K^{T} / \sqrt{D^{'}}

as the range of feature fusion in (a) the pretrained language model, our Vanilla IRC, and (b) Vanilla IRC+TEM. “T→V” means Textual Query matches the Visual Key to fuse with the value corresponding to the key. Blue, green, yellow and pink colors mark T→T, T→V, V→T, and V→V. Red ✗ and dashed line marks the text-irrelevant background of the image affecting multimodal fusion.

Figure 5. Different variants and the details of our proposed Attribute Range Minimization (ARM): (a) Prototype Guided Policy looks up “Prototype” vectors to guide the model to predict values for only those attributes. (b) DyNet-Guided Policy learns parameters and applies them on the dynamic network. (c) BERT-Guided Policy encodes the selected attribute words from a dictionary with the pretrained BERT model into textual embeddings for guidance. (d) Details of our proposed ARM, including the multi-label classification to minimize the range of the attributes, “Proto Lookup” and “Guidance Mechanism”. In detail, the model is guided by a classification loss to dynamically select the attributes related to current inputs in the manner of selecting the learnable vector “prototypes” and fusing them with current multimodal features. Red and black arrows with their corresponding blocks means our proposed modules for ARM and IRC, respectively.

Figure 6. Experiments of

τ_{R}

on different benchmarks. As a threshold to filter the attributes with higher existence scores for our proposed Attribute Range Minimization (ARM),

τ_{R}

varies due to the specific attribute distribution of each benchmark. Red numbers are the best performances with the selected

τ_{R} = 0.5

on MEPAVE and

0.4

on MAE, respectively. Orange denotes the best choice of

τ_{R}

.

Figure 6. Experiments of

τ_{R}

on different benchmarks. As a threshold to filter the attributes with higher existence scores for our proposed Attribute Range Minimization (ARM),

τ_{R}

varies due to the specific attribute distribution of each benchmark. Red numbers are the best performances with the selected

τ_{R} = 0.5

on MEPAVE and

0.4

on MAE, respectively. Orange denotes the best choice of

τ_{R}

.

Figure 7. Visualizations of our Vanilla IRC (top) and DRAM (bottom). Attention maps are queried by words on the top and generated by TEM of DRAM or the 1st layer of Vanilla IRC. Green “B, I”s are correct and red are wrong. DRAM focuses on more laced regions (red dashed boxes) surrounded by more visual contexts than Vanilla IRC. The red ✗ and green ✓ are the wrong or correct results. Each Chinese character of input text is corresponded to underlined English phrase as its meaning.

Figure 8. Visualizations of our Vanilla IRC (top) and DRAM (bottom). Attention maps are queried by words on the top and generated by TEM of DRAM or the 1st layer of Vanilla IRC. Green “B, I”s are correct and red are wrong. DRAM focuses on more laced regions (red dashed boxes) surrounded by more visual contexts than Vanilla IRC. The red ✗ and green ✓ are wrong or correct results. Each Chinese character of input text is corresponded to underlined English phrase as its meaning.

Table 1. Statistics of the MEPAVE dataset [6].

Category	# Product	# Instance	# Attr	# Value
Clothes	12,240	34,154	14	1210
Shoes	9022	20,525	10	1036
Bags	3376	8307	8	631
Luggage	1291	2227	7	275
Dresses	4567	12,283	13	714
Boots	713	2090	11	322
Pants	2832	7608	13	595
Total	34,041	87,194	26	2129

“#” means the number of data or annotations.

Table 2. Statistics of the MAE dataset [22], where the detailed categories are not tagged during collection, according to its original paper.

# Product	# Instance	# Attr	# Value
2.2 M	7.6 M	2.1 K	23.6 K

Table 3. Ablation study between frozen or task-calibrated language models of our proposed IRC.

Scheme	Method	TAG-F1
Frozen	M-JAVE	87.17
Task-calibrated	SMARTAVE	91.52
	M-JAVE ‡	95.40
	IRC (ours)	96.35

“‡” is reproduced results of fine-tuned M-JAVE. If not mentioned, all bold are the best results and best methods.

Table 4. Ablation study on our proposed IRC regarding different language models and TEM. Note that Vanilla IRC (V-IRC) is affected by the issue in Figure 4.

Modality	Method	TAG-F1
T	Vanilla-IRC (BERT)	94.88
	Vanilla-IRC (RoBERTa)	95.21
T+V	V-IRC (RoBERTa + ResNet)	95.05
	+1 × self-attention layer	94.89
	+TEM (our final IRC)	96.35

Table 5. Sensitive analysis about the weight

λ

of classification loss for our proposed ARM in Equation (17).

Table 5. Sensitive analysis about the weight

λ

of classification loss for our proposed ARM in Equation (17).

$λ$ in Equation (17)	CLS-F1	TAG-F1
$λ = 1.0 \times 10^{1}$	97.58	96.68
$λ = 1.0 \times 10^{3}$	97.65	96.86
$λ = 1.0 \times 10^{5}$	96.96	95.39

Table 6. Analysis on Precision and Recall of ARM and TEM. “+A+T” is the final version of our DRAM.

Method	Precision	Recall	TAG-F1
V-IRC	94.04	96.07	95.05
+ARM	94.99 (0.95)	96.71 (0.64)	95.84 (0.79)
+TEM	95.21 (1.17)	97.52 (1.45)	96.35 (1.30)
+A+T	95.89 (1.85)	97.86 (1.79)	96.86 (1.81)

Table 7. Comparison of the variants of our proposed Attribute Range Minimization (ARM): Prototype (P), DyNet (D), and BERT (B).

Method	D	B	P	CLS-F1	TAG-F1
V-IRC+TEM				-	96.35
+DARM	✓			97.52	96.07
+BARM		✓		97.54	96.62
+PARM			✓	97.65	96.86

“✓” denotes the usage of different variants of our proposed ARM.

Table 8. Comparison of state-of-the-art models on MEPAVE. “*” means extra unimodal or multimodal vision-language data pretraining.

Method	Modality	CLS-F1	TAG-F1
RNN-LSTM [12]	T (slot-filling methods)	85.76	82.92
Attn-BiRNN [11]		86.10	83.28
Slot-Gated [13]		86.70	83.35
Joint-BERT [17]		86.93	83.73
SUOpenTag [14]	T (attribute value extraction)	-	77.12
JAVE [6]		87.98	84.78
AVEQA [15]		-	89.15
AdaTag [16]		-	81.36
MAVEQA [29]		-	88.71
SMARTAVE [8]		-	89.21
K-PLUG * [18]		-	95.97
M-JAVE [6]	T+V (attribute value extraction)	90.69	87.17
PAM [23]		-	89.68
SMARTAVE [8]		-	91.52
EKE-MMRC [7]		-	93.52
DEFLATE * [9]		96.09	87.12
DRAM (ours)		97.65	96.86

Table 9. Comparison with state-of-the-art models on MAE.

Method	Modality	Accuracy
MAE-model [22]	T+V (attribute value extraction)	72.96
M-JAVE [6]		75.01
DRAM (ours)		79.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, M.; Zhu, C. DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data. Electronics 2026, 15, 969. https://doi.org/10.3390/electronics15050969

AMA Style

Liu M, Zhu C. DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data. Electronics. 2026; 15(5):969. https://doi.org/10.3390/electronics15050969

Chicago/Turabian Style

Liu, Mengyin, and Chao Zhu. 2026. "DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data" Electronics 15, no. 5: 969. https://doi.org/10.3390/electronics15050969

APA Style

Liu, M., & Zhu, C. (2026). DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data. Electronics, 15(5), 969. https://doi.org/10.3390/electronics15050969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data

Abstract

1. Introduction

2. Related Works

2.1. Unimodal Attribute Value Extraction

2.2. Multimodal Attribute Value Extraction

2.3. Multimodal Encoding via Language Models

3. Proposed Method

3.1. Information Range Calibration

3.1.1. Task-Calibrated Multimodal Language Model

3.1.2. Text-Related Embeddings for Task-Calibrated Multimodal Language Model

3.2. Attribute Range Minimization

3.2.1. Different Policies of ARM

3.2.2. The Prediction Pipeline of ARM

3.3. Loss Functions

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Study

4.4. Analysis of Precision and Recall

4.5. Comparison of Variants of ARM

4.6. Selection of $τ_{R}$ for ARM

4.7. State-of-the-Art Comparisons

5. Visualizations

6. Future Works

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data

Abstract

1. Introduction

2. Related Works

2.1. Unimodal Attribute Value Extraction

2.2. Multimodal Attribute Value Extraction

2.3. Multimodal Encoding via Language Models

3. Proposed Method

3.1. Information Range Calibration

3.1.1. Task-Calibrated Multimodal Language Model

3.1.2. Text-Related Embeddings for Task-Calibrated Multimodal Language Model

3.2. Attribute Range Minimization

3.2.1. Different Policies of ARM

3.2.2. The Prediction Pipeline of ARM

3.3. Loss Functions

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Study

4.4. Analysis of Precision and Recall

4.5. Comparison of Variants of ARM

4.6. Selection of τ R for ARM

4.7. State-of-the-Art Comparisons

5. Visualizations

6. Future Works

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.6. Selection of $τ_{R}$ for ARM