DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data
Abstract
1. Introduction
- First, the Information Range Calibration (IRC) method is proposed to dynamically fuse the multimodal information of related meanings into a language range by fine-tuning an attention-based language model to handle each modality as Text-Related Embeddings (TEM).
- Second, the Attribute Range Minimization (ARM) method is designed to decrease wrong predictions in two steps: (1) Once multi-class classification decides the attributes range, (2) learnable prototypes are selected to dynamically predict for chosen attributes.
- Finally, by the integration of the proposed IRC and ARM, our new approach DRAM achieves superior performances compared to the previous state-of-the-art techniques on popular MEPAVE and MAE benchmarks.
2. Related Works
2.1. Unimodal Attribute Value Extraction
2.2. Multimodal Attribute Value Extraction
2.3. Multimodal Encoding via Language Models
3. Proposed Method
3.1. Information Range Calibration
3.1.1. Task-Calibrated Multimodal Language Model
3.1.2. Text-Related Embeddings for Task-Calibrated Multimodal Language Model
3.2. Attribute Range Minimization
3.2.1. Different Policies of ARM
3.2.2. The Prediction Pipeline of ARM
Input text:
“13 inch bag matched with sliver laptop computers”
Annotation:
“B- I- O O O O B- I-”
Prediction based on :
“B I O O O O O O”
Prediction based on :
“B O O O O O O”
Original annotation:
“B- I- O O O O B- I-”
Ground truth for :
“B I O O O O O O”
Ground truth for :
“O O O O O O O O”
3.3. Loss Functions
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Ablation Study
4.4. Analysis of Precision and Recall
4.5. Comparison of Variants of ARM
4.6. Selection of for ARM
4.7. State-of-the-Art Comparisons
5. Visualizations
6. Future Works
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Çiftlikçi, M.S.; Çakmak, Y.; Kalaycı, T.A.; Abut, F.; Akay, M.F.; Kızıldağ, M. A New Large Language Model for Attribute Extraction in E-Commerce Product Categorization. Electronics 2025, 14, 1930. [Google Scholar] [CrossRef]
- Roy, K.; Goyal, P.; Pandey, M. Exploring generative frameworks for product attribute value extraction. Expert Syst. Appl. 2024, 243, 122850. [Google Scholar] [CrossRef]
- Sun, L.; Wang, J.; Zhang, K.; Su, Y.; Weng, F. RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER. Proc. AAAI Conf. Artif. Intell. 2021, 35, 13860–13868. [Google Scholar] [CrossRef]
- Yuan, D.; Zhu, H.; Chen, R.; Zhou, S.; Tang, J.; Shu, X.; Liu, Q. CMMDL: Cross-Modal Multi-Domain Learning Method for Image Fusion. Neural Netw. 2025, 196, 108450. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Q.; Liu, Q.; Yuan, D.; Li, X.; Liu, Y. PPIFuse: Physical priors injected infrared and visible image fusion. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
- Zhu, T.; Wang, Y.; Li, H.; Wu, Y.; He, X.; Zhou, B. Multimodal Joint Attribute Prediction and Value Extraction for E-commerce Product. In 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2129–2139. [Google Scholar]
- Bai, C. E-Commerce Knowledge Extraction via Multi-modal Machine Reading Comprehension. In International Conference on Database Systems for Advanced Applications; Springer: Cham, Switzerland, 2022; pp. 272–280. [Google Scholar]
- Wang, Q.; Yang, L.; Wang, J.; Krishnan, J.; Dai, B.; Wang, S.; Xu, Z.; Khabsa, M.; Ma, H. SMARTAVE: Structured Multimodal Transformer for Product Attribute Value Extraction. In Findings of the Association for Computational Linguistics: EMNLP 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 263–276. [Google Scholar]
- Zhang, Y.; Wang, S.; Li, P.; Dong, G.; Wang, S.; Xian, Y.; Li, Z.; Zhang, H. Pay attention to implicit attribute values: A multi-modal generative framework for AVE task. In Findings of the Association for Computational Linguistics: ACL 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 13139–13151. [Google Scholar]
- Zheng, G.; Mukherjee, S.; Dong, X.L.; Li, F. OpenTag: Open attribute value extraction from product profiles. In 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1049–1058. [Google Scholar]
- Liu, B.; Lane, I. Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 685–689. [Google Scholar]
- Hakkani-Tür, D.; Tür, G.; Celikyilmaz, A.; Chen, Y.N.; Gao, J.; Deng, L.; Wang, Y.Y. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 715–719. [Google Scholar]
- Goo, C.W.; Gao, G.; Hsu, Y.K.; Huo, C.L.; Chen, T.C.; Hsu, K.W.; Chen, Y.N. Slot-gated modeling for joint slot filling and intent prediction. In 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 753–757. [Google Scholar]
- Xu, H.; Wang, W.; Mao, X.; Jiang, X.; Lan, M. Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. In 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 5214–5223. [Google Scholar]
- Wang, Q.; Yang, L.; Kanagal, B.; Sanghai, S.; Sivakumar, D.; Shu, B.; Yu, Z.; Elsas, J. Learning to extract attribute value from product via question answering: A multi-task approach. In 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2020; pp. 47–55. [Google Scholar]
- Yan, J.; Zalmout, N.; Liang, Y.; Grant, C.; Ren, X.; Dong, X.L. AdaTag: Multi-Attribute Value Extraction from Product Profiles with Adaptive Decoding. In 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4694–4705. [Google Scholar]
- Chen, Q.; Zhuo, Z.; Wang, W. Bert for joint intent classification and slot filling. arXiv 2019, arXiv:1902.10909. [Google Scholar] [CrossRef]
- Xu, S.; Li, H.; Yuan, P.; Wang, Y.; Wu, Y.; He, X.; Liu, Y.; Zhou, B. K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce. In Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1–17. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 2024, 25, 3381–3433. [Google Scholar]
- Brinkmann, A.; Shraga, R.; Bizer, C. Extractgpt: Exploring the potential of large language models for product attribute value extraction. In International Conference on Information Integration and Web Intelligence; Springer: Berlin/Heidelberg, Germany, 2024; pp. 38–52. [Google Scholar]
- Logan, R.L., IV; Humeau, S.; Singh, S. Multimodal Attribute Extraction. In Proceedings of the 6th Workshop on Automated Knowledge Base Construction, AKBC@NIPS, Long Beach, CA, USA, 8 December 2017. [Google Scholar]
- Lin, R.; He, X.; Feng, J.; Zalmout, N.; Liang, Y.; Xiong, L.; Dong, X.L. PAM: Understanding Product Images in Cross Product Category Attribute Extraction. In 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2021; pp. 3262–3270. [Google Scholar]
- Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
- Zou, H.; Yu, G.; Fan, Z.; Bu, D.; Liu, H.; Dai, P.; Jia, D.; Caragea, C. EIVEN: Efficient Implicit Attribute Value Extraction using Multimodal LLM. In 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 453–463. [Google Scholar]
- Gao, H.; Zhu, C.; Liu, M.; Gu, W.; Wang, H.; Liu, W.; Yin, X.C. CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling. In 30th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2022; pp. 4957–4966. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Yang, L.; Wang, Q.; Yu, Z.; Kulkarni, A.; Sanghai, S.; Shu, B.; Elsas, J.; Kanagal, B. Mave: A product dataset for multi-source attribute value extraction. In Fifteenth ACM International Conference on Web Search and Data Mining; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1256–1265. [Google Scholar]
- Li, Q.; Fu, J.; Yu, D.; Mei, T.; Luo, J. Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions. In 2018 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1338–1346. [Google Scholar]
- Wu, Q.; Shen, C.; Wang, P.; Dick, A.; Van Den Hengel, A. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1367–1381. [Google Scholar] [CrossRef] [PubMed]
- Berlot-Attwell, I.; Agrawal, K.K.; Carrell, A.M.; Sharma, Y.; Saphra, N. Attribute diversity determines the systematicity gap in vqa. In 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 9576–9611. [Google Scholar]
- Xue, L.; Shu, M.; Awadalla, A.; Wang, J.; Yan, A.; Purushwalkam, S.; Zhou, H.; Prabhu, V.; Dai, Y.; Ryoo, M.S.; et al. Blip-3: A family of open large multimodal models. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2025; pp. 6124–6135. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]








| Category | # Product | # Instance | # Attr | # Value |
|---|---|---|---|---|
| Clothes | 12,240 | 34,154 | 14 | 1210 |
| Shoes | 9022 | 20,525 | 10 | 1036 |
| Bags | 3376 | 8307 | 8 | 631 |
| Luggage | 1291 | 2227 | 7 | 275 |
| Dresses | 4567 | 12,283 | 13 | 714 |
| Boots | 713 | 2090 | 11 | 322 |
| Pants | 2832 | 7608 | 13 | 595 |
| Total | 34,041 | 87,194 | 26 | 2129 |
| # Product | # Instance | # Attr | # Value |
|---|---|---|---|
| 2.2 M | 7.6 M | 2.1 K | 23.6 K |
| Scheme | Method | TAG-F1 |
|---|---|---|
| Frozen | M-JAVE | 87.17 |
| Task-calibrated | SMARTAVE | 91.52 |
| M-JAVE ‡ | 95.40 | |
| IRC (ours) | 96.35 |
| Modality | Method | TAG-F1 |
|---|---|---|
| T | Vanilla-IRC (BERT) | 94.88 |
| Vanilla-IRC (RoBERTa) | 95.21 | |
| T+V | V-IRC (RoBERTa + ResNet) | 95.05 |
| +1 × self-attention layer | 94.89 | |
| +TEM (our final IRC) | 96.35 |
| in Equation (17) | CLS-F1 | TAG-F1 |
|---|---|---|
| 97.58 | 96.68 | |
| 97.65 | 96.86 | |
| 96.96 | 95.39 |
| Method | Precision | Recall | TAG-F1 |
|---|---|---|---|
| V-IRC | 94.04 | 96.07 | 95.05 |
| +ARM | 94.99 (0.95) | 96.71 (0.64) | 95.84 (0.79) |
| +TEM | 95.21 (1.17) | 97.52 (1.45) | 96.35 (1.30) |
| +A+T | 95.89 (1.85) | 97.86 (1.79) | 96.86 (1.81) |
| Method | D | B | P | CLS-F1 | TAG-F1 |
|---|---|---|---|---|---|
| V-IRC+TEM | - | 96.35 | |||
| +DARM | ✓ | 97.52 | 96.07 | ||
| +BARM | ✓ | 97.54 | 96.62 | ||
| +PARM | ✓ | 97.65 | 96.86 |
| Method | Modality | CLS-F1 | TAG-F1 |
|---|---|---|---|
| RNN-LSTM [12] | T (slot-filling methods) | 85.76 | 82.92 |
| Attn-BiRNN [11] | 86.10 | 83.28 | |
| Slot-Gated [13] | 86.70 | 83.35 | |
| Joint-BERT [17] | 86.93 | 83.73 | |
| SUOpenTag [14] | T (attribute value extraction) | - | 77.12 |
| JAVE [6] | 87.98 | 84.78 | |
| AVEQA [15] | - | 89.15 | |
| AdaTag [16] | - | 81.36 | |
| MAVEQA [29] | - | 88.71 | |
| SMARTAVE [8] | - | 89.21 | |
| K-PLUG * [18] | - | 95.97 | |
| M-JAVE [6] | T+V (attribute value extraction) | 90.69 | 87.17 |
| PAM [23] | - | 89.68 | |
| SMARTAVE [8] | - | 91.52 | |
| EKE-MMRC [7] | - | 93.52 | |
| DEFLATE * [9] | 96.09 | 87.12 | |
| DRAM (ours) | 97.65 | 96.86 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, M.; Zhu, C. DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data. Electronics 2026, 15, 969. https://doi.org/10.3390/electronics15050969
Liu M, Zhu C. DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data. Electronics. 2026; 15(5):969. https://doi.org/10.3390/electronics15050969
Chicago/Turabian StyleLiu, Mengyin, and Chao Zhu. 2026. "DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data" Electronics 15, no. 5: 969. https://doi.org/10.3390/electronics15050969
APA StyleLiu, M., & Zhu, C. (2026). DRAM: Dynamic Range Modulation for Multimodal Attribute Value Extraction on E-Commerce Product Data. Electronics, 15(5), 969. https://doi.org/10.3390/electronics15050969

