We address multimodal product attribute prediction of fashion items based on product images and titles. The product attributes, such as type, sub-type, cut or fit, are in a chain format, with previous attribute values constraining the values of the next attributes. We propose to address this task with a sequential prediction model that can learn to capture the dependencies between the different attribute values in the chain. Our experiments on three product datasets show that the sequential model outperforms two non-sequential baselines on all experimental datasets. Compared to other models, the sequential model is also better able to generate sequences of attribute chains not seen during training. We also measure the contributions of both image and textual input and show that while text-only models always outperform image-only models, only the multimodal sequential model combining both image and text improves over the text-only model on all experimental datasets.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited