Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval

As a prominent topic in food computing, cross-modal recipe retrieval has garnered substantial attention. However, the semantic alignment across food images and recipes cannot be further enhanced due to the lack of intra-modal alignment in existing solutions. Additionally, a critical issue named food image ambiguity is overlooked, which disrupts the convergence of models. To these ends, we propose a novel Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval (MMACMR). To consider inter-modal and intra-modal alignment together, this method measures the ambiguous food image similarity under the guidance of their corresponding recipes. Additionally, we enhance recipe semantic representation learning by involving a cross-attention module between ingredients and instructions, which is effective in supporting food image similarity measurement. We conduct experiments on the challenging public dataset Recipe1M; as a result, our method outperforms several state-of-the-art methods in commonly used evaluation criteria.


Introduction
With rising awareness of health and sustainability, issues such as food safety [1,2] and nutrition [3] have gained unprecedented attention.Food computing [4][5][6][7][8] plays a crucial role in promoting healthier lifestyles, mitigating food waste, and enhancing both the quality and safety of food products.Cross-modal recipe retrieval [9,10] is one of the hot topics in food computing, leveraging artificial intelligence (AI) [11,12] which aims to retrieve the corresponding recipes by queries of food images or vice versa.In this task, food images depict finished dishes, while recipes comprise text encompassing three key components: a title, a list of ingredients, and detailed instructions outlining the cooking process.
The principal challenge in cross-modal recipe retrieval lies in mitigating the inherent heterogeneity between two distinct modalities: the recipes and the food images.To solve this challenging task, numerous studies have delved into additional interactions between the two modalities.For instance, Refs.[13][14][15] tried to learn the consistent feature distribution of food images and recipe texts.Refs.[16][17][18][19] boosted the interaction between two modalities through cross-modal attention.Ref. [20] employed a joint transformer encoder to promote alignment.Due to the complexity of image-recipe pairs, many existing studies focused on exploiting the latent semantic information within a modality.As typical studies, refs.[21][22][23][24][25] aimed to focus on the crucial term within recipes, while others [26][27][28] attempted to capture the salient objects or regions from food images to improve the cross-modal similarity measurement.Due to the complexity of the textual structure in recipes, other researchers [29][30][31][32][33] investigated the interaction among the title, ingredients, and instructions to excavate important semantics.Furthermore, some studies introduced diverse augmentation mechanisms to enhance cross-modal feature representations.For example, Refs.[34][35][36][37][38] employed various Generative Adversarial Networks (GANs) to reconstruct information from food images and recipes to bridge the heterogeneity gap across modalities, while refs.[39,40] leveraged multilingual translation to enrich the recipe information.Crowdsourcing strategy is also used to construct program representations of recipes [41].Thanks to the flourishing development of visual language pre-training recently, some pioneers [42][43][44][45] have further embedded complex semantic relationship information into common feature subspace by leveraging the pre-trained Contrastive Language-Image Pre-Training model (CLIP).
Despite the significant progress made so far, there is still room for further improvement in semantic distribution alignment across food images and recipes.To be specific, the prevailing efforts [18,29,30] concentrate on exploring inter-modal semantic alignment using conventional metric learning strategies, such as triplet loss.As shown in Figure 1a, the conventional metric learning strategy is devoted to reducing the distance between positive image-recipe pairs (the circles and squares with the same color) and enlarging the distance between negative samples (the circles and the gray squares) and is proficient in learning similarity relations within each image-recipe pair.However, semantic relations exist not only within each image-recipe pair, but extensively between different pairs.For example, the two image-recipe pairs in Figure 1a belong to the same food (chilli sauce), indicating strong semantic relations (highlighted by red lines) between the two food images as well as the two recipes.The conventional metric learning method (e.g., triplet loss), however, fails to capture this relation information.To be sure, there are lots of image pairs belonging to the same food in practical time.This situation indicates that effectively enhancing intra-modal semantic alignment is significant for improving recipe retrieval performance.For this purpose, a straightforward method applied in lots of cross-modal retrieval tasks [46][47][48] is to utilize metric learning or contrastive learning strategy within each modality.However, there is a non-trivial issue, i.e., food image ambiguity, in cross-modal recipe retrieval that has not been considered.Specifically, foods that look similar may be made from quite different materials and via different preparation methods.Thus, these similar food images correspond to significantly distinct recipes.For example, in Figure 1b, the top food image is a cup of raspberry smoothie, and the bottom one is a bowl of chilli sauce.Theses two foods are visually similar to each other, yet they are crafted from distinct ingredients and have undergone quite different instructions.Unfortunately, existing methods embed their semantics from the two modalities to the common subspace independently.Due to the resemblance in appearance, these two images will be close in the common space, while their corresponding recipes will not.This leads to a dilemma; embeddings that have a large similarity (small distance) in the visual modality may have a small similarity (large distance) in the text modality.As a result, the two modalities are hard to align with each other, and the models are difficult to converge, which reduces the accuracy of retrieval.Stumped by this stand-out drawback, we observe that recipes are the more reliable modality.In other words, foods prepared using similar recipes will have similar visual appearances.Therefore, this study aims to answer the following two questions: • Q1: How can we measure the similarity between ambiguous food images guided by their corresponding recipes?• Q2: How can we further improve the fine-grained semantic alignment between ingredients and instructions within each recipe to support food image similarity measurement?
To this end, we propose a novel cross-modal recipe retrieval method called the Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval (MMACMR).To answer Q1, we design a novel strategy, the Multi-Modal Disambiguity and Alignment strategy (MDA for short), which calculates the intra-modal similarity of recipes and guides the distances between corresponding images.As shown in Figure 1c, the green square is a recipe (chilli sauce) similar to the one shown by the orange square (sweet chilli sauce).Our MDA strategy attempts to pull them close and guide the distance between the green and orange circles (their corresponding images).For Q2, considering ingredients play a significant role within instructions, we introduce sentence-level cross-attention to focus on important ingredients in the instructions and further enhance the representations of recipes.In a nutshell, this work is a pioneering effort to further narrow cross-modal heterogeneity between food images and recipes by considering both multi-modal (inter-modal and intramodal) alignment while mitigating the impact of food image ambiguity.
To sum up, the main contributions of this article are four fold: The remainder of this article is organized as follows.The technical details and specific learning process are outlined in Section 2, the experimental particulars are discussed in Section 3, and we conclude the paper in Section 4.

Method
In this section, we first present the notations involved in this paper and provide the problem formulation for cross-modal recipe retrieval in Section 2.1.Then, we elaborate on the technique details of our method MMACMR, including the models in Section 2.2, the strategy in Section 2.3, and the algorithm in Section 2.4.

Notations
Without loss of generality, we denote sets as uppercase, handwritten, bold letters (e.g., D) and matrices as uppercase letters (e.g., W).The i-th row of W is denoted by W i , and the element found in the j-th column of i-th row in W is denoted as W ij .We represent the transpose of a matrix W as W ⊤ .Notation ∥•∥ 2 denotes the L2 norm of a matrix.We use so f tmax(•) to represent the softmax function.To ease reading, we summarize the frequently used notations in Table 1.

D
A cross-modal recipe dataset The food image of the i-th pair X r i The recipe of the i-th pair X tit i The title of the recipe X r i

X ing i
The ingredients of the recipe X r i The embedding of the title in a recipe X r i

E ing i
The embedding of the ingredients in a recipe X r i The embedding of the instructions in a recipe X r i

R
The recipe embedding V The food image embedding f r The recipe encoder The image encoder θ r The parameters of recipe encoder θ v The parameters of image encoder L tri The N-pairs triplet loss function L RGI The RGI loss function represent the food image and recipe of the i-th pair, respectively.X tit i , X ing i , and X ins i denote the title, list of ingredients, and list of instructions of the recipe, respectively.Note that each title comprises a single sentence, while both ingredients and instructions consist of several sentences.Given a recipe X r i as a query, cross-modal recipe retrieval aims to search for the most similar food image X v i from this dataset D, or vise versa.To enhance consistent feature distribution alignment across food images and recipes, we attempt to optimize an improved recipe encoder R = f r (X r ; θ r ) and an image encoder V = f r (X v ; θ v ) under the guidance of a novel learning strategy dubbed MDA.This strategy integrates two losses: an N-pairs triplet loss L tri to focus on inter-modal semantic alignment and an RGI loss L RGI to focus on semantic consistency within the same modality.By considering both inter-modal and intra-modal alignment, this approach effectively avoids the harmful effects of food image ambiguity.Therefore, the objective function is formulated as follows: where θ v and θ r are two learnable parameter vectors for image and recipe encoders, and λ is a pre-defined balance parameter.

Framework Overview
An overview of our method MMACMR is depicted in Figure 2. Following prevailing solutions [29,49], the backbone of MMACMR comprises an image encoder f v (•; θ v ) and a recipe encoder f r (•; θ r ) which project food images and recipes into a common feature subspace.In this subspace, the cross-modal features can be aligned effectively so that the similarity between images and recipes can be measured with accuracy.Below, we provide details of them.

Image Encoder
To fully capture the global semantic relations between fine-grained features in the content of each food image, we adopt the base-size model of Vision Transformer (ViT-B) [50] as the image encoder f v (•; θ v ).It is initialized with the weights pre-trained on ImageNet [51] and fine-tuned on the cross-modal recipe dataset.Given a food image X v i , the embedding of X v i is denoted as

Improved Recipe Encoder
To focus on the consistent fine-grained semantics between ingredients and instructions, we improve the hierarchical transformer-based recipe encoder [29].This encoder consists of two levels of transformers, denoted as T 1 and T 2 , with identical architectures.The first level encodes the title X tit i , ingredients X ing i , and instructions X ins i at word level and then outputs their sentence-level embeddings, while the second level encoder receives the sentence-level embeddings of ingredients and instructions and produces component-level embeddings.Such a widely adopted recipe embedding scheme unfortunately overlooks a fundamental yet crucial rule in a recipe; the instructions are steps tailored to the ingredients, with the ingredients playing a determining role in shaping the instructions to some extent.To obey this rule, we plug a cross-attention module for instructions between the two transformers for two purposes: (1) to focus on the salient ingredient and (2) to highlight semantic relationships between ingredients and instructions.
Specifically, given a recipe set {X r i } n i=1 , as shown in Figure 2, the first level module T 1 receives the word-level tokens of the three components separately and outputs the average embeddings of every sentence of every components, denoted as (( To highlight the effect of ingredients to instructions at sentence level and enhance the semantic relationship learning, cross-attention is carried out between (E ing i ) ′ and (E ins i ).Firstly, within a recipe, we construct an affinity matrix W as an attention map: where (E , and d is the dimension of each ingredient and each instruction.W ing ∈ R d×d and W ins ∈ R d×d are learnable weight matrices.Each element W jk means the normalized correlation between the j-th ingredient and the k-th instruction.Thereby, the embedding of instructions can be enhanced by focusing on the consistent semantics between instructions and ingredients as follows: After the second-level processing, we obtain the component-level features of title, ingredients, and instructions: Finally, these three component embeddings are concatenated and fed into a linear layer; thus, we obtain the final recipe feature, R i = FC([E tit i ; E ing i ; E ins i ]; θ l ), where FC(•; θ l ) is a linear layer, θ l are the parameters of it, and symbol [•; •; •] denotes the concatenation operation.

Multi-Modal Disambiguity and Alignment
To enhance the consistent feature distribution alignment across food images and recipes, we extend the prevailing learning scheme (only inter-modal metric learning, e.g., triplet loss) by considering both inter-and intra-modal alignment.To do so, we employ N-pairs triplet loss to realize inter-modal alignment within each batch, while we propose a novel RGI loss to steer the model towards capturing intra-modal consistent semantics effectively by preventing the misrecognition of ambiguous food images.

Inter-Modal Alignment: N-Pairs Triplet Loss
Given an anchor food image V i , a positive recipe R + i , and a negative recipe R − j , where i ̸ = j, the N-pairs triplet loss for visual modality can be defined as follows: where is the similarity function (we use cosine similarity here), |V| is the number of the image sample in the batch, and m is a pre-defined margin (we set m = 0.3 in this work).Similarly, the N-pairs triplet loss for text modality can be written in the same way.Consequently, we formulate the whole N-pairs triplet loss as follows:

Intra-Modal Alignment with Disambiguity: RGI Loss
As discussed above, N-pairs triplet loss is a satisfactory scheme for reducing heterogeneity between images and recipes.Using it within each modality, however, is far from a suitable intra-modal alignment solution due to the disturbance of food image ambiguity.Nor is this all; the prevailing recipe retrieval approaches [18,29] only consider cross-modal similarity measurement, which narrows the distance between anchor and positive samples while enlarging the distances between the anchor and negative samples.Such a limitation, on the one hand, leads to a discrepancy between the two modalities, making it difficult for model convergence.On the other hand, it is easy to match one of the ambiguous images, resulting in low retrieval performance.
Fortunately, recipes, or, more rigorously, text, are the more reliable modality owing to their ability to abstract semantic expression word by word.Thus, inspired by [52], we design a novel learning strategy termed RGI loss which chooses the similarity relations between recipes as guidance to determine the relations between corresponding food images.Specifically, if we assume that < R i , R j > is a recipe pair in a batch, we aim to preserve the similarity relation for it and project this relation to the corresponding image pair < V i , V j >.Given a recipe R i , we first rank other recipes in this batch by the similarity to R i using the K-nearest neighbors (KNN) algorithm [53].From the ranked recipes, we select the nearest neighbor as the positive sample R + j and a randomly selected recipe that is not among the top 10 neighbors as the negative recipe R − k , i ̸ = j ̸ = k.Inspired by the angular loss [54], our RGI loss for text modality is defined as follows: where tan 2 α = 1 is a pre-defined upper bound.For the visual modality, we no longer compute the KNN for images, while we adopt the rank of the neighbors of corresponding recipes directly.The RGI loss for the visual modality is defined in the same way: where tan 2 α = 1 is a pre-defined upper bound.Note that the indices of the visual modality are the same as the text modality.Thus, the entire RGI loss is formulated as follows: where λ 1 and λ 2 are hyper-parameters for adjusting the relation projection.

Total Loss
Finally, the total loss can be written as follows: where λ is a balance hyper-parameter for adjusting the performance of the two loss functions.

Optimization
Our method undergoes end-to-end optimization.The optimization procedure is outlined in Algorithm 1.

Algorithm 1 Optimization procedure of MMACMR
, number of epoch T. Output: parameters θ v , θ r of modality encoders.
1: Initialize θ v , θ r ; 2: for t = 1 to T do 3: Compute embeddings V and R; 5: Calculate Equation ( Rank the recipes neighbors via KNN algorithm; 8: Rank the images neighbors follow recipes; Update the parameters θ v , θ r by Equation ( 11) via gradient descent algorithm. 12: until Convergence 13: end for

Experiments and Discussion
This section presents extensive experiments conducted to assess our method's performance.We begin by introducing the experiment settings, followed by a detailed discussion of the experimental results.

Dataset
We implement experiments on the Recipe1M [9] dataset, which is by far the largest public multi-modal recipe dataset available.Recipe1M comprises over 1 million cooking recipe texts and 800 K food images which are collected from more than 24 popular cooking websites.We adhere to the official splits for data, with 238,399 image-recipe pairs allocated for training, 51,119 pairs for validation, and 51,303 pairs for testing.

Baselines
We benchmark our approach against the state-of-the-art baselines below: • CCA [9] stands for Canonical Correlation Analysis, a classical statistical method used to learn a joint embedding space; • JE [9] was the first to conduct the cross-modal recipe retrieval task on the Recipe1M dataset.It uses a joint encoder and a classifier to learn the information from food images and recipes; • AdaMin [10] combines the retrieval loss and classifies the loss to improve the robustness of models and proposes a novel strategy to mine the significant triplets; • R2GAN [35] promotes the modality alignment by employing a GAN mechanism equipped with two discriminators and one generator; • MCEN [14] bridges the semantic gap between the two modalities using stochastic latent variable models; • SN [16] employs three attention mechanisms on three components of recipes to capture the relationship between sentences; • SCAN [13] introduces semantic consistency loss to regularize the representations of images and recipes; • HF-ICMA [20] exploits the global and local similarity between the two modalities by considering inter-and intra-modal fusion; • SEJE [22] constructs a two-phase feature framework and divides the processes of data pre-processing and model training to extract additional semantic information; • M-SIA [17] argues that multiple aspects in recipes are related to multiple regions in food images and leverages multi-head attention to bridge them; • X-MRS [39] augments recipe representations by utilizing multilingual translation; • LCWF-GI [31] employs latent weight factors to fuse the three components of recipes by considering their complex interaction; • H-T [29] captures the latent semantic information in recipes by applying self-supervised loss to push components sourced from the same close recipe; • LMF-CSF [30] introduces a low-rank fusion strategy to combine the components in recipes and generate superior representations.

Evaluation Criteria
Similar to the majority of previous studies [9,29,44], we sample 1 K and 10 K imagerecipe pairs from the test partition and assess the retrieval performance for image-to-recipe and and recipe-to-image tasks using median rank (MedR) and recall rate at top k (R@k).Among these metrics, MedR represents the median index of the retrieved samples for each query, measuring the ability of models to understand the semantic correlation between two modalities and the accuracy of retrieval.A lower MedR value indicates better performance.R@k indicates that the percentage of the ground truth index is among the first k retrieved samples, which is also known as sensitivity or the true positive rate, measuring the ability of models to correctly identify all relevant instances.A higher R@k value indicates better performance.Here, we evaluate the top 1 (R@1), top 5 (R@5), and top 10 (R@10).By using these two metrics, we can evaluate the comprehensive performance of the models.Every evaluation is repeated 10 times, and the mean results are returned.

Implementation Details
In line with prior research [49], we use food images with a depth of three channels in the RGB color space.All the images in our experiments are resized to 256 pixels in their shorter dimension and then cropped to 224 × 224 pixels.The image encoder utilizes a pre-trained ViT-based model, yielding an output size of 1024.Regarding recipes, sentences in three components are truncated to a maximum length of 15, and every ingredients or instructions list has a maximum of 20 sentences.Each transformer in the hierarchical transformer recipe encoder comprises two layers, and each layer has four attention heads.Every component in the recipes is embedded as 512 dimensions, and the final embedding of a recipe is output as 1024 dimensions.The model is trained utilizing the Adam optimizer, the batch size is set as 128, and the learning rate is η = 10 −4 .The balance parameters λ 1 = 0.09, λ 2 = 0.1, and λ = 0.01.

Experimental Environment
Our experiments are conducted using Python 3.7 with the PyTorch 1.31.1 framework.We utilize a deep learning workstation equipped with an Intel(R) Core i9-12900K 3.9 GHz processor, 128 GB of RAM, 1 TB SSD, and 2 TB HDD storage.The workstation runs on the Ubuntu-22.04.1 operating system and is powered by two NVIDIA GeForce RTX 3090Ti GPUs (NVIDIA, Palo Alto, CA, USA).

Comparison with State-of-the-Art Methods
We compare the performance of our method with the baselines mentioned above.The results are reported in Table 2.It is easy to see that MMACMR is superior to the best results of existing works using all the metrics.Concretely, our method achieves a 3.3, 1.1, 0.6 R{1, 5, 10} improvement for image to recipe and a 3.7, 1.2, 0.7 R{1, 5, 10} improvement for recipe to image in the 1 K size compared to the SOTA method LMF-CSF [30] and achieves a 3.5, 3.1, 2.7 R{1, 5, 10} improvement for image to recipe and a 4.0, 3.1, 2.8 R{1, 5, 10} improvement for recipe to image in the 10 K size compared to the SOTA method LMF-CSF [30].In addition, the MedR of our method in the 10 K size dataset decreases to 2.1 for image to recipe and 2.2 for recipe to image compared to 3.0 in LMF-CSF [30].These results demonstrate the effectiveness of our MMACMR.In other words, our approach to addressing the questions mentioned above is effective for cross-modal recipe retrieval.

Table 2.
Comparison with SOTA methods.MedR(↓) and R@k(↑) in 1 K and 10 K size.The best results are marked in bold font.

Scalability Analysis
In order to investigate the scalability of our method, we conduct experiments on datasets larger than 10 K in size.As shown in Figure 3, the MedR results of MMACMR are consistently lower than those of all other methods across all dataset sizes.In addition, it can be seen that, with the increase in test size, the performance gap between our method and others also widens.We argue that, on the one hand, the enhancement of recipe embedding promotes the alignment between the two modalities.On the other hand, as the dataset size increases, so does the number of ambiguous food images, leading to a higher probability of matching incorrect recipes.By effectively addressing this issue, our method demonstrates improved robustness and scalability as the dataset size enlarges.

Ablation Studies
In this subsection, we conduct ablation experiments to assess the contribution of each part of our model to the overall performance.Table 3 reports the image-to-recipe retrieval results of different parts of MMACMR in 1 K and 10 K test size.In Table 3, Base is the baseline framework consisting of the food image encoder (ViT-B) and the original hierarchical transformer recipe encoder coupled with the N-pairs triplet loss.IR means introducation of the improved recipe encoder, and L RGI is our RGI loss.A √ symbol under the columns Base, IR, and L RGI indicates the use of that part.On the right, we list the MedR, R@1, R@5, and R@10 results for the image-to-recipe and recipe-to image tasks.We first evaluate the Base framework, then introduce the improved recipe encoder and RGI loss separately.Finally, we combine all three parts.It can be observed that the addition of both IR and L RGI boosts the baseline model.This indicates that the solutions we propose to address the questions mentioned above are effective.When employing all subassemblies, we achieve the best performance, further validating the effectiveness of each element in our approach.Note that the method without IR obtains the same scores as the full method in R@5 and R@10 for image to recipe, and R@5 for recipe to image, for the 10 K size dataset.Additionally, it achieves better performance in MedR for recipe to image in 10 K size.Therefore, we attribute the main contribution to the MDA strategy.

Image to Recipe Recipe to Image
MedR R@1 R@5 R@10 MedR R@1 R@5 R@10 To more intuitively analyze the representative results of MMACMR in image-to-recipe retrieval, we select four food images as queries to retrieve the recipes from the test set using our method and the SOTA method H-T (ViT) [29].As shown in Figure 4, from left to right, the queries are "Chickpeas and Spinach with Smoky Paprika", "Blue Ribbon Apple Crumb Pie", "Apricot Nectar Cake", and "Sweet and Spicy Grilled Pork Tenderloin".In the first two samples, the categories of food are relatively easy to distinguish; both of these methods retrieve approximate recipes.However, in the first example, H-T (ViT) [29] does not retrieve the main ingredient, apricot nectar, while our method successfully retrieves it.The same situation occurs in the second example, where H-T (ViT) [29] retrieves a recipe whose corresponding image is similar to the query image but it misrecognizes the pork tenderloin as chicken thighs.In contrast, MMACMR retrieves the ground truth recipe.We attribute this to our MDA strategy, which can better address the problem of ambiguous food images and recognize the ingredients correctly.In the third example, H-T (ViT) [29] identifies some beans and vegetable leaf in the image but misclassifies their types, and the retrieved entire recipe deviates significantly from the ground truth.In the last example, the food image is difficult to recognize by human eye.H-T (ViT) [29] retrieves a recipe whose corresponding image has a similar color to the query (actually, it is a shortcut for models to classify objects which have not been seen before).However, our method retrieves the correct recipe even though the query image is ambiguous.We believe this is because MMACMR can reduce the distances between images with similar recipes, allowing the correct sample to be retrieved even when the query is hard to distinguish.Examples of image-to-recipe retrieval results for the 10 K test set.The first row contains the query images, the second row shows the recipes retrieved using our method (all of which are the ground truth recipes; therefore, the first row is their corresponding food images), the third row displays the recipes retrieved using H-T (ViT) [29], and the last row presents the corresponding food images of the recipes from the third row.The key ingredients not retrieved by H-T [29] but retrieved by our method are highlighted in red.

Qualitative Results on Recipe-to-Image Retrieval
We also conduct experiments to visualize the results of recipe-to-image retrieval for the 1 K test set, which are presented in Figure 5. From top to bottom, the query recipes are titled "Fruit Salad", "Italian Beef Roast", and "Pesto Salmon", followed by the top five retrieved images using our method and the SOTA method H-T (ViT) [29].In the first example, both methods retrieve five food images of fruit salad, but our method retrieves the ground truth as the top one, while H-T (ViT) [29] retrieves it in the top three.In the second example, the two methods retrieve the correct image in the top two.However, all the food images MMACMR retrieves are roast beef, while the third and fifth retrieved images of H-T (ViT) [29] do not match the recipe query.In the last example, our method retrieves the ground truth image as the top one, while H-T (ViT) [29] fails to retrieve the correct food image.At the same time, the second image retrieved by MMACMR is similar to the correct one, while the first and third images retrieved by H-T (ViT) [29] deviate significantly from the ground truth.We attribute these achievements to the capability of our method to

Title:
Sweet Chilli Sauce Ingredients: 1 tablespoon sunflower oil; 2 red chilies, finely hopped 2 tablespoons tomato... Instructions: Heat the oil in a small pan.Add the tomatoes and seasoning.stirring until softened... Title: Chilli Sauce Ingredients: 1 can chopped tomato; olive oil; salt; pepper... Instructions: Heat the oil.Add the tomatoes and seasoning.Simmer for 5-10 minutes.Put in the blender and blitz... Title: Raspberry Smoothie Ingredients: 1 (8 ounce) carton lemon low fat yogurt; 1/2 cups frozen reduced-calorie whipped topping; 2 cups raspberries... Instructions:Combine yogurt, whipped topping, raspberries, and ice cubes... Title: Chilli Sauce Ingredients: 1 can chopped tomato; olive oil; salt; pepper... Instructions: Heat the oil.Add the tomatoes and seasoning.Simmer for 5-10 minutes.Put in the blender and blitz... tomato; olive oil; salt; pepper... Instructions: Heat the oil.Add the tomatoes and seasoning.Simmer for 5-10 minutes.Put in the blender and blitz... project project reduce inter-modal distance reduce inter-modal distance enlarge inter-modal distance enlarge inter-modal distance reduce intra-modal distance reduce intra-modal distance ambiguous food images ambiguous food images guide guide intra-modal semantic relation intra-modal semantic relation Title: Sweet Chilli Sauce Ingredients: 1 tablespoon sunflower oil; 2 red chilies, finely hopped 2 tablespoons tomato... Instructions: Heat the oil in a small pan.Add the tomatoes and seasoning.stirring until softened...

Figure 1 .
Figure 1.The demonstration of multi-modal alignment schemes for cross-modal recipe retrieval.(a) The prevailing learning strategy that ignores intra-modal alignment.(b) The food image ambiguity issue.(c) Our solution (negative samples are omitted).Circles represent images, and squares represent recipes.Shapes of the same color indicate positive pairs, while gray shapes indicate negative samples.

Figure 2 .
Figure 2. The framework of MMACMR, which comprises two branches of modality encoder, f r for recipe texts and f v for food images, along with the MDA strategy.

Figure 3 .
Figure 3. Scalability analysis.The abscissa represents the dataset size ranging from 10 K to 50 K, while the ordinate represents the MedR value.
Chickpeas and Spinach With Smoky PaprikaIngredients: 1 tablespoon olive oil; 4 cups onions, thinly sliced; 5 garlic cloves, thinly sliced; 1 teaspoon spanish smoked paprika; 12 cup dry white wine; 14 cup vegetable broth ; 1/2 ounce can diced fire-roasted tomatoes, undrained; 15 ounce can chickpeas, rinsed and drained; 9 ounce package fresh spinach; 2 tablespoons fresh parsley, chopped; 2 teaspoons sherry wine vinegar.Instructions: Heat a Dutch oven over medium heat.Add the oil and swirl to coat.Add the onion and garlic; cover and cook 8 minutes oruntil tender, stirring occasionally.Stir in smoked paprika; cook 1 minute, stirring constantly.Add wine, broth, and tomatoes; bring to a boil.Add the chickpeas.Reduce heat, and simmer until the sauce thickens slightly (about 15 minutes); stir occasionally.Add spinach; cover and cook 2 minutes or until the spinach wilts.Stir in parsley and vinegar.Title: Aarsis Ultimate Mattar Mushroom CurryIngredients: 4 cups creamini mushrooms; 2 cups green peas (called A MatarA in India) ; 2 tablespoons garam masala; 1 large red onion; 12 ounces diced tomatoes; 2 green chilies ; 2 tablespoons coriander powder ; 13 cup tomato ketchup; pinch asafetida powder; 4 bay leaves; 1 tablespoon red chili powder ; 2 cups water; 2 teaspoons salt; 4 tablespoons vegetable oil.I nstructions : Heat oil in pressure cooker.Add green chilies, bay leaves and asafetida to this.Add onions along with 1 Tsp of salt to the above.Stir all the above ingredients together, and let them cook until the onions turn translucent and oil starts separating from them.Now add the diced tomatoes along with garam masala powder, coriander powder and red chili powder.Mix all the ingredients together and let them cook on medium low heat until the mixture starts to separate from oil.Add green peas and mushroom to this mixture and saute for couple of minutes... Title: Blue Ribbon Apple Crumb Pie Ingredients: Crust; 1/2 cups all-purpose flour; 1/2 cup vegetable oil ; 3 tablespoons milk ; 2 teaspoons white sugar; 1/2 teaspoon salt; Filling; 1/4 cup white sugar; 1 pinch ground cinnamon, or to taste; 6 Golden Delicious apples, peeled and sliced ; Crumb Topping ; 1 cup allpurpose flour; 1/2 cup packed dark brown sugar; 1/2 cup cold butter.Instructions: Preheat oven to 350 degrees F (175 degrees C).Mix 1 1/2 cups flour, vegetable oil, milk, 2 teaspoons sugar, and salt in a bowl until mixture pulls together; transfer and press into a 9-inch pie dish to form a crust.Combine 1/4 cup sugar and cinnamon in a large bowl; toss apples into cinnamon sugar to coat.Transfer apples to pie dish.Stir 1 cup flour and brown sugar in a bowl.Cut in cold butter with a knife or pastry blender until the mixture resembles coarse crumbs.Sprinkle crumbs over apples.Bake in preheated oven until golden and bubbly, about 45 minutes.Title: Fruit Cocktail Cake VII Ingredients: 2 eggs; 1/2 cups white sugar; (15.25 ounce) can fruit cocktail with juice; 3/4 cups all-purpose flour; 1/ 2 teaspoons baking soda ; 1 cup white sugar ; 1/2 cup butter; 2/3 cup evaporated milk; 1 cup flaked coconut; 1 teaspoon vanilla extract.Instructions: Preheat oven to 350 degrees F (175 degrees C).Grease and flour a 9x13 inch pan.Sift together the flour, and baking soda; set aside.In a large bowl, combine the eggs, sugar and fruit cocktail.Beat in the flour mixture.Spread batter into prepared pan.Bake in the preheated oven for 30 to 35 minutes, or until a toothpick inserted into the center of the cake comes out clean.Prick the top with a fork and spread on topping while still hot.To make the topping: In a saucepan, combine 1 cup sugar, butter, evaporated milk and coconut.bring to a rolling boil over medium heat.Title: Apricot Nectar Cake I n g r e d i e n t s : 1 box (18.25 Oz.Box) Duncan Hines Lemon Supreme Cake Mix ; 1 cup Apricot Nectar ; 4 whole Eggs; 1/2 cups Sugar; 3/4 cups Vegetable Oil; 2 whole Lemons, Juiced (Or Less As Needed) ; 1 cup Powdered Sugar.I nstructions : Add all cake ingredients.Combine thoroughly.Bake at 350 degrees F in a greased and floured bundt pan for about 50 minutes to 1 hour.Mix the lemon juice and powdered sugar to make a glaze.When cake is done, poke it all over with a fork.Spoon the glaze, a little at a time, over the cake until all is soaked up.Title: Briscoe's Irish Brown Bread (Bread Machine) I n g r e d i e n t s : 2 large eggs ; 1/2 cup butter plus 2 tablespoons, 125 grams; 1 cup sugar; 2 cups cake flour; 2/3 cup milk; 1 teaspoon vanilla extract.I nstructions : Preheat oven to 180 degrees cup (350F/ 180C).Combine all ingredients in a small bowl.Beat with electric mixer on low until blended, then beat at high speed for 2 minutes.Grease round cake tin and line the base with greaseproof paper.Pour mixture into tin and bake in moderate oven for about 30 to 40 minutes.Your choice if you want to leave plain or put incing on top.Enjoy.Title: Sweet and Spicy Grilled Pork Tenderloin I n g r e d i e n t s : 2 pork tenderloin (12 oz each) ; 6 tablespoons brown sugar ; 14 cup cilantro, chopped fine (parsley can be substituted) ; 12 teaspoon red pepper flakes (or to taste); 14 cup olive oil; 12 garlic cloves, just crushed and a rough chop; 2 teaspoons mustard, dried and ground; 2 teaspoons ground ginger; 1 teaspoon paprika; 12 cup soy sauce.Instructions : Marinade --In a large baggie add all the ingredients, close and shake well.Add the pork tenderloins and let marinade.Grill --Let set at room temp 20-30 minutes to take the chill off and then grill.After grilling cover and rest 5-10 minutes.Slice and enjoy.Title: Grilled Lime Chicken Thighs Ingredients: 2 lbs chicken thighs; 1/2 cup fresh lime juice ; 1 / 2 cup extra virgin olive oil ; 2 teaspoons dry tarragon; 1 tablespoon minced onion; 1/2 teaspoon hot sauce; salt and pepper.Instructions: Place olive oil, lime juice, onion, tarragon, salt, and hot sauce into a large, resealable plastic bag; shake to mix.Add chicken thighs, coat with marinade, squeeze out air, and refrigerate for at least 4 hours.(I leave it in overnight).Preheat an outdoor grill for medium heat and lightly oil grate.Remove chicken from marinade, and shake off excess.Discard remaining marinade.Season with salt and pepper.Grill chicken for about 30 minutes, or until no longer pink in the center.

Figure 4 .
Figure 4.Examples of image-to-recipe retrieval results for the 10 K test set.The first row contains the query images, the second row shows the recipes retrieved using our method (all of which are the ground truth recipes; therefore, the first row is their corresponding food images), the third row displays the recipes retrieved using H-T (ViT)[29], and the last row presents the corresponding food images of the recipes from the third row.The key ingredients not retrieved by H-T[29] but retrieved by our method are highlighted in red.

Table 1 .
Summary of notations.

Table 3 .
Ablation study.MedR (↓) and R@k (↑) in 1 K and 10 K size.The best results are marked in bold font.A √ symbol indicates that the corresponding part in this column is being used.