Social Media Mining for an Analysis of Nutrition and Dietary Health in Taiwan

Dining is an essential part of human life. In order to pursue a healthier self, more and more people enjoy homemade cuisines. Consequently, the amount of recipe websites has increased significantly. These online recipes represent different cultures and cooking methods from various regions, and provide important indications on nutritional content. In recent years, the development of data science made data mining a popular research area. However, only a few researches in Taiwan have applied data mining in the studies of recipes and nutrients. Therefore, this work aims at utilizing machine learning models to discover health-related insights from recipes on social media. First, we collected over 15,000 Chinese recipes from the largest recipe website in Taiwan to build a recipe database. We then extracted information from this dataset through natural language processing methodologies so as to better understand the characteristics of various cuisines and ingredients. Thus, we can establish a classification model for the automatic categorization of recipes. We further performed cluster analysis for grouping nutrients to recognize the nutritional differences for each cluster and each cuisine type. The results showed that using the support vector machine (SVM) model can successfully classify recipes with an average F-score of 82%. We also analyzed the nutritional value of different cuisine categories and the possible health effects they may bring to the consumers. Our methods and findings can assist future work on extracting essential nutritional information from recipes and promoting healthier diets.


Introduction
Under the economic prosperity and the rapid development of science and technology, the social pattern has changed dramatically. For example, the rise of social network has greatly changed the culinary culture. The choice of eating or cooking has become a hot topic, as well as the social influence of dietary choices [1]. The immediacy and convenience of the Internet have made sharing both food photos and cooking methods easier. The popularity of food search applications and food sharing websites is also increasing [2]. Researchers have also noticed the connection between recipes and diet habits under the effect of social networks [3].
The choice of recipes reflects one's preference on ingredients and diet habits, which in turn has a strong correlation with diseases, including incidence rate of cancer [4], the death rate [5], cardiovascular disease [6], and metabolic related diseases and obesity [7]. Moreover, a 2014 study in Brazil [8] examined the association between adult eating habits and metabolic syndrome from a total of 1112 cases. The results showed that the higher intake of fat-containing and sugary foods increased the risk of metabolic syndrome. Other researches [9][10][11] indicated that the current human diet is mostly refined cereals, excessive saturated fatty acids, red meat, processed meats, refined sugars, and fewer fruits, vegetables, whole grains, dietary fiber, plant protein, and nuts, which cause insulin resistance, inflammatory reactions that lead to an increase in the prevalence of chronic diseases, such They conducted cuisines classification by ingredients as eigenvalues. The results can be applied to the recommended food category labelling and automatic classification of recipes. Kusmierczyk and Nørvåg [24] analyzed the interaction data of recipes and scores uploaded by the German community platform Kochbar.de and found that the changes in nutrition (fat, protein, carbohydrate and calories) in the diet have obvious temporal trends. Rokicki et al. [25] studied the difference in nutritional value between recipes uploaded by different user groups. In addition, the carbohydrate amount in recipes seems to decrease as the user age increases.
Contrastively, only a few studies investigated the correlation between recipes and health conditions in Taiwan. Considering recent advances in machine learning technologies, we believe applying them to this topic can be fruitful. Therefore, this paper aims to discover the correlation between food and health by exploring cuisines and its unique ingredients, building cuisine classification models and clustering nutrients. We start by collecting information from the largest Chinese recipe website in Taiwan, iCook.com. It is a social platform for amateurs to share and discuss cooking recipes. However, they lack standardized nutrient lists. Therefore, we use natural language processing (NLP) techniques to process the data in order to establish a food-centric vocabulary and database. Subsequently, we construct machine learning models to automatically classify the recipes. This research also extracts unique ingredients for each cuisine type and the nutritional information of each recipe by linking the recipe ingredients to the nutrient database. We then establish a clustering model of the recipe based on the nutrient characteristics, and finally explore the relationship between diet and health. Our contributions include findings from online Chinese recipes, the ingredients and nutrients of various cuisines, an analysis of the characteristics of various cuisines and their nutritional value, and the correspondence between diet and diseases. This research can help improve awareness of the effect of what we eat on our body, as well as propose customized recipes or recommendation services to individual users.

Materials and Methods
Our study relies on the recipes data retrieved from iCook.tw, the largest Chinese recipe-sharing website in Taiwan. After data collection and preprocessing, ingredients and nutrients are regarded as features and machine learning as well as data exploration techniques are utilized to analyze cuisine types, dietary habits and its influences on health. The entire process of our system is presented in Figure 1. Major components of our method include: (1) preprocess the free-form recipes collected from the Internet, (2) train machine learning models to extract key features as well as help categorize these recipes by their nutritional content, and (3) uncover relationships between food, nutrients, and health. Detailed experimental methods are revealed in the following sections.

Data Collection
At the outset, we collect data from the recipe website including the most popular cuisines in Taiwan. Eight categories of cuisines are retained for further analysis, namely, Chinese (C), Japan (J), Korea (K), Thailand (T), America (A), Italy (I), France (F) and Spain (S) cuisines. The numbers of recipes in each cuisine category are 1321 (C), 1231 (J), 1333 (K), 1021 (T), 670 (A), 1836 (I), 949 (F), and 821 (S), thus resulting in 9182 classified recipes in total. In addition, there are also 6121 non-classified (N) recipes. This class of recipes are to be classified by the trained model later.
For the nutritional content, due to the fact that the website does not have nutritional labels, we attempt to map the ingredients to the 2017 version of the Taiwan FDA Food Composition Databases (TFDA) for gaining nutritional insights. This process is illustrated in Figure 2. The application of such a database can effectively convert free-form data into meaningful and structured information.

Data Collection
At the outset, we collect data from the recipe website including the most popular cuisines in Taiwan. Eight categories of cuisines are retained for further analysis, namely, Chinese (C), Japan (J), Korea (K), Thailand (T), America (A), Italy (I), France (F) and Spain (S) cuisines. The numbers of recipes in each cuisine category are 1321 (C), 1231 (J), 1333 (K), 1021 (T), 670 (A), 1836 (I), 949 (F), and 821 (S), thus resulting in 9182 classified recipes in total. In addition, there are also 6121 non-classified (N) recipes. This class of recipes are to be classified by the trained model later.
For the nutritional content, due to the fact that the website does not have nutritional labels, we attempt to map the ingredients to the 2017 version of the Taiwan FDA Food Composition Databases (TFDA) for gaining nutritional insights. This process is illustrated in Figure 2. The application of such a database can effectively convert free-form data into meaningful and structured information.

Data Preprocessing
The free-form data uploaded by users of the recipe website requires the following preprocessing steps. To start with, we need to normalize synonyms. It is common for the same ingredient to be written in different ways. Moreover, different physical forms of the same ingredient are commonly shown as different terms, for instance, "diced scallions" and "chopped scallions" both are scallions and should map to the same term in the nutrient database. Besides, we also remove parentheses, punctuations, and emojis, such

Data Preprocessing
The free-form data uploaded by users of the recipe website requires the following preprocessing steps. To start with, we need to normalize synonyms. It is common for the same ingredient to be written in different ways. Moreover, different physical forms of the same ingredient are commonly shown as different terms, for instance, "diced scallions" and "chopped scallions" both are scallions and should map to the same term in the nutrient database. Besides, we also remove parentheses, punctuations, and emojis, such as "「", "【", "】", "！", "❤", and all other non-ingredient-related words like kitchen appliances, cooking techniques such as "julienne." Finally, in order to precisely analyze ingredients, the recipes that use "few, appropriate," etc. quantity of the ingredients are replaced with an average amount. Decorative ingredients with a very small amount are removed. After preprocessing, we discover that there are some unrecognizable recipes or recipes with only one ingredient left. These exceptional cases are examined manually by the authors and excluded from the database.

Common Ingredients and Featured Ingredients
This step is to identify common and featured ingredients, in other words, those that appear across multiple types and only in a few types. To achieve this, each recipe is con-, and all other non-ingredient-related words like kitchen appliances, cooking techniques such as "julienne." Finally, in order to precisely analyze ingredients, the recipes that use "few, appropriate," etc. quantity of the ingredients are replaced with an average amount. Decorative ingredients with a very small amount are removed. After preprocessing, we discover that there are some unrecognizable recipes or recipes with only one ingredient left. These exceptional cases are examined manually by the authors and excluded from the database.

Data Preprocessing
The free-form data uploaded by users of the recipe website requires the following preprocessing steps. To start with, we need to normalize synonyms. It is common for the same ingredient to be written in different ways. Moreover, different physical forms of the same ingredient are commonly shown as different terms, for instance, "diced scallions" and "chopped scallions" both are scallions and should map to the same term in the nutrient database. Besides, we also remove parentheses, punctuations, and emojis, such as "「 ", " 【", "】 ", "！", "❤", and all other non-ingredient-related words like kitchen appliances, cooking techniques such as "julienne." Finally, in order to precisely analyze ingredients, the recipes that use "few, appropriate," etc. quantity of the ingredients are replaced with an average amount. Decorative ingredients with a very small amount are removed. After preprocessing, we discover that there are some unrecognizable recipes or recipes with only one ingredient left. These exceptional cases are examined manually by the authors and excluded from the database.

Common Ingredients and Featured Ingredients
This step is to identify common and featured ingredients, in other words, those that appear across multiple types and only in a few types. To achieve this, each recipe is considered as a document, while each ingredient is considered as a term and each cuisine type is considered as a label, as illustrated by Figure 3. Term Frequency-Inverse Document Frequency (tf-idf) (https://en.wikipedia.org/wiki/Tf%E2%80%93idf (accessed on 22 May 2021).) is used for calculating a vector representation of the ingredients. Then, we experiment with four widely used machine learning models, including naïve Bayes (NB) [26], decision tree (DT) [27], random forest (RF) [28], and support vector machine (SVM) [29], to construct our classifier.

Common Ingredients and Featured Ingredients
This step is to identify common and featured ingredients, in other words, those that appear across multiple types and only in a few types. To achieve this, each recipe is considered as a document, while each ingredient t j is considered as a term and each cuisine type is considered as a label, as illustrated by Figure 3. Term Frequency-Inverse Document Frequency (tf-idf) (https://en.wikipedia.org/wiki/Tf%E2%80%93idf (accessed on 22 May 2021).) is used for calculating a vector representation of the ingredients. Then, we experiment with four widely used machine learning models, including naïve Bayes (NB) [26], decision tree (DT) [27], random forest (RF) [28], and support vector machine (SVM) [29], to construct our classifier.  More specifically, we calculate the importance and uniqueness of the ingredient terms in each recipe by using the tf-idf model, and then combine the above weights to obtain the "ingredient characteristic value" , as: , , The more times a certain ingredient appears in the recipe, the more important it is. Hence, we consider the frequency of an ingredient appearing in the cuisine type as a measure of its importance of the ingredient i in the category j cuisine as , . The denominator is the total number of recipes in cuisine type j. The numerator , is the number of recipes in category j that includes ingredient i. At the experimental stage, a critical value is empirically chosen such that , is the criterion for an ingredient to be considered as "featured" in the category j.
On the other hand, we define , as a measure of the uniqueness of an ingredient i as the following equation. The numerator N is the total number of cooking categories, and the denominator is the number of cooking categories that include ingredient i. More specifically, we calculate the importance and uniqueness of the ingredient terms in each recipe by using the tf-idf model, and then combine the above weights to obtain the "ingredient characteristic value" p i,j as: The more times a certain ingredient appears in the recipe, the more important it is. Hence, we consider the frequency of an ingredient appearing in the cuisine type as a measure of its importance of the ingredient i in the category j cuisine as p i,j . The denominator s j is the total number of recipes in cuisine type j. The numerator r i,j is the number of recipes in category j that includes ingredient i. At the experimental stage, a critical value τ i is empirically chosen such that p i,j ≥ τ i is the criterion for an ingredient to be considered as "featured" in the category j.
On the other hand, we define w i,j as a measure of the uniqueness of an ingredient i as the following equation. The numerator N is the total number of cooking categories, and the denominator C i is the number of cooking categories that include ingredient i.
Finally, by multiplying the importance score p i,j and the uniqueness weight w i,j , we can obtain the "Specialty score" of the ingredient i in the category j, denoted as S i,j . A higher Specialty score indicates that this ingredient has a higher chance of being a featured item in a certain category.

. Nutrient Normalization
To understand the nutrition attributes of each cuisine, we extract nutritional information of nutrients by referring to the TFDA database mentioned in the previous section. We calculate nutritional facts per 100g of each recipe from each ingredient, and select seven of the most important and common nutrients, including carbohydrates, proteins, fat, saturated fat, dietary fiber, sugars and sodium. For an ingredient that cannot be mapped to the TFDA database, if the usage of it is low in the recipe, it is excluded in the calculation. Otherwise, it is replaced by the most similar ingredient in the TFDA database. Additionally, the quantity of some ingredients in the recipe is not specified using a precise unit, but usually described as "properly," "a few," etc. Therefore, we define the unspecified quantity of an ingredient by the following criteria: i.
Seasonings are usually set as the mean value. ii. When the unit is described as "a long piece of," "a piece of," "a carton of," etc., we use the food replacement table as a reference for the replacement of nutrients. iii. The ingredients for decoration or spices with small amounts such as white sesame and mint leaf are excluded in calculation.

Cuisine Categorization
We utilize unsupervised learning (https://en.wikipedia.org/wiki/Unsupervised_ learning (accessed on 22 May 2021).) methods to find groups that contain similar nutritional features withing the recipe. In other words, we do not use predefined category labels but rather their actual nutrients as the categorization criteria. In our experiments, we employ the k-means algorithm to distribute 13,323 recipes into 20 clusters, and use tf-idf to determine the most representative ingredients of each cluster. Through the steps mentioned in this section, we can quantitatively examine the correlation between recipe, its ingredients, and the well-being of our body.

Feature Extraction
First, we perform basic preprocessing steps mentioned above, and the numbers of samples before and after preprocessing are listed in Table 1. Afterwards, 80% of the data are used for training and 20% for testing. The k-fold cross-validation scheme with k = 10 is adopted in our experiments. Multi-class classification models are used for classifying recipes into 8 categories, including Chinese (C), Japan (J), Korea (K), Thailand (T), America(A), Italy (I), France (F), and Spain (S) cuisines. In addition, due to the fact that only using accuracy as the metric for performance evaluation can be prone to bias, we include Macro-average F1-score as a comprehensive metric for multi-class models. The tf-idf algorithm is then used to calculate the "Specialty Score" of ingredients. Recall that, the higher the Specialty Score, the more important the ingredient is for the cuisine type. The scores are listed in Tables 2 and 3.

Classification Results
We compare different classification models with regards to their ability to categorize recipes into eight cuisine types. As Figure 4 shows, each ingredient of a recipe is treated as a term. We then use tf-idf to calculate ingredient weights as a vector space feature representation. Here, cuisine types are treated as classification labels. We evaluate SVM, naïve Bayes, decision tree, and random forest for their classification performance. We adopt a 10-fold cross-validation scheme and calculate precision, recall, and F1-scores. The results of the SVM algorithm are shown in Table 4. The macro average of precision is 0.83, recall rate is 0.82, and F1-score 0.82. The confusion matrix is shown in Figure 5. Performances of compared methods are listed as follows: naïve Bayes model in Table 5 and its confusion matric in Figure 6, decision tree in Table 6 and Figure 7, and random forest in Table 7 and Figure 8.

Classification Results
We compare different classification models with regards to their ability to categorize recipes into eight cuisine types. As Figure 4 shows, each ingredient of a recipe is treated as a term. We then use tf-idf to calculate ingredient weights as a vector space feature representation. Here, cuisine types are treated as classification labels. We evaluate SVM, naïve Bayes, decision tree, and random forest for their classification performance. We adopt a 10-fold cross-validation scheme and calculate precision, recall, and -scores. The results of the SVM algorithm are shown in Table 4. The macro average of precision is 0.83, recall rate is 0.82, and F1-score 0.82. The confusion matrix is shown in Figure 5. Performances of compared methods are listed as follows: naïve Bayes model in Table 5 and its confusion matric in Figure 6, decision tree in Table 6 and Figure 7, and random forest in Table 7 and Figure 8.             The macro average of three metrics from all classifiers are compared in Figure 9. Overall, we identify that the SVM model performs best on the categorization of recipes into cuisine types. Therefore, we adopt SVM to help us determine the category of the 5349 unclassified recipes collected from the web. The classification results are listed in Table 8.

Nutrient Grouping
The k-means algorithm is employed to distribute 13,323 recipes into 20 clusters. Afterwards, tf-idf is used to find representative ingredients of each cluster. We then manually label and merge them with similar characteristics. In the end, 20 clusters are reorganized into 5 groups, and results are as follows.
Group A: This group includes the 1st, 5th, and 13th clusters. The recipes in this group constitute a third of the total recipes. Carbohydrates, protein and fat are relatively average in this group (see Table 10 for a comparison). It also shows that the 13th cluster mainly uses low fat meat, such as salmon and chicken, as ingredients. Past researches indicate that high protein, low fat and low carbohydrates are good for weight loss [30]. Therefore, Group A is a healthier choice than Groups B, C, E, described later. The macro average of three metrics from all classifiers are compared in Figure 9. Overall, we identify that the SVM model performs best on the categorization of recipes into cuisine types. Therefore, we adopt SVM to help us determine the category of the 5349 unclassified recipes collected from the web. The classification results are listed in Table 8.

Nutrient Grouping
The k-means algorithm is employed to distribute 13,323 recipes into 20 clusters. Afterwards, tf-idf is used to find representative ingredients of each cluster. We then manually label and merge them with similar characteristics. In the end, 20 clusters are reorganized into 5 groups, and results are as follows.
Group A: This group includes the 1st, 5th, and 13th clusters. The recipes in this group constitute a third of the total recipes. Carbohydrates, protein and fat are relatively average in this group (see Table 10 for a comparison). It also shows that the 13th cluster mainly uses low fat meat, such as salmon and chicken, as ingredients. Past researches indicate  The macro average of three metrics from all classifiers are compared in Figure 9. Overall, we identify that the SVM model performs best on the categorization of recipes into cuisine types. Therefore, we adopt SVM to help us determine the category of the 5349 unclassified recipes collected from the web. The classification results are listed in Table 8.

Nutrient Grouping
The k-means algorithm is employed to distribute 13,323 recipes into 20 clusters. Afterwards, tf-idf is used to find representative ingredients of each cluster. We then manually label and merge them with similar characteristics. In the end, 20 clusters are reorganized into 5 groups, and results are as follows.
Group A: This group includes the 1st, 5th, and 13th clusters. The recipes in this group constitute a third of the total recipes. Carbohydrates, protein and fat are relatively average Figure 9. Comparison of classification models in terms of the macro-average precision, recall, and F1-score. Group B: It contains the 2nd, 8th, 11th, and 19th clusters. This group has higher carbohydrates. Past studies have found that, under the same calorie limit, a diet that maintains a high carbohydrate ratio daily is not helpful to weight loss [31]. The 2nd cluster and the 8th also have a higher percentage of refined sugar. According to the World Health Organization, excessive amounts of sugar in food can cause obesity. In addition, many studies have indicated that excessive intake of sugar can increase triglycerides [32], total cholesterol [9], blood pressure [33], and cardiovascular disease [34]. It has a significant impact on our health. In the 2nd and 8th clusters, the recipes closest to the center of mass are mostly American recipes. The recipes mainly include cakes, muffins, crepes and other desserts.
Group C: It consists the 0th, 6th, and 12th clusters. This group has high sodium content, that is, high salt content. Many studies have shown that excessive sodium content in the diet has been considered to be associated with hypertension, cardiovascular disease, and chronic kidney disease [35][36][37]. Notably, the 6th and 12th clusters are mainly Japanese and Chinese dishes, respectively. Most of them require a cooking method of stewing or contain marinated ingredients. The representative ingredients in this group are soy sauce, rice wine, and chili.
Group D: It is made of the 4th, 10th, 14th, and 16th clusters. The main characteristic of this group is high dietary fiber. There is considerable epidemiological evidence that higher daily dietary fiber intake can reduce the risk of diseases including cardiovascular disease [38], Type 2 diabetes [17] and cancer [39]. Among the 17 recipes in this group, those closest to the center of mass are mostly Japanese, and those with high dietary fiber content are sushi. Some of the recipes with higher dietary fiber are listed in Table 9. Besides, most of them have a higher proportion of vegetables, or soybean-related products such as tofu, which are healthier food choices. 17 Japanese recipes and 9 Chinese recipes are closest to the center of mass. There are no Spanish cuisines and only one American recipe. Seaweed and onions are the representative ingredients of this group. Group E: It includes the 3rd, 7th, 9th, 15th, 17th, and 18th clusters. This group has an overall higher portion of fat. We observe that the recipes in this group are mostly salads, which contains ingredients such as sesame oil, olive oil and coconut oil. Olive oil is rich in monounsaturated fatty acids, and several studies have shown that it can lower the chance of stroke for patients with cardiovascular diseases [40] and has an important role in reducing these diseases [41]. For the 7th and 15th clusters, 10 Chinese recipes are closest to the center of mass. For the 9th, 17th, and 18th clusters, 12 American cuisines are closest to the center of mass. In this group, in addition to common seasonings, the most representative ingredients of the 7th and 15th clusters are pork belly, olive oil and sesame oil; for the 17th, 9th and 18th clusters are milk, cream and unsalted cream. A large-scale longitudinal study pointed out that, using whole grain foods as the control, the risk of cardiovascular disease is lower when consuming unsaturated than saturated fat. The study also pointed out that if there is a high frequency of eating fine starch and saturated fat at the same time, there is a higher prevalence rate of cardiovascular disease [42]. The recipes in this group are mostly cakes, pastries, etc., with low amount of dietary fiber. Observing the ingredients of this group, we notice that most of them use refined carbohydrates such as flour, bread, etc.
To summarize, our model finds 20 clusters of recipes that may have various influences on human health. Table 10 is a list of groups and clusters that are more in line with principles of a healthy diet, accounting for 60% of all recipes. On the other hand, Table 11 shows other groups that do not conform with these principles. Note that the recipe database we have so far only covers a portion of the online recipes in one website. Overall, Japanese recipes and dishes are generally healthier. However, other factors such as the fact that there are fewer French and Spanish recipes, and cooking methods are not considered. Among those listed in Table 11, most are American and French desserts with refined sugar, saturated fat, and carbohydrates. Another major trait of this group is high sodium content. Reduce the risk of stroke [40] or cardiovascular disease [41]  Highly saturated fat, carbohydrate and refined sugar 17 18

Common Ingredients and Featured Ingredients
To acquire a deeper insight, we analyze common ingredients and featured ingredients of various cuisine types. We observe that Chinese, Japanese, and Korean cuisines use similar ingredients, whereas American, Italian, French, and Spanish cuisines are comparable in the same manner. Thai food is more distinct, where common ingredients such as salt, sugar, and pepper are rare. The reason may be that Thailand is in Southeast Asia, where the preference of flavor is different from Northeast Asia. More precisely, it focuses on sour, spicy, and umami. Lemon, chili, and fish sauce are common ingredients in this region. It is also customary for people in this area to use fish sauce to replace salt and/or sugar, so there is a considerable difference in the use of ingredients from other cuisines.
When Sajadmanesh et al. [1] used tf-idf to extract distinctive ingredients, they found a strong relationship with culture, geographical location, and agriculture. It was also shown that Western European cuisine is more similar to North American dished, both relying heavily on dairy products, eggs and wheat-based products. Asian cuisines commonly use soy sauce, sesame oil, rice, and ginger [45]. Our analysis on the contents of online recipes is consistent with the results of previous work.

Cuisine Classification Model
In addition to the SVM classification model, the current study also uses naïve Bayes, decision tree, and random forest to compare the results. SVM was previously used by Su et al. [23] to understand the relationship between cuisines and ingredients based on the presence or absence of ingredients. Similar to our methods, their study used online recipes data for cuisine type prediction. We determine that, among all classification methods, the SVM model can obtain the best result. Furthermore, this study employs tf-idf to calculate the weight of each ingredient, and convert each recipe into a vector representation. Compared with sparse representation, the vector space model has the following advantages: (1) The weight of ingredients is not binary. (2) It can be sorted according to the degree of correlation between recipes. (3) It supports local matching. As shown in Figure 10, the SVM model in this research has a higher precision and recall than that from Su et al. [23].
sugar, so there is a considerable difference in the use of ingredients from other cuisines.
When Sajadmanesh et al. [1] used tf-idf to extract distinctive ingredients, they found a strong relationship with culture, geographical location, and agriculture. It was also shown that Western European cuisine is more similar to North American dished, both relying heavily on dairy products, eggs and wheat-based products. Asian cuisines commonly use soy sauce, sesame oil, rice, and ginger [45]. Our analysis on the contents of online recipes is consistent with the results of previous work.

Cuisine Classification Model
In addition to the SVM classification model, the current study also uses naïve Bayes, decision tree, and random forest to compare the results. SVM was previously used by Su et al. [23] to understand the relationship between cuisines and ingredients based on the presence or absence of ingredients. Similar to our methods, their study used online recipes data for cuisine type prediction. We determine that, among all classification methods, the SVM model can obtain the best result. Furthermore, this study employs tf-idf to calculate the weight of each ingredient, and convert each recipe into a vector representation. Compared with sparse representation, the vector space model has the following advantages: (1) The weight of ingredients is not binary. (2) It can be sorted according to the degree of correlation between recipes. (3) It supports local matching. As shown in Figure 10, the SVM model in this research has a higher precision and recall than that from Su et al. [23].  Figures 5-8 show the confusion matrices corresponding to SVM, naïve Bayes, decision tree, and random forest classifiers in this study. Compared to other cuisines, Chinese, Italian, Korean, Thai, and Japanese cuisines are easy to distinguish from one another, whereas French, American, and Spanish cuisines are more challenging. Taking the confusion matrix of the SVM model as an example, we can further find that American, French, and Spanish recipes tend to be classified as Italian cuisine. Table 12 lists some American, French, and Spanish recipes that are classified as Italian and their ingredients. We note that some ingredients have higher Specialty scores in Italian cuisine, but the misclassified recipes have low Specialty scores on those ingredients, which may be the reason for the misclassification of these dishes. For example, American recipes classified as Italian cuisine mostly contain olive oil, while French recipes include tomatoes, spaghetti and other ingredients, and Spanish recipes include basil, tomatoes, and cream. Similarly, Thai and Japanese dishes are sometimes classified as Chinese dishes. As indicated in Table 13, soy sauce, white pepper, rice wine, and shiitake mushrooms are also quite common ingredients in Chinese cuisine.  Figures 5-8 show the confusion matrices corresponding to SVM, naïve Bayes, decision tree, and random forest classifiers in this study. Compared to other cuisines, Chinese, Italian, Korean, Thai, and Japanese cuisines are easy to distinguish from one another, whereas French, American, and Spanish cuisines are more challenging. Taking the confusion matrix of the SVM model as an example, we can further find that American, French, and Spanish recipes tend to be classified as Italian cuisine. Table 12 lists some American, French, and Spanish recipes that are classified as Italian and their ingredients. We note that some ingredients have higher Specialty scores in Italian cuisine, but the misclassified recipes have low Specialty scores on those ingredients, which may be the reason for the misclassification of these dishes. For example, American recipes classified as Italian cuisine mostly contain olive oil, while French recipes include tomatoes, spaghetti and other ingredients, and Spanish recipes include basil, tomatoes, and cream. Similarly, Thai and Japanese dishes are sometimes classified as Chinese dishes. As indicated in Table 13, soy sauce, white pepper, rice wine, and shiitake mushrooms are also quite common ingredients in Chinese cuisine.

Relationship between Nutrients
Hsiao and Chang [46] stated that using recipe recommendations can greatly improve users' dietary habits and health. Therefore, we further use the clustering results to explore the nutrient characteristics of recipes, and analyze whether each cluster of recipes is in line with principles of a healthy eating habit. For general ingredients, calories are provided by carbohydrates, proteins, and fats, where one gram can provide four calories, four calories and nine calories, respectively. Sodium and dietary fiber do not provide calories. Figure 11 shows the Pearson correlation coefficient matrix of seven nutrients, namely, carbohydrate, protein, fat, saturated fat, sugar, dietary fiber, and sodium, with calories. It can be perceived that all nutrients are positively correlated with calories. Interestingly, carbohydrates are negatively correlated with protein, fat, and saturated fat; and sugar has low correlations with fat, saturated fat, and sodium, while negatively correlated with protein. The result is consistent with past research [24]. It has been shown that dietary fiber is mostly found in whole wheat grains, vegetables, fruits, and soybeans. If more refined sugar is added to the recipe, such as granulated sugar, powdered sugar, etc., the recipes are mostly of exquisite pastries. Thus, the dietary fiber content is relatively low [3]. The findings in our experiments agree with previous research, in that dietary fiber is positively correlated with carbohydrates and protein, and negatively correlated with sugar.  For the purpose of suggesting a healthier diet, we refer to the Dietary Guidelines for Americans 2020-2025, Guideline 4 of Chapter 1 (https://www.dietaryguidelines.gov/ sites/default/files/2020-12/Dietary_Guidelines_for_Americans_2020-2025.pdf (accessed on 22 May 2021)). It states that foods and beverages with added sugar, fat, sodium, or alcohol should be reduced. This corresponds to Groups B, C, and E in Section 3.3. These recipes are high in sugar, sodium, and fat content, which are related to obesity, high cholesterol, high blood pressure, heart disease, and chronic kidney disease. The Japanese Dietary Guidelines recommend moderate consumption of highly processed snacks, confectionery and sugar-sweetened beverages as well [47]. Furthermore, the Updated Mediterranean Diet Pyramid [48] also encourages healthy fats like olive oil to be the main source of fat, while sweets and ultra-processed high-sugar, high-fat foods and beverages should only be consumed in small amounts. These modernized views of healthy diets align well with Group E in Section 3.3. Moreover, the Italian Dietary Guidelines (http://www.fao.org/nutrition/education/food-dietary-guidelines/regions/ countries/italy/en/ (accessed on 22 May 2021)) advocate for small amounts of fat and sugar in foods. To sum up, our method can successfully find groups of recipes and nutrients that are consistent with most up-to-date dietary recommendations around the world. 3.3. These recipes are high in sugar, sodium, and fat content, which are related to obesity, high cholesterol, high blood pressure, heart disease, and chronic kidney disease. The Japanese Dietary Guidelines recommend moderate consumption of highly processed snacks, confectionery and sugar-sweetened beverages as well [47]. Furthermore, the Updated Mediterranean Diet Pyramid [48] also encourages healthy fats like olive oil to be the main source of fat, while sweets and ultra-processed high-sugar, high-fat foods and beverages should only be consumed in small amounts. These modernized views of healthy diets align well with Group E in Section 3.3. Moreover, the Italian Dietary Guidelines (http://www.fao.org/nutrition/education/food-dietary-guidelines/regions/countries/italy/en/ (accessed on 22 May 2021)) advocate for small amounts of fat and sugar in foods.
To sum up, our method can successfully find groups of recipes and nutrients that are consistent with most up-to-date dietary recommendations around the world. Figure 11. Pearson correlation matrix between nutrients in recipes.

Strengths and Limitations
The strengths of our study include the use of state-of-the-art machine learning models to efficiently categorize recipes and discover relations among ingredients and nutrients, which enable an objective measure of healthy diets. Another advantage of our methods is its accuracy. A study of recipes provided by restaurants on university campuses [49] mentioned that the accuracy of menu labels has a significant impact on the nutritional information actually provided by the dishes. People who want to choose healthy meals may be affected by incorrect menu labels, which may in turn result in lower nutritional intake than expected, and even lead to problems with diet control. In our experiments, we use online recipes that provide detailed ingredients, and we also consult nutrition databases to achieve a complete nutritional analysis for each recipe. In this way, our analysis can provide users with accurate nutritional content analysis. However, the distribution of recipes from the data source may be imbalanced, and the ingredient to nutrition database is not always comprehensive, i.e., some ingredients are not recognized or found in the database. These factors can limit the outcome of our system.

Strengths and Limitations
The strengths of our study include the use of state-of-the-art machine learning models to efficiently categorize recipes and discover relations among ingredients and nutrients, which enable an objective measure of healthy diets. Another advantage of our methods is its accuracy. A study of recipes provided by restaurants on university campuses [49] mentioned that the accuracy of menu labels has a significant impact on the nutritional information actually provided by the dishes. People who want to choose healthy meals may be affected by incorrect menu labels, which may in turn result in lower nutritional intake than expected, and even lead to problems with diet control. In our experiments, we use online recipes that provide detailed ingredients, and we also consult nutrition databases to achieve a complete nutritional analysis for each recipe. In this way, our analysis can provide users with accurate nutritional content analysis. However, the distribution of recipes from the data source may be imbalanced, and the ingredient to nutrition database is not always comprehensive, i.e., some ingredients are not recognized or found in the database. These factors can limit the outcome of our system.

Conclusions
This work examines ingredients and nutrients on a recipe sharing social network site iCook.tw to explore the health effects of various cuisines and ingredients. The online recipes are processed and nutrients within them are linked to a standard nutrient database. Multiple machine learning approaches are explored, and the SVM classifier is found to be superior in the recipe classification experiment than three other methods, with an F1score of 0.82. We further analyze the healthiness of cuisines by clustering nutrients and organizing possible health effects of different clusters of recipes. We observe that only a third of the online recipes contain high protein, low fat and low carbohydrates, which are indications of a healthier diet. As for the most notable relationship between nutrients, sugar is negatively correlated with protein and dietary fiber, in other words, sweeter dishes are usually low in protein and fiber content. On the other hand, dietary fiber is positively correlated with carbohydrates and protein, which are essential nutrients of human health. Our findings can help the public to better understand the impact of dietary habits. We foresee more nutritious and healthy cooking styles to emerge after our proposal of the awareness of healthy cuisines and ingredients.