Clustering in Wineinformatics with Attribute Selection to Increase Uniqueness of Clusters

: Wineinformatics is a new data science research area that focuses on large amounts of wine-related data. Most of the current Wineinformatics researches are focused on supervised learning to predict the wine quality, price, region and weather. In this research, unsupervised learning using K-means clustering with optimal K search and ﬁltration process is studied on a Bordeaux-region speciﬁc dataset to form clusters and ﬁnd representative wines in each cluster. 14,349 wines representing the 21st century Bordeaux dataset are clustered into 43 and 13 clusters with detailed analysis on the number of wines, dominant wine characteristics, average wine grades, and representative wines in each cluster. Similar research results are also generated and presented on 435 elite wines (wines that scored 95 points and above on a 100 points scale). The information generated from this research can be beneﬁcial to wine vendors to make a selection given the limited number of wines they can realistically offer, to connoisseurs to study wines in a target region/vintage/price with a representative short list, and to wine consumers to get recommendations. Many possible researches can adopt the same process to analyze and ﬁnd representative wines in different wine making regions/countries, vintages, or pivot points. This paper opens up a new door for Wineinformatics in unsupervised learning researches.


Introduction
Data science is the advancement in the combination of data engineering, scientific methods, math, visualization and statistically based algorithms with a domain of application to make sense of larger quantities of data. With the rise of the internet, data has become abundant; therefore, data science has become one of the most popular research areas in the 21st century. Within this popular field there are four major types of learning algorithms that provide efficacy: Supervised Learning [1], Unsupervised Learning [2], Semi-supervised Learning [3], and Reinforced Learning [4]. All of these methods provide useful and distinct information to the domain knowledge with large amount of data.
Wine has been enjoyed by people across the world for several thousand years. It is both delicious and so wildly varied that people often choose to dedicate a great deal of their time and money to tasting, comparing, and discussing different wines with their friends and peers. According to the International Organization of Vine and Wine (OIV), who is the world's authority on wine statistics, in 2018, 293 million hectoliters of wine were produced across 36 countries. This constitutes a 17% increase in wine production from 2017 to 2018 [5]. The world's total wine production in 2019 is estimated to be 263 million hectoliters. This is just slightly below the average global wine production over the last ten years of 270 Mhl [6]. Based on the OIV statistic, wine is one of the high-value products that heavily affect many wine-producing countries' economies, such as France, Italy, and Spain.
Unsupervised machine learning algorithms infer patterns from a large dataset without reference to known or labeled outcomes [2]. What separates this from the supervised machine learning algorithms is the fact that when this type of learning is performed, there our knowledge, no literature has focused on how to use unsupervised learning to find beneficial information for wine distributors and consumers from the large amount data, especially from region-specific datasets.
With the massive selection of Bordeaux wines on the market, wine vendors have many tough choices when it comes to selecting which wines they want to have represented in their offerings. No vendors can possibly supply all available wines, so they must choose a limited number to provide the best selection for their customers. Choosing these wines can be a difficult process and this project aims to provide some insight by grouping similar wines so that a vendor can make more informed decisions through the unsupervised learning. This study allows wine distributors to compile a comprehensive list of selections from any groups of wines without missing out on a particular type. For the scope of this project, we will be focusing solely on wines from the Bordeaux region of France as the group of wine. The approaches we used can be easily applied to any selection of wines, depending on the need.

Bordeaux Dataset
The fundamental element for data science research is the dataset within the application domain. The source, the pre-processing, and the creation of the data are all major factors to the quality of the data. In this research, the Wineinformatics dataset comes from wine reviews which are processed by the Computational Wine Wheel as the Natural Language Processing (NLP) tool [39].

Wine Spectator
When deciding on a wine that suits someone's preferences, the best way to decide which wine that is, aside from tasting it yourself, is to check reviews on the wine(s) you are curious about. While you can choose to follow the reviews of the general populace, there exists a field of work related specifically to the tasting and rating of wines. Wine reviewers set trends and guide customers' preferences. [40] These reviewers go through specific "wine education" that trains them to better identify and understand the qualities of wines. The verdict of the wine usually goes with the 100-point wine-scoring scale to summarize the review [41]. However, many research efforts indicate that wine judges may demonstrate intra-and inter-inconsistencies while tasting designated wines [42][43][44][45][46][47]. Therefore, the source of the data needs to come from consistent and creditable wine judges.
Wine Spectator is a wine magazine company that provides wine reviews periodically by a group of wine region specific reviewers, "Wine spectator started as a biweekly, California-based newsletter, but has since become the world's leading authority on wine." [48] The magazine publishes 15 issues a year, and there are between 400 to 1000 wine reviews per issue. In previous Wineinformatics research [14], more than 100,000 wine reviews were gathered and analyzed across all wine regions in the world. This dataset was used to test wine reviewers' accuracy in predicting a wine's credit score. Wine Spectator reviewers received more than 87% accuracy when evaluated with the SVM method while predicting whether a wine received a credit score higher than 90/100 points [7]. The satisfactory results demonstrate that Wine Spectator provides consistent wine reviews. Moreover, in the same study, James Molesworth who reviews all Bordeaux wines was ranked number three among all reviewers. Therefore, the Bordeaux wine reviews retrieved from Wine Spectator are suitable for this research.

Bordeaux Dataset
Bordeaux ("Bore-doe") refers to a wine from Bordeaux, France. A massive portion of the wines produced by this region are red wines (over 90%) with Merlot or Cabernet Sauvignon. Bordeaux is the largest AOC vineyard of France and has 54 appellations [50]. As this region provides such a large portion of red wines, a large amount of the reviews utilized in this project and a considerable amount of the attribute distribution will be in favor of red wines. However, Bordeaux does produce other varieties of wines, and even within the red wines that are produced in this region, there exists differences based off the terroir, that being the environment in which a wine is produced, and the vintage of the wines as well.
In our previous research [34], we explored all 21st century Bordeaux wines by creating a publicly available dataset with 14,349 Bordeaux wines [35] that covers all available Bordeaux wine reviews from year 2000~2016 from Wine Spectator. Since all of the reviews are in human language format as shown in Figure 1, the reviews ar e processed by the Computational Wine Wheel [21,22] which works as a dictionary using one-hot encoding to convert words into vectors. For example, in a wine review, there are some words that contain fruits such as apple, blueberry, plum, etc. If the word matches the attribute in the computation wine wheel, it will be 1; otherwise, it will be 0. Binary data represent a method of data classification where the data exists in either one state or the other. It is numerically represented by a combination of zeros and ones. The Computational Wine Wheel is also equipped with a generalization function to map similar words into the same coding. For example, fresh apple, apple, and ripe apple are generalized into "Apple" since they represent the same flavor; however, green apple belongs to "Green Apple" since the flavor of green apple is different from apple. The score of the wine is also attached to the data as the last attribute, also known as the label. This pre-processing step is crucial for computers to "understand" the wine reviews. Figure 2 provides a visual example of the process.
While this information can be useful to the consumer and profitable to vendors, in its current state, the dataset is difficult to interpret for results. As stated above, it is not practical to expect any vendor to carry all the wines in the dataset, and it would be a sizable task to sift through this information and extract any specific wine or feature to focus on. The initial idea was to perform an unsupervised approach to find common attributes of the listed wines and group them together based on these similarities. A further issue became apparent as the data was analyzed. The data provides binary attributes for all possible values that can be provided by the wine wheel. Among these values are more general terms such as "finish", "fruit", and "great". These values represent common attributes appearing at a higher rate than other attributes, but these values are not exactly interesting for finding similarities within groups of clusters. Before moving forward with further grouping of the wines, such values need to be removed to allow for better results when looking to group the wines and find the most common attributes between them. To accomplish this, attributes were removed from the overall list of attributes. The determined threshold was that any wine attribute that appears in over 20% of the listed wines (around 2870 wines) or more was removed. This resulted in the removal of six attributes from the dataset. The flowchart of converting reviews into machine language understandable through the computation wine wheel. All key word appearances for a review are initially recorded as a 0. When a key word is found within the review, it is recorded in the table as a 1.

K-Means Clustering
Unsupervised machine learning algorithms infer patterns from a dataset without reference to known or labeled outcomes [2]. What separates this from the supervised machine learning algorithms is the fact that when this type of learning is performed there is no knowledge as to what we are going to observe in the results. Among the unsupervised learning algorithms are various methods of clustering, which is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities. While there are different methods of clustering that can be performed, this project turned to the use of K-Means clustering [51] as it is a familiar method that is well known and tested.
K-Means clustering might be the simplest and the most popular unsupervised machine learning algorithm. In K-Means clustering, the K value refers to the number of centroids (clusters). When a number K is defined, we then calculate the distance from a point to each centroid such that for every point we check the distance between this point and all centroids. The centroid with the minimum distance in relation to the point will define which cluster a point belongs to. When using K-Means clustering, a major factor in the calculation is the distance calculation. For most applications, the use of Euclidean distance or Manhattan distance formulas is utilized, but these are not accurate for the use of distance calculations of binary data. To account for this problem, the Jaccard distance formula shown in equation 1 is used in place of the standard Euclidean distance formula.
where P = Number of variables positive for both objects, Q = Number of variables positive in Q, but not R, R = Number of variables positive in R, but not Q. The short Jaccard's distance, the similar two objects are. A value of 0 in the distance is completely similar and a value of 1 is completely dissimilar.
Using Table 1 to demonstrate this calculation for the distance between A and B, that is dAB, we find the values for the distance calculation defined above. The three variables that are specifically needed are p, q, and r. The calculation for p would correlate to Item1 as both A and B are positive for the case in Item1, so P = 1. The calculation for Q correlates to Item3 and Item5 where the values for A are positive (1), and the values for B are negative (0), so Q = 2. The calculation for r correlates to Item2 where A is negative (0), and B is positive (1). With these two values, the calculation for dAB would be (2 + 1)/(1 + 2 + 1). This results in a Jaccard distance from A to B of 0.75. Table 1. Binary data example. For each label (A and B), the distance is calculated based on the comparison between the appearance of each item within that label. Jaccard distance is utilized for this type of data (binary).

Wine
Item1 Item2 Item3 Item4 Item5 The original K-means clustering can be described as shown in Figure 3 [7]. With a userdefined K as the number of clusters, the program will randomly choose the initial centroids location. After that, a repeat process will occur to assign existing points to the closest centroids by calculating the distances to all centroids based on the given distance calculation formula. After all points are assigned, recalculate the centroid location and repeat the process until no changes on the centroid location or some pre-defined convergence criteria. . Pseudo code for the K-means clustering algorithm. The initialization of the centroids is chosen from the already existing collection of data. The standard k-means algorithm utilizes Euclidean distance for assigning points their closest centroid, but this type of distance calculation does not work with regards to binary data (1 s and 0 s).

Filtering Process
Before clustering the wine information, this research attempts to apply some filtering processes to extract more precise and meaningful information since both the number of wines and attributes are large. Two separate methods were used to filter this data before performing the K-Means clustering algorithm.

Filtering Method 1: Attributes Filtration
Method one filtered the characteristics of the wine based on overall appearance within the dataset. This ensured that overly common attributes ("FINISH", "TANNINGS", etc.) were removed from the calculation to avoid wines being clustered based on these attributes. The selection for which attributes to remove was based on a percentage calculation based on the overall number of wines relating to how many of these wines has this attribute present. Multiple tests were performed on 10% increments, and the best results were observed when attributes appearing in 20% of the wines or more are removed. The pseudo code for filtering method 1 is given in Figure 4. . Pseudo code for the filtering method 1: attribute filtration. We utilize A i to account for an attribute (vanilla, cherry, finish, etc.), and we calculate the total appearance of this attribute in the entire dataset. When an attribute appears too often in the dataset, it can skew the results of the clustering algorithm.

Filtering Method 2: Wine Grade Filtration + Attributes Filtration
The second method involved choosing an attribute as a pivot and building distribution ratios for all wine wheel attributes based on that pivot. It makes the most sense for the pivot attribute to be one that was not generated by Computational Wine Wheel. Instead, it should be chosen from the other available attributes such as Price, Score, time of harvest, etc. For this project, Score was selected to be the pivot point. Starting from the selection of 432 wines with a score of 95 and greater, they were then split into three sub-groups. These consisted of wines with scores of 95, wines with scores of 96-97, and wines with scores of 98-100. This split allowed for a relatively even distribution, leading to sets of 165, 202, and 70 wines, respectively. A total was then taken for each wine wheel attribute. These totals represented the total number of wines within each sub-group that contained the given attribute. The totals were weighted to account for the variation of the sub-group size used to generate three ratios showing each sub-group's representation of each wine wheel attribute. For the purposes of this method, attributes whose distribution was too even were tossed out for clustering. Three subsets of data were generated using distribution thresholds. These thresholds were 50%, 55%, and 60%, meaning if one sub-group of wines carried a weighted representation of 50% or more for a given wine wheel attribute, it was selected to remain in the 50% subset. The pseudo code for filtering method 2 is given in Figure 5.

Proposed K-Means Clustering with Optimal K Search and Filtration Process
In this research, we proposed a modified K-means clustering algorithm to cluster the Bordeaux wine dataset based on the original K-means clustering, Jaccard's Distance and filtering methods. The following method shown in Figure 6 is the clustering algorithm that is performed on the data after filtering: The first step of the clustering is read in the data and set the K value starts with 2, representing 2 clusters. After that, one of the filtering methods is applied to remove unwanted information. Steps 3~7 are the original K-means clustering algorithm where the initial centroids are calculated using the random selection method. This method chooses random points from the dataset and sets those as the starting centroids that all points are compared to. After all points are clustered using the initial centroids, the new centroids are calculated using a threshold value based on the number of times each feature is present in the wines in the cluster. A base value of 30% is used as the lowest threshold value allowed for a feature to become part of the new centroid, meaning that if the attribute in question is present (value of 1) 30% of the time or more, then the value for that feature in the new centroid of that cluster is a 1, otherwise it is a 0. This allows for unique features to still be evaluated as part of the selection, since some features can appear in less than one percent of the wines.
Step 8 is used to evaluate the quality of the cluster. When performing K-Means clustering, a method must be used to calculate the validity of the clusters formed. This is useful for knowing if the clustering algorithm is performing correctly, as well as showing which value K results in the best clusters. The method utilized in this project relies on the use of the SSE (Sum of the Squares due to Error) values The formula calculates the variation within the clusters, where n is the number of observations and X i is the value of the ith observation. A cluster that consists of identical items would result in a SSE value of zero. When utilized in cluster evaluation, we can take the minimum SSE value and use it as a measure of when the wines in each cluster are most similar to each other.
After steps 3~8 are executed, Step 9 will increment the K value by one and repeat the whole process with the new K value. Therefore, after the first try of the K = 2, step 9 will increase K to 3 and repeat the whole K-means clustering and evaluate the result. Once K = 3 is done, K will change to 4 and repeat the whole process and so on.

Clustering with Attributes Filtration
For the first part of the result, the data is first filtered by the approach described in Section 3.2.1 and then clustered utilizing the full 14,349 wines in the dataset. One hundred runs are performed on each K value possible, with a maximum number of one hundred iterations allowed for the program to successfully separate the wines into their best clusters. This is done while determining the SSE values for each run and keeping track of which run produced the best SSE value and what that SSE value was. Once the optimum K value is obtained, the clustering algorithm was performed multiple times using only the optimum value for K to determine if this value produced consistently useful results.
Utilizing this method, the optimum number of clusters K was determined to be 43. Based on this information, the clusters were formed after 50 runs of K-means clustering to determine the best formed clusters utilizing 43 as the K value. These clusters were then used to extract the following information for each cluster: Number of Points in Cluster, Average Year, Year Standard Deviation, Average Score, Score Standard Deviation, Most Common Attribute, Second Most Common Attribute, and Third Most Common Attribute. From these clusters, we also determined the wine from each cluster that best represents the cluster as a whole as shown in Table 2. This wine is determined as the wine with attributes that are most similar to the final centroid value for each cluster. A percentage value was utilized to determine which wine was the most similar to each centroid, and the percentage values that resulted ranged from 97.38-99.51% in similarity. Table 2. Best wine representations in each cluster at optimum cluster number (K = 43). The wines shown above, with their respective production years and scores, are the wines most similar to the centroid values of each cluster number. Each wine listed contains the closest similarities to all other wines within their respective group (cluster).

Cluster Number
Wine Name Wine Year Wine Score  Figure 7 shows the distribution of wines within each cluster when K = 43. Based on the shown information, we know that the majority of the clusters contain a range of 200 or more wines. However, the diversity of the amounts within each cluster show that the wines have been grouped successfully based on their attributes, as further illustrated in Figures 8 and 9 where the most common and second most common attributes are shown per each cluster. This gives more information into what characteristics the wines listed in Table 2 contain that shows them as the best representation of their clusters. For example, if we look at cluster 10, we see that Romulus Pomeroi from the year 2008 is the best wine to represent this cluster. It has a score of 90, making it an outstanding wine by the scoring scale. We also see that this wine can most likely be described as "medium-bodied" and that the "character" of this wine stands out well.  . The most common attribute of each cluster, or group, of wines is shown as a percentage of appearance in each cluster overall. Labels above the bars correlate to the percentage of appearance of that attribute in the cluster. These characteristics are typically the primary characteristics for grouping the wines. Figure 9. Second most common attributes per cluster at optimum cluster number (K = 43). The second most common attribute for each cluster, or group, of wines is shown as a percentage of appearance in each cluster overall. These values show a significant secondary characteristic grouping of each cluster, reflecting the necessity of utilizing all features of the wine reviews to group the wines accurately. Figure 10 shows the average wine score in each cluster with standard deviation. Since the wine score was not included in the clustering process as an attribute, we can use Figure 10 to understand more about each cluster. If we look into cluster 38, we can see that it has the lowest average score for the collection of wines within. When we look further into the common attributes of the cluster, we can determine that a combination of "modest" and "herbs" reflections result in a less desirable wine than any other noted combination. This would also imply that Château Pipeau St.-Emilion 2007, which likely contains the combination of features, would probably be less favorable on this list for vendor sales.
second most common attribute for each cluster, or group, of wines is shown as a percentage o appearance in each cluster overall. These values show a significant secondary characteristic grouping of each cluster, reflecting the necessity of utilizing all features of the wine reviews to group the wines accurately. Figure 10 shows the average wine score in each cluster with standard devia Since the wine score was not included in the clustering process as an attribute, we can Figure 10 to understand more about each cluster. If we look into cluster 38, we can that it has the lowest average score for the collection of wines within. When we further into the common attributes of the cluster, we can determine that a combinati "modest" and "herbs" reflections result in a less desirable wine than any other n combination. This would also imply that Château Pipeau St.-Emilion 2007, which l contains the combination of features, would probably be less favorable on this lis vendor sales. Figure 10. Average wine scores (±SD) per cluster with standard deviation. As the wine scores not utilized when clustering the wines due to the focus on keywords within wine reviews, the scores given at the optimum cluster value (K = 43) are better utilized as a means of reflection a which common attribute combination (Figures 8 and 9) has better results. Figure 10. Average wine scores (±SD) per cluster with standard deviation. As the wine scores are not utilized when clustering the wines due to the focus on keywords within wine reviews, the wine scores given at the optimum cluster value (K = 43) are better utilized as a means of reflection as to which common attribute combination (Figures 8 and 9) has better results.   Table 3 illustrate the possible use of the algorithm when a different number of clusters are desired. This demonstrates that if researchers wanted a smaller list of representative wines to select from, then, the proposed method can change the number of clusters to represent that. This is also true for desiring a larger number of wines to select from as well. While the optimal K value was determined to be 43, that does not restrict this program from creating larger or smaller clusters. The major differences when altering the cluster sizes are the number of wines placed into a cluster, the average calculations derived from the clusters, and the number of common attributes that are possible.     Table 3. Best representing wines with 13 Clusters. These wines are the best representation of their clusters, but they may not capture the overall similarity of the wines in their respective clusters as well as the representing wines with higher K values.

Cluster Number
Wine Name Wine Year Wine Score

Clustering with Wine Grade Filtration + Attributes Filtration
For the second part of the result, the wines and attributes are filtered by the method described in Section 3.2.2. While the wine grade threshold was set to 95 points, 435 wines remained after the filter process. The overall goal of this method is to cluster high end Bordeaux wines so that vendors and wine lovers alike might use the resulting clusters to develop a selection of wines that encompasses the wide range of characteristics that can describe Bordeaux. The same approach was used as in Section 4.1 for determining the optimal K value. 7 clusters seem the best choice for the smaller but elite dataset. The clustering results are relatively evenly distributed clusters compared to unfiltered attempts, as well as highly unique and interesting Highest Common Attribute. It was determined that the 60% and 65% subsets were throwing out interesting attributes like BLACK-TEA and BLOOD ORANGE so it was decided that the focus would continue on the 50% subset. When applying this method to different datasets, these comparisons would still prove useful but require human decision making as to which thresholds are best. The number of wines contained in each cluster contained between 24 and 155 wines is illustrated in Figure 15.  The most common and second most common attributes and their percentage of appearance are listed in Tables 4 and 5, respectively. The low percentage of appearance of DEEP within Cluster 1 suggests that the wines in this cluster may not be as well grouped as the wines in the others. If a user were to look into the specific wines of each cluster in order to make a selection, this cluster may need to be either left out or simply used to fill out any remaining space in their offerings. The best representation for each cluster is given in Table 6. For the purposes of this dataset, average score and the standard deviation did not provide any meaningful insight, which was not surprising given that so few scores were included. Table 4. Highest common attributes. The percentages shown reflect the idea that cluster 1 may not be considered as a collection of similar wines, but all other clusters present acceptable percentages when considering the number of existing attributes.

Cluster Number
Highest Common Attribute Percent of Appearance  The highest common attributes actually hinted at another potential use of this process. Not only are the wines now clustered into unique subcategories within high end Bordeaux, but potential names for these categories are given by these highest common attributes. A wine vendor could use the attribute names themselves as sub-categories for Bordeaux that they offer their patrons. For example, if they chose the wine Liber Pater Graves from cluster 1, they could advertise/offer it to their customers under the moniker Deep, even if that specific wine did not end up with a reviewer using that specific word. This would in turn simplify and clarify the choice of what to purchase for the customers themselves.
Each of the two methods described in this project have shown to be very promising in the objective of composing an all-encompassing list of wines that represents the full range of flavors and textural characteristics that can be used to describe Bordeaux wines.

Conclusions
Wineinformatics is a new data science research area that focuses on large amounts of wine-related data. In this research, unsupervised analysis was applied on 14,349 wines to select representative 21st century Bordeaux wines. A systematic process that incorporates K-means clustering with optimal K search and filtration process was proposed and carried out in this work. Detail clustering results constructed from two different filtering methods, where the first method looks at the overall presence of each attribute and the second method focuses on attribute distribution based on a user defined pivot, were provided in the result section. Both have shown promise for generating unique clusters of wines, and both should be considered for any real-world use cases.
The intended use of these methods is for wine vendors to make a selection given the limited number of wines they can realistically offer. These wines will hopefully represent a broad range of flavor profiles within a given dataset and therefore please the widest market. Wine connoisseurs can also try the list of representative wines of the clusters to understand the variety of the wine region with as few wines as possible. Another use of the cluster could be the recommendation system. A cluster of wine represents wines with similarity; a consumer who enjoyed a representative wine from the cluster can be recommended other wines in the cluster with higher (or lower) price.