1. Introduction
During the last several years, the vast explosion of Internet data has fueled the development of Big Data management systems and technologies by companies, such as Google and Yahoo. The rapid evolution of technology and Internet has created huge volume of data at very high rate [
1], deriving from commercial transactions, social networks, scientific research, etc. The mining and analysis of this volume of data may be beneficial for the humans in crucial areas such as health, economy, national security and justice [
2], leading to more qualitative results. At the same time, the computational and management costs are quite high due to the ever-increasing volume of data.
Because customer analysis is the cornerstone of all organizations, it is essential for more and more companies to store their data into large data-centers aiming to initially analyze them and to further understand how their consumers behave. Concretely, a large amount of information is accessed and then processed by companies so as to get a deeper knowledge about their products’ sales as well as consumers’ purchases. Owners, from those who have small shops to large organizations, try to record information that probably contains useful data about consumers.
In addition to the abundance of Internet data, the rapid development of technology provides even higher quality regarding network services. More to this point, the Internet is utilized by a large number of users for information in each field, such as health, economy, etc. As a result, the companies are concentrated on users’ desired information and personal transactions in order to give their customers personalized promotions. Moreover, businesses provide their customers with cards so that they can record every buying detail and, thus, this procedure has led to a huge amount of data and search methods for data processing.
Historically, in the collection and processing of data, several analysts have been involved. Nowadays, the data volume requires the use of specific methods so as to enable analysts to export correct conclusions due to its heavy size. In addition, the increased volume drives these methods to use complex tools in order to perform automatic data analysis. Thus, the purpose of data collection can be now regarded a simple process.
The analysis of a dataset can be considered as a key aspect in understanding the way that customers think and behave during a specific period of the year. There are many classification and clustering methods that can provide great help to analysts in order to aid them broaching the consumers’ minds. More specifically, supervised machine learning techniques are utilized in the present manuscript in the process of mass marketing, and more in detail in the specific field of supermarket ware.
On the other hand, despite all this investment and technology, organizations still continue to struggle with personalizing customer and employee interactions. It is simply impractical as well as unsustainable for many analytic applications to be driven because it takes a long time to produce usable results in the majority of use cases. In particular, applications cannot generate and then automatically test a large number of hypotheses that are necessary to fully interpret the volume of data being captured. For addressing these issues, new and innovative approaches, which use Artificial Intelligence and Machine Learning methodologies, now enable accelerated personalization with fewer resources. The result is more practical and actionable with the use of customer insights, as will be shown in the present manuscript.
According to viral marketing [
3], clients influence each other by commenting on specific fields of e-shops. In other words, one can state that e-shops work like virtual sensors producing a huge amount of data, i.e., Big Data. Practically, this method appears in our days when people communicate in real time and affect each other on the products they buy. The aim of our proposed model is the analysis of every purchase and the proposal of new products for each customer.
This article introduces an extended version of [
4], whereas some techniques were as well utilized in [
5]. Concretely, a work on modeling and predicting customer behavior using information concerning supermarket ware is discussed. More specifically, a new method for product recommendation by analyzing the purchases of each customer is presented; with use of specific dataset’s categories, we were able to classify the aforementioned data and subsequently create clusters.
More to the point, the following steps are performed: initially, the analysis of the sales rate as Big Data analytics with the use of MapReduce and Spark implementation is utilized. Then, the distances of each customer from the corresponding supermarket are clustered and accordingly the prediction of new products that are more likely to be purchased from each customer separately is implemented. Furthermore, with the use of three well-known techniques, e.g., Vector Space Model, Term Frequency-Inverse Document Frequency (Tf-Idf) and Cosine Similarity, a novel framework is introduced. Concretely, opinions, reviews, and advertisements, as well as different texts that consider customer’s connection towards supermarkets, are taken into account in order to measure customer’s behavior. The proposed framework is based on the measurement of text similarity by applying cloud computing infrastructure.
In our approach, we handle the scalability bottleneck of the existing body of research by employing cloud programming techniques. Existing works that deal with customers’ buying habits as well as purchases analysis do not address the scalability problem at hand, so we follow a slightly different approach. In our previous work [
4], we followed similar algorithmic approaches, but, in main memory, while here we extend and adapt our approach in the cloud environment addressing the need for Big Data processing.
Furthermore, the proposed work can be successfully applied to other scenarios as well. Data mining undoubtedly plays a significant role in the process of mass marketing where a product is promoted indiscriminately to all potential customers. This can be implemented by allowing the construction of models that predict a customer’s response given their past buying behavior as well as any available demographic information [
6]. In addition, one key aspect in our proposed work is that it treats each customer in an independent way; that is, a customer can make a buying decision without knowing what all the other customers usually buy. In this case, we do not consider the influence that is usually taken into account when dealing with such situations as friends, business partners and celebrities often affect customers’ buying habit patterns [
7].
The remainder of the paper is structured as follows:
Section 2 discusses the related work.
Section 3 presents in detail the techniques that have been chosen, while, in
Section 4, our proposed method is described. Additionally, in
Section 5, the distributed architecture along with the algorithm paradigms in pseudocode and further analysis of each step are presented. We utilize our experiments in
Section 6, where
Section 7 presents the evaluation experiments conducted and the results gathered. Ultimately,
Section 8 describes conclusions and draws directions for future work.
2. Related Work
Driven by real-world applications, managing and mining Big Data have shown to be a challenging yet very compelling task. While the term “Big Data” literally concerns data volumes, on the other hand, the HACE (heterogeneous, autonomous, complex and evolving) theorem [
8] suggests that Big Data consist of the following three principal characteristics: huge with diverse and heterogeneous data sources, autonomous with decentralized and distributed control, and finally complex and evolving regarding data and knowledge associations. Another definition for Big Data is given in [
9], mainly concerning people and their interactions; more specifically, authors regard the Big Data nature of digital social networks as the key characteristic. The abundance of information on users’ message interchanges among peers is not taken lightly; this will aid us to extract users’ personality information for inferring network social influence behaviour.
The creation as well as the accumulation of Big Data is a fact for a plethora of scenarios nowadays. Sources like the increasing diversity sensors along with the content created by humans have contributed to the enormous size and unique characteristics of Big Data. Making sense of this information and these data has primarily rested upon Big Data analysis algorithms. Moreover, the capability of creating and storing information nowadays has become unparalleled. Thus, in [
10], the gargantuan plethora of sources that leads to information of varying type, quality, consistency, large volume, as well as the creation rate per time unit has been identified. The management and analysis of such data (i.e., Big Data) are unsurprisingly a prominent research direction as pointed out in [
11].
In recent years, regarding sales transaction systems, a large percentage of companies maintain such electronic systems, thus aiming at creating a convenient and reliable environment for their customers. In this way retailers are enabled to gather significant information for their customers. As stated below, since the number of data is significantly increasing, more and more researchers have developed efficient methods and rule algorithms for smart market basket analysis [
12]. The “Profset model” is an application that researchers have developed for optimal product selection on supermarket data. By using cross-selling potential, this model has the ability to select the most interesting products from a variety of ware. Additionally, in [
13], the authors have analyzed and in following designed an e-supermarket shopping recommender. Researchers have also invented a new recommendation system where supermarket customers were able to get new products [
14]; in this system, matching products and clustering methods are used in order to provide less frequent customers with new products.
Moreover, new advertising methods based on alternative strategies have been utilized from companies in order to achieve increasing purchases. Such a model was introduced in [
3] having an analysis of a person-to-person recommendation network with 4 million people along with 16 million recommendations. The effectiveness of the recommendation network was illustrated by its increasing purchases. A model regarding a grocery shop for analyzing how customers respond to price and other point-of-purchase information was created in [
15]. Another interesting model is presented in [
16], where authors analyzed the product range effect in purchase data. Specifically, since market society is affected by two factors, e.g., diversity and rationality in the price system, consumers try to minimize their spending and in parallel maximize the number of products they purchase. Thus, researchers invented an analytic framework based on customers’ transaction data where they found out that customers did not always choose the closest supermarket.
Some supermarkets are too big for consumers to search and to find the desirable product. Hence, in [
17], a recommendation system targeted towards these supermarkets has been created; using RFID technology with mobile agents, authors constructed a mobile-purchasing system. Furthermore, in [
18], the authors presented another recommendation system based on the past actions of individuals, where they provided their system to an Internet shopping mall in Korea. In point of fact, in [
19], a new method on personalized recommendation in order to get further effectiveness and quality since collaborative methods presented limitations such as sparsity, is considered. Regarding Amazon.com, they used for each customer many attributes, including item views and subject interests, since they wanted to create an effective recommendation system. This view is echoed throughout [
20], where authors analyzed and compared traditional collaborative filtering, cluster models and search-based methods. In addition, Weng and Liu [
21] analyzed customers’ purchases according to product features and as a result managed to recommend products that are more likely to fit with customers’ preferences.
Besides the development of web technologies, the colourfulness of social networks has created a huge number of reviews on products and services, as well as opinions on events and individuals. This has led to consumers been used to being informed by other users’ reviews in order to carry out a purchase of a product, service, etc. One other major benefit is that businesses are really interested in the awareness of the opinions and reviews concerning all of their products or services and thus appropriately modify their promotion along with their further development. As a previous work on opinion clustering emerging in reviews, one can consider the setup presented in [
22]. Furthermore, the emotional attachment of customers to a brand name is a topic of interest in recent years in the marketing literature; it is defined as the degree of passion that a customer feels for the brand [
23]. One of the main reasons for examining emotional brand attachment is that an emotionally attached person is highly probable to be loyal and pay for a product or service [
24]. In [
25], authors infer details on the love bond between users and a brand name through their tweets and this bond is considered as a dynamic ever evolving relationship. Thus, the aim is to find those users that are engaged and rank their emotional terms accordingly.
Efficient analysis methods in the era of Big Data is a research direction receiving great attention [
26]. The perpetual interest to efficient knowledge discovery methods is mainly supported by the nature of Big Data and the fact that, in each instance, Big Data cannot be handled and processed to extract knowledge by most current information systems. Current experience with Cloud Computing applied to Big Data usually revolves around the following sequence as in [
27]: preparation for a processing job, submission of the job, usually anticipating for an unknown amount of time for results, receive feedback for the internal processing events and finally receive results.
Large scale data such as graph datasets from social or biological networks are commonplace in applications and need a different handling in case of storing, indexing and mining. One well known method to facilitate large-scale distributed applications is MapReduce [
28] proposed by Dean and Ghemawat.
For measuring similarity among texts in the cloud infrastructure, many research works have been proposed in the last several years. Initially, in [
29], a method that focuses on a MapReduce algorithm for computing pairwise document similarity in large document collections is introduced. Also in [
30], a method using the MapReduce model, in order to improve the efficiency of traditional Tf-Idf algorithm is created. Along the same line of reasoning, the authors in [
31] propose the use of the Jaccard similarity coefficient between users of Wikipedia based on co-occurrence of page edits with the use of the MapReduce framework. Another piece of supplementary research in reference to large-scale data-sets is utilized in [
32], where a parallel K-Means algorithm based on MapReduce framework is proposed.
4. Proposed Method
In our model, given a supermarket ware dataset, our utmost goal is the prediction whether a customer will purchase or not a product using data analytics and machine learning algorithms. This problem can be considered as a classification one since the opinion class consists of specific options. Furthermore, we have gathered the reviews of Amazon (Seattle, Washington, USA) and in particular the reviews of each customer, in order to analyze the effect of person-to-person influence in each product’s market.
The overall architecture of the proposed system is depicted in
Figure 2 while the proposed modules and sub-modules of our model are modulated in the following steps.
As a next step, we have developed two systems; the first in the MapReduce and the second in the Apache Spark framework for programming with Big Data. Precisely, an innovative method for measuring text similarity with the use of common techniques and metrics is proposed. In particular, a prospective of applying Tf-Idf [
35] and Cosine Similarity [
45] measurements on distributed text processing is further analyzed where the component of document pairwise similarity calculation is included. In particular, this method performs pairwise text similarity with the use of a parallel and distributed algorithm that scales up, regardless of the massive input size.
The proposed method consists of two main components: Tf-Idf and Cosine Similarity, where these components are designed by following the concept of the MapReduce programming model. Initially, the terms of each document are counted and the texts are then normalized with the use of Tf-Idf. Finally, Cosine Similarity of each document pair is calculated and the results are obtained. One major characteristic of the proposed method is that it is faster and more efficient compared to the traditional methods; this is due to the MapReduce model implementation in each algorithmic step that tends to enhance the efficiency of the method as well as the aforementioned innovative blend of techniques.
4.1. Customer Metrics Calculation
From the supermarket dataset, four datasets with varying number of records (e.g., , , and ) regarding customers’ purchases, containing information about sales over a vast period of four years, have been randomly sampled. More specifically, the implementation of our method can be divided into the following steps: initially, the customers along with the products are sampled while, subsequently, the clustering of the products based on the sales rate takes place. Then, the customers related on the distance of their houses from the supermarket are clustered and a recommendation model, with new products separately proposed to each customer based on their consumer behavior, is utilized. Finally, we sampled the customers of Amazon and then, using the ratings of the reviews, we came up with the fraction of the satisfied customers.
The training set of the supermarket data consists of eight features, as presented in the following
Table 1 (where we have added a brief description), including customer ID, category of the product, product ID, shop, number of purchased items, distance from each supermarket, price of the product as well as the choice.
4.2. Decision Analysis
Here, we describe the choice analysis based on classification and clustering tools. This method gathers eight basic features from the given supermarket database as well as eleven different classification methods in order to further analyze our dataset. In [
14], researchers have considered clustering with the aim of identifying customers with similar spending history. Furthermore, as authors in [
46] indicate, the loyalty of customers to a certain supermarket is measured in different ways; that is, a person is considered to be loyal towards a specific supermarket if they purchase specific products and also visit the store on a regular basis. Despite the fact that the percentage of loyal customers seems to be less than
, they purchase more than
of the total amount of products.
Since the supermarket dataset included only numerical values for each category, we have created our own clusters in terms of customers as well as products. More concretely, we have measured the sales of each product and the distances, and we have then created three clusters for products as well as two classes for distances.
More to the point, an organization, in order to measure the impact of various marketing channels, such as social media, outbound campaigns as well as digital advertising response, might have the following inputs:
Customer demographics, like gender, age, wealth, education and geolocation,
Customer satisfaction scores,
Recommendation likeliness,
Performance measures, such as email click-through, website visits including sales transaction metrics across categories.
5. Distributed Architecture
The proposed method for measuring text similarity applying cloud computing infrastructure consists of four stages. Initially, occurrences of each condition in terms of given documents are counted (Word Count) and the frequency of every term in each document is measured (Term Frequency). Thereafter, the Tf-Idf metric of each term is measured (Tf-Idf) and finally the cosine similarities of the pairs are calculated in order to estimate the similarity among them (Cosine Similarity). The MapReduce model has been used for designing each one of the above-mentioned steps. The algorithm paradigm in pseudocode and further analysis of each step is presented in the following subsections in detail.
MapReduce Stages
In the first implementation stage, the occurrences of each word in every document are counted. The algorithm applied is presented in Algorithm 1.
Algorithm 1 Word Count. |
1: function 2: Method 3: for each do 4: write 5: end for 6: end function 7: 8: function 9: Method 10: 11: for each do 12: 13: end for 14: return ▹ number of occurences 15: end function |
Initially, each document is divided into key-value pairs, whereas the term is considered as the key and the number (equals to 1) is considered as the value. This is denoted as where the key corresponds to the term and the value to the number one, respectively. This phase is known as the Map Phase. In the Reduce Phase, each pair is taken and the sum of the list of ones for the term is computed. Finally, the key is set as the tuple and the value as the number of occurrences, respectively.
Furthermore, regarding the second implementation phase, the overall number of terms of each document is computed as described in Algorithm 2.
Regarding the Map Phase of this algorithm implementation, the input is divided into key-value pairs, whereas the is considered as the key and the tuple is considered as the value. In the Reduce Phase of the algorithm, the total number of terms in each document is counted and the key-value pairs are returned; that is, the as the key and the tuples as the value, where N is the total number of terms in the document.
Algorithm 2 Term Frequency. |
1: function 2: Method 3: for each do 4: write 5: end for 6: end function 7: 8: function 9: Method 10: 11: for each do 12: 13: end for 14: return 15: end function |
In the third implementation stage, the Tf-Idf metric of each term in a document is computed with the use of the following Equation (
3) as presented in Algorithm 3:
where |
D| is the number of the documents in corpus and
is the number of documents where
t-term appears.
Algorithm 3 Tf-Idf. |
1: function 2: Method 3: for each do 4: write 5: end for 6: end function 7: 8: function 9: Method 10: 11: for each do 12: 13: 14: ▹ number of documents in corpus 15: end for 16: return 17: end function |
By applying the Algorithm 3 during the Map Phase, the term is set as the key and the tuple as the value. In that case, the number of documents is calculated during the Reduce Phase, where the term appears and the result to the n variable is set. The term frequency is subsequently calculated plus the inverse document frequency of each term as well. Finally, key-value pairs with the as the key and the tuple as the value are taken as a result.
In the fourth and final implementation phase, all of the possible combinations of two document pairs are provided and then the cosine similarity for each of these pairs is computed as presented in Algorithm 4. Assuming that there are
n documents in the corpus, a similarity matrix is generated as in the following Equation (
4):
Algorithm 4 Cosine Similarity. |
1: function 2: Method 3: 4: for to do 5: for to do 6: write 7: end for 8: end for 9: end function 10: 11: function 12: Method 13: 14: 15: 16: return 17: end function |
In the Map Phase of Algorithm 4, every potential combination of the input documents is generated and the document
for the key as well as the
vectors for the value is set. Within the Reduce Phase, the Cosine Similarity for each document pair is calculated and the similarity matrix is also provided. This algorithm is visualized as follows in
Figure 3.
6. Implementation
The first stage of the implementation process was the data cleaning. In the dataset utilized, there was a small number of missing values. In general, there are several methods for data imputation depending on the features of the dataset. In our case, the missing values were imputed by the mean value of each feature. The main reason that the corresponding method was implemented is that the number of missing values was too small (less than ) and other methods like linear regression would be time-consuming for the whole process.
After finishing with the data cleaning process, consumers were categorized according to the amount of money they spend at the supermarket. More specifically, we created three consumer categories, A, B and C, which correspond to the average money they pay on a regular basis. In addition, the same process with three categories was implemented for the distance of each consumer’s house from the supermarket. The overall implementation is presented in terms of Map-Reduce model as follows in Algorithm 5.
Algorithm 5 Distributed Implementation. |
1: function 2: Method 3: for each do 4: cluster consumers into three categories based on budget spent 5: end for 6: for each do 7: consumer’s details are processed by the same reducer 8: end for 9: end function 10: 11: function 12: Method ▹ Recommend products with highest similarity score for every consumer that haven’t been purchased 13: 14: for each do 15: 16: 17: 18: 19: end for 20: find 21: for each do 22: compute cosine_similarity(consumer, consumers_list) 23: end for 24: for each do 25: recommend consumer’s products with the highest similarity 26: end for 27:end function |
6.1. Datasets
The present manuscript utilizes two datasets, e.g., a supermarket database [
16] as well as a database from Amazon [
3], which contains information about the purchases of customers.
Initially, we started our experiments with the supermarket database [
16] and we extracted the data using C# language in order to calculate customer metrics. We have implemented an application where we measured all the purchases of the customers and then several samples of the customers, so as to further analyze the corresponding dataset, were collected. The five final datasets consist of
,
,
,
and
randomly selected purchases with all the information from the supermarket dataset as previously mentioned.
The prediction of any new purchase is based on the assumption that every customer can be affected by any other one due to the fact that consumers communicate every day and exchange reviews for products. On the other hand, being budget conscious, they are constricted to select products that correspond better to their needs. Therefore, a model that recommends new products to every customer from the most preferred supermarket, is proposed.
By analyzing the prediction model, information about consumers’ behavior can be extracted. We measured the total amount of products that customers purchased and then categorized them accordingly. Several classifiers are trained using the dataset of vectors. We separated the dataset and we used 10-Fold Cross-Validation to evaluate training and test sets. The classifiers that were chosen are evaluated using TP (True Positive) rate, FP (False Positive) rate, Precision, Recall, as well as F-Measure metrics. We chose classifiers from five major categories of the Weka library [
47] including “bayes”, “functions”, “lazy”, “rules” and “trees”.
Additionally, we evaluated a model using the results of our experiments on Amazon [
3] since we wanted to measure the number of customers in terms of five product categories, namely book, dvd, music, toy and video. In
Table 2, we present the number of delighted and, on the other hand, the number of unsatisfied customers.
Figure 4 illustrates the amount of customers who are satisfied with products of every category out of the aforementioned ones. We can observe that the number of satisfied customers is much bigger than the unsatisfied in four out of five categories (regarding category entitled toy, the number is equal to 1 for both category of customers). With these results, one can easily figure out that Amazon customers are loyal to the corresponding company and prefer purchasing products from the abovementioned categories.
6.2. Cloud Computing Infrastructure
A series of experiments for evaluating the performance of our proposed method under many different perspectives is conducted. More precisely, we have implemented the algorithmic framework by employing the cloud infrastructure. We used the MapReduce Programming Environment as well as Apache Spark, which is a newer framework built on the same principles as Hadoop.
Our cluster includes four computing nodes (VMs), each one of which has four GHz CPU processors, GB of memory, 45 GB hard disk and the nodes are connected by 1 gigabit Ethernet. On each node, Ubuntu as operating system, Java with a 64-bit Server VM, as well as Hadoop and Spark were installed.
One of the VMs serves as the master node and the other three VMs as the slave nodes. Furthermore, the following changes to the default Hadoop and Spark configurations are applied; 12 total executor cores (four for each slave machine) are used, the executor memory is set equal to 8 GB and the driver memory is set equal to 4 GB.
8. Conclusions
In the proposed work, we have presented a methodology for modelling and predicting the purchases of a supermarket using machine learning techniques. More specifically, two datasets are utilized: a supermarket database as well as a database from Amazon that contains information about the purchases of customers. Given the analysis of the dataset from Amazon, a model that predicts new products for every customer based on the category and the supermarket customers prefer is created. We have also examined the influence of person-to-person communication, where we found that customers are greatly influenced by other customer reviews.
More to the point, we handle the scalability bottleneck of existing works by employing cloud programming techniques. Concretely, the analysis of the sales rate as Big Data analytics with the use of MapReduce and Spark implementation is utilized. Furthermore, with use of three well-known techniques, e.g., Vector Space Model, Tf-Idf and Cosine Similarity, a novel framework is introduced. Concretely, opinions, reviews, and advertisements as well as different texts that consider customer’s connection towards supermarkets, are taken into account in order to measure customer’s behavior. The proposed framework is based on the measurement of text similarity by applying cloud computing infrastructure.
As future work, we plan to create a platform using the recommendation network. Customers will have the opportunity to choose among many options on new products with lower prices. In the future, we plan to extend and improve our framework by taking into consideration more features of the supermarket datasets that may be added in the feature vector and will improve the classification accuracy. One other future consideration is the experimentation with different clusters so as to better evaluate Hadoop’s and Spark’s performance in terms of time and scalability. In conclusion, we could use a survey to have further insights and get an alternative verification of user’s engagement.