Automatic Eligibility of Sellers in an Online Marketplace: A Case Study of Amazon Algorithm

: Purchase processes on Amazon Marketplace begin at the Buy Box, which represents the buy click process through which numerous sellers compete. This study aimed to estimate empirically the relevant seller characteristics that Amazon could consider featuring in the Buy Box. To that end, 22 product categories from Italy’s Amazon web page were studied over a ten-month period, and the sellers were analyzed through their products featured in the Buy Box. Two different experiments were proposed and the results were analyzed using four classiﬁcation algorithms (a neural network , random forest , support vector machine , and C5.0 decision trees ) and a rule-based classiﬁcation. The ﬁrst experiment aimed to characterize sellers unspeciﬁcally by predicting their change at the Buy Box. The second one aimed to predict which seller would be featured in it. Both experiments revealed that the customer experience and the dynamics of the sellers’ prices were important features of the Buy Box. Additionally, we proposed a set of default features that Amazon could consider when no information about sellers was available. We also proposed the possible existence of a relationship or composition among important features that could be used for sellers to be featured in the Buy Box.


Introduction
The number of algorithms that automate services once requiring manual operations is expected to grow in the coming years. Knowledge about the behaviour of such algorithms is gaining interest in the scientific community despite the current lack of tools and methodologies to measure the effects of these algorithms on people. Algorithms implemented for personalization on Google Search, the review of gig economy workers by customers (e.g., a job for a specified period of time), or gender discrimination in hiring are some of the subjects of this research [1][2][3]. In e-commerce and online marketplaces, the impact of algorithms is also of interest due to the current popular demand of web-based shopping platforms.
Amazon Marketplace is one of the leaders in online retail [4], controlling 45% of the e-commerce market share in the United States and surpassing Walmart in this regard in 2020 [5]. Amazon competes with some of the largest corporations around the world for market share, accounting for 13% of the global e-commerce sector's gross merchandise volume in 2020, while the Alibaba group (Taobao, Alibaba, and Tmall), Jingdong (JD.com), Pinduoduo, and eBay had 25%, 9%, 6%, and 2% of the market, respectively. The combined share of Suning.com, Rakuten, Apple, Walmart, Vip.com, and Shopee was 6% [6].
Amazon Marketplace represents a structured and managed e-commerce website that accommodates two groups of participants: sellers presenting new, refurbished, or used products to a large group of potential buyers and customers who benefit from a coordinated system of purchasing that includes the search for products, payment, shipment, and order tracking. The number of products each seller can offer on Amazon is unlimited, as is the

Related Work
To the best of the authors' knowledge, the only study that has investigated the algorithm by which Amazon selects sellers to occupy the Buy Box was by Chen et al. [39], who analyzed the algorithmic pricing strategies on the Amazon Marketplace. These authors also simplified the modeling through a predictive problem, and after analyzing the importance of the selected features from the sellers' offers, they concluded that the more important ones related to a seller earning the position in the Buy Box are price difference and the ratio of the price to the lowest price for the product. Other analyzed features included positive seller feedback whether or not the product was fulfilled by Amazon, and the average product rating. The main difference between the predictive problem used in that study and that used in the current one is the type of sellers considered. While those authors collected information from seller offers displayed on the offer listing page and that of the seller who won the Buy Box, in the present study only the characteristics of the sellers occupying the Buy Box were considered. This is because we wanted to focus on sellers who were truly eligible and probably more professional. Additionally, according to Amazon documentation, features that condition a seller to be eligible to occupy the Buy Box are their sales volume, response time to customer enquiries, rate of returns and refunds, and shipping times [39]. Therefore, one plausible way to circumvent the lack of this latter information from sellers was to consider only those ones who actually earned the Buy Box.
The remainder of this article is organized as follows: In Section 2, the experimental design of this work is described, while the results are presented in Section 3 together with a short discussion. Finally, the main conclusions are presented in Section 4.

Materials and Methods
This section describes the predictive problems for an empirical analysis of Amazon's criteria for a seller being selected to occupy the Buy Box and then once selected, continuing in it. Thus, two classification experiments using selected features and the same four classifiers on each were performed on 22 different product category datasets. Then, for each experiment, once the more accurate classifier had been selected, the importance of features in the predictive problems was estimated. Complementary to this estimation, a rule-based classification was also performed. The datasets used in this study and the analyzed features are explained in the next section.

Datasets and Features
Product page information from 530 best-seller products belonging to 22 different product categories was obtained from Italy's Amazon Marketplace web page from 5 April to 14 December 2018. Preliminary crawling exercises were accomplished for Amazon Marketplaces in Germany, the United Kingdom, Spain, and France. The best communication performance results in server stability and response time were found for Italy's Marketplace. A typical product page is shown in Figure 1. A B For each category, the best-seller products were analyzed over time, and a longitudinal dataset describing the dynamic of features was created and shown in Table 1. These features were obtained directly from the product pages, except the last three, which were derived from the prices of the products at each analyzed point of time. A crawling experiment was carried out previously to detect the most dynamic categories in changes to the sellers occupying the Buy Box. Categories with a low rate of change among such sellers or those for which Amazon was the only seller (e.g., Amazon device accessories, Kindle, Alexa) were excluded. Analogously, products sold by fewer than five sellers in a category were not considered due to low sales relevance. The crawling process was carried out sequentially from the first best-seller product listed in each category to the last. Due to Amazon's strategic commercial reasons, the number of best-seller products displayed in each category was altered over time (e.g., 20, 50 or 100 products); therefore, the frequency of visits to each product page to collect product information changed during the experiment, ranging from ∼1 h to ∼4 h. The numbers of instances, products, and sellers in each analyzed dataset are indicated in Table 2. Datasets built for each product category were used as input data in the supervised and rule-based experiments performed and were studied independently. Next, the proposed classification problems are explained.

Proposed Classification Problems
The estimation of the importance of predictors shown in Table 1 was addressed through two supervised experiments. They were performed independently on each of the 22 longitudinal datasets from the product categories. The experiments differed in the treatment received by the response feature. In the first one (predicting the change of a seller at the Buy Box), the levels of the response feature were represented by a binary output in which the positive class ("1") indicated if a change of seller had been observed at the Buy Box (two-class classification), and the negative class indicated otherwise ("0"). In the second one (predicting the seller to occupy the Buy Box), the levels of the response feature indicated the sellers displayed in the Buy Box (multi-class classification). Thus, the first experiment can be considered unspecific regarding the target seller occupying the Buy Box since it was focused on detecting simply the changes. On the other hand, the second experiment aimed to predict the specific seller occupying the Buy Box. Figure 2 illustrates the labeling process for a given product category. Table 2. Characteristics of the datasets built from the selected product categories. The number of instances in each dataset is indicated as well as the relations of sellers to products (Sellers/Products). Abbr.: abbreviation.

Instances Sellers/Products
Automotive  Four classifiers were used in each supervised experiment, namely, neural networks (nnet), random forests (rf ), support vector machines (svm), and C5.0 decision trees (C5.0). Descriptions of these classifiers are provided in Section 2.2.1. The idea behind this approach was to identify the most accurate classifier in each experiment and then use it to estimate the importance of the features and perform rule-based classification. A general overview of the complete experimental design is illustrated in Figure 3. To evaluate the accuracy of the classifiers, a dataset from the different categorieswas divided into training and test sets, consisting of 70% and 30% of the data, respectively. The caret package [40] from R language was used to build classification models for the training data using a 10-fold cross-validation scheme that was repeated three times to tune the hyperparameters of each classifier. Due to the different magnitudes of the values of the involved predictors, all were centred and scaled.

Classification Algorithms
A decision tree gives a set of rules that can be used to divide data into different groups to make a decision about them [41]. An rf classifier is an ensemble of decision trees that uses a randomly selected subset of training samples and features to yield reliable classifications.The trees are created by drawing a subset of training samples through replacement, meaning that the same sample can be selected several times, while others may not be selected. About two-thirds of the samples are used to train the trees, with the remaining one-third being used for an internal cross-validation to estimate how well the resulting rf model performed [42]. C5.0 decision trees are a more advanced version of Quinlan's C4.5 classification model [43]. C4.5 builds decision trees from a set of training data using the concept of information entropy. C5.0 has additional features, such as boosting and unequal costs for different types of errors, but is also likely to generate smaller trees. The algorithm combines non-occurring conditions for splits with several categories and conducts a final global pruning procedure that attempts to remove sub-trees with a cost-complexity approach [44].
nnet are computational models inspired by biological neural networks capable of approximating nonlinear functional relationships between inputs and outputs features. A collection of neurons is referred to as a layer, and the collection of interconnected layers forms the neural networks [45]. In a neuron, the output is calculated by a nonlinear function of the sum of its inputs. The connections between different neurons from adjacent layers are represented by the weights in a model. The weights adjust as learning proceeds, and they represent the strength of the signal at a connection. The nonlinear function is also called the activation function [46].
svm are based on the statistical learning theory concept of decision planes that define decision boundaries. A decision plane ideally separates objects with different class memberships. The most commonly known svm is the linear classifier, which predicts each input's member class from two possible classifications. A more accurate definition is that a svm builds a hyperplane or set of hyperplanes to classify all inputs in a high-dimensional or even infinite space. The values closest to the classification margin are known as support vectors. The svm's goal is to maximize the margin between the hyperplane and the support vectors [47,48]. The metrics used to evaluate the accuracy of classifiers is explained next.

Performance Evaluation
The classes use in the experiments described above are not equally distributed because the occurrence of sellers in the Buy Box is not homogeneous and the ratio of change for sellers in the Buy Box is low (the positive class is under-represented). Since it is not appropriate to use only a single metric to evaluate the performance of a classifier [49], three metrics suited to dealing with class imbalance were selected: balanced accuracy, Kappa statistic [50], and F1-score (F1). The first experiment was analyzed as a binary classification problem; however, the evaluation of the multi-class problem (second experiment) was treated as a set of binary problems ('one-versus-all' transformation). Metrics are explained next using a binary confusion matrix in the context of both experiments (Table 3). For a given product, the positive class (+) in the first experiment was represented by a change in seller, and the negative class (−) was represented by the continuity of the seller at the Buy Box. In the second experiment, the positive class (A) was some seller of interest selling a given product, and the negative class ( =A) was by any other seller.
where O is the observed accuracy, and E is the expected accuracy based on the marginal totals of the confusion matrix. BAcc and F1 metrics range from 0 to 1, and high values indicate high classification performances. The K statistic takes values between −1 and 1; a value of 0 means there is no agreement between the observed and predicted classes, while a value of 1 indicates perfect concordance between the model prediction and the observed classes [44].
Once both experiments had been carried out and the best classifiers identified in each according to the three metrics used (Tables 4 and 5), the importance of the predictors was estimated, as indicated in Table 1, and a rule-based classification analysis was conducted.

Predictor Importance and Rule-Based Classification
The importance of predictors was estimated to identify the most relevant features involved in both predictive problems. For that purpose, for each product category, the most accurate classifier was used to train a model with full datasets (no train-test split), and the importance of predictors was estimated using the varImp function from the caret package. This function measured the aggregate effect of the predictors on the model and returned a score for each of the features in model.  As a complementary exercise, rule-based classification was accomplished to discern relations within the set of features (predictors and response) analyzed in this study. A rule-based classifier uses a set of IF-THEN rules for class prediction. An IF-THEN rule is an expression of the form IF condition THEN conclusion. The "IF" part (or left-hand side) of a rule is known as the rule antecedent or precondition. The "THEN" part (or right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more features that are logically added by AND clauses. These features are the predictors defined in Table 1. The classes predicted in the rule in this study were represented by the seller's identification or by its change labeled as a binary feature, depending on the experiment, as explained above (Section 2.2). The function C5.0 (C5.0 package [51]) from R was used, and all rules obtained from both experiments were analyzed. Totals of 488 and 4009 rules were obtained in the experiments to predict the change in seller and to predict the seller, respectively. These rules included 1616 and 16,368 conditions, respectively. Since the same conditions may appear for different rules, an analysis of the frequency of conditions appearing in rules was performed.
The accuracy of each rule was estimated using the Laplace ratio (n − m + 1)/(n + 2), where n is the number of cases covered by the rule (support of the rule) and m is the number of cases that do not belong to the class predicted by the rule. Additionally, the lift estimate was calculated by dividing the rule's estimated accuracy by the relative frequency of the class predicted in the data set. This estimate is a measure of the interest of the rule (predictive ability), and is in the range [0, ∞]. Values far from one imply the co-occurrence of conditions defining the rule and the predicted class.

Results and Discussion
The accuracy of classifiers for each experiment was estimated for each of the 22 product category datasets. Results are shown in Tables 4 and 5, together with the average values across categories. It is remarkable that for the prediction of the seller change experiment, the kappa value for some categories was 0 when evaluated with two accuracy metrics (e.g., Bby and Lgh), but that the svm classifier showed an accuracy value of zero or close to zero for this statistic in 13 categories. Greater accuracy for all categories was obtained by rf and C5.0 decision tree classifiers. In the prediction of seller experiments, no accuracy level was close to zero. The C5.0 classifier was selected as the most accurate as it obtained the highest accuracy level for the three quality metrics when both experiments were considered.

Predictor Importance
The importance in both predictive experiments that used the C5.0 decision trees is shown in Figure 4. A summary of the most and least relevant features based on occurrences is shown in Table 6   stock (2) prodRating, rank (2) 10th 1 amChoice, best-seller (7) amChoice, best-seller (8) 2 fulfilled, rank, rPriceCumMax (2) rank (5) 3 fulfilled, stock (1) fulfilled (1) prodRatings and opinions features were found to be the most important features in both experiments (Table 6) although they had different representations across categories. In particular, the user experience (opinions) represented the most relevant feature in the Elc, Hpc, Tls, and Wtc categories. The product fulfilment by Amazon (fulfilled) feature appeared to be decisive only for the prediction of seller experiment, as the variation of prices (rPrice) is relevant to the prediction of a change in seller. This latter predictor was estimated to be the most relevant in four categories and in 11 categories it was the second-most important feature.
Another aspect of interest is the least important features. The importance of features indicates that sellers of those products are Amazon's choices (amChoice) or best-sellers (best-seller) but are not specifically being considered for selection into the Buy Box. Given that the fulfilled feature was included in the group of important and dispensable features for both experiments, the decisive role of this feature is likely to be category-dependent. This was also observed for prodRating, although only for the prediction of seller experiment.
The experimental design used in the Buy Box study conducted by Chen et al. [39] to detect algorithmic pricing in Amazon Marketplace also included the prediction of the seller occupying the Buy Box, which coincides with the second experiment presented in this study (specific experiment). For that purpose, those authors used the random forest algorithm and a set of features related to prices, average rating, positive feedback and feedback count, whether or not sellers used FBA, and whether or not the seller was Amazon. They observed that Amazon used non-trivial strategies to evaluate sellers (i.e., additional features beyond price to select the seller to occupy the Buy Box). They detected that the seller's positive feedback and feedback count (prodRatings and opinions in our study, respectively), were also important features related to "winning" or occupying the Buy Box, which coincided with the results obtained here (Table 6). Interestingly, these authors considered the fulfilled feature (FBA program) to have low relevance, which also coincided with our results.

Rule-Based Classification
The main characteristics of the rule-based classification analysis are shown in Table 7. A detailed list of the most relevant rules for both experiments can be found in the Supplementary material. The average number of conditions present in the rules was greater for the seller prediction experiment (4.1 conditions/rule) than for the seller change prediction experiment (3.3 conditions/rule), and the average accuracy of rules was similar in both experiments (0.81 and 0.85, respectively). These latter results were in line with the average accuracy level obtained for the C.50 algorithm when evaluated using the train-test split (Tables 4 and 5). Remarkably, the lift estimate yielded results for rules that were one order of magnitude higher for the seller prediction experiment, suggesting a higher efficiency of rules for predicting sellers than for predicting their change. Table 7. Numbers of rules and conditions for each product category, accuracy, average lift, and ranking according to lift for both experiments. n r and n c are the number of rules and conditions, respectively. n r /n c is the rounded average value of conditions by rule. Complementing the previous analysis, Figure 5 shows an analysis of the frequency of appearance of conditions in rules as a heat map, and Table 8 shows the absolute and relative frequencies represented by a percentage of features. In this study, we interpreted such frequencies as playing the role of weights (in %) to combine sellers' features by Amazon to select sellers to occupy the Buy Box. In Figure 5, conditions involving the product were the most frequent in both experiments, since Amazon's algorithm is primarily oriented toward fulfilling product demand by customers among the huge catalog of available products. Additionally, the specificity of rules for products could be based on its differentiation, as reflected in categories like Pc, Elc, and Vdg, and on the opposite side, to those products with little differentiation and low values in the Spr, Grc, and Lgh categories. This can also be seen for the experiment on the prediction of sellers' lift rankings ( Table 7). As shown, the number and types of conditions seem to be highly category-dependent. However, a more interesting outcome can be found in Table 8. Different features are used by Amazon's algorithm to select sellers to occupy the Buy Box, although their use (%) was found to be quantitatively different depending on the experiment (prediction of a seller change-unspecific experiment and prediction of seller-specific experiment). Apart from product, which was not found to add any qualitative distinction to the analyses beyond its availability from a given seller, in both experiments, opinions and attributes related to the price dynamic (rPrice, rPriceCumMax and rPriceCumMin) were identified as more frequently applied by the Amazon algorithm. This could be indicative of their being primary attributes considered by Amazon to select a seller to occupy the Buy Box.

Prediction of a Seller Change Prediction of Seller
As discussed previously in this section, the different use (%) of features between experiments could suggest a sort of weighted relationship among them, as well as showing that such relationships from one or both experiments could be selected by Amazon according to the information held by this platform regarding the seller. This interpretation coincides with that given by Chen et al. (2016) [39]. However, those authors associated the importance of features obtained from the random forest classifier with the features' weights. Weights for features associated with prices were the highest, followed by positive feedback and whether Amazon was the seller. In our work, these weights coincided for the same features, since opinions and rPrice obtained the highest weights among the studied features (in bold, Table 8). For recent sellers, with low selling activity or few available products, the relationship of attributes in the unspecific experiment is considered, and rPrice was shown to have the highest weight. On the contrary, when Amazon has enough information about the sellers, this attribute is replaced by opinions. This hypothesis could also be extended to the classification problem in Section 3.1. However, it should be noted that the percentage values shown in Table 8 refer to all categories of products, and these results could present variations according to the types of products analyzed. Attributes such as best-seller, amChoice, rank or prodRating seem to be irrelevant in selecting a seller to occupy the Buy Box. The relevance of these attributes is in accordance with the predictor importance results shown in Figure 4 (Section 3.1) for the unspecific experiment (opinions and rPrice predictors) and, to some less extent, to those obtained in the specific experiment (opinions predictor).

Conclusions
This study aimed to analyze empirically the most important features in determining how Amazon chooses sellers to occupy the Buy Box. To that end, Italy's Amazon web page was analyzed over a period of 10 months, and best-seller products from most of the categories of products were analyzed. From each category, sellers' characteristics were analyzed by studying the behavior of products featured in the Buy Box. Such behavior was analyzed according to the price dynamic of the products, their availability and ranking, customer experience, and whether the product was fulfilled by Amazon.
This study considered two different but complementary experiments. The first, which had an unspecific nature, was designed to predict seller change in the Buy Box. The second, a more specific experiment, focused on predicted which seller would occupy the Buy Box. Both experiments were analyzed using supervised and rule-based classification.
The classification results for the first (unspecific) experiment showed that customer experience (opinions and prodRating features) and product price dynamics (rprice) were the most important features in determining whether a seller would be selected to occupy the Buy Box. With regard to the second (specific) experiment, Amazon's fulfillment of products (fulfilled) was identified as the most important feature along with customer experience. However, the analysis also revealed that the results were category-dependent.
The rule-based classification indicated that opinions on products (opinions) received by customers and attributes related to the price dynamic (rPrice, rPriceCumMax and rPrice-CumMin) were the most relevant to Amazon's algorithm for selecting sellers to occupy the Buy Box. This was found in both experiments (unspecific and specific). These results were mostly coincident with the classification results. It is hypothesized that there was a composition or relationship among such features that was used to decide which seller should occupy the Buy Box, given their different frequencies of use in the experiments. A composition could be selected by Amazon according to the sellers' available history. In the case of new or low-activity sellers, rPrice could be the leading feature, while opinions could be used for active sellers with a record of activity on Amazon Marketplace. Re-sults obtained through this empirical study revealed a dependency within categories of analyzed products.
The main limitation of this study was that it analyzed only one Amazon Marketplace (Italy), which means that the conclusions could not be extended over other European Amazon Marketplaces. Additionally, lacking product price information from the replaced sellers at the Buy Box represented a considerable setback to better understanding Amazon's algorithm for selecting the seller. Editor: Future work will include broadening the analysis presented in this study to Amazon's other marketplaces in Europe as well as analyzing sellers' strategies to sell the same specific best-selling products on such Amazon platforms. Additionally, the extension of this analysis to the mobile shopping market (m-commerce) is of interest.

Disclaimer
The views expressed are purely those of the authors and may not be regarded, in any circumstances, as stating the official position of the European Commission.