Sorting Center Value Identification of “Internet + Recycling” Based on Transfer Clustering

As the core link of the “Internet + Recycling” process, the value identification of the sorting center is a great challenge due to its small and imbalanced data set. This paper utilizes transfer fuzzy c-means to improve the value assessment accuracy of the sorting center by transferring the knowledge of customers clustering. To ensure the transfer effect, an inter-class balanced data selection method is proposed to select a balanced and more qualified subset of the source domain. Furthermore, an improved RFM (Recency, Frequency, and Monetary) model, named GFMR (Gap, Frequency, Monetary, and Repeat), has been presented to attain a more reasonable attribute description for sorting centers and consumers. The application in the field of electronic waste recycling shows the effectiveness and advantages of the proposed method.


Introduction
In 2015, China's recycling pattern shifted from "manual recycling" to "Internet + Recycling" [1]. The "Internet + Recycling" process is very important to identify the value of the sorting center [2]. Take electronic waste (e-waste) recycling as an example. The company will offer information to sorting centers and receive a commission [3]. Therefore, the "Internet + Recycling" companies will design a specific strategy based on sorting center costs [4], recycling channels [5], and value dimensions [6]. In this way, the company can improve its business competitiveness and reduce churn at the sorting center. However, the data set of sorting center is small and imbalanced that is called "Absolute Rarity", [7] which makes it difficult to carry out the task of value assessment. The traditional oversampling methods [8] may not evaluate the accurate value for sorting centers.
Transfer learning is a branch of machine learning that has been shown to help solve problems with small data sets. It has been widely used in image classification [9], signal processing [10], and text classification [11]. An effective model for the target domain can be obtained by leveraging useful relative information from the source domain. However, there is still limited research on the transfer clustering problem. Jiang et al. proposed transfer spectral clustering (TSC), which could transfer knowledge from related clustering task [12]. Wang et al. extended three traditional Gaussian mixture model (GMM) to transfer clustering versions [13]. These methods are more suitable for the clustering problem that has definite boundaries. However, the amount of sorting centers is too limited to obtain boundaries. Fuzzy c-means (FCM) [14] is a clustering algorithm that could solve this problem. The algorithm are more applicable in many fields by changing the objective function of the FCM [15]. Transfer fuzzy c-means (TFCM) [16] is a transfer clustering version of FCM which has good performance on small data set clustering by transferring knowledge from the relative source domain's cluster centers. There are plentiful customers in "Internet + Recycling" which contain useful and relative information. Thus, in this paper, we adopt TFCM to transfer the cluster centers of customers as knowledge to assist cluster sorting centers. In order to achieve accurate cluster centers, a comprehensive model that can describe the characteristics of customers is necessary. The RFM model with clustering algorithm has been widely used in customer value identification. Pondel et al. compared three different clustering algorithms' results of 56,237 customers who made at least 2 purchases in the online store [17]. Kumar et al. practiced the RFM model on 127,037 business customer [18]. Many other scholars also adjust the RFM model according to the features of customers. For example, Li et al. used an improved RFM model with added indicators to classify 4000 customers on the e-commerce platform [19]. Fahed et al. used an enhanced RFM model to classify 42,172 retail customers [20]. However, in the "Internet + Recycling" process, some high-value customers are the individual economy which is rare in customers. The original RFM model could not describe a comprehensive characterization of the high-value customers.
In summary, to get the accurate value identification of sorting centers, TFCM that transferring knowledge from customers is adopted to solve the problem of the small amount of data set, and inter-class balanced data selection (IBDS) is proposed to help solve the problem of imbalanced data set. In order to obtain accurate customer cluster centers for transfer, an improved RFM model GFMR is proposed. The application in "Internet + Recycling" company proves our approach can effectively improve the accuracy of classifying sorting centers. With the accurate value identification of sorting centers, "Internet + Recycling" companies can apply their marketing strategies more precisely, which could improve business competitiveness and reduce sorting center churn.

Acquisition of Customer Cluster Centers
Accurate cluster centers of customers is the prerequisite for TFCM. The RFM model is a popular customer value analysis tool widely used to measure customer lifetime value as well as customer value identification and behavioral analysis. The original RFM definition is as follows: R: recency of the last trade F: frequency of trades M: monetary value of the trades However, the original RFM model could not identify the active customer. The R of RFM is almost the same for a new customer and an active loyal customer. In the "Internet + Recycling" company, the top 20% customers who are active in trade devote over 60% trading volume. Therefore, it is important to divide high-value customers from others.
The high-value customers of "Internet + Recycling" companies are sometimes the individual economy. They will recycle some specific goods and store them. When the price of the goods is relatively high, they will place several orders online. Therefore, their characteristics are short trade gap, high frequency, huge monetary value, and focus on particular goods. In order to strengthen the ability to identify high-value customers, this paper proposed GFMR model as follows.
where T s denotes the statistical interval, T f denotes the first date the consumer traded during the statistical interval, T l denotes the last date the consumer traded during the statistical interval. If a customer has only traded one time during the T s , this paper assumes the consumer's average trade gap is bigger than the statistical interval set G = T s . Otherwise, calculate the true trade gap.
The GFMR model will reduce the effect of randomness because the four indicators all have small relation to the sampling date. The average gap will separate intensive trade consumers from the others. The frequency and monetary value will identify loyal consumers. The repeat recycling times could identify individual economies. Consequently, GFMR is more suitable for identifying the consumer value of "Internet + Recycling". Based on GFMR model, k-means algorithm [21] is used to obtain the cluster centers.
Definition of variables: : domain consisting of the standardized GFMR data of m customers K: the number of clusters v k : the kth cluster center of customers Steps of acquiring customer cluster centers by k-means algorithm are as followed: Step1: Randomly generateṽ k (k = 1, 2, ..., K) as initial cluster centers.
Step2: Calculate the distance of each x S j toṽ k as x S j −ṽ k 2 and classify the sample into the cluster corresponding to the minimal distance.
Step3: Calculate the mean value of all samples within each cluster and update theṽ k .

Transfer Clustering for Sorting Centers
The data set of the introduced customer clustering study ranged from 4000 to 127,037, the amount of sorting centers is only 223 which is far from the modeling order of magnitude. In transfer learning, the domain containing a large amount of useful information is often defined as the source domain. And the domain we need to learn is defined as the target domain. In this paper, the customers data set is the source domain D S = x S j (j = 1, 2, ..., M) and the labeled sorting centers data set is the target domain . Due to the small size of the data set, there are no legible boundaries between each class. FCM could help solve this problem. The objective function of the original FCM is as follows.
where K denotes the number of clusters, U = [u ik ] K×N is the fuzzy/possibilistic partition matrix whose element u ik denotes the membership of the ith sample belonging to the kth class, α denotes the fuzzy index, T is the matrix of K cluster centers whose element v k denotes the kth cluster center of sorting centers. Transferring knowledge from the D S is a must because the D T cannot be trained to a satisfactory model on its own. There is a transfer learning version of FCM in which the objective function is defined as followed [16].
where λ 1 and λ 2 are non-negative balance parameters. The learning rules based on Equation (3) are as follows: There TFCM Algorithm 1 is described below. 2. Update the v k (t) using Equation (4) 3. Set t = t + 1; 4. Update the u ik (t) using Equation (5) 5. If all u ik (t) − u ik (t − 1) < or t = t max , then terminate; else go to 2.
The computational complexities of TFCM is O(tNK + tC) that is the same as FCM. This method transfers knowledge from the source domain to the target domain through source domain cluster centers. Changing λ 1 and λ 2 could adjust the level of learning from the source domain. As is proved in the article [16], if the source domain has bad knowledge, it will have a negative influence on the clustering performance in the target domain, which is called negative transfer. The original article tried and failed to adopt appropriate parameter values for reducing the effect of the bad source domain. Therefore, in this paper we choose data selection methods rather than adopting parameter values that have been proven to be effective.
In data selection, the key is to find a measurement between the source and target domain. Kullback-Leibler (KL) divergence is often used to measure the difference between two distributions. The KL divergence is defined as follows.
where H(p, q) denotes the cross-entropy, H(p) denotes the information entropy.
In D KL (p||q), the H p is constant. And proved by Gibbs' inequality that H(p, q) is bigger than H p , so D KL (p||q) is monotonic and H(p, q) could represent D KL (p||q). The smaller H(p, q) means q is closer to p.
In this paper, assume the distribution of the source domain is p which has m samples, and the distribution of the target domain is q which has n samples. In this paper, we attempt to find source domain samples that are more similar to the target domain. Turn into the math equation, the smaller H(q, p) is what we want. So order the source domain by H(q, p, i) = −q(x i ) log p(x i ), and select the relative smaller s samples to compose the subset of source domain. It is easy to obtain: which means we can measure source domain samples by H(q, p, i), the smaller the sample is closer to the target domain. However, due to the imbalanced number of different categories, if not separate the target domain or the source domain and calculate the distribution individually, a class with a larger amount will overwrite the features of a class with a smaller amount. As is shown in Figure 1, when fitting the distribution of G of consumers. The five best-fitting distributions all neglect small samples.  + bN), where b is the number of features of D T . The proposed algorithm will increase the complexity of the algorithm to some extent, but it can significantly improve the overall accuracy of the algorithm.
This paper ordered the source domain samples by calculating H(q, p, i) for each sample in a different category. Then, use the best s samples of each category to get a balanced and more similar subset that provides better cluster center TFCM. The method has proved to be effective.

Experimental Results
In this section, the proposed algorithm is evaluated on real-world data set. This paper collected 754,904 e-waste recycling order records of consumers and 19,703 e-waste transporting records of sorting centers from January 2021 to December 2021 of China's "Internet + Recycling" company. 308,059 consumers and 223 sorting centers were detected. The company has developed four marketing programs targeting high-value, potentialvalue, stable-value and low-value sorting centers. The goal of this paper is to accurately identify 4 types of sorting center values.

Data Processing
The consumer is modeled by GFMR. Due to the trade of sorting centers, it is always counted by cars without the specific trade category. So, the model used is the GFM model. The data is normalized by The smaller G is better. Thus G is normalized by Equation (8). F, M, and R are the bigger the batter, so they are normalized by Equation (9). G , F , M , R are the standardized variables.

Customer Value Identification
Based on the RFM and GFMR model, results of the k-means cluster algorithm to identify the value of customer are as follows.
As is shown in Table 1, the difference of R between four clusters is not obvious. Thus, the algorithm may incorrectly classify some high-value users into potential-value user groups, which leads to the high M of potential-value. Between stable-value and lowvalue, the difference between these two clusters is mainly the R which is caused by the random time the new customer begins to use the service. As is shown in Table 2, the difference of G between four clusters is very obvious, and the added R vivid segments high-value customers from the others. Low-value customers who trade only once are clearly separated from stable-value users who trade more actively. Thus, the result of the GFMR model is better than the RFM model. The cluster centers are more informative for sorting centers clustering.

Sorting Center Value Identification
However, there is still a large gap between clustering centers of the source domain and target domain, as shown in Tables 3 and 4. This means that there is a great difference between the source domain and the target domain. Therefore, IBDS is practiced to find a more appropriate source domain. The cluster centers of D S are shown in Table 5. As is shown in Table 5, the four clustering centers are all closer to the target domain. IBDS orders samples by H(q, p, i), the smaller, the more similar to the target domain. The variation of distance between the four clustering centers and the true target domain centers with different top ratios of D S is shown in Figure 2. As is shown in Figure 2, when taking 50% of D S the distance is minimum. If the ratio is too large, the subset will contain some samples that are not very similar to the target domain. If the ratio is too small, the randomness of the samples will also affect the similarity between the subset and the target domain.
To prove the effect of IBDS, this paper also presents the accuracy of transferring the knowledge of D S and 50% of D S with t max = 100 in Tables 6 and 7. A greener background  color in Tables 6 and 7 means higher accuracy, and a redder background color means lower accuracy. As is shown in Table 6, compared to 100% of D S , the accuracy of top 50% of D S with the same λ 1 = 10 and λ 2 = 1 is 95.07% which is higher than 91.03%. Compared to the original D S , the results of D S have higher accuracy in most parameters. D S exceed D S 7.13% of all kinds of parameters. When λ 1 is bigger than λ 2 , it usually has good accuracy because v k is effected by the randomness of small data set. This impact can be reduced by enhancing the learning of high quality data in the source domain. Thus, IBDS combined with TFCM effectively clusters small and imbalanced data sets.
To highlight the advantages of our approach, we also compared it with FCM and CSS (clustering with stratified sampling technique) [8]. The detailed results are shown in Figure 3. Obviously, TFCM combined with IBDS could get the result that is most close to real situation. The data set of the sorting center is small and imbalanced. Thus, the accuracy of FCM is only 60.09%. FCM classifies some high-value sorting centers into the potential-value sorting centers and classifies some potential-value sorting centers into the stable-value sorting center because setting the cluster centers in the dense samples could achieve a lower score of the Equation (2). CSS is an imbalanced data classification algorithm. The accuracy of CSS is 83.41%. CSS is more accurate in identifying low-value sorting centers. But TFCM combined with IBDS outperforms CSS in identifying high-value sorting centers. Because the small data set is easily overfitted by the way of generating samples through oversampling. Transfer learning can effectively improve the accuracy of sorting center value identification and avoid overfitting at the same time.

Conclusions
Considering the fact that the data set of sorting centers is small and imbalanced, TFCM combined with IBDS has been proposed to solve the value identification problem. TFCM could transfer knowledge from clustering centers of customers. The IBDS could find a subset that are more similar and balanced than the target domain. In further research, different ratios of the subset exhibits different disparities from the target domain which is caused by randomness and redundant samples. A suitable ratio that balances sample diversity and validity will have better performance. In most of the parameters, IBDS elevates the accuracy which proved the validity of the method. Compared with FCM, the value assessment accuracy of the sorting center elevated from 60.09% to 91.03%. The method in this paper is also less likely to be overfitted compared to the oversampling method.
Further research will focus on automatic adjustment of the ratio to balance sample diversity and validity. In this paper, customers and sorting centers share similar characteristics. However, transferring knowledge from data sets without similar features is still a challenge.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.