# Reinforcement Learning Recommendation Algorithm Based on Label Value Distribution

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Research Design and Model Construction

#### 2.1. Fuzzy Mahalanobis Metric Clustering Enhancement

#### 2.2. Sublinear Coding Enhancement

#### 2.3. Enhancement of Contextual–Quantile Regression Reinforcement Learning Model

Algorithm 1: Contextual–Quantile Regression Reinforcement Learning Network |

$\mathrm{Hyperparameter}:N,K$ |

$\mathrm{Input}:c,x,a,r,{x}^{\prime},\gamma \in [0,1)$ |

# Computational distribution Bellman operator |

$Q({s}^{\prime},{a}^{\prime}):={\displaystyle \sum _{j}{q}_{j}{\theta}_{j}}(c,{x}^{\prime},{a}^{\prime})$ |

${a}^{*}\leftarrow \mathrm{arg}\underset{{a}^{\prime}}{\mathrm{max}}(c,x,{a}^{\prime})$ |

$T{{\theta}^{\prime}}_{j}=r+\gamma {{\theta}^{\prime}}_{j},i=0,1,\cdots N-1$ |

# Optimized quantile regression loss |

$\mathrm{Output}:{\displaystyle \sum _{i=1}^{N}{E}_{j}[{\rho}_{{\widehat{\tau}}_{i}}^{\mathrm{K}}(\tau {\theta}_{j}-{\theta}_{i}(c,x,a))]}$ |

## 3. Experimental Design and Environmental Configuration

#### 3.1. Encryption Algorithm Description

#### 3.1.1. Platform Selection

#### 3.1.2. Parameter Setting

#### 3.1.3. Data Description

- (1)
- If users have more interactive behaviors, their click through rate indicators are relatively stable and remain at the same level.
- (2)
- If the number of user interaction behaviors is higher, the click through rate indicator is higher in the case of random recommendations.
- (3)
- If the number of user interaction behaviors is too few, the variance is abnormally large, which should be considered an invalid interaction and eliminated during actual processing. In the case of insufficient information, the quality of the output training samples is not high, which will affect the model fitting.

#### 3.2. Experimental Environment Configuration

#### 3.2.1. Hardware Configuration

#### 3.2.2. Software Configuration

#### 3.2.3. Resource Demand

## 4. Experimental Verification and Result Analysis

#### 4.1. Establishment of Recommendation Algorithm Evaluation Indicators

#### 4.2. Validation Experiment of Static Features

- (1)
- A feature engineering method based on counting, which simply counted the data to calculate the browse and interaction quantity of each product as the baseline method for comparison.
- (2)
- A feature engineering method based on SVD decomposition: the hidden code length level of the SVD was set at 5, 20, and 50 levels, and the codes were svd-5, -20, and -50, respectively.
- (3)
- A feature engineering method based on simple sublinear coding: the levels of the number of users (M) were 50, 150, 500, and 5000, and the codes were SubLinEmb-50, -150, -500, and -5 k, respectively.
- (4)
- A systematic design based on the user’s static characteristic data: When processing the data, the information contained within the data was cleaned, converted, and processed. After the processing was completed, the data were cached in the memory system until the near-line and online applications. The code name was Entirety-Plan.

- (1)
- Increasing the number of implicit variables will not enhance the index, but it will decrease the performance of the index.
- (2)
- As a baseline model, the effect of the popularity recommendation was stronger than that of SVD, but the effect was not very strong overall. The random recommendations were the worst, which we were expecting.
- (3)
- The feature engineering method using sublinear coding was much more effective than the linear SVD correlation method was because it considered this situation in theory.
- (4)
- The scheme that combined the linear and sublinear and sequential patterns as a whole was optimal, which was not unexpected because it had the most comprehensive information.

#### 4.3. Model Comparison Experiment of Interactive Data

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Karacan, H.; Karacan, H.; Yenice, Y.E.; Yenice, Y.E.; Al-Sabaawi, A.; Al-Sabaawi, A. A novel overlapping method to alleviate the cold-start problem in recommendation systems. Int. J. Softw. Eng. Knowl. Eng.
**2021**, 31, 1277–1297. [Google Scholar] - Baerg, L.; Bruchmann, K. COVID-19 information overload: Intolerance of uncertainty moderates the relationship between frequency of internet searching and fear of COVID-19. Acta Psychol.
**2022**, 224, 103534. [Google Scholar] [CrossRef] [PubMed] - Ge, M.; Persia, F. A Survey of Multimedia Recommender Systems: Challenges and Opportunities. Int. J. Semant. Comput.
**2017**, 11, 411–428. [Google Scholar] [CrossRef] - Gomez-Uribe, C.A.; Hunt, N. The netflix recommender system. ACM Trans. Manag. Inf. Syst.
**2015**, 6, 1–19. [Google Scholar] [CrossRef] [Green Version] - Zhou, R.; Khemmarat, S.; Gao, L.; Jian, W.; Zhang, J. How youtube videos are discovered and its impact on video views. Multimed. Tools Appl.
**2016**, 75, 6035–6058. [Google Scholar] [CrossRef] - Zhou, R.; Khemmarat, S.; Gao, L. The impact of YouTube recommendation system on video views. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, Melbourne, Australia, 1–3 November 2010. [Google Scholar]
- Ikram, F.; Farooq, H. Multimedia recommendation system for video game based on high-level visual semantic features. Sci. Program.
**2022**, 2022, 6084363. [Google Scholar] [CrossRef] - Kiruthika, N.S.; Thailambal, G. Dynamic light weight recommendation system for social networking analysis using a hybrid lstm-svm classifier algorithm. Opt. Mem. Neural Netw.
**2022**, 31, 59–75. [Google Scholar] [CrossRef] - Goldberg, D.; Nichols, D.; Oki, B.M.; Terry, D. Using collaborative filtering to weave an information tapestry. ACM
**1992**, 35, 61–70. [Google Scholar] [CrossRef] - Resnick, P.; Iacovou, N.; Suchak, M.; Bergstrom, P.; Riedl, J. Gruoplens: An open architecture for collaborative filtering of netnews. In Proceedings of the ACM 1994 Conference on Computer Supported Cooperative Work, Chapel Hill, NC, USA, 22–26 October 1994; pp. 175–186. [Google Scholar]
- Linden, G.; Smith, B.; York, J. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput.
**2003**, 7, 76–80. [Google Scholar] [CrossRef] [Green Version] - Breese, J.S.; Heckerman, D.E.; Kadie, C.M. Empirical analysis of predictive algorithms for collaborative filtering. Uncertain. Artif. Intell.
**1998**, 98052, 43–52. [Google Scholar] - Pazzani, M.J. A framework for collaborative, content-based and demographic filtering. Artif. Intell. Rev.
**1999**, 13, 393–408. [Google Scholar] [CrossRef] - Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep learning based recommender system. ACM Comput. Surv. (CSUR)
**2019**, 52, 1–38. [Google Scholar] [CrossRef] [Green Version] - He, X.; Liao, L.; Zhang, H.; Nie, L.; Chua, T.S. Neural collaborative filtering. In Proceedings of the International World Wide Web Conferences Steering Committee, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]
- Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; Tikk, D. Session-based recommendations with recurrent neural networks. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; pp. 498–506. [Google Scholar]
- Pang, J.; Hegde, V.; Chalupsky, J. Upper Confidence Bound Algorithm for Oilfield Logic. U.S. Patent US20210150440A1, 20 May 2021. [Google Scholar]
- Prashanth, L.A.; Korda, N.; Munos, R. Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling. Mach. Learn.
**2021**, 110, 559–618. [Google Scholar] [CrossRef] - Seminario, C.E.; Wilson, D.C. Nuke Em Till They Go: Investigating Power User Attacks to Disparage Items in Collaborative Recommenders. In Proceedings of the 9th ACM Conference on Recommender Systems, Vienna, Austria, 16–20 September 2015; pp. 293–296. [Google Scholar]
- Dornheim, J.; Link, N.; Gumbsch, P. Model-free adaptive optimal control of episodic fixed-horizon manufacturing processes using reinforcement learning. Int. J. Control Autom. Syst.
**2020**, 18, 1593–1604. [Google Scholar] [CrossRef] [Green Version] - Jwk, A.; Bjp, A.; Hy, B.; Tho, A.; Jhl, B.; Jml, A. A model-based deep reinforcement learning method applied to finite-horizon optimal control of nonlinear control-affine system. J. Process Control
**2020**, 87, 166–178. [Google Scholar] - Ie, E.; Jain, V.; Wang, J.; Narvekar, S.; Boutilier, C. SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 2592–2599. [Google Scholar]
- Shani, G.; Heckerman, D.; Brafman, R.I.; Boutilier, C. An mdp-based recommender system. J. Mach. Learn. Res.
**2005**, 6, 1265–1295. [Google Scholar] - Zhao, X.; Xia, L.; Zhang, L.; Ding, Z.; Yin, D.; Tang, J. Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems, Vancouver, BC, Canada, 2 October 2018; pp. 95–103. [Google Scholar]
- Christakopoulou, K.; Banerjee, A. Learning to Interact with Users: A Collaborative-Bandit Approach. In Proceedings of the 2018 SIAM International Conference on Data Mining, San Diego, CA, USA, 3–5 May 2018; pp. 612–620. [Google Scholar]
- Li, S.; Karatzoglou, A.; Gentile, C. Collaborative Filtering Bandits. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 539–548. [Google Scholar]
- Vinyals, O.; Le, Q. A neural conversational model. Computer Science. In Proceedings of the ICML Deep Learning Workshop, Lille, France, 6–11 July 2015. [Google Scholar]
- Candès, E.J.; Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theory
**2010**, 56, 2053–2080. [Google Scholar] [CrossRef] [Green Version] - Mcnee, S.M.; Riedl, J.; Konstan, J.A. Being accurate is not enough: How accuracy metrics have hurt recommender systems. In Proceedings of the Extended Abstracts 2006 Conference on Human Factors in Computing Systems, Montréal, QC, Canada, 22–27 April 2006; pp. 1097–1101. [Google Scholar]
- Hz, A.; Cp, B.; Bm, A.; Tl, A.; Hv, A. New technique to alleviate the cold start problem in recommender systems using information from social media and random decision forests—Sciencedirect. Inf. Sci.
**2020**, 536, 156–170. [Google Scholar] - Liu, K.; Wei, L.; Chen, X. A new preference-based model to solve the cold start problem in a recommender system. In Proceedings of the 2nd International Conference on Electromechanical Control Technology and Transportation, Zhuhai, China, 14–15 January 2017; pp. 121–126. [Google Scholar]
- Kvr, A.; Sjm, B.; Tb, C. Model-Driven Approach Running Route two-level SVD with Context Information and Feature Entities in Recommender System. Comput. Stand. Interfaces
**2022**, 82, 103627. [Google Scholar] - Xu, R.; Li, J.; Li, G.; Pan, P.; Zhou, Q.; Wang, C. Sdnn: Symmetric deep neural networks with lateral connections for recommender systems. Inf. Sci.
**2022**, 595, 217–230. [Google Scholar] [CrossRef]

Name | Definition |
---|---|

P | Quantity of items |

u | User |

t | From the beginning of the session to the current time |

Z_{u}_{,t} | Current transaction type (enumeration type: organic or bandit) |

V_{u}_{,t} | User browsing item ID. If it is bandit, it is none |

a_{u}_{,t} | Item recommendation action. If it is organic, it is none |

C_{u}_{,t}t | Whether a click event occurs. If it is organic, it is none |

Name | Symbol Interpretation | Model Definition | Value |
---|---|---|---|

K | Late factor dimension | K | 50 |

P | Item quantity | num_products | 1000 |

F | Difference between organic and bandit | number_of_flips | 650 |

µt | Average visit duration | normal_time_mu | 1 |

σt | Access to the box difference during stay | normal_time_sigma | 1 |

σ(µ) | Potential characteristic difference in user interest | sigma_mu_organic | 2 |

σ(ω) | Initialization of potential features of user interest | sigma_omega_init | 1 |

Noise | sigma_omega | 0.2 | |

Does omega change with interaction | change_omega_for_bandits | True |

Name | Function Description | Related Matters | Value Range |
---|---|---|---|

t | Serial value or time | None | Serial value: int, time: float |

u | User ID | None | Int |

z | Interactive identification | Enumeration type | Browse: organic, interact: bandit |

v | Browse item identification | None in bandit | Int |

a | Action | Products displayed to users | Int |

c | User feedback | Whether the user provides positive feedback | User click: true, no action: false |

ps | User click probability | None in organic | 0–1 |

ps-a | All action probabilities | None in organic | 0–1 |

Hardware | Model | Quantity |
---|---|---|

CPU | Intel E5-2650 V3 | 2 |

Memory | 32 G | 4 |

Hard disk | RAID5 3 T | 3 |

Graphics card | NVIDIA Tesla M40 24 GB | 1 |

Environment | Edition |
---|---|

Operating system | Ubuntu 16.04 Server |

Python | 3.6.15 |

Pytorch | 1.2.0 |

RecoGym | 0.1.3.0 |

Hadoop | 2.7.2 |

Hbase | 1.2.1 |

Hive | 2.3.X |

Spark | 3.1.2 |

Pig | 0.13.0 |

Experimental Nature | Number of Test Items | Number of Test Users | Memory Usage | Offline Data Volume |
---|---|---|---|---|

Functional verification | 1000 | 10,000 | Less than 2 GB | About 1,004,594 articles |

Indicator verification | 1000 | 50,000 | Less than 2 GB | About 5,118,051 articles |

Performance testing | 5000 | 100,000 | No more than 4 GB | About 10,104,289 articles |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Guo, Z.; Fu, J.; Sun, P.
Reinforcement Learning Recommendation Algorithm Based on Label Value Distribution. *Mathematics* **2023**, *11*, 2895.
https://doi.org/10.3390/math11132895

**AMA Style**

Guo Z, Fu J, Sun P.
Reinforcement Learning Recommendation Algorithm Based on Label Value Distribution. *Mathematics*. 2023; 11(13):2895.
https://doi.org/10.3390/math11132895

**Chicago/Turabian Style**

Guo, Zhida, Jingyuan Fu, and Peng Sun.
2023. "Reinforcement Learning Recommendation Algorithm Based on Label Value Distribution" *Mathematics* 11, no. 13: 2895.
https://doi.org/10.3390/math11132895