# Hyperparameter Black-Box Optimization to Improve the Automatic Classification of Support Tickets

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Feature Extraction from Textual Tickets

Algorithm 1: Categorization of Tickets. |

1. Extraction of Relevant Terms from the Corpus of 15,000 Tickets |

1.a Tokenization |

1.b Stop-words Elimination |

1.c Lemmatization and Part-Of-Speech recognition |

2. Preparation of the Records (=Standardized Description) of the Tickets |

2.a Word Embedding to associate Relevant terms with vectors |

2.b Selection within the 31,000 Relevant Terms |

of the 424 Most Relevant Terms |

2.c Projection of the Tickets in the space of the Most Relevant Terms |

to obtain the Records |

3. Labeling of 8076 Records by Human Experts |

4. Training of the Classifier with the 8076 Labeled Records |

4.a Splitting of training and test sets |

4.b Optimization of the Hyperparameters: For every |

combination of Hyperparameters V’ do the following: |

4.b.i Training with Hyperparameters V’ |

4.b.ii Testing and evaluation of the performance with Hyperp.V’ |

4.c Final training with the optimal Hyperparameters |

5. Categorization of all the Tickets (193,419 Records) using |

the optimal Hyperparameters |

## 3. Optimizing the Hyperparameters of the Classifiers

#### 3.1. Convolutional Neural Networks

- Embedding_dim: width of the kernel matrix for the 2D convolution window. The considered values are (200; 300; 400; 500), both for the optimization approach and for the grid search approach. For the extended grid search, to keep the computational time within 250 h (>10 days) and explore in detail the other hyperparameters, we are forced to consider only (200; 300) which are however the most promising values.
- Filter_sizes: height of the 2D convolution window; in other words, the number of words we want our convolutional filters to cover. We specify 3 different values for the three layers. The values tested with the optimization approach are ([3,4,5]; [5,4,3]; [3,5,7]; [7,5,3]). For the grid search approach, for the computational reasons explained above, we use only ([3,4,5]). For the extended grid search we use ([3,4,5]; [5,4,3]).
- Num_filters: the number of output filters in the convolution. We consider for all the approaches only the value 512, since preliminary experiments showed not much sensitivity to this parameter.
- Optimizer: the optimization technique used in the gradient descent when training the network. For all the approaches we consider (adam; adamax; RMSprop; sgd).
- Loss_function: the function used to calculate the error when training the network. For our multi-class problem, we selected only categorical cross-entropy as the loss function in all the approaches, since it was deemed the most suitable.
- Activation_Conv: activation function for the convolutional layers. The activation functions of a neuron determine whether it should be activated (“fired”) or not, based on the inputs received. Many activation functions exist, see also [27]. They must also be computationally efficient because they are calculated across many neurons for each data sample. For the optimization approach, we use (linear; relu; elu; selu; softsign; softplus; sigmoid; hard_sigmoid; exponential; tanh). For the grid search approach, we slightly limit the choice to the most promising and use (relu; softsign; sigmoid; exponential; tanh). For the extended grid search, we use (relu; elu; softsign; sigmoid; exponential; tanh).
- Activation_Dense: activation function for the final dense layer. For the optimization approach, we use (relu; softmax; sigmoid). For the grid search approach, for the computational reasons explained above, we only use (softmax). For the extended grid search, we use (softmax; sigmoid).
- Epochs: an epoch is one complete pass through the training data. Generally, a network is trained for multiple epochs; as a compromise between speed and performance, we select 5 for all the approaches.
- Class_weights: specifies how the errors in the different classes are weighed in the loss function. In keras the possible options are as follows: 1 allows to specify individual weights, in particular, weights proportional to the class frequencies; 2 provides balanced class weights, that is, weights are inversely proportional to the class frequencies; 3 provides uniform class weights, an error that has the same importance for any class. For both the optimization approach and the grid search approaches we use all possibilities (1; 2; 3).

#### 3.2. Support Vector Machines

- Kernel: the type of kernel used. For both the optimization approach and the grid search approaches we use (Linear; Radial Basis Function (RBF)). In the case of RBF, it is important to define the penalty parameter c of the error term and the kernel coefficient $\gamma $.
- Coefficient $\gamma $: is the inverse of the standard deviation of the radial basis kernel, and can be seen as the inverse of the radius of influence of samples selected by the model as support vectors. For the optimization approach and for the extended grid search we use $\gamma =({2}^{-19},{2}^{-18.7},\dots ,{2}^{20})$ with step ${2}^{i+0.3}$ for a total of 131 values. For the grid search we use $\gamma =({2}^{-19},\dots ,{2}^{20})$ but step ${2}^{i+1}$, for a total of 40 values.
- Penalty c: is the regularization parameter of the error term. This value allows one to trade off training error vs. model complexity. For the optimization approach and for the extended grid search we use $C=({2}^{-8},{2}^{-7.7},\dots ,{2}^{23})$ with step ${2}^{i+0.3}$ for a total of 95 values. For the grid search we use $C=({2}^{-8},\dots ,{2}^{23})$ with step ${2}^{i+1}$ for a total of 32 values.
- Class_weights: specifies how the errors in the different classes are weighed during the training. The possible options are: (1) weights proportional to the class frequencies; (2) weights inversely proportional to the class frequencies; (3) provides uniform class weights, an error that has the same importance for any class. For both the optimization approach and the grid search approaches we use all possibilities (1, 2, and 3).

## 4. Experimental Results

- General information: tickets that require generic information on the survey, like topic or use of the data, etc.
- Usability: tickets relating to the difficulties in accessing the site or which highlight problems relating to the usability of the electronic questionnaire (e.g., non-editable fields, methods of sending the questionnaire, etc.) or tickets requesting the reopening of the questionnaire already sent, to proceed with the correction of data entered incorrectly.
- Information on questions: tickets requesting assistance in completing a specific question in the questionnaire (e.g., the question is not applicable to the case of the user, or the case of the user is not present in the questionnaire).
- Interaction with Istat: tickets that highlight a criticality in the communication process that took place between Istat and end users (e.g., requests for information on the reference year of the survey, the obligation to reply, the deadline for completion, confirmation of the sending of the questionnaire, extensions for the completion of the questionnaire).
- Eligibility: tickets that show difficulties in understanding whether the unit has the characteristics to be part of the sample or not and therefore is actually required to participate in the survey.
- Indeterminable / Unclassifiable: tickets with ambiguous content, or tickets with several problems highlighted at the same time, so they cannot be classified.
- Rest: tickets describing well-defined problems not belonging to the previous classes, however, these problems have a small frequency (they appear rarely). So, instead of using several other small classes, they have been put all into one class: “the rest of the tickets”.

- If we allow a reasonably comparable amount of time for the two approaches, then the optimization approach can explore a much larger search space, hence it can likely find a better hyperparameter configuration that will lead to better predictive performance.
- If, on the other hand, we allow the same size of the search space for the two approaches, then we have two possible subcases:
- either we need to use a search space small enough to allow the termination of the grid search in a reasonable time, and in this case, the optimization approach can probably find the same best configuration of the grid search (or a slightly suboptimal one, given the nature of the technique) but using considerably less computational effort,
- or we chose a search space large enough to consider all interesting hyperparameter configurations, and in that case, the grid search may simply become computationally infeasible.

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Aggarwal, C.C. Machine Learning for Text; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
- Zeng, C.; Zhou, W.; Li, T.; Shwartz, L.; Grabarnik, G.Y. Knowledge Guided Hierarchical Multi-Label Classification Over Ticket Data. IEEE Trans. Netw. Serv. Manag.
**2017**, 14, 246–260. [Google Scholar] [CrossRef] - Tellez, E.S.; Moctezuma, D.; Miranda-Jímenez, S.; Graff, M. An automated text categorization framework based on hyperparameter optimization. Knowl.-Based Syst.
**2018**, 149, 110–123. [Google Scholar] [CrossRef] [Green Version] - Han, J.; Akbari, M. Vertical Domain Text Classification: Towards Understanding IT Tickets Using Deep Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence 2018, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Mateo, R.M.A. A Knowledge Extraction Framework for Call Center Analytics. In Proceedings of the 18th Online World Conference on Soft Computing in Industrial Applications (WSC18). Advances in Intelligent Systems and Computing; Ane, B., Cakravastia, A., Diawati, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2019; Volume 864. [Google Scholar]
- Zhang, H.; Dong, B.; Feng, B.; Yang, F.; Xu, B. Classification of Financial Tickets Using Weakly Supervised Fine-Grained Networks. IEEE Access
**2020**, 8, 129469–129477. [Google Scholar] [CrossRef] - Revina, A.; Buza, K.; Meister, V.G. IT Ticket Classification: The Simpler, the Better. IEEE Access
**2020**, 8, 193380–193395. [Google Scholar] [CrossRef] - Putong, M.W.; Suharjito, S. Classification Model of Contact Center Customers Emails Using Machine Learning. Adv. Sci. Technol. Eng. Syst. J.
**2020**, 5, 174–182. [Google Scholar] [CrossRef] [Green Version] - Yayah, F.C.; Ghauth, K.I.; Ting, C.-Y. The automated machine learning classification approach on telco trouble ticket dataset. J. Eng. Sci. Technol.
**2021**, 16, 4263–4282. [Google Scholar] - Tolciu, D.-T.; Săcărea, C.; Matei, C. Analysis of patterns and similarities in service tickets using natural language processing. J. Commun. Softw. Syst.
**2021**, 17, 29–35. [Google Scholar] [CrossRef] - He, X.; Zhao, K.; Chu, X. AutoML: A survey of the state-of-the-art. Knowl.-Based Syst.
**2021**, 212, 106622. [Google Scholar] [CrossRef] - Kotthoff, L.; Thornton, C.; Hoos, H.H.; Hutter, F.; Leyton-Brown, K. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. J. Mach. Learn. Res.
**2017**, 18, 1–5. [Google Scholar] - Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems (NIPS) 2011; NeurIPS: La Jolla, CA, USA, 2011; ISBN 978-161839599-3. [Google Scholar]
- Mantovani, R.G.; Rossi, A.L.D.; Alcobaça, E.; Vanschoren, J.; de Carvalho, A.C.P.L.F. A meta-learning recommender system for hyperparameter tuning: Predicting when tuning improves SVM classifiers. Inf. Sci.
**2019**, 501, 193–221. [Google Scholar] [CrossRef] [Green Version] - Yoo, Y. Hyperparameter optimization of deep neural network using univariate dynamic encoding algorithm for searches. Knowl.-Based Syst.
**2019**, 178, 74–83. [Google Scholar] [CrossRef] - Joy, T.T.; Rana, S.; Gupta, S.; Venkatesh, S. Fast hyperparameter tuning using Bayesian optimization with directional derivatives. Knowl.-Based Syst.
**2020**, 205, 106247. [Google Scholar] [CrossRef] - Du, H.; Han, P.; Xiang, Q.; Huang, S. MonkeyKing: Adaptive Parameter Tuning on Big Data Platforms with Deep Reinforcement Learning. Big Data
**2020**, 8, 270–290. [Google Scholar] [CrossRef] - Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res.
**2018**, 18, 1–52. [Google Scholar] - Mu, T.; Wang, H.; Wang, C.; Liang, Z.; Shao, X. Auto-CASH: A meta-learning embedding approach for autonomous classification algorithm selection. Inf. Sci.
**2022**, 591, 344–364. [Google Scholar] [CrossRef] - Mikolov, T.; Le, Q.V.; Sutskever, I. Exploiting Similarities among Languages for Machine Translation. arXiv
**2013**, arXiv:1309.4168. [Google Scholar] - Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Proc. Adv. Neural Inf. Process. Syst.
**2012**, 25, 1090–1098. [Google Scholar] [CrossRef] [Green Version] - Chang, C.-C.; Lin, C.J. Training v-support vector classifiers: Theory and algorithms. Neural Comput.
**2001**, 13, 2119–2147. [Google Scholar] [CrossRef] - Boukouvala, F.; Misener, R.; Floudas, C.A. Global optimization advances in Mixed-Integer Nonlinear Programming, MINLP, and Constrained Derivative-Free Optimization, CDFO. Eur. J. Oper. Res.
**2016**, 252, 701–727. [Google Scholar] [CrossRef] [Green Version] - Liuzzi, G.; Lucidi, S.; Rinaldi, F. An algorithmic framework based on primitive directions and nonmonotone line searches for black-box optimization problems with integer variables. Math. Program. Comput.
**2020**, 12, 673–702. [Google Scholar] [CrossRef] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Yoon, K. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014. [Google Scholar] [CrossRef] [Green Version]
- Nwankpa, C.; Ijomah, W.; Gachagan, A.; Marshall, S. Activation functions: Comparison of trends in practice and research for deep learning. arXiv
**2018**, arXiv:1811.03378. [Google Scholar] - Vapnik, V. The Nature of Statistical Learning Theory, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 1995. [Google Scholar]
- Oswal, B.V. CNN-Text-Classification-Keras, GitHub Repository. 2016. Available online: https://github.com/bhaveshoswal/CNN-text-classification-keras (accessed on 1 December 2022).
- Bruni, R.; Bianchi, G. Effective Classification using Binarization and Statistical Analysis. IEEE Trans. Knowl. Data Eng.
**2015**, 27, 2349–2361. [Google Scholar] [CrossRef] [Green Version] - Bruni, R.; Bianchi, G.; Dolente, C.; Leporelli, C. Logical Analysis of Data as a Tool for the Analysis of Probabilistic Discrete Choice Behavior. Comput. Oper. Res.
**2019**, 106, 191–201. [Google Scholar] [CrossRef] [Green Version] - Bruni, R.; Bianchi, G. Website categorization: A formal approach and robustness analysis in the case of e-commerce detection. Expert Syst. Appl.
**2019**, 142, 113001. [Google Scholar] [CrossRef]

**Figure 1.**Overall scheme of the proposed approach, including both the text mining phase and the classification phase with hyperparameters optimization.

Black Box | Grid Search | Ext. Grid Search | |
---|---|---|---|

Hyperparameter configurations | 5760 | 240 | 576 |

Evaluated points | 212 | 240 | 576 |

Enumeration percentage | 2.76% | 100% | 100 % |

Time in sec. | 318,500 | 375,000 | 901,000 (>10 days) |

Solution accuracy | 89.72% | 86.15% | 89.75% |

Black Box | Grid Search | Ext. Grid Search | |
---|---|---|---|

Hyperparameter configurations | 49,780 | 1280 | 49,780 |

Evaluated points | 291 | 1280 | 49,780 |

Enumeration percentage | 0.58% | 100% | 100% |

Time in sec. | 3490 | 15,360 | 597,360 (∼7 days) |

Solution accuracy | 74.53% | 71.87% | 74.53% |

Black Box | Grid Search | Ext. Grid Search | |
---|---|---|---|

Embedding_dim | 300 | 300 | 300 |

Filter_sizes | [5,4,3] | [3,4,5] | [5,4,3] |

Num_filters | 512 | 512 | 512 |

Optimizer | Adam | Adamax | Adamax |

Loss_function | Cat. Cross entropy | Cat. Cross entropy | Cat. Cross entropy |

Activation_Conv | Elu | Tanh | Elu |

Activation_Dense | Sigmoid | Softmax | Sigmoid |

Epochs | 5 | 5 | 5 |

Class_weights | 2 | 1 | 2 |

Black Box | Grid Search | Ext. Grid Search | |
---|---|---|---|

Kernel | RBF | RBF | RBF |

$\gamma $ | 42 | 32 | 42 |

c | 10 | 8 | 10 |

Class_weights | 2 | 2 | 2 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Bruni, R.; Bianchi, G.; Papa, P.
Hyperparameter Black-Box Optimization to Improve the Automatic Classification of Support Tickets. *Algorithms* **2023**, *16*, 46.
https://doi.org/10.3390/a16010046

**AMA Style**

Bruni R, Bianchi G, Papa P.
Hyperparameter Black-Box Optimization to Improve the Automatic Classification of Support Tickets. *Algorithms*. 2023; 16(1):46.
https://doi.org/10.3390/a16010046

**Chicago/Turabian Style**

Bruni, Renato, Gianpiero Bianchi, and Pasquale Papa.
2023. "Hyperparameter Black-Box Optimization to Improve the Automatic Classification of Support Tickets" *Algorithms* 16, no. 1: 46.
https://doi.org/10.3390/a16010046