Best Practices of Convolutional Neural Networks for Question Classiﬁcation

: Question Classiﬁcation (QC) is of primary importance in question answering systems, since it enables extraction of the correct answer type. State-of-the-art solutions for short text classiﬁcation obtained remarkable results by Convolutional Neural Networks (CNNs). However, implementing such models requires choices, usually based on subjective experience, or on rare works comparing di ﬀ erent settings for general text classiﬁcation, while peculiar solutions should be individuated for QC task, depending on language and on dataset size. Therefore, this work aims at suggesting best practices for QC using CNNs. Di ﬀ erent datasets were employed: (i) A multilingual set of labelled questions to evaluate the dependence of optimal settings on language; (ii) a large, widely used dataset for validation and comparison. Numerous experiments were executed, to perform a multivariate analysis, for evaluating statistical signiﬁcance and inﬂuence on QC performance of all the factors (regarding text representation, architectural characteristics, and learning hyperparameters) and some of their interactions, and for ﬁnding the most appropriate strategies for QC. Results show the inﬂuence of CNN settings on performance. Optimal settings were found depending on language. Tests on di ﬀ erent data validated the optimization performed, and conﬁrmed the transferability of the best settings. Comparisons to conﬁgurations suggested by previous works highlight the best classiﬁcation accuracy by those optimized here. These ﬁndings can suggest the best choices to conﬁgure a CNN for QC.


Introduction
Nowadays, intelligent systems able to interact with users in natural language are being developed. However, due to the difficulties associated with natural language understanding by computer systems, this is still a field of research of increasing interest [1][2][3].
In particular, question answering systems should be able to answer automatically to questions presented in natural language. In order to accomplish this task, a number of operations are required, in order to eventually translating from spoken to written text, to process natural language (tokenization, part-of-speech tagging, dependency parsing), to analyze the question (entity extraction, question classification, query formulation), and to consult the information corpora (information retrieval and answer extraction).

1.
Regarding the text representation, the following approaches are compared here: The inclusion or deletion of punctuation, the use of a well-established pre-trained WE model or of random vectors for words representation, the use of null vectors or of random vectors for representing Out-Of-Vocabulary (OOV) words, the embedding dimension, and the possibility of fine-tuning WE vectors during learning or keeping them constant; 2.
regarding the CNN architecture, which uses filters to extrapolate features relative to sets of consecutive words, the following characteristics are tuned here: Filter region size, number of filters, and the activation function, while only the pooling strategy is fixed; and 3.
regarding the learning hyperparameters, for training network weights and eventually the WE vectors, the following are analyzed: Batch size, learning method, learning rate, and regularization terms, while the number of epochs is chosen for each run to ensure convergence.
Moreover, the proposed procedure is performed and settings are tuned for two languages, English and Italian, in order to evaluate differences of the contribution of each factor between languages having different morphological richness, and to demonstrate that a system optimized by the proposed approach can be employed successfully in a multilingual context.
The analysis and the subsequent application of the optimized QC model is performed for two datasets of labelled questions made available in both English and Italian languages by a task presented at Text Retrieval Conferences (TREC) 2002 and 2003 (https://trec.nist.gov/, accessed 1 July 2020). In addition, a widely used dataset of English labelled questions (http://cogcomp.cs.illinois.edu/Data/QA/QC/, accessed 1 July 2020) is employed to check transferability.
Finally, the optimal CNN configurations found here are compared with those found in the most relevant previous similar works, [19,21].
The paper is structured as follows. The following part of this section summarizes related works, while in Section 2 describes the data, formalizes the general QC approach comprising the CNN, and plan the model optimization. The experimental plan with results and their discussion are presented in Section 3. Finally, Section 4 draws conclusions of the work.

Related Works
QC, and more generally speaking sentence classification, is a crucial task for NLP [1,2,16]. Natural language sentences, in both affirmative and interrogative forms, have complicated structures, both sequential and hierarchical, that must be handled to allow their comprehension. Thanks to their ability to capture local relations of temporal or hierarchical structures, CNNs have emerged as a relatively simple yet powerful class of models for sentence modelling and classification, since characterized by remarkably strong performances, with different shallow or deep architectures proposed in the recent years.
The first CNN for sentence classification with end-to-end training is proposed in [23,24]. In this seminal work, one convolutional layer is used together with a new global max-pooling operation, resulting to be very effective for text. Moreover, multiple deep models are co-trained on many tasks to transfer task-specific information. Starting from the results of this work, a simpler architecture with slight modifications have been presented in [21], achieving state-of-the-art performances even on many small datasets. In particular, one convolutional layer with multichannel representation and variable-size filters are employed, where fine-tuned or pre-trained word embeddings are combined in multi-channels, convolutions allow determining high-level abstract features, and multiple linear filters are used to effectively extract different n-gram features. Both the CNN architectures proposed in [23,24] and in [21] make use of max-pooling to keep the most important information to represent the sentence. Moreover, the pooling operation helps the network deal with variable sentence lengths. In [25] a further variant of multi-layer CNN architecture was proposed, with a dynamic k-max-pooling, where k depends on the length of the sentence and can be dynamically set as a part of the network. This allows detecting the k most relevant features occurring into a sentence, independent of their specific position and preserving their relative order. In [26], a multichannel variable-size CNN architecture for sentence classification was described, further exploring the capabilities of multichannel and variable size feature detectors. In particular, it combines diverse versions of pre-trained word embeddings and extracts features of multi-granular phrases with variable-size convolution filters.
All of CNNs presented in these works are based on word input tokens, encoded as distributed representations in the form of WE vectors [27]. Moreover, they are rather shallow (two layers in most of them), if compared to those successfully proposed to face computer vision problems, due to the reduced length, in terms of number of words, of typical sentences and paragraphs.
Later, a first attempt of CNN jointly using character-level, word-level, and sentence-level representations to perform sentence classification is described in [28], with a shallow architecture made of two convolutional layers to extract relevant features from words and sentences of any size. More recently, a deep CNN architecture, with up to 6 convolutional layers, was proposed in [29], able to automatically learn the notions of words and sentences on texts operating directly at a character level, without any pre-processing, not even tokenization. Convolutional kernels of size 3 and 7 were used, as well as simple max-pooling layers. Another interesting aspect of this work is the usage of several large-scale data sets for text classification. In [30], a new deep CNN architecture, with up to 29 convolutional layers, was proposed for text classification, operating directly at character level and using only small convolutions and pooling operations.
More specifically with respect to QC, in [20], a CNN was used to classify Italian questions. In particular, different solutions regarding the CNN architecture have been tested, and, according to literature advices, the best settings have been searched in the proper ranges, in order to maximize the classification power for the particular case of Italian questions dataset. In [31], an extended CNN architecture is proposed, able to first classify a question into a broader category, and, successively, based on the prior knowledge, assign to it a more specific category. This solution was tested on an English questions dataset with pre-trained word embeddings, showing results on par or improved with respect to other classical methods. In [32], a simple and effective method for QC is presented, which increases generalization, by replacing entities with placeholders, and diversity of sentence features, by reading sentence vectors from both forward and reverse directions. This approach has shown better performance than many other complex CNN models, also proving its effectiveness applied to question answering systems. Finally, in [33], a QC approach based on word embedding using subword information and CNN is outlined, in order to improve classification accuracy. In particular, a comparison between English and Italian languages is reported, by highlighting eventual improvements obtained by initializing word embeddings with advanced vectors learned in an unsupervised manner and comprising character-based information.
Summarizing, all the presented approaches based on CNNs for sentence classification, and specifically for QC, are characterized by models, whose structure is designed by hand by experts, thus requiring considerable skill and experience to select suitable hyperparameters such as the learning rate, the size of convolutional filters, the number of layers and so on. Moreover, these hyperparameters have internal dependencies, which make them particularly expensive for tuning and can depend on the specific classification task considered. Even though some recent works have shown that there exists much room to improve current optimization techniques for learning deep CNN architectures [34], fundamental working principles and behaviors of CNN models when specifically applied to QC have not been extensively investigated.
The most relevant works addressing these issues are generally tested for text classification. In [21], different strategies for words representation are compared, by employing in turn, singularly or combined in a multi-channel way, differently initialized, and eventually fine-tuned WE vectors. On the basis of [21] model, a sensitivity analysis of CNNs is proposed in [19], summarizing the influences of various hyperparameters, i.e., WE vectors, filter size, number of filters, activation function, pooling strategy, and regularization. Both these works, for the QC task considered among the others, found different best settings with respect to the other tasks. However, to the best of our knowledge, related research considered only few settings, and without reference to possible interactions among them. Moreover, (i) QC was only considered as an instance of text classification; (ii) the possible relation with the language not taken into account.
Thus, this work constitutes the first attempt of considering hyperparameters in a comprehensive way, examining different possibilities with respect to morphologically different languages, to study the problem of configuring the appropriate CNN architecture for QC.

Data
QC aims at associating each question to a class comprised in a given set. This is made accordingly to a number of examples of labelled questions, used to train and test the model. Different datasets are available, particularly for English language. The main example is the TREC dataset provided by [35], used in various previous works, comprising [21], which is particularly big. However, in order to study questions in different languages, multilingual data are rare and less extensive.
In order to compare English and Italian languages, the chosen data are made of the union of two datasets presented at TREC conferences 2002 and 2003, each comprising 500 training questions and labelled according to the same taxonomy. The same 1000 questions are available in English and Italian, among the other languages. For example, a row of the joint dataset is made of four attributes (coarse class, fine class, question in English, question in Italian), as follows: "FACTOID-LOCATION-What is Africa's largest country?-Qual è il paese più vasto dell'Africa?".
In Table 1, the two-levels taxonomy is reported. Since the aim of the approach is the single (not hierarchical) classification task, coarse classes were not considered in this work. On the other hand, all the questions were included, and the union of fine classes for any of the three coarse classes is considered, which results in the following 9 labels: "Acronym", "How", "Location", "Material", "Measure", "Person", "Time", "Title", "Other".
Each experiment is performed by 10-fold cross-validation. Therefore, the runs are performed with a number of examples for training Ntrain = 900, and a number of examples for testing Ntest = 100.
Moreover, the dataset provided by (Li and Roth 2002), available online, is also used, to compare results with those of other state-of-the-art best convolutional architectures. It is already divided into 5452 questions for training and 500 for testing, and is based on a 2-levels taxonomy, whose coarse level, used here, is made of the following 6 classes: "ABBREVIATION", "ENTITY", "DESCRIPTION", "HUMAN", "LOCATION", "NUMERIC".

Question Classification Model
This section describes the structure of the model employed for classifying questions, and the learning procedure. This model, firstly developed in [23,24], was implemented with variable settings in the open source Python framework TensorFlow (https://www.tensorflow.org/, accessed 1 July 2020). The testing platform consisted of a fold containing data, a main program with subroutines for pre-processing and model architecture and producing the results, and three configuration json files, where the user can manually change all the settings before each run. The variable settings were defined within sets chosen coherently with findings of previous literature and with preliminary experiments. The model general form is schematized in Figure 1. It comprised a pre-processing phase, which allowed translation of the question into a sparse matrix constituting the input layer, the embedding phase, which allowed representation of the question by a matrix with smaller dimension constituting the embedding layer, and a CNN made of convolutional layer, pooling layer, fully connected layer, and output layer, which finally associated each question to a class.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 26 constituting the embedding layer, and a CNN made of convolutional layer, pooling layer, fully connected layer, and output layer, which finally associated each question to a class. The following subsections describe in detail the pre-processing and WE phases (Section 2.2.1), the CNN architecture (Section 2.2.2), and the procedure for learning network weights (Section 2.2.3). Finally, since this approach is implemented here with freely adjustable settings, Section 2.2.4 summarizes the degrees of freedom considered with respective possible values, and explains the optimization approach.

Question Pre-Processing and Word Embedding
Each question had to be pre-processed, to be divided into a sequence of tokens, and represented as a sparse matrix.
Firstly, special characters, not comprised in the set {A-Za-z0-9(),;.:!?'`"} were substituted with spaces. Then, apostrophes and some substrings comprising them were substituted, depending on the language, as reported in Table 2.

Question Pre-Processing and Word Embedding
Each question had to be pre-processed, to be divided into a sequence of tokens, and represented as a sparse matrix.
Each question was made of L tokens, and the maximum length Lmax was calculated over the whole dataset. Moreover, a vocabulary was assembled by gathering all V different tokens plus an entry <UNK> in the first position corresponding to unknown token. Original tokens are used, instead of lemmatizing them, to be coherent with pre-trained WE.
Once the vocabulary was fixed, each token was represented as a vector with V elements, which were all equal to 0, except the element corresponding to the position of the token in the vocabulary, equal to 1. Therefore, each question was represented as a matrix X with V columns and Lmax rows, composed by vectors x j , with j = 1, . . . ,Lmax, where if L < Lmax, last rows were filled with all zeros. This matrix was the input layer of the deep neural network.
The next embedding phase consisted in the linear transformation of X into a matrix with smaller dimension. Each one-hot V-dimensional vector x j was transformed into a De-dimensional vector corresponding to the representation of the word suggested by the pre-trained WE model or to a random or null vector. In practice, X was multiplied by the embedding matrix W emb with De columns and V rows, to obtain a matrix X emb made of De columns and Lmax rows: The embedding matrix was initialized depending on the choice of the WEinit factor: If pre-trained WE vectors are used (WEinit = pre-trained), then the row of W emb corresponding to each known word was initialized as the pre-trained WE vector, while the other rows corresponding to OOV words were initialized with null vectors (OOVinit = null). The pre-trained WE representation chosen for this work was based on fastText model, with 300 dimensions (De = 300), trained on the Wikipedia corpora (https://fb-public.app.box.com/s/htfdbrvycvroebv9ecaezaztocbcnsdn, accessed 1 July 2020), both in English and Italian languages. This model was chosen for its outstanding characteristics. In fact, it was an evolution of the skip-gram model, which trains the representation of each word by unsupervised learning to predict words that appear in its context, but fastText also measures similarity between words based on character n-grams included in them. Therefore, these vectors encode information regarding syntactic structure of the text and semantic features like the skip-gram model, as well as information regarding the morphology of the words.
On the other hand, if pre-trained vectors were not used (WEinit = random), then all the rows of W emb were initialized with random vectors, both for known or unknown words (OOVinit = random). This representation was made with a number of dimension which was a further degree of freedom, studied in the following interval: De ∈ [10, 500].
Since the values assumed by OOVinit are coupled with those assumed by WEinit, in the following OOVinit was omitted. In both cases, the embedding matrix W emb could be kept constant or fine-tuned during the network training:

Convolutional Neural Network Architecture
A classical CNN architecture was used here for associating questions with labels. However, here the architecture was not fixed, but was implemented with freely adjustable settings.
A convolution was firstly applied to X emb , by using a single channel, with no padding and stride 1, as recommendable in text classification context. Filters of different sizes may be employed, therefore, if there were Ns different sizes and for each size a number Nf of filters, the total number Ntot of filters was: The sizes and the total number of filters were degrees of freedom, and they were considered in the following ranges: Ntot ∈ [50, 500].
Each filter of a certain size consists in a matrix W i conv , with i = 1, . . . ,Ntot made of De columns and Fs rows. The result of the convolution was a vector x conv i with dimension Lmax − Fs + 1, whose components x i,j conv , with j = 1, . . . ,Lmax − Fs + 1, can be written as: Then a bias term b i conv was added to each component, and an activation function f was applied, to get each component x i,j act , with j = 1, . . . ,Lmax − Fs + 1, of the vector x i act , which was the final result of the convolution by the given filter W i conv : Of course, vectors x i act with the same size were obtained by using filters with the same Fs, while vectors of different sizes were obtained by differently sized filters. However, Ntot vectors were obtained, and they constitute the convolutional layer.
The activation function to use for convolution was a degree of freedom of the proposed implementation. The following functions were used, whose meaning is shown in Figure 2: Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 26 The following operation was the pooling, which was implemented coherently with the common choice, i.e., the 1-max pooling strategy. In fact, using the max function was forced by the padding with zeros the input representation of questions shorter than Lmax, and the choice of only one maximum element was certainly enough, due to the big number of filters employed. Therefore, the pooling layer was constituted by a horizontal vector p with dimension Ntot, whose elements p i , with i = 1,…,Ntot, were: The following fully connected layer was constituted by K neurons, where K was also the number of classes. In the considered case, K = 9. The vector of class activations y was computed by multiplying p by a matrix of weights Wfc with Ntot rows and K columns, and adding a bias vector bfc: The final output layer was made of only one node, which contains the position of the class with the highest activation:

Learning Procedure
The described model includes many parameters that were initialized randomly and have to be trained, i.e., Wconv i with i = 1,…,Ntot, bconv i with i = 1,…,Ntot, Wfc, and bfc, for a total of De ⸱ Fs ⸱ Ntot + Ntot ⸱ K + K. Moreover, Wemb can be initialized by pre-trained WE or randomly, but in both cases they were fine-tuned if WEtuning = Dynamic, bringing other V ⸱ De parameters.
These parameters were adapted on data by a learning procedure summarized as follows. Firstly, the training dataset was divided in batches composed of a certain number of examples. In this work, the batch size was a degree of freedom, studied in its whole range: All examples of a batch were used as input of the model, but during the training, in order to learn separately different parts of the network, the pooling layer was modified by the dropout function, which randomly transforms each component p i multiplying it by zero with probability (1 -Pkeep), and by 1 / Pkeep with probability Pkeep, so that the expected sum remains unchanged. Here, the dropout was a degree of freedom, variable in the following interval: The following operation was the pooling, which was implemented coherently with the common choice, i.e., the 1-max pooling strategy. In fact, using the max function was forced by the padding with zeros the input representation of questions shorter than Lmax, and the choice of only one maximum element was certainly enough, due to the big number of filters employed. Therefore, the pooling layer was constituted by a horizontal vector p with dimension Ntot, whose elements p i , with i = 1, . . . ,Ntot, were: The following fully connected layer was constituted by K neurons, where K was also the number of classes. In the considered case, K = 9. The vector of class activations y was computed by multiplying p by a matrix of weights W fc with Ntot rows and K columns, and adding a bias vector b fc : The final output layer was made of only one node, which contains the position of the class with the highest activation: output = argmaxy.

Learning Procedure
The described model includes many parameters that were initialized randomly and have to be trained, i.e., W i conv with i = 1, . . . ,Ntot, b i conv with i = 1, . . . ,Ntot, W fc , and b fc , for a total of De · Fs · Ntot + Ntot · K + K. Moreover, W emb can be initialized by pre-trained WE or randomly, but in both cases they were fine-tuned if WEtuning = Dynamic, bringing other V · De parameters.
These parameters were adapted on data by a learning procedure summarized as follows. Firstly, the training dataset was divided in batches composed of a certain number of examples. In this work, the batch size was a degree of freedom, studied in its whole range: All examples of a batch were used as input of the model, but during the training, in order to learn separately different parts of the network, the pooling layer was modified by the dropout function, which randomly transforms each component p i multiplying it by zero with probability (1 − P keep ), and by 1/P keep with probability P keep , so that the expected sum remains unchanged. Here, the dropout was a degree of freedom, variable in the following interval: For each input b, the loss was calculated by the cross entropy function (19), where p k (17) was the softmax transform of the kth component of computed vector y, and c k (18) was 1 for the position of the true label k True , 0 otherwise: After a batch, the whole associated loss was calculated as: The regularization parameter l2, used to prevent big values of fully connected layer weights, was a degree of freedom here, studied in the following interval: The loss gradient was used for updating network weights by a backpropagation approach based on Stochastic Gradient Descent (SGD) algorithm, which implies a stochastic approximation [36] of the basic gradient descent algorithm. Since it reduces the computational complexity, achieving faster iterations in trade for a lower convergence rate [37], it was recognized as a very effective learning algorithm in machine learning [38]. A variant of the updating rule was freely chosen among the following ones: While in case optimizer = Momentum, the momentum parameter was fixed to 0.1, according to previous findings, the learning rate, which was a further parameter common to all the algorithms, was the last considered degree of freedom, studied in the following wide range, enlarged with respect to previous works [19,21]: η ∈ [0.01, 10].
The intent in this work was to get the best possible model, therefore the number of epochs was not taken as an adjustable setting.

Threats to Validate
Different threats could affect the learning procedure. First, underfitting could affect results, if the number of epochs chosen for learning was too low. In order to avoid it, a sufficient number of epochs was chosen for different runs. For most of the runs, 2000 epochs result enough to reach convergence, while in some cases (properly compared with the others) 20000 epochs were needed.
Second, a large number of epochs could cause overfitting on training data. In order to avoid it, every 2 epochs, the model was tested on a randomly sampled dev set, and at the end of the epochs, the model presenting the best accuracy on the dev set was chosen.
Finally, the choice of training and testing questions within the dataset could (positively or negatively) influence and distort the results. In order to avoid it, each experiment was performed by stratified 10-fold cross-validation. Therefore, the dataset was randomly divided into 10 subsets of 100 questions with approximately the same rate of labels. Each run was performed with the union of 9 question subsets for training, and the remaining subset for testing, this was repeated 10 times for considering all the examples for testing, and the results of the 10 runs were averaged to obtain the result of the experiment.

Model Optimization
In order to optimize the QC model, its classification accuracy was studied by analyzing different experiments, corresponding to respective configurations of settings (factors).
Since each experiment was the set of 10 training and testing phases constituting a 10-fold cross-validation, the accuracy of an experiment was the average of the accuracies gained by the 10 trained models on the respective test set. The accuracy on a test set was calculated as a percentage, by averaging c output , which was 1 if the network output (14) was equal to the position of the true label associated with the bth input of the test set (output = k True ), 0 otherwise: The factors here considered to analyze their influence on the model accuracy are summarized in Table 3. For categorical factors, all the possible values were considered, while for quantitative ones the considered admitted ranges were based on previous literature findings.
Since considering all the possible interactions among factors would involve an unfeasible experimentation, some factors were analyzed in the following one by one, since they were hypothesized to have negligible interactions with the others, while some sets of factors were studied together to verify potential interactions.
For each factor or set of factors, their individual influences and interactions (effects) were evaluated in a chosen range by performing a set of experiments. Most sets of experiments were planned according to full factorial designs, which comprise all the combinations of factors levels. This approach needs more numerous experiments, but minimizes the risk of confounding different effects. The range of each factor, and the fixed values of other settings, relative to factors not being evaluated in a set of experiments, since were hypothesized to not interact, were chosen according to findings of previous works [19][20][21]33], or to preliminary experiments. Due to the random initialization of weights and to some other sources of randomness in the learning procedure (splitting training data in batches, dropout function, and SGD algorithm), each run, and thus each whole experiment, gave different results if repeated. Therefore, some repetitions were performed, to estimate the experimental variance σ2, which was used to evaluate the experiments reproducibility.
The intrinsic variance in the measurement of the experiment performance implies that a deterministic functional dependence between factors and model accuracy does not exist. Therefore, in order to analyze the effects on the QC accuracy, an approximate function was extrapolated from each set of experimental results: where x 1 ,x 2 , . . . represent the individual factors, x 1 x 2 , . . . represent their interactions, and coefficients c 0 ,c 1 ,c 2 ,c 12 , . . . were used to linearly combine these (also nonlinear) effects to predict experimental accuracy. The significance of effects was evaluated in terms of the respective coefficients [39]. Indeed, each estimated coefficient belongs to a respective confidence interval, corresponding to the interval comprising the true coefficient value with 95% probability, that was calculated as follows. Given the estimated experimental variance σ 2 , calculated with a certain number of degrees of freedom, the variance of each coefficient can be estimated as σ 2 /N, where N is the number of experiments of the full factorial design. Therefore, the width of the coefficient confidence interval can be calculated as σ/N 1/2 · t 0.975 , where t 0.975 is the value of a t-student distribution with the same degrees of freedom corresponding to 0.975 cumulative probability (two tails t-test). As a consequence, if the estimated coefficient was lower than the confidence interval semi-width, then the confidence interval comprises the null value, and the hypothesis that the true coefficient value was zero cannot be rejected, and the corresponding effect was not significant.
Moreover, the estimated function (24) comprising significant effects can be used to predict the accuracy, on the basis of the considered factors and their eventual interactions. This allows finding optimal values of factors, corresponding to higher calculated accuracy.
After that all the factors were individually optimized, some repetitions corresponding to the optimal settings were performed, to evaluate the performance of the QC model in optimal conditions. Moreover, optimal conditions were validated on a larger set of data.

Results and Discussion
In this section, the results obtained by the QC model are reported and discussed. As described before, the accuracy reported in correspondence of a configuration of settings was obtained by the average of 10 cross-validation runs.
The results were obtained for each configuration by considering questions in both English and Italian languages.
First of all (Section 3.1), repeated experiments using the same configuration are described and discussed. Then, the influence of settings regarding text representation (Section 3.2), network architecture (Section 3.3), and learning procedure (Section 3.4) was evaluated. In addition, Section 3.5 takes into account all the previous findings to individuate the most influencing parameters. Finally, Section 3.6 presents optimal settings obtained for different cases, and evaluates the performance of the associated proposed models, also showing a comparison with baseline models found as optimal in previous literature, on a widely used dataset.
In total, 2404 runs for training and testing the described QC model were performed.

Repetitions
As explained before, each experiment reported here was made of 10 runs since cross-validation is performed. Therefore, for each experiment, an "internal" variance of the testing accuracy is calculated. Averaging on all the experiments, the "internal" standard deviations found were about 4.0% for English and 3.5% for Italian. These quite low values were due to the robustness of the random stratified splitting of the dataset in folds.
On the other hand, some whole experiments were performed 5 times, to evaluate their reproducibility. In the hypothesis that the system is homoscedastic, the accuracy variance could be estimated in correspondence of only one configuration. Here, this hypothesis is relaxed, due to the structural differences between runs performed by using fixed pre-trained WE vectors or random vectors, and between runs performed with fixed WE vectors or by fine-tuning them. Therefore, the accuracy variance is estimated in correspondence of the combinations of these settings. For each configuration, 5 repetitions of the same experiment were performed. In Table 4, the experimental variance σ 2 calculated over repetitions is reported. Since the estimated variance is itself a random variable, on the basis of results of Table 4, the homoscedasticity can be hypothesized, also with respect to the language; therefore, the experimental variance was estimated by averaging over different configurations the mean pure squared errors of the repetitions. It corresponds to a low standard deviation σ = 0.6% (calculated with 32 degrees of freedom, therefore t 0.975 = 2.038), and compared to the variance of different experiments, corresponds to a good reproducibility (= 0.90).

Text Representation
The first setting analyzed here regards the text representation, and in particular, the possibility of eliminating all punctuation symbols from the question during pre-processing. This qualitative factor can assume 2 levels, and it was hypothesized to not interact with others. Therefore, the only 2 experiments for each language reported in Table 5 were performed. Even if for both languages the case AvoidPunctuation = True gives slightly higher accuracy, the differences with the case AvoidPunctuation = False were statistically not significant, since it is comparable to (and even smaller than) the standard deviation of repeated experiments. In other words, if a model describing the accuracy as a function of this variable is constructed, the linear coefficient results 0.1% for both languages, which is smaller than the semi-width of its confidence interval (0.9%), therefore the chance that the true value of the coefficient is zero cannot be discarded.
This finding suggests that, in order to simplify the QC model, punctuation can be eliminated without significant loss of information.
Other two factors regarding the text representation, i.e., the possibility of initializing WE vectors of known words by fastText pre-trained vectors and OOV words by null vectors or initializing all WE vectors randomly, and the possibility of fine-tuning these vectors during training or not, were analyzed together, to evaluate at the same time their effects and eventual interactions. Each of these qualitative factors can assume 2 levels; therefore, 4 configurations for each language were tested. For each configuration, 5 repeated experiments were performed, used to estimate variances reported in Table 4, and whose mean accuracies are reported in Table 6. In this case, the effects of the evaluated factors WEinit and WEtuning, and of their interaction, produced significant results. In particular, random or pre-trained WE vectors give equivalent results if they are static, and fine-tuning of random vectors does not improve accuracy, but the combination of WEinit = pre-trained and WEtuning = dynamic gives a contribution to the mean accuracy of about 2.6% for English and 4.3% for Italian. These contributions are greater than the confidence interval semi-width (about 0.6%).
This behavior can be explained by observing that the generally valid information embodied by WE pre-trained vectors was not necessarily the same required by the specific classification task, therefore, if kept static, they could result equivalent to random ones; however, they embody semantic information that allows, if properly fine-tuned, to get closer to optimal, with respect to random ones.
These findings suggest the following considerations: • It is convenient to employ fastText pre-trained vectors to initialize WE vectors, which embody semantic and morphological information in words representation; • it is convenient to fine-tune WE vectors, since optimizing the representation of the single words most influencing on QC allows to stress their importance; • a significant improvement on QC accuracy is gained if these two settings were used at the same time, since the WE vectors of words semantically associated with question classes, already represented by embodying semantic information, can be coherently optimized; and • all these effects result more relevantly in the Italian language, with respect to English, since all the improvements regarding words representation were more useful for a morphologically rich language.
The last factor taken into account for text representation was the embedding dimension De, hypothesized to have no interactions with the others. This quantitative factor was analyzed in the range [10,500], and in particular, in correspondence of the representative levels {10,100,300,500}, to analyze also its nonlinear effects. Therefore, four configurations for each language were tested, and results are reported in Table 7 and shown in Figure 3.  From results reported in Table 7, a function was fitted for each language to predict accuracy as a quadratic function of De in logarithmic scale, and all coefficients result much greater than their confidence interval, therefore, De gives significant effects. The fitted function, in accordance with results shown in Figure 3, explains that, as the embedding dimension increases, a great improvement to QC was given, since more semantic, syntactic, and morphological aspects of words were represented. However, for more than some hundreds dimensions, a plateau was reached, and adding other dimensions does not give a significant improvement. Therefore, also in accordance with most of the previous literature works, and with the majority of the available pre-trained WE vectors, the value De = 300 was chosen as optimal here.

CNN Architecture
The CNN architecture was analyzed firstly in terms of both the filter size (7) and their total number (8), and then with regard to the activation function involved in (10), while their interactions were neglected.
Different filter sizes were experimented in the range [1,10], taking into account all possible sizes. The functions fitted in this whole range to predict accuracy reveal no significant linear or quadratic effect; however, in the restrictions of this range, the experimental results showed significant trends, From results reported in Table 7, a function was fitted for each language to predict accuracy as a quadratic function of De in logarithmic scale, and all coefficients result much greater than their confidence interval, therefore, De gives significant effects. The fitted function, in accordance with results shown in Figure 3, explains that, as the embedding dimension increases, a great improvement to QC was given, since more semantic, syntactic, and morphological aspects of words were represented. However, for more than some hundreds dimensions, a plateau was reached, and adding other dimensions does not give a significant improvement. Therefore, also in accordance with most of the previous literature works, and with the majority of the available pre-trained WE vectors, the value De = 300 was chosen as optimal here.

CNN Architecture
The CNN architecture was analyzed firstly in terms of both the filter size (7) and their total number (8), and then with regard to the activation function involved in (10), while their interactions were neglected. Different filter sizes were experimented in the range [1,10], taking into account all possible sizes. The functions fitted in this whole range to predict accuracy reveal no significant linear or quadratic effect; however, in the restrictions of this range, the experimental results showed significant trends, associated with significant improvements in correspondence of individual filter sizes with respect to the others, as discussed in the following.
Firstly, all filters with the same size were employed. Results for both languages are reported in Table 8 and shown in Figure 4.  Within the first set of experiments, the best single filter size results Fs = 2, which corresponds to significant improvements with respect to both Fs = 1 and Fs > 2. The trend was similar for both languages, while a misalignment results for Fs = 8, which may be due to experimental variance.
Then, in order to evaluate the possibility of using filters of different sizes at the same time, as suggested by previous works [19,20], 150 filters of size 2 were fixed, while the size of the other 150 was varied in the same interval. Results for both languages are reported in Table 9 and shown in Figure 5.  Within the first set of experiments, the best single filter size results Fs = 2, which corresponds to significant improvements with respect to both Fs = 1 and Fs > 2. The trend was similar for both languages, while a misalignment results for Fs = 8, which may be due to experimental variance.
Then, in order to evaluate the possibility of using filters of different sizes at the same time, as suggested by previous works [19,20], 150 filters of size 2 were fixed, while the size of the other 150 was varied in the same interval. Results for both languages are reported in Table 9 and shown in Figure 5. These results again allow to individuate Fs = 2 as the best filter size, also in association with other filters of size 2. This corresponds to significant improvements with respect to Fs = 1 and slight improvements with respect to Fs > 2.  These results again allow to individuate Fs = 2 as the best filter size, also in association with other filters of size 2. This corresponds to significant improvements with respect to Fs = 1 and slight improvements with respect to Fs > 2.
These results mean that substituting 150 filters with others having different sizes does not improve the accuracy. Therefore, further experiments were performed by fixing 200 filters of size 2, while the size of only 100 varies. Results for both languages are reported in Table 10 and shown in Figure 6. These results mean that substituting 150 filters with others having different sizes does not improve the accuracy. Therefore, further experiments were performed by fixing 200 filters of size 2, while the size of only 100 varies. Results for both languages are reported in Table 10 and shown in Figure 6.    Also in this case, the filter size 2 results the best. However, it corresponds to significant improvements with respect to Fs = 1, while the variations for Fs ≥ 2 were not significant, since they were comparable with the confidence interval of the linear coefficient of the function approximating this trend (about 0.9%).
This finding of the best filter size corresponding to Fs = 2 can be explained by observing that, while other literature results were inferred for classifying sentences, if questions were considered as in this work, their classification can be done for most of them by considering a sequence of maximum 2 words comprised in them. This is coherent with previous findings, e.g., [4], individuating single words like "head words", or "WH-words" (why, when, where, …), or couples of words (how much, how long, …) as the most informative for QC. The total number of filters was analyzed as well, within the range [50,500], by considering the following values : {50,100,200,300,400,500}. Results are reported in Table 11 and shown in Figure 7. Also in this case, the filter size 2 results the best. However, it corresponds to significant improvements with respect to Fs = 1, while the variations for Fs ≥ 2 were not significant, since they were comparable with the confidence interval of the linear coefficient of the function approximating this trend (about 0.9%).
This finding of the best filter size corresponding to Fs = 2 can be explained by observing that, while other literature results were inferred for classifying sentences, if questions were considered as in this work, their classification can be done for most of them by considering a sequence of maximum 2 words comprised in them. This is coherent with previous findings, e.g., [4], individuating single words like "head words", or "WH-words" (why, when, where, . . . ), or couples of words (how much, how long, . . . ) as the most informative for QC.
The total number of filters was analyzed as well, within the range [50,500], by considering the following values: {50,100,200,300,400,500}. Results are reported in Table 11 and shown in Figure 7.  Also in this case, in the whole considered range, the functions fitted to predict accuracy reveal no significant linear or quadratic effect, but a significant improvement can be detected in correspondence of Ntot = 100 with respect to Ntot = 100. Moreover, starting from 100 filters, i.e., given that enough filters were used, as this number increases, the positive influence of adding filters disappears. Even if for English the result for 300 filters was slightly better than that with 100, for both languages similar trends can be recognized, therefore this difference can be ascribed to the experimental variance. On the contrary, a decreasing trend of the accuracy can be detected as Ntot increases.
These results mean that a minimum of 100 filters should be used, since at least 100 filters were useful to extract different features from text. Moreover, the decreasing trend can be explained by observing that each filter adds 609 weights to the model, therefore adding a great number of filters cause overfitting on training data, and thus a worse accuracy on testing. Therefore, the value Ntot = 100 appears the best choice.
As far as the activation function is regarded, those reported in Table 12, together with respective results, are analyzed.  These results, similar for both languages, show that a very low accuracy was obtained by using the sigmoid function. Also with f = softsign the accuracy was significantly lower than the others, while using f = tanh, the accuracy was better, but however, a t-test still reveals that the difference Also in this case, in the whole considered range, the functions fitted to predict accuracy reveal no significant linear or quadratic effect, but a significant improvement can be detected in correspondence of Ntot = 100 with respect to Ntot = 100. Moreover, starting from 100 filters, i.e., given that enough filters were used, as this number increases, the positive influence of adding filters disappears. Even if for English the result for 300 filters was slightly better than that with 100, for both languages similar trends can be recognized, therefore this difference can be ascribed to the experimental variance. On the contrary, a decreasing trend of the accuracy can be detected as Ntot increases.
These results mean that a minimum of 100 filters should be used, since at least 100 filters were useful to extract different features from text. Moreover, the decreasing trend can be explained by observing that each filter adds 609 weights to the model, therefore adding a great number of filters cause overfitting on training data, and thus a worse accuracy on testing. Therefore, the value Ntot = 100 appears the best choice.
As far as the activation function is regarded, those reported in Table 12, together with respective results, are analyzed. These results, similar for both languages, show that a very low accuracy was obtained by using the sigmoid function. Also with f = softsign the accuracy was significantly lower than the others, while using f = tanh, the accuracy was better, but however, a t-test still reveals that the difference with the others (eLU, Identity, ReLU, softplus) was significant. On the other hand, the activation functions f = eLU, f = Identity, f = ReLU, and f = softplus allow to obtain higher accuracies, with variations among them comparable with the experimental variance. From Figure 2, it can be noticed that these functions giving better results can be distinguished by their characteristic of infinitely increasing trend, with respect to the worse ones that have asymptotic behavior. Since they offer comparable results, one of them can be chosen. For example, f = softplus could be chosen by considering resulting small differences in accuracy, while f = Identity could be preferred in order to design the simplest network architecture.

Learning Procedure
The first hyperparameter defining the learning procedure considered here was the batch size, i.e., the number of examples considered together to calculate the loss value. Given the training dataset, the variability range was batch ∈ [1,900], and all the orders of magnitude were considered, i.e., batch = {1,10,100,900}. Results are reported in Table 13 and graphically represented in logarithmic scale in Figure 8. The first hyperparameter defining the learning procedure considered here was the batch size, i.e., the number of examples considered together to calculate the loss value. Given the training dataset, the variability range was batch  [1,900], and all the orders of magnitude were considered, i.e., batch = {1,10,100,900}. Results are reported in Table 13 and graphically represented in logarithmic scale in Figure 8.    Figure 8 clearly shows that, while the influence of batch size was not significant for sizes between 1 and 10, as the batch size increases, the accuracy clearly decreases, with a very strong effect of this hyperparameter. This finding confirms the usefulness of employing batches instead of summing up the loss function for all the examples. In particular, the smaller the batch size was, the higher accuracy was obtained. However, one should also take into account that smaller batch sizes also cause much longer training time. For this reason, here, in order to choose the best batch size, between 1 and 10, having comparable performances, batch = 10 was chosen.
The dropout was also varied, in the range Pkeep = (0,1], and in particular, in correspondence of the following representative levels: Pkeep = {0.1,0.5,0.9,1}. Results are reported in Table 14.   Figure 8 clearly shows that, while the influence of batch size was not significant for sizes between 1 and 10, as the batch size increases, the accuracy clearly decreases, with a very strong effect of this hyperparameter. This finding confirms the usefulness of employing batches instead of summing up the loss function for all the examples. In particular, the smaller the batch size was, the higher accuracy was obtained. However, one should also take into account that smaller batch sizes also cause much longer training time. For this reason, here, in order to choose the best batch size, between 1 and 10, having comparable performances, batch = 10 was chosen.
The dropout was also varied, in the range P keep = (0,1], and in particular, in correspondence of the following representative levels: P keep = {0.1,0.5,0.9,1}. Results are reported in Table 14. From Table 14, it can be seen that, excepting the case P keep = 0.1, which causes significant accuracy worsening, the other cases were very similar. This means that for this kind of system, and for the considered size of the dataset, dropout was not strictly necessary. Therefore, for the considered dataset, the dropout can be avoided, by choosing P keep = 1.0, or equivalently P keep = 0.9 can be chosen.
The regularization term l2 was also considered, at the following levels: l2 = {1.0,3.0,5.0}. Results are reported in Table 15. From Table 15, a slight but significant decreasing trend of the accuracy can be detected while l2 increases. Therefore, l2 = 1.0 was chosen.
Finally, the updating rule optimizer used to perform weights update by SGD backpropagation algorithm, and the associated learning rate η, were studied together, in order to evaluate also their probable interactions. According to a full factorial design, all the combinations of factors levels were experimented, i.e., all the available updating rules optimizer = {Adadelta,Adagrad,Adam,Ftrl, GradientDescent,Momentum,ProximalAdagrad,ProximalGradientDescent,RMSProp} combined with all the magnitude orders in the considered range of the learning rate η= {0.01,0.1,1,10}. Results were reported in Table 16 and shown in Figure 9. In Table 16, some results were not reported ("-"), since the corresponding experiments were not performed, because they make no sense in light of the other experiments. Moreover, the results of some experiments were reported as "<20.0", since in those cases the learning procedure did not offer acceptable accuracy. Some others were reported as "DIV", since the learning procedure gave exceptions due to overflow. The results indicated by an asterisk were obtained by 20,000 epochs instead of 2000; however, results with different numbers of epochs can be compared, because in all the cases the training was stopped after that convergence was reached. For each set of experiments with different learning rates, the best result is reported in bold.
From Table 16, it can be evinced that some updating rules does not work with a too high learning rate, giving overflow problems. In particular, when optimizer = {GradientDescent,Momentum, ProximalGradientDescent}, the learning rate η= 1 was already too high. Moreover, when optimizer = Ftrl and η= 0.01 (too low η), and when optimizer = RMSProp and η= 1 (too high ηin this case), the learning procedure does not improve the testing accuracy of the initial random model. Another point to take into account was that experiments obtained by 20000 epochs, necessary to get convergence for some low values of η, need much more computation time. These observations limit the range of the usable values of η, peculiarly for each updating rule.  From Table 16, it can be evinced that some updating rules does not work with a too high learning rate, giving overflow problems. In particular, when optimizer = {GradientDescent,Momentum,ProximalGradientDescent}, the learning rate η = 1 was already too high. Moreover, when optimizer = Ftrl and η = 0.01 (too low η), and when optimizer = RMSProp and η = 1 (too high η in this case), the learning procedure does not improve the testing accuracy of the initial random model. Another point to take into account was that experiments obtained by 20000 epochs, necessary to get convergence for some low values of η, need much more computation time. These observations limit the range of the usable values of η, peculiarly for each updating rule.
Most updating rules result equivalent for low learning rates, as can be evinced by Figure 9, in correspondence of η = 0.01. In particular, the cases optimizer = {GradientDescent,Momentum,ProximalGradientDescent} result equivalent for this dataset in the whole range of η. Moreover, as can be seen in Figure 9, in the acceptable ranges of η for each updating rule, most of them present similar trends, with significantly increasing accuracy values as η decreases. This can be explained by the network behavior of adapting fast to training data for high learning rates, which allows to increase predictivity only during the first few epochs. Therefore, while experimenting lower learning rates was not doable due to too high computation time, the option of Most updating rules result equivalent for low learning rates, as can be evinced by Figure 9, in correspondence of η = 0.01. In particular, the cases optimizer = {GradientDescent, Momentum,ProximalGradientDescent} result equivalent for this dataset in the whole range of η. Moreover, as can be seen in Figure 9, in the acceptable ranges of η for each updating rule, most of them present similar trends, with significantly increasing accuracy values as η decreases. This can be explained by the network behavior of adapting fast to training data for high learning rates, which allows to increase predictivity only during the first few epochs. Therefore, while experimenting lower learning rates was not doable due to too high computation time, the option of higher learning rates was not promising for most of the cases. On the other hand, when optimizer = Adadelta, accuracy surprisingly increases with η, even if the differences in the range η = [0. 1,10] were comparable with experimental variance, and this allows using high learning rates (e.g., η = 1), and not too many epochs. Moreover, in correspondence of optimizer = Adadelta and η ≥ 0.1, a significant accuracy improvement was gained, with respect to the maximal values of the other algorithms obtained with η = 0.01. Therefore, the following couple of values of the considered degrees of freedom was chosen as optimal: optimizer = Adadelta and η = 1.

Most Influencing Hyperparameters
The previous findings can be compared and summarized as follows. For classifying questions (in 9 classes, using 900 training instances, with a CNN), the influence associated to variations of different settings, relative to text representation, CNN architecture, and learning procedure, was qualitatively described in Table 17. From Table 17, the set of possible causes of very bad results can be individuated, i.e., too few embedding dimensions, sigmoid activation function, and a wrong choice of learning rate associated with a certain weights updating rule.
On the other hand, the strongest positive effect on accuracy was associated with a small batch size. Other settings give significant positive effects: Use of pre-trained and fine-tuned WE vectors, minimum 300 embedding dimensions, filter size equal to 2, minimum 100 total number of filters, choice of the activation function among {eLU,Identity,ReLU,softplus}, low loss regularization constant, and low learning rate. On the contrary, the influence of eliminating punctuation, and of the dropout function (given P keep ≥ 0.5) were not significant.

Experiments with Optimal Settings
The results reported above allow individuating the best settings, for hopefully obtaining the highest accuracy values, with respect to those reported so far. Therefore, some experiments were performed in correspondence of the best settings, in CV and with some repetitions, in order to validate the optimization procedure described before. In particular, two different settings were chosen, one (OPT1) comprising one of the best activation functions (f = softplus), the other without the activation function (f = Identity).
Moreover, the results obtained here were compared with those obtained with settings individuated as optimal in previous works [19,21]. For configurations found in previous works, fastText WE pre-trained vectors are used here.
In Table 18, the settings relative to different final experiments are reported, together with respective accuracy on testing. Results presented in Table 18, firstly validate the optimization performed of the whole model. Indeed, the accuracy values were the highest obtained so far on this dataset.
Moreover, the results obtained by taking into account optimal configurations individuated by [19,21] were significantly worse than those obtained here. The most noticeable differences in the configurations revealed that the model chosen here was much simpler, since it avoids considering punctuation, uses only 100 filters of size 2, and does not use dropout (nor any activation function, for OPT2).
Since [19,21] found their best configurations on a different dataset, the same comparison was performed on the most used dataset provided by [35] for the English language. In Table 19, the settings and the respective accuracy on testing were reported, relative to different final experiments, also on this bigger sized dataset. Among results of Table 19, those obtained with the proposed optimal settings were better than those obtained with settings optimized in previous works for this particular dataset. This confirms the validity and transferability of the optimal text representation, CNN architecture, and learning procedure obtained here for the QC task.

Limitations
The optimal settings found here were based on a multilingual dataset regarding QC, using the taxonomy explained in Section 2.1. Moreover, they were validated on a further dataset, also regarding QC, presenting a different taxonomy.
However, the optimality of those settings cannot be demonstrated for any taxonomy of question classes. Moreover, it cannot be extended to other sentence classification tasks. For example, if a filter of size 2 was enough to classify some questions by just individuating "How much" sequence of words, the same small filter could be undersized to distinguish more fine-grained question classes or to classify sentiment of affirmative sentences.

Conclusions
This paper presented a study performed to analyze the settings of Convolutional Neural Networks for Question Classification, in terms of words representation, network architecture and learning procedure.
Both English and Italian languages were considered, since they have different morphological richness, and training sets made of different number of questions were tested. All experiments were based on questions properly extracted from the same multilingual dataset, in order to check possible dependencies of optimal settings with respect to language.
All the hyperparameters and the most plausible interactions among them were tested in correspondence of wide ranges of variability. For each of them, statistical significance of its influence was evaluated by means of a comparison with intrinsic variability, measured through repetitions of the same experiments.
Results of the huge number of experiments drove to the individuation of optimal settings, which are similar for both languages. They can be summarized as follows. Regarding the text representation, it is better to avoid punctuation, to use pre-trained word embedding vectors with dimension 300, and fine-tune them according to available data; regarding the architecture, 100 filters of size 2 were enough for coarse-grain classification, and an infinitely increasing activation function should be preferred (eLU, ReLU, softplus), or equivalently no activation function (Identity); regarding the learning procedure, using a small batch of 10 gives strong improvements, while choosing it smaller only increases computation time, dropout and loss regularization should be avoided, and the best and fastest optimizer was Adadelta, associated with learning rate 1.0.
The individuated best configuration was tested on the same data and on a different set of questions widely used for QC, and compared to the configurations suggested by the most relevant previous works. These further results validated the optimization performed and confirmed the transferability of the best settings on different data, since in all cases the models optimized here showed significantly better classification accuracy than those suggested before.