Investigating the Relationship between Classification Quality and SMT Performance in Discriminative Reordering Models

Arefeh Kazemi 1,*, Antonio Toral 2, Andy Way 3, Amirhassan Monadjemi 1,* and Mohammadali Nematbakhsh 1 1 Department of Computer Engineering, University of Isfahan, Isfahan 81746-73441, Iran; nematbakhsh@eng.ui.ac.ir 2 Center for Language and Cognition, University of Groningen, Groningen 9712 EK, The Netherlands; a.toral.ruiz@rug.nl 3 ADAPT Centre, School of Computing, Dublin City University, Dublin 9, Ireland; andy.way@adaptcentre.ie * Correspondence: kazemi@eng.ui.ac.ir (A.K.); monadjemi@eng.ui.ac.ir (A.M.)


Introduction
Statistical Machine Translation (SMT) systems automatically translate from one natural language into another.Clearly, natural languages vary in their vocabularies and also in their grammatical structure, i.e., the manner in which they arrange words to make up sentences.Accordingly, in order to translate a sentence from the source language into the target language, SMT has to handle two problems: (i) finding the appropriate translation of the words in the source sentence ("lexical choice"), and (ii) predicting their correct order in the target sentence ("reordering").Reordering is one of the most important factors affecting the quality of the final translation [1].A large amount of research has been conducted to address the reordering problem, much of which follows the discriminative reordering model (DRM), i.e., they consider word reordering as an structured prediction problem and apply a discriminatively trained model to predict the appropriate word order.In order to predict the word order, most DRMs use a classifier which employs some features extracted from the words and computes the word order of the target sentence by predicting the orientation between the word pairs in the source sentence or the most probable jump length after each source word.In fact, the performance of the classification algorithm has a significant impact on the quality of the translation.To the best of our knowledge the relationship between classification quality and SMT performance has not been studied to date.It might be assumed that improvements in classifier quality will be monotonically reflected in overall SMT performance.This is the assumption that justifies previous work which tries to find the best classifier for an SMT system, based solely on the classifier quality metrics [2][3][4].In this paper, we study the relationship between the performance of the reordering classifier and SMT quality in three parallel corpora from different language pairs, and experimentally show that this assumption does not always hold.
The remainder of this paper is organized as follows.Section 2 reviews the related work and places our work in its proper context.Section 3 presents in detail the DRMs implemented for our experiment, including their conceptualization, the classifiers and the features used, and their integration into hierarchical phrase based SMT (HPB-SMT).Sections 4 and 5 contain the experiments carried out to investigate the relationship between classification performance and SMT quality.This is followed by in-depth analysis in Section 6.Finally, we outline conclusions in Section 7, together with some avenues for further research.

Discriminative Reordering Models
Many different approaches have been proposed to address the problem of reordering by incorporating a DRM into SMT.The core component of these DRMs is a classifier that tries to predict the appropriate word order for two words in the source sentence.Zens and Ney [2], Xiong et al. [5] and He et al. [6] used a maximum-entropy (henceforth maxEnt) classifier, while Li et al. [7] used a neural classifier to predict the orientation between neighbouring phrases.Bisazza and Federico [3] and Green et al. [8] employed a maxEnt classifier to predict the orientation of a source word in a given position with respect to another.Gao et al. [9] used a maxEnt classifier to predict the orientation between head and dependent words in the dependency tree of the source sentence.Kazemi et al. [10,11] used Naive-Bayes classifiers to predict the orientation between the dependants in the dependency tree of the source sentence.Xiong et al. [12] proposed a DRM that uses a maxEnt classifier to predict the order of the predicates and their associated arguments.Wang et al. [13] proposed a topic-based RM that uses a maxEnt classifier to predict the order of neighbouring phrases.Alrajeh et al. [14] used a multiclass SVM classifier to model phrase movements.
Despite the huge amount of work on DRMs that use a classifier to predict reordering, to the best of our knowledge the relationship between classification quality and SMT performance has not been studied to date.In order to find the best classification algorithm or the best features to be used in the classifier in DRMs, it is important to study this relationship and evaluate the classifier in a way that ensures that the best classifier based on this evaluation, when used in the DRM of an SMT system, leads to the best SMT performance.

Intrinsic vs. Extrinsic Evaluation
In general, there are two different ways to assess the quality of a component in a system: (i) intrinsic evaluation and (ii) extrinsic evaluation [15].Intrinsic evaluation considers the isolated component and measures its performance on its particular sub-task.Extrinsic evaluation employs the component in the final system and measures the performance of the component in terms of its contribution to the overall performance of the system.For example, for the classifier in the DRM of an SMT system, intrinsic evaluation takes the classifier independently and evaluates it based on classification quality metrics such as accuracy and F-measure.Extrinsic evaluation employs the classifier in the DRM of the final SMT system and evaluates it by measuring the translation quality achieved by the SMT system.Since extrinsic evaluation is difficult and time-consuming, researchers generally tend to pursue intrinsic evaluation.In order to perform intrinsic evaluation, it is essential to investigate the relationship between intrinsic and extrinsic metrics and find an intrinsic metric which has a good correlation with the extrinsic one.In this paper, we investigate the relationship between classification performance and SMT quality, and provide some guidelines for intrinsic evaluation of the classification performance in SMT.It is worth noting that in the SMT area, most research conducted to date on intrinsic evaluation of SMT components has focused on word alignment.Fraser and Marcu [16,17] study the correlation between metrics used to measure word alignment quality and the BLEU [18] score.They show that previously used intrinsic metrics such as alignment error rate (AER) have a low correlation with the BLEU score, and hence are not suitable for predicting translation quality.For intrinsic evaluation of word alignment, they propose to use a variation of the F-measure which uses the coefficient α to modify the balance between precision and recall (with the optimal value for α depending on the corpus and the SMT task at hand).Ayan and Dorr [19] and Davis et al. [20] show that AER is a poor indicator of SMT performance and propose the "Consistent Phrase Error Rate" [19] and "Word Alignment Agreement F1" [20] metrics for intrinsic evaluation of the word alignment.Vilar et al. [21] argue against the assumption that better alignment increases translation quality, and show that improvement in alignment quality does not always imply an improvement in translation quality.They show that neither AER nor the proposed F-measure in [16,17] are essentially suitable metrics for intrinsic evaluation of word alignment in SMT, with the main flaw in both of these metrics being that they do not take the structure of the translation into account.Guzman et al. [22] study the relationship between word alignment and phrase extraction, and Tian et al. [23] study the relationship between word alignment, phrase table, and translation quality in SMT systems.

Method
In order to investigate the relationship between classification quality and SMT performance in DRMs, we implement the two DRMs described in [9,10].Both of these DRMs have been designed for hierarchical phrase-based SMT (HPB-SMT) [24] and are based on the dependency tree of the source sentence, which shows the grammatical relations between the words in that sentence.As an example, Figure 1 shows the dependency tree of an English sentence.In this figure, the arrow with label "adj" from "brown" to "fox" indicates that the dependent word "brown" is the adjective related to the head word "fox".Two constituents in the dependency tree of the source sentence can be translated with monotone or swap orientation [25].If the order of two constituents in the source sentence is the same as the order of their translations in the target sentence, the orientation is monotone and otherwise it is swap.We try to find the optimal word order of a sentence by predicting the orientation of its constituent pairs.To be more precise, we try to find the orientation of each dependent word with respect to its head (head-dep) [9] or with respect to its siblings (dep-dep) [10].For example, for the sentence in Figure 1, our DRMs try to predict the orientations (ori) between the (head-dep) or (dep-dep) pairs that are shown in Table 1.Table 1.head-dep and dep-dep pairs for the sentence in Figure 1 and their corresponding orientations when translating into Farsi.

Classifiers
The core component of a DRM is a classifier, whose goal is to predict the correct orientation class (monotone or swap) for each (head-dep) and (dep-dep) pair.We use the maxEnt classifier for this task.Instead of using maxEnt to perform a hard classification, we use it to estimate the probability distribution over two orientation classes.The maxEnt classifier estimates the probability of the orientation type ori given the constituent pair pair as shown in Equation (1), where h n are binary features extracted from the constituent pair pair and λ n are the weights of these features: Table 2 shows the features that we used to characterize the constituent pairs in the maxEnt model.syn(w) shows the synonym set of the word w as found in Wordnet [26].As an example, Table 3 shows the features that we use for the (dep-dep) pair "fox" and "dog" in our example in Figure 1.Synsets are represented by their unique identifiers in WordNet.synset of word w Table 3. Features for the (dep-dep) pair ("fox","dog") in Figure 1.

Feature Values
lex(head), lex(dep1), lex(dep2) jumped, fox, dog syn(head), syn(dep1), syn(dep2) SID-01945853-V, SID-02097711-N, SID-02064081-N depRel(dep1), depRel(dep2) nsubj, prep-over In order to generate training instances for the maxEnt model, we use the dependency parse tree of the source sentence and the word alignments between the source and target words in the parallel corpus used for training the MT system.We extract all possible (head-dep) and (dep-dep) pairs for each sentence and determine the orientation type for each pair.Once we have obtained the orientation type for each constituent pair in the training part of our parallel corpus, we train the maxEnt classifier to estimate the probability of a source dependent word having monotone or swap order with respect to its head and its siblings.

Integration into HPB-SMT
During translation, the HPB-SMT decoder [24] estimates the probability of translating the source sentence S into the translation hypothesis H, through a log-linear combination of several feature functions, as shown in Equation ( 2), where a is the latent word alignment between H and S, F i is the i-th feature function (out of N total features) and w i is the weight of this feature.The translation hypothesis with the highest probability is then selected as the final translation: The DRMs are implemented as four feature functions [10,27]: The feature functions are computed for hypothesis H, which has been applied to source sentence S with constituent pairs Pairs(S).F dependencyCoherence encourages concurrent translation of constituents, based on the assumption that constituents move together in the translation process [28].It computes the number of covered constituent pairs by hypothesis H, as in Equation (3).In Equation (3), Covered(H, Pairs(S)) shows the constituent pairs of the source sentence S that have been covered by hypothesis H.A constituent pair is covered by H if H covers both words in the pair: F monotone and F swap compute the sum of the orientation probabilities of those constituent pairs which are translated in monotone or swap order, respectively.We determine the probability of the orientation type for the constituent pairs based on Equation (1).Based on the orientation class for a pair, we consider its score for calculating monotone or swap feature functions and compute F monotone and F swap as shown in Equation ( 4), where a is the word alignment between S and H, and Aligned(H, Pairs(S), a) shows the aligned covered pairs based on the word alignment a.A constituent pair is aligned if both words in the pair are aligned to at least one target word: It might happen that a word in a constituent pair is not aligned to any target word, so we obviously cannot determine the orientation and compute swap or monotone features in such a case.As this may lead to a search error [9], a penalty is applied by means of an unaligned-pairs feature F unalignedPairs , computed as the number of covered constituent pairs with at least one unaligned word, as shown in Equation ( 5): (5) After computing the four feature functions for the translation hypothesis at hand, we combine them with the other feature functions in the HPB-SMT model, as shown in Equation (2).

Generating Classifiers with Varying Quality
In order to build an SMT system with a DRM, we require a parallel corpus, a word alignment of that corpus, a language model (LM) built from target-language sentences, as well as a DRM (with an embedded classifier).In this paper, we intend to study the impact of reordering classification quality on SMT performance.Accordingly, in all of our experiments, we keep the parallel corpus, the word alignment, and the LM constant and only vary the classifier.We create classifiers of varying quality by using different feature sets in the classifiers and training them on different amounts of data.We select the features for (head-dep) and (dep-dep) pairs from Table 2 [11], and then use them in the maxEnt classifier of our DRMs.We split the training part of the corpus into separate pieces corresponding to 1/2 and 1/4 of the original data.Then, we trained each of the classifiers on three data sets: the original data set and the two generated subsets.In this way, we have 18 reordering classifiers with different qualities.The feature sets and training data used in each classifier are shown in Table 4.

Experiments
We experiment with three parallel corpora for different language pairs: English-Farsi, English-Arabic and English-Turkish.The English-Farsi corpus (Tep++) [29] is extracted from film subtitles.The English-Turkish corpus is extracted from documents in the international relations and legal sphere [30].Finally, the English-Arabic corpus is the News commentary corpus (v11) [31].For all the experiments, tuning and test sets were selected randomly from the main corpus with the remaining part of the corpus used for training.Table 5 shows the statistics of training, tuning and test sets for the three parallel corpora.
In order to obtain the dependency trees of the source sentences, we used the Stanford dependency parser [32].To generate word alignments, we used GIZA++ [33].Having obtained both the dependency structure and the word alignment, we extracted (head-dep) and (dep-dep) pairs from the training sets and determined the orientation for each pair.Table 6 shows the reordering type distribution over the training sets of each language pair.To perform the classification task in the DRM, we used the Stanford maxEnt classifier [34] with default settings.Our baseline SMT system is the Moses implementation [35] of the HPB-SMT model, with standard settings.We integrated our DRMs as four additional features as described in Section 3.3.In all experiments, the weights of our reordering feature functions and the other built-in feature functions were tuned by MIRA [36].We used a 5-gram LM trained on the target side of our training corpora.In order to evaluate the performance of the classifiers, we trained them on the training parts of the parallel corpora and evaluate them on the test part.
We built 18 SMT systems, each using a DRM with a classifier built using a setting from Table 4.The machine-translated text is evaluated in the target language against its translation reference based on two popular automatic metrics: BLEU [18] and TER [37].BLEU is the de facto standard automatic evaluation metric in the MT field, with a higher score indicating better translation quality.We also use TER as it is an error-rate metric whose score is based on the number of operations (insertions, deletions and edits) that are required to bring the MT output to match the reference, and thus provides an indication of the effort required to post-edit the MT output (the lower the TER value, the better the MT performance).In order to overcome the BLEU and TER variations created by the random processes in the tuning step, we tune each system three times and report the average scores obtained with multeval [38] on the MT outputs.

Relationship between Classification Performance and Translation Quality
Reordering classifiers are generally evaluated intrinsically by measuring their accuracy.The accuracy of the classifier is the proportion of correctly classified examples.It might be assumed that there is a strong monotonic relationship between the accuracy of the classifier in the DRM and SMT performance.To be more precise, it might be assumed that there is a strong positive correlation between the accuracy of the classifier and the BLEU score, and correspondingly that there is a strong negative correlation between the accuracy of the classifier and the TER score.In order to examine the validity of this assumption, we calculate Spearman's rank correlation coefficient (r s ) between the classifier's accuracy and the BLEU score and also between the classifier's accuracy and TER.
ρ s shows how well the relationship between two variables can be described by a monotonic function.If ρ s (Accuracy, BLEU) = 1, there is a perfect positive relationship between the accuracy and the BLEU score, i.e., the BLEU score increases when the accuracy of the classifier increases, and vice versa.Similarly, if ρ s (Accuracy, TER) = −1, there is a perfect negative relationship between the accuracy and the TER score, i.e., the TER score decreases when the accuracy of the classifier increases, and vice versa.
The Spearman correlation between two variables (ρ) is equal to the Pearson correlation coefficient (r) between their rank values, as shown in Equation ( 6).In Equation ( 6), cov(Rank(X), Rank(Y)) is the covariance of the rank variables, and σ(Rank(X)) and σ(Rank(Y)) are the standard deviations of the rank variables: Figures 2-7 are scatter plots.Figures 2-4 show Spearman's correlation of the classifier's accuracy and the BLEU score while Figures 5-7 show Spearman's correlation of the accuracy and TER, for each of our three parallel corpora.Data labels show the corresponding number of each classifier (No.) as shown in Table 4.The figures include the correlation coefficient ρ s and its p-value as well as a regression line and its 95% confidence region.The p-value shows the statistical significance of the correlation (ρ s ).We consider a value of α = 0.05 to be statistically significant.These figures have been generated with R's library ggplot.
An ideal metric for measuring the performance of the built-in classifier in the SMT system should have a perfect positive Spearman's correlation with the BLEU score and a perfect negative Spearman's correlation with the TER score.In this way, increasing classifier quality will increase the SMT performance.As Figures 2-7 show, the correlation coefficient is statistically significant in all cases (p < 0.05).For the En-Fa and En-Tr corpora, there is a strong positive correlation between the accuracy and the BLEU score.Furthermore, there is a strong negative correlation between the accuracy and the TER.However, the absolute values of the correlation coefficients are not equal to one.This means that there is a mismatch between classification accuracy and the SMT performance, such that higher classification accuracy does not always lead to better MT performance.Accordingly, one cannot rely solely on classification accuracy in order to select the best classifier for the DRM in the SMT system.
Strangely enough, for the En-Ar corpus, there is a strong negative correlation between the accuracy and the BLEU score and there is a strong positive correlation between the accuracy and TER.This shows that, for the En-Ar corpus, the classifier with the best accuracy will probably lead to the worst SMT performance.We hypothesize that this is because, as Table 6 shows, the percentage of monotone instances is much larger than swap instances in the En-Ar corpus (71% versus 21% for head-dep and 88% vs 12% for dep-dep), i.e., the En-Ar corpus is imbalanced.In imbalanced data, micro-averaged scores such as accuracy may become biased in favour of the majority class (here, the monotone class).Our experiments confirm this trend.We observed that for the classifiers on the En-Ar corpus, the precision of the classifier on the monotone class is about 93%, while its precision on the swap class is only around 35%.This means that the classifier considers the majority class (here, the monotone class) for most of the pairs and only a limited number of reorderings can be performed by the DRM, which is why the performance of the SMT system decreases when the classifier accuracy increases.
Accuracy is a micro-averaged score and hence it is a measure of effectiveness of the classifier on the larger class.In order to measure the effectiveness of the classifier on the smaller class in imbalanced data, macro-averaged results should be computed [39].We investigate the relationship between macro-averaged F 1 and the translation performance for imbalanced En-Ar corpus.q q q q q q q q q q q q q q q q q q 12 18 3 9 q q q q q q q q q q q q q q q q q q 12 15 18 q q q q q q q q q q q q q q q q q q 16 17 Accuracy Rank for English-Turkish, ρ = 0.8, p-value < 0.01.q q q q q q q q q q q q q q q q q q 12 18 q q q q q q q q q q q q q q q q q q 12 15 18 q q q q q q q q q q q q q q q q q q 16 17 We measure macro-averaged F 1 as shown in Equation (7), where F m and F s are the F 1 scores on monotone and swap classes, respectively, which are computed based on Equation (8).While micro-averaged metrics such as accuracy give equal weights to per-instance classification decisions, macro-averaged F-measure as in Equation (7) gives equal weights to each class [39].
Hence, macro-averaged metrics are more suitable for imbalanced data: Figures 8 and 9 show the correlation between the macro-averaged F-score and the BLEU and TER scores for the En-Ar corpus.The correlation coefficient is statistically significant in all cases (p < 0.05).As expected, for the imbalanced En-Ar corpus, there is a strong positive correlation between the macro-averaged F-score and BLEU, and there is a strong negative correlation between the macro-averaged F-score and TER.q q q q q q q q q q q q q q q q q q 12 15 18 q q q q q q q q q q q q q q q q q q 12 15 18

The Impact of Classification Improvement on Translation Quality
In Section 6.1, we showed that improving the performance of the classifier in the DRM does not automatically improve SMT quality.However, we observed that when the relative improvement in classification performance is high enough, the quality of the SMT system improves too.In order to confirm this observation, we investigate the impact of classification improvement on translation quality.To this end, for each pair of the 18 SMT systems described in Section 5, we calculate the amount of relative improvement in classification performance and SMT quality as shown in Algorithm 1.In Algorithm 1, ClPer f ormance(System A ) shows the performance of the classifier in the DRM of SMT system A in terms of accuracy or macro-averaged F-score.MtQuality(System A ) shows the quality of SMT system A in terms of BLEU.Cl Imp and MtImp show, respectively, the relative improvement in the classification performance and SMT quality for system A compared to system B.
For each parallel corpus, we calculate the improvement in the classification performance and SMT quality for each pair of SMT systems based on Algorithm 1.As discussed in Section 6.1, for the imbalanced En-Ar corpus, the macro-averaged F-score shows higher correlation with BLEU in comparison to accuracy.Accordingly, for the En-Ar corpus we calculate ClPer f ormance(System A ) in terms of macro-averaged F-score while for the En-Fa and En-Tr corpora we calculate it in terms of accuracy.For all SMT systems, we calculate MtQuality(System A ) in terms of BLEU.Figures 10-12 show the relationship between the improvement in classification performance (Cl Imp) with the improvement in SMT quality (MtImp) for En-Fa, En-Ar and En-Tr corpora, respectively.
We derive the following observations from the results: • When the improvement in classification performance exceeds a certain threshold, SMT quality will improve too.For En-Fa, En-Ar and En-Tr corpora, the threshold values are 6.4%, 3% and 6.2%, respectively.This shows that, for each parallel corpus, if the amount of improvement in classification performance exceeds the corresponding threshold value, we can expect the SMT quality to improve as well.

•
The magnitude of the improvement in classification performance is not necessarily proportional to the magnitude of the improvement in SMT quality.That is, a higher improvement in classification performance does not always lead to a higher improvement in SMT quality.

•
An improvement of about 0-20% in classification performance leads to an improvement of about 0-3.5% in the BLEU score.It is worth noting that although the improvement in BLEU score is much smaller than the improvement in classification performance, it is still comparable with the BLEU improvement gained by some recent reordering models (cf.Table 7).
Algorithm 1 Calculating the amount of improvement in classification performance and SMT quality.

Conclusions
In this paper, we conducted an empirical study of the relationship between the quality of the classifier used in the reordering model with the ultimate performance of an SMT system.We measured Spearman's rank correlation coefficient between the classification evaluation metric (accuracy) and MT automatic evaluation metrics (BLEU and TER).For one of the examined corpora, Spearman's correlation between accuracy and BLEU is negative.That is, for this corpus, the classifier with the highest accuracy leads to the worst SMT performance.We hypothesized that this is because this corpus is imbalanced, so accuracy is not a suitable metric with which to evaluate the classifier.For this corpus, we obtained a good positive correlation by using the macro-averaged F-score.Hence, we provided evidence that for imbalanced corpora, macro-averaged F-score is a better metric than accuracy for evaluating the classifier in the reordering model of an SMT system.Further investigation on more imbalanced corpora is necessary to confirm this hypothesis.
In addition, we showed that the absolute value of Spearman's correlation coefficient is lower than 1 for all three corpora examined.This means that better classification performance does not always lead to better SMT quality.We therefore investigated the impact of classification improvement in translation quality.We showed that if the improvement in classification performance is high enough, the SMT quality improves too.This shows that, although better classification performance does not always lead to the better SMT quality, when the improvement in classification performance exceeds a certain threshold value, we can expect the SMT quality to improve as well.For the En-Fa, En-Ar and En-Tr corpora that we used in this paper, these threshold values were found to be 6.4%, 3% and 6.2%, respectively.
In this paper, we have investigated the relationship between the performance of the classifier in the DRM and SMT quality for HPB-SMT systems.Similar work should be done to investigate this relationship for other types of SMT systems (e.g., phrase-based SMT).Researchers who work on the same HPB-SMT model and use corpora with similar distributions of monotone and swap reordering as those reported here could use the threshold values we obtained in this paper.

Figure 1 .
Figure 1.An example dependency tree for an English source sentence, its translation in Farsi and the word alignments.

Figure 10 .
Figure 10.The mprovement in classification performance vs. the improvement in SMT quality for English-Farsi.

Figure 11 .
Figure 11.The mprovement in classification performance vs. the improvement in SMT quality for English-Arabic.

Figure 12 .
Figure 12.The improvement in classification performance vs. the improvement in SMT quality for English-Turkish.

Table 2 .
Features used in the maxEnt model.
Descriptionlex(w)surface form of word w depRel(d) dependency relation between dependent word d and its head syn(w)

Table 4 .
Training data and feature sets used in the classifiers.

Table 6 .
Reordering type distribution over the training data for the parallel corpora.