Surfing the modeling of POS taggers in low-resource scenarios

: The recent trend towards the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, in particular low-resource settings. In parallel, model selection has become an essential task to boost performance at reasonable cost, even more so when we talk about processes involving domains where the training and/or computational resources are scarce. Against this backdrop, we evaluate the early estimation of learning curves as a practical mechanism for selecting the most appropriate model in scenarios characterized by the use of non-deep learners in resource-lean settings. On the basis of a formal approximation model previously evaluated under conditions of wide availability of training and validation resources, we study the reliability of such an approach in a different and much more demanding operational environment. Using as case study the generation of POS taggers for Galician, a language belonging to the Western Ibero-Romance group, the experimental results are consistent with our expectations.


Introduction
The application of machine learning (ML) techniques has significantly changed the landscapes of natural language processing (NLP) over the last decade, allowing some of the gaps derived from the use of rule-based approaches to be filled in, mainly its lack of flexibility and high development cost.Thus, although the state-of-the-art makes clear that hand-crafted tools are not only easier to interpret and manipulate but also often provide better results [1][2][3][4][5], the high level of dependency on expert knowledge makes their implementation and subsequent maintenance costly in human terms, in addition to hindering their applicability to different languages [2,[5][6][7][8].Combined with the surge in computational power, the possibility of accessing massive amounts of data and the decline in the cost of disk storage, this has decisively contributed to the growing popularity of ML algorithms as the basis for classification [9] and clustering [10] models in a variety of tasks.This includes entity detection [11], information retrieval [12], language identification [13], machine translation [14], question answering [15], semantic role labeling [16], sentiment analysis [17] and text classification [18], among others.
At this juncture, although recent proposals based on deep learning (DL) have outperformed the traditional ML methods on a variety of operating fronts, the approach has also shown its limitations.Specifically, there are two main reasons for the popularity and the arguable superiority of the DL solutions: end-to-end training and feature learning1 .However, the latter translates into the inability to handle directly symbols [19].This implies that data must first be converted to vector representations for input into the model and then do just the reverse with its output, which leads to a complex interpretability of models.Other well known challenges are the lack of theoretical foundation [20], the difficulty in dealing with the long tail [21,22], the ineffectiveness at inference and decision making [23], and the requirement of large amounts of data and powerful computing resources that may not be available.This complex picture looks even worse in the NLP domain [24], particularly when it comes to dealing with low-resource scenarios2 [25].On the one hand, feature-based techniques lead to an imperfect use of the linguistic knowledge.On the other, the scarcity of training data is not only problematic per se, but also because of its impact on the rest of trials.So, generating high-quality vector representations remains a challenge [26] and the imbalance in the training samples that start the long tail and bias phenomena is more likely.Together with the prone to overfitting of DL models [27], which can result in poor predictive power, thereby compromising both inference and decision making.
This has led to a renewal of interest in reviewing the role of DL techniques vs. traditional ML ones in the development of NLP applications [3,24,[28][29][30], particularly in low-resource scenarios [25].Special attention has been given here to sequence labeling tasks [30].These encompass a class of NLP problems that involve the assignment of a categorical label to each member of a sequence of observed values, and whose output facilitates downstream applications such as parsing or semantic analysis, so errors at this stage can lower their performance [31].Among the most important, we can highlight named entity recognition [7,32], multi-word expression identification [29], and morphological [26] and POS tagging [2,[33][34][35][36].It is precisely in this framework, the generation of POS taggers for low-resource scenarios by means of non-deep ML, that we propose the study of model selection based on the early estimation of learning curves.With that in mind, we first overview the state-of-the-art and our contribution in Section 2. Next, Section 3 briefly reports on the theoretical basis supporting our research.In Section 4, we introduce the testing frame for the experiments described in Section 5 and later discussed in Section 6.Finally, Section 7 presents our conclusions and thoughts for future work.

Related Work and Contribution
Model selection based on the estimation of learning curves has been the subject of ongoing research over recent decades, inspired by the idea that the loss of predictive power and of training are correlated [37].In the scope of NLP, these techniques have been applied to most commonly researched areas such as, for example, machine translation.Specifically, they have been used here for assessing the quality systems [38,39], optimizing parameter setting [40], estimating how much training data is required to achieve a certain degree of accuracy [41] or evaluating the impact of a concrete set of distortion factors on the performance [42].Their popularity is growing especially in the field of active learning3 (AL) [43], where we can refer to applications for information extraction [44,45], parsing [46,47] and text classification [48][49][50][51].The same is true for POS tagging [52][53][54] and closely related tasks, such as named entity recognition [55][56][57] or word sense disambiguation [58][59][60], always with the aim of reducing the annotation effort.Since AL prioritizes the data to be labelled in order to maximize the impact for training a supervised model, it performs better than other ML strategies with substantially fewer resources.This justifies the interest in it as an underlying learning guideline to deal with low-resource scenarios [61][62][63][64] and specifically in the area of POS tagging [65][66][67][68][69][70].
Focusing on the early estimation of learning curves in AL, we can distinguish between functional [55,71] and probabilistic [72][73][74] proposals, depending on the nature of the halting condition used to determine the end of the training process from the information generated in each cycle.As a basic difference, functional strategies not only permit the calculation of relative and absolute error (resp.convergence) thresholds [75] 4 , but they are also simpler and more robust than techniques based on probabilistic ones.In particular, by replacing single observations with distributions, we introduce elements of randomness, and thus uncertainty.That way, to establish how much data is necessary to reliably build such distributions is no easy matter, and the same applies to rare event handling.Being related to the definition of a sampling strategy on a sufficiently wide range of observations, this question should be preventable or better dealt with in a functional frame [76], even more so when the scarcity of training resources make it difficult to apply probabilistic criteria.
In such a context, we face the evaluation of POS tagging models by early estimation of learning curves when working in resource-scarce settings, to the best of our knowledge a yet unexplored terrain.Leading on from this and looking for an operational solution, we focus on AL scenarios, turning our attention to a functional view of the issue.With a view to exploring the practicality and potential of the approach, we address it in a setting used previously to demonstrate its effectiveness when the availability of resources for learning is not a problem.That way, we take up both the formal prediction concept and the testing frame introduced in [71], which also allow us to contrast the level of efficiency to be expected when the conditions for training and validation of the generated models are much more restrictive.

The Formal Framework
Below is a brief review of the theoretical basis underlying our work, taken from [71].From now on, we denote the real numbers by R and the natural ones by N, assuming that 0 ̸ ∈ N. The order in N is also extended to N N N Assuming that a learning curve is a plot of model learning performance over experience, we focus on accuracy as a measure of that performance.

The notational support
We start with a sequence of observations calculated from cases incrementally taken from a training data base, and organized around de concept of learning scheme [71].Definition 1. (Learning scheme) Let D be a training data base, K ⊊ D a set of initial items from D, and σ : N → N a function.We define a learning scheme for D with kernel K and step σ, as a triple D K σ = [K, σ, {D i } i∈N ], such that {D i } i∈N is a cover of D verifying: with ∥I i ∥ the cardinality of I i .We refer to D i as the individual of level i for D K σ .
That relates a level i with the position ∥D i ∥ in the training data base, determining the sequence of observations { ) is the accuracy achieved on such an instance by the learner.Thus, a level determines an iteration in the adaptive sampling whose learning curve is whilst K delimits a portion of D we believe to be enough to initiate consistent evaluations of the training.For its part, σ identifies the sampling scheduling.
In order to get a reliable assessement, the weak predictor generated at each learning cycle is extrapolated according to an accuracy pattern [71], which allows to formally compile a set of properties giving stability to the estimates and widely accepted as working hypotheses by the state-of-the-art in model evaluation [73,[77][78][79][80][81].
As running accuracy pattern we select the power law family π(a, b, c)(x) := −a * x −b + c.Its use is illustrated in the right-most diagram of Fig. 1 to fit the learning curve represented on the left-hand side for the SVMTool tagger [82] on the Xiada corpus of Galician [83], with the values a = 204.570017,b = 0.307277 and c = 99.226727provided by the trust region method [84].Returning to the review of our notational support, we now adapt these calculation elements to an iterative dynamics through the concept of learning trend [71].The minimum level ℓ for a learning trend A π ℓ [D K σ ] is 3, because we need at least three points to generate a curve.Its value A π ℓ [D K σ ](x i ) is our prediction of the accuracy for a case x i , using a model generated from the first ℓ cycles of the learner.Accordingly, the asymptotic term α ℓ is interpretable as the estimate for the highest accuracy attainable.
Continuing with the tagger SVMTool and the corpus Xiada, Fig. 2 shows a portion of the learning trace with kernel and uniform step function 5 * 10 3 , including a zoom view.

Correctness
Assuming our working hypotheses, the correctness of the proposal -i.e., the existence and effective approximation of a learning curve . Specifically, the function exists and is positive definite, increasing, continuous and upper bounded by 100 in (0, ∞).
In order to estimate the quality of our approximation, a relative proximity criterion is introduced.Labelled layered convergence, it evaluates the contribution of each learning trend to the convergence process in a sequence which is proved to be decreasing and convergent to zero.These layers of convergence can then be interpreted as a reliable reference to fix error (resp.convergence) thresholds.

Robustness
Robustness is studied from a set of testing hypotheses, which assume that learning curves are positive definite and upper bounded, albeit only quasi-strictly increasing and concave.An observation is then categorized according to its position with respect to the working level (WLevel), i.e. the cycle after which irregularities would not impact the correctness.Considering that the learner stabilizes as the training advances and that the monotonicity of the asymptotic backbone is at the basis of any halting condition, it is identified as the level providing the first slope fluctuation below a given ceiling in such a backbone and, once passed, the prediction level (PLevel) marking the beginning of learning trends which could feasibly predict the learning curve, namely not exceeding its maximum (100) [71].Based on this, WLevel is calculated as the lowest level for which the normalized absolute value of the slope of the line joining two consecutive points on the asymptotic backbone is less than a verticality threshold ν, which is corrected by applying a coefficient 1/ς for avoiding having to deal with to both infinitely large slopes and extremely small decimal fractions.
Following our example, Fig. 3 shows the scale of the deviations in the asymptotic backbone before and after WLevel, for ν = 2 * 10 −5 , ς = 1 and λ = 5.The PLevel is also displayed, proving that these two levels might not be the same.

The Testing Frame
Given a training corpus D, we want to study how far in advance and how well , built from a kernel K and using a step function σ, can be approximated in a low-resource scenario.To ensure the relevance of the results obtained, we will standardize the conditions under which the experiments take place, following the same criteria previously considered in the study of resource-rich languages [71].

The monitoring structure
As evaluation basis we consider the run [71], , a prediction level ℘(ν, ς, λ) and a convergence threshold τ.We then apply our study on a collection of runs C = {E i } i∈I , defined for a set of different learners.In order to avoid misconceptions due to the lack of uniformity in the testing frame, a common corpus D, kernel size, accuracy pattern π, step function σ, verticality threshold ν, slowdown ς, look-ahead λ and convergence threshold τ are used.
In practice, we are interested in studying each run E = [A π [D K σ ], ℘(ν, ς, λ), τ] from the level in which predictions are below τ, and which we baptize convergence level (CLevel).So, once the PLevel is found during the computation of the trace A π [D K σ ], we begin to check the layer of convergence.When it reaches the threshold τ, the trend , and the process of approximation is stopped.For the runs C = {E i } i∈I , monitoring is applied to the learning trends {A π CLevel i

[D K
σ ]} i∈I on a finite common control sequence of levels for the training data base, which are extracted from an interval of the prediction windows {[CLevel i , ∞)} i∈I [71].In each control level, the accuracy (Ac) and the corresponding estimated accuracy (EAc) are computed for each run using six decimal digits, though only two are represented for reasons of space and visibility.

The performance metrics
Our aim is both to assess the reliability of our estimates and their robustness against variations in the working hypotheses.To do so, we employ two specific kind of metrics [71].

Measuring the reliability
We here differentiate two complementary viewpoints: quantitative and qualitative.In the first case, it is simply a matter of studying the closeness of the estimates and the actual learning curves, while in the second the objective is to determine the impact of those estimates on the decision making about the performance of some models relative to others.

The quantitative perspective
A simple way of measuring the reliability from this viewpoint is through the mean absolute percent error (MAPE) [85].For every run E and level i of a control sequence S, we first compute the percentage error (PE) as the difference between the EAc calculated We can then express the MAPE as the arithmetic mean of the unsigned PE [71], as Intuitively, the error in the estimates done over a control sequence is, on average, proportional to the MAPE, who fulfil our requirements at this point.

The qualitative perspective
To that end, having fixed a collection of runs H working on a common corpus and a control sequence S, the reliability of one of such runs depends on the percentage of cases on which its estimates not altering the relative position of its learning curve with respect to the rest throughout S. In this sense, our primary reference is the reliability estimation (RE) of two runs E , Ẽ ∈ H on i ∈ S, defined [71] as Having fixed a control level, this Boolean function verifies if the estimates for E and Ẽ preserve the relative positions of the corresponding observations, and it can be extended to the control sequence through the concept of reliability estimation ratio [71].

Definition 5. (Reliability estimation ratio)
Let E and Ẽ runs on a control sequence S. We define the reliability estimation ratio (RER) of E and Ẽ for S as From this, we can calculate the percentage of runs in a set H with regard to which the estimates for a given one E are reliable on the whole of the control sequence S considered.We denote the resulting metric as decision-making reliability [71].Definition 6. (Decision-making reliability) Let H = {E k } k∈K and E ̸ ∈ H a set of runs and a run, respectively, on a control sequence S. We define the decision-making reliability (DMR) of E on H for S as

Measuring the robustness
Since the stability of a run E correlates to the degree of monotony in its asymptotic backbone, we measure it as the percentage of monotonic elements in the latter through the interval [WLevel E , CLevel E ] where the approximation performs effectively.We baptize it as robustness rate [71].

Definition 7. (Robustness rate)
Let E be a run with asymptotic backbone {α ℓ } ℓ∈N , and CLevel E and WLevel E its convergence and working levels, respectively.We define the robustness rate (RR) of E as with µ the longest maximum monotonic subsequence of {α i , WLevel E ≤ i ≤ CLevel E }.
The tolerance of a run to variations in the working hypotheses is therefore greater the higher its RR, thus providing a simple criterion for checking the degree of robustness on which we can count.

The Experiments
Within a model selection context and focused on the generation of POS taggers in low-resource scenarios, our goal is to provide evidence of the suitability of using evaluation mechanisms based on the early estimation of learning curves.It is a non trivial challenge because, to provide that evidence, we need to study our estimates over a significant range of observations, which is in clear contradiction to the scarcity of training data.

The linguistics resources
In order to address the issue posed, we chose to work with a case study that meets four conditions.The first is that the language considered is really a resource-poor one, which should guarantee that it has been outside the tuning phase of the process of developing nthe taggers later used in the experiments, thereby precluding any potential biases associated with the learning architecture.Second, it should have a rich morphology, thus making the training process non-trivial and therefore relevant to the test performed.Thirdly, we should have at least a training corpus of sufficient size to study the reliability of the results obtained.Finally, that corpus should provide sufficiently low levels of convergence to allow the identification of the learning processes with the generation of viable models from a small set of training data.
We then take as a case study Galician, a member of the West Iberian group of Romance languages that also includes the better-known Portuguese.It is an inflectional language with a great variety of morphological processes, particularly non-concatenative ones, derived from its Latin origin.Some of its most distinctive characteristics are [86] Verbal forms with enclitic pronouns at the end, which can produce changes in the stem due to the presence of accents: deu (gave), déullelo (he/she gave it to them).
The unstressed pronouns are usually suffixed and, moreover, they can be easily drawn together and often are contracted (lle + o = llo), as in the case of váitemello buscar (go and fetch it for him (do it for me)).It is also frequent to use what we call a solidarity pronoun, as che and vos, in order to let the listeners be participant in the action.That way, forms with up to four enclitic pronouns, like perdéuchellevolo (he had lost it to him), are rather common.

•
A highly complex gender inflection, including words with only one gender as home (man) and muller (woman), and words with the same form for both genders as azul (blue).Regarding words with separate forms for masculine and feminine, more than 30 variation groups are identified.

•
A highly complex number inflexion, with words only being presented in singular form, such as luns (monday), and others where only the plural form is correct, as matemáticas (mathematics).More than a dozen variation groups are identified.
This choice limits the availability of curated corpora of sufficient size to a single candidate, Xiada [83], whose latest version (2.8) includes over 747,000 entries gathered from three different sources: general and economic news articles, and short stories.With the aim of accom-modating the elaborate linguistic structure previously described, the tag-set includes 460 tags, a short description of which can be found at http://corpus.cirp.gal/xiada/etiquetario/taboa.

The POS tagging systems
As already argued, we focus on models built from AL, selecting a broad range of proposals covering the most representative non-deep learning architectures 5 , the same tested in [71] on resource-rich languages.This matching will allow us to establish, together with the subsequent identification of the parameters in the testing space, a valid reference based on the results obtained in that work:

•
In the category of stochastic methods and representing the hidden Márkov models (HMMs), we chose TnT [87].We also include the TreeTagger [88], a proposal that uses decision trees to generate the HMM, and Morfette [89], an averaged perceptron approach [90].To illustrate the maximum entropy models (MEMs), we select MXPOST [91] and OpenNLP MaxEnt [92].Finally, the Stanford POS tagger [92] combines features of HMMs and MEMs using a conditional Márkov model.

•
Under the heading of other approaches we consider fnTBL [93], an update of the classic Brill tagger [94], as an example of transformation-based learning.As memory-based method we take the memory-based tagger (MBT) [95], while SVMTool [82] illustrates the behaviour of support vector machines (SVMs).
In addition, this ensures an adequate coverage of the range of learners available in the computational domain under consideration.

The testing space
Following the way drawn by the choice of ML architectures discussed above, the design of the testing space will be the same as the one considered in the state-of-the-art [71] for the study of resource-rich languages, thus ensuring the reference value of the latter.Thus, in order to avoid dysfunctions resulting from sentence truncation during training, we retake the class of learning scheme then proposed, which permits us to reap the maximum from the training process.Given a corpus D, a kernel K ⊊ D and a step function σ, we build the set of individuals {D i } i∈N as follows: where C i denotes the minimal set of sentences including C i .
Along the same lines and with respect to the setting of runs, the size of the kernels is 5 * 10 3 words and the constant step function 5 * 10 3 locates the instances, which can be considered conservative values since smaller and larger ones are possible.Regarding the parameters used for estimating the prediction levels, the choice again goes to ν = 4 * 10 −5 , ς = 1 and λ = 5.This also holds true for the selection of the regresion technique used for approximating the partial learning curves and for π, that falls on the trust region method [84] and the power law family [96], respectively.
Taking into account that real corpora are finite, we study the prediction model within their boundaries, which implies limiting the scope in measuring the layers of convergence.We then adapt the sampling window and the control levels to the size of the corpus now considered.So, if ℓ denotes the position of the first sentence-ending beyond the ℓ-th word, the former comprises the interval [ 5 * 10 3 , 7 * 10 5 ], whilst the latter are taken from control sequences in [ 3 * 10 5 , 7 * 10 5 ].In order to confer additional stability on our measures, we apply a k-fold cross validation [97] to compute the samples, with k = 10.

Discussion
As mentioned, the experiments are studied from two complementary points of view, quantitative and qualitative, according to the performance metrics previously introduced.

The sets of runs
To illustrate the predictability of the learning curves for the Xiada corpus, we start with a collection of runs, C = {E i } i∈I generated from the data compiled in Table 1.The latter includes an entry for each one of the learners previously enumerated, together with its PLevel and CLevel, as well as the values for Ac and EAc along the control sequence, from which to calculate MAPEs, DMRs and RRs.In order to improve understanding, all the levels managed are indicated by their associated word positions in the corpus, which is denoted by using a superscript wp in their identification labels.One detail that attracts our attention is that the run associated with TreeTagger does not reach the PLevel within the limits of the training corpus.This behavior is certainly singular in among all the taggers considered, which highlights the variety of factors that impact the evaluation of learners, and that, in this case, leads us to discard considering it in our study.In other words, in a real model selection process on the Xiada corpus selected here, TreeTagger would not even be placed among the hypotheses that allow the application of the prediction technique considered.

The quantitative study
Our reference metric is now the MAPE, whose values are shown graphically in Fig. 4 from the data compiled in Table 1.Taking into account that we are interested in numbers as small as possible, the scores range from 0.02 for Stanford, to 0.35 for MXPOST in the interval [ 3 * 10 5 , 7 * 10 5 ].Those results are illustrated in Fig. 5, showing the learning curves and learning trends used for prediction on the runs with best and worst MAPE on the control sequence.As we have already done, the observations are generated considering the portion of the corpus taken from its beginning up to the word position indicated on the horizontal axis.Finally, 50% of MAPE values in this set of runs are in the interval [0, 0.12], a proportion that reaches 75% in [0, 0.28].Although these results are slightly worse than the ones reported in [71] for resource-rich languages, they are still very promising, which leads us to argue for the goodness of the proposal on the quantitative plane.

The qualitative study
Our reference metric is here the DMR, whose values are shown graphically in Fig. 4 from the data compiled in Table 1.Taking into account that we are now interested in scores close to 100, these range from 71.43 to 100, with 85.71% of the values in the interval [85.71, 100].Moreover, the DMRs lower than 100 are the result of the intersection between the TnT learning curve with those of Stanford and fnTBL.Under these conditions, the maximum value would only be possible if the error in the estimate of the intersection points was lower than the distance between its neighbouring control levels, an unrealistic proposition given how short that distance is (5,000 words).In any case, the results are comparable to those reported in [71] for resource-rich languages, also meeting our expectations from a qualitative point of view.

The study of robustness
The reference metric is now RR, and we are interested in values as close as possible to 100, the maximum.The results are shown in Fig. 4 from the data compiled in Table 1.While RR values range from 85.71 to 100, the latter is only reached in 37.50% of the runs.This percentage rises to 62.50% for RRs in the interval [90,100].Overall, these results even exceed those reported in [71] for resource-rich languages, illustrating once again the good performance of the prediction model, this time against variations in its working hypotheses.

Conclusions and Future Work
Our proposal arises as a response to the challenge of evaluating POS tagging models in low-resource scenarios, for which non-deep learning approaches have often proven to be better suited.For this purpose, we reuse a formally correct proposal, based on the early estimation of learning curves.Technically described as the uniform convergence of a sequence of partial predictors which iteratively approximates the solution, the method acts as a proximity condition that halts the training process once a convergenge/error threshold fixed by the user is reached, and has already demonstrated its validity when the availability of large enough learning datasets is not a problem.In order to ensure the reliability of the results obtained, we have once again used the testing frame considered then, involving both quantitative and qualitative aspects, but also the survey of robustness against possible irregularities in the learning process.
Special attention was paid to the selection of a case study combining representativeness and access to validation resources, something somewhat contradictory in the context under consideration.We then focus on Galician, a minority language of complex morphology, for which the collection of available training resources is reduced to a single corpus of sufficient size and quality to ensure both the validation phase and a rapid convergence process.This set of unique features allows us to simulate and evaluate short training sessions in a non-trivial learning environment, associating them with a language with important deficiencies in terms of computational resources.The results corroborate the expectations for the theoretical basis, placing the performance at a level similar to that observed in the state-of-the-art for resource-rich languages on the same learners.This supports the effectiveness of the approach for model selection considered and its suitability to low-resource scenarios, as initially argued.
To the best of our knowledge and belief, not only is this the first time that a proposal for estimating the performance based on the prediction of learning curves has demonstrated its feasibility in frameworks of this nature, but it has done so without any type of prior specific adaptation.In other words, no operational limitations to the original conceptual design have been observed.All this justifies the interest in highlighting the independence, both in terms of language and usage, of the technology deployed.A comprehensible way of doing it is extending our analysis, first to a broader set of languages in a variety of language families, and then to other fundamental and applicative NLP tasks, which establishes a clear line of future work.

Figure 1 .
Figure 1.Learning curve for SVMTool on Xiada, and an accuracy pattern fitting it.

Figure 2 .
Figure 2. Learning trace for SVMTool on Xiada, with details in zoom.

Figure 3 .
Figure 3. Working and prediction levels for SVMTool on Xiada, with details in zoom.

Figure 5 .
Figure 5. Learning trends for the best and worst MAPEs.

Table 1 .
Monitoring of runs along the control sequences.