Optimizing small BERTs trained for German NER

Currently, the most widespread neural network architecture for training language models is the so called BERT which led to improvements in various Natural Language Processing (NLP) tasks. In general, the larger the number of parameters in a BERT model, the better the results obtained in these NLP tasks. Unfortunately, the memory consumption and the training duration drastically increases with the size of these models. In this article, we investigate various training techniques of smaller BERT models: We combine different methods from other BERT variants like ALBERT, RoBERTa, and relative positional encoding. In addition, we propose two new fine-tuning modifications leading to better performance: Class-Start-End tagging and a modified form of Linear Chain Conditional Random Fields. Furthermore, we introduce Whole-Word Attention which reduces BERTs memory usage and leads to a small increase in performance compared to classical Multi-Head-Attention. We evaluate these techniques on five public German Named Entity Recognition (NER) tasks of which two are introduced by this article.


Introduction
tasks on smaller BERT models as best as possible. Since our focus is the NER, we test if the optimizations work consistently on five different German NER tasks. Due to our project aims we evaluated our new methods on the German language. We suspect a consistent behavior on similar European languages like English, French and Spanish. Two of the considered tasks rely on new NER datasets which we generated from existing digital text editions. Therefore, in this article, we examine which techniques are optimal to pre-train and fine-tune a BERT to solve NER tasks in German with limited resources. We investigate this on smaller BERT models with six layers that can be pretrained on a single GPU (RTX 2080 Ti 11 GB) within 9 days while fine-tuning can be performed on a notebook CPU in a few hours.
We first compared different well-established pre-training techniques such as Mask Language Model (MLM), Sentence Order Prediction (SOP), and Next Sentence Prediction (NSP) on the final result of the downstream NER task. Furthermore, we investigated the influence of absolute and relative positional encoding, as well as Whole-Word Masking (WWM).
As a second step, we compared various approaches for carrying out fine-tuning, since the tagging rules cannot be learned consistently by classical fine-tuning approaches. In addition to existing approaches such as the use of Linear Chain Conditional Random Fields (LCRFs), we propose the so-called Class-Start-End (CSE) tagging and an specially modified form of LCRFs for NER which led to an increased performance. Furthermore, for decoding, we introduced a simple rule-based approach, which we call Entity-Fix rule, to further improve the results.
As already mentioned, the training of a BERT requires many resources. One of the reasons is that the memory amount of BERT depends quadratically on the sequence length when calculating the energy values (attention scores) in its attention layers which leads to memory problems for long sequences. In this article, we propose Whole-Word Attention, a new modification of the Transformer architecture that not only reduces the number of energy values to be calculated by about factor two, but also results in slightly improved results.
In summary, the main goal of this article is to enable the training of efficient BERT models for German NER on limited resources. For this, the article provides different methodology and claims the following contributions: • We introduce and share two datasets for German NER formed from existing digital editions. • We investigate the influence of different BERT pre-training methods, such as pre-training tasks, varying positional encoding, and adding Whole-Word Masking on a total of five different NER datasets. • On the same NER tasks, we investigate different approaches to perform fine-tuning. Hereby, we propose two new methods which led to performance improvements: Class-Start-End tagging and a modified form of Linear Chain Conditional Random Fields. • We introduce a novel rule-based decoding strategy achieving further improvements. • We propose Whole-Word Attention, a modification of the BERT architecture that reduces the memory requirements of the BERT models, especially for processing long sequences, and also leads to further performance improvements. • We share the datasets (see Section 2) and our source code 3 with the community.
The remainder of this article is structured as follows: In Section 2 we present our datasets including the two new German NER datasets. In Section 3 we introduce the different pre-training techniques, while Section 4 describes fine-tuning. Subsequently, in Section 5, we introduce Whole-Word Attention (WWA). In all these sections we provide an overview of the existing techniques with the corresponding related work which we adopted and also introduce our novel methods. After that, Section 6 shows the conducted experiments and their results. We conclude this article by a discussion of our results and giving an outlook on future work.

Datasets
In this section, we list the different datasets. First, we describe the dataset used for pre-training throughout our experiments. Then, we mention the key attributes of five NER datasets for the downstream tasks.

Pre-training Data
To pre-train a BERT, a large amount of unlabeled text is necessary as input data. We collected the German Wikipedia and a web crawl of various German newspaper portals to pre-train our BERT. The dump of the German Wikipedia was preprocessed by the Wiki-Extractor [Attardi, 2015] resulting in about 6 GB of text data. In addition, we took another 2 GB of German text data from different newspaper portals 4 crawled with the news-please framework [Hamborg et al., 2017] .

NER Downstream-Datasets
We evaluated our methods on five different NER tasks. In addition to three already existing German NER datasets, the frequently used GermEval 2014 dataset and two NER datasets on German legal texts, we introduce two NER tasks of two existing digital editions. In the following, we describe each of the five tasks.
GermEval 2014 One of the most widespread German NER datasets is GermEval 2014 [Benikova et al., 2014] which comprises several News Corpora and Wikipedia. In total, it contains about 590,000 tokens with about 41,000 entities which are tagged into four main entity classes: "person", "organisation", "location", and "other". Each main class can appear in a default, a partial, or a derived variant, resulting in 12 overall classes. In the GermEval task, entities can be tagged in two levels: outer and inner (nested entities). Since there are few inner annotations in the dataset, we restrict ourselves to evaluating the outer entities in our experiments as it is often the approach in other papers [e.g. Labusch et al., 2019, Chan et al., 2020, Riedl and Padó, 2018. This is called the outer chunk evaluation scheme which is described in more detail by Riedl and Padó [2018].

Legal Entity Recognition
The Legal Entity Recognition (LER) dataset [Leitner et al., 2020] contains 2.15 million tokens with 54,000 manually annotated entities from German court decision documents of 2017 and 2018. The entities are divided into seven main classes and 19 subclasses which we label by Coarse-Grained (CG) and Fine-Grained (FG), respectively. The FG task (LER FG) is more difficult than the CG task (LER CG) due to its larger number of possible classes.
Digital Edition: Essays from H. Arendt We created an NER dataset based on the digital edition "Sechs Essays" by H. Arendt. It consists of 23 documents from the period 1932-1976 which are published online in [Hahn et al., 2020] as TEI files [TEI-Consortium, 2017]. In these documents, certain entities were manually tagged. Since some of the original NER tags comprised too few examples and some ambiguities (e.g., place and country), we joined several tags as shown in Table 1. Note that we removed any annotation of the class "ship" since only four instances were available  3 Pre-training Techniques In this section, we provide an overview of several common pre-training techniques for a BERT which we examined in our experiments.

Pre-training Tasks
In the original BERT [Devlin et al., 2019], pre-training is performed by simultaneously minimizing the loss of the so called Mask Language Model (MLM) and the Next Sentence Prediction (NSP) task. The MLM task first tokenizes the text input with a subword tokenizer, then 15% of the tokens are chosen randomly. Hereby, 80% of these chosen tokens are replaced by a special mask token, 10% are replaced by a randomly chosen other token, and the remaining 10% keep the original correct token. Therefore, the goal of the MLM task is to find the original token for the 15% randomly chosen tokens which is only possible by understanding the language and thus learning a robust language model. Since BERT should also be able to learn the semantics of different sentences within a text, NSP was additionally included. When combining NSP with MLM, the input for pre-training are two masked sentences which are concatenated and separated by a special separator token. In 50% of the cases, two consecutive sentences from the same text document are used whereas in the other 50% two random sentences from different documents are selected. The goal of the NSP task is to identify which of the two variants it is.
In the follow-up papers RoBERTa  and XLNet , experiments showed that the NSP task often had no positive effect on the performance of the downstream tasks. Therefore, both papers recommended that the pre-training should solely be performed by the MLM task. In the ALBERT paper [Lan et al., 2020] this was investigated in more detail. They assumed that the ineffectiveness of the NSP task was only due to its simplicity which is why they introduced Sentence Order Prediction (SOP) as a more challenging task that aims to learn relationships between sentences similar to the NSP task: BERT always receives two consecutive sentences, but in 50% of the cases the order is wrong by flipping them. The SOP task is to learn the correct order of the sentences.
In this article, we examine the influences of the different pre-training tasks (MLM, NSP, SOP) with the focus on improving the training of BERT for German NER tasks.

Absolute and Relative Positional Encoding
The original Transformer architecture [Vaswani et al., 2017] was based exclusively on attention mechanisms to process input sequences. Attention mechanisms allow every sequence element to learn relations to all other elements. By default, Attention does not take into account information about the order of the elements in the sequence. But since information about the order of the input sequence elements is mandatory in almost every NLP tasks, the original Transformer architecture introduced the so-called absolute positional encoding: a fixed position vector p j ∈ R d model was added to each embedded input sequence element x j at position j ∈ {1, . . . , n} for an input sequence of length n, thus x j = x j + p j .
6 https://github.com/NEISSproject/NERDatasets/tree/main/Sturm In the original approach the position vector p j is built by computing sinusoids of different wavelength in the following way: p j,2k := sin j/10000 2k/d model , While the experiments in [Vaswani et al., 2017] showed great results, the disadvantage of absolute positional encoding is that the performance is significantly reduced in cases where the models are applied on sequences longer than those on which they were trained because the respective position vectors were not yet seen during training. Therefore, in [Rosendahl et al., 2019] other variants for positional encoding were investigated and compared on translation tasks. The most promising approach was relative positional encoding [Shaw et al., 2019]: a trainable distance information d K j−i is added in the attention layer when computing the energy e i,j of the ith sequence element to the jth one. Thus, if x i and x j are the ith and jth input elements of a sequence in an attention layer, instead of multiplying just the query vector W Q x i with the key vector W K x j , one adds the trainable distance information d K j−i to the key vector resulting in where W Q n , W K n ∈ R d model ×d k . In addition, when multiplying the energy (after applying softmax) with the values, another trainable distance information d V j−i is added. Finally, the output y i for the ith sequence element of a sequence of length n with relative positional encoding is computed by and W V ∈ R d model ×dv . To train d K j−i and d V j−i , a hyperparameter τ (called the clipping distance), the trainable embeddings r K −τ , . . . , r K τ ∈ R d k , and r V −τ , . . . , r V τ ∈ R dv are introduced. These embeddings are used to define the distance terms d K j−i and d V j−i , where distances longer than the clipping distance τ are represented by r τ or r −τ , thus: Rosendahl et al. [2019] already showed that relative positional encoding suffers less from the disadvantages of absolute position encoding of unseen sequence lengths. In this article, we examine the influence of these two variants of positional encoding during the training of German BERT models.

Whole-Word Masking (WWM)
Whole-Word Masking (WWM) is a small modification of the Mask Language Model (MLM) task described in section 3.1. In contrast to the classic MLM task, WWM does not mask token-wisely but instead word-wisely. This means that in all cases either all tokens belonging to a word are masked or none of them. Recent work of Chan et al. [2020], Cui et al. [2019] already showed the positive effect of WWM in pre-training on the performance of the downstream task. In this article, we also examine the differences between the original MLM task and the MLM task with WWM.

Fine-tuning Techniques for NER
The task of NER is to detect entities, such as persons or places, which possibly consist of several words within a text. As proposed in [Devlin et al., 2019], the traditional approach for fine-tuning a BERT to a classification task like NER is to attach an additional feed-forward layer to a pre-trained BERT which predicts token-wise labels. In order to preserve and obtain information about the grouping of tokens into entities, Inside-Outside-Beginning (IOB) tagging [Ramshaw and Marcus, 1999] is usually applied. IOB tagging introduces two versions of each entity class, one marking the beginning of the entity and one representing the interior of an entity, and an "other" class, which all together results in a total of γ = 2e + 1 tag classes where e is the number of entity classes. Table 3 shows an example in which the beginning token of an entity is prefixed with a "B-" and all other tokens with an "I-". Table 3: IOB tagging example with unlabeled words (O) and the two entities: "location" (Loc) and "person" (Per). The first tag of each entity is prefixed with "B-", while all following tokens of that entity are marked with an "I-". The first row are the words of the sentence which are split into one or more tokens (second row). The third row shows the tagged tokens based on the given entities (last row). The example sentence can be translated as "Peter lives in Frankfurt am Main".

Words
Peter In compliance with the standard evaluation scheme of NER tasks in [Sang and De Meulder, 2003], we compute an entity-wise F 1 score denoted by E-F 1 . Instead of computing a token-or word-wise F 1 score, E-F 1 evaluates a complete entity as true positive only if all tokens belonging to the entity are correct. Our implementation of E-F 1 relies on the widely used Python library seqeval [Nakayama, 2018].
Usually, IOB tagging is trained by a token-wise softmax cross-entropy loss. However, this setup of one feed-forward layer and a cross-entropy loss does not take into account the context of the tokens forming an entity. In the following, we will call this default approach of fine-tuning the BERT Default-Fine-Tuning. It can lead to inconsistent tagging, for example, an inner tag may only be preceded by an inner or beginning tag of the same entity, and thus results in a devastating impact on the E-F 1 -score. Therefore, we propose and compare three modified strategies that include context to prevent inconsistent NER tagging during training or decoding. The first approach is a modification of the IOB tagging, the second proposal uses Linear Chain Conditional Random Fields (LCRFs), the last attempt applies rules to fix a predicted tagging.
Most papers on BERT models dealing with German NER, for example [Chan et al., 2020] or [Labusch et al., 2019], do not focus on an investigation of different variants for fine-tuning. However, there are already studies for NER tasks in other languages [e.g. Luoma andPyysalo, 2020, Souza et al., 2020] which show that the application of LCRFs can be beneficial for fine-tuning. Souza et al. [2020] also investigated whether it is advantageous for the fine-tuning of BERT models on NER tasks to link the pre-trained BERT models with LSTM layers. However, these experiments did not prove to be successful.

Fine-tuning with CSE tagging
In this section, we propose an alternative to the IOB tagging which we call Class-Start-End (CSE) tagging. The main idea is to split the task into three objectives as shown in Table 4: finding start and end tokens, and learning the correct class.
Loc Loc Loc CSE appends two additional dense layers with logistic-sigmoid activation to the last BERT layer with scalar outputs, one for the start p start , and one for the end p end token. In summary, the complete output for an input sample consisting of n tokens is p start 1 , p end 1 , y 1 , p start 2 , p end 2 , y 2 , . . . , p start n , p end n , y n ∈ R n×(2+e+1) where e + 1 is the number of possible entities and the "other" class.
The objective for y i is trained with softmax cross entropy as before but without the distinction between B-and I-, while the start and end vectors contribute extra losses J start and J end : where t start and p start are the target and prediction vectors for start as shown in Table 4. J end is defined analog.
Converting the CSE into IOB tagging is realized by accepting tokens which exceeds the threshold of 0.5 as start or end markers. If an end marker is missing between two start markers, the position of the highest end probability between the two locations is used as an additional end marker. This approach is applied analogue in reverse for missing start markers. Finally, all class probabilities between each start and end marker pairs (including start and end) is averaged to obtain the entity class. In conclusion, an inconsistent tagging is impossible.

Fine-tuning with Linear Chain Conditional Random Field with NER-Rule (LCRF NER )
Another approach to tackle inconsistent IOB tagging during fine-tuning of a BERT is based on Linear Chain Conditional Random Fields (LCRFs) which are a modification of Conditional Random fields, both proposed in [Lafferty et al., 2001]. LCRFs are a common approach to train neural networks that model a sequential task and are therefore well suited for fine-tuning NER. The basic idea is to take into account the classification of the neighboring sequence members when classifying an element of a sequence.
The output Y = (y 1 , y 2 , . . . , y n ) ∈ R n×γ of our neural network for the NER task consists of a sequence of n vectors whose dimension corresponds to the number of classes γ ∈ N. LCRF introduce so-called transition values T which are a matrix W T of trainable weights, in the basic approach: T := W T ∈ R γ×γ . An entry T i,j of this matrix T can be seen as the potential that a tag of class i is followed by a tag of class j. In one of the easiest forms of LCRFs which we choose, decoding aims to find the sequence C p := {c p 1 , c p 2 , . . . , c p n } ∈ {1, 2, . . . , γ} n with the highest sum of corresponding transition values and elements of the corresponding output vectors as shown in eq. (6).
Eq. (6) is efficiently solved by the Viterbi-Algorithm [see e.g. Sutton and McCallum, 2010]. During training, a loglikelihood loss is calculated that takes into account the transition values T and the network output Y . Sutton and McCallum [2010] provides a detailed description for its implementation.
Since the IOB tagging does not allow all possible transitions, Lester et al. [2020] tried to simply ban these forbidden transitions completely by assigning fixed non-trainable high negative values to the associated entries in T. However, this did not lead to any improvement in performance, but they were able to show that this allows finetuning to converge faster when switching from the classic IOB tagging to the more detailed IOBES tagging scheme [Lester et al., 2020]. In contrast to them, we extend the original LCRF approach by explicitly modeling these forbidden transitions by adding additional trainable weights to the model when computing the transition values T. In the following, we call our adapted algorithm LCRF NER .
Assume an NER task comprises the set of entities X 1 , X 2 , . . . , X e which results in γ = 2e + 1 classes following the IOB tagging scheme. Thus, beside a label O for unlabeled elements, for each entity X i there is a begin label B-X i and an inner label I-X i . For simplicity, we order these classes by B-X 1 ,. . .,B-X e ,I-X 1 ,. . .,I-X e ,O, that is: With respect to this ordering, we introduce the matrix F ∈ {0, 1} γ×γ of all forbidden transitions as Thus, an element F i,j is 1, if and only if a tag of class j can not follow on a tag of class i in the given NER task. This maps the constraint that the label of the predecessor of an interior tag of label I-X can only be the same interior label I-X or the corresponding begin label B-X.
In Figure 1 we illustrate the definition of F.

O Previous Element
Next Element Figure 1: Example for the definition of the matrix F of all forbidden transitions for two entities X 1 , X 2 . If we follow the IOB tagging scheme, red arrows mark forbidden transitions between two sequence elements that lead to an entry 1 in F.
Likewise, we define the matrix A ∈ {0, 1} γ×γ by A i,j = 1 − F i,j as the matrix of all allowed tag transitions.
LCRF NER introduces two additional trainable weights ω F factor , ω F absolute ∈ R besides the weights W T and constructs T by where is the point-wise product. If setting ω F factor = 1 and ω F absolute = 0 this defaults to the original LCRF approach. In this way, the model can learn an absolute penalty by ω F absolute and a relative penalty by ω F factor for forbidden transitions. Note, that LCRF NER is mathematically equivalent to LCRF, the only purpose is to simplify and to stabilize the training.

Decoding with Entity-Fix Rule
Finally, we propose a rule-based approach to resolve inconsistent IOB tagging which can for example occur if an I-X tag is subsequent to a token that is not I-X or B-X (for any possible entity class X). Our so-called Entity-Fix rule replaces forbidden I-X tags with the tag of the previous token. If the previous token has a B-X tag, the inserted token is converted to the corresponding I-X tag. In the special case where an I-X tag is predicted at the start of the sequence, it is converted to B-X of the same class. See Table 5 for an example. The advantage of this approach is that it can be applied as a post-processing step independent of training. Furthermore, since only tokens which already form an incorrect entity are affected by this rule, the E-F 1 score can never decrease by applying it. Note that this does not necessarily hold for the token-wise F 1 score, though. Table 5: Example for Entity-Fix rule. Rows refer to the tokens, its respective target, prediction, and the prediction resulting from decoding with Entity-Fix rule. Changes are emphasized in bold.

Tokens
Peter lebt in Frank _furt am Main In this section, we describe our proposed word-wise attention layers used by some of our BERT models during pretraining and fine-tuning. This Whole-Word Attention (WWA) was inspired by the benefits of the Whole-Word Masking (WWM). It comprises two components: the first one called mha wwa applies traditional multi-head attention on words instead of tokens, while the second component is a windowed attention module called mha wind .
Traditional Approach In opposite to current NLP network architectures, previous approaches for tokenizing text [e.g. Mikolov et al., 2013] did not apply a tokenizer to break down each word of a sentence into possibly more than one token. Instead, they trained representations for a fixed vocabulary of words. The major drawback was that this required a large vocabulary and out-of-vocabulary words could not be represented. Modern approaches tokenize words by a vocabulary of subwords which allows to compose unknown words by known tokens. However, when combined with Transformers, attention is computed between pairs of tokens. As a consequence the number of energy values (see eq. (1)) to be calculated increases quadratically with sequence length resulting in a large increase of memory and computation time for long sequences.
There exist different approaches to tackle this problem. The most prominent ones are BigBird [Zaheer et al., 2020] and Longformer [Beltagy et al., 2020]. In their work, the focus is on pure sparse attention strategies: Instead of a full attention, they try to omit as many calculations of energy values as possible, so that as little performance as possible is lost. Instead, we propose to rejoin tokens into word-based tokens which also has a quadratic dependence on the sequence length but by a lower slope.
Our Methodology The purpose of the first module, mha wwa , is to map tokens back to words and then to compute a word-wise attention. However, since mha wwa loses information about the order of tokens within a word, we introduce mha wind as additional component which acts on the original tokens. mha wind scales linearly with the sequence length since only a window of tokens is taken into account when computing the energy vectors. In summary, mha wwa learns the global coarser dependence of words whereas mha wind allows to resolve and learn relations of tokens but only in a limited range. In the following, we first describe mha wwa and then mha wind .
Let T denote the input of our BERT model which is a part of text and can thus be seen as a sequence of words T = (w 1 , w 2 , . . . , w m ) with m ∈ N. Similar to a classical BERT, a tokenizer T transforms T into a sequence of tokens T (T ) =: t = (t 1 , t 2 , . . . , t n ) ∈ N n with m ≤ n because we only consider traditional tokenizers that encode the text word-wisely by decomposing a word into one or more tokens. Such a tokenizer provides a mapping function F T,T : {1, 2, . . . , n} → {1, 2, . . . , m} which uniquely maps an index i of the token sequence t to the index j of its respective word w j .
Each encoder layer of the classical BERT architecture contains a multi-head attention layer mha which maps its input sequence X T,T = (x 1 , x 2 , . . . , x n ) ∈ R n×d to an output Y T,T of equal length n and dimension d: mha (X T,T ) = Y T,T = (y 1 , y 2 , . . . , y n ) ∈ R n×d , where the ith output vector y i is defined as the concatenation of the resulting vectors for every attention head computed by equation (2). Our mha wwa layer modifies this by applying attention only on the sequenceX T,T = (x 1 ,x 2 , . . . ,x m ) ∈ R m×d , wherex and {i : F T,T (i) = j} is the set of all tokens i belonging to the word j. In other words, we average the corresponding token input vectors for each word. Next, we apply mha onX T,T yielding the output mha (X T,T ) =:Ŷ T,T = (ŷ 1 ,ŷ 2 , . . . ,ŷ m ) ∈ R m×d which is a sequence of length m only. Finally, to again obtain a sequence of length n, we transform the output sequence back to the length n by repeating the output vector for each word according to the number of associated tokens. Thus, the final output of a layer mha wwa is defined as mha wwa (X T,T ) := Z T,T = (z 1 , z 2 , . . . , z n ) ∈ R n×d where z i :=ŷ F T ,T (i) . See Figure 2 for an illustration of the concept described above.
We perform positional encoding by utilizing relative positional encoding because absolute positional encoding adds the positional vectors to the token sequence directly after the embedding. This is not compatible with WWA of eq. (8). Instead, relative positional encoding adds a word-wise relative positional encoding to the vectors of the word-wise sequence within mha (X T,T ).  Attention is applied on words instead of tokens. Orange: Members of sequences, whose length is the number of tokens n. Red: Members of sequences, whose length is the number of words m. For a better overview we define F j := |{i : F T,T (i) = j}| as the number of tokens the j'th word consists of.
As an experiment, we also pre-trained a BERT which solely uses mha wwa layers for attention. However, it was already apparent in the pre-training that it could only achieve a very low Mask Language Model (MLM) accuracy. The main reason for this is that Z T,T does not take into account any information about the position of the tokens within a word since the output elements of them are equal which is why these tokens are no longer related to each other via attention.
To tackle this problem, for each , we introduce a second multi-head attention layer mha wind based on windowed attention as used in [Beltagy et al., 2020, Zaheer et al., 2020. Because the sole purpose of mha wind is to map the relationships and positions of the tokens within a word in the model, we use a very small sliding window size of ω = 5 tokens in each direction. Hence, in contrast to Beltagy et al. [2020], Zaheer et al. [2020], we also do not arrange our input sequence into blocks or chunks.
Formally, we define mha wind as mha wind (X T,T ) := Y T,T = (y 1 , y 2 , . . . , y n ) ∈ R n×d where the ith output vector y i is the concatenation of the resulting vectors y ,h i for every attention head h. Each y ,h i is given by  (3) and (4). In summary, in our BERT model using WWA, the total output of the -th attention layer of the -th encoder layer is, after adding the input sequence X T,T as residual, X T,T + mha wwa (X T,T ) + mha wind (X T,T ) compared to the original approach X T,T + mha (X T,T ). mha wind introduces additional trainable variables compared to the traditional Transformer architecture. However, the number of energy values increases only linearly compared to the sequence length with respect to the window size ω Table 6: Columns refer to the average E-F 1 score [cf. Sang and De Meulder, 2003] of three fine-tuning runs with Default-Fine-Tuning (see section 4) for five datasets and its standard deviation σ multiplied by 100. Rows refer to the respective pre-training task, absolute or relative positional encoding (PE), and use of Whole-Word Masking (WWM); best results within 2 · σ of the maximum (best) are emphasized  by a factor 2ω + 1. Hence, the impact on the memory can be neglected for long sequences and small ω. In order to quantify the overall reduction of memory consumption using WWA, we provide an example using our tokenizer 7 built on the German Wikipedia which transforms on average a sequence of m words in n ≈ 1.5m = 3 2 m tokens. Therefore, while the traditional multi-head attention layer calculates n 2 energy values per each head, our WWA approach only requires mhawwa m 2 + mha wind (2ω + 1)n ≈ 4 9 n 2 + (2ω + 1)n (9) energy values. Therefore, for large n (and small ω), the number of energy values to be calculated is more than halved.

Experiments
To evaluate our proposed methods for German NER tasks, we conducted several experiments. First, we compared the pre-training variants presented in Section 3 and then fine-tuning techniques presented in Section 4. Afterwards, we applied Whole-Word Attention (WWA) on the overall training of the NER task. Finally, we discuss our results in relation to the state-of-the-art models of the LER and GermEval tasks.

Comparing Pre-training Techniques
In our first experiments, we investigated which of the known pre-training techniques (see Section 3) yields the best models for a subsequent fine-tuning on German NER tasks. For this purpose, we combined the three presented pre-training tasks (Mask Language Model (MLM), MLM with Sentence Order Prediction (SOP), and MLM with Next Sentence Prediction (NSP)) with relative and absolute positional encoding and optionally enabled Whole-Word Masking (WWM). For each resulting combination, we pre-trained a (small) BERT with a hidden size of 512 with 8 attention heads on 6 layers for 500 epochs and 100,000 samples per epoch. Pre-training was performed with a batch size of 48 and a maximal sequence length of 320 tokens which is limited by the 11 GB memory of one GPU (RTX 2080 Ti). Each of the 12 resulting BERTs was then fine-tuned using the Default-Fine-Tuning approach (described in Section 4) on the five German NER tasks described in Section 2.2. For fine-tuning, we chose a batch size of 16 and trained by default for 30 epochs and 5,000 samples per epoch. The number of epochs was increased to 50 for the larger LER tasks. Each fine-tuning run was performed three times and its average result was reported in Table 6 together with its standard deviation σ.
The first thing we can see in Table 6 is that, as expected, the standard deviation for the NER tasks with a smaller ground truth (mainly Sturm but also H.Arendt) is higher than for NER tasks with larger ground truth (GermEval, both LER variants). The fluctuations of the different fine-tuning runs are nevertheless within a reasonable range. The results in Table 6 reveal that the best performing BERTs for NER tasks used relative positional encoding, WWM, and were solely pre-trained with MLM. This is plausible since the samples, the academical NER datasets consists of, are only single sentences and hence using additional sentence-spanning losses during pre-training (SOP or NSP) is even harmful for these downstream tasks.
Furthermore, our experiments show that relative positional encoding yields significantly better results than absolute positional encoding. This is an interesting observation since the experiments from Rosendahl et al. [2019] stated that relative positional encoding only performs better when applied to sequences with length longer than those on which the network was previously trained. To investigate this in detail, we performed an analysis of the E-F 1 score in dependence of the sequence length. We sorted the samples in the test list for each dataset by token length in increasing order and then split them into seven parts of equal number of samples. The left chart of Figure 3 shows the averaged E-F 1 score of the five datasets for each part. We observe that relative encoding (rel) is outperforming absolute encoding (abs) for almost any sequence length. For long sequences (last part), the gap between the two approaches increases which is expected. The right chart of Figure 3 shows the E-F 1 score for each part of only the H. Arendt dataset. For this dataset, the discrepancy between relative and positional encoding is very small which shows that the benefit of relative positional encoding highly depends on the dataset. Nevertheless, on average, our experiments suggest to use relative positional as the mean of choice for German NER tasks.
Furthermore, we observe that WWM together with relative positional encoding led to significant improvements when combined with MLM with or without SOP as pre-training task. In summary, our best setup combined solely MLM, relative positional encoding, and WWM.

Comparing Fine-tuning Techniques
In this section, we examine our four different variants of fine-tuning: Default-Fine-Tuning (see section 4), Class-Start-End (CSE) (see Section 4.1), and Linear Chain Conditional Random Field (LCRF) or LCRF NER (see Section 4.2). The new fine-tuning techniques CSE and LCRF NER were specifically designed to help the network learn the structure of IOB tagging.
First, we evaluated the impact of the fine-tuning method in dependence of the pre-training techniques examined in Section 6.1. For this purpose, we fine-tuned all BERT models in analogy to Section 6.1 on all five NER tasks using the four fine-tuning methods mentioned. The individual results shown in Table 7 are averaged across the five tasks whereby the outcome of each task is the mean of three runs.
Regardless of the choice of the fine-tuning method, the results confirm that pre-trained BERTs that were only pretrained with MLM, use relative Positional Encoding, and use WWM are the most suitable for German NER tasks. Therefore, we will only consider this fine-tuning setup in the following.
Furthermore, the results show that LCRF NER consistently outperforms LCRF. We think that the introduction of the new weights ω F factor , ω F absolute (see eq. 7) enables the fine-tuning with LCRF NER to outperform LCRF because the network can learn more easily to avoid inconsistent tagging. However, CSE performs best on average. We suspect that this is due to the fact that the way of decoding in CSE completely prevents inconsistent tagging. Table 7: Average E-F 1 score of all five NER tasks and three fine-tuning runs. Rows refer to the respective pretraining task, absolute or relative positional encoding (PE), and use of Whole-Word Masking (WWM); Columns refer to the fine-tuning methods Default-Fine-Tuning (DFT), Class-Start-End (CSE), Linear Chain Conditional Random Field (LCRF) and LCRF NER ; best results per column are emphasized.  Table 8 provides more details for the best setup by listing separate results per NER task. Furthermore, we examined the influence of applying the rule based Entity-Fix (see Section 4.3) as a post-processing step. The Entity-Fix rule was specifically designed to fix inconsistencies that otherwise would lead to an error on the metric. Since the CSE decoding already includes the rules of the metric, no changes are present if applying the Entity-Fix rule. This post-processing step led to clear improvements on all other fine-tuning methods. Surprisingly, although LCRF and LCRF NER were specially designed to learn which consecutive tags were not allowed in the sequence, they still show significant improvements with the Entity-Fix rule. Thus, they were not able to learn the structure of the IOB tagging scheme sufficiently.
The general result is that LCRF NER represents the best fine-tuning method if combined with the Entity-Fix rule.
Additionally, we examined why, in contrast to all other tasks, the H. Arendt task with CSE yielded slightly better results than with LCRF NER and Entity-Fix rule by comparing the errors made on the test set. Unfortunately, no explanation could be found in the data. Nevertheless, LCRF NER emerges as the best fine-tuning method from our experiments in general. Next, we investigated the influence of replacing the original multi-head attention layers with our proposed Whole-Word Attention (WWA) approach. For this, we pre-trained BERTs with only Mask Language Model (MLM), relative positional encoding, and Whole-Word Masking (WWM) with the same hyper-parameter setup as in Section 6.1. Since, as shown in eq. (9), WWA allows to increase the maximal token sequence length, we first pre-trained a BERT with a token sequence length of 320 (similar to the previous experiments) but also one with a maximum sequence length of 300 words, that is roughly about 450 tokens on average (see eq. (9)). This value is chosen so that approximately the same number of energy values were calculated in this BERT model as in the comparable model without WWA.
After pre-training of the two BERTs was finished, we fine-tuned all NER tasks with our best fine-tuning method LCRF NER . Table 9 compares the results of using WWA to those without (see Table 8).
The results show that, on average, WWA slightly improved the original approach, even though training was done with reduced memory consumption due to the smaller number of energy values to be calculated and stored. The drawback is that the pre-training of a BERT with WWA takes about 1.5 times longer than pre-training without WWA due to the additional window attention layer and the transformation of the sequence from token to word and vice versa. The runtime could probably be accelerated by improving the implementation for the transformation from token sequence to word sequence without looping over the samples of the batch.
Increasing the maximum sequence length to 300 words did not result in an additional advantage of the performance. We suspect that the reason is that the maximum sequence length for NER tasks is not a primary concern because the samples in all five NER datasets tested almost never exhausted the maximum sequence length used in pre-training. However, we expect a benefit for other downstream tasks such as document classification or question answering where longer sequences and long-range relations are more important.

Comparing Results with the State of the Art
In this section, we compare our results with the current state of the art. This is only possible for the two LER tasks and the GermEval task since the other two NER datasets were newly introduced by this paper.
To the best of our knowledge, the current state of the art in both LER tasks was achieved in [Leitner et al., 2019]. They applied a non-Transformer model on the basis of bidirectional LSTM layers and a LCRF with classical pre-trained word embeddings. For evaluation, in contrast to our used E-F 1 score, they took a token-wise F 1 score, thus, calculating precision and recall per token. Table 10 shows the comparison of the token-wise F 1 score to our best results. Note however, that the comparison of  [Leitner et al., 2019] presumably always exactly one token is used per word. Furthermore, since not published, we were not able to use the identical splits of train, validation, and test data.
Next, in Table 11, we compare our results for the GermEval task with the current state of the art which, to the best of our knowledge, was achieved in [Chan et al., 2020]. In addition, we list the best results that were achieved without a Transformer architecture [Riedl and Padó, 2018]. While our approach outperforms the BiLSTM results of Riedl and Table 11: Comparison of our best results on the GermEval-task with other state of the art models, where E-F 1 is the entity-wise F 1 score. Our shown result is again the average of 3 fine-tuning runs. The score of DistilBERT was taken from https://huggingface.co/dbmdz/flair-distilbert-ner-germeval14 Model Params E-F 1 BiLSTM-WikiEmb [Riedl and Padó, 2018] -0.8293 DistilBERT Base [Sanh et al., 2019] 66 mio 0.8563 GBERT Base [Chan et al., 2020] 110 mio 0.8798 GELECTRA Large [Chan et al., 2020] 335 mio 0.8895 our best on small BERTs 34 mio 0.8448  Chan et al. [2020]. One of the reasons is that our BERTs are smaller and also pre-trained on a much smaller text corpus. The DistilBERT [Sanh et al., 2019] with almost 2x the parameters also reaches a higher score. This comes with the drawback that a larger BERT is needed for pre-training. There are some reasons where it is desired to train a BERT from scratch for example to enable research like the Whole-Word Attention or train on a different domain/language. In Table 12 we illustrate some differences of some technical attributes between our BERT models and GBERT Base of Chan et al. [2020]. It shows that our models can be trained with much lower hardware requirements.
In addition, we compared the time needed for a fine-tuning. For a fair comparison, we downloaded the GBERT Base from Hugging Face and fine-tuned it with the same hyperparameters as we fine-tuned our models. This resulted in an average time of 50 minutes for a fine-tuning run on our models and 70 minutes for a run on GBERT Base .

Conclusion and Future Work
In this article, we conducted our research on comparatively small BERT models to address real-world applications with limited hardware resources, making it accessible to a wider audience. We have worked out how to achieve best results in German NER tasks with smaller BERT models. This simplifies the work of Germanists in the creation of digital editions.
Therefore, we investigated which pre-training method is the most suitable to solve German NER tasks on three standard and two newly introduced (H. Arendt and Sturm) NER datasets. We examined different pre-training tasks, absolute and relative positional encoding, and masking methods. We observed that a BERT pre-trained only on the Mask Language Model (MLM) task combined with relative positional encoding and Whole-Word Masking (WWM) yielded the overall best results on these downstream tasks.
We also introduced two new fine-tuning variants, LCRF NER and Class-Start-End (CSE), designed for NER tasks. Their investigation in combination with Default-Fine-Tuning and common Linear Chain Conditional Random Fields (LCRFs) showed that the best pre-training technique of the BERT is independent of the fine-tuning variant. Furthermore, we introduced the Entity-Fix rule for decoding. Our results showed that for most German NER tasks, LCRF NER together with the Entity-Fix rule delivers the best results, although there are also tasks for which the CSE tagging has a minor advantage.
In addition, our novel Whole-Word Attention (WWA) which modifies the Transformer architecture resulted in small improvements by simultaneously halving the number of energy values to be calculated. For future work, it would be particularly interesting to investigate WWA in connection with other downstream tasks such as document classification or question answering, where the processing of longer sequences is more important than in NER tasks. Another approach would be to combine WWA with a sparse-attention mechanism like BigBird [Zaheer et al., 2020].
To further simplify training and application of BERTs for users with only a low technical background, we are currently developing an open source implementation of these optimized models in a user friendly software with a graphical user interface. The goal of this software is to greatly simplify the creation of digital editions by enriching text stored as TEI files with custom NER taggings.