Cross-Encoder-Based Semantic Evaluation of Extractive and Generative Question Answering in Low-Resourced African Languages

Ijebu, Funebi Francis; Liu, Yuanchao; Sun, Chengjie; Jere, Nobert; Mienye, Ibomoiye Domor; Inyang, Udoinyang Godwin

doi:10.3390/technologies13030119

Open AccessArticle

Cross-Encoder-Based Semantic Evaluation of Extractive and Generative Question Answering in Low-Resourced African Languages

by

Funebi Francis Ijebu

^1,2,*

,

Yuanchao Liu

¹,

Chengjie Sun

¹,

Nobert Jere

³,

Ibomoiye Domor Mienye

⁴

and

Udoinyang Godwin Inyang

²

¹

School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

²

Department of Computer Science, University of Uyo, Uyo 520271, Nigeria

³

Department of Computer Science, University of Fort Hare, Alice Campus, Alice 5700, South Africa

⁴

Institute for Intelligent Systems, University of Johannesburg, Johannesburg 2006, South Africa

^*

Author to whom correspondence should be addressed.

Technologies 2025, 13(3), 119; https://doi.org/10.3390/technologies13030119

Submission received: 7 February 2025 / Revised: 10 March 2025 / Accepted: 14 March 2025 / Published: 16 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Efficient language analysis techniques and models are crucial in the artificial intelligence age for enhancing cross-lingual question answering. Transfer learning with state-of-the-art models has been beneficial in this regard, but the performance of low-resource African languages with morphologically rich grammatical structures and unique typologies has shown deficiencies linkable to evaluation techniques and scarce training data. To enhance the former, this paper proposes an evaluation pipeline leveraging the semantic answer similarity method enhanced with automatic answer annotation. The pipeline uses the Language-agnostic BERT Sentence Embedding model integrated with an adapted vector measure to perform cross-lingual text analysis after answer prediction. Experimental results from the multilingual-T5 and AfroXLMR models on nine languages of the AfriQA dataset surpassed existing benchmarks deploying string-based methods for question answer evaluation. The results are also superior to the F1-score-based GPT4 and Llama-2 performances on the same downstream task. The automatic answer annotation technique effectively reduced the labelling time while maintaining a high performance. Thus, the proposed pipeline is more efficient than the prevailing string-based F1 and Exact Match metrics in mixed answer type question–answer evaluations, and it is a more natural performance estimator for models targeting real-world deployment.

Keywords:

cross-lingual question answering; extractive question answering; large language models; semantic answer similarity; low-resourced African languages

1. Introduction

The recent gains in artificial intelligence (AI) have increased the impacts of computational solutions on daily living. Yet, and the need for more efficient techniques for language processing towards AI advancements cannot be overemphasized. The underlying large language models (LLMs) of AI systems with multilingual training have steadily increased our capacity of understanding natural human languages, thus attracting more interaction from users. By interacting with these language models, information seekers often query on diverse topics expecting accurate response [1]. Sometimes, the most accurate response for the queries exists in another language not properly understood by the retrieval system [2]; in such cases, the information seeker misses knowledge. To close this cross-lingual gap, researchers have continued to explore the capabilities of LLMs in performing cross-lingual information retrieval [3,4].

A concern about researchers advancing cross-lingual information retrieval is their focus on languages with more resources [5]. This makes low-resource languages suffer from reduced technological support. It limits the effectiveness of cross-lingual information retrieval systems on languages lacking extensive bilingual dictionaries and machine translation support [6,7]. To bridge this gap, the work in [8] proposed the AfriQA dataset for cross-lingual open-retrieval question answering. The study evaluates the performance of state-of the-art models in English tasks on cross-lingual question answering in African languages and reports results with the Exact Match (EM) and F1 metrics. However, previous work had identified flaws with the EM metric in question answering (QA) tasks when it is used to evaluate non-short factual answers [9]. The F1 metric, which evaluates performance by checking the number of common tokens between a model’s predicted answer (precision) and the ground truth (recall), is also string-based. The lexical dependence of the EM and F1 metrics makes them misjudge predicted answers without lexical overlap.

Since ideal QA model performance evaluation involves comparing the ground truth with model-predicted answers, it is possible that both answer tokens have no lexical overlap, but they are semantically similar [10]. This should be explored when evaluating QA models, especially on datasets and languages with rich morphology and grammatical structure. Datasets with mixed answer types better reflect the responses information seekers expect when querying retrieval systems. Inspecting the AfriQA dataset’s ground truths shows that the annotated answers consist of factual, short, long, and slightly elaborate text chunks. This means that the dataset has mixed answer types. While the answers are consistent with the natural linguistic responses in African languages, it signifies that the flaws of the EM and F1 metrics can manifest in the evaluation process and impact the true performance of a QA model, thereby affecting its real-world performance estimation. The underperformance of the mT5 [11] and AfroXLMR [12] models in [8], after task-specific finetuning, may be due to the manifestation of the evaluation flaws of the EM and F1 metrics, as these models have demonstrated remarkable proficiency in analogous downstream tasks in English, as well as in other natural language processing (NLP) downstream tasks in African languages [7,13].

To investigate the impact of string-based QA evaluation metrics on model performance on morphologically and grammatically dynamic low-resource languages and QA tasks, the focus is on African languages, which are largely underserved in the AI and QA literature. In consideration of a more efficient mixed answer type QA evaluation technique that is more inclined to natural human language expressions, the following contributions are made in this work:

An automatic text labelling technique for custom and generalized semantic textual analysis tasks is proposed.
The automatic text labelling technique is deployed to enhance the efficiency of the semantic answer similarity (SAS) method in QA evaluation proposed in [9]. The resulting enhanced technique is dubbed SAS+.
This study demonstrates the efficiency and robustness of the SAS+ pipeline in evaluating underserved low-resourced languages, compared to conventional methods used by encoder–decoder and decoder-only models.
This study shows that the proposed SAS+ evaluation pipeline is a more natural and befitting estimator for QA model performance relative to the prevailing F1 and EM metrics.

The remainder of this work is described as follows: Section 2 highlights the related literature, while Section 3 describes the proposed SAS+ evaluation pipeline. Our experimental design is presented in Section 4. Section 5 presents and discusses the results obtained, while Section 6 gives the conclusion of this study.

2. Related Works

2.1. Question Answering in African Languages

The inability of conventional information retrieval systems to semantically process natural language queries motivated researchers to investigate and propose information retrieval systems with semantic processing abilities. Today, the effort of researchers in this regard has led to the emergence of innovations that leverage information retrieval and machine learning techniques to produce diverse natural language processing models [7,14,15]. These efforts are foundational to the field of QA research, which has been gainfully addressing real-world problems and showing impressive results [16,17]. Two revolutionizing works in the QA domain are the SQuAD dataset [18] and the work in [19], where Wikipedia is used as a knowledge source for QA [20]. Subsequently, several authors have curated datasets and proposed models for efficient QA [21,22]; however, these are mostly for high-resource languages [5].

As high-resourced languages continue to gain traction, efforts towards monolingual QA in low-resourced languages are emerging [23,24]. For low-resourced African languages, the work by [25] presents a monolingual QA dataset in Tigrinya, while the effort in [26] focuses on the Swahili language. The works in [5,27] similarly present QA benchmarks in select African languages, providing data to encourage further research in those directions. In [28], the authors evaluate the performance of five LLMs on QA in African languages to understand the proficiency of the selected models in African languages. Aside from these studies, there is a scarcity of monolingual QA resources in African languages.

2.2. Cross-Lingual Language Resources

As monolingual models for QA attain state-of-the-art performance, the need to extend their capabilities into understanding and answering questions in two or more languages ensued. Drawing insights from cross-lingual machine translation, foundational cross-lingual QA systems translated questions into the target language and searched for answers accordingly [13]. In recent years, multilingual resources like the MLQA [29] and MKQA [30] datasets have been proposed for cross-lingual model engineering using pretrained LLMs. These datasets have been diversely explored to extend the frontiers of cross-lingual QA research.

For instance, the authors in [31] proposed the CORA model, a pretrained language model for cross-lingual open retrieval (XOR). The CORA model achieves outstanding results on the MKQA and XOR-TyDiQA datasets, thereby advancing QA research in their incorporated languages. These datasets are versatile cross-lingual QA resources that cover a total of 26 languages including some low-resourced languages, but none of the languages covered is an African language. There are several similarly versatile datasets for QA without African languages. Considering the scarcity of language resources for African NLP advancements, previous works [6,13] encouraged the curation of multilingual and cross-lingual datasets in African languages to catalyze research efforts to further develop African NLP.

As contribution to African NLP advancements, the authors in [8] proposed the AfriQA dataset for cross-lingual QA in nine low-resourced African languages. The proposing study leveraged NLLB [32], Google Translate, and finetuned M2M-100 [33] models for cross-lingual translation in their QA investigation. Their results showed that the African languages incorporated in the dataset are challenging for the adopted models to process, suggesting that the language understanding of prevalent QA solutions is greatly deficient in low-resourced African languages. However, it was not considered if the evaluation metrics in the study impacted the results obtained.

2.3. Cross-Lingual QA Performance Evaluation Metrics

The performance of QA models in the recent literature is mostly reported with string-based metrics. The extractive QA submissions in [2,8,29,30,31,34] all assess and report the performances of their proposed systems with the F1 and/or EM metrics. This is a recurrent trend with extractive and generative QA system evaluations. With such studies, consideration is not given to the different morphological and grammatical structures of the languages evaluated, as well as the difference in dataset answer types considered, when the performance evaluation metric is selected. However, these are crucial for model performance estimation [9,29].

The generative QA framework in [35] also reports model effectiveness with the F1 metric. The work in [36] similarly reports Llama-2’s performance across multiple QA benchmarks in the F1 and EM scores. In [1], the accuracies of GPT models on the QA task are evaluated using string-based comparison of generated answers with the ground truth. To demonstrate the proficiency of a data augmentation technique for generative LLMs, the Rouge-L, EM, and F1 metrics were used to report the results in [37]. The k@N metric (where k correspond to accuracy, precision, or recall and N equals the number of documents to retrieve from the knowledge base) [38] and the mean average precision are other string-based metrics used in the evaluation of extractive and generative QA systems [4].

Although these metrics show the efficiency of their evaluated systems, they are biassed against tasks which incorporate lexical overlaps between the ground truth and prediction. The EM metric invalidates a predicted answer with an additional character compared to the ground truth tokens [9]. The F1 is more subtle than EM in penalizing the answer and prediction strings that overlap, yet it lacks semantic awareness. It does not maximally score cases where the ground truth and prediction strings are partially overlapping. Humans can easily identify such situations and rate them appropriately. This limitation of the F1 metric means that its evaluation of models would not highly agree with human judgements in QA settings with limited lexical overlap between predictions and ground truth answers.

To overcome the string-dependent limitation of the F1 metric, some generative QA model evaluations use the BLEU and ROUGE metrics. However, these metrics have also been found to misjudge model performance. Given a prediction and ground truth with semantic or syntactic similarity only, the BLEU and ROUGE metrics unfairly penalizes the comparisons because they lack lexical similarity [39,40]. The BERTScore [40] and BLEURT [39] metrics have also been proposed for QA performance evaluation. With the BLEURT and BERTScore metrics, the semantic capability of an answer prediction model is considered beyond the confines of conventional lexical evaluation.

This kind of performance evaluation enables semantic and syntactic correlations to be identified between ground truths and predictions in downstream QA tasks. Furthermore, efforts have been made to also extend the abilities of the BLEURT and BERTScore metrics. In this regard, the authors in [9] present the semantic answer similarity (SAS) metric for QA evaluation. The SAS metric is a man-in-the-middle evaluator that uses a cross-encoder for ground truth and model-predicted answer encoding; then, cosine similarity is used for vector comparison. The availability of a cross-encoder within the SAS metric enables it to consider the semantic and lexical similarity of ground truth and model-predicted answers in performance estimation. The creation of semantic embeddings for the answer pairs helps the SAS metric overcome the lexical-dependence limitation of the string-based metrics.

However, the efficiency of the SAS metric has not been tested in other languages aside from German and English, which was evaluated in [9]. Therefore, its generalizability to other languages, especially low-resourced African languages, is unknown. This work, therefore, seeks to explore the efficiency of the SAS evaluation metric in low-resourced African languages. In addition, the manual hand annotation of predicted answers adopted by SAS is time consuming and expensive; therefore, a less cost intensive answer labelling technique that improves the annotation time and metric efficiency is proposed in this work.

3. Proposed QA Evaluation Pipeline

3.1. Dataset

The AfriQA dataset [8] is adopted in this study for the evaluation of the proposed pipeline. The dataset is publicly available and consists of nine African languages, including the Bemba (bem), Fon (fon), Hausa (hau), Igbo (ibo), Kinyarwanda (kin), Swahili (swa), Twi (twi), Yoruba (yor), and Zulu (zul) languages. The languages are spoken across countries in South, East, West, and Central Africa, with a combined speaker population of about 292 million. All languages in the dataset use the Latin script for writing. They spread across two main language families including Afro–Asiatic (hau) and Niger–Congo (bem, fon, ibo, zul, kin, swa, twi, and yor). Further nuanced linguistic characteristics of each language are discussed in [8]. All the languages are low-resourced because of the low volume of publicly available data to support model training towards QA and other NLP studies.

The questions in the AfriQA dataset are originally annotated in the respective African languages and translated by human experts into their respective pivot languages. Except for the Fon language which has its pivot as French, all other languages have their pivot as English. The pivot language is a high-resourced language in which the content of each context is written. The contexts are passages retrieved from Wikipedia. The dataset is released with the train, dev, and test splits, but our experiments utilize only the test sets of each language, because we undertake a zero-shot investigation.

Four dimensions of cross-lingual extractive QA consistent with [8] are evaluated with the proposed SAS+ pipeline. In the HT (human-translated) dimension, the questions were translated from the respective African languages by expert annotators into the pivot language and used to query the QA model. For the GMT (Google Machine translation) and NLLB (no language left behind) dimensions, the Google Translate and NLLB models, respectively, are used to automatically translate each original query from the respective African language to the pivot language before it is provided to the prediction model for answer extraction. The fourth dimension is termed CL (cross-lingual), where the questions are held and supplied to the answer prediction model in the African language that it is annotated.

3.2. Cross-Encoder for SAS+

The Language-agnostic BERT Sentence Embedding (LaBSE) [41] model, enhanced by additive margin softmax (AMS) described in Equation (1), is the integrated bidirectional encoder in our SAS+ pipeline. Through custom pre-training, the LaBSE model has been shown to achieve state-of-the-art performance in cross-lingual embedding production and comparison in bi-text retrieval and mining downstream tasks [41]. Inspired by its performance in cross-lingual semantic analysis, this work integrates the inference-tuned LaBSE model into the evaluation pipeline to semantically analyze predicted answers relative to the ground truth.

L = \frac{1}{N} \sum_{i = 1}^{N} \frac{e^{Φ (x_{i}, y_{i}) - m}}{e^{Φ (x_{i}, y_{i}) - m} + \sum_{n = 1, n \neq i}^{N} e^{Φ (x_{i}, y_{n})}}

(1)

From Equation (1),

Φ (x_{i}, y_{i})

represent the embedding space similarity between variables

x

and

y

, which are semantic vectors representing the ground truth and model-predicted answers.

m

is the margin that improves the separation between the bi-text over all

N - 1

alternatives to be processed in the vector space. The AMS-trained dual encoder generates separate semantic representations for the ground truths and predicted answers in the task. The resultant embeddings are then compared using

S i m_{E C S}

, an adaptation of the extended vector similarity measure. Its adoption in this study is inspired by its show of efficiency in embedding similarity analysis compared to the traditional cosine measure embedded in LaBSE [10].

3.3. Automatic Answer Labelling

The default SAS protocol involves traditional semantic textual similarity labelling where human annotators are required to manually judge the semantic similarity of the ground truth and model-predicted answers. The annotators task entails assigning a corresponding numeric score between 0 and 2 to the answer pairs. The assignment of scores is guided by the rating scheme described in Figure 1. The gathering of human annotators to manually compare hundreds or thousands of predicted answers (depending on the size of the dataset) with their ground truth pairs for similarity is cost intensive and time consuming.

To avert this time-consuming and cost-intensive similarity labelling protocol, the automatic answer labelling technique is introduced in this work. The automated technique eliminates the man-in-the-middle that performs answer labelling, thereby removing the cost and time required to complete the evaluation lifecycle. The automatic answer annotation rule is described using Equation (2), where

ψ

and

η

represent the potential answer label and the predicted answer string, respectively. As such,

φ

is the actual label assigned to the predicted answer string.

φ = \{\begin{matrix} ψ (2), & i f η \neq e m p t y_s t r i n g \\ ψ (0), & o t h e r w i s e \end{matrix}

(2)

The implementation of the automatic annotation technique in the evaluation pipeline is described in Algorithm 1. The algorithm takes the question (Q) and context (C) as inputs, and it produces a performance score (ρ) corresponding to the correlation between the ground truth and model prediction. Each language being evaluated has hundreds of question–context pairs; hence, Lines 4–10 are repeated until the question–context pairs for that language are exhausted. The variable

\hat{A_{i}}

represents a List object that stores the outcome of Lines 5–8, creating a corresponding index for each value received. Line 10 updates

\hat{A_{i}}

after each question–context pair is processed. When Line 3 is true, the initialized bi-encoder (β) in Line 1 is deployed to produce semantic embeddings as in Line 11 and Line 12. This is followed by the computation of the similarity between the generated embeddings in Line 13. The similarity scores are held in a List object

H

so that it is updated after each instance of similarity calculation. When the instances in the focus language are exhausted, Line 14 is executed, resulting in a correlation score

(ρ)

, as in Line 16. This value represents the performance of the QA system in terms of the SAS+ metric, which evaluates the semantic correctness of the prediction model, rather than the lexical correctness examined by the F1 and EM metrics.

Algorithm 1: Pseudocode for the proposed pipeline with automatic answer labelling for SAS enhancement

Input: Question (Q), Context (C)
Output: Performance score (ρ)
Initialize bi-encoder (β), answer prediction model (

\dot{M}

)

1.: for each language data do:
2.: while input ≠ 0 do:
3.: $A_{i} \leftarrow \dot{M} (Q, C)$
4.: if answer ≠ empty string
5.: $A_{l} = 2$ ⧏ answer label
6.: else
7.: $A_{l} = 0$ ⧏ answer label
8.: end if
9.: $\hat{A_{i}} \leftarrow [A_{i}, A_{l}]$ ⧏ update answer
10.: $V e c_{i} \leftarrow β (\hat{A_{i}} [A_{i}])$ ⧏ encode prediction
11.: $V e c_{j} \leftarrow β (G t_{j})$ ⧏ encode ground truth
12.: $\overset{˘}{H} \leftarrow c o m p u t e S i m_{E C S} w i t h E q u a t i o n (3)$
13.: $ρ \leftarrow corr (\overset{˘}{H}, \hat{A_{i}} [A_{l}])$ ⧏ compute correlation
14.: end
15.: $obtain ρ$
16.: end

Notice that all predicted answers are automatically assigned a similarity value of two in Line 6. This value corresponds to the highest similarity score on the default SAS answer labelling scale. Thus, it is assumed that every predicted answer span is equivalent to the gold answer span. The equivalence of the ground truth and predicted answer spans are semantically analyzed in the vector space using Equation (3), where

V_{1} a n d V_{2}

are the cross-encoder-generated semantic embeddings for the ground truth and model-predicted answer spans, while

k

is a normalization constant with the value of one [10]. The length of the respective semantic embeddings is dependent on the cross-encoder embedding dimension, and the generated semantic vectors are unit vectors. As such the value of

S i m_{E C S}

ranges between 0 and 1, where values closer to 1 indicates a stronger similarity between the text chunks being considered.

S i m_{E C S} = \frac{V_{1} V_{2} \sqrt{\sum {(V_{1} - V_{2})}^{2}}}{||V_{1}|| ||V_{2}|| + k}

(3)

The adapted vector measure,

S i m_{E C S}

, deployed to compute the semantic equivalence of the generated semantic embeddings derives from a document classification vector measure. Its reliance of the original measure on computing the centroids of each training data category during training meant that it could not be used to analyze semantic vectors from pretrained models. Thus, by leveraging the idea that the absolute difference between two vectors alone does not govern similarity and finding that correlation is related to the Euclidean distance between any two standardized vectors, the work in [10] presented an adaptation of the extended cosine vector measure, resulting in Equation (3).

4. Experimental Design

Our experiments cover the HT, GMT, NLLB, and CL test dimensions. The AfriQA dataset provides ready questions for the HT and CL dimensions; hence, they are used as-is in our experiments. For the translation-based GMT and NLLB dimensions, source questions in the respective African languages are passed through the neural translation models (i.e., Google Translate and NLLB), where the target language is the corresponding pivot for that language. The output of the translation model is then used as the input question to the prediction model, alongside the context in the pivot language. All context passages are either in English or French, as they are the pivot languages of the dataset. The AfroXLMR and mT5-base LLMs have been adopted for the answer extraction phase. The finetuned AfroXLMR model from [8] is utilized, while the mT5 model checkpoints from Hugging Face (https://huggingface.co/google/mt5-base, accessed on 13 March 2025) is finetuned with the SQuAD-v2 dataset [18] for 5 epochs with the Adam optimizer, and a learning rate of

3 \times 10^{- 5}

, on a single A800-80 GB NVIDIA GPU, is achieved.

The extractive question answering task requires that prediction models extract a chunk of text from the given context as response to the input question. Models are not allowed to generate any text from their pretrained knowledge. As shown in Figure 2, the ‘Predicted Answer Span’ is the output from the AfroXLMR and mT5 models. The answer span is a consecutive set of tokens from the context passage that the prediction models extract and return as an answer to the input question. Note from the figure that the QA model takes the question and context pair as inputs; it then returns an answer span which could be a single token or a set of consecutive tokens per question provided. In the implementation being considered, the length of tokens to be returned as the answer span is not constrained. This is to allow the models to deploy their maximum understanding of the language and question to extract answer spans of varying lengths. This is also consistent with the human-annotated ground truths in the AfriQA dataset, where we observe mixed answer types. Thus, answer spans vary from single to multiple tokens, as well as sentences in some cases. By making the token length flexible, the models can extract answer spans that are semantically equivalent to the ground truth even though they differ in number of constituent tokens.

When an answer span is extracted, it is added to a List object that is updated with new entries after each iteration. Using the predicted answer spans, the F1-score of the model is computed using Equation (4), where

r e c a l l

is the raw tokens in the gold answer span, and

p r e c i s i o n

is the raw tokens in the predicted answer span. The F1 metric therefore performs a string-based calculation, checking the number of common tokens between the predicted answer spans and the tokens of the gold answer span. Its operation does not consider the word order and context; thus, a match between tokens from both answer groups adds to the accuracy value. The F1 metric is therefore equal to zero only when no two tokens from the predicted and gold answer spans match. The higher the number of common tokens, the better the F1 valuation. The Exact Match metric, on the other hand, checks whether the raw tokens in the predicted answer span exactly correspond to the raw tokens in the gold answer span, without additional characters, spaces, or word order disparities. Thus, any additional spaces or character in the predicted span that is absent in the gold answer causes the predicted answer to be invalid.

F 1 = \frac{2 * R e c a l l * P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(4)

With the predicted answer spans, we undertake manual answer annotation to perform a fair comparison with the default SAS evaluation method. For all nine AfriQA languages, each answer pair was manually annotated by three annotators independently, using the three-point labelling scheme described in Figure 1. The human-annotated scores from this process are then held in storage, like the gold answer spans, as in Figure 2. The automatic annotation of predicted answers is conducted using Equation (2), and the results are held in a List object with the corresponding index indicator to ensure accurate correlation calculations. The semantic embeddings for the predicted answer span (PA) and gold answer span (GA) are next generated with the LaBSE model. The model generates independent semantic embeddings for PA and GA. To determine the overall performance of the evaluation processes in each case, correlation is computed. The Spearman correlation measure, ρ, is used for this case. The value (ρ×100), obtained from the correlation computation, constitutes the performance of the model in that language’s QA task.

5. Results and Discussion

5.1. SAS+ Performance Compared to F1- and EM-Based Baselines

Unbiased human evaluation is often the best indicator of a system’s quality [39], but for cost and time constraints, automated methods have become useful. LLMs in semantic textual similarity analysis have shown that two text samples with complete lexical differences can be semantically similar. This has been diversly explored by information retrieval systems to provide more meaningful responses to information seekers when they issue a query. In semantic similarity evaluation tasks, the efficiency of an automated method is determined by comparing its performance with human judgement. The traditional methods for the comparison include the Pearson and Spearman correlations [10]. The higher the correlation between the automated response technique and the human evaluation, the more efficient the proposed method [40]. Such a metric is a more natural method for mixed answer type QA evaluation, when compared to the F1 and EM metrics that strictly focus on lexical similarity. Moreover, real-world non-factoid answer correctness evaluation by humans does not mostly focus on word or phrase exactness [42], because words have synonyms, and different words or phrases can be correctly used to respond to a question and describe an item or situation.

In reality, when a response semantically connotes the meaning of an expected ground truth in a non-comprehension reading or cloze QA task, such responses would be considered correct to large extents. Unlike in [8], where the cross-lingual QA performances on African languages are measured with lexical metrics, this work proposes the utilization of a semantic metric and deploys the SAS+ evaluation method to demonstrate the practicality of the idea. As shown in Table 1 and Table 2, the SAS+ metric outperforms F1 and EM metrics in all languages and task dimensions considered. Recall that the SAS+ score is the degree of agreement between expert human judgement and the model on the semantic equivalence of model predictions to the ground truth. The results, therefore, indicate that the cross-encoder integrated in the SAS+ evaluation pipeline enables the method to efficiently identify nuanced semantic features, which are missed by the F1 and EM lexical analysis of the predicted and gold answer spans. As such, the mT5 and AfroXLMR performances are substantially improved in the HT, GMT, NLLB, and CL question answer dimensions compared to the results obtained in [8] with the F1 and EM metrics.

When AfroXLMR is the answer extraction model, prediction accuracy in most languages under the HT, GMT, and NLLB dimensions exceed 80, with the exception of the Fon language under NLLB, as well as Igbo under the HT and NLLB dimensions. The CL dimension shows relatively lower agreements with human judgement, especially for the Bemba and Twi languages. Although the lower scores is consistent with the F1 and EM results originally reported in [8], it can be inferred that the model’s understanding of these languages is poor; hence, it is challenging to answer questions in them, as only 30.45 and 39.18 percent, respectively, of QA pairs are interpreted correctly. For the Igbo CL task, however, slightly over 80 percent predictions correctly agree with human assessments, and this is only second to Swahili with 82.55, which is the most widely spoken African language and possibly the most supported African language by prevailing LLMs.

Corroborating the complexity of African languages for current LLMs, mT5’s prediction accuracy in the CL task is consistently lower in most languages (see Table 2) than AfroXLMR, which is trained in more African languages [12]. While the accuracy on Bemba remains lowest, mT5’s performance on Twi suggests that it understands the semantics of Twi questions better than AfroXLMR, thereby being able to extract more correct answer spans. Unlike in AfroXLMR where most SAS+ scores exceed 80, only Fon and Swahili results in the HT task reaching this accuracy. While mT5 SAS+’s accuracy is above 80 in four GMT and six NLLB tasks, no prediction accuracy on the CL experiments reached this magnitude. This improves the F1 and EM evaluations but rationalizes the deficiency of prevailing encoder–decoder LLMs in understanding low-resourced African languages.

In [8], the HT dimension consistently produced the best F1 and EM scores with the mT5 and AfroXLMR models. In addition, the difference between their HT results and other dimension is significant in most AfriQA languages. When the average HT and CL dimension scores are compared, mT5’s performance across all languages is seen to drop by 36.9 and 30.8 points in this study, while AfroXLMR lost 37.1 and 30.6 F1 and EM points, respectively. Although this difference is smaller when the HT average is compared to the GMT and NLLB averages, the performance drops are still significant.

Contrarily, in our evaluation with the SAS+ metric, the performance difference between the dimensions is marginal. For instance, in the AfroXLMR processing of the Hausa language, the best SAS+ performance is observed in the GMT dimension. A similar outstanding performance is obtained in the NLLB dimension when mT5 is the answer predictor. This shows that SAS+’s consideration of semantic equivalence rather than monotonous lexical matching is beneficial to mixed answer type evaluations, where answer predictions and ground truths would have limited lexical overlap.

For all neural machine translator-supported languages, the SAS+ results indicate that the Google Translate and NLLB models produce meaningful translations that are almost semantically equivalent to the benchmark human translations. This is deducible from mT5’s and AfroXLMR’s abilities to understand and extract largely correct answers from the given contexts. This is contrary to the lower F1 and EM results, which suggests that the translators do not produce semantically equivalent question translations relative to the human translations.

5.2. SAS+’s Performance Compared to F1-Bassed Generative QA Models

Recent state-of-the-art decoder-only LLMs like GPT and Llama have demonstrated general intelligence in a wide range of complex natural language understanding tasks, including extractive cross-lingual QA using prompt-based learning procedures [43,44]. However, much of their tremendous performances are in high-resource languages. This means that there is a scarcity of evidence showing their effectiveness in low-resourced languages. One of the few applications of these models on low-resourced language NLP tasks is the work in [28], where GPT4-0 and Llama-2 13B are prompted to perform extractive QA with the AfriQA dataset. The study probed the HT and CL dimensions, reporting performance with the F1 metric.

Intrigued by the results they obtained, we investigate and compare the QA performance of mT5 and AfroXLMR evaluated with the SAS+ metric, compared to GPT4-0 and Llama-2 13B evaluated with the F1 metric in same downstream task. The basis for this comparison is that the performance of GPT4 and Llama-2 on the AfriQA dataset is assessed with the traditional F1 metric. The same metric was reported to be bias and flawed due to its strict lexical dependence [9].

As shown in Figure 3 (HT task), except for the Igbo language where GPT4 obtains the overall best performance with the magnitude of 78.1, it is outperformed in all other languages by the encoder–decoder models evaluated with SAS+. Llama-2 obtains its overall personal best performance in the Igbo language, with an F1-score at par with AfroXLMR. Also, mT5 predictions have the least agreement with human judgement on the Igbo language. Across all languages, the AfroXLMR results are better than the GPT4 and Llama-2 results, with the mT5 results following in eight out of nine instances.

In the CL evaluation, both GPT4 and Llama-2 estimations are lower than their HT task results. This is because the input questions are in the respective AfriQA languages, signifying that the models struggle to understand the questions. Although AfroXLMR performs optimally in other languages, it struggles to answer the Twi questions effectively. The mT5 model shows a relatively stable understanding of the different languages, but its proficiency is greatly below what it showed on the HT task. Taken together, these results reverberate the limitation of state-of-the-art models in African languages and justify the claim of the negative performance impacting the nature of string-based evaluators for QA tasks [9].

We hypothesize that the F1 performance of GPT4 and Llama-2 on the HT task is influenced by the prompt design, which did not explicitly prohibit the models from generating extra tokens as part of the returned answer span. We believe that the decoder-only models, in their attempt to answer some of the prompt questions, ended up generating answer spans that are not a consistent sequence of tokens in the given context.

Furthermore, the default setting of decoder-only LLMs, due to their training protocol, is to a generate response to questions from their pretrained knowledge. Our experiments reveal that when a generative model is prompted to act as an extractive QA agent and given a context and question, it analyzes the input and generates a response that consists of single or multiple tokens. In cases where multiple tokens are generated, if not explicitly prompted, the response may not be a consecutive set of tokens formed within the context. In cases where multiple non-consecutive tokens are generated as a response, even though they correctly answer the provided question, an F1 evaluation would reduce the overall model performance if all string pairs do not match, while the EM metric would assess the response as a wrong answer. This limitation of the F1 metric is overcome by SAS+ through its focus on the semantics between the response string and the ground truth. Hence, SAS+ improves the results of the GPT4 and Llama-2 models across all AfriQA languages.

5.3. SAS+ Analysis of Dissimilar Answer Pairs from the F1 Assessment

The question–answer evaluation with the F1 metric checks whether any token in the extracted answer span matches a token in the ground truth. Without consideration to the word order and context, a match adds to the accuracy of the F1 metric. The F1 metric is therefore equal to zero only when no two tokens from the prediction and gold answer spans match. Using the output of the F1 metric computation, predicted answers are classified into two groups, consistent with [9]. In the first group, it has instances where the F1 valuation is zero (i.e.,

F 1 = 0

), while the second group holds instances where the F1 score is not equal to zero (i.e.,

F 1 \neq 0

).

The justification for using the F1 score as criteria for the classification, instead of the EM metric, is because the F1 is more lenient with penalizing lexical overlap between gold and predicted answers; thus, its performance ratings constantly surpass EM values. With the classification of answer spans into the zero and non-zero F1 groups, the availability of semantic equivalence in the answer pairs under the

F 1 = 0

group is investigated. Instances in this group are without a semantic relationship, according to the F1 metric; hence, the ability of SAS+ to discover semantic relationships between them, as shown in Table 3, adds to the overall model performance. The outcome of this task therefore contributes to substantiating the effectiveness of the SAS+ metric in evaluating QA models with mixed answer types.

The results in Table 3 show the outcome of analyzing the predicted and gold answer pairs with the SAS+ metric. Except for the

F 1 = 0

case of AfroXLMR predictions for the Twi language, as well as the

F 1 = 0

case of mT5 predictions for the Zulu language, the SAS+ pipeline identified some semantic equivalence between predicted answers and their corresponding ground truths. A manual inspection of the Twi and Zulu answer pairs showed that most predictions returned empty strings, while others were arbitrary texts from the context.

Furthermore, the poor performance on Twi may have emanated from the model’s poor transfer ability, as neither mT5 nor AfroXLMR, as answer prediction models, nor the SAS+ cross-encoder contains Twi in its original pretraining language list [11,12]. A similar result is observed with mT5 on Zulu. But in that case, mT5 is pretrained with Zulu language data; thus, hallucination is suspect. Hallucination in LLMs remains a problem researchers seek to curtail, because such model behaviours can be highly misleading. It has been found that even state-of-the-art decoder-only models like GPT4 hallucinate on language understanding tasks [44]. Hence, it is not strange to observe similar behaviours in our task.

The intra-model assessment on the use of translated queries with AfroXLMR shows that the human-translated queries consistently produced better performances than neural machine translations when

F 1 = 0

, except in Kinyarwanda (see Table 3). With mT5, the prediction accuracy trend differs. The results show that queries translated by NLLB have higher accurate predictions in more languages than the human-translated equivalents. The CL dimension results of both predictors show language-dependent performances, as no model shows outright superior predictive power across all languages. This indicates that model selection for QA tasks in low-resourced African languages is important. Thus, model capabilities ought to be thoroughly researched prior to adoption in QA tasks, if outstanding results are expected.

5.4. Comparing SAS+ and SAS Performances in the Downstream Task

To assess the performance efficiency of SAS+ against the default SAS metric, we first assign all ground truth spans with a label of two, corresponding to the highest similarity value in the answer labelling scale, because all ground truths are the correct answers. The resultant label series becomes the baseline to estimate the efficiency of both the SAS+ and SAS metrics. The evaluation method between SAS and SAS+ that correlates better with this baseline series is the most accurate and suitable metric in the evaluation task. This is consistent with the semantic textual similarity analysis protocol, where a proposed method efficiency is determined by comparing its computed similarity scores to a predetermined baseline [10]. Recall that SAS+ entails automatically labelling predicted answer pairs with our formulation in Equation (2), while SAS entails manual hand labelling of model-predicted answers.

To replicate the default SAS procedure in this study, expert human annotators that are proficient in the focus languages keenly assessed and annotated semantic equivalence between model predictions and corresponding ground truths, following the annotation scheme in Figure 1. Three independent annotators were assigned to each language, and their ratings per answer pair were averaged to obtain the actual similarity coefficient for the respective instances. Completing the manual annotation process makes two versions of answer pair data available. The versions are differentiated by the method deployed to label the degree of semantic equivalence. Therefore, in Figure 4 and Figure 5, the results from correlating the semantic similarity coefficients from SAS and SAS+ with the ground truth coefficient using Spearman correlation are presentenced.

It is worthy of note that the current analysis progresses from the previous F1-based classification subtask, where semantic equivalence is investigated for

F 1 = 0

and

F 1 \neq 0

. Although the manual and automatic answer annotations covered all instances in the test split of the respective language corpuses, the instances where

F 1 = 0

are filtered out for this subtask. As such, only answer pairs with

F 1 \neq 0

are included in the current analysis, as reported in Figure 4 and Figure 5. From the figures, SAS+ consistently shows better correlation with the ground truth coefficients in all four experimental dimensions.

Comparing the improvements of SAS+ with mT5 and AfroXLMR over SAS, the least magnitude observed is 7.35 points in the CL dimension of the Twi language, which was processed by AfroXLMR. Similarly, SAS+ with AfroXLMR makes the highest gain of both models on the Hausa language, adding 35.86 points to the SAS value in the GMT dimension. In the AfroXLMR HT and NLLB experiments, the minimum and maximum performance gains relative to the default SAS are 11.58 and 9.88 and 29.19 and 31.75 points, respectively (see Figure 4). By assessing the mT5 predictions, SAS+ is seen to improve SAS evaluation in all languages across the four dimensions. It adds 11.58 to Fon, 10.03 to Swahili, 9.88 to Fon, and 9.11 to Yoruba language evaluation in the HT, GMT, NLLB, and CL dimensions, respectively. Interestingly, these are the minimum points gained by SAS+ relative to SAS. In terms of the maximum points, the magnitude is 31.75 for the Igbo language in the NLLB dimension (see Figure 5). Similarly, the SAS+ valuation of the HT dimension records 27.72 points higher than SAS for the Igbo language. The maximum performance gains in the GMT and CL dimensions are 19.27 and 23.70 in the Hausa and Swahili languages, respectively.

The performances of AfroXLMR and mT5 on languages like Hausa, Swahili, and Igbo suggest that the models have a better understanding of these languages compared to others in the test dataset. Keenly observing the performance of SAS+ in the comparisons of Figure 4 and Figure 5 shows a pattern consistent with the rise and drop in the SAS results, when we move from one language to another. This pattern suggests an agreement between both metrics but indicates that the SAS+ method is more accurate, because it correlates better with the ground truth coefficients in all languages evaluated across the four experimental dimensions.

With the default SAS protocol, especially for large datasets, inconsistences could arise due to human errors. Some inaccurate labels may be assigned to answer pairs, either due to fatigue, individual bias, long extended hours of work, pre-knowledge, or some other human factors. Such human factors capable of impacting data labelling quality can significantly impair the performance estimates of a model. It is suspected that this contributed to the lower correlation seen from the human annotations when the default SAS protocol is executed. Nevertheless, the remarkable performance of SAS+ in all the low-resourced African languages indicates that the technique guarantees annotation integrity and quality labelling for SAS-based QA evaluation tasks. The SAS+ pipeline also saves time and resources that would otherwise be required to manually label all answer pairs when

F 1 \neq 0

. There is however the limitation of being unable to show the corresponding correlation when

F 1 = 0

instances are assessed in isolation. Yet, SAS+ performs effectively when all

F 1 \neq 0

and

F 1 = 0

instances are combined. Compared to the string-based F1 and EM metrics, using the SAS+ pipeline gives the advantage of higher performances derived from the semantic textual analysis of answer pairs.

6. Conclusions

This work presents an SAS-based pipeline enhanced with a novel automatic answer labelling technique for evaluating extractive and generative QA models. The pipeline incorporates the cross-lingual bi-encoder, LaBSE, for semantic embedding generation, and the

S i m_{E C S}

vector is a measure for effective semantic embedding analysis. Cross-lingual semantic analysis of answer spans with LaBSE is conducted after answer prediction with mT5 and AfroXLMR. The key advantage of the SAS+ pipeline is the introduction of automatic answer labelling, which effectively reduces the time, cost, and resources previously needed to accomplish an SAS-based QA evaluation task. With this new pipeline, the annotation time for SAS-based evaluations will be greatly reduced by more than half. The technique also ensures data integrity and helps eliminate errors that may emerge from the manual annotation process by humans.

The potential use of the SAS+ pipeline in real-world applications is supported by its outstanding validation results relative to the traditional string-based EM and F1 metrics on the nine low-resourced African languages considered. The SAS+ results from mT5 and AfroXLMR show the efficiency of the evaluation pipeline in morphologically and grammatically diverse low-resourced languages evaluated. The results obtained underscore the feasibility of SAS+ generalization to other low-resourced languages. Compared to state-of-the-art GPT4 and Llama-2 13B performances on the same downstream task, SAS+ estimations are more consistent with human judgements. In summary, we believe that the proposed SAS+ pipeline gives a more natural performance estimate for QA systems targeting real-world deployment.

In the future, alternative multilingual models would be explored as answer prediction and cross-encoder models in the SAS+ pipeline. Also, the performance of the proposed method would be further tested with other low- and high-resourced languages.

Author Contributions

Conceptualization, F.F.I., Y.L. and C.S.; methodology, F.F.I., Y.L., C.S., N.J., I.D.M. and U.G.I.; software, F.F.I., Y.L., C.S., N.J. and I.D.M.; validation, Y.L., C.S., N.J., I.D.M. and U.G.I.; formal analysis, F.F.I., N.J. and I.D.M.; investigation, F.F.I., N.J. and I.D.M.; resources, F.F.I., Y.L., C.S., N.J., I.D.M. and U.G.I.; data curation, F.F.I., N.J., I.D.M. and U.G.I.; writing—original draft preparation, F.F.I., Y.L., C.S., N.J. and I.D.M.; writing—review and editing, F.F.I., Y.L., C.S., N.J., I.D.M. and U.G.I.; visualization, F.F.I., N.J. and I.D.M.; supervision, Y.L. and C.S.; project administration, N.J. and I.D.M.; funding acquisition, I.D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data used in this work are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rasool, Z.; Kurniawan, S.; Balugo, S.; Barnett, S.; Vasa, R.; Chesser, C.; Hampstead, B.M.; Belleville, S.; Mouzakis, K.; Bahar-Fuchs, A. Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction Using CogTale Dataset. Nat. Lang. Process. J. 2024, 8, 100083. [Google Scholar] [CrossRef]
Asai, A.; Kasai, J.; Clark, J.; Lee, K.; Choi, E.; Hajishirzi, H. XOR QA: Cross-Lingual Open-Retrieval Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Association for Computational Linguistics: Bangkok, Thailand, 2021; pp. 547–564. [Google Scholar]
Do, J.; Lee, J.; Hwang, S. ContrastiveMix: Overcoming Code-Mixing Dilemma in Cross-Lingual Transfer for Information Retrieval. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 197–204. [Google Scholar]
Guo, P.; Hu, Y.; Cao, Y.; Ren, Y.; Li, Y.; Huang, H. Query in Your Tongue: Reinforce Large Language Models with Retrievers for Cross-Lingual Search Generative Experience. In Proceedings of the ACM on Web Conference 2024, Singapore, 13–17 May 2024; ACM: Singapore, 2024; pp. 1529–1538. [Google Scholar]
Adelani, D.I.; Ojo, J.; Azime, I.A.; Zhuang, J.Y.; Alabi, J.O.; He, X.; Ochieng, M.; Hooker, S.; Bukula, A.; Lee, E.-S.A.; et al. IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models. arXiv 2025, arXiv:2406.03368. [Google Scholar]
Adebara, I.; Abdul-Mageed, M. Towards Afrocentric NLP for African Languages: Where We Are and Where We Can Go. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 3814–3841. [Google Scholar]
Ogundepo, O.; Zhang, X.; Sun, S.; Duh, K.; Lin, J. AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 8721–8728. [Google Scholar]
Ogundepo, O.; Gwadabe, T.; Rivera, C.; Clark, J.; Ruder, S.; Adelani, D.; Dossou, B.; Diop, A.; Sikasote, C.; Hacheme, G.; et al. Cross-Lingual Open-Retrieval Question Answering for African Languages. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Association for Computational Linguistics: Singapore, 2023; pp. 14957–14972. [Google Scholar]
Risch, J.; Möller, T.; Gutsch, J.; Pietsch, M. Semantic Answer Similarity for Evaluating Question Answering Models. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, Punta Cana, Dominican Republic, 3 June 2021; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 149–157. [Google Scholar]
Ijebu, F.F.; Liu, Y.; Sun, C.; Usip, P.U. Soft Cosine and Extended Cosine Adaptation for Pre-Trained Language Model Semantic Vector Analysis. Appl. Soft Comput. 2024, 169, 112551. [Google Scholar] [CrossRef]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Association for Computational Linguistics: Bangkok, Thailand, 2021; pp. 483–498. [Google Scholar]
Alabi, J.O.; Adelani, D.I.; Mosbach, M.; Klakow, D. Adapting Pre-Trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; Calzolari, N., Huang, C.-R., Kim, H., Pustejovsky, J., Wanner, L., Choi, K.-S., Ryu, P.-M., Chen, H.-H., Donatelli, L., Ji, H., et al., Eds.; International Committee on Computational Linguistics: Gyeongju, Republic of Korea, 2022; pp. 4336–4349. [Google Scholar]
Usip, P.U.; Ijebu, F.F.; Udo, I.J.; Ollawa, I.K. Text-Based Emergency Alert Framework for Under-Resourced Languages in Southern Nigeria. In Semantic AI in Knowledge Graphs; CRC Press: Boca Raton, FL, USA, 2023; pp. 111–126. ISBN 978-1-00-331326-7. [Google Scholar]
Inyang, U.G.; Ijebu, F.F.; Osang, F.B.; Afoluronsho, A.A.; Udoh, S.S.; Eyoh, I.J. A Dataset-Driven Parameter Tuning Approach for Enhanced K-Nearest Neighbour Algorithm Performance. Int. J. Adv. Sci. Eng. Inf. Technol. 2023, 13, 380–391. [Google Scholar] [CrossRef]
Vázquez-Enríquez, M.; Alba-Castro, J.L.; Docío-Fernández, L.; Rodríguez-Banga, E. SWL-LSE: A Dataset of Health-Related Signs in Spanish Sign Language with an ISLR Baseline Method. Technologies 2024, 12, 205. [Google Scholar] [CrossRef]
Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. AAAI 2024, 38, 17754–17762. [Google Scholar] [CrossRef]
Li, S.; Sun, C.; Liu, B.; Liu, Y.; Ji, Z. Modeling Extractive Question Answering Using Encoder-Decoder Models with Constrained Decoding and Evaluation-Based Reinforcement Learning. Mathematics 2023, 11, 1624. [Google Scholar] [CrossRef]
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Association for Computational Linguistics: Austin, TX, USA, 2016; pp. 2383–2392. [Google Scholar]
Chen, D.; Fisch, A.; Weston, J.; Bordes, A. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 1870–1879. [Google Scholar]
Harris, S.; Hadi, H.J.; Ahmad, N.; Alshara, M.A. Fake News Detection Revisited: An Extensive Review of Theoretical Frameworks, Dataset Assessments, Model Constraints, and Forward-Looking Research Agendas. Technologies 2024, 12, 222. [Google Scholar] [CrossRef]
Wu, C.-S.; Madotto, A.; Liu, W.; Fung, P.; Xiong, C. QAConv: Question Answering on Informative Conversations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 5389–5411. [Google Scholar]
Huang, J.; Wang, M.; Cui, Y.; Liu, J.; Chen, L.; Wang, T.; Li, H.; Wu, J. Layered Query Retrieval: An Adaptive Framework for Retrieval-Augmented Generation in Complex Question Answering for Large Language Models. Appl. Sci. 2024, 14, 11014. [Google Scholar] [CrossRef]
Chen, J.; Wang, H.; Shang, J. Chaomurilige Interpretable Embeddings for Next Point-of-Interest Recommendation via Large Language Model Question–Answering. Mathematics 2024, 12, 3592. [Google Scholar] [CrossRef]
Hernández, A.; Ortega-Mendoza, R.M.; Villatoro-Tello, E.; Camacho-Bello, C.J.; Pérez-Cortés, O. Natural Language Understanding for Navigation of Service Robots in Low-Resource Domains and Languages: Scenarios in Spanish and Nahuatl. Mathematics 2024, 12, 1136. [Google Scholar] [CrossRef]
Gaim, F.; Yang, W.; Park, H.; Park, J. Question-Answering in a Low-Resourced Language: Benchmark Dataset and Models for Tigrinya. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 11857–11870. [Google Scholar]
Wanjawa, B.W.; Wanzare, L.D.A.; Indede, F.; Mconyango, O.; Muchemi, L.; Ombui, E. KenSwQuAD—A Question Answering Dataset for Swahili Low-Resource Language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–20. [Google Scholar] [CrossRef]
Bayes, E.; Azime, I.A.; Alabi, J.O.; Kgomo, J.; Eloundou, T.; Proehl, E.; Chen, K.; Khadir, I.; Etori, N.A.; Muhammad, S.H.; et al. Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages. arXiv 2024, arXiv:2412.00948. [Google Scholar]
Ojo, J.; Ogueji, K.; Stenetorp, P.; Adelani, D.I. How Good Are Large Language Models on African Languages? arXiv 2024, arXiv:2311.07978v2. [Google Scholar]
Lewis, P.; Oguz, B.; Rinott, R.; Riedel, S.; Schwenk, H. MLQA: Evaluating Cross-Lingual Extractive Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Bangkok, Thailand, 2020; pp. 7315–7330. [Google Scholar]
Longpre, S.; Lu, Y.; Daiber, J. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering. Trans. Assoc. Comput. Linguist. 2021, 9, 1389–1406. [Google Scholar] [CrossRef]
Asai, A.; Yu, X.; Kasai, J.; Hajishirzi, H. One Question Answering Model for Many Languages with Cross-Lingual Dense Passage Retrieval. arXiv 2021, arXiv:2107.11976. [Google Scholar]
NLLB Team; Costa-jussà, M.R.; Cross, J.; Çelebi, O.; Elbayad, M.; Heafield, K.; Heffernan, K.; Kalbassi, E.; Lam, J.; Licht, D.; et al. Scaling Neural Machine Translation to 200 Languages. Nature 2024, 630, 841–846. [Google Scholar] [CrossRef]
Adelani, D.I.; Alabi, J.O.; Fan, A.; Kreutzer, J.; Shen, X.; Reid, M.; Ruiter, D.; Klakow, D.; Nabende, P.; Chang, E.; et al. A Few Thousand Translations Go a Long Way! Leveraging Pre-Trained Models for African News Translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Carpuat, M., de Marneffe, M.-C., Meza Ruiz, I.V., Eds.; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 3053–3070. [Google Scholar]
Clark, J.H.; Choi, E.; Collins, M.; Garrette, D.; Kwiatkowski, T.; Nikolaev, V.; Palomaki, J. TyDi QA: A Benchmark for Information-Seeking Question Answering in Ty Pologically Di Verse Languages. Trans. Assoc. Comput. Linguist. 2020, 8, 454–470. [Google Scholar] [CrossRef]
Wang, L.; Yu, K.; Wumaier, A.; Zhang, P.; Yibulayin, T.; Wu, X.; Gong, J.; Maimaiti, M. Genre: Generative Multi-Turn Question Answering with Contrastive Learning for Entity–Relation Extraction. Complex Intell. Syst. 2024, 10, 3429–3443. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Jin, J.; Wang, H. Select High-Quality Synthetic QA Pairs to Augment Training Data in MRC under the Reward Guidance of Generative Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, 20–25 May 2024; Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N., Eds.; ELRA and ICCL: Torino, Italia, 2024; pp. 14543–14554. [Google Scholar]
Inyang, U.G.; Robinson, S.A.; Ijebu, F.F.; Udo, I.J.; Nwokoro, C.O. Optimality Assessments of Classifiers on Single and Multi-Labelled Obstetrics Outcome Classification Problems. IJACSA 2021, 12. [Google Scholar] [CrossRef]
Sellam, T.; Das, D.; Parikh, A. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Bangkok, Thailand, 2020; pp. 7881–7892. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the Eighth International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Feng, F.; Yang, Y.; Cer, D.; Arivazhagan, N.; Wang, W. Language-Agnostic BERT Sentence Embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 878–891. [Google Scholar]
Usip, P.U.; Ijebu, F.F.; Dan, E.A. A Spatiotemporal Knowledge Bank from Rape News Articles for Decision Support. In Knowledge Graphs and Semantic Web: Second Iberoamerican Conference and First Indo-American Conference, KGSWC 2020, Mérida, Mexico, 26–27 November 2020; Villazón-Terrazas, B., Ortiz-Rodríguez, F., Tiwari, S.M., Shandilya, S.K., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 147–157. [Google Scholar]
Xiong, K.; Ding, X.; Cao, Y.; Liu, T.; Qin, B. Examining Inter-Consistency of Large Language Models Collaboration: An In-Depth Analysis via Debate. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 7572–7590. [Google Scholar]
Wang, Z.; Mao, S.; Wu, W.; Ge, T.; Wei, F.; Ji, H. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 257–279. [Google Scholar]

Figure 1. Answer similarity labelling scheme for the SAS metric [9].

Figure 2. Proposed SAS+ pipeline for cross-lingual QA evaluation.

Figure 3. SAS+ evaluation results of mT5 and AfroXLMR compared to F1-based GPT4 and Llama-2 13B performances on the AfriQA dataset. The F1 results of GPT4 and Llama-2 are taken from [28].

Figure 4. SAS and SAS performances in evaluating cross-lingual QA systems. Each subplot depicts the annotation outcome from SAS and SAS+ compared to the ground truth using Spearman’s correlation. The answer prediction model in this figure is AfroXLMR.

Figure 5. SAS and SAS+ performances in evaluating cross-lingual QA systems. Each subplot depicts the annotation outcome from SAS and SAS+ compared to the ground truth using Spearman’s correlation. The answer prediction model in this figure is mT5.

Table 1. SAS+ evaluation of AfroXLMR QA performance in African languages relative to the F1 and EM evaluations. The best score between the three metrics per language’s experimental dimension is indicated in bold text. All F1 and EM results are gotten from [8].

Lang.	HT			GMT			NLLB			CL
Lang.	SAS+	F1	EM	SAS+	F1	EM	SAS+	F1	EM	SAS+	F1	EM
bem	81.64	38.20	29.50	-	-	-	82.29	30.00	21.90	30.45	0.40	0.40
fon	85.77	53.80	40.40	-	-	-	77.95	37.50	26.70	59.86	13.40	6.00
hau	81.85	60.90	52.70	82.67	54.40	47.70	82.00	50.90	43.70	80.22	27.70	23.70
ibo	75.54	68.20	60.60	80.66	62.10	55.00	71.71	62.80	56.20	80.23	29.20	24.70
kin	85.31	56.80	38.90	84.55	50.80	36.00	82.78	51.30	36.60	75.42	22.70	17.90
swa	84.80	45.20	37.90	83.85	44.60	37.90	84.62	45.20	38.10	82.55	31.60	24.60
twi	86.67	51.20	41.80	86.10	39.20	31.10	86.12	34.30	30.00	31.98	3.40	2.50
yor	81.38	45.10	38.60	84.42	36.00	31.70	81.97	32.30	28.00	50.00	6.00	3.80
zul	84.90	59.10	49.20	83.50	56.00	48.60	82.39	53.60	45.80	68.60	17.00	13.50

Table 2. SAS+ evaluation of mT5 QA performance on low-resourced African languages relative to the F1 and EM evaluations. The best score between the three metrics per language’s experimental dimension is indicated in bold text. All F1 and EM results are gotten from [8].

Lang.	HT			GMT			NLLB			CL
Lang.	SAS+	F1	EM	SAS+	F1	EM	SAS+	F1	EM	SAS+	F1	EM
bem	75.77	48.80	41.70	-	-	-	84.68	38.50	32.00	46.52	2.90	1.10
fon	82.12	41.40	28.50	-	-	-	83.67	23.40	15.30	52.44	5.10	2.30
hau	74.92	58.50	49.00	80.87	53.50	45.70	82.99	50.90	42.70	63.29	25.80	22.30
ibo	64.48	66.60	59.20	68.82	59.80	53.30	73.94	60.20	53.30	70.00	41.70	34.70
kin	70.07	60.80	43.80	75.59	57.30	40.90	79.34	58.80	42.90	59.49	25.50	20.20
swa	81.01	52.30	42.60	82.56	48.90	40.80	83.97	49.20	41.20	66.78	29.40	23.50
twi	79.72	55.40	45.30	86.30	42.00	33.70	86.82	40.10	33.10	67.04	5.30	3.50
yor	77.57	54.90	49.80	82.02	48.90	45.10	83.61	47.90	43.00	62.19	11.90	7.80
zul	74.80	60.20	50.80	76.97	57.40	48.90	79.45	55.60	46.50	59.75	24.70	20.90

Table 3. SAS+ evaluation of model performance relative to human judgement based on the F1 score. Predicted answers classified into two groups, based on their F1 score are analysed with the SAS+ metric. The scores are the degree of SAS+ agreement with human judgement on the presence of semantic equivalence between model predictions and ground-truth.

Lang	F1	AfroXLMR				mT5
Lang	Value	HT	GMT	NLLB	CL	HT	GMT	NLLB	CL
bem	F1 = 0	32.16	-	30.84	9.38	36.77	-	44.56	6.91
bem	F1 ≠ 0	56.00	-	66.60	54.77	47.66	-	55.64	81.60
fon	F1 = 0	26.01	-	7.37	5.90	53.94	-	27.05	9.46
fon	F1 ≠ 0	68.30	-	77.61	77.69	64.90	-	76.61	83.70
hau	F1 = 0	18.20	13.17	17.89	13.87	35.86	25.58	38.70	4.48
hau	F1 ≠ 0	42.01	43.35	50.13	50.84	49.26	55.39	58.35	78.34
ibo	F1 = 0	18.20	13.17	17.89	13.87	41.89	49.63	43.73	27.41
ibo	F1 ≠ 0	37.06	45.12	34.53	59.12	51.01	41.74	53.57	70.44
kin	F1 = 0	24.01	27.84	30.54	18.56	33.76	36.04	45.14	20.96
kin	F1 ≠ 0	64.60	59.67	64.89	69.91	55.33	56.16	63.57	67.05
swa	F1 = 0	36.92	27.16	31.89	18.86	27.94	20.89	39.83	15.38
swa	F1 ≠ 0	50.41	53.47	60.87	63.50	60.14	66.00	64.24	62.92
twi	F1 = 0	34.70	27.75	28.97	0.00	17.85	29.02	29.07	24.67
twi	F1 ≠ 0	58.62	54.04	58.75	82.89	57.75	60.96	59.04	75.82
yor	F1 = 0	25.75	24.52	18.23	2.02	22.62	19.23	24.89	19.54
yor	F1 ≠ 0	54.67	48.73	52.67	83.20	52.84	55.98	63.64	82.13
zul	F1 = 0	31.97	25.04	25.79	6.43	48.03	36.40	33.81	0.00
zul	F1 ≠ 0	57.60	57.32	60.06	75.38	60.06	58.51	60.74	76.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ijebu, F.F.; Liu, Y.; Sun, C.; Jere, N.; Mienye, I.D.; Inyang, U.G. Cross-Encoder-Based Semantic Evaluation of Extractive and Generative Question Answering in Low-Resourced African Languages. Technologies 2025, 13, 119. https://doi.org/10.3390/technologies13030119

AMA Style

Ijebu FF, Liu Y, Sun C, Jere N, Mienye ID, Inyang UG. Cross-Encoder-Based Semantic Evaluation of Extractive and Generative Question Answering in Low-Resourced African Languages. Technologies. 2025; 13(3):119. https://doi.org/10.3390/technologies13030119

Chicago/Turabian Style

Ijebu, Funebi Francis, Yuanchao Liu, Chengjie Sun, Nobert Jere, Ibomoiye Domor Mienye, and Udoinyang Godwin Inyang. 2025. "Cross-Encoder-Based Semantic Evaluation of Extractive and Generative Question Answering in Low-Resourced African Languages" Technologies 13, no. 3: 119. https://doi.org/10.3390/technologies13030119

APA Style

Ijebu, F. F., Liu, Y., Sun, C., Jere, N., Mienye, I. D., & Inyang, U. G. (2025). Cross-Encoder-Based Semantic Evaluation of Extractive and Generative Question Answering in Low-Resourced African Languages. Technologies, 13(3), 119. https://doi.org/10.3390/technologies13030119

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Encoder-Based Semantic Evaluation of Extractive and Generative Question Answering in Low-Resourced African Languages

Abstract

1. Introduction

2. Related Works

2.1. Question Answering in African Languages

2.2. Cross-Lingual Language Resources

2.3. Cross-Lingual QA Performance Evaluation Metrics

3. Proposed QA Evaluation Pipeline

3.1. Dataset

3.2. Cross-Encoder for SAS+

3.3. Automatic Answer Labelling

4. Experimental Design

5. Results and Discussion

5.1. SAS+ Performance Compared to F1- and EM-Based Baselines

5.2. SAS+’s Performance Compared to F1-Bassed Generative QA Models

5.3. SAS+ Analysis of Dissimilar Answer Pairs from the F1 Assessment

5.4. Comparing SAS+ and SAS Performances in the Downstream Task

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI