Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya

This article studies convolutional neural networks for Tigrinya (also referred to as Tigrigna), which is a family of Semitic languages spoken in Eritrea and northern Ethiopia. Tigrinya is a “low-resource” language and is notable in terms of the absence of comprehensive and free data. Furthermore, it is characterized as one of the most semantically and syntactically complex languages in the world, similar to other Semitic languages. To the best of our knowledge, no previous research has been conducted on the state-of-the-art embedding technique that is shown here. We investigate which word representation methods perform better in terms of learning for single-label text classification problems, which are common when dealing with morphologically rich and complex languages. Manually annotated datasets are used here, where one contains 30,000 Tigrinya news texts from various sources with six categories of “sport”, “agriculture”, “politics”, “religion”, “education”, and “health” and one unannotated corpus that contains more than six million words. In this paper, we explore pretrained word embedding architectures using various convolutional neural networks (CNNs) to predict class labels. We construct a CNN with a continuous bag-of-words (CBOW) method, a CNN with a skip-gram method, and CNNs with and without word2vec and FastText to evaluate Tigrinya news articles. We also compare the CNN results with traditional machine learning models and evaluate the results in terms of the accuracy, precision, recall, and F1 scoring techniques. The CBOW CNN with word2vec achieves the best accuracy with 93.41%, significantly improving the accuracy for Tigrinya news classification.


Introduction
The rise of Internet usage has led to the production of diverse text data that are provided by various social media platforms and websites in different languages. On the one hand, English and many other languages are regarded as affluent languages for the accessibility of the tools and data for numerous natural language processing tasks. On the other hand, many languages are also deemed to be low-resource languages [1]. Similarly, Negaish [2] and Osaman et al. [2,3] mentioned that Tigrinya is a "low-resource" language because of its underdeveloped data resources, few linguistic materials, and even fewer linguistic tools. Likewise, the lack of data resources for Tigrinya is manifested through the absence of a Tigrinya standard text dataset, and this is a significant barrier for Tigrinya text classification. Consequently, Tigrinya remains understudied from a natural language processing (NLP) perspective and this imposes challenges for the advancement of Tigrinya text classification research [3,4]. Tedla et al. [5] mentioned that unlike many other languages, the use of Tigrinya is rare in wiki pages. These pages are used as raw sources to construct unlabeled corpora (i.e., they are used for word embedding). Moreover, there are almost no

•
We develop a dataset that contains 30,000 text documents labeled in six categories.

•
We develop an unsupervised corpus that contains more than six million words to support CNN embedding. • This work allows an immediate comparison of current state-of-the-art text classification techniques in the context of the Tigrinya language. • Finally, we evaluate the CNN classification accuracy with word2vec and FastText models and compare classifier performance with various machine learning techniques.
It is expected that the results of this study will reveal how the use of a given word embedding model affects Tigrinya news article classification with CNNs. Furthermore, CNNs are used in many research approaches for natural language processing research Information 2021, 12, 52 3 of 17 problems because of their ability to learn complex feature representations as compared with traditional machine learning approaches. We apply a CNN-based approach for categorization at the sentence level based on the semantics extracted from the corpus. We compare the FastText and word2vec pretrained vectors in terms of their impact on text classification. Our results indicate that the word2vec CNN approach outperforms the other approaches by 93.41% in terms of classification accuracy. The structure of this paper is as follows: In Section 2, we present the research background and related works; in Section 3, we present the research methodology; in Section 4, dataset construction and CNN architecture are described; in Section 5, we detail the evaluation techniques; in Section 6, we conclude the paper with a summary and discuss possible future work.

Previous Attempts for Tigrinya Natural Language Processing
We review existing works related to the proposed scheme, mainly considering previous attempts with the Tigrinya language. The majority of Tigrinya speakers live in Eritrea and the northern part of Ethiopia (Tigray Province) in Africa's horn, with an estimated population of more than 10 million [16]. Tigrinya is ranked third among the widely-spoken Semitic language families in the world, after Arabic and Amharic [17]. Despite Tigrinya sharing similarity with most Semitic languages in several ways, Tigrinya has different compound prepositions such as " Information 2021, 12, x FOR PEER REVIEW 3 of 18 It is expected that the results of this study will reveal how the use of a given word embedding model affects Tigrinya news article classification with CNNs. Furthermore, CNNs are used in many research approaches for natural language processing research problems because of their ability to learn complex feature representations as compared with traditional machine learning approaches. We apply a CNN-based approach for categorization at the sentence level based on the semantics extracted from the corpus. We compare the FastText and word2vec pretrained vectors in terms of their impact on text classification. Our results indicate that the word2vec CNN approach outperforms the other approaches by 93.41% in terms of classification accuracy. The structure of this paper is as follows: In Section 2 , we present the research background and related works; in Section 3, we present the research methodology; in Section 4, dataset construction and CNN architecture are described; in Section 5, we detail the evaluation techniques; in Section 6, we conclude the paper with a summary and discuss possible future work.

Previous Attempts for Tigrinya Natural Language Processing
We review existing works related to the proposed scheme, mainly considering previous attempts with the Tigrinya language. The majority of Tigrinya speakers live in Eritrea and the northern part of Ethiopia (Tigray Province) in Africa's horn, with an estimated population of more than 10 million [16]. Tigrinya is ranked third among the widelyspoken Semitic language families in the world, after Arabic and Amharic [17]. Despite Tigrinya sharing similarity with most Semitic languages in several ways, Tigrinya has different compound prepositions such as "ኣብልዕሉዓራት or ab leliarat" (on (top of) the bed), which has the preposition ኣብ/ab, preposition ልዕሉ/leli, and noun ዓራት/arat. Furthermore, Abate et al. [18] mentioned that Tigrinya is a highly inflected language and features complex morphological characteristics due to basic word formations being based on sequences of consonants expressed by "roots" and "template patterns." Littell et al. [19] stated that Tigrinya also shows both inflectional and derivational morphologies, where the former pertains to the tense, mood, gender, person, number, etc. Simultaneously, the latter produces different case patterns that include voice, causative, and frequentative forms. The presence of the two morphologies for the construction of enormous numbers of variants for a single word through the prefix, infix, and suffix affixations leads to the lack of data. Nonetheless, Tigrinya belongs to the set of low-resource languages that are conveyed with minimal data, linguistic materials, and tools [4,6].
Recently, a few researchers have attempted to challenge common corpora techniques, as shown in Table 1. Some of the researchers have attempted to develop volumes of words, tokens, or sentences. Furthermore, a few Tigrinya researchers have considered preprocessing by using word stemming techniques and their impact on result accuracy [3]. In their first attempt, Fisseha [20] developed a rule-based stemming algorithm with a dictionary-based stemming method that found better results for feature selection with Tigrinya information retrieval tasks, despite the fact that most stemming methods give good results. Overall, according to the Tigrinya NLP literature review, we have observed that none of the researchers have implemented neural network approaches for the text classification of the Tigrinya language.

Author
Main Application Sentences/Tokens Year Fisseha [20] Stemming algorithm 690,000 2011 Reda et al. [21] Unsupervised ML word sense disambiguation 190,000 2018 Osman et al. [3] Stemming Tigrinya words 164,634 2012 Yemane et al. [6] Post tagging for Tigrinya 72,000 2016 or ab leliarat" (on (top of) the bed), which has the preposition Information 2021, 12, x FOR PEER REVIEW 3 of 18 It is expected that the results of this study will reveal how the use of a given word embedding model affects Tigrinya news article classification with CNNs. Furthermore, CNNs are used in many research approaches for natural language processing research problems because of their ability to learn complex feature representations as compared with traditional machine learning approaches. We apply a CNN-based approach for categorization at the sentence level based on the semantics extracted from the corpus. We compare the FastText and word2vec pretrained vectors in terms of their impact on text classification. Our results indicate that the word2vec CNN approach outperforms the other approaches by 93.41% in terms of classification accuracy. The structure of this paper is as follows: In Section 2 , we present the research background and related works; in Section 3, we present the research methodology; in Section 4, dataset construction and CNN architecture are described; in Section 5, we detail the evaluation techniques; in Section 6, we conclude the paper with a summary and discuss possible future work.

Previous Attempts for Tigrinya Natural Language Processing
We review existing works related to the proposed scheme, mainly considering previous attempts with the Tigrinya language. The majority of Tigrinya speakers live in Eritrea and the northern part of Ethiopia (Tigray Province) in Africa's horn, with an estimated population of more than 10 million [16]. Tigrinya is ranked third among the widelyspoken Semitic language families in the world, after Arabic and Amharic [17]. Despite Tigrinya sharing similarity with most Semitic languages in several ways, Tigrinya has different compound prepositions such as "ኣብልዕሉዓራት or ab leliarat" (on (top of) the bed), which has the preposition ኣብ/ab, preposition ልዕሉ/leli, and noun ዓራት/arat. Furthermore, Abate et al. [18] mentioned that Tigrinya is a highly inflected language and features complex morphological characteristics due to basic word formations being based on sequences of consonants expressed by "roots" and "template patterns." Littell et al. [19] stated that Tigrinya also shows both inflectional and derivational morphologies, where the former pertains to the tense, mood, gender, person, number, etc. Simultaneously, the latter produces different case patterns that include voice, causative, and frequentative forms. The presence of the two morphologies for the construction of enormous numbers of variants for a single word through the prefix, infix, and suffix affixations leads to the lack of data. Nonetheless, Tigrinya belongs to the set of low-resource languages that are conveyed with minimal data, linguistic materials, and tools [4,6].
Recently, a few researchers have attempted to challenge common corpora techniques, as shown in Table 1. Some of the researchers have attempted to develop volumes of words, tokens, or sentences. Furthermore, a few Tigrinya researchers have considered preprocessing by using word stemming techniques and their impact on result accuracy [3]. In their first attempt, Fisseha [20] developed a rule-based stemming algorithm with a dictionary-based stemming method that found better results for feature selection with Tigrinya information retrieval tasks, despite the fact that most stemming methods give good results. Overall, according to the Tigrinya NLP literature review, we have observed that none of the researchers have implemented neural network approaches for the text classification of the Tigrinya language.

Author
Main Application Sentences/Tokens Year Fisseha [20] Stemming algorithm 690,000 2011 Reda et al. [21] Unsupervised ML word sense disambiguation 190,000 2018 Osman et al. [3] Stemming Tigrinya words 164,634 2012 Yemane et al. [6] Post tagging for Tigrinya 72,000 2016 /ab, preposition It is expected that the results of this study will reveal how the use of a given word embedding model affects Tigrinya news article classification with CNNs. Furthermore, CNNs are used in many research approaches for natural language processing research problems because of their ability to learn complex feature representations as compared with traditional machine learning approaches. We apply a CNN-based approach for categorization at the sentence level based on the semantics extracted from the corpus. We compare the FastText and word2vec pretrained vectors in terms of their impact on text classification. Our results indicate that the word2vec CNN approach outperforms the other approaches by 93.41% in terms of classification accuracy. The structure of this paper is as follows: In Section 2 , we present the research background and related works; in Section 3, we present the research methodology; in Section 4, dataset construction and CNN architecture are described; in Section 5, we detail the evaluation techniques; in Section 6, we conclude the paper with a summary and discuss possible future work.

Previous Attempts for Tigrinya Natural Language Processing
We review existing works related to the proposed scheme, mainly considering previous attempts with the Tigrinya language. The majority of Tigrinya speakers live in Eritrea and the northern part of Ethiopia (Tigray Province) in Africa's horn, with an estimated population of more than 10 million [16]. Tigrinya is ranked third among the widelyspoken Semitic language families in the world, after Arabic and Amharic [17]. Despite Tigrinya sharing similarity with most Semitic languages in several ways, Tigrinya has different compound prepositions such as "ኣብልዕሉዓራት or ab leliarat" (on (top of) the bed), which has the preposition ኣብ/ab, preposition ልዕሉ/leli, and noun ዓራት/arat. Furthermore, Abate et al. [18] mentioned that Tigrinya is a highly inflected language and features complex morphological characteristics due to basic word formations being based on sequences of consonants expressed by "roots" and "template patterns." Littell et al. [19] stated that Tigrinya also shows both inflectional and derivational morphologies, where the former pertains to the tense, mood, gender, person, number, etc. Simultaneously, the latter produces different case patterns that include voice, causative, and frequentative forms. The presence of the two morphologies for the construction of enormous numbers of variants for a single word through the prefix, infix, and suffix affixations leads to the lack of data. Nonetheless, Tigrinya belongs to the set of low-resource languages that are conveyed with minimal data, linguistic materials, and tools [4,6].
Recently, a few researchers have attempted to challenge common corpora techniques, as shown in Table 1. Some of the researchers have attempted to develop volumes of words, tokens, or sentences. Furthermore, a few Tigrinya researchers have considered preprocessing by using word stemming techniques and their impact on result accuracy [3]. In their first attempt, Fisseha [20] developed a rule-based stemming algorithm with a dictionary-based stemming method that found better results for feature selection with Tigrinya information retrieval tasks, despite the fact that most stemming methods give good results. Overall, according to the Tigrinya NLP literature review, we have observed that none of the researchers have implemented neural network approaches for the text classification of the Tigrinya language.

Author
Main Application Sentences/Tokens Year Fisseha [20] Stemming algorithm 690,000 2011 Reda et al. [21] Unsupervised ML word sense disambiguation 190,000 2018 Osman et al. [3] Stemming Tigrinya words 164,634 2012 Yemane et al. [6] Post tagging for Tigrinya 72,000 2016 /leli, and noun It is expected that the results of this study will reveal how the use of a given word embedding model affects Tigrinya news article classification with CNNs. Furthermore, CNNs are used in many research approaches for natural language processing research problems because of their ability to learn complex feature representations as compared with traditional machine learning approaches. We apply a CNN-based approach for categorization at the sentence level based on the semantics extracted from the corpus. We compare the FastText and word2vec pretrained vectors in terms of their impact on text classification. Our results indicate that the word2vec CNN approach outperforms the other approaches by 93.41% in terms of classification accuracy. The structure of this paper is as follows: In Section 2 , we present the research background and related works; in Section 3, we present the research methodology; in Section 4, dataset construction and CNN architecture are described; in Section 5, we detail the evaluation techniques; in Section 6, we conclude the paper with a summary and discuss possible future work.

Previous Attempts for Tigrinya Natural Language Processing
We review existing works related to the proposed scheme, mainly considering previous attempts with the Tigrinya language. The majority of Tigrinya speakers live in Eritrea and the northern part of Ethiopia (Tigray Province) in Africa's horn, with an estimated population of more than 10 million [16]. Tigrinya is ranked third among the widelyspoken Semitic language families in the world, after Arabic and Amharic [17]. Despite Tigrinya sharing similarity with most Semitic languages in several ways, Tigrinya has different compound prepositions such as "ኣብልዕሉዓራት or ab leliarat" (on (top of) the bed), which has the preposition ኣብ/ab, preposition ልዕሉ/leli, and noun ዓራት/arat. Furthermore, Abate et al. [18] mentioned that Tigrinya is a highly inflected language and features complex morphological characteristics due to basic word formations being based on sequences of consonants expressed by "roots" and "template patterns." Littell et al. [19] stated that Tigrinya also shows both inflectional and derivational morphologies, where the former pertains to the tense, mood, gender, person, number, etc. Simultaneously, the latter produces different case patterns that include voice, causative, and frequentative forms. The presence of the two morphologies for the construction of enormous numbers of variants for a single word through the prefix, infix, and suffix affixations leads to the lack of data. Nonetheless, Tigrinya belongs to the set of low-resource languages that are conveyed with minimal data, linguistic materials, and tools [4,6].
Recently, a few researchers have attempted to challenge common corpora techniques, as shown in Table 1. Some of the researchers have attempted to develop volumes of words, tokens, or sentences. Furthermore, a few Tigrinya researchers have considered preprocessing by using word stemming techniques and their impact on result accuracy [3]. In their first attempt, Fisseha [20] developed a rule-based stemming algorithm with a dictionary-based stemming method that found better results for feature selection with Tigrinya information retrieval tasks, despite the fact that most stemming methods give good results. Overall, according to the Tigrinya NLP literature review, we have observed that none of the researchers have implemented neural network approaches for the text classification of the Tigrinya language.

Author
Main Application Sentences/Tokens Year Fisseha [20] Stemming algorithm 690,000 2011 Reda et al. [21] Unsupervised ML word sense disambiguation 190,000 2018 Osman et al. [3] Stemming Tigrinya words 164,634 2012 Yemane et al. [6] Post tagging for Tigrinya 72,000 2016 /arat. Furthermore, Abate et al. [18] mentioned that Tigrinya is a highly inflected language and features complex morphological characteristics due to basic word formations being based on sequences of consonants expressed by "roots" and "template patterns." Littell et al. [19] stated that Tigrinya also shows both inflectional and derivational morphologies, where the former pertains to the tense, mood, gender, person, number, etc. Simultaneously, the latter produces different case patterns that include voice, causative, and frequentative forms. The presence of the two morphologies for the construction of enormous numbers of variants for a single word through the prefix, infix, and suffix affixations leads to the lack of data. Nonetheless, Tigrinya belongs to the set of low-resource languages that are conveyed with minimal data, linguistic materials, and tools [4,6].
Recently, a few researchers have attempted to challenge common corpora techniques, as shown in Table 1. Some of the researchers have attempted to develop volumes of words, tokens, or sentences. Furthermore, a few Tigrinya researchers have considered preprocessing by using word stemming techniques and their impact on result accuracy [3]. In their first attempt, Fisseha [20] developed a rule-based stemming algorithm with a dictionary-based stemming method that found better results for feature selection with Tigrinya information retrieval tasks, despite the fact that most stemming methods give good results. Overall, according to the Tigrinya NLP literature review, we have observed that none of the researchers have implemented neural network approaches for the text classification of the Tigrinya language. Table 1. Literature review of previous Tigrinya natural language processing (NLP) research papers.

Text Classification
A conventional text classification framework consists of preprocessing, feature extraction, feature selection, and classification stages. These applications have to deal with several problems related to both the nature and structure of the underlying textual infor-mation for languages by converting word variations into concise representations while preserving most of the linguistic features. Similarly, Uysal et al. [22] also studied the impact of preprocessing on language in both the text and language domains, where the preprocessing affected the accuracy. They further concluded that the preprocessing step in text classification is as essential as the feature extraction, feature selection, and classification steps. Specifically, conventional approaches for text analysis use typical features, such as bag-of-words [23], n-gram [24], and term frequency-inverse document frequency (TF-IDF) methods [25] as input methods for machine learning algorithms such as Naïve Bayes (NB) classifiers [26], K-nearest neighbor (KNN) algorithms [27], and support vector machines (SVMs) [28] for classification. Text classification is based on the statistical frequency of sentiment-related words extracted from tools such as lexicons [29]. Zhang et al. [29] provided an improved TF-IDF approach that used confidence, support, and characteristic words to enhance the recall and accuracy for text classification.
It is easy to see how machine learning has become a field of interest for text classification tasks, where machine-learning methods show great potential for obtaining linguistic knowledge. Although statistical machine learning-based representation models have achieved comparable performance, their shortcomings are apparent. First, these techniques only concentrate on word frequency features and completely neglect the contextual structure information in text, making it a challenge to capture text semantics. Second, the success of these statistical approaches in machine learning typically heavily depends on laborious engineering feats and the use of enormous linguistic resources.
In recent years, there has been a complete shift from statistical machine learning to state-of-the-art deep learning with text categorization models [30,31]. Zhang et al. [32] mentioned that natural language-based text classification has a wide range of applications, ranging from emotion classification to text classification. With their first design, Collobert and Wetson [33] found that image preprocessing methods could also be used for natural language preprocessing. Moreover, many researchers have applied neural networks to text classification problems by using end-to-end deep neural networks to extract contextual features from raw text data. Kim [12] adopted a method to capture local features from different positions of words in sentences using various convolutional neural network architectures. Similarly, Zhang et al. [34] designed a powerful method using character-level information for text classification. Furthermore, Lai [35] also recommended recurrent neural network models for contextual information, together with a convolutional neural network. Most useful information from text is obtained through pooling technology and also CNNs in conjunction with unsupervised word vectors on top of single-layer convolutions and the use of relatively simple kernel convolutional kernels as fixed windows. Pennington et al. [36] devised an approach based on a corpus that considered linear substructures in word embedding space models such as word2vec via thorough training with global word co-occurrence data.
For multi-label classifications with short texts, Parwez [37] mentioned that CNN architectures introduce promising results by using domain-specific word embedding. Tang et al. [38] stated that sentiment-based word embedding models should be designed by encoding textual information along with word contexts, enabling the discernment of opposite word polarities in related contexts. On the basis of the improved word embedding methods where training is based on word-to-word co-occurrence in a corpus, a CNN is used here to extract features in order to obtain excellent high-level sentence representations. Pennington et al. [36] introduced the GloVe model, which is an unsupervised world log-bilinear regression model that is used for mastering the representations of relatively uncommon words. Additionally, Joulin et al. [39] tried to show learning model representations of vectors by mixing unsupervised and supervised techniques to research the vectors of words to capture semantic information. In [40], it was shown that combining a CNN with a RNN (recurrent neural networks) for the sentiment analysis of short texts provide good results. In [25], a CNN was used, and character-level information was considered, to support word-level embedding. One of the contributions of this work is using an end-to-end network that comprises four main steps, namely, word vectorization, sentence vectorization, document vectorization, and then classification. We compare the results of the proposed method with other machine learning approaches.

Word Embeddings
Word embedding is foundational to natural language processing and represents the words in a text in an R-dimensional vector space, thereby enabling the capture of semantics, semantic similarity between words, and syntactic information for words. Word embedding approaches via word2vec have been proposed by Mikolov et al. [15]. Pennington et al. [36] and Arora et al. [36,41] introduced Word2vec's semantic similarity as a standard sequence embedding method that translates natural language into distributed representations of vectors; however, in order to overcome the inability of a predefined dictionary to learn rare word representations, FastText [42] is also used for character embedding. The word2vec and FastText models, which include two separate components (CBOW and skip-gram), can capture contextual word-to-word relationships in a multidimensional space as a preliminary step for predictive models used for semantics and data retrieval tasks [14,40]. Figure 1 shows that when the context words are given, the CBOW component infers the target word, while the skip-gram component infers the context words when the input word is provided [43]. In addition to that, the input, projection, and output layers are available for both learning algorithms, although their processes of output formulation are differ- where W n denotes words. The projection layer corresponds to an array of multidimensional vectors and stores the sum of several vectors. The output layer corresponds to the layer that outputs the results of the vectorization.

Research Methods and Dataset Construction
Research Methods Figure 2 shows the research methodology. Generally, the first process is data collection, where text data are collected from various sources. We created a single-label dataset (Dataset 1) that used 90% of data for training and 10% for testing. We also considered an unsupervised corpus (Dataset 2). Some text preprocessing steps were carried out before the data were passed to the model. These steps included removing extra white spaces, removing meaningless words, removing duplicate words, tokenization, cleaning, and the removal of stop words. These steps provided unique and meaningful sequences of words with unique identifications. With Dataset 2, we performed word embedding using FastText and word2vec with both the CBOW and skip-gram algorithms. These algorithms were trained with 100 dimensions and window sizes of 5 for both word embedding techniques to capture meaningful vectors that were able to learn from the nature of our data type, as well as from the morphological richness of the language. Using the preprocessed words, the embedding layer learned distributed representations for input tokens and these tokens had the same latent relationships. We applied a CNN-based approach to automatically learn and classify sentences into one of the six categories in evaluation Dataset 1. CNNs require inputs to have a static size and sentence lengths can vary greatly. Consequently, we used a maximum average word length of 235. Finally, considering the categorical news articles, based on the pretrained word vectors, we evaluated the accuracy, precision, recall, and F1 scores between methods. The methodology that we used is very close to that which was proposed in [12].

Single-Label Tigrinya News Articles Dataset
We obtained news articles from popular news sources via web scraping and manual data collection techniques. We employed web-scraping tools (Selenium Python, requests, Beautiful Soup, and PowerShell) for news accessible sources on the Internet ( Figure  3a) and the number of articles for each categories as stated in (Figure 3b). Furthermore, we also collected news articles in the form of word documents from the Tigray Mass Media Agency (a broadcast TV news agency) that transmitted on air from 2012 to 2018. The newly underlying corpus has six categories, i.e.,

Research Methods and Dataset Construction
Research Methods Figure 2 shows the research methodology. Generally, the first process is data collection, where text data are collected from various sources. We created a single-label dataset (Dataset 1) that used 90% of data for training and 10% for testing. We also considered an unsupervised corpus (Dataset 2). Some text preprocessing steps were carried out before the data were passed to the model. These steps included removing extra white spaces, removing meaningless words, removing duplicate words, tokenization, cleaning, and the removal of stop words. These steps provided unique and meaningful sequences of words with unique identifications. With Dataset 2, we performed word embedding using FastText and word2vec with both the CBOW and skip-gram algorithms. These algorithms were trained with 100 dimensions and window sizes of 5 for both word embedding techniques to capture meaningful vectors that were able to learn from the nature of our data type, as well as from the morphological richness of the language. Using the preprocessed words, the embedding layer learned distributed representations for input tokens and these tokens had the same latent relationships. We applied a CNN-based approach to automatically learn and classify sentences into one of the six categories in evaluation Dataset 1. CNNs require inputs to have a static size and sentence lengths can vary greatly. Consequently, we used a maximum average word length of 235. Finally, considering the categorical news articles, based on the pretrained word vectors, we evaluated the accuracy, precision, recall, and F1 scores between methods. The methodology that we used is very close to that which was proposed in [12].

Research Methods and Dataset Construction
Research Methods Figure 2 shows the research methodology. Generally, the first process is data collection, where text data are collected from various sources. We created a single-label dataset (Dataset 1) that used 90% of data for training and 10% for testing. We also considered an unsupervised corpus (Dataset 2). Some text preprocessing steps were carried out before the data were passed to the model. These steps included removing extra white spaces, removing meaningless words, removing duplicate words, tokenization, cleaning, and the removal of stop words. These steps provided unique and meaningful sequences of words with unique identifications. With Dataset 2, we performed word embedding using FastText and word2vec with both the CBOW and skip-gram algorithms. These algorithms were trained with 100 dimensions and window sizes of 5 for both word embedding techniques to capture meaningful vectors that were able to learn from the nature of our data type, as well as from the morphological richness of the language. Using the preprocessed words, the embedding layer learned distributed representations for input tokens and these tokens had the same latent relationships. We applied a CNN-based approach to automatically learn and classify sentences into one of the six categories in evaluation Dataset 1. CNNs require inputs to have a static size and sentence lengths can vary greatly. Consequently, we used a maximum average word length of 235. Finally, considering the categorical news articles, based on the pretrained word vectors, we evaluated the accuracy, precision, recall, and F1 scores between methods. The methodology that we used is very close to that which was proposed in [12].

Research Methods and Dataset Construction
Research Methods Figure 2 shows the research methodology. Generally, the first process is data collection, where text data are collected from various sources. We created a single-label dataset (Dataset 1) that used 90% of data for training and 10% for testing. We also considered an unsupervised corpus (Dataset 2). Some text preprocessing steps were carried out before the data were passed to the model. These steps included removing extra white spaces, removing meaningless words, removing duplicate words, tokenization, cleaning, and the removal of stop words. These steps provided unique and meaningful sequences of words with unique identifications. With Dataset 2, we performed word embedding using FastText and word2vec with both the CBOW and skip-gram algorithms. These algorithms were trained with 100 dimensions and window sizes of 5 for both word embedding techniques to capture meaningful vectors that were able to learn from the nature of our data type, as well as from the morphological richness of the language. Using the preprocessed words, the embedding layer learned distributed representations for input tokens and these tokens had the same latent relationships. We applied a CNN-based approach to automatically learn and classify sentences into one of the six categories in evaluation Dataset 1. CNNs require inputs to have a static size and sentence lengths can vary greatly. Consequently, we used a maximum average word length of 235. Finally, considering the categorical news articles, based on the pretrained word vectors, we evaluated the accuracy, precision, recall, and F1 scores between methods. The methodology that we used is very close to that which was proposed in [12].

Research Methods and Dataset Construction
Research Methods Figure 2 shows the research methodology. Generally, the first process is data collection, where text data are collected from various sources. We created a single-label dataset (Dataset 1) that used 90% of data for training and 10% for testing. We also considered an unsupervised corpus (Dataset 2). Some text preprocessing steps were carried out before the data were passed to the model. These steps included removing extra white spaces, removing meaningless words, removing duplicate words, tokenization, cleaning, and the removal of stop words. These steps provided unique and meaningful sequences of words with unique identifications. With Dataset 2, we performed word embedding using FastText and word2vec with both the CBOW and skip-gram algorithms. These algorithms were trained with 100 dimensions and window sizes of 5 for both word embedding techniques to capture meaningful vectors that were able to learn from the nature of our data type, as well as from the morphological richness of the language. Using the preprocessed words, the embedding layer learned distributed representations for input tokens and these tokens had the same latent relationships. We applied a CNN-based approach to automatically learn and classify sentences into one of the six categories in evaluation Dataset 1. CNNs require inputs to have a static size and sentence lengths can vary greatly. Consequently, we used a maximum average word length of 235. Finally, considering the categorical news articles, based on the pretrained word vectors, we evaluated the accuracy, precision, recall, and F1 scores between methods. The methodology that we used is very close to that which was proposed in [12].

Research Methods and Dataset Construction
Research Methods Figure 2 shows the research methodology. Generally, the first process is data collection, where text data are collected from various sources. We created a single-label dataset (Dataset 1) that used 90% of data for training and 10% for testing. We also considered an unsupervised corpus (Dataset 2). Some text preprocessing steps were carried out before the data were passed to the model. These steps included removing extra white spaces, removing meaningless words, removing duplicate words, tokenization, cleaning, and the removal of stop words. These steps provided unique and meaningful sequences of words with unique identifications. With Dataset 2, we performed word embedding using FastText and word2vec with both the CBOW and skip-gram algorithms. These algorithms were trained with 100 dimensions and window sizes of 5 for both word embedding techniques to capture meaningful vectors that were able to learn from the nature of our data type, as well as from the morphological richness of the language. Using the preprocessed words, the embedding layer learned distributed representations for input tokens and these tokens had the same latent relationships. We applied a CNN-based approach to automatically learn and classify sentences into one of the six categories in evaluation Dataset 1. CNNs require inputs to have a static size and sentence lengths can vary greatly. Consequently, we used a maximum average word length of 235. Finally, considering the categorical news articles, based on the pretrained word vectors, we evaluated the accuracy, precision, recall, and F1 scores between methods. The methodology that we used is very close to that which was proposed in [12].

Research Methods and Dataset Construction
Research Methods Figure 2 shows the research methodology. Generally, the first process is data collection, where text data are collected from various sources. We created a single-label dataset (Dataset 1) that used 90% of data for training and 10% for testing. We also considered an unsupervised corpus (Dataset 2). Some text preprocessing steps were carried out before the data were passed to the model. These steps included removing extra white spaces, removing meaningless words, removing duplicate words, tokenization, cleaning, and the removal of stop words. These steps provided unique and meaningful sequences of words with unique identifications. With Dataset 2, we performed word embedding using FastText and word2vec with both the CBOW and skip-gram algorithms. These algorithms were trained with 100 dimensions and window sizes of 5 for both word embedding techniques to capture meaningful vectors that were able to learn from the nature of our data type, as well as from the morphological richness of the language. Using the preprocessed words, the embedding layer learned distributed representations for input tokens and these tokens had the same latent relationships. We applied a CNN-based approach to automatically learn and classify sentences into one of the six categories in evaluation Dataset 1. CNNs require inputs to have a static size and sentence lengths can vary greatly. Consequently, we used a maximum average word length of 235. Finally, considering the categorical news articles, based on the pretrained word vectors, we evaluated the accuracy, precision, recall, and F1 scores between methods. The methodology that we used is very close to that which was proposed in [12].
Our dataset consisted of 30,000 articles, and we split the dataset into subsets, i.e., 24,000 articles for the training set and 6000 for the test set. We used 90% of articles in the training set (3600 for each category) and 10% for testing (1000 articles for each class). The training dataset was used to train the classifier and optimize the parameters, while the test dataset (unseen to the model) was reserved for testing the built model and determining the quality of the trained model. Statistics for the aforementioned dataset are given in Table 2.

. Tigrinya Corpus Collection and Preparation
In NLP applications, the use of conventional features such as term frequency-inverse document frequency (TF-IDF) has proven to be less efficient than word embedding [44]. Consequently, as a significant part of our contribution, we developed our own "Tigrinya multi-domain" corpus via data collected from various sources. Our second dataset was used to test the success of solving Tigrinya text classification problems and was used with the word2vec model using Genism [45]. Figure 4 show the steps for constructing our word embedding model Furthermore Figure 5 also shows similar model that apply to our methodology.   Table 3 shows a detailed summary of word embedding in terms of quantity. Further information can be found in Appendix A.

. Data Preprocessing
In order to prepare text data well before input into the classification training models, text preprocessing is an essential step. We applied the following steps to remove unrelated characters and symbols:

•
Removing stop words using Python, Figure 6 shows an example of common stop words; • Tokenizing the texts using Python; • Removing URLs and links to websites that started with "www.*" or "http://*"; • Removing typing errors; • Removing non-Tigrinya characters; • We avoided normalization since it can affect word meanings, for example, the verb "eat" in English is equivalently translated to the Tigrinya form "  Table 3 shows a detailed summary of word embedding in terms of quantity. Further information can be found in Appendix A.

. Data Preprocessing
In order to prepare text data well before input into the classification training models, text preprocessing is an essential step. We applied the following steps to remove unrelated characters and symbols:

•
Removing stop words using Python, Figure 6 shows an example of common stop words; • Tokenizing the texts using Python; • Removing URLs and links to websites that started with "www.*" or "http://*"; • Removing typing errors; • Removing non-Tigrinya characters; • We avoided normalization since it can affect word meanings, for example, the verb "eat" in English is equivalently translated to the Tigrinya form "በልዐ/bele", and adding the prefix "ke/ክእ" changes the word to "kebele/ክእበልዐ", meaning to "let me eat"; • Furthermore, we also prepared a list for mapping known abbreviations into counterpart meanings, e.g., abbreviated "(bet t/t (ቤት ት/ቲ)" means "school/ትምህርቲ ቤት" and abbreviated "betf/di (ቤት ፍ/ዲ)" means "justice office/ቤት ፍርዲ".  Table 3 shows a detailed summary of word embedding in terms of quantity. Further information can be found in Appendix A.

. Data Preprocessing
In order to prepare text data well before input into the classification training models, text preprocessing is an essential step. We applied the following steps to remove unrelated characters and symbols:

•
Removing stop words using Python, Figure 6 shows an example of common stop words; • Tokenizing the texts using Python; • Removing URLs and links to websites that started with "www.*" or "http://*"; • Removing typing errors; • Removing non-Tigrinya characters; • We avoided normalization since it can affect word meanings, for example, the verb "eat" in English is equivalently translated to the Tigrinya form "በልዐ/bele", and adding the prefix "ke/ክእ" changes the word to "kebele/ክእበልዐ", meaning to "let me eat"; • Furthermore, we also prepared a list for mapping known abbreviations into counterpart meanings, e.g., abbreviated "(bet t/t (ቤት ት/ቲ)" means "school/ትምህርቲ ቤት" and abbreviated "betf/di (ቤት ፍ/ዲ)" means "justice office/ቤት ፍርዲ".  Table 3 shows a detailed summary of word embedding in terms of quantity. Further information can be found in Appendix A.

. Data Preprocessing
In order to prepare text data well before input into the classification training models, text preprocessing is an essential step. We applied the following steps to remove unrelated characters and symbols:

•
Removing stop words using Python, Figure 6 shows an example of common stop words; • Tokenizing the texts using Python; • Removing URLs and links to websites that started with "www.*" or "http://*"; • Removing typing errors; • Removing non-Tigrinya characters; • We avoided normalization since it can affect word meanings, for example, the verb "eat" in English is equivalently translated to the Tigrinya form "በልዐ/bele", and adding the prefix "ke/ክእ" changes the word to "kebele/ክእበልዐ", meaning to "let me eat"; • Furthermore, we also prepared a list for mapping known abbreviations into counterpart meanings, e.g., abbreviated "(bet t/t (ቤት ት/ቲ)" means "school/ትምህርቲ ቤት" and abbreviated "betf/di (ቤት ፍ/ዲ)" means "justice office/ቤት ፍርዲ". ", meaning to "let me eat"; • Furthermore, we also prepared a list for mapping known abbreviations into counterpart meanings, e.g., abbreviated "(bet t/t (  Table 3 shows a detailed summary of word embedding in terms of quantity. Further information can be found in Appendix A.

. Data Preprocessing
In order to prepare text data well before input into the classification training models, text preprocessing is an essential step. We applied the following steps to remove unrelated characters and symbols:

•
Removing stop words using Python, Figure 6 shows an example of common stop words; • Tokenizing the texts using Python; • Removing URLs and links to websites that started with "www.*" or "http://*"; • Removing typing errors; • Removing non-Tigrinya characters; • We avoided normalization since it can affect word meanings, for example, the verb "eat" in English is equivalently translated to the Tigrinya form "በልዐ/bele", and adding the prefix "ke/ክእ" changes the word to "kebele/ክእበልዐ", meaning to "let me eat"; • Furthermore, we also prepared a list for mapping known abbreviations into counterpart meanings, e.g., abbreviated "(bet t/t (ቤት ት/ቲ)" means "school/ትምህርቲ ቤት" and abbreviated "betf/di (ቤት ፍ/ዲ)" means "justice office/ቤት ፍርዲ".  Table 3 shows a detailed summary of word embedding in terms of quantity. Further information can be found in Appendix A.

. Data Preprocessing
In order to prepare text data well before input into the classification training models, text preprocessing is an essential step. We applied the following steps to remove unrelated characters and symbols:

•
Removing stop words using Python, Figure 6 shows an example of common stop words; • Tokenizing the texts using Python; • Removing URLs and links to websites that started with "www.*" or "http://*"; • Removing typing errors; • Removing non-Tigrinya characters; • We avoided normalization since it can affect word meanings, for example, the verb "eat" in English is equivalently translated to the Tigrinya form "በልዐ/bele", and adding the prefix "ke/ክእ" changes the word to "kebele/ክእበልዐ", meaning to "let me eat"; • Furthermore, we also prepared a list for mapping known abbreviations into counterpart meanings, e.g., abbreviated "(bet t/t (ቤት ት/ቲ)" means "school/ትምህርቲ ቤት" and abbreviated "betf/di (ቤት ፍ/ዲ)" means "justice office/ቤት ፍርዲ".  Table 3 shows a detailed summary of word embedding in terms of quantity. Further information can be found in Appendix A.

. Data Preprocessing
In order to prepare text data well before input into the classification training models, text preprocessing is an essential step. We applied the following steps to remove unrelated characters and symbols:

•
Removing stop words using Python, Figure 6 shows an example of common stop words; • Tokenizing the texts using Python; • Removing URLs and links to websites that started with "www.*" or "http://*"; • Removing typing errors; • Removing non-Tigrinya characters; • We avoided normalization since it can affect word meanings, for example, the verb "eat" in English is equivalently translated to the Tigrinya form "በልዐ/bele", and adding the prefix "ke/ክእ" changes the word to "kebele/ክእበልዐ", meaning to "let me eat"; • Furthermore, we also prepared a list for mapping known abbreviations into counterpart meanings, e.g., abbreviated "(bet t/t (ቤት ት/ቲ)" means "school/ትምህርቲ ቤት" and abbreviated "betf/di (ቤት ፍ/ዲ)" means "justice office/ቤት ፍርዲ".  Table 3 shows a detailed summary of word embedding in terms of quantity. Further information can be found in Appendix A.

. Data Preprocessing
In order to prepare text data well before input into the classification training models, text preprocessing is an essential step. We applied the following steps to remove unrelated characters and symbols:

Basic CNN Architecture Sequence Embedding Layer
Similar to many other CNN models [30,46], as shown in Figure 6, which is based on [12], a sequential text vector is obtained by concatenating the embedded vectors of the component words. Equation (1) details our method of word embedding as: We made the lengths of sentences equal by padding zero values to form a text matrix of k × n dimensions with k tokens and n length-embedding vectors. As shown in Equation (1), we can represent a concatenation operator to concatenate word vector X i corresponding to the ith word, and k represents the number of words/tokens present within the text. We consider k with a fixed length here (k = 250). To capture the discriminative features from low-level word embedding, the CNN model applies a series of transformations to the input sentence X i:n using convolution, nonlinearity activation, and pooling operations in the following layers.

Convolutional Layer
Unique features in the convolutional layer are extracted as word vectors corresponding to each filter and feature map from a different width-embedding matrix. Discriminative word sequences are found during the training process. The extracted features have lowlevel semantic features as compared with the original text, thus, reducing the number of dimensions. The convolution word filter considers positions that are independent for every word and filters at higher layers capture syntactic or semantic associations between phrases that are far apart in a text.
A W ∈ R m×n filter was applied to word sections to obtain high-level representations, where m shifts with stride s through the embedding matrix to produce feature map ci and where each Ci is calculated using Equation (2). In this equation, " * " represents the convolution operation, which represents word vectors from X i to X i+m−1 (i.e., m rows at a time) from X covered by filter W using strides. Additionally, "bi" represents the biased term and the activation function f is usually of a nonlinear form, such as a sigmoid or hyperbolic tangent form. In our case, f represents the rectified linear unit (ReLU) activation function as: The ReLU is applied to a layer to inject nonlinearity to the system by making all the negative values zero for any input x, as shown in Equation (3): This helps increase the model training speed without any significant difference in accuracy. Once the filter F iterates over the entire embedding matrix, we obtain a corresponding feature map as shown in Equation (4): Pooling Layer The convolutional layer outputs are, then, passed to the pooling layer, which aggregates the information and reduces the representation through common statistical methods, such as finding the mean, maximum, and L2-norm. The pooling layer can alleviate overfitting and produce vectors of sentences with fixed lengths. Suppose there are K different filters, we aggregate the original information in feature maps by pooling as per Equation (5): . . .
where F i is the ith convolutional filter map with the bias vector Z pooled = Z max 1 , Z max 2 · · · z max k , which is a learned new distributed representation of the input sentence. In this paper, we utilize a max-over-time pooling operation, which selects global semantic features and attempts to capture the most important feature with the highest value for each feature map. Given a feature map z i , the pooling operation returns the maximum value z max in map z i , i.e., z max = max, where Z I is the resulting feature corresponding to the filter. Therefore, we can obtain vectors if the model employs K parallel filters with different window sizes.

Fully Connected Layer
In this layer, each neuron has full connections with all of the neurons in the previous layer. The connection structure is the same as with layers in classic neural network models. Dropout regularization is applied to the fully connected layer to avoid overfitting, and therefore improve the generalization performance. When training the model, neurons that are dropped have a probability of being temporarily removed from the network. Dropped neurons are ignored when calculating the input and output for both forward propagation and backward propagation. Therefore, the dropout technique prevents neurons from co-adapting too much by making the presence of any neuron unreliable.

Softmax Function
Finally, the vector representations in the fully connected layer are passed to the softmax function, and the output of the softmax function is the probability distribution over the labels. For example, the probability of label y i is calculated as per Equation (6): A categorical cross-entropy loss function was used to train the classifier to categorize news articles into different categories by training the classifier via calculating gradients and using backward propagation. The loss was calculated using Equation (7), where x i is the ith element of the dataset, y i represents the predicted label of the element x i , t represents the number of training samples, and θ describes the parameters:

Experiment and Discussion
In this section, we explore the performances of CNN models trained with word2vec (CBOW and skip-gram) and CNN models trained with FastText (CBOW and skip-gram) for news articles. We use the accuracy, recall, precision, and F1 score as performance metrics. The expressions of these metrics are given as follows: F1 score = 2 × precision × recall precision + recall (11) where the number of real positives among the predicted positives is delineated by true positive (TP) and true negative (TN) denotes the number of real negatives among the predicted negatives. Similarly, false negative (FN) denotes the number of real positives among the predicted negatives, and false positive (FP) denotes the number of real negatives among the predicted positives. Therefore, accuracy denotes the proportion of documents classified correctly by the CNN among all documents, and recall denotes the proportion of documents that are classified as positive by the CNN among all real positive documents. Precision denotes the percentage of documents that are real positives among documents classified as positive by the CNN, and the F1 score denotes the average of the weighted recall and precision scores.

CNN Parameter Values
The CNN parameters that were applied for the pretrained CBOW and skip-gram models were adapted from the literature [12,47]. A grid hyperparameter optimization search method was applied to find the optimum value, as shown in Tables 4 and 5 for both word embedding and CNN parameter. The word vector dimension d was equal to 100. Four different window sizes (filter widths) were used. The use of 100 filters resulted in 100 feature maps, and the convolution filter weights and softmax weights were taken uniformly from the interval [−0.1, 0.1]. The maximum pooling size was 2, which was used to pool high-level features from the feature maps, and the pooled values were concatenated to produce a single vector at the fully connected dense layer which was used to calculate class probabilities. The model was trained using the "Adam" learning rate method with 20 epochs to avoid overfitting using a callback function. We used the dropout parameter P at the embedding layer with p = 0.15 and the L2 regularization parameter value of 0.03 for the convolutional layer. Table 4. Word embedding parameters.

Parameter Value
Word embedding size 100 Window size (filter size) 2, 3, 4, and 5 Number of filters for each size 100 Dropout probability at the embedding layer 0.15

Comparison with Traditional Models
We compared our CNN models with traditional machine learning models. The CNN models were tested and compared with four of the most common machine learning models, i.e., SVM, Naïve Bayes, decision tree, and random forest models with BOW features, and considering unigrams where vector representation was carried out using TF-IDF vectors. For the SVM models, models from the "sklearn" package in the scikit-learn library were used, while for Naïve Bayes, we employed "MultinomialNB" from scikit-learn. The Naïve Bayes model that we used was the one in the scikit-learn library. Similarly, for the decision tree and random forest models, we used the modules from scikit-learn. Furthermore, to deal with the sparsity of the feature matrix, all CNN algorithms were run five times with the same parameter settings. Table 6 presents accuracy and training values, where the experiment analysis showed that the CNN had featured better training accuracy, while, among the traditional machine learning models, the random forest model showed the best validation accuracy (SVM, naive Bayes, decision tree, and random forest).

Comparison of Pretrained Word2vec with CNN-Based Models
In this section, we compare the overall performance of word embedding, considering both the CBOW and skip-gram models and without pretrained vector CNN model based on the experiments. Figure 7 shows accuracy and F1 score comparisons for the generated vectors with word2vec embedding, when the number of training volume were static with 22k news articles for the CNN with skip-gram, the CNN with CBOW, and the CNN with pretrained vectors. The CBOW CNN showed the highest performance for all training volumes, with 0.9341 and 0.9274 as the values for the accuracy and F1 score, respectively. The CBOW CNN also delivered better performance in terms of accuracy in volume 16. The skip-gram CNN also showed better performance in volume 20, while the other CNNs trained without pretrained vector fluctuations at various volumes and ultimately decreased in accuracy. els, i.e., SVM, Naïve Bayes, decision tree, and random forest models with BOW features, and considering unigrams where vector representation was carried out using TF-IDF vectors. For the SVM models, models from the "sklearn" package in the scikit-learn library were used, while for Naïve Bayes, we employed "MultinomialNB" from scikit-learn. The Naïve Bayes model that we used was the one in the scikit-learn library. Similarly, for the decision tree and random forest models, we used the modules from scikit-learn. Furthermore, to deal with the sparsity of the feature matrix, all CNN algorithms were run five times with the same parameter settings. Table 6 presents accuracy and training values, where the experiment analysis showed that the CNN had featured better training accuracy, while, among the traditional machine learning models, the random forest model showed the best validation accuracy (SVM, naive Bayes, decision tree, and random forest).

Comparison of pretrained Word2vec with CNN-Based Models
In this section, we compare the overall performance of word embedding, considering both the CBOW and skip-gram models and without pretrained vector CNN model based on the experiments. Figure 7 shows accuracy and F1 score comparisons for the generated vectors with word2vec embedding, when the number of training volume were static with 22k news articles for the CNN with skip-gram, the CNN with CBOW, and the CNN with pretrained vectors. The CBOW CNN showed the highest performance for all training volumes, with 0.9341 and 0.9274 as the values for the accuracy and F1 score, respectively. The CBOW CNN also delivered better performance in terms of accuracy in volume 16. The skip-gram CNN also showed better performance in volume 20, while the other CNNs trained without pretrained vector fluctuations at various volumes and ultimately decreased in accuracy.

Comparison of FastText Pretrained on CNN-Based Models
This section considers FastText word embedding vectors and compares the performances among the CNNs. Figure 8 clearly shows that the CBOW-FastText model scored better results than the skip-gram CNN even though it did not show significant differences for almost all training volumes. The CNN with CBOW obtained values of 0.9013 and 0.9012 for the accuracy and F1 scores, respectively. The CBOW and skip-gram CNNs showed higher performance at volume 14. The random CNN vectors subsequently decreased. We observed that the CBOW CNN trained with word2vec shows better performance than the CBOW CNN trained with the skip-gram model. Nevertheless, CNN without pretrained vector accuracy decreased when the number of volumes and epochs increased; however, all CBOW CNN models showed better performance when the volumes and epochs increased. The performance of the random CNN decreased with the absence of pretrained data, which implies that there are direct relationships between word embedding representations for text classification with CNNs. Table 7 shows the accuracy and F1 scores of a CNN with both word2vec and FastText. As compared with the CBOW CNN for the vector in news articles, this model was less stable and required more epochs to reach maximum performance. For example, when the training volume corresponded to 24, the CNN with the vector in the news article exhibited an accuracy of 0.93 or more when the epoch number corresponded to 17, although the CNN with the vector exhibited an accuracy of 0.91. However, the skip-gram CNN with the vector did not exhibit any significant difference in terms of the maximum values for the accuracy and F1 scores. FastText also showed similar results to the CBOW CNN when we compared the vectors in news articles when the training volume corresponded to 24, and the news articles exhibited an accuracy of 0.90 or more when the number of epochs was 20.

Comparison of CBOW Results Word2vec and FastText
Similarly, in Table 8 we compared pretrained word embedding results for the CBOW models with the word2vec and FastText results. Subsequently, those algorithms showed better results. The best results for news categories were found for sport category with the word2vec CBOW model, obtaining values of 0.9301, 0.9312, and 0.9306 for the precision, recall, and F1 scores, respectively. Meanwhile, the CBOW FastText model also showed that sport category achieved values of 0.9202, 0.9214, and 0.9206 for the precision, recall, and F1 scores, respectively, which indicated that the sport category provided good results.  Table 4 shows that the CNNs showed better results than traditional machine learning results due to the use of pretrained word vectors for enhancing the learning of word representations. Additionally, CNN performance was meaningfully reduced in the absence of pretrained word vectors (PWVs). This indicates that word relationships are significant factors for learning in this context. Moreover, in Figure 7, we compared two word embedding models in terms of the accuracy and F1 scores and revealed that the CBOW CNN-CBOW trained with word2vec and FastText shows higher performance than the skip-gram CNN model, even though the output results were close between some models. Moreover, as in Tables 6 and 7, this situation can be interpreted as the CBOW CNN model being better at representing common words in the corpus considered here, since the skip-gram CNN algorithm was better for learning rare words. In this paper, FastText and word2vec provided better results for Tigrinya news article classification than the CBOW skip-gram and CBOW word2vec models.

Conclusions
We have used word2vec and FastText techniques, and our results suggest that they are among the best word embedding techniques in the field of NLP research. In this study, we have evaluated word2vec and FastText in classification models by applying CNNs to Tigrinya news articles. We observed that word2vec improved the classification model performance by learning the relationships among words. We further compared and analyzed both of the word2vec models with machine learning models. The current study shows that the most successful word embedding method was the CBOW algorithm for the news articles considered here. This study is expected to aid future studies on deep learning in the field of Tigrinya text processing and natural language processing. The word vectors and datasets created in this study contribute to the current literature on Tigrinya text processing. In the future, we plan to address other embedding techniques, such as GloVe, BERT, XLNET, and others. We also intend to find optimal solution methods for large-scale word embedding problems, which is currently a time-consuming problem.