Emotion Estimation Method Based on Emoticon Image Features and Distributed Representations of Sentences

: This paper proposes an emotion recognition method for tweets containing emoticons using their emoticon image and language features. Some of the existing methods register emoticons and their facial expression categories in a dictionary and use them, while other methods recognize emoticon facial expressions based on the various elements of the emoticons. However, highly accurate emotion recognition cannot be performed unless the recognition is based on a combination of the features of sentences and emoticons. Therefore, we propose a model that recognizes emotions by extracting the shape features of emoticons from their image data and applying the feature vector input that combines the image features with features extracted from the text of the tweets. Based on evaluation experiments, the proposed method is conﬁrmed to achieve high accuracy and shown to be more effective than methods that use text features only. the respective results. The horizontal axis is the emotion label estimated by the classiﬁer, and the vertical axis is the correct emotion label. The lighter the color of each cell, the higher is the accuracy. As seen from these ﬁgures, the results are better for (dv, hv3, ev) using the image features of emoticons, although their accuracies are almost equal. When training CNNs on images of emoticons, we augmented the training data of the pretraining model and used large-scale data; hence, we were able to create a feature extractor with high versatility. On the other hand, the results show that even when using only a small emoticon dictionary, we can effectively obtain emoticon features.


Introduction
Emoticons are nonverbal expressions that are composed by combining characters. They are typically embedded in text and are widely used in multibyte character languages that have many character types, such as Japanese. By adding emoticons to text, it is possible to flexibly express emotions and intentions that are otherwise difficult to convey using only character information. Presently, in addition to emoticons, various other forms of nonverbal expressions, such as emojis and stamps, are used in text-based communications.
However, because there are very few other expressions that can be used with the same ease and versatility as emoticons, it is expected that they will be used extensively in the future. However, emoticons have a higher degree of polysemy than emojis, and their meanings often change when combined with sentences. Hence, it is difficult to ascertain emotions and intentions using only emoticons. In this study, to ascertain the emotions expressed by entire sentences from tweets including emoticons, we focus on the shape features of the emoticons and propose a method to combine these extracted features with textual features.
The remainder of this paper is organized as follows. Section 2 discusses previous studies related to emoticons and explains how they differ from this study. Section 3 describes the proposed method for estimating emotions from text with emoticons. Section 4 outlines the experiments conducted to evaluate the validity of the proposed method. Section 5 analyzes the experimental results obtained. Section 6 presents concluding remarks.

Related Work
This section introduces previous studies on emotion estimation from emoticons, emotion estimation from sentences containing emoticons, and emotion estimation from sentences containing pictograms, and describes how they differ from the present study.

Emotion Estimation from Emoticons
To estimate emotions from emoticons, Fujisawa et al. [1] focused on the appearance of the emoticons as cues. Emoticons are generally composed of character strings. Therefore, when considering emoticons as research objects, we often focus on each character individually. Fujisawa et al. proposed a method to obtain shape information from the entire emoticon by converting its text data to image data and classifying the corresponding emotions using shape information as features. This method allows the capturing of the entire image of the emoticon as a feature, rather than focusing on the individual characters. In addition, this method was shown to be effective for unknown emoticons that were not included in the training data.
Yu et al. [2] proposed a new system, called AZEmo, which extracts emoticons from Chinese social media and other sources and classifies them into seven emotional categories. The system is based on a kinesics model that divides emoticons into semantic regions (eyes, mouth, etc.). This model has been modified to adapt to the Chinese context.
Takishita et al. [3] focused on emoticons and onomatopoeia in documents to extract the emotions associated with the contents of the entire documents. In their work, Takishita et al. investigated the effects of onomatopoeia on the emotions expressed using emoticons; accordingly, they estimated the emotions implied by emoticons. In the case of a combination of an emoticon and onomatopoeia, each of which express different emotions, Takishita et al. considered both the possible precedence of only one of the emotions over the others as well as the expression of a completely new emotion.
Ptaszynski et al. [4] identified the correspondences between emoticons and their corresponding linguistic expressions, such as onomatopoeia, and estimated the degree of ambiguity in the general meaning of emoticons. They surveyed users regarding the meanings of emoticons, and applied the results of the survey to quantify the understandability (meaning ambiguity) of emoticons. Based on their results, they applied emotions that can be expressed in everyday vocabulary to identify the ambiguity of emoticons' meanings.
Matsumoto et al. [5] proposed a method for emotion estimation that learns the character features of emoticons with a convolutional neural network. In their method, the characters of emoticons are converted into character variance representation by word2vec, and their weights are used as initial parameters that are input to a one-dimensional convolutional neural network. They targeted the five emotions Joy, Surprise, Anger, Sorrow, and Neutral. However, because it is more difficult to capture the meaning of character variants than word variants and because it depends on the quality and quantity of the unsupervised text data, it is necessary to capture the prerequisite knowledge of how each character is used in emoticons.
Several studies have been conducted with the objective of estimating the emotions of individual emoticons [6]. On the one hand, emoticons are used to express a variety of emotions. On the other hand, when emoticons appear as part of a sentence, it is very difficult to estimate the emotion of the sentence and the emoticon.

Emotion Estimation from Sentences Containing Emoticons
Jiang et al. [7] investigated the effects of emoticons on the emotions implied by tweets on Twitter; they compared tweets with and without emoticons and formulated relationships between them and the expected emotions. Furthermore, they quantitatively assessed the emotions of not only sentences but also each of the emoticons individually. They concluded that using the emotions implied by the emoticons is effective for improving the accuracy of emotion estimation from short sentences such as tweets.
Wegrzyn-Wolska et al. [8] investigated the effects of emoticons on the entire sentence by estimating emotions from tweets under various scenarios, such as when emoticons were removed from the sentences and when binary labels (positive/negative) were assigned in the absence of emotion labels. However, the emoticons used by Wegrzyn-Wolska et al. were of the western style, which are slightly different from the Japanese-style emoticons considered in this present study. Table 1 shows examples of each type of emoticon. In this manner, according to the various types of emoticons used in different countries, it is important to use different analysis methods that are appropriate for each type of emoticon.

Western Emoticons Japanese Emoticons
:-), : D, : P (ˆˆ), (;_;) In addition to the studies cited above, many others have been conducted with the objective of estimating emotions in sentences containing emoticons [9,10]. In existing research, dictionary-based matching has been the main method for estimating the emotions of emoticons. First, the emotions expressed by the emoticons are estimated based on dictionary data, and then the emotions of the entire sentence are estimated based on the estimated emotions of the emoticons. However, we believe that there is a shortcoming in the creation of the dictionary data. Emoticons are expressions created by combining symbols, and the variety of emoticons is enormous. In addition, depending on the language in which the emoticons are used, the applicable characters and symbols may differ even if they look the same (e.g., ellipsis: "-"; ruled: "--"; dash: "-", etc.). In the case of characterbased emoticons, if the parts of the emoticon are different, we need to register each part as a different emoticon. Consequently, in our approach, in addition to character-based recognition, we use a method that treats emoticons as images. In this way, we can estimate the emotions of unknown emoticons by referring to the existing dictionary data if they are composed of visually similar parts. Table 2 shows the differences and advantages of our method in relation to existing emotion estimation methods that include emoticons.

•
By creating dictionary data in advance, sentiment estimation can be performed quickly for known emoticons.

•
The cost of machine learning is low.
• For visually similar emoticons, emotion estimation can be performed even for unknown emoticons. • By converting emoticons into images, it is easy to treat them as vector features.

Drawback
For unknown emoticons, the accuracy of emotion estimation becomes worse. The cost of machine learning is high.

Emotion Estimation from Sentences Containing Emojis
Similar to emoticons, emojis are nonverbal expressions that are also used inline in text. Emojis are generally created in image format or using Unicode characters and used to express emotions and responses.
Ahanin et al. [11] used fuzzy clustering to group emojis into one or more emotion classes; then, they developed pretrained embeddings for these emojis. By comparing the embeddings for emojis with existing word embeddings excluding emojis, Ahanin et al. determined the effectiveness of emojis for tasks involving sentiment and emotion analyses. More studies are available on the subject of emojis than on emoticons. The main difference between emojis and emoticons is that they are either image or text data. In addition, emojis cannot be used without predesigned system items supported by a system, and there are only 20 to 30 types of facial expressions. Emoticons, however, are generated by combining characters and symbols such that we can freely create expressions depending on a situation. Thus, emoticons are available in a wider variety and allow more flexibility.
Cherbonnier et al. [12] compared the recognition of human emotions based on their new emoji with other forms of expression (facial expressions and Facebook and iOS emojis). Their new emojis are designed to convey six basic emotions (Anger, Disgust, Happiness, Surprise, Fear, Sadness). Their experimental results revealed that the new emojis are more likely than other expressive techniques to recognize disgust and sadness. Although their research focuses solely on the recognition of emotions in pictograms, and does not include a method for automatically recognizing emotions from pictogram information, it shows that depending on the type of pictogram, the sender and receiver may not be able to accurately recognize emotions.
Fujino et al. [13] constructed a neural network-based emotion estimation model for each gender of the speaker by constructing a pseudo-emotion-labeled speech corpus using tweets containing emoji and automatically expanding the corpus. Their work is based on Emoji2vec [14], which vectors pictograms and assigns labels to four types of emotions-Joy, Anger, Sorrow, and Surprise-based on their similarity. Although their proposed method can compensate for the lack of a corpus, it is difficult to improve the accuracy because it contains noise.
In addition to Japanese texts, research has been conducted on texts containing pictograms in Chinese, Turkish, Arabic, and other languages. There are many other studies that use emojis as a cue for emotion estimation, and many have also evaluated the effectiveness of emojis experimentally [15][16][17][18][19][20][21][22]. The number of studies on emoticons is small compared to those on pictograms. However, emoticons and pictograms are similarly used for emotional expression. Therefore, we believe that both emoticons and pictograms can provide effective clues for sentiment analysis.

Proposed Method
In this study, we primarily focus on the shape features of emoticons. Fujisawa et al. [1] proposed a method to estimate emotions expressed by emoticons by converting them into images and extracting some shape features. Their method achieved highly accurate emotion classification by dividing the emoticons into character units and synthesizing features from each character unit. However, the positions of the characters are important for emoticons. In many cases, a certain shape is formed by a sequence of letters. Therefore, to improve the estimation accuracy, it is necessary to use methods such as the edit distance by considering the character arrangements and the relatively high calculation costs of machine-learning methods such as recurrent neural networks.
Methods that use character n-grams as features and character-level convolutional neural networks (CNNs) have also been proposed. CNNs are popular for image recognition tasks, and AlexNet [23] and ResNet [24] are typical models. In terms of using CNN for text data, Character-Level CNN (CLCNN) has been proposed in the field of natural language processing. CLCNN is a one-dimensional CNN that is used in text classification or detecting words [25,26]. In our proposed method, we extract emoticon features from emoticon images using CNNs. This allows us to consider the positional relationships between the characters to handle the shape features of emoticons well. Figure 1 shows the flow of our proposed method.

Extraction of Emoticon Features (ev) Based on Deep Learning
Our proposed method extracts the image features of emoticons and converts them into 64-dimensional feature vectors using a prelearned model. The pretrained model is an emotion classifier trained using a deep CNN based on an emoticon dictionary. In addition, to compensate for the shortage of emoticon data during pretraining, image data were augmented using multiple character fonts as one of the data-augmentation methods. Most of the existing pretrained networks are applicable to color images; however, monochrome images are generally out of scope of such models. Because emoticons do not have any color information, we designed and trained a new model from scratch. Figure 2 shows the network configuration used in this study for pretraining of the CNN. The emotion labels are predicted using the convolution and max-pooling layers with SoftMax as the activation function in the output layer. In this model, the emotions are recognized solely via the image features of the emoticons, such that learning progression allows the overfitting caused by the emotions that are associated with the emoticon units only. Thus, robustness to unknown emoticons is lost, and so a dropout function is placed after the fully connected layer.
l. Sci. 2022, 12, x FOR PEER REVIEW 5 o language processing. CLCNN is a one-dimensional CNN that is used in text classificat or detecting words [25,26]. In our proposed method, we extract emoticon features fr emoticon images using CNNs. This allows us to consider the positional relationships tween the characters to handle the shape features of emoticons well. Figure 1 shows flow of our proposed method.

Extraction of Emoticon Features (ev) Based on Deep Learning
Our proposed method extracts the image features of emoticons and converts th into 64-dimensional feature vectors using a prelearned model. The pretrained model is emotion classifier trained using a deep CNN based on an emoticon dictionary. In additi to compensate for the shortage of emoticon data during pretraining, image data were a mented using multiple character fonts as one of the data-augmentation methods. Mos the existing pretrained networks are applicable to color images; however, monochro images are generally out of scope of such models. Because emoticons do not have a color information, we designed and trained a new model from scratch. Figure 2 shows the network configuration used in this study for pretraining of CNN. The emotion labels are predicted using the convolution and max-pooling lay with SoftMax as the activation function in the output layer. In this model, the emotio are recognized solely via the image features of the emoticons, such that learning progr sion allows the overfitting caused by the emotions that are associated with the emotic units only. Thus, robustness to unknown emoticons is lost, and so a dropout function placed after the fully connected layer.
In addition, because emoticons are created by connecting multiple characters ho zontally, their widths are greater than their heights. If the aspect ratio of the emotic In addition, because emoticons are created by connecting multiple characters horizontally, their widths are greater than their heights. If the aspect ratio of the emoticon image is converted as is, the characters are crushed, and the original shape information will be lost. As this would affect the feature learning process, we considered appropriate margins at the tops and bottoms of the images and converted them into square-shaped images without changing the aspect ratios of the emoticons themselves. This enables easier handling during convolution in the CNN. Specifically, the input emoticon image is grayscale, with vertical and horizontal dimensions of 46 pixels each. Figure 3 shows the process of converting the emoticon to image data. In the proposed method, we treat emoticons as image data as well. When converting text to images, the same character can look different if the character font is different. Therefore, in this study, we prepared emoticons in text format in various fonts and converted them to image data.
The CNN then learns from the abovementioned data-augmented training images, and the image features of the emoticons are input to the trained CNN to obtain the output of the hidden layer (fully connected layer). This output 64-dimensional vector is used as the shape feature of the emoticon. In subsequent explanations, we will refer to this feature as ev.
we prepared emoticons in text format in various fonts and converted them to image data.
The CNN then learns from the abovementioned data-augmented training images, and the image features of the emoticons are input to the trained CNN to obtain the output of the hidden layer (fully connected layer). This output 64-dimensional vector is used as the shape feature of the emoticon. In subsequent explanations, we will refer to this feature as ev.  we prepared emoticons in text format in various fonts and converted them to image dat The CNN then learns from the abovementioned data-augmented training image and the image features of the emoticons are input to the trained CNN to obtain the outpu of the hidden layer (fully connected layer). This output 64-dimensional vector is used a the shape feature of the emoticon. In subsequent explanations, we will refer to this featur as ev.

Distributed Features of Text; dh1, dh2, and dh3
In this study, features are also extracted from the textual sentences with the emoticons. Herein, we obtained features from tweets containing emoticons to be able to extract semantic features that take into account the positions of the emoticons in the text. Various methods have been proposed in the literature for extracting linguistic features from sentences, such as word n-grams, a method using semantic features of words, those that use sentence syntax patterns, and those that use word semantic distribution expressions. In these methods, the sentences are decomposed into certain units and features are extracted thereof. Therefore, in many cases, the positional relationships between the elements in the sentence cannot be used.
In recent years, methods considering the order of appearance of words in a sentence have often been used. For example, the methods using recurrent neural networks that can handle time-series information and a method that can efficiently handle word position information based on transformer and self-attention mechanisms are being used. The natural language model BERT [27] developed by Google in 2018 is a general-purpose model that trains a large-scale network model from large-scale training data by combining transformer and attention. Many methods that improve upon BERT are still being proposed. However, BERT is a very large model, so retraining for dedicated tasks can be costly. When applied to an actual task, it is often finetuned by relearning only the layer close to the output layer using a small amount of task-specific training data.
In the present study, we extract features from tweet sentences using DistilBERT [28], which can infer at high speeds without significantly compromising accuracy by reducing the training parameters. In the proposed method, the emotion label classifier is trained by applying distributed expressions obtained from the trained DistilBERT model as the inputs and using the tweet-sentence corpus with emotions labeled by hand as the learning data. Tweet sentences are input to the constructed trained model, and features specialized for emotion estimation are extracted from the three hidden layers. The configuration of the network used is shown in Figure 4, where DistilBERT obtains features with the same number of dimensions (768) as regular BERT. Using this 768-dimensional feature vector as the input, we learn a model that predicts four emotions: "joy", "anger", "surprise", and "sorrow". Using this pretrained model, we extract the outputs (256, 128, and 64 dimensions) of the three hidden, fully connected layers, respectively. In the following description, these feature vectors output from the hidden layers are denoted as dh1, dh2, and dh3, respectively.

Emotion Estimation Using Emoticons and Sentences
We combine five types of vectors; namely feature vectors extracted by a CNN feature extractor using face image features as inputs, intermediate vectors obtained through a pretraining model using the distributed representation of DistilBERT extracted from text as the input, and distributed representation using pretrained DistilBERT. These vectors are then fused to build a model that estimates four types of emotions.
In addition, features based on CLCNN and the average vector of the word distributed representations using fastText [29] are used for comparison as baseline features. To combine these features, we use the method of horizontally concatenating each feature vector. The labeled corpus considering both emoticons and sentences is used for the training and evaluation data. The network configuration used for learning is shown in Figure 5.

Emotion Estimation Using Emoticons and Sentences
We combine five types of vectors; namely feature vectors extracted by a CNN feature extractor using face image features as inputs, intermediate vectors obtained through a pretraining model using the distributed representation of DistilBERT extracted from text as the input, and distributed representation using pretrained DistilBERT. These vectors are then fused to build a model that estimates four types of emotions.
In addition, features based on CLCNN and the average vector of the word distributed representations using fastText [29] are used for comparison as baseline features. To combine these features, we use the method of horizontally concatenating each feature vector. The labeled corpus considering both emoticons and sentences is used for the training and evaluation data. The network configuration used for learning is shown in Figure 5.

Experiment Evaluation
In this section, we evaluate the effectiveness of the proposed method by comparing the proposed method based on the constructed learning model with the baseline method using existing features.

Experiment Evaluation
In this section, we evaluate the effectiveness of the proposed method by comparing the proposed method based on the constructed learning model with the baseline method using existing features.

Dataset
In this experiment, we used a corpus of emoticon images and another corpus of sentences with sentiment labels for pretraining, and a corpus of sentences with emoticon labels for the evaluations. The details regarding the data are summarized in Table 3. Because the quantity of evaluation data is small, we used five-fold cross-validation to predict and evaluate all examples. Furthermore, as the number of examples is uneven, the synthetic minority oversampling technique (SMOTE) [30] was adopted for oversampling.

Baseline Method
Three methods were primarily used as the baseline in this study: a method based on a character-based deep CNN for emoticons, a text classification method based on a distributed representation of words using fastText, and a method that outputs results by combining both of the above methods (mixed method). The character-based CNN comprised two convolutional layers, a maximum pooling layer, and a one-hot vector of up to 15 characters as the input. A model of this network is shown in Figure 4.
For the fastText approach we used the python API, which could also be used for supervised learning of the distributed representations; therefore, when we pretrained a corpus of sentences with emotion labels, we used the emotion labels as the teacher labels. In the following description, we denote the feature extracted using fastText as ft, and the 64-dimensional features extracted from the seventh fully connected layer of the network in Figure 6 as ecv.

Parameters and Conditions
In this section, we describe the pretraining and training parameters of the experiments. In addition, we compare the case of concatenating the input features by assigning different weights to each feature with the case of assigning uniform weights to each feature. Table 4 shows the parameters of the training models and algorithms. The features used in the evaluation experiments are summarized in Table 5.
15 characters as the input. A model of this network is shown in Figure 4.
For the fastText approach we used the python API, which could also be used for supervised learning of the distributed representations; therefore, when we pretrained a corpus of sentences with emotion labels, we used the emotion labels as the teacher labels. In the following description, we denote the feature extracted using fastText as ft, and the 64dimensional features extracted from the seventh fully connected layer of the network in Figure 6 as ecv.

Parameters and Conditions
In this section, we describe the pretraining and training parameters of the experiments. In addition, we compare the case of concatenating the input features by assigning different weights to each feature with the case of assigning uniform weights to each feature. Table 4 shows the parameters of the training models and algorithms. The features used in the evaluation experiments are summarized in Table 5.  DistilBERT-based hidden layer output vector 256 dh2 DistilBERT-based hidden layer output vector 128 dh3 DistilBERT-based hidden layer output vector 64 ev Emoticon image hidden layer output vector 64 ecv Emoticon char-CNN hidden layer output vector 64 ft fastText averaged word embedding vector 100

Results
In this section, we describe the results of our evaluation experiments. In Section 4.4.1, we show the results of the three baseline methods in graph form. In Section 4.4.2, we show the results of the method combining the image features of emoticons with the distributed representation features of the sentences. Figures 7-9 show the experimental results of the baseline methods using ft, ecv, and combined ft with ecv; these features are presented in Table 4.

dh2
DistilBERT-based hidden layer output vector 128 dh3 DistilBERT-based hidden layer output vector 64 ev Emoticon image hidden layer output vector 64 ecv Emoticon char-CNN hidden layer output vector 64 ft fastText averaged word embedding vector 100

Results
In this section, we describe the results of our evaluation experiments. In Section 4.4.1, we show the results of the three baseline methods in graph form. In Section 4.4.2, we show the results of the method combining the image features of emoticons with the distributed representation features of the sentences.       These results show that using the features of the emoticons in the character units together with the mean vector of word embeddings improves the accuracy of "surprise" estimation compared with that using only one of the features.

Results-2: Proposed Method
The results for each of the proposed methods are shown in Table 6. The results of the two methods with the highest accuracies and the method with the lowest accuracy are also shown graphically (Figures 10-12).

Discussion
The experimental results of the proposed method show that the highest accu obtained when three types of features were concatenated: features extracted from dle layer pretrained with DistilBERT (dh3), distributed representation of DistilBE features extracted from the model pretrained by CNN based on the image fea emoticons. In short sentences such as tweets, we expect that the emotions express only emoticons would be important, but the emotion estimation model trained character-level CNNs showed low accuracy for estimating the emotion "anger." Focusing only on the number of examples, the number of cases labeled "s was the lowest, and the next lowest was the number of cases labeled "anger", su that SMOTE, which is an oversampling method, was able to control the effects o size bias to some extent. Although there were few examples in which the emotic

Discussion
The experimental results of the proposed method show that the highest accuracy was obtained when three types of features were concatenated: features extracted from the middle layer pretrained with DistilBERT (dh3), distributed representation of DistilBERT, and features extracted from the model pretrained by CNN based on the image features of emoticons. In short sentences such as tweets, we expect that the emotions expressed using only emoticons would be important, but the emotion estimation model trained with the character-level CNNs showed low accuracy for estimating the emotion "anger".
Focusing only on the number of examples, the number of cases labeled "surprise" was the lowest, and the next lowest was the number of cases labeled "anger", suggesting that SMOTE, which is an oversampling method, was able to control the effects of sample size bias to some extent. Although there were few examples in which the emoticons had a strong effect on the sentence emotion in the evaluation data used in this study, the addition of features based on the image features of the emoticons resulted in higher accuracies, suggesting that the addition of emoticons' features had an improvement effect. Table 7 shows an example of a sentence that was classified correctly using only the proposed method as compared with that using all features with the baseline method. When reading these sentences without considering the emoticons, we felt that these sentences did not include words to express emotions. Table 7. Example of sentence classified correctly using only the proposed method.

Sentence Emotion
"Mendoukusai kara shitakunai" ('·ω·') "I don't want to do it because it's a hassle." anger "Kyou wa neochi shitemasen!" (*-ω-)b "I haven't fallen asleep today". joy Therefore, it was difficult to estimate the emotion of the sentence. In such cases, the sentence was assigned an emotion using emoticons. We also considered the reasons why emotions could be correctly estimated with only the proposed method; this was attributed to the fact that using image features could help classify individual emoticons with high accuracies. For sentences that are heavily influenced by the emoticons, we considered that accurately recognizing and classifying the emoticons would result in correct estimation of the emotion of the sentence.
In addition, we considered that the proposed method had a certain improvement effect for accurately recognizing and classifying emoticons. Figures 13 and 14 show the confusion matrices of (dv, hv1, hv2, hv3, ev) and (dv, hv1, hv2, hv3, ecv) generated from the respective results. The horizontal axis is the emotion label estimated by the classifier, and the vertical axis is the correct emotion label. The lighter the color of each cell, the higher is the accuracy. As seen from these figures, the results are better for (dv, hv3, ev) using the image features of emoticons, although their accuracies are almost equal. When training CNNs on images of emoticons, we augmented the training data of the pretraining model and used large-scale data; hence, we were able to create a feature extractor with high versatility. On the other hand, the results show that even when using only a small emoticon dictionary, we can effectively obtain emoticon features.
confusion matrices of (dv, hv1, hv2, hv3, ev) and (dv, hv1, hv the respective results. The horizontal axis is the emotion labe and the vertical axis is the correct emotion label. The lighte higher is the accuracy. As seen from these figures, the result using the image features of emoticons, although their accura training CNNs on images of emoticons, we augmented the tra model and used large-scale data; hence, we were able to cre high versatility. On the other hand, the results show that ev emoticon dictionary, we can effectively obtain emoticon featu   Figure 14. Confusion matrix (dv, hv1, hv2, hv3, ecv).

Conclusions
In this paper, we proposed a more accurate method than existing approaches for estimating emotions from tweets containing emoticons by extracting features from the images of the emoticons, in addition to distributed expressions extracted from the text of the tweets. The effectiveness of the proposed method was confirmed by comparisons with a method using a combination of CLCNN and distributed representation with fastText as the baseline methods.
The results of the evaluation experiments show that the method using a combination of DistilBERT, hidden layer output vector from the pretrained models, and feature expressions extracted from the image features of the emoticons produced the highest accuracy for emotion estimation. We believe that this is because using the images of emoticons enabled a more detailed analysis and capturing of local shape features than existing methods that decompose text at the character level.
On the other hand, if the initial weights for the emoticons are not reduced, the benefit of combining information from them is not available. This suggests that in the case of the evaluation dataset prepared herein, the influence of the emoticons on the overall sentence may have been small for many cases.
To clarify the effects of the emoticons on emotion estimation, we plan to conduct further experiments with sentences wherein the emotions expressed by the emoticons have a higher level of importance (e.g., longer or shorter sentences).