1. Introduction
Recently, with the emergence of ChatGPT [
1], artificial intelligence has once again become a hot topic for the public. ChatGPT has a powerful general language capability that can assist in various aspects of human daily life and work. In the Chinese Natural Language Processing (NLP) domain, there are also products like ChatGLM [
2,
3], ERNIE Bot [
4], and MOSS (MOSS:
https://github.com/OpenLMLab/MOSS) playing important roles. These dialogue systems fine-tuned from large language models demonstrate significant capabilities in the general domain. When they are applied to vertical domains such as healthcare and finance, further fine-tuning and refinement are needed. However, fine-tuning a large language model is costly work. Therefore, for vertical domains, using retrieval-based dialogue systems is a viable choice. Question-matching technology is a critical technique for retrieval-based dialogue systems [
5,
6]. It calculates the similarity between a user query and a set of predefined questions, returning the top-k questions with the highest similarity scores. These questions are then presented for the system to further select from for response. In the NLP domain, question matching is regarded as a text-matching task, specifically a binary classification task. As shown in
Figure 1, given two input texts, the target is to output a label indicating whether they are similar or dissimilar. It can also be viewed as a text similarity calculation task, where the objective is to input two texts and compute the similarity between them. If the computed similarity exceeds a predefined threshold, the texts are considered similar; otherwise, they are considered dissimilar.
Previous methods often treat the question-matching task as a text similarity-matching task. Huang et al. [
6] introduced a deep structured latent semantic method (DSSM) that models queries and documents in a shared Euclidean space to calculate the similarity between documents and user queries. Shen et al. [
5] improved the DSSM by using convolutional neural networks (CNN) to model the low-dimensional semantic vector space of search user queries and documents. Pang et al. [
7] transformed the text classification problem into an image classification problem via stacking CNN modules to extract features. These methods achieve high accuracy of text similarity tasks in general domains, but they exhibit poor generalization performance in vertical domains.
In this paper, we focus on financial question matching. Some studies investigate how to integrate text-matching techniques in financial domains. For instance, Tan et al. [
8] employed a hybrid model consisting of CNN and RNN to model long sequential text for insurance domain question-answering tasks. Li et al. [
9] introduced a text-matching technique specifically for risk guarantees, addressing the gaps in the risk advisory community. Although these methods show some applicability in the financial domain, due to the specialty of such a field, further customization is necessary. Firstly, context is highly important in the financial NLP domain. For instance, in the investment market, stock prices can fluctuate based on contextual influences. Secondly, the financial domain abounds with specialized terminology, which is unfamiliar to the general NLP domain, adaptation training is required to acquaint the model.
Table 1 displays various common financial phrases in everyday Chinese language, all sourced from a Chinese financial problem-matching dataset (
Section 4.1). The modeling of financial text context and the ability to recognize financial phrases both significantly impact the inferential capabilities of model. Therefore, designing a model that can effectively adapt to financial question matching become one of the most crucial tasks at present. In particular, the design of the financial question matching model faces
two key challenges:
(1) How to model a contextual utterance more accurately in the financial domain? (2) How to represent financial phrases more accurately?To address these challenges, we propose a novel Financial Knowledge Enhanced Network (FinKENet) that incorporates financial knowledge into text representations. We consider the uniqueness of text in the financial domain, the proposed model is designed to lean more towards representing financial text. Firstly, we design a multi-level encoder layer, including sentence-level and phrase-level representation. Specifically, the sentence-level representation aims to encode financial text representations that make them biased toward the financial context. To this end, the proposed model utilizes FinBERT [
10] to encode text vectors. Then, the phrase-level representation is able to enhance the adaptability of utterances to financial contexts by directly encoding financial phrases within sentences. Additionally, to facilitate the fusion of sentence vectors and financial keyword phrase vectors, the proposed model utilizes a financial co-attention adapter which can fuse both from sentence to phrase and from phrase to sentence. Finally, we design a multi-level similarity decoder layer to comprehensively predict the similarity between query and question from three perspectives (cosine similarity, manhattan distance, euclidean distance), enhancing the generalization capabilities. In contrast to the additional knowledge injection [
11] in language models like GPT and BERT, our proposed model focuses on the fusion process of knowledge.
We present a Cross-entropy-based objective function for training all model parameters. Experimental results on the Ant Financial Question Matching Corpus (AFQMC) show that the proposed FinKENet surpasses all previous baseline models, and becomes the new state-of-the-art (SOTA) model. Since the FinKENet effectively models the financial context, it fills the gaps in this domain. The main contributions of this work are as follows:
We introduce a novel financial knowledge-enhanced network that explicitly incorporates financial knowledge into text representations, which have a multi-level encoder layer consisting of sentence-level representation and phrase-level representation.
Specifically, we propose a financial co-attention adapter, which extracts attention vectors from both sentences to phrase and from phrase to sentence, thereby enhancing the text representation capabilities of the method.
We introduce a multi-level similarity decoder layer that enhances the discriminative power of the model from three perspectives.
Experimental results demonstrate that the proposed model performs significantly better than the previous state-of-the-art (SOTA) model.
The remaining sections of this paper are organized as follows:
Section 2 introduces the related work on dialogue systems and question matching.
Section 3 introduces the implementation principles and technical details of the proposed model.
Section 4 introduces the experimental design and analyzes the experimental results.
Section 5 introduces the analysis of ablation experiments, analysis of the multi-level similarity decoder, and analysis of the case study.
Section 6 summarizes the paper and discusses the future research directions of the work.
3. Proposed Method
In this section, we introduce the proposed method, as shown in
Figure 2. We design the model based on a dual similarity text matching architecture [
5,
6], consisting of a multi-level encoder layer (
Section 3.2), a fin co-attention adapter (
Section 3.3), and a multi-level similarity decoder layer (
Section 3.4). The multi-level encoder layer ensures the completeness of financial text representation from the views of both sentence and financial phrases. The sentence-level representation is responsible for representing utterance, as shown in
Figure 2a. The phrase-level representation is responsible for representing financial phrases, as illustrated in
Figure 2b. The fin co-attention adapter is responsible for integrating sentence vectors and financial phrase vectors to generate a comprehensive text representation, as depicted in
Figure 2c. The multi-level similarity decoder layer is responsible for computing the text similarity between the representations of query and question, and it outputs the labels, as shown in
Figure 2d.
3.1. Problem Definition
In this paper, we define model inputs as
where
represents the user query, and
denotes pre-defined questions, FinBERT inputs as
, fin-keywords sequence as
, and model output label as
. During the training,
signifies similarity between the
and
, and
indicates dissimilarity between the
and
, as shown in
Table 2. The objective of question matching is to accurately distinguish whether a
and a
are similar or dissimilar.
3.2. Multi-Level Encoder Layer
Due to the specificity of financial domain text, we extract both sentence features and financial phrase features separately. Therefore, we design a multi-level encoder layer, including sentence-level and phrase-level representation. The sentence-level representation aims to encode financial text representations that are inclined to the financial context. The phrase-level representation is able to enhance the adaptability of utterances to financial contexts by directly encoding financial phrases within sentences. We first extract sentence features from the text, and the financial keyword phrase features are incorporated to enhance the text features, thereby improving the representational capabilities of the proposed model. Some examples of Chinese financial phrases are illustrated in
Table 1.
3.2.1. Sentence-Level Representation
We employ the FinBERT [
10] model to encode text vectors. The FinBERT is a financial pre-trained language model designed on BERT [
35]. Both possess identical model architectures, and the FinBERT workflow is depicted in
Figure 3. Its pretraining corpus and tasks are specifically tailored for the financial domain. Therefore, The FinBERT is better suited to represent financial text. The formula is as follows:
where
is user query input,
is qusetion text,
and
is output of FinBERT,
n is length of text, and
m is hidden layer dimension.
3.2.2. Phrase-Level Representation
Our model enhances the ability of sentence representation by incorporating financial phrases, bolstering the ability to discern financial knowledge.
As shown in
Figure 4, Firstly, financial phrases are encoded using the Word Embedding, as shown in the following formula:
where
is trainable financial keywords embedding matrix,
is financial phrases of query,
is financial phrases of qusetion,
l is length of financial keywords, and
m is hidden layer dimension.
Then, a self-attention layer is utilized to capture the relationships between financial phrases, as follows:
where
is an activation function,
is trainable matrix,
is trainable matrix,
is trainable matrix,
l is length of financial keywords, and
m is hidden layer dimension.
Additionally, our method employs max pooling operations at the row dimension level which ensures different numbers of financial keywords within a sentence are represented in a uniform dimension, as illustrated in the following formula:
where
is a row max pooling function, and
m is hidden layer dimension.
Our model performs the same processing in parallel for the
, as shown in the following formula:
where
is an activation function,
is trainable matrix,
is trainable matrix,
is trainable matrix,
l is length of financial keywords, and
m is hidden layer dimension.
where
is a row max pooling function, and
m is hidden layer dimension.
3.3. Fin Co-Attention Adapter
The proposed model employs a co-attention adapter [
22] to combine sentence representations with financial phrase representations to obtain contextual representations. The workflow of this module is illustrated in
Figure 5.
Firstly, calculate the co-attention scores from both directions (phrase-to-sentence and sentence-to-phrase), as shown in the following formula:
where
is trainable matrix, and
m is hidden layer dimension.
Subsequently, derive the sentence vector representation enriched with financial words and the financial word vector representation enriched with sentences, as represented in the following formulas:
where
is a row max pooling function, and
m is hidden layer dimension.
Finally, concatenate
and
to obtain the contextual representation, as expressed in the following formula:
where ⊕ is the concatenation operation, and
m is hidden layer dimension.
For the
, our method performs the same parallel processing, as described in the following formula:
where
is trainable matrix, and
m is hidden layer dimension.
where
is a row max pooling function, and
m is hidden layer dimension.
where ⊕ is the concatenation operation, and
m is hidden layer dimension.
3.4. Mutil-Level Similarity Decoder Layer
The proposed model uses a multi-level similarity encoder to calculate the similarity between
and
, mitigating the randomness associated with a single calculation formula. Our model concatenates
,
, and
together to form the final discriminative representation, as expressed in the following formula:
where
is the hidden states of the
, and
m is hidden layer dimension.
Subsequently, similarity scores for and are calculated using cosine similarity, Manhattan distance, and Euclidean distance, as illustrated in the following formulas:
Cosine Similarity
where ⊙ is the dot product operation.
Finally, our model computes the average of the three discriminative distances as the final model output, as represented in the following formula:
where
Y ≥
means label is 1,
means label is 0.
Cross Entropy Loss The training objective of our method is to minimize the score of
,
is the output of the Cross entropy loss function, as follows:
where
represents the true labels.