Bidirectional Representations for Low Resource Spoken Language Understanding

Most spoken language understanding systems use a pipeline approach composed of an automatic speech recognition interface and a natural language understanding module. This approach forces hard decisions when converting continuous inputs into discrete language symbols. Instead, we propose a representation model to encode speech in rich bidirectional encodings that can be used for downstream tasks such as intent prediction. The approach uses a masked language modelling objective to learn the representations, and thus benefits from both the left and right contexts. We show that the performance of the resulting encodings before fine-tuning is better than comparable models on multiple datasets, and that fine-tuning the top layers of the representation model improves the current state of the art on the Fluent Speech Command dataset, also in a low-data regime, when a limited amount of labelled data is used for training. Furthermore, we propose class attention as a spoken language understanding module, efficient both in terms of speed and number of parameters. Class attention can be used to visually explain the predictions of our model, which goes a long way in understanding how the model makes predictions. We perform experiments in English and in Dutch.


Introduction
Commercial spoken language understanding (SLU) systems rely on cascading automatic speech recognition (ASR) with natural language understanding (NLU).We identify three main problems of such interfaces: (1) the ASR errors are cascaded to the NLU system; (2) discrepancies arise between training and inference, as the NLU model is often not trained on spoken language, which differs in terms of phrasing but also includes acoustic elements such as stress and pitch that are lost after transcription; and (3) the resulting systems can be large, as both recent ASR and NLU models contain an increasing number of parameters.However, it is undeniable that there is a strong need for lightweight, robust personal assistants able to understand the nuances and intricacies of spoken language.Those systems should also work out of the box, without requiring much effort from the user.
Since their introduction in 2017, transformers [1] have taken over the field of natural language processing (NLP).The key to their success is their ability to process long sequences, attend to relevant pieces of information present in the input and learn qualitative representations that can be used for other purposes.This last aspect is especially true for masked language models (MLMs), where a model is trained to reconstruct a partial input in a "fill in the gap" fashion.In this setting, words are masked from the input sentence, and the model must predict what was removed.While doing so, transformers learn syntactic and semantic information, such as parts of speech and language composition [2].Unlike conditional language models (CLMs), which only have access to past information to generate a prediction, MLMs can benefit from left and right contexts [3,4].Furthermore, whereas CLMs are limited to the generation of one token at a time conditioned on the previously generated tokens, MLMs can fill multiple masked tokens in parallel.Nonetheless, CLMs are still widely popular for language generation tasks.
In speech processing, transformers have also been widely adopted by the research community [5].A major difference with text-based transformers is the addition of a convolutional front end to process continuous sequences [6,7].Its role is to learn local relationships and encode positional information.Both encoder-only and encoder-decoder transformers have been experimented with.Large self-supervised encoder-only models have shown impressive performance in many speech processing tasks [8,9].However, the amount of semantic information contained in these encoders is limited or so deeply entangled that extracting it is challenging, and fine tuning the encoder is necessary to achieve satisfactory performance [10].In contrast, encoder-decoder models split the representation task over two distinct modules: the encoder models the acoustic information contained in speech, while the decoder learns more language related features such as syntax and semantics [11].The encoder-decoder relies only on attention to align speech and text.[12] proposed the integration of connectionist temporal classification (CTC) [13], a technique that has been well researched in speech recognition.The CTC objective in the hybrid transformer enforces a monotonic alignment between speech and text, resulting in improved robustness and faster convergence [12].Another notable contribution is that of [4], who concatenated the outputs of a speech encoder and BERT, which was then passed to a prediction network.They also noted the benefit of using left and right context, as opposed to models learning only from past values.Finally, attempts have been made to augment large language models with speech capabilities, although more research is needed to achieve competitive performance [14].Despite considerable improvement in speech modeling in recent years, speech representation models do not show the same ability as text-based language models to efficiently store semantic information, and a considerable amount of fine tuning is necessary to achieve decent performance [10].
In this research, we explore representation models of speech and demonstrate their usefulness for SLU, with a particular focus on low-resource solutions, when only a handful of examples is available to train and validate the models.We propose an architecture for pretraining of bidirectional speech transformers on a surrogate ASR task with the goal of learning qualitative representations that prove useful for solving language understanding tasks.We evaluate the resulting embeddings in an SLU task, intent recognition, where the objective is to, given a spoken command, identify the intent and any argument necessary.For example, the sentence "switch the lights in the kitchen to blue" would result in an intent of "lights" and arguments of "room = kitchen" and "color = blue", while "play Beyonce" has an intent of "music" and an argument of "artist = Beyonce".For completeness and comparison purposes, we also include results obtained after partially fine tuning the model.
We are particularly interested in low-resource scenarios, where we control the number of examples used for training and validation to study the limits of what can be learned from the representations.When information is stored efficiently, few examples are necessary to learn how to extract it.In contrast, noisy embeddings require many examples to extract relevant information.For the SLU module, we propose class attention as a drop-in replacement of LSTMs.Class attention assigns a score to each position that relates to the importance of the corresponding token to predict the intent and arguments.Aside from the advantages of its small footprint, it provides us with a method to understand why a model makes predictions.In the remainder of this article, we first detail the architecture of the different components of the model.Then, we lay out the experimental setup and methodology.Finally, we evaluate the models and analyze the representations in detail.

Methods
Building on the work of [15], we propose an encoder-decoder architecture with a multiobjective training strategy to learn bidirectional representations of speech (Figure 1).We evaluate the features learned in a downstream SLU task: intent prediction.The encoder, as presented in Section 2.1, learns acoustic representations of speech.The sequences of acoustic unit representations are used in two modules in parallel: they are mapped to output symbols with a classification layer and optimized with CTC (Section 2.2), and they are processed by a bidirectional transformer decoder to learn linguistic features (Section 2.3).These features learned by the decoder are processed by an intent recognizer to obtain the final output (Section 2.4).

Encoder
The encoder processes Mel-scaled filter banks and transforms them into a sequence of acoustic embeddings (h enc ).The module is composed of a VGG-like convolutional front end followed by a transformer encoder.The role of the CNN is twofold: decrease the frequency and time dimensions and learn the local relationships in the sequence of speech features.This replaces the positional encoding traditionally required by transformers to keep track of the sequence order [7].The CNN front end compresses a sequence of speech features of length T to a smaller sequence of length T ′ ≈ T /4.The transformer encoder is composed of multiple blocks of multihead self-attentions and position-wise fully connected feedforward networks.Each block has a residual connection around both operations, followed by layer normalization.In this setting, the encoder cannot be trained in a standalone process.
Figure 1: The encoder processes speech features.The CTC module makes a rough prediction of the text, where low-probability tokens are masked.The decoder attends to the masked transcription and to the encoder's output and predicts missing symbols.The generated representations are used as input to the downstream model to predict the intent.

CTC
Connectionist temporal classification [13] is a method for optimizing sequence-to-sequence models.For each unit in the input sequence, CTC predicts a probability distribution over the vocabulary, consisting of the set of output symbols and a blank token.The predictions are assumed to be conditionally independent.A simple algorithm allows a sequence of CTC tokens to be reduced to a sentence by first removing repeated symbols, then removing blank tokens (Figure 1).The CTC loss is computed by summing the negative log likelihood of all the alignments that result in the correct sentence.For decoding, we opt for a greedy algorithm, which takes the token predicted with the highest probability at each time step.Although more advanced techniques such as beam search or weighted finite-state transducers [16] can improve the CTC prediction considerably, [16] found that the largest gains were observed in substitution errors, with substantially lower gains for insertions and deletions.For this reason, we prefer a fast decoding technique and let the decoder fix the substitution errors in the rough prediction.

Decoder
Although an encoder equipped with CTC is perfectly able to transcribe speech, it does so by assigning a token to each position independently.[12] reported that the addition of a decoder integrates language modeling in the learning process.Indeed, the decoder combines the acoustic cues of the encoder with the previous tokens to predict the next one.Here, we opt for a bidirectional transformer decoder instead of a left-to-right model.Left-to-right decoders predict one token at a time and stop after producing a special end-of-sequence token.This method is slower because a prediction depends on the previously predicted tokens.Additionally, the network only has access to previous tokens, thus limiting the amount of information available to make a prediction.In modern NLP, bidirectional architectures have become popular to learn powerful representations that make use of both left and right contexts [3,17].For instance, BERT [3] is optimized with a masked language modeling objective by randomly masking some tokens in the input sequence.In ASR, a similar approach was adopted by [15], who obtained a rough text prediction with CTC, which was refined by masking low-probability tokens and letting the decoder predict the missing items in the sequence, similarly to [18].One issue with these models is the decoder's initial state definition.As the output sequence is unknown at the inference time, it is necessary to provide the decoder with a template consisting of a sequence of mask tokens of the same length as the target sequence.The decoder is not able to add or remove symbols to adjust the sequence length, and the correct length is unknown at the inference time.[18] presented a solution that involves predicting the length as part of the prediction task of the encoder.Then, during inference, the decoder predicts multiple candidates of different lengths corresponding to the predicted lengths with the highest probability.[15] assumed the greedy CTC prediction to be of correct length, although they showed that it leads to weaker transcriptions.In this work, we focus on learning rich representations for spoken language understanding.We thus work around this issue, as we assume that the length is not essential for our purpose.We follow [15] for the architecture of the decoder.We use the output of the penultimate layer as the linguistic representations of speech.

Intent Recognizer
In the SLU datasets, intent is represented as a certain action (e.g., move, grab, turn, etc.) to which are attached a number of arguments (forward, fast, left, etc.).We encode the full intent string as a multihot vector, where each bit corresponds to either one of the possible actions or an argument value.The representations obtained from the decoder are summarized with a stack of class attention layers [19].Class attention combines a sequence of representations into a class embedding (x CLS ), that is, a learned vector of the same dimensions as the representations.During training, the module learns to identify patterns in the input features that are predictive of the class labels.Similarly to other transformers, a class attention layer is composed of two sublayers: an attention layer and a point-wise feedforward layer.The input to the layer is normalized before processing to stabilize training, as prescribed by [20].We define the queries in the original notation according to [1] as the following class embedding: Q = x CLS .The keys and values are linear projections of the elements in the input sequence of features (x): In contrast to [19], we do not include the CLS token in the keys and values.We compute the attention weights by applying the softmax function to the inner product of the keys and queries, scaled by the square root of the dimension of each attention head: ), where d is the total dimension of x CLS , and h is the number of attention heads.The output corresponds to and b o are learned parameters.We train the model by minimizing a binary cross-entropy loss.Since not all combinations are possible, we enforce the output structure by selecting the valid combination that minimizes the cross-entropy with the predicted vector.

Datasets
Corpus Gesproken Nederlands [21] is a collection of recordings in Dutch and Flemish collected from various sources, such as readings, lectures, news reports, conferences, telephone conversations, etc., totaling more than 900 h.They are divided into 15 components (a to o) based on their nature.After removing short and overlapping utterances, we divide each component into three subsets to serve as training, validation and test sets, respectively.We leave out three components (a, c and d) because the quality of the transcriptions differs considerably from that of the other components.The subsets from the remaining components, totaling 415 h of speech, are concatenated to form the training, validation and test sets.
Librispeech [22] contains about a thousand hours of read English speech derived from audiobooks.The authors provide official splits for training, validating and testing the models, and this dataset serves as a reference in ASR.We use all 960 h for training.
Grabo [23] is composed of spoken commands, mostly in Dutch (one speaker speaks in English) intended to control a robot.There are 36 commands that were repeated 15 times by each of the 11 speakers.More precisely, the robot can perform eight actions (i.e., approach, move (relative), move (absolute), turn (relative), turn (absolute), grab, point and lift), of which is further defined by some attributes (e.g., the robot can move forward or backward and rapidly or slowly).We follow the methodology applied in [23]: For each speaker, we create nine datasets of different sizes and divide each dataset in five folds for cross validation.The target is represented as a binary vector of 31 dimensions.
Patience [24] is derived from a card game in which a player provides vocal instructions to move cards between different tiles on the table.The dataset contains recordings from eight different Flemish speakers.The intent of the player is represented as a binary vector.Again, we use the same splitting methodology as [23], keeping only the 31 most represented classes.It should be noted that without visual input, this task is quite difficult, even for a human operator.Fluent Speech Commands [25] contains 30,043 utterances from 97 English speakers.Each utterance is a command to control smart home appliances or a virtual assistant.The challenge splits that we use in this work were proposed by [26].Two test sets are provided, in which specific speakers or utterances are separated from the training and validation sets.We also explore data shortage scenarios, where only a portion of the training set is available.
SmartLights [27] is composed of 1660 commands to control smart lights uttered by speakers with various accents and of various origins.Again, we use the challenge split proposed by [26].

Pretraining
We pretrain our encoder-decoder model on an ASR task with a hybrid objective that is the weighted sum of the CTC loss ( ctc ) and the label-smoothing cross-entropy loss ( dec : = ρ ctc + (1 − ρ) dec , where ρ and 1 − ρ are the weights of the CTC loss and the decoder loss in the total loss, respectively).We use CGN (220 h) to pretrain the Dutch model and Librispeech (960 h) for the English model.Our objective is to train a representation model of speech that captures elements of language and semantics while remaining robust to irrelevant aspects such as speaker identity or background noise.Additionally, the information stored in the encodings should be readily accessible for processing by a downstream model.We choose ASR as a proxy task because of the availability of large datasets and because it is easier to generate rich representations containing elements of language than with self-supervised approaches, which typically require considerably more resources to achieve the same goal.However, the supervised objective is likely to discard information not directly useful for solving the task.Addressing this neglected information will be the focus of future work.During pretraining, we use the real tokens as input to be masked for the decoder.During inference, we use the CTC module to generate a rough prediction that is used as a template, and we mask out tokens predicted with a probability of less than 90%.Similarly to MaskPredict [18], we iteratively refine the input for a maximum of 10 steps.We pretrain the models on a GPU for 200 epochs, with batches of 32 examples, and accumulate the gradients over 8 iterations, which provides an effective batch size of 256.We use the Noam learning rate scheduler [1].We experimentally set ρ to 0.3 by measuring the ASR accuracy on a held-out validation set.

Training
After pretraining the representation model, we freeze all the layers to train the spoken language understanding module.The training objective is to minimize the binary cross entropy between the predictions and the multihot-encoded targets.Freezing the representation model gives us the opportunity to evaluate the predictive power of the representations.At this point, the pretrained encoder-decoder from described in Section 3.2 has not been exposed to any training example from the SLU datasets, nor does it know about the output structure of the underlying task (number of classes, etc.).We train the models with the Adam optimizer [28] for a maximum of 200 epochs, with early stopping.The batch size is 512.We set the learning rate to 0.005, except for the Smartlights dataset, for which we found that a slower learning rate of 0.001 achieved better results.

Fine Tuning
Our main objective with this research is to learn representations that perform well without fine tuning.However, to provide a good basis for comparison with previous research and to quantify how much can be gained by updating the main model, we perform a few fine-tuning steps at a low learning rate while unfreezing some or all layers of the representation model's decoder.This operation can be seen as specializing the representation model for the specific aspects of the downstream task, often at the expense of generality.In practice, we found that unfreezing the whole decoder was more tedious and sometimes led to overtraining.Unfreezing the last four layers generally yielded the best results.As discussed in Section 2.3, we use the CTC module to produce the initial sequence in which low-probability tokens are masked.We perform only one pass with the decoder to generate the representations used by the SLU module.

Evaluation
We choose specific metrics to assess the performance of our models, primarily focusing on accuracy and F1 score.Accuracy measures the proportion of correctly predicted instances, encompassing both intents and slots.The F1 score, which is computed as the harmonic mean of precision and recall, provides a more holistic evaluation.While we consider the F1 score to offer a more comprehensive assessment, we also examine accuracy to facilitate comparisons with established models.
In our investigation into the model's performance within a low-resource setting, we deliberately limit the number of available training examples.This allows us to assess its adaptability and effectiveness under conditions of data scarcity.Furthermore, to gain deeper insights into the representations generated by our model, we conduct a qualitative analysis.We achieve this by visualizing the representations using t-SNE [29], a technique that helps identify clusters of examples, shedding light on the structure and organization of the learned features.We also visualize the attention weights to understand how the model makes certain predictions.

Hyperparameters
The transformer has 18 layers (12 encoder layers and 6 decoder layers) each with 4 attention heads, a hidden layer size of 256 and 2048 hidden units in the linear layer.We use dropout with a probability of 0.1 after each transformer layer.The front end has two 2D convolutional layers with kernel sizes of 3 × 3 and a stride of 2 × 2.Then, the input dimension is divided by a factor of four along the time and frequency dimensions.The hyperparameters related to the architecture of the encoder-decoder were chosen according to [15].The encoder-decoder model has 30.9 million parameters.The weights for scaling pf the different losses in the final loss function are chosen through experimentation using the pretraining data.We train our models for 200 epochs with a batch size of 256.The models are trained with the Adam optimizer [28].The learning rate linearly increases until reaching a maximum value of 0.4 after 25,000 steps, then decreases according to the Noam schedule [1].The downstream datasets were not used in any way to determine the value of the abovementioned hyperparameters.
The downstream models are composed of two class attention layers with four attention heads with 32 dimensions.The fully connected layer has 1024 units.The models are trained with batches of 512 examples for 100 epochs, with a learning rate equal to 0.005.The intent classification module has 890 thousand parameters.

Results
We compare our models on basis of the accuracy score to remain consistent with previous research.We select four baselines because they propose a similar approach as ours and present results on at least one of our selected SLU datasets.End-to-end SLU [25] proposes a model composed of a phoneme module, a word module and an intent module that can all be trained or fine-tuned independently.The authors experimented with four settings: with or without pretraining and fine tuning the word module only or all modules together.The two best models (pretrained on ASR and the fine-tuned word module only) are displayed in Tables 1a-1c.ST-BERT [30] also uses an MLM objective.However, it is pretrained with cross-modal language modeling on speech and text data in two different ways: using MLM or with conditional language modeling (CLM), where the goal is to predict one modality given the other.Additionally, the model uses large text corpora pretraining and domain adaptation pretraining with the SLU transcripts to further improve the results.However, the use of text transcripts from the downstream task is not compatible with our use case, as we assume that only the speech is available for the SLU task.Since domain adaptation provides an unfair advantage, we do not report the results related to this experiment.The pretrained model corresponds to the model that was pretrained on speech-only data, and the fine-tuned model corresponds to the model pretrained on text with CLM and fine-tuned on SLU.The performance of these two baselines were reported after training on 1% and 10% of the data, which is not the case for the following.Wav2Vec2-Classifier [31] is an encoder-only model with a fully connected layer as the output.The model is entirely fine-tuned, and the performance when freezing Wav2Vec2 is not reported, although we know from [10] that Wav2Vec does not store semantics in its internal representations and that fine tuning is necessary to achieve decent performance.ESPnet-SLU [11] uses a pretrained HuBERT as a feature extractor and a transformer decoder for the SLU module.The complete model is fine-tuned on SLU.
Table 1a shows the results on Fluent Speech Commands when the models are trained on varying amounts of training data.For completeness, we also report the results of the models fine-tuned on the entire training set (Table 1b).In Table 1c, the challenge splits [26] correspond to improved splits, where specific speakers or specific utterances are partitioned.For Smartlights, no previous splits are available, so we follow the same approach as [26] and use a random splitting strategy.
In Table 1a, we observe significant improvements compared to [25], especially when few training examples are used, both before and after fine tuning.Both [25] and [31] noted that fine tuning the entire model does not always improve performance because exposition to new data leads to catastrophic forgetting and, in some cases, overfitting.We make the same observation and find that unfreezing the encoder leads to its degradation.Indeed, the encoder was pretrained on large amounts of data and is robust to variations in the input features.When it is fine-tuned with a different objective for a handful of examples, the parameter updates following the new setting often lead to overfitting and loss of generality.This is also observed for ESPnet-SLU [11] in Table 1c, where we observe a much larger difference between the performance on unknown speakers and unknown utterances than with the other models.Updating the encoder also deteriorates its ability to generate a good template for the decoder, which leads to weaker performance.Consequently, we decided to keep the encoder frozen but update the decoder during the fine-tuning stage.With this setting, the performance improves slightly, although consistently compared to the model before fine tuning.Our model shows a similar pretraining accuracy as [30], but small improvements are observed after fine tuning.The good performance of the frozen models is encouraging, as it suggests that the pretrained model stores semantic information adequately enough for the SLU module to make use of it.Both approaches using an MLM objective are particularly efficient when a Test accuracies on Fluent Speech Commands (original splits).In the first stage, the pretrained model is frozen and the SLU layers are trained.
In the second stage, the pretrained model is (partially) finetuned.

Model Accuracy MLM (finetune)
99.8 E2E SLU [25] 99.1 ST-BERT [30] 99.5 Wav2Vec2-Classifier [31] 99.7 ESPnet-SLU [11] 99 c Test accuracies on the Fluent Speech Commands (challenge splits) and Smartlights datasets random and challenge splits.In the first stage, the pretrained model is frozen and the SLU layers are trained.In the second stage, the pretrained model is partially (or totally in [11]) finetuned.Our model also fares well against [31] and [11], although our methodology does not include data augmentation (Table 1b).Unfortunately, we cannot compare the quality of the representations produced by the pretrained models, owing to a lack of reported results.
In Table 1c, we observe a 12% relative improvement compared to [25] on FSC (Challenge) but a deterioration on Smartlights.We speculate that the fact that the MLM objective produces similar embeddings for word pieces that belong to the same semantic scope (such as blue and green or bedroom and bathroom) hurts the performance on this particular dataset.The few examples in the training set do not suffice to generalize to unknown phrasings for a specific target (i.e., when the same target is expressed differently).

Low-Resource Scenario
To explore low-resource scenarios, we split Grabo and Patience per speaker.For each speaker, we randomly select a fixed number of examples per class for the training set of the SLU model.By gradually increasing the size of the training set, we are able to measure the learning curve.This operation is performed three times per speaker, with different splits each time.We compute the micro-averaged F1 score for each experiment and report the average F1 score with its standard deviation in Figure 2.
We compare our representations (MLM in Figure 2) with three types of features: NLP features generated by encoding the gold transcriptions with bert-base-dutch-cased [33], Pipeline features resulting from the encoding of ASR transcripts predicted by ESPnet's hybrid ASR model [12] trained on CGN with bert-base-dutch-cased and CLM features corresponding to the output of the penultimate layer of the ASR model.The colored areas represent the 68% confidence interval.We use the F1 score for comparison with the NMF baseline [32].avoiding this problem.Nonetheless, we see that the CLM features perform worse than our model.The masked language modeling objective allows the MLM model to look at both right and left contexts, whereas the CLM model only has access to previous predictions.Because each predicted unit is conditioned on the whole sentence, the model is able to derive better representations than by only looking at the past.
We also compare our results with the state of the art on both datasets [32].This model combines ASR pretraining and an NMF decoder for intent recognition [32,34].The NMF decoder uses a bag-of-words approach with multihot intent representation, thus ignoring the order of the words in the sentence.Although this model was presented for dysarthric speech, results are available for both the Grabo and Patience datasets [32].The main advantage of the NMF decoder is its low computational requirements, which makes it particularly well-suited when very few training data are available, as can be observed in Figure 2a.This advantage disappears as the size of the training set increases.However, for the Patience corpus, in which the order of words is important for prediction, our approach achieves better performance compared to the model proposed by [34], in which the sequence order is ignored (Figure 2b).

Representation Content
To better understand the characteristics of the representations, we average the sequences to a unique representation per utterance and visualize them in two dimensions using the t-SNE algorithm [29].For this experiment, we focus on Grabo and, in particular, on the eight actions that the robot can perform.We compare the representations produced by different layers, namely the last encoder layer, the third decoder layer and the last decoder layer (encoder.11,decoder.2 and decoder.5 in Figure 3, respectively).Both encoder.11and decoder.5 form meaningful clusters, while decoder.2 is ill-defined.This is also observed in the classification accuracy of those representations on SLU, where we observe a clear difference between decoder.2 and the other features.We expected that decoder.5 would generate better features than encoder.11,although they seem to lead to similar performance, with a small advantage for the encoder's output.We conjecture that the CTC component helps to define the acoustic units at the encoder's output, which translates into well-defined representation sequences, albeit much longer ones.
We also observe that approach and move_abs end up in the same region, which means that a classifier might often confuse them.This makes sense from a language perspective, and the fact that this is more the case in decoder.5 than in encoder.11leads us to conclude that the implicit language model from the decoder generates similar embeddings for these two commands, although they do not sound similar.

Interpreting Model Predictions
Finally, we want to explore an aspect that is often overlooked in deep learning, namely elucidating the factors in the input features that drive the model's predictions.A key aspect of our class attention layer is its ability to establish a connection between the final output prediction and the input sequence through the attention weights.These weights can be interpreted as the relevance of each input unit in determining the correct label for the utterance.As shown in Figure 4, a limited subset of tokens (e.g., "turn", "the", "light", "on" and "kitchen") substantially influences the prediction process.In instances in which the model's predictions may deviate in terms of accuracy, these attention weights serve as valuable starting points for in-depth investigations into the underlying causes.

Conclusions
Despite the significant advancements in speech processing, there remains a crucial gap in the field concerning the development of speech representation models that capture semantics akin to the capabilities exhibited by language models.In this paper, we introduce a novel bidirectional representation model pretrained on an ASR task, which demonstrates remarkable effectiveness in transferring learned features to the domain of intent recognition, all without necessitating additional fine tuning.Our pretraining strategy incorporates CTC, together with an MLM objective, yielding speech features endowed with bidirectional contextual awareness.
Throughout our research, we examined the representations in various layers of the model, particularly with regard to their performance when applied to previously unseen datasets.Our findings show the versatility of these representations in training an effective SLU model.Notably, our representations either match or surpass the performance of state-of-the-art models on the SLU task, especially prior to the fine-tuning phase.We have also introduced a novel approach leveraging class attention mechanisms to summarize sequences of representations.This approach not only proves highly efficient, owing to its transformer-like architecture and minimal parameter requirements, but also offers insights into the model's decision-making process, shedding light on the input patterns that significantly influence predictions.Furthermore, our combination of class attention and ASR pretraining exhibits substantial gains in terms of data efficiency, as evidenced in a low-resource scenario in which we intentionally constrained the number of training examples available for the SLU task.In an era of increasingly massive and data-hungry models, our research underscores the importance of pursuing resource-friendly and efficient solutions.
However, it is important to acknowledge a limitation of our presented model, namely its challenges when confronted with SLU datasets that differentiate between terms seldom encountered or entirely absent in the pretraining data.In such cases in which two words are largely interchangeable except for a few instances in the pretraining dataset, our model's representations may become overly similar, hindering the ability to discriminate between these distinct classes without fine tuning.Addressing this limitation will be a focal point of our future research endeavors.

Figure 2 :
Figure 2: Comparison of different models trained with increasing train set sizes on (a) Grabo and (b) Patience.Each curve represents the F1 score on the test set as a function of the size of the training set (utterance count).The colored areas represent the 68% confidence interval.We use the F1 score for comparison with the NMF baseline [32].

Figure 4 :
Figure 4: Class attention weight visualization of the two layers of the downstream class attention model.Although the model receives sequences of embeddings, we label the graph with the corresponding tokens for demonstration purposes.The two layers focus on different positions to predict the intent and arguments.

Table 1 :
[26]lts on Fluent Speech Commands (top) and Smartlights and MASE splits[26](bottom) trained on 1% of the training data, which leads us to conclude that bidirectional context embeddings are particularly useful in a low-resource setting.
To provide a fair comparison, neither the NLP, ASR nor MLM model encountered training examples from the SLU dataset.In other words, we do not fine tune any of the representation models on the downstream datasets at this stage.Both the ASR and MLM models are trained on CGN, and the NLP model is trained on a collection of five text corpora for a total of about 2.4B tokens.Considering the amount of training data, it is no surprise to see that the gold transcript encoded with the NLP model performs very well in all data regimes.As expected, the ASR transcripts show the weakest performance.Converting features into discrete symbols forces the model to make decisions, resulting in potential errors from which the model cannot recover.In contrast, the CLM and MLM models use the hidden representations produced by the model, thus