Improving Misfire Fault Diagnosis with Cascading Architectures via Acoustic Vehicle Characterization

In a world dependent on road-based transportation, it is essential to understand automobiles. We propose an acoustic road vehicle characterization system as an integrated approach for using sound captured by mobile devices to enhance transparency and understanding of vehicles and their condition for non-expert users. We develop and implement novel deep learning cascading architectures, which we define as conditional, multi-level networks that process raw audio to extract highly granular insights for vehicle understanding. To showcase the viability of cascading architectures, we build a multi-task convolutional neural network that predicts and cascades vehicle attributes to enhance misfire fault detection. We train and test these models on a synthesized dataset reflecting more than 40 hours of augmented audio. Through cascading fuel type, engine configuration, cylinder count and aspiration type attributes, our cascading CNN achieves 87.0% test set accuracy on misfire fault detection which demonstrates margins of 8.0% and 1.7% over naïve and parallel CNN baselines. We explore experimental studies focused on acoustic features, data augmentation, and data reliability. Finally, we conclude with a discussion of broader implications, future directions, and application areas for this work.


Introduction
Transportation is essential to daily life, with over one billion automobiles in use globally [1]. These vehicles produce approximately three billion metric tons of carbon dioxide annually [2] and consume scarce natural resources. As miles travelled increase, the need to reduce emissions and conserve energy grows in importance.
Advances in autonomy and shared mobility will also soon demand that costly highly automated vehicles be amortized over increased miles travelled [3][4][5]. A growing market for personal automobiles, alongside increased longevity requirements, decreased supply, and demand for long-life vehicles such as those to be used in new mobility services has led to more demanding needs for vehicle longevity than ever before. In turn, there is a need to maintain peak performance over longer vehicle lifetimes, as persisted inefficiencies drive the total cost of ownership higher and increase environmental damage unnecessarily.
Core to the problem of persisted inefficiency is a lack of consumer knowledge. Though there have been efforts to make waste and emissions management exciting and accessible [6] to average individuals, the same is less true for fields such as automotive maintenance. Without detailed knowledge of vehicles or their maintenance needs, automobiles are often left to languish, driven with check engine lights on and unburnt fuel exiting their exhaust pipes. There is an opportunity to build smarter vehicle diagnostics that help untrained individuals assess, evaluate, and plan a response to vehicle operating conditions. While vehicle technology is rapidly advancing, fault diagnostics are lesser-explored, whether for trained individuals-such as mechanics, or unskilled individuals-such as vehicle owners and operators. There is a largely unmet need for a better way to understand 1.
General acoustic classification: Does the audio sample contain a vehicle? 2.
Attribute recognition: What is the kind of vehicle? 3.
Status prediction: Is the vehicle performing normally? 4.
Fault identification: If abnormal, what fault is occurring?
In this manuscript, we identify three main areas of novelty with our work: • Misfire fault diagnosis: While most misfire fault diagnostic approaches are limited to specific types of engines, we built a more robust system that can detect misfires across unique fuel types, engine configurations, cylinder counts, and engine aspirations. • Acoustic Vehicle Characterization: We trained a two-stage convolutional neural network (CNN) that takes as input acoustic features and then predicts attributes of the engine. To the best of our knowledge, we built the first acoustic recognition system that can simultaneously characterize engines and diagnose faults. • Cascading Architectures: We proposed a broad framework for cascading architectures and then showcased a proof-of-concept approach which achieved 87.0% test set accuracy on misfire fault detection improving 8.0% and 1.7% over naïve and non-cascading baselines.

Obtain audio sample
If vehicle What is the class?

Level 1
If abnormal

Extract features
What kind of vehicle?
Gas type, engine config, cylinder count, aspiration type

Level 2
Is it performing normally?

Level 3
What is the fault type?

Level 4
Proof of Concept Figure 1. Flowchart of our proposed cascading architecture for vehicle understanding using audio data. We can see there are four sequential levels with a conditional dependence between levels 1 → 2 and 3 → 4. Of particular note is the transition between level 2 and 3, as level 2 cascades its attribute predictions to level 3. This work focuses on level 2 and 3 as a proof-of-concept.
To demonstrate the impact of our approach, we conducted ablation studies on data augmentation and model generalizability. In the conclusion, we consider the broader implications, future directions, and potential applications of our work.

Prior Art
The presented work is novel in three main areas. To establish how our work relates to and builds upon these areas, we now explore relevant prior art for each in Sections 2.1-2.3. First, we consider fault detection in automotive combustion engines. Second we delve into vehicle classification tasks using audio features. Finally, we compare our proposed cascading architecture to similar sequential and conditional architectures.
Our work focuses on utilizing only an unrefined acoustic signal to detect the misfire. This approach has been narrowly studied, specificially with traditional fully connected artificial neural networks (ANNs) combined with chaos analysis utilizing wavelet features have been leveraged on audio collected from smartphones [34,35]. Authors [28] build ANN models that compare vibration and sound modes with Fast Fourier Transform (FFT) features. Using sound quality metrics with a support vector machine (SVM) [36,37] also improves upon vibration analysis. Parallel tasks in misfire fault analysis consider finding the location of the misfire [38] or extracting the fault component from the acoustic signal [39].
One research study looking to identify specific faults [9] collects nearly 1k audio samples of vehicles using smartphone microphones for which approximately 1/3 are abnormal "vehicle misfire" events. Using traditional machine learning techniques and various audio features combined with feature selection, [9] achieves 1.0% misclassification rate and 1.6% false positive rate. However, this previous work was trained on three vehicles types and it is unclear if and how the results would generalize to a broader sample including more diverse vehicles.
Most approaches for acoustic misfire fault diagnosis only consider a single type of engine such as gasoline [30], Diesel [24,27], turbocharged [32], or four cylinder [28,36]. The strength and novelty of our approach is that it is trained on over 200 unique engine samples that vary across fuel type, engine configuration, cylinder count, and aspiration type. So our approach is not specific to Diesel or turbocharged for example, but rather is agnostic to the engine type when detecting misfires. This makes our approach more dependable when deployed in a real-world scenario for a consumer checking the status of their vehicle.

Acoustic Vehicle Characterization
We define general acoustic classification [40] as classifying a signal across many labels within broad categories. AudioSet [41], for example, is a popular dataset for general acoustic classification, which contains 527 diverse, hand-annotated classes for 2.1 million YouTube videos. AudioSet [41] provides an ontology for general acoustic classification with seven categories including: human sounds (speech/body), animals, music, sounds of things (vehicle, engine, tools), natural (weather), source-ambiguous, and background noise.
First, we focus on multiple low-level, fine-grained multi-level label prediction with the assumption of the highest-level class, rather than simply classifying a single level at a time. Given our approach focuses on these fine-grained tasks, we do not utilize pre-trained acoustic embeddings [50,52,53] but rather train lightweight models from scratch.
Second, we utilize data collected at a higher sampling rate (48 kHz) compared to that collected from YouTube, which we show, in Section 5.2, caps its sampling rate and in the process may discard useful informative features. This is important because models trained on such crowd-sourced audio may not perform as well as models trained and operated using raw, full-frequency audio directly collected from a mobile device. YouTube and other such sources also conduct feature-destructive compression on some audio samples, which can limit model performance and generalizability.
Third, our work is novel in its application of sound-based deep learning, as we focus on acoustic vehicle characterization. Prior art [12] has conducted an extensive survey on off-board vehicle vibro-acoustic diagnostics which explores a multitude of acceleration-and sound-based vehicle characterization approaches. Our work focuses on the lesser-explored tasks of vehicle attribute recognition, engine misfire fault detection, and specifically the interrelation between vehicle variant and fault-specific diagnostic algorithms.
Previous work has focused on vehicle attribute recognition [12,68] utilizing traditional ML techniques such as SVMs, random forests, and feature selection. These primarily consider the FFT and Mel-Frequency Cepstral Coefficients (MFCC) feature types, whereas our work considers five distinct feature types and conducts an in-depth feature comparison. These works also do not consider the engine configuration attribute task. Our work explores the three previously explored attribute prediction tasks: fuel type, cylinder count, aspiration type, as well as the engine configuration task. Refs. [12,68] also build separate models for each prediction task, whereas our deep learning approach demonstrates that predicting attributes jointly in a single model can yield strong results, with variant-identifying parameters helping to "shape" the final diagnostic model.
Our work is unique in its merging of vehicle acoustic characterization tasks. First, classifying attributes for an abnormally performing vehicle may be a more challenging task than has been explored [12,68] for normally performing vehicles. Second, we show the value of multi-task learning through attributes prediction improving misfire performance and vice versa. Finally, we demonstrate a proof-of-concept with our two-stage CNN that both characterizes engines and diagnoses faults.

Cascading Architectures
We envision our cascading architecture for vehicle understanding as a sequential, conditional model. Specifically, we construct an architecture with four sequential levels ( Figure 1), where levels 1 → 2 and 3 → 4 are conditionally dependent. Additionally, level 3 is dependent upon the cascaded result from level 2. There is valuable related work in sequential and conditional deep learning that has been utilized in a select applications. Examples include human activity recognition [69], image segmentation [70], facial recognition [71], frame prediction in games [72], and music rhythm generation [73]. Three works are of particular relevance to our cascading architecture for acoustic vehicle characterization: [74][75][76].
The approach from [74] has elements that are similar to our proposed cascading architecture. Specifically, [74] leverages the relationship between fine and coarse labels in their model training. The first and second level of our cascading architecture can be seen as examples of coarse and fine labels, which have a subset relationship. However, [74] does not consider conditional logic and does not have multiple levels like our proposed cascading architecture.
Both [75] and our work utilize neural networks for acoustic classification. However, the conditional component of [75] does not correspond with multi-label dependence proposed in our cascading architecture but rather conditional dependence in the data itself. Specifically, [75] leverages the temporal relationship of nearby frames in spectrograms whereas our work utilizes the entire spectrogram, agnostic to its temporal nature, as a feature set in the CNN.
CascadeML [76] is closely related to our proposed architecture. Our architecture might almost be considered an example of the theoretic structure proposed by [76]. However, ref. [76] focuses on the AutoML component where the network can dynamically grow itself based upon any number of multi-label samples. We instead embed expert knowledge into the static structure of the cascading architecture. In other words, we know fault type is conditional upon status prediction, so we embed that in the structure of our architecture rather than building a network from scratch with the AutoML approach of [76]. Not only will this approach allow us to demonstrate stronger and more consistent performance, but also it would be theoretically more efficient given it structure is already defined.
In summary, refs. [74][75][76] showcase sequential, conditional, and cascading architectures that are closely related to our work. However, these works focus on either theoretical or high level acoustic classification. We show there is value in building a model for the specific application of acoustic vehicle characterization and misfire fault diagnosis.

Approach
As noted in the problem motivation, proper maintenance is increasingly important to vehicles as their number and miles travelled grow. However, as vehicles become more like appliances and less familiar to their users, driver and passenger awareness of their capabilities diminishes. We suggest that automated vehicle identification is an important first step in diagnosing and rectifying problems plaguing efficiency and emissions: when a part fails, it is not enough to know that a vehicle is big or small, or red or silver, but rather that a vehicle is a 2016 version of some specific make and model possessing a particular engine and transmission type and tire size. Creating tools that effectively identify vehicles and use these attributes as a means of improving state characterization allows non-expert users to expertly diagnose their own vehicles, clearing the knowledge hurdle to effective maintenance and thereby reducing fuel consumption, emissions, and likelihood of breakdown.
Our approach is a proof-of-concept for how vehicle characteristics can better inform fault diagnosis and we explain our approach from five main angles: Sections 3.1-3.5. First, we share the details of our novel dataset, specifying the dataset splits and augmentation procedure. Second, we compare the different audio feature types used in our models. Third, we develop a high-level understanding of the two multi-task models proposed in this work. Then, we clarify the specifics of the CNNs for each main feature type. Finally, we provide implementation details to aid in reproducibility.

Dataset
Our dataset is a collection of audio samples recorded using mobile phones. We utilize mobile phone microphones as these tend to have repeatable characteristics within the range of frequency similar to those of human speech and hearing (roughly 20 Hz-20 kHz). In particular, we merge the datasets from [12,68], which focus on attributes, with [9,77], which focus on misfire detection. We outline the makeup of our dataset and its splits in Table 1. Table 1. Summary of the train/validation/test split for our datasets. We create 8 augmented training samples per one validation sample. The test sets is mutually exclusive from the train and validation sets, which allows it to provide an insight into the outsample performance and model generalizability. Together, these datasets totaled 286 (228 + 58) audio samples, which contributed to about 320 min of non-augmented audio. We decided to split each audio sample into three second clips ("chunks") to provide our features and models with a standard input size for training and to expand our dataset size. This time was selected as striking an appropriate balance of capturing low-frequency sounds with ease and speed of capture. Additionally, we believe three seconds is a long enough sample to be representative data for our model, but also short enough that would allow for clips within the same sample to be distinct and useful for training. After chunking the samples, this resulted in 6406 (5455 + 951) unique three second clips. Though fixed-length clips were used, our models at test time are agnostic to the input feature size, rather than needing the sample to be exactly three seconds.

Dataset Split Raw Samples Clips (3 s) Total Length (min)
Our approach for the train/validation/test split involved building a train set using only augmented clips with the goal of reducing overfitting and improving generalizability. In turn, we split the real full length samples into a validation and test set, using a traditional 80/20 split. As seen in Table 1, this resulted in 5455 validation clips and 951 test clips. We split the validation and test set using the raw samples rather then the clips because we created the train set using only the validation set. This allowed for the test set to be mutually exclusive from both the train and validation set to showcase as accurate as possible the model's outsample generalizability.
Due to the relatively small size of the datasets, we manually created the validation and test splits with an attempt to balance the label distribution to the best of our ability for the four attribute recognition tasks: fuel type, engine configuration, cylinder count, and aspiration type and the status prediction task focused on misfire detection. Label distributions are visualized in Figures 2 and 3.  Label distribution for the test set. We attempted to build a test set with as similar label distribution to training and validation as possible, while also including at least one raw sample from each label.

Data Augmentation
In deep learning for computer vision, data augmentation is often utilized via rotations, cropping, scaling, etc. to gain invariance among unique inputs. Audio may similarly be augmented. For instance, ref. [78] utilized time stretching, dynamic range compression, pitch shifting, and background noise for environmental sound classification. In speech recognition, ref. [79] applied multiple speed changes to expand their training set. Focused on animal audio classification, ref. [80] uniquely applies computer vision augmentation techniques to the spectrogram such as reflection and rotation, as well as waveform augmentation with pitch, speed, volume, random noise, and time shift. For music classification, ref. [81] found pitch-shifting to be most informative but also utilizes time stretching and random frequency filtering.
Researchers [82] classify accelerating vehicles using audio and apply random noise and pitch shift to expand their dataset. Other work [83] focuses on vehicle type identification using MFCCs and other audio features, these authors apply city noise to augment captured signals. There have been additional applications of augmentation to acoustic vehicle DL related tasks [84][85][86]. Our work is similar to [78,80,81] in that we utilize pitch, speed, volume, and random noise augmentation types. However, we differentiate ourselves in our application of the augmentation to multi-task learning for vehicle attribute recognition and fault detection. Additionally, to our knowledge, our approach is unique in that our train set is not a combination of real and augmented samples but is entirely made up of augmented samples. We hypothesize this may allow for more robust training and less overfitting / memorizing of specific samples.
Our process of creating the training set consisted of applying a data augmentation process eight times to each of the 5455 validation clips, which when expanded resulted in a train set of size 43,640 as seen in Table 1. The data augmentation process consisted of applying four types of data augmentation to each clips which each data augmentation type having a probability of 50% of being activated. The types of augmentation chosen were volume change, pitch shift, speed shift, and random background noise. More specifically, we chose each type of augmentation from a randomly uniformly distribution over these augmentation parameters: These ranges were set based on performing at limit, listening, and determining sample was recognizable with each applied individually. We also added an epsilon between 10 −6 and 10 −5 to each parameter, so as to not randomly choose a hyperparameter of zero volume change/pitch shift or one for speed change, which would result in the clip not being augmented at all. Additionally, if the clip was not changed after all four augmentation types rolled their probability (i.e., all four types were not activated), we applied one augmentation type at random. This guaranteed that no training clips were exactly the same as their respective validation clip. All value ranges were determined experimentally, with expert mechanic review to coarsely validate the "closeness" of the augmented sample with the original sample. Augmentation parameters were set so as to line up with typical variation in engine configurations, e.g., through manufacturing diversity and wear. For example, frequency shift was bounded based on typical allowable tolerances for idle speed. Amplitude limits were set so as to minimize the effect of signal clipping. The features were determined to be acoustically discernible after augmentation.
In summary, we created a new dataset with a non-traditional train/validation/test split. This set uses only augmented variants of validation samples in training and is mutually exclusive from both the validation and test set. Our dataset, per Table 1, in total consisted of 50,046 three second clips for a total of 2502.3 minutes or about 42 h total of audio. We further motivate our decision for using data augmentation later in Section 5.1.

Features
The raw audio signal, known as a waveform, can provide standalone insights, but often feature extraction and transformation is a necessary step in building successful models in some acoustic recognition tasks [87,88]. Transforming the raw waveform from the time domain to the frequency domain with the provides particularly informative features. Other informative feature types utilize hybrid time and frequency information. Figures 4 and 5 show a representative waveform and Mel-Frequency Cepstral Coefficients (MFCCs) for two automotive engine types; note how the MFCCs are more visually discernable. This characteristic better informs our CNN models.
In this work, we explore five audio feature types which include:   These features were chosen to give our models a diverse set of inputs. The raw waveform provides the model with time information, while the FFT provides the model with frequency information. MFCCs, spectrograms, and wavelets provide the model with varying degrees of hybrid time and frequency information at different dimensionality. MFCCs and spectrograms are 2D features while FFT, waveform, and wavelets are 1D. Distinct models are trained for each feature type.

Model
We design and implement a variant of the earlier-proposed Cascade model: a two-stage network for which the first stage specializes on attributes classification and the second stage specializes on misfire detection. The first stage receives one of the previously described feature sets as input. The second stage receives both the feature set and the attribute predictions from the first stage. Both stages are trained jointly using multi-task learning.
To compare against the Cascade model, we also consider two baselines: Naïve and Parallel models. The Naïve model is a straightforward baseline that predicts the most represented class for each task. For example, if we look at Figure 3, the Naïve model would always predict gasoline as the fuel type. This model would achieve 81.7% test accuracy.
The Parallel model, as seen in Figure 6, represents a more complex baseline to consider than the Naïve model. The Parallel model is represented as a sequential, multi-level, conditional network in the same archetype as the Cascade model. However, per Figure 6, it does not explicitly cascade, or provide as input, the classified vehicle type to the status prediction level. Rather, these two tasks are predicted in parallel from the raw waveform and/or features.

Obtain audio sample
If vehicle What is the class?

Level 1
If abnormal

Extract features
What kind of vehicle?
Gas type, engine config, cylinder count, aspiration type

Level 2
Is it performing normally?
What is the fault type?

Level 3
Proof of Concept Both the Cascade and Parallel model are viable classification approaches: both serve as a proof-of-concept for a full scale AI mechanic architecture and provide an all-in-one solution for classifying attributes and misfire. The Cascade and Parallel models also both utilize multi-task learning to allow for the attributes prediction to better inform the misfire prediction and vice versa. However, the models differ in their internal sharing of intermediate classification representations. Therefore, there is value in the comparison on whether a joint, shared representation via the Parallel model or separate, specialized representations via the Cascade model would yield better performance. It is also worth noting that given the Cascade model has a second sequential stage, there is an non-negligible increase in model runtime and memory. In turn, these factors should be worth considering if the Cascade model provides a margin on any task predictions. These two models are directly compared in Figure 7. We observe the Parallel model uses a shared representation for both the attributes and misfire predictions whereas the Cascade model has two distinct CNNs that allow for specialization in attributes and misfire. Additionally, both networks utilize multi-task learning so the attributes loss can inform the misfire loss and vice versa.

Network
We propose two distinct multi-layer convolutional neural networks (CNNs), Parallel and Cascade, to predict both vehicle attributes and engine misfire, as seen in Figure 7.
For each, we group audio features into 1D (FFT, waveform, wavelets) and 2D (Spectrogram, MFCCs) sets, with two two distinct model architectures that utilize 1D and 2D convolution, respectively, motivated by [89]. We adapt the M5 architecture [90] which was built for general audio classification using the raw waveform.
Our samples utilize a 48 kHz sampling rate, for which we consider frequencies ≤ 24 kHz according to the Nyquist-Shannon Sampling Theorem. These samples were collected using a stereo microphone for which we average the dual channel input into a single mono channel. We split each sample into 3 s chunks which results in a 1 × 72,000 input vector for the raw waveform.
The unique aspect of the M5 model utilizes a large convolutional kernel size of 80 in the first layer to propagate a large receptive field through the network. We adopt kernel size of 3 from [90] for the remaining three convolutional layers.
For the MFCC model, we utilize kernel size of 2 × 2 since we chose 13 MFCC coefficients as a hyperparameter which results in an input dimensionality of 130 × 13. For the spectrogram, we choose hyperparameters of 512 for hop length and 2048 for window size. This results in a input dimensionality of 1025 × 282. Since it was a larger input dimension than the MFCCs, we utilize traditional 3 × 3 kernels for spectrogram.
For the 2D models, our model consists of three convolution layers, rather than four layers for the 1D models because of their smaller input width. We follow the precedent set by [90] for including pooling, batchnorm, and ReLU after each conv layer in both 1D and 2D models. We also add dropout with probability = 0.5 after each layer to improve generalizability and to minimize the likelihood of overfitting. We treat each prediction task as classification and therefore have final fully connected output layers with dimensionality corresponding to the number of classes for each task. These predictions are then fed into log softmax and trained using negative log-likelihood loss. Specifically, we consider five loss terms which are all equally weighted:

Implementation
We built our models using PyTorch within an Anaconda environment. We trained our models using 1080 Ti and Titan XP GPUs. Our models unless otherwise stated were trained for 100 epochs with the Adam optimizer, utilizing a batch size of 128, learning rate of 0.001, no weight decay, and with amsgrad following the standard set by [90] for training CNNs on audio. The Spectrogram model did not fit into GPU memory with a batch size of 128, so a batch size of 64 and 32 were used for the Parallel and Cascade models, respectively.
We conducted a handful of hyperparameter searches to find the best set for each model type, across number of epochs, learning rate, and dropout rate. Specifically, we looked at number of epochs ranging from 5 to 100, learning rate ranging from 0.0001 to 0.1, and dropout rate ranging from 0.0 to 0.75. These experiments did not yield any notable trend and in turn have not included these results in our manuscript. Results exploring validation performance, confusion matrices, and hyperparameters can be found in [91]. However, any model weights and log files are available via request for reproduciblility. For the following Sections 4 and 5, we utilize the hyperparameters which result in the top performing model for each feature type.
Our dataset and code are undergoing an IP review and are expected to be released to the public.

Results
We evaluate classifier performance along two dimensions in Sections 4.1 and 4.2. For model comparison, we first compare our Parallel and Cascade architectures as seen in Figure 7. Our goal is to provide unbiased data and to learn the scenarios under which each model performs better or worse than the other. We hypothesize that the Cascade model will demonstrate better performance on the misfire fault detection task over the Parallel approach. Next, with the feature comparison subsection, we illustrate the impact the choice of audio feature has on model accuracy. Finally, we delve into task comparison to better understand how our top performing models are classifying each of the four main attributes: fuel type, engine configuration, cylinder count, and aspiration type, as well these labels' contribution to misfire fault detection performance. We include evaluation metrics (precision/recall) and confusion matrices to provide additional insight into the true value of our approaches. Table 2 compares the test set accuracy for each feature type between the Naïve, Parallel and Cascade models. With respect to the Naïve baseline, it appears that the Parallel and Cascade models plateau for attributes. However, for misfire prediction the Cascade model outperforms both the Naïve and Parallel baselines across every feature type. Specifically, when comparing Parallel and Cascade model test set accuracy on the misfire task, we observe 4.0%, 1.5%, 1.0%, 8.8%, 1.7% improvements for FFT, MFCC, spectrogram, waveform, and wavelet model performance respectively.

Model Comparison
The major takeaway: the Cascade model outperforms all baselines on misfire prediction. These findings show we have achieved our goal of improving fault prediction through cascading of vehicle attributes.

Feature Comparison
We explore model performance for five feature types (FFT, MFCC, spectrogram, wavelets, waveform). FFT, wavelets, and waveform share the exact same model architecture, whereas MFCC and spectrogram each use a unique model architecture with differing amounts of layers and kernel sizes to reflect the feature dimensionality.
When exploring the test set performance from Table 3, we find MFCC is the top performing feature for the Parallel and Cascade models. We also note that the models with spectrogram features achieve second place for both Parallel and Cascade on the test set. This shows that the models which utilize 2D convolution are able to find more generalizable patterns from the train set to the test set. The final pattern we can observe is the poor performance of the raw waveform relative to the other feature types. This demonstrates the value of our experiments which analyze many different types of audio features.

Ablation Studies
To prove model robustness for previously unseen vehicles and vehicle variants, we conducted two ablation studies in Sections 5.1 and 5.2. As discussed in Section 3.1.1, data augmentation is an essential component of modern deep learning systems. We first explore how our models perform on the test set when trained with and without data augmentation. Secondly, we built an additional test set to consider our models outsample generalizability by using input data crowd-sourced from YouTube. Through these test set experiments, we demonstrate why this is not an effective approach to evaluating real-world model performance. In particular, training on crowdsourced data from YouTube is not well-suited to model creation that is broadly generalizable for acoustic vehicle characterization tasks.

Data Augmentation
Our first ablation study explores the difference of training on augmented samples vs. real samples on model performance. Table 4 compares across each of the five feature types for test set accuracy. We observe a strong improvement on misfire prediction accuracy when trained on augmented samples for all models. More specifically, we find 6.4%, 4.6%, 3.7%, 4.2%, and 5.7% margins of improvement when using data augmentation for misfire detection with the FFT, MFCC, spectrogram, waveform, and wavelet models, respectively.
When exploring the attributes recognition tasks, some feature type models utilize data augmentation better than others. For example, the waveform performs better across all four attributes tasks when trained on augmented samples. This makes sense since there would be more room for the network to memorize and overfit to the raw data. Additionally, the spectrogram model improves on all four tasks with augmentation, the MFCC and wavelet models improve on 3/4 tasks. The FFT model had a small degradation in attributes tasks with augmentation which is reasonable given it had the biggest margin of improvement for misfire prediction.
In summary, data augmentation is a crucial component of training effective acoustic classification networks. We showed with our Cascade model that training on augmented samples has a positive impact across all feature types for misfire fault detection.

YouTube Outsample
We collected 51 samples from YouTube (YT) to provide our models with representative data for evaluating outsample generalizability. After chunking these samples every three seconds, we built a new test set of 649 never-before-seen clips to evaluate the outsample performance of our model.
We compare the test set performance of our Cascade models to the YT outsample performance in Table 5. We observe a significant performance degradation of our models on these YouTube clips. Specifically, we find across all feature types, there is minimal to extreme degradation in all five prediction tasks. With the top-performing MFCC model, we see 5.2%, 36.5%, 54.4%, 1.2%, and 22.6% decreases in accuracy across the fuel type, engine configuration, cylinder count, aspiration type, and misfire status tasks, respectively. There are a handful of reasons why this sort of discrepancy has occurred between our test set and the YouTube test set. It could have resulted from a lack of sufficient data and a label distribution mismatch compared to the training set. However, we have evidence that YouTube audio is frequency-attenuated on certain videos, likely to save space. We show an example of a regular sample from our set and a sample from the YouTube set in Figures 8 and 9. In Figure 8, note that there is no representation of frequencies greater than 16 kHz in the YT sample whereas in raw device-captured samples we see frequencies represented well beyond 20 kHz. Given our sampling rate is 48 kHz, according to the Nyquist-Shannon Theorem, our features should have informative frequency representation up to 24 kHz.  From a physics standpoint, some features, such as turbocharger bearing rotation, would make acoustic emissions around these higher frequencies. The comparison of Spectrograms shown in Figure 9 further elucidates the frequency attentuation. For the Youtube sample, in Figure 9, we can visualize a sharp knee in the function at a much lower frequency than appears in the regular sample Spectrogram.
In summary, we showed that there should be caution applied when working with crowd-sourced acoustic data such as that in datasets originating on YouTube. This further strengthens the need for more comprehensive data collection and generation to create a balanced and diverse set for testing the outsample performance of models.

Discussion
We will explore three main areas of discussion in Sections 6.1-6.3. Within broader implications, we consider the larger impact our proposed architecture may have on both consumers and industry. For future directions, we note routes this work could be continued and built upon with respect to data quality, model improvement, and architectural modifications. Finally, we discuss the how our work could be utilized in diverse application areas within and outside vehicle characterization and diagnostics.

Broader Implications
There are two main groups to consider for the broader impacts of our work developing an AI mechanic: consumers and industry.
Given the end-to-end nature of our proposed system, it has practical usability for consumers to obtain vehicle status in real-time, e.g., using a mobile phone. Our high-level cascading architecture takes as input a raw audio sample, requiring no direct input or knowledge from the consumer. This has the impact of potentially saving even non-expert users stress, time, and money. Whenever their vehicle may be exhibiting a fault, they can receive an initial opinion from the AI mechanic which could provide the consumer with insight into the seriousness of their vehicle state.
There are multiple extensions for the AI mechanic when consumer data are integrated. A feedback loop would be created that could improve the system for detecting faults. The collection of fault data over time from long-standing consumers would create the opportunity for preventive maintenance. One day, the AI mechanic may help average consumer to prevent their vehicle from exhibiting a fault in the first place through proper warnings.
Prognostics will not come about immediately. For there to be value in a feedback loop from consumers, vehicles must be allowed to fail such that these data might be captured. Additionally, we showed the impact underrepresented labels have on our model and its poor outsample performance on most underrepresented labels. Without a more diverse, comprehensive, and well-balanced dataset, this behavior will still be observed. When the AI mechanic is working with a previously unseen vehicle, it will take the system time to adjust, and may necessitate the use of techniques such as synthetic data generation or federated learning.
Within the automotive industry, off-board diagnostics which can utilize inexpensive microphones and mobile or embedded devices an alternative to the costly expense of onboard diagnostics that constantly collect and send data from the vehicle. The AI mechanic could provide automakers with an enormous swath of data on the status condition and fault identification for each kind of vehicle.
Our cascading models are flexible in that despite being trained using the vehicle attribute predictions as input to the second stage network, this could easily be adapted to take the ground truth labels as input instead. The value of this change could be significant for automakers as it would allow them to build unique fault identification models conditional upon specific vehicle types. If automakers could use inexpensive audio data to create a chart of anticipated vehicle wear, this may be used to design better vehicles and supporting infrastructure. This will require the capture and labeling of samples for desired fault identification types. If other fault types were to be integrated into the AI mechanic, this would require hundreds or thousands of distinct examples across vehicle classes to achieve similar performance to our method's demonstrated misfire detection performance.

Future Directions
Throughout our results and ablation sections, we brought up issues with our current approach and raised potential opportunities for improvement. One of the main issues uncovered (and not unique to the AI mechanic) is the models' dependency on quality data that is not only diverse but also balanced across classes. We showed that obtaining crowdsourced audio data from YouTube may address data quantity, but not data quality. There is value in developing a real-world application that would make it easy for a consumer to collect data on their mobile device and store the waveform at its original 48 kHz sampling rate to prevent lossy compression or frequency attenuation.
Another avenue for future work to address the lack of data, especially in underrepresented classes, would be generative modeling such as with Generative Adversarial Networks (GANs) [92] and Diffusion Models [93]. Generative models could be extended not only to acoustic samples, but also utilized in a feedback loop for developing better synthetic samples for training, including for classes not well represented. One direction for improving test set accuracy would be exploring feature and model fusion. Considering an ensemble approach where 1D and 2D convolution can be utilized on their respective feature types should yield better overall performance. Additionally, we may consider data balancing [94] and mixup [95,96] data augmentation as ways to improve results.
There is a significant opportunity in the space of acoustic vehicle characterization. Our dataset had labels for make, OEM, idling status, recording environment/location, and more. Perhaps there could be value derived from more expansive attribute classification. Additionally, our set contains labels for horsepower and engine displacement, which would provide an interesting exploration in attribute regression.
Finally, building upon our initial proof-of-concept for a cascading architecture would be a challenging but worthwhile endeavor. It could be tackled in two main ways. The first way would be to connect more stages to the architecture both in the high-level, general acoustic classification, and low-level, fault type recognition. These additional stages may benefit from utilizing pre-trained audio embeddings and transfer learning [50][51][52][53]. The other approach, motivated by work discussed in Section 2.3, would be to implement conditional logic into the network itself. Since fault type recognition is conditional upon there being a fault in the first place, having a network which is adaptive would be a major contribution. There has already been great strides made in deep sequential networks [97] and adaptive networks focused on efficiency [98][99][100]. As such, building this work upon prior art would be a natural extension going forward.

Potential Applications
While our main focus of this work was on vehicle understanding, our approach and proposed cascading architecture can be adapted in broader application areas. These application areas include sequential, conditional, or multi-label prediction tasks and fault identification.
Our proposed cascading architecture is a multi-level network where each level is conditional upon the prior. Any application area where there is potential for inter-label dependency could leverage the cascading architecture. For example, music recognition has hierarchies and label dependency such as first asking does the sample contain music. Then the next stage is conditional upon the first stage where it could look to predict genre, artist, song, etc. Another area for cascading architecture could be animal recognition. Does the sample first contain an animal, then what kind of animal, what state is the animal in, what location is the animal in, is the animal behaving normally, etc.?
Not only can the cascading architecture be extended to other audio applications, but also applications with other modes of data such as in computer vision. For example, biometrics could first ascertain whether a sample contains a valid fingerprint, iris, or facial scan. Then conditionally upon the first level, it could then ask whether the biometrics scan represents a valid user, what condition the user is in, perhaps using multi-modal data such as heart rate or blood pressure prediction.
The other significant application area for these techniques is broader fault identification, particularly using audio data. This includes other diagnostic areas, such as for industrial processes or energy sector equipment. One such example is home or industrial heating and ventilation systems. Other applications include home appliances (washer/dryer) with belt slipping or drum imbalance, electric cars/bicycles with suspension issues, manufacturing equipment (CNC mills/lathes) with tool run-out or spindle issues, drills with brush wear or belt slip, the energy sector with turbine and pump health, elevator/escalators condition, and even carnival/fair equipment.
In summary, our proposed method is broadly applicable to any task which can leverage a conditional dependency and/or has a potential need for fault identification. We plan to explore the identified areas and more in continuations of this work. We believe this work can one day be extended as a real-world mobile application that can help people better understand and diagnose issues with their vehicles.

Conclusions
In this manuscript, we identified three main areas of novelty with our work: • Misfire fault diagnosis: An engine misfire is a common fault that occurs in automotive vehicles that can lower fuel efficiency and increase maintenance costs. We proposed an automated deep learning system that leverages the cost-effective and prevalent audio data to diagnose misfires. Most acoustic misfire detection approaches consider only a single or small number of vehicles which limits the type of engine able to be reliably diagnosed by the resulting algorithms. Our approach is built on over 200 unique samples varied across gasoline and Diesel, I4, V6, V8, turbocharged, and more. In turn, we built a more robust fault diagnostic system. • Acoustic Vehicle Characterization: Often the first step in diagnosing an issue with an automotive vehicle is to identify its characteristics such as make, model, year, mileage, etc. In our work, we look to identify four distinct attributes of the engine: fuel type, engine configuration, cylinder count, and aspiration type. Prior work has characterized these attributes using traditional machine learning techniques on normally performing samples. We merged two datasets to enable attribute recognition for the less studied abnormal misfiring samples. Additionally, our approach improved upon prior work with multi-task learning to predict four attributes and misfire in a single model. • Cascading Architectures: We defined cascading architectures as sequential, multilevel, conditional networks. As a proof-of-concept, we built a two-stage convolutional neural network where the first stage specialized on vehicle attributes and then cascaded its attribute predictions to the second stage which specialized on misfire detection with the goal of better informing both prediction tasks. Our cascading model achieved 87.0% test set accuracy on misfire fault detection which demonstrates margins of 8.0% and 1.7% over naïve and non-cascading baselines.
We further explored the impact of our approach through relevant ablation studies on data augmentation and model generalizability. We concluded our manuscript with an important discussion of the broader implications, future directions, and potential applications, highlighting the potential for future research and adoption of the developed architectures and concepts. Data Availability Statement: Non-augmented source data for training the models in this study will be made available on the Harvard Dataverse upon clearing necessary intellectual property reviews. Please contact the authors to access a subset of these data prior to their publication.

Conflicts of Interest:
The authors declare no conflict of interest.