1. Introduction
The sensory quality of black tea is shaped by a complex mixture of volatile and non-volatile compounds. Cultivar, terroir, and the processing stages of withering, rolling, fermentation (oxidation), and drying jointly determine the composition [
1,
2]. Volatile organic compounds (VOCs) are the primary carriers of tea aroma and a key quality indicator across all tea types [
1,
3]. Aroma dominates grade assignment. Premium teas develop rich floral and fruity notes from linalool, geraniol, and nerolidol released during controlled fermentation. Lower-grade teas yield flatter profiles with more hexanal and short-chain aldehydes [
4,
5]. Linalool has been identified as the key compound driving grade differentiation in oolong teas through odor activity value (OAV) analysis [
4]. Perceptual interactions among key aroma compounds also give rise to synergistic or masking effects that modulate the overall sensory outcome [
6]. The dynamic evolution of aroma during fermentation has been studied in ripened Pu-Erh [
7] and in various black teas. These studies confirm that processing conditions strongly shape the final VOC profile. These VOC differences arise from the same enzymatic oxidation cascade that converts catechins into theaflavins and thearubigins. Those pigments govern color, astringency, and body. The headspace VOC fingerprint therefore carries information about polyphenol status and overall chemical quality [
2,
8]. Furthermore, the polyphenol-rich leaf matrix controls how fast individual VOCs are released when the leaf is heated. The shape of this release over time, rather than the time-averaged concentration, carries grade information: it reflects both which compounds are present and how strongly they are bound in the leaf. Broader flavor chemistry across tea types, including the interaction between volatile and non-volatile constituents, has been reviewed for specialty products such as milk tea [
9]. Standardized pretreatment protocols for faithful aroma representation continue to be refined [
10]. Process optimization studies on white tea further link manufacturing parameters to key flavor substances [
11].
Current grading practice has two strands, and neither is fully suited to high-throughput production. Sensory panels score infusion color, aroma, taste, and mouthfeel. Their outcomes are vulnerable to taster fatigue, variation in training level, and ambient conditions [
2,
12]. Building robust, validated quality scoring systems remains hard even for well-characterized herbal teas [
13]. Instrumental methods give accurate compound-level or spectral data but often need elaborate sample preparation, costly instrumentation, and long turnaround times. Common examples are GC-MS and GC-IMS for VOC profiling, HPLC for catechins and theaflavins, and near-infrared (NIR) spectroscopy for rapid chemical screening [
2,
3]. Hyperspectral imaging has shown promise for real-time tea quality prediction in cultivation settings [
14]. NIR-based models can predict quality substance content in green teas [
15], and shoot-trait-based models support quality evaluation of machine-picked fresh leaves [
16]. Image-based approaches have been used to monitor the degree of withering [
17], and three-dimensional fluorescence spectroscopy combined with UMAP enables classification of dark teas [
18]. Despite this range of methods, rapid on-line aroma-based screening of finished tea products is still missing.
Electronic nose (e-nose) technology has emerged as a promising route to close this gap. Early multi-sensor MOS arrays already discriminated tea grades from headspace VOC patterns, first with neural networks [
12] and LDA/PCA [
19], and later with chemometric classifiers such as PCA, LDA, and SVM for in situ black-tea grading [
20] and with extreme learning machines for tea-gas identification [
21]. To push accuracy higher, several groups fused the e-nose with a second modality, such as Vis–NIR spectroscopy [
22], hyperspectral imaging through global–local feature-fusion networks [
23], molecularly imprinted electronic tongues [
24], or combined computer-vision, e-nose, and e-tongue platforms for storage-life prediction [
25]. Even low-cost hardware can give competitive discrimination when paired with suitable machine learning (ML) pipelines such as PLSDA, LDA, and PCA [
26], while portable aroma sensors [
27] and adaptive gas-feature networks for origin traceability [
28] point to a clear trend toward compact, field-deployable instruments. Deep learning has since been adopted widely in the tea domain, including cross-time–frequency networks for agricultural-product recognition [
29], two-dimensional correlation spectroscopy for high-precision black-tea classification [
30], multi-view multi-task networks for joint classification and flavor-factor estimation [
31], transfer learning on NIRS data [
32], convolutional networks for rapid green tea quality prediction [
33], ML-based taste-profile classification [
34], and lightweight networks for leaf-appearance inspection [
35]. Methods from neighboring fields transfer directly to sensor signals as well, such as augmentation strategies from hyperspectral remote sensing [
36] and deep architectures for time series classification [
37].
Despite these advances, four main limitations persist: (i) most e-nose platforms use arrays of 6–16 sensors, which adds cost and system complexity, leaving open whether a single programmable MOS sensor with multiple heater channels can give useful grade discrimination; (ii) evaluation has mostly looked at instance-level accuracy and has overlooked whether the instrument gives consistent verdicts when the same tea product is retested across multiple sessions, which is a basic requirement for on-line quality control; (iii) feature-based approaches reduce sensor waveforms to summary statistics such as means and standard deviations, discarding the temporal shape of the VOC-release curves, and because gas-resistance readings often have heavy tails and pronounced skewness, losing this kinetic information can be decisive for classification performance; and (iv) no direct side-by-side comparison of classical ML and deep learning has been reported for single-sensor tea grading on a small, non-normally distributed dataset.
These limitations are addressed here with a portable platform built around a single Bosch BME688 sensor (Bosch Sensortec GmbH, Reutlingen, Germany). The framing is explicit: the aim is to recover grade-defining aroma chemistry from a low-cost sensor trace. The principal contributions are: (i) a portable single-sensor platform that performs controlled thermal VOC desorption from dry tea, designed to record chemically interpretable release-kinetic traces rather than time-averaged headspace concentrations; (ii) a six-model classical ML benchmark that quantifies how much aroma-chemistry information is preserved when sensor waveforms are compressed into PCA-reduced statistical features; (iii) a Multi-Scale 1D-CNN with Squeeze–Excitation and temporal self-attention (MS-CNN-Attention) that operates directly on the raw VOC-release waveforms and, by learning from the full trace, retains the release-kinetic and signal-shape information that per-channel summary statistics discard; and (iv) a product-level decision consistency metric that quantifies prediction stability across repeated measurements of the same tea product, directly relevant to on-line aroma-based quality screening in tea processing facilities.
The novelty of this work therefore does not lie in proposing a new sensor or a new generic classifier. It lies in the specific combination, not previously reported for tea grading, of (i) a single programmable MOS sensor used as a thermal-desorption profiler instead of a multi-sensor array, (ii) a head-to-head comparison of feature-based classical ML against raw-waveform deep learning on the same small, non-normally distributed dataset, and (iii) a product-level decision-consistency analysis that evaluates the instrument as a screening tool rather than only at the level of individual measurements.
The remainder of the manuscript is organized as follows.
Section 2 describes the tea samples, the single-sensor platform, and the measurement protocol.
Section 3 details the two analytical pipelines: a feature-based classical ML pipeline and a raw-waveform deep learning model.
Section 4 reports the classification and product-level consistency results for both paradigms.
Section 5 interprets these findings in terms of VOC-release kinetics and aroma chemistry, examines sensor drift and calibration, and contrasts the design choices with representative prior tea e-nose studies.
Section 6 draws the main conclusions.
5. Discussion
5.1. VOC-Release Kinetics and the Advantage of Raw-Waveform Modeling
When dry tea is heated to 30–50 °C inside a sealed chamber, VOCs desorb from the leaf matrix in a burst-then-decay pattern. The shape of that pattern depends on each compound’s vapor pressure, the available leaf-surface area, and the strength of the polyphenol–VOC binding matrix [
1,
4]. Premium teas carry high linalool and geraniol loads bound within a polyphenol-rich scaffold. They release these terpene alcohols more slowly and at higher peak concentrations than lower-grade products, whose headspace is dominated by lighter, faster-desorbing short-chain aldehydes. The gas sensor resistance waveform recorded by the BME688 folds these desorption kinetics into a single temporal trace. The grade-relevant signal is in the morphology of that trace. The mean or maximum amplitude alone is not enough.
Compressing this waveform to five per-channel summary statistics collapses the shape into scalar descriptors. For roughly symmetric, unimodal signals that representation can be enough. The gas sensor resistance traces in this dataset are different. They are markedly right-skewed, since the initial desorption burst produces a pronounced long tail. Consequently, mean and standard deviation cannot separate teas whose average VOC headspace concentrations are similar but whose temporal release profiles diverge. The convolutional neural network operates on the full 4400-point waveform. It keeps this kinetic information and learns shape-sensitive filters end-to-end.
The multi-scale branch architecture mirrors the multi-timescale nature of VOC release from the leaf matrix. Small kernels (3–5 samples) resolve fast transient spikes from highly volatile aldehydes. Medium kernels (15–31) track the gradual accumulation of terpene alcohols. Large kernels (63–127) encode the slow emission tail dominated by heavier sesquiterpenes [
37]. Temporal self-attention further sharpens grade discrimination. It assigns higher weights to the mid-window interval where VOC release peaks and suppresses contributions from the uninformative early warm-up and late steady-state plateaus.
5.2. Aroma Chemistry and Residual Classification Errors
High-grade bud teas (ORAC 1050–1250 μmol TE/100 mL) carry high catechin and theaflavin concentrations. They give complex, high-intensity headspace fingerprints dominated by floral terpene alcohols [
2,
4]. The sensor platform consistently separates these from low-grade CTC and bag-format teas, whose headspace is both weaker and less chemically differentiated. The medium-grade class sits in a transitional zone in polyphenol content and VOC diversity. It is the hardest class to resolve. The deep learning model still narrows this gap substantially (F1: 0.52 → 0.79) by exploiting release-shape differences that summary-statistic representations cannot see.
The two persistent misclassification cases involve products carrying aromatic compounds unrelated to the native leaf chemistry. Sample A12 is a bergamot-flavored bag product. It contains bergamot oil whose dominant monoterpenes (limonene and linalyl acetate) produce a headspace fingerprint similar to those of high-grade floral teas rather than the unflavored medium-grade bag substrate it is built on. Sample A13 also seems to carry processing-derived aroma compounds that shift its sensor fingerprint away from the unflavored members of its assigned quality class. Fixing such cases would need either a preprocessing stage that normalizes for exogenous flavoring agents or an auxiliary classifier that first detects non-tea aromatic additives before the grade-assignment model is applied.
5.3. Bridging Instrumental and Sensory Evaluation
Trained sensory panels assess aroma, color, astringency, and body. These attributes are rooted in the same polyphenol and VOC chemistry that the BME688 captures. A single-sensor electronic nose cannot replace a trained sensory panel. It can, however, work as a fast screening instrument. It can flag batches whose headspace fingerprint deviates from the expected grade profile and so reduce the number of samples that need full panel-based evaluation. The product-level consistency metric introduced here speaks directly to this screening role. The system grades 14 of 16 products correctly by majority vote. In a production-line context, each batch is usually evaluated across multiple measurement sessions. A system that occasionally gets an individual run wrong but consistently assigns the correct grade at the product level is still useful and deployable.
5.4. Sensor Drift and Calibration
Like all metal–oxide gas sensors, the BME688 is subject to drift. Drift is a slow change in the baseline gas sensor resistance over days to months, caused by aging of the SnO2-based sensing layer, gradual poisoning by ambient contaminants, and residual humidity effects. Its practical consequence for the algorithms used here is distributional: drift shifts the absolute resistance level and therefore the statistical features (mean, RMS, maximum) on which the classical pipeline relies, so a model trained on data recorded at one time can degrade when applied to data recorded weeks later, even for the same tea. Two aspects of the design adopted here limit this effect. First, every exposure run is referenced to a freshly recorded 30 min empty-chamber baseline, so the discriminative information lies in the relative change and in the temporal shape of the response rather than in its absolute level; relative, baseline-referenced features are inherently less sensitive to slow absolute drift. Second, the deep model is trained on the full waveform morphology, which is more robust to a constant baseline offset than scalar amplitude features.
Calibration in this context consists of re-establishing the clean-air baseline of the sensor. In each run, this is done automatically: the on-chip BSEC library continuously tracks the clean-air reference, and the 30 min empty-chamber acquisition provides a per-run zero against which the subsequent exposure is measured. A full recalibration simply repeats the clean-reference-air measurement to reset the baseline; it requires only clean air and a few minutes, with no specialized reagents, but it must be repeated periodically for long-term field use. Learning-based drift handling is also attractive: domain adaptation and drift compensation schemes, combined with the baseline-referenced features already used here, can absorb part of the drift and so reduce how often manual recalibration is needed. A systematic drift compensation study over extended deployment windows is left for future work.
5.5. Comparison with Prior Tea e-Nose Studies
Table 8 positions this work against representative tea e-nose studies. Three points stand out. First, the great majority of prior systems use multi-sensor arrays or fuse the e-nose with a second modality such as NIR or hyperspectral imaging [
22,
23,
26]; here, a single programmable MOS sensor, operated as a thermal-desorption profiler, provides the discriminative signal. Second, most studies classify hand-crafted features with classical chemometrics (PCA, LDA, SVM, ELM) [
12,
19,
20,
21,
26], whereas recent deep models are typically reported on their own rather than benchmarked against a feature-based baseline on the same data [
29,
30]. The two paradigms are instead compared directly on one small, non-normally distributed dataset. Third, evaluation in prior work is almost exclusively at the level of individual measurements; to the authors’ knowledge, no earlier tea e-nose study reports a product-level decision-consistency analysis, which is the metric most relevant to deployment as a batch-screening instrument.
5.6. Limitations
Dataset size. A dataset of 90 runs from 16 products is small for deep learning applications. The elevated fold-to-fold F1-macro variance reflects this. Expanding the dataset to include more products and more replicate measurements would improve model stability and generalization.
Flavored tea products. Bergamot and other exogenous aromatic additives confound the natural VOC fingerprint. A dedicated detection or normalization procedure will be needed to handle this class of samples.
Single-sensor configuration. A multi-sensor array with complementary chemical selectivity could improve class separability. The trade-off is the current platform’s simplicity and cost-effectiveness.
Absence of direct chemical validation. The quality class labels were derived from published literature and manufacturer data rather than from HPLC or GC-MS analyses of the specific samples used. Pairing future sensor runs with GC-MS headspace profiling would strengthen the mechanistic link between the sensor response and individual VOC species.
Sensor drift. Long-term sensor stability over extended deployment windows was not evaluated. As discussed in
Section 5.4, the baseline-referenced acquisition limits the impact of slow drift, but periodic recalibration and, ideally, learning-based drift compensation will be needed for sustained field deployment.