In this section, we first describe the evaluation metrics and then assess the performance of both individual base models and ensemble models on the test set.
4.1. Evaluation Metrics
To comprehensively assess model performance, we consider two types of metrics: threshold-dependent metrics and threshold-independent metrics.
4.1.1. Threshold-Dependent Metrics
A binary classifier predicts a label by comparing the model’s output probability to a decision threshold. For each input sample (benign or malicious), the prediction is either correct or incorrect, leading to four possible outcomes in malware detection:
True Negatives (TNs): benign samples correctly classified as benign.
False Positives (FPs): benign samples incorrectly classified as malicious.
False Negatives (FNs): malicious samples incorrectly classified as benign.
True Positives (TPs): malicious samples correctly classified as malicious.
These quantities form the confusion matrix. In our implementation, we encode labels as 0 for benign (negative class) and 1 for malicious (positive class). This encoding is used consistently throughout the codebase and follows the standard convention in malware detection, where the malicious class is treated as the positive class (i.e., the class of interest).
Using these terms, the F1 score is defined as follows:
where
The F1 score ranges from 0 to 1, with higher values indicating better detection performance; an F1 score of 1 corresponds to perfect classification.
We also report the false positive rate (FPR), defined as follows:
In this study, we focus on performance at low operating points (e.g.,
). In production settings, FPR must be kept low to avoid alert fatigue; therefore, an effective malware detector should maintain a high true-positive rate (TPR) while operating at low FPR.
4.1.2. Threshold-Independent Metrics
To provide a threshold-independent view of a model’s ability to distinguish benign from malicious samples, we also report the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC). The ROC curve plots the TPR against the FPR across all possible thresholds, while AUC summarizes this curve as a single scalar. AUC ranges from 0 to 1, with higher values indicating better discrimination. A random classifier achieves an AUC of 0.5, whereas a perfect classifier achieves an AUC of 1.
4.1.3. Threshold-Selection Procedure
In our study, we selected thresholds corresponding to fixed operating points (FPR = 0.1, 0.01, and 0.001) directly from the ROC curve during test-set evaluation. Specifically, for each model, we computed the ROC curve on the test set and identified the decision threshold that achieves the target empirical FPR. We then reported the corresponding TPR, precision, and F1 score at that operating point.
4.2. Individual Models
We evaluate the performance of the individual models on the test set, including LightGBM and TabNet trained on the 696-dimensional EMBER Base features; Google ViT and DinoV3 trained on byte-plot images; LightGBM and TabNet trained on the 1131-dimensional EMBER Extended features; and the SHERLOCK baseline.
Table 2 reports ROC AUC and TPR at fixed FPR levels of 0.1, 0.01, and 0.001, while
Figure 6 shows the ROC curves for our studied individual models.
The results indicate that feature representation has a substantial impact on detection performance. In particular, the vision models trained on byte-plot images outperform the tabular models trained on EMBER Base features. However, tabular models trained on EMBER Extended features achieve the best overall performance. Among the six models trained on our dataset, LightGBM with EMBER Extended features attains the highest ROC AUC (0.9633) and the highest TPR at fixed FPR levels of 0.1, 0.01, and 0.001. These findings suggest that feature engineering on APKs can significantly improve Android malware detection.
To assess the impact of the 435 Android-specific features introduced in the EMBER Extended representation,
Table 3 reports detailed performance at target FPR levels of 0.1, 0.01, and 0.001, including F1 score, TPR, precision, TN, and TP, for LightGBM and TabNet trained on EMBER Base and EMBER Extended features. For LightGBM, the TPR at
increases from 55.12% to 89.46%, a gain of 34.34 percentage points. At
, the TPR rises from 15.75% to 51.65%, more than tripling. Under the most stringent setting,
, the TPR improves from 3.64% to 21.08%, nearly a sixfold increase. TabNet exhibits similarly large gains, with TPR improving from 55.11% to 82.63% at
and from 4.17% to 30.42% at
, representing more than a sevenfold increase. No TabNet results are reported at
in
Table 2 and
Table 3 because, on our test set, TabNet cannot achieve an operating threshold that yields
.
In addition,
Table 4 summarizes the absolute improvements of EMBER extended features over EMBER Base features. The consistent gains for both LightGBM and TabNet confirm that domain-specific feature engineering provides substantial value for Android malware detection, particularly for practical deployment regimes that require low FPR.
Using LightGBM’s built-in feature-importance metrics, we report the top 10 most important features and their importance scores for LightGBM trained with EMBER Base features and with EMBER extended features in
Table 5. Notably, all of the top 10 features under the EMBER extended setting come from the 435 Android-specific features. In addition, we quantify the representation of Android-specific features among the most influential features for LightGBM trained on EMBER extended features in
Table 6: Android-specific features account for 100% of the top-10 features and 64% of the top-100 features. These results indicate that the introduced Android-specific features provide dominant discriminative signals and materially contribute to improved Android malware detection performance.
We further compare our LightGBM model trained on EMBER extended features with SHERLOCK, a state-of-the-art Vision Transformer approach for Android malware detection [
2]. Notably, SHERLOCK was trained on approximately 1.2 million Android application images, whereas our LightGBM model was trained on 35,000 samples. As shown in
Table 2, LightGBM achieves a ROC AUC of 0.9633, slightly higher than SHERLOCK’s 0.9613, indicating that it achieves a competitive performance relative to the state of the art. At
, LightGBM attains an 89.46% TPR compared to SHERLOCK’s 88.23%. At
, SHERLOCK shows a 3.10 percentage-point advantage (54.75% vs. 51.65%). However, at the most stringent operating point,
, LightGBM achieves a 21.08% TPR while SHERLOCK reaches 15.81%, giving LightGBM a 5.27 percentage-point advantage.
We also measured the inference time for LightGBM and SHERLOCK on the test set, as reported in
Table 7. Both models were evaluated on a system with a single NVIDIA A100 80GB GPU, a 32-core CPU, and 64 GB of RAM. LightGBM achieves approximately 64× faster inference than SHERLOCK (0.10 ms vs. 6.61 ms per sample on average). Overall, these results indicate that domain-specific feature engineering combined with gradient boosting can match or exceed state-of-the-art Vision Transformer performance while requiring substantially less training data and a significantly lower computational cost. The 435 Android-specific features appear to capture discriminative patterns that would otherwise require learning from large-scale image corpora, making this approach more practical for deployment in resource-constrained settings.
To provide a more complete comparison of computational cost between tabular and vision-based models, we analyze training time, feature-extraction overhead, and hardware requirements for both approaches. The results are summarized in
Table 8. Specifically, all tabular-model experiments were conducted on a machine with a 4-core CPU and 32 GB RAM. Under this setting, training LightGBM with the EMBER Extended feature set took approximately 1 min and 7 s. Feature extraction for the EMBER Extended features averaged 10.36 s per APK, with the majority of the cost attributable to DEX parsing and static code analysis. Importantly, feature extraction was performed once per APK, and the resulting feature vectors were stored and reused across training and evaluation runs. In contrast, the vision-based models were trained on a substantially more powerful system equipped with a 32-core CPU, 64 GB RAM, and an NVIDIA A100 GPU with 80 GB memory. Despite this hardware advantage, training DinoV3 took more than 30 min. The byte-plot image extraction pipeline averaged 11.12 s per image, which is comparable to the tabular feature-extraction time. However, vision models incur significantly higher computational and memory costs during training due to their larger parameter counts and reliance on GPU acceleration. Overall, while feature-extraction overhead is similar across the two representations, tabular models require substantially less training time and less demanding hardware.
Does the high dimensionality of the EMBER Extended feature set increase the risk of overfitting? To address this question, we analyzed LightGBM’s built-in feature-importance metrics. The resulting distribution is highly sparse, with most predictive signals concentrated in a small subset of the 1131 features. Specifically, the mean importance score is 81.32, while the standard deviation is 573.40, indicating substantial dispersion. The most important feature (Feature 947) has an importance score of 10,399.59, which is approximately 5252× larger than the median importance of 1.98. These results suggest that LightGBM effectively ignores the majority of features, and the effective dimensionality is therefore substantially lower than the raw feature count due to extreme importance sparsity. In addition, we use early stopping and standard LightGBM regularization (L1/L2 penalties and minimum samples per leaf), which further mitigates overfitting.
How does the choice of the scale_pos_weight parameter in LightGBM affect performance, particularly in the low-FPR regime? The scale_pos_weight parameter controls how LightGBM addresses class imbalance by reweighting the loss during training. In our training set, the class ratio is approximately 15:1 (malicious–benign). Setting scale_pos_weight to 0.067 (=) applies inverse-ratio weighting, which increases the loss contribution of the minority (benign) class relative to the majority (malicious) class. This encourages the model to be more conservative when predicting the majority class and helps reduce false positives, which is particularly important when operating at stringent false-positive rates (e.g., FPR = 0.001). To quantify the effect at the FPR = 0.001 operating point, we compared the TPR on the validation set under two settings: scale_pos_weight = 1 (unweighted) and scale_pos_weight = 0.067 (inverse-ratio weighted). The resulting TPR values were 19.23% and 21.08%, respectively. Thus, inverse-ratio weighting improves TPR at FPR = 0.001 by 1.85 percentage points, indicating a better performance in the low-FPR regime that is most relevant for real-world deployment.
4.3. Ensemble Models
We evaluate ensemble models that combine predictions from four base classifiers (LightGBM Extended, TabNet Extended, Google ViT, and DinoV3) using a logistic regression meta-classifier. We consider all two-model pairings, resulting in six ensemble configurations.
Figure 7 presents the ROC curves for all ensembles. The best-performing ensemble combines LightGBM Extended with DinoV3, achieving a ROC AUC of 0.9653. Moreover, because LightGBM Extended is consistently strong on its own, every ensemble that includes LightGBM Extended attains a ROC AUC above 0.96.
Table 9 reports performance at fixed FPR levels of 0.1, 0.01, and 0.001. At
, the LightGBM–Google ViT ensemble performs best, achieving an F1 score of 0.9016 and a TPR of 90.28%. The LightGBM–DinoV3 ensemble performs similarly at
(F1 = 0.9005, TPR = 90.09%) but becomes the strongest detector at more stringent operating points. Specifically, at
and
, it detects 54.58% and 27.13% of malware, respectively.
We further compare the best-performing ensemble (LightGBM–DinoV3) with the best individual model (LightGBM Extended). As shown in
Figure 6 and
Figure 7, LightGBM–DinoV3 attains a ROC AUC of 0.9653, a slight increase of 0.002 over LightGBM Extended.
Table 3 and
Table 9 provide a more detailed comparison between these two models at fixed operating points. At
, the ensemble achieves a TPR of 90.09% versus 89.46% for LightGBM Extended, a gain of 0.63 percentage points. At
, the ensemble detects 54.58% of malware compared to 51.65% for the single model, a 2.93 percentage-point improvement. At the most stringent threshold,
, the ensemble reaches a TPR of 27.13% versus 21.08%, yielding a 6.05 percentage-point improvement.
Table 10 summarizes these incremental gains in both TPR and F1 score, showing that the benefits of ensembling become more pronounced as the false-positive constraint tightens.
These gains, however, must be balanced against deployment cost. The ensemble requires maintaining two separate feature-extraction and inference pipelines, increasing computational overhead and operational complexity. Overall, the results suggest that LightGBM Extended offers the best trade-off between performance and simplicity in many deployment settings, where the marginal benefit of ensembling is limited at moderate FPRs. In contrast, for security-critical environments operating under very strict false-positive constraints (e.g., ), the 6.05 percentage-point TPR improvement—equivalent to detecting an additional 567 malware samples while maintaining 99.65% precision—may justify the added complexity. Taken together, these findings indicate that while cross-modal ensembling can provide measurable improvements, investments in domain-specific feature engineering deliver larger and more broadly useful gains for Android malware detection.
We note that the same validation split is used both to tune the base models and to train the meta-classifier in the ensemble, which can introduce a risk of second-order overfitting to the validation data. To mitigate this risk, we apply several safeguards. First, the base models use early stopping with conservative patience settings to reduce overfitting to the validation split. Second, the meta-classifier is intentionally low-capacity (logistic regression with L2 regularization), which limits its ability to memorize validation-specific artifacts. Third, we report ensemble gains only on a fully held-out test set, ensuring that improvements are assessed strictly out-of-sample. As future work, we plan to evaluate strictly out-of-sample stacking protocols (e.g., out-of-fold stacking or nested cross-validation) to further quantify robustness.