Re-Evaluating Android Malware Detection: Tabular Features, Vision Models, and Ensembles

Hosahalli Dayananda, Prajwal; Chen, Zesheng

doi:10.3390/electronics15030544

Open AccessEditor’s ChoiceArticle

Re-Evaluating Android Malware Detection: Tabular Features, Vision Models, and Ensembles

by

Prajwal Hosahalli Dayananda

and

Zesheng Chen

^*

College of Engineering, Technology, and Computer Science, Purdue University Fort Wayne, Fort Wayne, IN 46805, USA

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 544; https://doi.org/10.3390/electronics15030544

Submission received: 31 December 2025 / Revised: 17 January 2026 / Accepted: 24 January 2026 / Published: 27 January 2026

(This article belongs to the Special Issue Feature Papers in Networks: 2025–2026 Edition)

Download

Browse Figures

Versions Notes

Abstract

Static, machine learning-based malware detection is widely used in Android security products, where even small increases in false-positive rates can impose significant burdens on analysts and cause unacceptable disruptions for end users. Both tabular features and image-based representations have been explored for Android malware detection. However, existing public benchmark datasets do not provide paired tabular and image representations for the same samples, limiting direct comparisons between tabular models and vision-based models. This work investigates whether carefully engineered, domain-specific tabular features can match or surpass the performance of state-of-the-art deep vision models under strict false-positive-rate constraints, and whether ensemble approaches justify their additional complexity. To enable this analysis, we construct a large corpus of Android applications with paired static representations and evaluate six popular machine learning models on the exact same samples: two tabular models using EMBER features, two tabular models using extended EMBER features, and two vision-based models using malware images. Our results show that a LightGBM model trained on extended EMBER features outperforms all other evaluated models, as well as a state-of-the-art approach trained on a much larger dataset. Furthermore, we develop an ensemble model combining both tabular and vision-based detectors, which yields a modest performance improvement but at the cost of substantial additional computational and engineering overhead.

Keywords:

Android malware detection; EMBER features; feature engineering; SHERLOCK; LightGBM; ensemble model

1. Introduction

Android has become the dominant mobile operating system for both personal and enterprise devices, and its openness has fostered a large ecosystem of applications distributed through official and third-party marketplaces. However, this openness also creates a broad and continually evolving attack surface. Malicious Android applications can steal credentials, exfiltrate sensitive data, commit monetization fraud, or compromise devices as part of larger botnets. Security products that scan applications at scale must therefore detect malware reliably while maintaining an extremely low false-positive rate (FPR).

Low FPRs are essential in production environments. At a 5% FPR, an enterprise system scanning 1000 benign applications would incorrectly flag 50 of them as malicious. Such errors impose substantial operational burden through manual review and can erode user trust. When users repeatedly encounter legitimate applications blocked by security software, they may disable protection features altogether. As a result, production malware detection systems typically target FPRs of 1% or lower.

Static, machine learning-based detectors that operate directly on application packages are widely used for Android malware detection because they can be applied before installation, have predictable runtime costs, and do not require executing potentially harmful code [1]. More recently, deep learning models that represent binaries, graphs, or intermediate representations as images or sequences have attracted attention [2], enabling the use of convolutional networks and Vision Transformers. These approaches seek to reduce manual feature engineering and to learn richer representations from raw data, often achieving state-of-the-art detection performance.

In prior work, DREBIN pioneered the use of engineered static, tabular features for Android malware detection [3]. However, DREBIN relies on a linear support vector machine (SVM) and does not examine the practically important regime of very low FPRs. SHERLOCK is a state-of-the-art Android malware detector [2]. It is based on a Vision Transformer and was trained on approximately 1.2 million malware images. However, SHERLOCK’s performance has not been specifically evaluated under stringent FPR constraints.

Machine learning approaches to static malware detection typically rely on two distinct families of input representations. The first uses hand-crafted tabular features that summarize application properties such as permissions, API usage, control-flow statistics, and structural characteristics. EMBER [4,5,6] and related datasets exemplify this paradigm. The second family encodes binaries or derived structures as images—for example, by visualizing byte sequences, graphs, or control-flow artifacts—and then trains deep convolutional or transformer models on these images. MalNet [7] is a prominent example of this approach. While image-based models can benefit from advances in computer vision, they often require substantial computational resources and large volumes of labeled data.

Despite substantial progress, several gaps remain with regard to Android malware detection in real-world settings. Most public benchmarks and detailed analyses focus on Windows binaries, while Android datasets often use a single representation or do not provide paired tabular and image views of the same samples. In addition, many studies emphasize aggregate performance and do not report detection behavior at the very low FPRs required operationally. Moreover, few works directly compare feature-engineered tabular models with modern Vision Transformers on a shared Android corpus, and even fewer quantify whether tabular-vision ensembles deliver practically meaningful gains relative to their added computational and engineering cost.

This work addresses these gaps by investigating the following research questions:

When trained and evaluated on the same benign and malicious Android applications, how do feature-engineered tabular models compare with Vision Transformer models trained on image-based encodings, particularly at very low FPRs?
Can models trained on feature-engineered tabular representations achieve detection performance comparable to, or better than, state-of-the-art image-based models, especially under stringent FPRs?
Do ensembles that combine tabular and vision-based models yield practically meaningful gains at low FPRs, given their additional computational and engineering costs?

To answer these questions, we propose a framework that systematically collects benign and malicious Android packages, derives both tabular features and byte-plot images from the same samples, applies representation-specific preprocessing, and trains and evaluates a range of models, including tabular learners, Vision Transformers, and ensembles. Specifically, we construct a labeled corpus of Android applications comprising 35,000 training samples, 5000 validation samples, and 18,746 test samples. Building on prior work for Windows binaries, we also design and implement an extended set of EMBER-style static features [5,6] tailored to Android, capturing application structure, manifest characteristics, permission usage, and code-level signals associated with malicious behavior. We evaluate three input representations—EMBER Base features, EMBER Extended features, and byte-plot images—across tabular models (LightGBM and TabNet), Vision Transformers (Google ViT and DinoV3), and two-model ensembles constructed from these base learners.

The novelty of our work lies in constructing a paired dataset that provides, for each Android APK, both feature-engineered tabular representations and corresponding image-based representations. This paired design enables a controlled, head-to-head comparison of feature-engineered tabular models, image-based vision models, and their ensemble variants for Android malware detection. The main empirical findings of this work are summarized as follows:

Our experiments show that LightGBM trained on EMBER Extended features achieves the strongest performance among all evaluated base models, attaining a Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) of 0.9633 on the test set. This slightly exceeds SHERLOCK. At a fixed $FPR = 0.001$ , LightGBM achieves a true positive rate (TPR) of 0.2108, compared to 0.1581 for SHERLOCK. In addition, LightGBM provides approximately 64× faster inference than SHERLOCK. Overall, these results indicate that feature-engineered tabular representations combined with gradient boosting can match or surpass Vision Transformer detectors at low FPRs while requiring substantially less training data and significantly lower computational cost.
Ensemble models yield only modest gains. The best-performing two-model ensemble combines LightGBM and DinoV3, but increases ROC AUC by only 0.002 and improves TPR by 6.05 percentage points at $FPR = 0.001$ relative to LightGBM trained on EMBER Extended features.

To support reproducibility, we release our source code at https://github.com/hdprajwal/multimodal-malware-detection (accessed on 23 January 2026).

The remainder of this paper is organized as follows. Section 2 presents the overall system design and Section 3 describes the dataset, feature extraction pipelines, and machine learning model architectures in detail. Section 4 evaluates both individual base models and ensemble models. Finally, Section 5 concludes the paper and outlines directions for future work.

2. System Design

This section describes the overall system architecture and design approach for the malware detection framework. The system provides an end-to-end pipeline that starts with raw Android application samples and proceeds through feature extraction, model training, and evaluation. The goal is to enable fair comparisons between tabular and visual feature representations under controlled conditions, ensuring that any performance differences are attributable to the representations and model choices rather than differences in samples or experimental setup.

2.1. Design Philosophy and Requirements

The system design addresses several key requirements derived from the research goals:

Fair Comparison: The system processes multiple feature representations in parallel while maintaining identical evaluation conditions. All feature types are extracted from the same source APKs, use identical train–test splits, and are evaluated with the same metrics, ensuring that performance differences reflect the representations and model choices rather than differences in samples or experimental setup.
Modularity and Extensibility: The modular architecture follows the software design principle of separation of concerns and supports alternative feature extraction methods, model families, and ensemble strategies, enabling flexible exploration of different combinations.
Comprehensive Evaluation: The evaluation framework goes beyond accuracy to include ROC curves and metrics such as true-positive rate (TPR), precision, and F1 score at a specified false-positive rate (FPR). This comprehensive view reveals both overall performance differences and the regimes in which Android malware detectors should operate.
Class Imbalance Handling: The training and validation sets contain approximately 15× more malicious samples than benign samples, consistent with the class distribution used to train the SHERLOCK model [2]. The selected machine learning models apply appropriate class weighting and evaluation strategies to mitigate this imbalance. In contrast, the test set is class-balanced to facilitate unbiased performance assessment.
Reproducibility: We document and determine the version used for all experimental artifacts, including datasets, extracted features, model configurations, and trained weights. We also release our source code at https://github.com/hdprajwal/multimodal-malware-detection (accessed on 23 January 2026).

2.2. System Architecture Overview

The system operates in five stages, beginning with the collection of raw Android samples and proceeding through dataset preparation, feature extraction into multiple representations, input preprocessing, and model selection, training, and evaluation. Figure 1 illustrates the end-to-end pipeline.

Specifically, the system comprises the following five stages:

Stage 1: Data Acquisition and Preparation. Android APKs are collected from AndroZoo [8], including both malicious and benign samples. The dataset is partitioned into training, validation, and test splits. To ensure controlled experimental conditions, the same APKs are used in the corresponding splits across all feature representations.
Stage 2: Feature Extraction. Feature extraction proceeds along three parallel paths, each operating on the same source APKs:
–
EMBER Base Features: Standard EMBER-style static features, adapted from the EMBER 2024 dataset [5,6], yield a 696-dimensional representation that captures general executable characteristics.
–
Visual Features: APKs are converted into $256 \times 256$ RGB byte-plot images following the MalNet-Image methodology [7] for use with Vision Transformer models.
–
EMBER Extended Features: As a contribution, we apply feature engineering to augment the 696-dimensional EMBER Base features with 435 Android-specific features, resulting in a 1131-dimensional representation.
Stage 3: Data Preprocessing. Each representation is preprocessed using a representation-specific pipeline. Tabular features undergo z-score standardization [6] and class-weight computation to mitigate class imbalance, whereas visual features are normalized using ImageNet statistics [2] to ensure compatibility with pretrained Vision Transformers.
Stage 4: Model Training. Three categories of machine learning models are trained:
–
Tabular Models: LightGBM and TabNet are each trained on both EMBER Base and EMBER Extended features, yielding four tabular models. This design tests whether the proposed Android-specific features improve detection performance across model families.
–
Vision Models: Google ViT and DinoV3 Vision Transformer models are trained on byte-plot images under matched training configurations.
–
Ensemble Models: Selected pairs of base models (e.g., LightGBM and DinoV3) are combined to form ensemble models.
Stage 5: Evaluation and Analysis. All trained models are evaluated on the held-out test set. Performance is assessed using ROC curves and operating-point analysis. For ensemble models, the results are compared against their constituent base models to quantify the benefit of multi-modal fusion. Overall, this evaluation protocol enables systematic comparisons across feature representations, model families, and ensemble strategies.

2.3. Reference Baseline: SHERLOCK

SHERLOCK is used as a reference baseline for comparison with our proposed approach. Specifically, SHERLOCK is a state-of-the-art self-supervised Vision Transformer model for Android malware detection developed by Seneviratne et al. [2]. It was trained and evaluated on the large-scale MalNet-Image dataset, which is also derived from AndroZoo and contains approximately 1.2 million Android malware images spanning a hierarchy of 47 types and 696 families. SHERLOCK adopts the Vision Transformer Base (ViT-B/16) architecture with a patch size of 16, comprising 12 transformer encoder layers with a hidden size of 768 and 12 attention heads per layer, for a total of approximately 86 million trainable parameters.

A direct comparison between our results and SHERLOCK is complicated by dataset differences. SHERLOCK is trained and evaluated on MalNet-Image (approximately 1.2 million samples), whereas our experiments use AndroZoo APKs with a substantially smaller corpus (58,746 samples). As a result, SHERLOCK may have an inherent advantage due to its scale. Nevertheless, because the underlying source repository is the same, we consider it reasonable to evaluate SHERLOCK on our test set under the same operating points for a controlled comparison. That is, we include SHERLOCK as a reference point to contextualize our findings and to highlight the benefits of feature engineering for Android malware detection under a controlled experimental pipeline.

3. Methodology

In this section, we describe the dataset construction process, the feature extraction pipelines for the three representations, and the architectures of the machine learning models evaluated in this study.

3.1. Dataset Construction

We obtain APK files from AndroZoo [8]. For temporal tracking, AndroZoo indexes samples using multiple timestamps, most notably market_date (the publication date on the app store) and vt_scan_date (the first appearance on VirusTotal). In addition, AndroZoo assigns labels using the Euphony tool [8], which consolidates predictions from multiple antivirus engines into a single, consistent family label.

Following the MalNet-Image dataset construction methodology [7], we apply stratified sampling when splitting the AndroZoo corpus into training and validation sets. As a result, the training and validation splits preserve the same benign-to-malicious proportion and remain highly imbalanced, as shown in Figure 2. Specifically, the training set contains 35,000 samples (2194 benign APKs and 32,806 malicious APKs), and the validation set contains 5000 samples (313 benign APKs and 4687 malicious APKs). In contrast, the test set contains 18,746 samples, evenly balanced between 9373 benign APKs and 9373 malicious APKs. The test set is intentionally balanced to facilitate clearer interpretation and more stable estimates of metrics such as TPR and FPR. Importantly, the empirical FPR at a given decision threshold is defined as the fraction of benign samples that are incorrectly flagged, and it does not depend on the overall class prevalence in the test set. However, the practical implications of operating at a fixed FPR do depend on the real-world base rate: under highly skewed deployment conditions, even a small FPR can translate into a substantial number of false alarms, and metrics such as precision will typically decrease as the benign base rate increases.

When constructing the training and test sets from AndroZoo, we ensured that there is no overlap between splits. Each APK is uniquely identified by its SHA-256 hash. We verified that no two samples across the training and test sets share the same SHA-256 value, thereby preventing duplicate APKs from appearing in both splits.

3.2. Feature Extraction

Each APK can be transformed into three feature representations: EMBER Base features, EMBER Extended features, and byte-plot images.

3.2.1. EMBER Base Features

The EMBER feature extraction framework, originally developed for Windows PE malware analysis, provides a robust foundation for static feature representation [5]. EMBER2024 introduced an updated feature set that extends support beyond Windows PE files to include partial feature extraction from non-PE formats, including Android APKs [6]. In this work, we adapt the EMBER2024 extraction pipeline to derive features directly from Android APK files.

As shown in Figure 3, the EMBER Base feature vector comprises four complementary categories:

General file information captures basic properties such as file size, Shannon entropy, and header bytes to characterize overall file structure (seven dimensions).
String features extract printable ASCII strings and summarize their statistical properties. This category uses 76 regular-expression patterns to detect behavioral indicators such as URLs, IP addresses, file paths, and cryptographic artifacts that may signal malicious intent (177 dimensions).
Byte histogram features represent the normalized frequency distribution of byte values (0–255) across the file, which can reveal patterns associated with malware families or packing techniques (256 dimensions).
Byte–entropy histogram features approximate the joint distribution of byte values and local entropy using a sliding-window approach with 2048-byte windows and 1024-byte steps. The result is organized into a $16 \times 16$ matrix over entropy bins and coarse byte-value bins (256 dimensions).

Together, these four categories capture both statistical and semantic properties of Android APKs, yielding a 696-dimensional EMBER Base representation.

Figure 3. EMBER Base feature vector structure showing the four feature categories and their dimensions.

3.2.2. EMBER Extended Features

While EMBER Base features provide a solid foundation for malware detection, they were originally designed for Windows PE files. When applied to Android APKs, they primarily capture general executable characteristics and therefore omit Android-specific properties. That is, EMBER was originally designed for parsing Windows PE files, whereas Android applications are distributed as APKs that comprise substantially different artifacts (e.g., AndroidManifest.xml, classes.dex, and resource directories). Consequently, EMBER Base features alone are limited in their ability to capture Android-specific structural and semantic characteristics. To address this limitation, we introduce a comprehensive set of Android-aware feature enhancements that capture platform-specific attributes and behaviors. These enhancements substantially expand the feature space while preserving compatibility with the EMBER framework.

Motivated by DREBIN [3], which demonstrated the effectiveness of manifest-derived signals such as permissions, components, and intent filters, we extend the 696-dimensional EMBER Base vector to 1131 dimensions by adding 435 features tailored to Android malware analysis. The extended feature set follows design principles consistent with the EMBER methodology [5,6]. Specifically, we use normalized counts to accommodate APKs of varying sizes, entropy-based measures to capture randomness and potential obfuscation, ratio features to represent relative proportions, and threshold-based binary indicators to flag anomalous behavior. The extraction pipeline also includes robust fallback logic to handle parsing failures gracefully while maintaining a consistent feature dimensionality across all samples. As shown in Figure 4 and Table 1, we group the extended features into four categories that explicitly target key APK components and known behavioral signals used in Android malware analysis:

Certificate and Signing Features: Android applications must be signed with a developer certificate. Certificate metadata and signing properties provide informative provenance and consistency signals that can help distinguish benign apps from malware.
Resource and File Structure Features: APKs include resource directories with characteristic organizational patterns. Benign applications typically exhibit well-formed, consistent resource hierarchies, whereas malware often shows atypical or inconsistent packaging patterns.
DEX Code Analysis Features: The Dalvik Executable (DEX) format encapsulates the application bytecode and enables extraction of code-level indicators (e.g., structural and complexity cues) that are not represented by PE-oriented features.
APK Manifest Features: Permissions and component declarations in AndroidManifest.xml define the app’s intended access to system resources. Malware frequently requests permission sets and combinations that differ from those of benign applications, making these signals particularly relevant under low-FPR constraints.

Below, we provide detailed descriptions of the extended features in each category.

Figure 4. EMBER extended feature vector structure showing the EMBER Base features and four Android-specific feature categories.

Table 1. EMBER Extended feature vector composition.

Category	Dimensions	Details
EMBER Base Features	696	Original EMBER Base features including general file information, string features, byte histogram features, and byte–entropy histogram features
Certificate and signing features	25	App signing metadata including issuer and subject info, key type and length, and simple certificate anomaly indicators
Resource and file structure features	50	Layout of files and resources including counts and sizes of resource types, assets, native libs, and unusual structures
DEX code analysis features	50	Static code statistics including opcode distributions, API family usage, suspicious API ratios, reflection, and loading
APK Manifest Features	310	Parsed AndroidManifest.xml containing components, intent filters, protection levels, and permission presence and density
Android-specific features	435	The number of Android-focused features added on top of EMBER Base features
Total EMBER extended	1131	Full EMBER extended feature vector (i.e., EMBER Base plus Android-specific features)

A.: Certificate and Signing Features

Code signing provides integrity verification and authentication for Android applications. Android supports multiple signature schemes (v1, v2, v3, and v4), each with distinct security properties and verification mechanisms. Our certificate and signing feature group (25 dimensions) characterizes both the signing scheme and certificate properties, and includes the following:

Basic signing features: These features capture the presence and type of signatures, including binary indicators for each signature version, the total number of certificates, and the average certificate size.
Certificate analysis features: These are used to examine the use of RSA versus DSA and flag self-signed certificates, which may indicate unofficial distribution or malicious repackaging.
Threshold-based features: These compute suspicion scores based on certificate counts; applications with an unusually large number of certificates (e.g., more than three) or no valid signature receive elevated scores.
Reserved features: These serve as placeholders for future extensions such as certificate-chain validation, temporal analysis of signing dates, and certificate authority verification.

B.: Resource and File Structure Features

Android applications contain diverse file types, including compiled code (DEX), native libraries (SO), resources (e.g., PNG and XML), and assets. The distribution and organization of these files can reveal behavioral patterns, repackaging, and packing techniques. Our resource and file structure feature group (50 dimensions) characterizes the internal composition of an APK and includes the following:

File count features: Raw and normalized counts for total files, native libraries, assets, and suspicious files. Suspicious files are flagged based on naming patterns (e.g., payload, exploit, shell, backdoor) or unusual locations (e.g., executables outside standard directories).
File-type distribution features: Counts of common extensions such as DEX, XML, PNG, SO, ARSC, JSON, TXT, HTML, and JS. These features summarize APK composition and can help to differentiate between app categories or malware families.
Entropy features: Measures of extension and directory-path diversity. High extension entropy may indicate complex packages or attempts to hide malicious payloads among legitimate files, while high path entropy can suggest obfuscation. The entropy calculation follows the standard Shannon entropy formula: $H = - \sum p_{i} {log}_{2} (p_{i})$ where $p_{i}$ represents the probability of an extension or a path.
Architecture analysis features: Analysis of native libraries to infer supported CPU architectures. These features count unique architectures (e.g., ARM, x86, and MIPS) and compute normalized architecture statistics.
Ratio features: Relative proportions of key file groups, including the DEX-file ratio, native-library ratio, and asset ratio.
File-type histogram features: Diversity measures based on the number of unique extensions and the normalized unique-extension ratio.
Threshold features: Binary indicators for anomalous conditions, such as multiple DEX files (potential code injection), the presence of suspicious files, very large packages (e.g., more than 1000 files), excessive architecture support, unusually high file-type diversity (e.g., more than 20 unique extensions), and excessive assets (e.g., more than 200 asset files).

C.: DEX Code Analysis Features

The DEX file contains the compiled Dalvik bytecode that implements an application’s functionality. DEX analysis is the most computationally intensive component of our feature extraction pipeline. Our DEX code analysis feature group (50 dimensions) performs a more detailed examination of this bytecode and includes the following:

Basic DEX statistics: Counts of DEX files, classes, methods, fields, strings, total code size, and external references. We also compute normalized variants to accommodate applications of varying complexity. Normalization thresholds are calibrated using typical benign distributions (e.g., 10,000 classes, 50,000 methods, and 10 MB of code).
Suspicious API and suspicious string features: Curated sets of keywords and patterns that are commonly associated with malicious behaviors while appearing relatively infrequently in benign applications. Suspicious APIs are identified via case-insensitive substring matching over fully qualified method names extracted from DEX bytecode. The curated API patterns span five primary categories: (i) reflection (e.g., invoke, Class.forName); (ii) native code interaction (e.g., JNI usage and .so loading); (iii) cryptography (e.g., AES, RSA, Cipher); (iv) network communication (e.g., socket, connect); and (v) system-level execution (e.g., Runtime, exec, shell). We also target representative namespaces commonly involved in these behaviors (e.g., android.os.Process, javax.crypto.*, and android.telephony.*). Meanwhile, suspicious strings are extracted from the APK string pool and matched against curated literals and patterns indicative of malicious intent (e.g., payload, backdoor, botnet, keylog).
Code-entropy features: Entropy measures over class names, method names, and strings. Obfuscated applications often exhibit higher entropy due to automated renaming and encoding.
Ratio features: Relative measures that normalize raw counts by appropriate denominators. Examples include the methods-per-class (code density), native-method ratio (non-Java implementation), reflection ratio, crypto-usage ratio, and suspicious-API ratio. We also compute an obfuscation ratio, defined as the proportion of classes with unusually short names.
Threshold features: Binary indicators for extreme conditions, such as many DEX files (e.g., more than five), excessive methods (more than 50,000), excessive classes (more than 10,000), heavy reflection use (more than 100 calls), numerous native calls (more than 50), strong obfuscation (more than 30% short class names), and high string entropy (greater than 6.0 bits).

We handle parsing failures via fallback logic that performs header-only analysis when full bytecode parsing is not possible. This design preserves a consistent feature dimensionality even for malformed or deliberately corrupted APKs. In our experiments, only 45 APKs (0.077%) out of 58,746 samples required fallback logic during DEX parsing. Notably, all of these malformed samples were labeled as malicious.

D.: APK Manifest Features

The Android manifest file (AndroidManifest.xml) is the central configuration document for an Android application. It declares key information such as requested permissions, application components (activities, services, broadcast receivers, and content providers), intent filters, and application-level attributes. We parse the manifest using Androguard [9]. Our APK manifest feature group (310 dimensions) captures both declared capabilities and suspicious configuration patterns, and includes the following:

Permission features: These capture both the number and types of requested permissions. We distinguish between normal and dangerous permissions, with particular emphasis on permissions that are frequently abused by malware (e.g., SMS access, phone-state monitoring, location access, and network communication). Permission names are hashed into a fixed-length representation to remain robust to newly introduced permissions. Specifically, we use signed 32-bit MurmurHash3 with an output dimensionality of $2^{20}$ features (columns) to generate the fixed-length representations. At this dimensionality, the expected collision rate is statistically negligible, and signed hashing (in which collisions can cancel rather than purely accumulate) further reduces the potential impact of collisions on downstream performance.
Component features: These features characterize declared components and their exposure. Activity features include counts of total activities, exported activities (accessible to other apps), main/launcher activities, and activities with intent filters. Service features focus on background services, including accessibility services and services with intent filters that may indicate persistent behavior. Receiver features analyze broadcast receivers, emphasizing system-event listeners such as boot completion and SMS reception. We also compute entropy over component names to flag obfuscated or randomly generated identifiers.
Intent-filter features: These features summarize declared intent filters that govern component interactions with the Android system and other applications. We count potentially suspicious intent actions (e.g., BOOT_COMPLETED, SMS_RECEIVED, and PACKAGE_ADDED) commonly associated with persistence and monitoring. We also compute package-name entropy to capture irregular or obfuscated package structures.
Application metadata features: These features are used to extract versioning and SDK information, signing-related indicators, and platform flags. We pay particularly close attention to configuration options associated with development or debugging. Features include boolean indicators for debuggable builds, backup allowance, multidex usage, and platform variants (Android TV, Leanback, and Wearable).
Suspicious combination features: These features model interaction effects between manifest elements by combining related signals, such as dangerous permissions together with exported activities or suspicious intent filters together with background services. Threshold-based indicators flag extreme values (e.g., unusually high numbers of permissions or exported components).
Statistical features: These features provide aggregate summaries across manifest entities, including total component counts, permission density (permissions per component), exported-component ratios, and suspicious-intent ratios.
Additional entropy-based features: These features are used to measure diversity and randomness in component names, package structure, and intent declarations.

3.2.3. Byte-Plot Images

The image-based representation follows the MalNet methodology [7], which converts binary executables into visual formats suitable for computer vision models. Specifically, we transform the raw Dalvik Executable (DEX) bytecode from Android APKs into RGB images whose color channels encode structural information about the DEX format. The process begins by extracting the DEX file from the APK, which is a ZIP archive containing bytecode, resources, the manifest, and (optionally) native libraries. The primary DEX file, typically named classes.dex, contains the compiled application code in Dalvik bytecode form. After extraction, the DEX binary is read as a hexadecimal stream and converted into a one-dimensional array of unsigned 8-bit integers, where each byte value ranges from 0 to 255. To control memory usage and image dimensions, files larger than 80 MB are excluded.

A key aspect of the MalNet approach is assigning semantic meaning to the RGB channels based on the major structural components of the DEX file. We map the following three components to the RGB channels to obtain a semantically meaningful visualization:

Header section: This section contains metadata about the DEX structure (e.g., file size and offset pointers) and is mapped to the red channel.
Identifiers section: This section includes string IDs, type IDs, prototype IDs, field IDs, method IDs, and class definitions, and is mapped to the green channel.
Data section: This section contains the bytecode and other data items (e.g., constants) and is mapped to the blue channel.

The image width is selected adaptively based on file size to maintain a reasonable aspect ratio: smaller files are encoded with narrower widths (e.g., 32 pixels for files under 10 KB), while larger files use wider widths (up to 1024 pixels for files over 1 MB). Each channel’s one-dimensional byte array is reshaped into a two-dimensional matrix using the selected width, and the three matrices are stacked to form an RGB image. The pipeline then produces five resolution variants (32 × 32, 64 × 64, 128 × 128, 256 × 256, and 512 × 512) using Lanczos resampling. Lower resolutions emphasize global structure, whereas higher resolutions preserve fine-grained details that may help distinguish closely related malware families.

In this study, we use 256 × 256 as the standard input size for Vision Transformer models for two reasons:

Information–efficiency trade-off. Byte-plot images primarily capture global structural characteristics of DEX files (e.g., coarse layout and entropy patterns) rather than fine-grained visual textures. Increasing the resolution beyond $256 \times 256$ substantially increases computational cost—especially for Vision Transformers, where the attention mechanism scales quadratically with the number of tokens—while the additional discriminative information is likely to yield diminishing returns.
Consistency with prior work for fair comparison. SHERLOCK is trained and evaluated on MalNet-Image using $256 \times 256$ byte-plot images [2]. Using the same resolution therefore enables a fair and controlled comparison with SHERLOCK.

3.3. Model Architectures

We study three categories of machine learning models: (1) tabular models trained on either EMBER Base or EMBER Extended features, (2) Vision Transformer models trained on byte-plot images, and (3) ensemble models that fuse information from EMBER Extended features and byte-plot images.

3.3.1. Tabular Models

For the tabular setting, we use LightGBM and TabNet.

1.: LightGBM Model

LightGBM is widely used as a strong baseline for EMBER-style tabular features and has demonstrated excellent malware detection performance in prior work [5,6].

LightGBM is a gradient-boosting framework based on tree ensembles. It uses gradient-boosted decision trees (GBDTs), constructing trees sequentially to minimize the training objective. In this study, we train LightGBM for up to 1000 boosting iterations with a learning rate of 0.05 and binary log loss. To control model complexity and reduce overfitting, we apply L2 regularization, cap each tree at 64 leaves, and use early stopping with a patience of 100 iterations based on validation loss. We also address class imbalance using the scale_pos_weight hyperparameter, computed as the ratio of benign to malicious samples in the training set. This weighting adjusts the gradient contribution of the positive (malicious) class relative to the negative (benign) class. With 32,806 malicious samples and 2194 benign samples in the 35,000-sample training set, scale_pos_weight is approximately 0.067.

2.: TabNet Model

TabNet, proposed by researchers at Google Cloud, is a deep neural network architecture for tabular learning that has been shown to outperform many alternatives across a wide range of tabular datasets [10].

TabNet combines sequential feature selection with learned feature transformations. In our implementation, the model uses decision and attention embedding dimensions of

n_{d} = 64

and

n_{a} = 64

, respectively, and performs

n_{s t e p s} = 3

sequential decision steps. At each step, an attention mechanism selects a subset of features to emphasize. The network includes both shared and step-specific Gated Linear Unit (GLU) blocks:

n_{s h a r e d} = 2

shared layers capture common transformations across steps, while

n_{i n d e p e n d e n t} = 2

independent layers allow each step to learn specialized processing. Feature reuse across steps is controlled by the relaxation parameter

γ = 1.3

, which encourages the model to explore different feature subsets rather than repeatedly focusing on the same features. To improve training stability and generalization, we apply Ghost Batch Normalization with a virtual batch size of 256 and set the batch-normalization momentum to 0.02.

We train TabNet for up to 150 epochs with a batch size of 1024 using the Adam optimizer with a learning rate of 0.02. As with LightGBM, we apply class weighting to mitigate class imbalance. To reduce overfitting, we use early stopping with a patience of 25 epochs and incorporate sparsity regularization with coefficient

λ_{sparse} = 0.001

, which encourages the attention mechanism to select fewer features at each step.

3.3.2. Vision Transformer Models

Vision Transformers (ViTs) apply the Transformer architecture to image data. Unlike convolutional neural networks, which rely on local filters, ViTs represent an image as a sequence of fixed-size patches and learn global dependencies through self-attention. In this study, we evaluate two pretrained ViT variants for malware detection on byte-plot images: Google ViT and DinoV3.

1.: Google ViT

Google ViT applies the Transformer paradigm—originally developed for natural language processing—to computer vision tasks such as image classification [11].

In this work, we use the pretrained vit-base-patch16-224 model [12], which was originally trained on ImageNet-21k (approximately 21,000 classes). The model partitions each input image into

16 \times 16

patches and treats each patch as a token for Transformer-based processing. The ViT-Base backbone consists of 12 Transformer encoder layers with a hidden dimension of 768 and 12 attention heads per layer, for a total of approximately 86 million parameters.

To adapt the pretrained backbone to binary malware classification, we attach a linear classification head and fine-tune the full model. We use a training batch size of 64 and an evaluation batch size of 128, together with the AdamW optimizer (learning rate of

10^{- 4}

and weight decay of

10^{- 4}

). We employ a cosine learning-rate schedule with a warmup ratio of 0.05, which linearly increases the learning rate during the first 5% of training steps before applying cosine decay. To mitigate class imbalance, we use a weighted binary cross-entropy loss, where the positive-class weight is set to the benign-to-malicious ratio in the training set. The model is trained for up to 20 epochs with early stopping (patience = 5) based on validation loss. We also use mixed precision (BF16) to reduce memory consumption and accelerate training on compatible GPUs.

2.: DinoV3

DinoV3 is a state-of-the-art vision foundation model that was released by Meta AI in August 2025. It learns strong, general-purpose image representations from unlabeled data via self-supervised learning (SSL) [13].

In this study, we use the pretrained dinov3-vitb16-pretrain-lvd1689m model [14]. The backbone follows a ViT-Base configuration with 12 encoder layers, a hidden dimension of 768, and 12 attention heads. We adapt the model for malware detection by attaching a binary classification head and fine-tune it with a batch size of 32 and a learning rate of

5 \times 10^{- 5}

. We use the same weight decay (

10^{- 4}

) and early-stopping strategy (patience = 5) as that used for Google ViT. Training runs for up to 20 epochs, with early stopping based on validation loss.

3.3.3. Ensemble Models

After training the individual models, we evaluate ensemble methods that combine their predicted probabilities to improve classification performance. We consider three approaches for fusing model outputs:

Simple averaging, which computes the mean of the predicted probabilities across models.
Random forest or XGBoost, which can learn non-linear combinations of the models’ predicted probabilities.
Logistic regression, which learns a weighted linear combination of the predicted probabilities.

Across our experiments, logistic regression consistently achieved the best performance among the three methods. Accordingly, we use logistic regression as the meta-classifier for all ensemble configurations (Figure 5). Specifically, after training the base models, we train the logistic regression meta-classifier on the validation set, using the base models’ predicted probabilities as input features.

In this work, we construct ensembles by combining two base models selected from a pool of four: two tabular models (LightGBM and TabNet) trained on EMBER Extended features and two vision models (Google ViT and DinoV3) trained on byte-plot images. We evaluate all two-model pairs to systematically compare combination strategies, resulting in six ensemble configurations:

1.: LightGBM + TabNet (tabular-only ensemble).
2.: LightGBM + Google ViT (cross-modal ensemble).
3.: LightGBM + DinoV3 (cross-modal ensemble).
4.: TabNet + Google ViT (cross-modal ensemble).
5.: TabNet + DinoV3 (cross-modal ensemble).
6.: Google ViT + DinoV3 (vision-only ensemble).

These configurations cover three ensemble types: (i) tabular-only ensembles that combine different tabular architectures on the same EMBER Extended feature representation, (ii) vision-only ensembles that combine different Vision Transformer backbones on byte-plot images, and (iii) cross-modal ensembles that fuse tabular and visual models to test whether multi-modal integration improves detection. By evaluating all two-model pairs, the experiments identify which combination provides the most complementary information for Android malware detection.

4. Performance Evaluation

In this section, we first describe the evaluation metrics and then assess the performance of both individual base models and ensemble models on the test set.

4.1. Evaluation Metrics

To comprehensively assess model performance, we consider two types of metrics: threshold-dependent metrics and threshold-independent metrics.

4.1.1. Threshold-Dependent Metrics

A binary classifier predicts a label by comparing the model’s output probability to a decision threshold. For each input sample (benign or malicious), the prediction is either correct or incorrect, leading to four possible outcomes in malware detection:

True Negatives (TNs): benign samples correctly classified as benign.
False Positives (FPs): benign samples incorrectly classified as malicious.
False Negatives (FNs): malicious samples incorrectly classified as benign.
True Positives (TPs): malicious samples correctly classified as malicious.

These quantities form the confusion matrix. In our implementation, we encode labels as 0 for benign (negative class) and 1 for malicious (positive class). This encoding is used consistently throughout the codebase and follows the standard convention in malware detection, where the malicious class is treated as the positive class (i.e., the class of interest).

Using these terms, the F1 score is defined as follows:

F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN},

(1)

where

Precision = \frac{TP}{TP + FP}, Recall = TPR = \frac{TP}{TP + FN} .

(2)

The F1 score ranges from 0 to 1, with higher values indicating better detection performance; an F1 score of 1 corresponds to perfect classification.

We also report the false positive rate (FPR), defined as follows:

FPR = \frac{FP}{FP + TN} .

(3)

In this study, we focus on performance at low operating points (e.g.,

FPR \in {0.1, 0.01, 0.001}

). In production settings, FPR must be kept low to avoid alert fatigue; therefore, an effective malware detector should maintain a high true-positive rate (TPR) while operating at low FPR.

4.1.2. Threshold-Independent Metrics

To provide a threshold-independent view of a model’s ability to distinguish benign from malicious samples, we also report the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC). The ROC curve plots the TPR against the FPR across all possible thresholds, while AUC summarizes this curve as a single scalar. AUC ranges from 0 to 1, with higher values indicating better discrimination. A random classifier achieves an AUC of 0.5, whereas a perfect classifier achieves an AUC of 1.

4.1.3. Threshold-Selection Procedure

In our study, we selected thresholds corresponding to fixed operating points (FPR = 0.1, 0.01, and 0.001) directly from the ROC curve during test-set evaluation. Specifically, for each model, we computed the ROC curve on the test set and identified the decision threshold that achieves the target empirical FPR. We then reported the corresponding TPR, precision, and F1 score at that operating point.

4.2. Individual Models

We evaluate the performance of the individual models on the test set, including LightGBM and TabNet trained on the 696-dimensional EMBER Base features; Google ViT and DinoV3 trained on byte-plot images; LightGBM and TabNet trained on the 1131-dimensional EMBER Extended features; and the SHERLOCK baseline. Table 2 reports ROC AUC and TPR at fixed FPR levels of 0.1, 0.01, and 0.001, while Figure 6 shows the ROC curves for our studied individual models.

The results indicate that feature representation has a substantial impact on detection performance. In particular, the vision models trained on byte-plot images outperform the tabular models trained on EMBER Base features. However, tabular models trained on EMBER Extended features achieve the best overall performance. Among the six models trained on our dataset, LightGBM with EMBER Extended features attains the highest ROC AUC (0.9633) and the highest TPR at fixed FPR levels of 0.1, 0.01, and 0.001. These findings suggest that feature engineering on APKs can significantly improve Android malware detection.

To assess the impact of the 435 Android-specific features introduced in the EMBER Extended representation, Table 3 reports detailed performance at target FPR levels of 0.1, 0.01, and 0.001, including F1 score, TPR, precision, TN, and TP, for LightGBM and TabNet trained on EMBER Base and EMBER Extended features. For LightGBM, the TPR at

FPR = 0.1

increases from 55.12% to 89.46%, a gain of 34.34 percentage points. At

FPR = 0.01

, the TPR rises from 15.75% to 51.65%, more than tripling. Under the most stringent setting,

FPR = 0.001

, the TPR improves from 3.64% to 21.08%, nearly a sixfold increase. TabNet exhibits similarly large gains, with TPR improving from 55.11% to 82.63% at

FPR = 0.1

and from 4.17% to 30.42% at

FPR = 0.01

, representing more than a sevenfold increase. No TabNet results are reported at

FPR = 0.001

in Table 2 and Table 3 because, on our test set, TabNet cannot achieve an operating threshold that yields

FPR = 0.001

.

In addition, Table 4 summarizes the absolute improvements of EMBER extended features over EMBER Base features. The consistent gains for both LightGBM and TabNet confirm that domain-specific feature engineering provides substantial value for Android malware detection, particularly for practical deployment regimes that require low FPR.

Using LightGBM’s built-in feature-importance metrics, we report the top 10 most important features and their importance scores for LightGBM trained with EMBER Base features and with EMBER extended features in Table 5. Notably, all of the top 10 features under the EMBER extended setting come from the 435 Android-specific features. In addition, we quantify the representation of Android-specific features among the most influential features for LightGBM trained on EMBER extended features in Table 6: Android-specific features account for 100% of the top-10 features and 64% of the top-100 features. These results indicate that the introduced Android-specific features provide dominant discriminative signals and materially contribute to improved Android malware detection performance.

We further compare our LightGBM model trained on EMBER extended features with SHERLOCK, a state-of-the-art Vision Transformer approach for Android malware detection [2]. Notably, SHERLOCK was trained on approximately 1.2 million Android application images, whereas our LightGBM model was trained on 35,000 samples. As shown in Table 2, LightGBM achieves a ROC AUC of 0.9633, slightly higher than SHERLOCK’s 0.9613, indicating that it achieves a competitive performance relative to the state of the art. At

FPR = 0.1

, LightGBM attains an 89.46% TPR compared to SHERLOCK’s 88.23%. At

FPR = 0.01

, SHERLOCK shows a 3.10 percentage-point advantage (54.75% vs. 51.65%). However, at the most stringent operating point,

FPR = 0.001

, LightGBM achieves a 21.08% TPR while SHERLOCK reaches 15.81%, giving LightGBM a 5.27 percentage-point advantage.

We also measured the inference time for LightGBM and SHERLOCK on the test set, as reported in Table 7. Both models were evaluated on a system with a single NVIDIA A100 80GB GPU, a 32-core CPU, and 64 GB of RAM. LightGBM achieves approximately 64× faster inference than SHERLOCK (0.10 ms vs. 6.61 ms per sample on average). Overall, these results indicate that domain-specific feature engineering combined with gradient boosting can match or exceed state-of-the-art Vision Transformer performance while requiring substantially less training data and a significantly lower computational cost. The 435 Android-specific features appear to capture discriminative patterns that would otherwise require learning from large-scale image corpora, making this approach more practical for deployment in resource-constrained settings.

To provide a more complete comparison of computational cost between tabular and vision-based models, we analyze training time, feature-extraction overhead, and hardware requirements for both approaches. The results are summarized in Table 8. Specifically, all tabular-model experiments were conducted on a machine with a 4-core CPU and 32 GB RAM. Under this setting, training LightGBM with the EMBER Extended feature set took approximately 1 min and 7 s. Feature extraction for the EMBER Extended features averaged 10.36 s per APK, with the majority of the cost attributable to DEX parsing and static code analysis. Importantly, feature extraction was performed once per APK, and the resulting feature vectors were stored and reused across training and evaluation runs. In contrast, the vision-based models were trained on a substantially more powerful system equipped with a 32-core CPU, 64 GB RAM, and an NVIDIA A100 GPU with 80 GB memory. Despite this hardware advantage, training DinoV3 took more than 30 min. The byte-plot image extraction pipeline averaged 11.12 s per image, which is comparable to the tabular feature-extraction time. However, vision models incur significantly higher computational and memory costs during training due to their larger parameter counts and reliance on GPU acceleration. Overall, while feature-extraction overhead is similar across the two representations, tabular models require substantially less training time and less demanding hardware.

Does the high dimensionality of the EMBER Extended feature set increase the risk of overfitting? To address this question, we analyzed LightGBM’s built-in feature-importance metrics. The resulting distribution is highly sparse, with most predictive signals concentrated in a small subset of the 1131 features. Specifically, the mean importance score is 81.32, while the standard deviation is 573.40, indicating substantial dispersion. The most important feature (Feature 947) has an importance score of 10,399.59, which is approximately 5252× larger than the median importance of 1.98. These results suggest that LightGBM effectively ignores the majority of features, and the effective dimensionality is therefore substantially lower than the raw feature count due to extreme importance sparsity. In addition, we use early stopping and standard LightGBM regularization (L1/L2 penalties and minimum samples per leaf), which further mitigates overfitting.

How does the choice of the scale_pos_weight parameter in LightGBM affect performance, particularly in the low-FPR regime? The scale_pos_weight parameter controls how LightGBM addresses class imbalance by reweighting the loss during training. In our training set, the class ratio is approximately 15:1 (malicious–benign). Setting scale_pos_weight to 0.067 (=

1 / 15

) applies inverse-ratio weighting, which increases the loss contribution of the minority (benign) class relative to the majority (malicious) class. This encourages the model to be more conservative when predicting the majority class and helps reduce false positives, which is particularly important when operating at stringent false-positive rates (e.g., FPR = 0.001). To quantify the effect at the FPR = 0.001 operating point, we compared the TPR on the validation set under two settings: scale_pos_weight = 1 (unweighted) and scale_pos_weight = 0.067 (inverse-ratio weighted). The resulting TPR values were 19.23% and 21.08%, respectively. Thus, inverse-ratio weighting improves TPR at FPR = 0.001 by 1.85 percentage points, indicating a better performance in the low-FPR regime that is most relevant for real-world deployment.

4.3. Ensemble Models

We evaluate ensemble models that combine predictions from four base classifiers (LightGBM Extended, TabNet Extended, Google ViT, and DinoV3) using a logistic regression meta-classifier. We consider all two-model pairings, resulting in six ensemble configurations. Figure 7 presents the ROC curves for all ensembles. The best-performing ensemble combines LightGBM Extended with DinoV3, achieving a ROC AUC of 0.9653. Moreover, because LightGBM Extended is consistently strong on its own, every ensemble that includes LightGBM Extended attains a ROC AUC above 0.96.

Table 9 reports performance at fixed FPR levels of 0.1, 0.01, and 0.001. At

FPR = 0.1

, the LightGBM–Google ViT ensemble performs best, achieving an F1 score of 0.9016 and a TPR of 90.28%. The LightGBM–DinoV3 ensemble performs similarly at

FPR = 0.1

(F1 = 0.9005, TPR = 90.09%) but becomes the strongest detector at more stringent operating points. Specifically, at

FPR = 0.01

and

FPR = 0.001

, it detects 54.58% and 27.13% of malware, respectively.

We further compare the best-performing ensemble (LightGBM–DinoV3) with the best individual model (LightGBM Extended). As shown in Figure 6 and Figure 7, LightGBM–DinoV3 attains a ROC AUC of 0.9653, a slight increase of 0.002 over LightGBM Extended. Table 3 and Table 9 provide a more detailed comparison between these two models at fixed operating points. At

FPR = 0.1

, the ensemble achieves a TPR of 90.09% versus 89.46% for LightGBM Extended, a gain of 0.63 percentage points. At

FPR = 0.01

, the ensemble detects 54.58% of malware compared to 51.65% for the single model, a 2.93 percentage-point improvement. At the most stringent threshold,

FPR = 0.001

, the ensemble reaches a TPR of 27.13% versus 21.08%, yielding a 6.05 percentage-point improvement. Table 10 summarizes these incremental gains in both TPR and F1 score, showing that the benefits of ensembling become more pronounced as the false-positive constraint tightens.

These gains, however, must be balanced against deployment cost. The ensemble requires maintaining two separate feature-extraction and inference pipelines, increasing computational overhead and operational complexity. Overall, the results suggest that LightGBM Extended offers the best trade-off between performance and simplicity in many deployment settings, where the marginal benefit of ensembling is limited at moderate FPRs. In contrast, for security-critical environments operating under very strict false-positive constraints (e.g.,

FPR = 0.001

), the 6.05 percentage-point TPR improvement—equivalent to detecting an additional 567 malware samples while maintaining 99.65% precision—may justify the added complexity. Taken together, these findings indicate that while cross-modal ensembling can provide measurable improvements, investments in domain-specific feature engineering deliver larger and more broadly useful gains for Android malware detection.

We note that the same validation split is used both to tune the base models and to train the meta-classifier in the ensemble, which can introduce a risk of second-order overfitting to the validation data. To mitigate this risk, we apply several safeguards. First, the base models use early stopping with conservative patience settings to reduce overfitting to the validation split. Second, the meta-classifier is intentionally low-capacity (logistic regression with L2 regularization), which limits its ability to memorize validation-specific artifacts. Third, we report ensemble gains only on a fully held-out test set, ensuring that improvements are assessed strictly out-of-sample. As future work, we plan to evaluate strictly out-of-sample stacking protocols (e.g., out-of-fold stacking or nested cross-validation) to further quantify robustness.

5. Conclusions

In this work, we developed an evaluation framework that enables controlled comparisons between tabular features and image-based representations for Android malware detection. In particular, we extended EMBER-style features with Android-specific signals and showed that the resulting EMBER extended representation supports a lightweight LightGBM detector that slightly outperforms SHERLOCK, a state-of-the-art Android malware detector trained on approximately 1.2 million images. We also investigated cross-modal ensembles that combine EMBER Extended features with byte-plot images and found that ensembling can provide modest gains—especially at very low FPR operating points—but at the cost of substantially higher computational overhead and system complexity.

This study has several limitations. First, the dataset reflects a specific snapshot of the Android ecosystem; shifts in Android versions, packaging practices, obfuscation techniques, and distribution channels may change feature distributions and degrade performance over time. Second, our analysis is restricted to static representations and does not incorporate dynamic behavior, network traffic, or user interaction signals. Third, we evaluate a fixed set of model architectures and hyperparameters; alternative tabular learners, graph- or sequence-based representation learning approaches (e.g., on call graphs or bytecode sequences), or more advanced ensemble strategies could lead to different trade-offs. Finally, our evaluation does not capture end-to-end deployment considerations such as feedback loops, analyst triage workflows, or active learning-driven retraining.

A key observation from our results is that, at FPR = 0.001, all models we evaluated—including the ensemble variants—miss at least 72% of malware. This highlights the fundamental difficulty of achieving high detection rates while operating under extremely stringent false-positive constraints. In practice, such an operating point may be appropriate only for high-confidence alerting (e.g., triggering manual review or downstream verification), and it underscores the need for future research on improving recall in the low-FPR regime.

In our study, we selected threshold values based on (i) recommendations and common practices in prior Android malware detection studies (e.g., Drebin [3] and MaMaDroid [15]) and (ii) known Android platform constraints (e.g., the 65K method limit for a single DEX file). Importantly, we did not tune or optimize these thresholds on our dataset, which reduces the risk of inadvertently overfitting threshold choices to a particular collection. In future work, we plan to systematically study threshold robustness across a broader range of Android applications, including time-segmented evaluations (training on older samples and testing on newer ones) to assess temporal stability.

In ongoing work, we plan to expand the static feature set and corpus to better capture emerging Android behaviors, including increased use of native code, reflection, and dynamic code loading. We will also study how feature distributions and model performance evolve across Android versions and over time, and we will explore more sophisticated fusion strategies such as feature-level blending to improve ensemble performance.

Moreover, AndroZoo aggregates Android applications from a broad range of sources, including official app stores, third-party marketplaces, and security research feeds. While this breadth increases diversity and contributes to AndroZoo’s widespread use in the research community, it may not fully match the application distribution encountered in specific deployment settings. In addition, some APKs available through AndroZoo may be older, whereas Android malware evolves over time; thus, the dataset may under-represent recent malware behaviors. In future work, we plan to incorporate samples from additional sources and to prioritize more recent APKs to mitigate potential bias and staleness effects.

Furthermore, we plan to extend our proposed framework and methodology to other domains, such as credit card fraud detection [16] and anomaly detection in IoT systems [17].

Author Contributions

Conceptualization, P.H.D. and Z.C.; methodology, P.H.D. and Z.C.; software, P.H.D.; validation, P.H.D. and Z.C.; formal analysis, P.H.D. and Z.C.; investigation, P.H.D. and Z.C.; resources, P.H.D. and Z.C.; data curation, P.H.D.; writing—original draft preparation, P.H.D. and Z.C.; writing—review and editing, P.H.D. and Z.C.; visualization, P.H.D.; supervision, Z.C.; project administration, Z.C.; funding acquisition, Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in https://androzoo.uni.lu/ (accessed on 16 December 2025).

Acknowledgments

During the preparation of this manuscript, the authors used OpenAI ChatGPT 5 to help with grammar checking and sentence restructuring. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Curve
DEX	Dalvik Executable
FN	False Negatives
FNR	False-Negative Rate
FP	False Positives
FPR	False-Positive Rate
RGB	Red, Green, Blue
ROC	Receiver Operating Characteristic
TN	True Negatives
TNR	True Negative Rate
TP	True Positives
TPR	True Positive Rate
ViT	Vision Transformer

References

Singh, A.K.; Kumar, M.; Singh, A.K. Enhancing Android Malware Detection Through Machine Learning: Insights From Permission and Metadata Analysis. In Proceedings of the 2024 Third International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN), Villupuram, India, 18–19 July 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Seneviratne, S.; Shariffdeen, R.; Rasnayaka, S.; Kasthuriarachchi, N. Self-Supervised Vision Transformers for Malware Detection. IEEE Access 2022, 10, 103121–103135. [Google Scholar] [CrossRef]
Arp, D.; Spreitzenbarth, M.; Hübner, M.; Gascon, H.; Rieck, K. DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket. In Proceedings of the Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 23–26 February 2014. [Google Scholar]
Endgame Inc. EMBER: An Open Dataset for Training Static PE Malware ML Models. Available online: https://github.com/endgameinc/ember (accessed on 20 April 2025).
Anderson, H.; Roth, P. EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models. arXiv 2018, arXiv:1804.04637. [Google Scholar] [CrossRef]
Joyce, R.; Miller, G.; Roth, P.; Zak, R.; Zaresky-Williams, E.; Anderson, H.; Raff, E.; Holt, J. EMBER2024—A Benchmark Dataset for Holistic Evaluation of Malware Classifiers. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25), Toronto, ON, Canada, 3–7 August 2025; Association for Computing Machinery: New York, NY, USA; pp. 5516–5526. [Google Scholar]
MalNet Project. MalNet: A Large-Scale Malware Network Dataset. Available online: https://www.mal-net.org/ (accessed on 20 April 2025).
University of Luxembourg. AndroZoo: A Collection of Android Applications. Available online: https://androzoo.uni.lu/ (accessed on 20 April 2025).
Desnos, A. Androguard: Reverse Engineering, Malware and Goodware Analysis of Android Applications. Available online: https://androguard.readthedocs.io/ (accessed on 17 November 2025).
Arik, S.O.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. arXiv 2019, arXiv:1908.07442. [Google Scholar] [CrossRef]
Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677. [Google Scholar] [CrossRef]
Vision Transformer (Base-Sized Model) Hosted at Hugging Face. Available online: https://huggingface.co/google/vit-base-patch16-224 (accessed on 22 December 2025).
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. DINOv3. arXiv 2025, arXiv:2508.10104. [Google Scholar] [PubMed]
DINOv3 Model Hosted at Hugging Face. Available online: https://huggingface.co/facebook/dinov3-vitb16-pretrain-lvd1689m (accessed on 22 December 2025).
Onwuzurike, L.; Mariconti, E.; Andriotis, P.; Cristofaro, E.D.; Ross, G.; Stringhini, G. MaMaDroid: Detecting Android malware by building Markov chains of behavioral models (extended version). ACM Trans. Priv. Secur. (TOPS) 2019, 22, 1–34. [Google Scholar] [CrossRef]
Kasula, V.K.; Yenugula, M.; Yadulla, A.R.; Konda, B.; Ayyamgari, S. An Improved Machine Learning Technique for Credit Card Fraud Detection. Edelweiss Appl. Sci. Technol. 2025, 9, 3093–3109. [Google Scholar] [CrossRef]
Kumar, D.; Pawar, P.P.; Addula, S.R.; Meesala, M.K.; Oni, O.; Cheema, Q.N.; Haq, A.U.; Sajja, G.S. AI-Powered Security for IoT Ecosystems: A Hybrid Deep Learning Approach to Anomaly Detection. J. Cybersecur. Priv. 2025, 5, 90. [Google Scholar] [CrossRef]

Figure 1. System overview.

Figure 2. Dataset composition.

Figure 5. Ensemble model design.

Figure 6. ROC of evaluated individual models.

Figure 7. ROC curves for six two-model ensemble configurations.

Table 2. An overall comparison of individual models across feature representations (the best value in each category is shown in bold).

		TPR at FPR
Model	ROC AUC	0.1	0.01	0.001
SHERLOCK (Baseline)	0.9613	0.8823	0.5475	0.1581
LightGBM (EMBER Extended)	0.9633	0.8946	0.5165	0.2108
TabNet (EMBER Extended)	0.9408	0.8263	0.3042	–
DinoV3	0.9181	0.7226	0.3187	0.0478
Google ViT	0.9094	0.6972	0.2444	0.0426
TabNet (EMBER Base)	0.8704	0.5511	0.0417	–
LightGBM (EMBER Base)	0.8603	0.5512	0.1575	0.0364

Table 3. Performance comparison between EMBER Base and EMBER extended features.

Model	FPR Target	F1 Score	TPR	Precision	TN	TP
LightGBM Base	0.100	0.6677	0.5512	0.8467	8438	5166
	0.010	0.2698	0.1575	0.9407	9280	1476
	0.001	0.0702	0.0364	0.9799	9366	341
LightGBM Extended	0.100	0.8970	0.8946	0.8995	8436	8,385
	0.010	0.6767	0.5165	0.9812	9280	4841
	0.001	0.3479	0.2108	0.9955	9364	1976
TabNet Base	0.100	0.6675	0.5511	0.8464	8436	5165
TabNet Base	0.010	0.0793	0.0417	0.8079	9280	391
TabNet Extended	0.100	0.8580	0.8263	0.8922	8437	7745
TabNet Extended	0.010	0.4630	0.3042	0.9687	9281	2851

Table 4. Absolute improvements achieved by EMBER extended features over EMBER Base features.

Model	TPR $Δ$ @FPR = 0.1	TPR $Δ$ @FPR = 0.01	TPR $Δ$ @FPR = 0.001
LightGBM	+34.34%	+35.90%	+17.44%
TabNet	+27.52%	+26.25%	–

Table 5. The top 10 most important features and their importance scores for LightGBM with EMBER Base features or with EMBER extended features. Note that for EMBER extended features, all top-10 most important features are from Android-specific features (435 features, indices 696–1130).

Rank	Feature (Base)	Importance (Base)	Feature (Extended)	Importance (Extended)
1	Feature 635	6306.81	Feature 947	10,399.59
2	Feature 686	3026.92	Feature 946	4463.23
3	Feature 677	1183.70	Feature 962	3169.31
4	Feature 535	914.13	Feature 696	1112.46
5	Feature 655	893.79	Feature 715	1071.98
6	Feature 650	767.37	Feature 717	1034.53
7	Feature 666	668.73	Feature 1111	717.26
8	Feature 595	603.39	Feature 809	665.11
9	Feature 681	577.30	Feature 1034	601.50
10	Feature 589	574.85	Feature 729	564.99

Table 6. Android-specific features (435 features, indices 696–1130) in the EMBER extended model.

Ranking Tier	Count	Percentage
Top-10	10/10	100.0%
Top-100	64/100	64.0%
Total Android features	435

Table 7. Inference time comparison between LightGBM Extended and SHERLOCK.

Model	Avg. Inference Time (ms)	Relative Speed
LightGBM Extended	0.10	64.0×
SHERLOCK	6.61	1.0×

Table 8. Comparison of computational cost for tabular and vision-based models.

Aspect	LightGBM Ember Extended	Vision Models (ViT, DinoV3)
Training hardware	4-core CPU, 32 GB RAM	32-core CPU, 64 GB RAM, NVIDIA A100 (80 GB)
Training time (Avg)	1 min 7 s	>30 min
Feature extraction type	EMBER extended features	Byte-plot image generation
Avg. extraction time per sample	10.36 s per APK	11.12 s per image
GPU required for training	No	Yes

Table 9. Performance metrics for ensemble models at fixed FPR of 0.1, 0.01, and 0.001 (the best value in each FPR target is shown in bold).

Model	FPR Target	F1 Score	TPR	Precision	TN	TP
LightGBM + DinoV3	0.100	0.9005	0.9009	0.9001	8436	8444
	0.010	0.7017	0.5458	0.9821	9280	5116
	0.001	0.4265	0.2713	0.9965	9364	2543
LightGBM + Google ViT	0.100	0.9016	0.9028	0.9004	8437	8462
	0.010	0.6905	0.5325	0.9817	9280	4991
	0.001	0.3928	0.2446	0.9961	9364	2293
LightGBM + TabNet	0.100	0.8990	0.8977	0.9004	8442	8414
	0.010	0.6927	0.5352	0.9818	9280	5016
	0.001	0.3406	0.2055	0.9953	9364	1926
TabNet + DinoV3	0.100	0.8753	0.8559	0.8957	8439	8022
	0.010	0.5902	0.4228	0.9771	9280	3963
	0.001	0.0462	0.0237	0.9610	9364	222
TabNet + Google ViT	0.100	0.8709	0.8483	0.8947	8437	7951
	0.010	0.5520	0.3850	0.9749	9280	3609
	0.001	0.0190	0.0096	0.9091	9364	90
Google ViT + DinoV3	0.100	0.8119	0.7516	0.8826	8436	7045
	0.010	0.4886	0.3265	0.9705	9280	3060
	0.001	0.0931	0.0489	0.9807	9364	458

Table 10. Improvement from LightGBM-DinoV3 over LightGBM Extended at fixed FPR of 0.1, 0.01, and 0.001.

Metric	Gain at FPR = 0.1	Gain at FPR = 0.01	Gain at FPR = 0.001
TPR Improvement	+0.63%	+2.93%	+6.05%
F1 Improvement	+0.0035	+0.0250	+0.0786

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hosahalli Dayananda, P.; Chen, Z. Re-Evaluating Android Malware Detection: Tabular Features, Vision Models, and Ensembles. Electronics 2026, 15, 544. https://doi.org/10.3390/electronics15030544

AMA Style

Hosahalli Dayananda P, Chen Z. Re-Evaluating Android Malware Detection: Tabular Features, Vision Models, and Ensembles. Electronics. 2026; 15(3):544. https://doi.org/10.3390/electronics15030544

Chicago/Turabian Style

Hosahalli Dayananda, Prajwal, and Zesheng Chen. 2026. "Re-Evaluating Android Malware Detection: Tabular Features, Vision Models, and Ensembles" Electronics 15, no. 3: 544. https://doi.org/10.3390/electronics15030544

APA Style

Hosahalli Dayananda, P., & Chen, Z. (2026). Re-Evaluating Android Malware Detection: Tabular Features, Vision Models, and Ensembles. Electronics, 15(3), 544. https://doi.org/10.3390/electronics15030544

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Re-Evaluating Android Malware Detection: Tabular Features, Vision Models, and Ensembles

Abstract

1. Introduction

2. System Design

2.1. Design Philosophy and Requirements

2.2. System Architecture Overview

2.3. Reference Baseline: SHERLOCK

3. Methodology

3.1. Dataset Construction

3.2. Feature Extraction

3.2.1. EMBER Base Features

3.2.2. EMBER Extended Features

3.2.3. Byte-Plot Images

3.3. Model Architectures

3.3.1. Tabular Models

3.3.2. Vision Transformer Models

3.3.3. Ensemble Models

4. Performance Evaluation

4.1. Evaluation Metrics

4.1.1. Threshold-Dependent Metrics

4.1.2. Threshold-Independent Metrics

4.1.3. Threshold-Selection Procedure

4.2. Individual Models

4.3. Ensemble Models

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI