Abstract
Repackaged Android applications pose a significant threat to mobile ecosystems, acting as common vectors for malware distribution and intellectual property infringement. Addressing the challenges of existing repackaging detection methods—such as scalability, reliance on app pairs, and high computational costs—this paper presents a novel hybrid approach that combines supervised learning and symptom discovery. We develop a lightweight feature extraction and analysis framework that leverages only 20 discriminative features, including inter-component communication (ICC) patterns, sensitive API usage, permission profiles, and a structural anomaly metric derived from string offset order. Our experiments, conducted on 8441 Android applications sourced from the RePack dataset, demonstrate the effectiveness of our approach, achieving a maximum F1 score of 85.9% and recall of 98.8% using Support Vector Machines—outperforming prior state-of-the-art models that utilized over 500 features. We also evaluate the standalone predictive power of AndroidSOO’s string offset order feature and highlight its value as a low-cost repackaging indicator. This work offers an accurate, efficient, and scalable alternative for automated detection of repackaged mobile applications in large-scale Android marketplaces.
1. Introduction
The rapid growth of the Android operating system has revolutionized the mobile landscape, powering more than three billion active devices worldwide [1]. Its open-source nature and flexible distribution model have accelerated innovation, enabling developers to create diverse applications that reach global audiences. However, this openness also introduces significant security risks. One of the most prevalent and damaging threats is application repackaging—a process in which attackers clone legitimate apps, inject malicious payloads or unauthorized modifications, and redistribute them on third-party marketplaces.
Repackaged apps present a dual threat to the mobile ecosystem. On the one hand, they act as effective vectors for malware distribution, facilitating data theft, spyware, ransomware, and other malicious activities. On the other hand, they lead to intellectual property infringement and loss of revenue for legitimate developers. Users who unknowingly install these cloned applications often suffer degraded device performance, intrusive advertisements, and serious privacy violations, which collectively erode trust in the Android ecosystem. Alarmingly, prior studies estimate that 86% of Android malware consists of repackaged applications [2], and industry reports show that 77% of the top 50 apps in Google Play once had repackaged counterparts [3].
The 2016 release of Pokémon GO provides a prominent example of how attackers exploit repackaging. Because the game was initially released in only a limited number of countries, third-party repackaged versions quickly emerged in excluded regions. Many of these clones contained hidden malware, resulting in devices overheating, slowing down, and displaying intrusive foreign-language advertisements. This case highlights a recurring cycle: adversaries exploit user demand for unavailable, older, or modified applications to bypass official distribution channels. Such incidents underscore the need for robust detection mechanisms that can identify malicious repackaged apps before they reach end users.
Despite a decade of research, repackaging detection remains an open challenge due to the following obstacles:
- Scalability: Most similarity-based approaches [4,5,6] rely on pairwise comparisons between candidate and original apps. While accurate in small-scale settings, these techniques suffer from prohibitive computational costs when applied to millions of applications in large marketplaces.
- Dependence on Original–Repackaged Pairs: Many methods, such as [4,6], require both the benign app and its repackaged counterpart for comparison, which is impractical in real-world settings where original apps may not be available.
- Feature Explosion: Some state-of-the-art approaches extract hundreds of features (e.g., APIs, permissions, control flow graphs), which leads to high-dimensional data, longer training times, and limited portability [7].
- Obfuscation and Code Transformation: Attackers frequently employ obfuscation, reordering, and library injection to disguise malicious payloads, which can confuse both similarity-based and machine learning-based detectors.
- Dataset Limitations: The lack of large, high-quality benchmark datasets hindered fair evaluation and reproducibility of proposed methods.
This paper introduces RepackDroid, a lightweight and effective detection framework that addresses the limitations of prior approaches by combining supervised learning with symptom discovery. Unlike prior works that rely on large feature sets or explicit app pairs [4,5,6,7], RepackDroid extracts a compact set of only 20 discriminative features, spanning inter-component communication (ICC) patterns, permission structures, sensitive API usage, and structural anomalies such as string offset order. In summary, this work makes the following contributions.
- Novel Feature Extraction Framework: We design and implement a tool that automatically processes APKs into a compact set of features tailored for repackaging detection.
- Efficient Learning with Fewer Features: We demonstrate that a reduced feature space (20 features) can outperform prior models trained on more than 500 features, achieving improved recall and F1 scores while reducing computational overhead.
- Comprehensive Empirical Evaluation: Using 8441 Android applications from the RePack dataset [8], we show that Support Vector Machines achieve up to 98.8% recall and 85.9% F1 score, outperforming state-of-the-art baselines. We also evaluate the predictive power of the string offset order anomaly, validating its role as a cost-effective indicator of repackaging.
The remainder of this paper is structured as follows. Section 2 reviews existing approaches for repackaged app detection, highlighting their strengths and limitations. Section 3 describes our methodology. Section 4 describes the dataset characterization and preprocessing. Section 5 presents experimental results. Section 6 discusses the implications of our findings. Section 7 discusses threats to validity to our approach. Finally, Section 8 concludes the paper and outlines directions for future work.
2. Related Work
In this section, we review prior work that focuses specifically on the detection of repackaged Android applications. While the broader Android security literature includes extensive research on general malware detection, such as systems that analyze permissions, API usage, control-flow patterns, or machine learning models designed to classify malicious applications [9,10,11,12,13,14,15], these approaches do not target the unique characteristics of repackaging or rely on datasets that contain original–repackaged app pairs. Our discussion therefore concentrates on techniques explicitly developed for identifying repackaged apps, including similarity-based comparison, runtime instrumentation, supervised learning, and symptom discovery.
Similarity Comparison Approaches. Similarity-based detection remains one of the earliest and most intuitive strategies. These methods compare suspected applications with their original counterparts to identify code or resource similarities. Tools such as DroidMOSS [5], CodeMatch [4], and RomaDroid [6] employ opcode sequences, fuzzy hashing, or manifest-based features to capture resemblance between app pairs. Other systems like CloneSpot [16] analyze metadata available in app marketplaces, such as titles, developer names, and descriptions. While often effective in small-scale scenarios, similarity-based approaches face three critical challenges: (1) they require access to both the original and repackaged apps, which is unrealistic at marketplace scale; (2) they incur high computational overhead due to pairwise comparisons; and (3) they are vulnerable to obfuscation, library injection, and class reordering. These limitations hinder their scalability in real-world app markets.
Runtime Monitoring Approaches. Runtime-based solutions attempt to detect repackaging by executing apps or embedding markers. For example, AppInk [17] embeds watermarking code into applications, while Nguyen et al. [18] apply perceptual hashing to dynamic UI screenshots. Such approaches reduce dependence on static code features and can bypass some obfuscation. However, runtime-based solutions depend heavily on developer cooperation (e.g., embedding watermarks) or require controlled execution environments, making them less practical for large-scale, automated marketplace scanning.
Machine Learning Approaches. Recent studies leverage supervised and unsupervised learning to address the shortcomings of similarity-based detection. For example, SCSdroid [19] uses system call sequences, DR-Droid [20] employs code heterogeneity analysis, and DroidMat [21] clusters applications using features such as API calls and manifest declarations. These models alleviate the dependency on explicit app pairs and offer greater adaptability. Nevertheless, many of these methods rely on large and complex feature sets—sometimes exceeding several hundred dimensions [7]. High-dimensional feature spaces increase computational cost, reduce interpretability, and make the models more difficult to scale in production environments.
Symptom Discovery Approaches. Symptom discovery methods focus on identifying artifacts consistently introduced during the repackaging process. AndroidSOO [22], for example, exploits anomalies in the string offset order of Dalvik executable (.dex) files, achieving high detection accuracy with minimal computational requirements. This approach avoids dependence on original–repackaged pairs and resists certain obfuscation techniques. However, symptom-based methods often hinge on specific characteristics of repackaging tools. Skilled adversaries who manually recompile or use alternative tools may circumvent these detection mechanisms, limiting their generalizability.
Our Approach. Building on these insights, our approach, RepackDroid, introduces a hybrid detection model that combines the strengths of supervised learning and symptom discovery. Unlike prior machine learning efforts that rely on hundreds of features, we reduce the feature space to only 20 carefully selected attributes spanning ICC patterns, permission usage, sensitive API calls, and structural anomalies. This design balances accuracy with computational efficiency, making the approach suitable for large-scale deployment. Furthermore, our work is among the first to validate the predictive power of string offset order in combination with machine learning, using the large-scale RePack dataset [8].
3. The RepackDroid Approach
This section presents the methodology underlying RepackDroid, our proposed framework for detecting repackaged Android applications. We first outline the overall approach, followed by detailed descriptions of the dataset construction, feature extraction process, machine learning model implementation, and evaluation methodology.
3.1. Overview of Approach
RepackDroid combines static analysis of Android applications with machine learning and symptom discovery. Figure 1 shows the workflow of the proposed supervised learning-based detection framework. The workflow consists of four key stages:
Figure 1.
Workflow of the proposed supervised learning-based detection framework. The pipeline includes APK decompilation, feature extraction, and classification using machine learning models.
- Dataset Preparation: We curate and preprocess applications from the RePack dataset [8], ensuring that each sample is properly labeled as original or repackaged.
- APK Decompilation and Parsing: Applications are decompiled using established tools to expose manifest files, Dalvik executable (DEX) bytecode, and intermediate smali code representations.
- Feature Extraction: We derive 20 discriminative features from four categories—inter-component communication (ICC), permissions, sensitive API usage, and structural anomalies.
- Classification: Supervised learning models are trained and tested to predict whether an application is repackaged. The feature set includes AndroidSOO’s [22] string offset order anomaly as a low-cost, structural indicator of repackaging.
3.2. Dataset Construction
We utilized the RePack dataset [8], a benchmark repository containing over 15,000 pairs of original and repackaged Android applications collected from AndroZoo [23]. Each app pair is associated with SHA-256 hashes and metadata, which we used to automatically label samples as “original” or “repackaged.” Applications were downloaded via the AndroZoo API using an authenticated key. The dataset was organized into two directories: originals and repacks. This structure facilitated efficient labeling and batch processing. Each APK was decompiled using Apktool [24] and baksmali [25], producing the following artifacts:
- AndroidManifest.xml: Containing package metadata, permissions, and component definitions.
- Classes.dex: Dalvik bytecode compiled from Java sources.
- .smali files: Human-readable intermediate code representation from DEX disassembly.
During preprocessing, we identified and removed redundant samples. Specifically, one original app was found to appear in 1676 pairs (11% of the dataset). Since many of the associated repackaged versions were nearly identical, we excluded these to avoid data skew. After filtering, the final dataset consisted of 8441 unique applications (6081 repackaged, 2360 original).
3.3. Feature Extraction
To enable automated and scalable processing of applications, we developed a custom feature extraction tool implemented in a combination of Java SE 12 and Python 3.4. This tool performs APK decompilation, parsing, and feature preprocessing in preparation for machine learning analysis.
The workflow of the feature extraction tool is summarized in Algorithm 1. It begins by downloading APKs from the RePack dataset [8], decompiling them to access manifest and bytecode files, and parsing inter-component communication (ICC) tuples following the approach in Elish et al. [26]. Additional features such as permissions, sensitive API counts [15,27], and string offset order anomalies [22] are then integrated into a consolidated dataset for supervised learning.
3.3.1. APK Decompilation and Preservation of DEX Files
The workflow starts by decompiling APK file using apktool. After decompilation, APK contains the following artifacts: a .smali directory, a manifest file (AndroidManifest.xml), and one or more classes.dex files. A custom batch script was created to manage the decompilation process. While apktool is effective for converting APKs into smali code, its default process often discards the classes.dex file during smali folder generation. Since our methodology required access to both smali code and the original DEX file—particularly for AndroidSOO’s string offset order analysis [22]—our batch script made an explicit copy of the classes.dex file before disassembly. This ensured consistent availability of both representations.
| Algorithm 1 Feature Extraction Workflow. |
|
3.3.2. Parsing Inter-Component Communication (ICC)
A primary function of our tool is to identify an application’s .smali directory and parse its inter-component communication (ICC) events. We adopted the four-tuple ICC representation introduced by Elish et al. [26], which consists of:
- ICC Name: the method central to initializing the communication (e.g., startActivity(), sendBroadcast()).
- Source Component: the component from which the intent originates.
- Target Component: the intended recipient of the communication, which may not always be identifiable due to limitations of smali code analysis. In cases where the target cannot be resolved, it is assigned a null value.
- Type of Communication: whether the intent is internal (within the same app) or external (directed outside the app), typically inferred through manifest file analysis.
An example ICC tuple is shown in Table 1. Each time an ICC name was discovered in the code, a tuple was generated and appended to an ArrayList, which was later exported into the application’s CSV file.
Table 1.
Snippet of ICC Representation from Application.csv.
3.3.3. Metadata and Preprocessing Output
Another key function of our tool is to prepare a CSV file for each application. First, the tool attempted to extract the app’s name from its AndroidManifest.xml file. The CSV schema included multiple categories of information:
- Single-row metadata—such as the permissions list, AndroidSOO output, and binary classification label (repackaged vs. original).
- Iteratively updated fields—such as counts of Android APIs, Java APIs, and user actions, which were incremented with each discovered ICC tuple.
This two-layer design prevented fragmentation of app data across multiple files while ensuring that both global and local features were captured.
3.3.4. Consolidation and Feature Engineering
Once all applications in the “originals” and “repacks” directories were processed, their individual CSV files were funneled into a single directory. The final step is to iterate through these CSVs to construct the final dataset. During this step, the tool:
- Computed new numerical features derived from ICC tuples and their relationships.
- Counted sensitive API occurrences, following established lists from Elish et al. [15] and Tian et al. [27].
- Transferred metadata fields from the first row of each CSV into the consolidated dataset.
The resulting dataset provided a structured and compact feature set, ready for building and training the machine learning classification models.
3.4. Feature Set Design
The effectiveness of any machine learning approach for repackaged app detection depends critically on the quality of its feature representation. We designed a compact feature set of 20 attributes, grouped into four categories: inter-component communication (ICC) features, permissions features, sensitive API features, and structural anomaly feature. While these categories have been individually explored in prior work, to the best of our knowledge, their collective integration into a single lightweight model for repackaged app detection has not been reported in the literature. The complete set of features extracted for our model is summarized in Table 2, along with their descriptions and categorizations.
Table 2.
Description of the extracted features employed for repackaged app classification.
3.4.1. ICC Features
Inter-component communication (ICC) reflects how activities, services, and broadcast receivers interact within an application. Following the four-tuple ICC representation of Elish et al. [26], we extract computable features that capture both the volume and structure of communications. These include the frequency of each ICC type, the ratio of internal versus external communications, the most common source and target components, and the aggregate component frequency.
As illustrated in Table 3, ICC method calls such as startActivity() or sendBroadcast() serve as anchors for constructing these features. Because ICC governs the way Android components collaborate, deviations in these patterns often reveal artifacts of repackaging. Thus, ICC metrics provide both application uniqueness and cross-sample comparability for classification.
Table 3.
Categorizations of ICC Methods.
3.4.2. Permissions Features
Application permissions, declared in AndroidManifest.xml, represent another critical dimension of security analysis. Permissions are both easy to extract and highly informative, since many malicious repackaged apps request excessive or unnecessary access. Leveraging the Android documentation [28], we categorize permissions into normal (low risk), dangerous (high risk), and signature (granted only to apps signed with the same certificate).
From these categories, we compute the total number of permissions, the count of each type, and a riskiness score defined as the ratio of dangerous to total permissions. Prior work has shown that malware families often exhibit consistent permission usage patterns [29], making these features highly predictive for repackaging detection.
3.4.3. Sensitive API Features
Whereas permissions describe declared intent, sensitive APIs capture operational behavior. Using the curated API lists from Elish et al. [15] and Tian et al. [27], we measure the frequency of Android-specific and Java-specific API calls across each application. Additional derived metrics include:
- Total API count (sum of Android and Java APIs).
- API-per-component ratio, estimating the average intensity of API usage.
- User action counts, measuring the frequency of user-triggered interactions.
- User action–per-component ratio, normalizing these interactions by application size.
For sensitive APIs, we include illustrative Android and Java API calls that are frequently associated with high-risk behavior. Examples include telephony and messaging interfaces (e.g., SmsManager.sendTextMessage, TelephonyManager.getDeviceId), system-level operations (e.g., Runtime.exec), and network or data-exfiltration APIs (e.g., HttpURLConnection.connect, Socket.write). These examples clarify how the selected API categories capture behavioral traits that are commonly modified or introduced during repackaging.
For user-triggered interactions, we highlight representative callbacks such as onClick, onTouchEvent, and startActivityForResult, which indicate execution paths initiated through user input. Because injected or altered malicious payloads frequently leverage these interaction points, these features provide additional discriminative value for distinguishing repackaged from original applications.
Together, these features characterize the “muscular system” of an application’s behavior, complementing the ICC and permissions features. Repackaging often alters these patterns due to injected functionality or code restructuring.
3.4.4. Structural Anomaly Feature: String Offset Order
The final feature category introduces a symptom discovery dimension based on structural anomalies. Specifically, we include the String Offset Order (SOO) feature, derived from AndroidSOO [22]. According to the Dalvik executable specification, string identifiers in DEX files should appear in alphabetical order; however, when an application is repackaged using tools such as apktool, this ordering is often disrupted.
We encode SOO as a binary feature (“in order” vs. “out of order”). Although simple, prior work has shown it to be a powerful low-cost indicator of tampering [22].
4. Dataset Characterization and Preprocessing
Before training machine learning models, it is critical to understand the statistical properties of the dataset. Such analysis not only provides insight into the relationships among features but also guides preprocessing decisions to reduce redundancy, balance feature contributions, and filter noisy or duplicated samples. We conducted three complementary analyses: feature correlation analysis, distributional assessment, and outlier detection.
4.1. Correlation Analysis
A correlation matrix was computed across all extracted features to evaluate their interrelationships. As visualized in Figure 2, the resulting heatmap provides a global view of how features relate to one another across the 9524 Android applications initially analyzed. This visualization confirms that certain attributes are naturally correlated. For example, userActionCount exhibits strong positive correlation with both component frequencies and API counts, reflecting the fact that larger applications tend to have more code, more components, and more user-triggered events.
Figure 2.
Correlation heatmap of the extracted dataset features. The visualization highlights relationships among ICC, permissions, sensitive APIs, and structural anomaly features across studied Android applications, with color intensity indicating the strength and direction of pairwise correlations.
The correlation map also validates the logical groupings introduced in Section 3.4 (ICC, permissions, APIs, and structural anomalies). Features within the same category cluster together, appearing as distinct quadrilateral regions in the heatmap (e.g., the block spanning BroadcastReceiverOccurrenceFrequency to ExternalOccurrence).
To avoid feature redundancy, we applied dimensionality reduction based on these correlations. Within the permission-related features, only RiskRatePerPerms was retained, as it provided the strongest correlation with the classification label. Similarly, within the API-related features, totalApiCount was chosen to represent the group of highly correlated metrics (javaApiCount, androidApiCount, and totalApiCount). This pruning preserved informativeness while lowering model complexity.
4.2. Distributional Analysis
Feature distributions were then examined through histograms, shown in Figure 3. These visualizations highlight the range and skew of each attribute across the dataset. Results indicate that the majority of applications are relatively small in scale, characterized by lower component counts, fewer permissions, and limited API usage.
Figure 3.
Histograms of the extracted dataset features. The plots illustrate the distribution of ICC, permissions, sensitive APIs, and structural anomaly attributes across studied Android applications, highlighting the predominance of smaller apps with relatively low feature counts and the presence of a minority of larger, more complex outliers.
Conversely, a smaller subset of applications exhibited disproportionately higher values, with totalApiCount reaching 800–1000 and userActionCount exceeding 100 events. In contrast, smaller apps typically contained fewer than 200 API calls and 25 user actions. This observation suggests a long-tailed distribution: while most apps are lightweight, a minority are significantly more complex, with feature values four to five times greater.
4.3. Outlier Detection and Filtering
Finally, we examined dataset composition for potential outliers and redundancies. Analysis of the RePack dataset revealed that one original application appeared in 1676 repackaged pairs, accounting for roughly 11% of the entire dataset. While in principle each repackaged variant could be distinct, feature extraction showed that many of these samples were nearly identical, providing little additional training value.
To prevent the model from being biased by such duplicates, these redundant samples were excluded wherever identified. After filtering, the dataset size was reduced from 9524 to 8441 applications, yielding a cleaner and more representative corpus for experimentation.
5. Evaluation Results
This section presents the results of our experimental evaluation. We begin with a summary of the dataset used for training and testing, followed by a description of evaluation metrics. We then report classification results across several machine learning algorithms and compare our approach with state-of-the-art techniques from the literature.
5.1. Dataset Summary and Preprocessing
After filtering redundant entries as described in Section 4.3, the final dataset contained 8441 Android applications, of which 6081 were repackaged and 2360 were original. This distribution indicates a significant class imbalance, with approximately 72% of samples belonging to the repackaged class.
To address the imbalance, we applied the Synthetic Minority Oversampling Technique (SMOTE) [30], which generates synthetic minority-class samples based on feature-space interpolation of nearest neighbors. Unlike simple duplication, SMOTE increases diversity within the minority class and reduces overfitting, thereby improving generalization. For all experiments, the dataset was split into 80% training and 20% testing partitions.
To assess the scalability of the proposed framework, we measured the execution time of the feature extraction pipeline across the full set of 8441 apps. Our Java-based feature extraction tool processed decompiled APK directories at an average rate of approximately 0.9 min per app, using standard workstation (Intel i7-12700K CPU, 32 GB RAM), to complete extraction of ICC tuples, sensitive API counts, permission metrics, and the SOO feature. These results empirically support the scalability claims of our approach and demonstrate that the proposed lightweight feature set enables efficient analysis even across large collections of apps. We acknowledge, however, that decompilation remains the dominant cost, and future work will explore parallelized APK processing and incremental parsing techniques to further reduce execution time.
5.2. Evaluation Metrics
We employed four widely adopted evaluation metrics for classification: precision, recall, F1-score, and precision–recall area under the curve (PR-AUC).
Precision measures the proportion of predicted repackaged apps that were correct, whereas recall measures the proportion of actual repackaged apps that were identified. F1-score balances these two metrics by computing their harmonic mean.
Given the imbalanced dataset, we also emphasized PR-AUC, which summarizes classifier performance across multiple thresholds. Branco et al. [31] recommend PR-AUC as a more reliable measure than ROC-AUC in imbalanced domains.
5.3. Model Performance
We trained and evaluated multiple supervised learning algorithms using the scikit-learn library, including Support Vector Machine (SVM), k-Nearest Neighbors (kNN), Gaussian Naive Bayes, Logistic Regression, Random Forest, Decision Tree, and a simple Feedforward Neural Network. Performance results are reported in Table 4. The findings indicate that:
Table 4.
Performance metrics of supervised learning algorithms on our dataset. Precision, recall, F1-score, and precision–recall area under the curve (PR-AUC) are reported for each classifier.
- Neural Network achieved the best recall (100%), successfully identifying all repackaged samples in the test set.
- SVM achieved the highest F1-score (85.9%), offering the best balance between precision and recall.
- Logistic Regression achieved the highest PR-AUC (87.8%), demonstrating stable performance across thresholds.
These outcomes are further visualized in Figure 4, which shows the precision–recall curves for the tested algorithms. Both KNN and Random Forest exhibited particularly strong PR curves, indicating robust trade-offs between false positives and false negatives.
Figure 4.
Precision–recall curves for the evaluated algorithms on our dataset dataset. The curves illustrate the trade-off between precision and recall across thresholds, with area under the curve (PR-AUC) values highlighting overall detection performance.
Cross-Validation and Statistical Robustness. To strengthen the reliability of the comparative evaluation, we conducted a 5-fold cross-validation experiment on the final feature set. For each classifier, performance was averaged across folds, and standard deviations were computed to quantify variability. The cross-validation results were consistent with those obtained from the original 80/20 train–test split, confirming that the performance differences observed among classifiers are stable rather than artifacts of a single partition. As expected, Support Vector Machines and k-Nearest Neighbors continued to yield the strongest recall and F1-score profiles, with standard deviations typically below 0.02 across folds.
6. Discussion
6.1. Comparative Analysis with State-of-the-Art
We summarize the most prominent state-of-the-art techniques for repackaged app detection in Table 5. Some of these techniques are not open-sourced and it is hard to re-implement or test them, or reproduce their results. For comparison, we highlight the approach type, number of features used, evaluation scale, and detection performance of each method.
Table 5.
Summary of representative state-of-the-art techniques for repackaged app detection.
Our approach, RepackDroid, achieved recall = 98.8% and F1 = 85.9% while using only 20 features, though precision was somewhat lower (72.0%). This trade-off underscores the complementary strengths of the two approaches. Li et al.’s model [7] prioritizes precision through a large, feature-rich representation, whereas RepackDroid emphasizes recall and efficiency, making it more appropriate for marketplace-scale deployment where missing a repackaged app (false negative) is more damaging than mistakenly flagging a legitimate one (false positive).
When compared with similarity-based detection methods, our system demonstrates a different performance profile. Similarity-based tools such as CodeMatch [4] and RomaDroid [6] often achieve precision scores above 90%, substantially higher than our average of 73.4% across eight algorithms. However, as discussed in Section 2, these methods depend on pairwise comparisons between original and candidate apps. While effective on small datasets, this approach does not scale to millions of applications due to the combinatorial explosion of comparisons. Indeed, most similarity-based studies evaluate their tools on relatively small test sets, limiting their applicability to real-world markets.
In contrast, our method trades some precision for scalability, recall, and efficiency, which are essential characteristics for practical deployment in large-scale app stores. By reducing the feature space and eliminating the need for app pairs, RepackDroid provides a lightweight yet robust alternative to both similarity-based and high-dimensional supervised learning approaches.
6.2. Role of String Offset Order in Detection
A key distinguishing aspect of our framework is the ability to achieve competitive performance with only 20 features, compared to the 521 features employed by Li et al. [7]. This reduction is not merely a matter of computational efficiency—it also demonstrates that carefully selected, discriminative features can capture the essence of repackaging behavior without requiring high-dimensional representations.
Among the features incorporated in our model, the String Offset Order (SOO) anomaly proposed by Gonzalez et al. [22] via the AndroidSOO tool emerged as particularly impactful. SOO exploits the structural property of Dalvik executable files, where string identifiers should appear in alphabetical order according to the DEX specification. When an app is repackaged using tools such as apktool, this ordering is often disrupted, leaving a reliable “symptom” of tampering.
Our experiments confirmed the utility of SOO when integrated into a broader supervised learning model. As shown in Table 6, the SOO feature alone demonstrates meaningful predictive power in distinguishing repackaged from original applications. More importantly, its combination with ICC, permissions, and sensitive API features improved overall accuracy and recall compared to models that excluded it. This aligns with the prediction made in Gonzalez et al.’s original work [22], but our contribution provides a stronger empirical validation: unlike their initial study, we evaluated SOO using the RePack dataset, which is specifically curated for repackaged application detection.
Table 6.
Performance metrics of the String Offset Order (SOO) feature. The results demonstrate the predictive capability of SOO as a standalone symptom discovery indicator for repackaged application detection.
To the best of our knowledge, this represents the first systematic assessment of SOO within a dataset explicitly designed for repackaging research. Previous work on SOO could not confirm its robustness in this context, as suitable benchmark datasets were not yet available. By demonstrating its effectiveness at scale, our study extends the relevance of SOO beyond a theoretical proof-of-concept to a practical, lightweight feature in modern repackaging detection pipelines.
This finding has two important implications. First, it highlights the value of symptom discovery approaches, which exploit structural artifacts left behind during repackaging. Second, it suggests that future work could explore additional low-cost structural features that may complement SOO, thereby reducing reliance on more computationally intensive feature sets. In particular, fine-tuning SOO thresholds, combining it with advanced ICC metrics, or applying it to more recent repackaging datasets could further improve both robustness and generalizability.
Our post-hoc examination of misclassified original applications indicates that false positives typically arise from a small subset of features whose distributions overlap between large, complex original apps and repackaged variants. In particular, high API usage counts and elevated external ICC frequencies occasionally mirror patterns introduced during repackaging, especially in multifunctional or advertisement-heavy legitimate applications. We confirmed that these cases are not attributable to noise in the SOO feature; rather, they stem from behavioral and structural characteristics naturally present in certain original applications. Although removing these features reduces false positives, the resulting models exhibit a substantial decline in recall and overall F1-score. Because large-scale marketplace analysis prioritizes minimizing false negatives over marginal reductions in false positives, we retain these features but now explicitly discuss this trade-off in Section 6. Future work will explore adaptive feature weighting, calibration techniques, and ensemble-based strategies to further mitigate false positives without compromising recall.
6.3. Summary of Results and Main Findings
The experimental evaluation demonstrates that our proposed approach, RepackDroid, offers a lightweight yet effective solution for detecting repackaged Android applications. Several key findings emerge from the results:
6.3.1. Competitive Performance with Fewer Features
Despite relying on only 20 features, our framework achieved an F1-score of 85.9% and a recall of 98.8% using k-Nearest Neighbors and Support Vector Machines, respectively. This performance surpasses or rivals models that use hundreds of features [7]. The implication is that feature quality can outweigh feature quantity when features are carefully chosen to reflect structural and behavioral properties of Android applications.
6.3.2. High Recall Prioritized over Precision
A notable outcome is that our system emphasizes recall over precision. For large-scale app marketplaces, this trade-off is strategically valuable: missing a repackaged application (false negative) poses a greater risk to users and developers than mistakenly flagging an original app (false positive). By maximizing recall, our approach ensures that potentially harmful repackaged apps are rarely overlooked, leaving false positives to be handled by downstream verification processes.
6.3.3. Algorithm-Specific Strengths
Our evaluation across multiple classifiers revealed algorithm-specific trade-offs:
- Decision Tree achieved the highest precision (77.6%).
- Neural Network reached perfect recall (100%) but at the expense of lower precision.
- Logistic Regression yielded the strongest PR-AUC (87.8%), reflecting stability across thresholds.
- Support Vector Machine provided the best overall balance with the highest F1-score (85.9%). This diversity suggests that ensemble methods, combining strengths of different algorithms, could further enhance detection performance in future work.
6.3.4. Validation of Symptom Discovery
In addition to behavioral and structural features (ICC, permissions, APIs), our study validates the role of symptom discovery, specifically the String Offset Order anomaly. Its inclusion significantly enhanced classification performance, confirming that structural artifacts of repackaging can serve as powerful, low-cost indicators when combined with supervised learning.
6.3.5. Broader Implications
Overall, the findings suggest that lightweight, symptom-enhanced machine learning models can provide scalable, accurate, and efficient alternatives to existing repackaging detection techniques. This contributes directly to the pressing need for marketplace-ready solutions capable of handling high volumes of Android applications without sacrificing detection accuracy.
7. Threats to Validity
Although the proposed framework demonstrates strong empirical performance, several factors may affect the interpretation and generalizability of the results.
Internal validity. The effectiveness of the proposed framework depends on the completeness and accuracy of static feature extraction. Because ICC patterns, API counts, and manifest attributes are obtained from decompiled smali code, the resulting features may be affected by code transformations, multi-dex partitioning, identifier renaming, encrypted strings, and reflection-based intent construction. Such factors may lead to partial or incomplete extraction of ICC tuples and API usage counts. These challenges stem from the inherent limitations of static analysis and similarly affect prior repackaged-app detection approaches that rely on smali or DEX-level artifacts.
Construct validity. Class imbalance in the RePack dataset was handled using SMOTE during training. Because oversampling modifies the minority-class distribution, there is a risk of generating synthetic feature vectors that fall between regions of the feature space that do not occur naturally. This is particularly relevant for aggregate count-based metrics. However, the SOO feature is binary and preserved exactly, and ICC/API features represent frequency-based abstractions that remain meaningful under moderate interpolation. The test set retains the original class distribution, ensuring that evaluation metrics reflect performance on real applications.
External validity. The evaluation is performed entirely on the RePack dataset, which is the only publicly available corpus containing verified original–repackaged application pairs. While this dataset is appropriate for assessing repackaging-specific structural deviations, it does not include systematically obfuscated or reflection-heavy variants representing more advanced repackaging techniques. Datasets commonly used for general malware detection, such as DREBIN [33] or AMD [34], lack original–repackaged labels and therefore cannot support meaningful assessment of repackaging detection. As a result, generalizability to heavily obfuscated or dynamically loaded applications remains an open direction.
Methodological validity. The comparison of supervised learning algorithms is based on an 80/20 split supplemented with 5-fold cross-validation to assess performance stability. Cross-validation reduces sensitivity to a single data partition, but differences among heterogeneous classifiers are not subjected to formal statistical hypothesis tests, which may limit interpretability of small performance variations. The study focuses on evaluating a lightweight feature set tailored to repackaging detection rather than benchmarking against deep-learning-based malware detectors, which operate under different learning objectives and datasets. This methodological choice aligns the evaluation with the specific task of detecting repackaged applications rather than general malware classification.
8. Conclusions and Future Work
In this paper, we presented RepackDroid, a supervised learning-based framework for detecting repackaged Android applications. Our approach demonstrates that accurate and scalable detection can be achieved with a compact feature set, overcoming the limitations of prior work that relied on extensive feature engineering or computationally expensive similarity comparisons. By leveraging only 20 carefully selected features, including inter-component communication (ICC) metrics, permission and API usage, and the structural anomaly of String Offset Order (SOO), we achieved competitive results compared to state-of-the-art methods that use more than 500 features.
Our evaluation on the RePack dataset confirmed that RepackDroid provides strong detection performance, with a maximum F1-score of 0.859 and recall of 98.8%. These findings highlight the effectiveness of lightweight features, especially SOO, as practical indicators of repackaging. To the best of our knowledge, this is the first study to systematically assess SOO within a dataset specifically curated for repackaging research, thereby extending its relevance from a proof-of-concept indicator to a validated component of an integrated detection framework. Beyond raw performance, the results demonstrate important trade-offs. While our average precision was lower than that of similarity-based approaches, our system emphasizes recall and efficiency, which are crucial for real-world deployment in large-scale marketplaces where false negatives—missed repackaged apps—pose the greatest risk. The lightweight nature of our framework makes it suitable for integration into continuous app vetting pipelines.
There are opportunities for improvement. Future work should explore additional low-cost structural symptoms of repackaging, refine ICC and API-based features, and investigate ensemble models that combine the strengths of different classifiers. Expanding evaluations to larger and more diverse datasets, including recently released applications and obfuscated variants, will also be essential to confirm the generalizability of our findings.
Author Contributions
Conceptualization, T.L. and K.E.; methodology, T.L. and K.E.; software, T.L.; validation, T.L. and K.E.; formal analysis, T.L.; investigation, T.L.; resources, T.L.; data curation, T.L.; writing—original draft preparation, T.L.; writing—review and editing, K.E.; visualization, T.L.; supervision, K.E.; project administration, K.E. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The raw data supporting the conclusions of this article will be made available by the authors on request.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Living in a Multi-Device World with Android. Available online: https://blog.google/products/android/io22-multideviceworld/ (accessed on 1 October 2025).
- Zhou, Y.; Jiang, X. Dissecting Android Malware: Characterization and Evolution. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 20–23 May 2012; pp. 95–109. [Google Scholar] [CrossRef]
- Trend Micro. A Look at Repackaged Apps and their Effect on the Mobile Threat Landscape. Trend Micro Blog. Available online: http://blog.trendmicro.com/trendlabs-security-intelligence/a-look-into-repackaged-apps-and-its-rolein-the-mobile-threat-landscape/ (accessed on 17 December 2021).
- Glanz, L.; Amann, S.; Eichberg, M.; Reif, M.; Hermann, B.; Lerch, J.; Mezini, M. CodeMatch: Obfuscation won’t conceal your repackaged app. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn, Germany, 4–8 September 2017; pp. 638–648. [Google Scholar] [CrossRef]
- Zhou, W.; Zhou, Y.; Jiang, X.; Ning, P. Detecting repackaged smartphone applications in third-party android marketplaces. In Proceedings of the Second ACM Conference on Data and Application Security and Privacy, San Antonio, TX, USA, 7–9 February 2012; pp. 317–326. [Google Scholar]
- Kim, B.; Lim, K.; Cho, S.-J.; Park, M. RomaDroid: A Robust and Efficient Technique for Detecting Android App Clones. IEEE Access 2019, 7, 72182–72196. [Google Scholar] [CrossRef]
- Li, L.; Bissyandé, T.F.; Klein, J. Rebooting Research on Detecting Repackaged Android Apps: Literature Review and Benchmark. IEEE Trans. Softw. Eng. 2019, 47, 676–693. [Google Scholar] [CrossRef]
- Li, L.; Bissyandé, T.; Klein, J. RePack. Github. Available online: https://github.com/serval-snt-uni-lu/RePack (accessed on 17 December 2024).
- Wolfe, B.; Elish, K.; Yao, D. Comprehensive Behavior Profiling for Proactive Android Malware Detection. In Proceedings of the Information Security Conference (ISC), Hong Kong, China, 12–14 October 2014. [Google Scholar]
- Zhou, Y.; Wang, Z.; Zhou, W.; Jiang, X. Hey, You, Get off of My Market: Detecting Malicious Apps in Official and Alternative Android Markets. In Proceedings of the 19th Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 5–8 February 2012. [Google Scholar]
- Dave, D.D.; Rathod, D. Systematic review on various techniques of android malware detection. In Communications in Computer and Information Science; Springer International Publishing: Cham, Switzerland, 2022; pp. 82–99. [Google Scholar]
- Wolfe, B.; Elish, K.; Yao, D. High Precision Screening for Android Malware with Dimensionality Reduction. In Proceedings of the International Conference on Machine Learning and Applications, Detroit, MI, USA, 3–6 December 2014. [Google Scholar]
- Alzaylaee, M.K.; Yerima, S.Y.; Sezer, S. DL-Droid: Deep learning based Android malware detection using real devices. Comput. Secur. 2020, 89, 101663. [Google Scholar] [CrossRef]
- Elish, K.; Yao, D.; Ryder, B. User-Centric Dependence Analysis for Identifying Malicious Mobile Apps. In Proceedings of the IEEE Mobile Security Technologies (MoST) Workshop, in Conjunction with the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 24 May 2012. [Google Scholar]
- Elish, K.O.; Shu, X.; Yao, D.D.; Ryder, B.G.; Jiang, X. Profiling user-trigger dependence for Android malware detection. Comput. Secur. 2014, 49, 255–273. [Google Scholar] [CrossRef]
- Martín, I.; Hernández, J.A. CloneSpot: Fast detection of Android repackages. Future Gener. Comput. Syst. 2019, 94, 740–748. [Google Scholar] [CrossRef]
- Zhou, W.; Zhang, X.; Jiang, X. AppInk: Watermarking android apps for repackaging deterrence. In Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, Hangzhou, China, 8–10 May 2013. [Google Scholar]
- Nguyen, T.; Mcdonald, J.T.; Glisson, W.; Andel, T. Detecting Repackaged Android Applications Using Perceptual Hashing. In Proceedings of the 53rd Hawaii International Conference on System Sciences, Maui, HI, USA, 7–10 January 2020. [Google Scholar] [CrossRef]
- Lin, Y.-D.; Lai, Y.-C.; Chen, C.-H.; Tsai, H.-C. Identifying Android Malicious Repackaged Applications by Thread-Grained System Call Sequences. Comput. Secur. 2013, 39, 340–350. [Google Scholar] [CrossRef]
- Tian, K.; Yao, D.; Ryder, B.G.; Tan, G. Analysis of Code Heterogeneity for High-Precision Classification of Repackaged Malware. In Proceedings of the 2016 IEEE Security and Privacy Workshops (SPW), San Jose, CA, USA, 22–26 May 2016; pp. 262–271. [Google Scholar]
- Wu, D.J.; Mao, C.H.; Wei, T.E.; Lee, H.M.; Wu, K.P. Droidmat: Android Malware Detection Through Manifest and API Calls Tracing. In Proceedings of the 2012 Seventh Asia Joint Conference on Information Security, Tokyo, Japan, 9–10 August 2012. [Google Scholar]
- Gonzalez, H.; Kadir, A.; Stakhanova, N.; Alzahrani, A.; Ghorbani, A. Exploring reverse engineering symptoms in Android apps. In Proceedings of the Eighth European Workshop on System Security, Bordeaux, France, 21 April 2015. [Google Scholar]
- Androzoo. Available online: https://androzoo.uni.lu/ (accessed on 10 September 2023).
- Apktool: A Tool for Reverse Engineering Android Apk Files. Available online: https://apktool.org/ (accessed on 15 September 2023).
- Baksmali. Available online: https://github.com/JesusFreke/smali (accessed on 15 September 2023).
- Elish, K.; Cai, H.; Barton, D.; Yao, D.; Ryder, B. Identifying Mobile Inter-App Communication Risks. IEEE Trans. Mob. Comput. 2020, 19, 90–102. [Google Scholar] [CrossRef]
- Tian, K.; Yao, D.; Ryder, B.G.; Tan, G.; Peng, G. Detection of Repackaged Android Malware with Code-Heterogeneity Features. IEEE Trans. Dependable Secur. Comput. 2017, 17, 64–77. [Google Scholar] [CrossRef]
- Permissions on Android: Android Developers. Available online: https://developer.android.com/guide/topics/permissions/overview (accessed on 18 March 2025).
- Sun, B.; Li, Q.; Guo, Y.; Wen, Q.; Lin, X.; Liu, W. Malware family classification method based on static feature extraction. In Proceedings of the 2017 3rd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2017; pp. 507–513. [Google Scholar]
- SMOTE—Azure Machine Learning. Microsoft Docs. Available online: https://docs.microsoft.com/en-us/azure/machine-learning/component-reference/smote (accessed on 21 July 2025).
- Branco, P.; Torgo, L.; Ribeiro, R. A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 2017, 49, 31:1–31:50. [Google Scholar] [CrossRef]
- Li, L.; Bissyandé, T.F.; Klein, J. SimiDroid: Identifying and explaining similarities in Android apps. In Proceedings of the 2017 IEEE Trustcom/BigDataSE/ICESS, Sydney, NSW, Australia, 1–4 August 2017. [Google Scholar]
- Arp, D.; Spreitzenbarth, M.; Hübner, M.; Gascon, H.; Rieck, K. Drebin: Efficient and Explainable Detection of Android Malware in Your Pocket. In Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 23–26 February 2014. [Google Scholar]
- Wei, F.; Li, Y.; Roy, S.; Ou, X.; Zhou, W. Deep Ground Truth Analysis of Current Android Malware. In Proceedings of the 14th International Conference on the Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), Bonn, Germany, 6–7 July 2017. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).