The Malware Detection Approach in the Design of Mobile Applications

: Background: security has become a major concern for smartphone users in line with the increasing use of mobile applications, which can be downloaded from unofﬁcial sources. These applications make users vulnerable to penetration and viruses. Malicious software (malware) is unwanted software that is frequently used by cybercriminals to launch cyber-attacks. Therefore, the motive of the research was to detect malware early before infection by discovering it at the application-design level and not at the code level, where the virus will have already damaged the system. Methods: in this article, we proposed a malware detection method at the design level based on reverse engineering, the uniﬁed modeling language (UML) environment, and the web ontology language (OWL). The proposed method detected “Data_Send_Trojan” malware by designing a UML model that simulated the structure of the malware. Then, by generating the ontology of the model, and using RDF query language (SPARQL) to create certain queries, the malware was correctly detected. In addition, we proposed a new classiﬁcation of malware that was suitable for design detection. Results: the proposed method detected Trojan malware that appeared 552 times in a sample of 600 infected android application packages (APK). The experimental results showed a good performance in detecting malware at the design level with precision and recall of 92% and 91%, respectively. As the dataset increased, the accuracy of detection increased signiﬁcantly, which made this methodology promising.


Introduction
Producing secure and high-quality software remains an ongoing research challenge. Software systems must fulfill quality characteristics such as reliability, usability, and maintainability. Over the last few years, it was proven that design patterns play a vital role in software engineering, IoT, security, mobile applications, and many other fields of computer science [1][2][3][4]. A good design pattern produces a perfect software design [1,5]. Most of the time, software engineers reuse existing design patterns for developing software systems and for solving similar issues such as errors, high costs, and high time consumption [2,6]. There are many design patterns on the internet for reusing purposes. The use of these patterns containing undesirable elements will result in poor quality and poor safety. Both anti-patterns and malware have similar bad factors that cause negative impacts on software quality [5,7]. Many studies have assessed the negative impact of anti-patterns on changeproneness, comprehension, reliability, fault proneness, security, performance, and energy efficiency [8][9][10]. At the same time, many empirical studies have assessed the negative impact of malware on performance [11,12], reliability [13], energy consumption [14], and other quality elements [7]. We noted that most of the research detected malware at the source code level, which is considered late [15][16][17]. The proposed approach detected malware at the design level, which is a new detection method for malware. In addition, the design level is a phase before the coding phase, which shows the model of the application using any modeling tool; hence, this will avoid time-consuming use of patterns of applications from the Internet as new versions containing malware. The proposed approach was a detection method for malware before coding and installation.
The contribution of the proposed approach is a new detection method at the design level for malware. According to the joint effects of anti-patterns and malware, the proposed approach studied the probability of malware detection at the design level as anti-pattern detection.
Motivated by the research mentioned above, the major contributions of this paper were fivefold: • A method for reversing the UML class model of malware using the Modelio modeling tool was presented. • A suitable classification method for malware for design detection was presented.

•
Existing solutions for anti-pattern detection were investigated to see if they could detect malware or not. • A semantic method for detecting malware at the design level was presented.

•
The evaluation of the proposed method in a dataset of 600 mobile applications was described.
The paper is structured as follows. To begin we present the related works. Next, we present the basic definition of both anti-patterns and malware. Then, the reverse engineering concept that was used in the proposed research is illustrated. After that, we describe the details of the proposed methodology, and the results and discussion are presented. Finally, the conclusions and future work are presented.

Related Work
Many previous studies have proven the negative impact of both malware and antipatterns on software and its quality. Malware and anti-pattern detection have always been an active area of research in recent years. Several techniques and methods have been suggested to counter and reduce the growing amount and sophistication of malware and anti-patterns. Machine and deep learning techniques have contributed significantly to the literature for malware detection, as well as anti-pattern detection. We will refer to the latest of these approaches and learning techniques related to this research.
Malware detection techniques range from early day signature-based detection to machine and deep learning techniques [11,18,19]. There are distinct basic malware analysis techniques. In [20][21][22][23][24], the authors proposed a classification of analysis of malware to static and dynamic and a hybrid of both static and dynamic analyses. This analysis was an essential step in the malware detection process as it was a way of knowing how malware performs its function, how to identify it, and how to defeat it. According to [25], research has shown that the analysis was based on static PE file features of malware samples and observed that linear SVM models could be useful in detecting the evolution of malware. After that, the authors of [26] expanded on and improved upon the work in [25] in several ways. Recently, machine learning and deep learning methods (e.g., support vector machines (SVM), decision trees (DT)) have been used to detect and classify unknown samples for malware families due to their scalability, rapidity, and flexibility. In addition, machine learning and data mining techniques are combined with present detection techniques in order to facilitate the process of detection [27]. The research in [28] proposed a malware detection method called MalNet that learnt features automatically from raw data. It learned from grayscale images and opcode sequences extracted from malware files by using two deep neural networks: CNN and LSTM. The results showed that MalNet achieved 99.88% validation accuracy for malware detection. In [29], the authors proposed a hybrid model based on a deep autoencoder (DAE) and convolutional neural network (CNN) to improve the accuracy of malware detection compared with traditional machine learning methods. In addition [30] provided a malware detection model using a deep convolutional neural network (CNN) in a metamorphic malware environment.
On the other hand, several studies proposed the detection of designed anti-patterns. The previous work, [31], introduced 40 types of design anti-patterns that formed the basis of design anti-pattern detection approaches [32][33][34][35]. The approach in [36] presented the ONTOPYTHO technique to detect smells and anti-patterns on the design of OWL Ontologies based on a metric method via the semantic web query language, SPARQL, and Python programming language. In [37], the authors focused on detecting mobile applications' anti-patterns and they proposed a method using reverse engineering and a UML modeling environment. This research presented a comparative study on nine UML tools and concluded that the Modelio UML modeler was a suitable tool for the detection process.
In addition, [1] introduced a general method that supported semantic and structural anti-pattern detection at the design level to automatically detect anti-patterns by using Modelio, OLED, and Protégé in a specific order to obtain positive results.
According to the symmetry between malware and anti-patterns, and the ability to detect design anti-patterns, the proposed research aimed to detect malware at the design level.

Design Patterns and Malware
According to [38], design patterns are defined as solutions that developers reused for solving repeated problems in software systems for improving reusability and quality. Every pattern has its design and the features of the anti-pattern that is resolved. Anti-patterns exist in various levels in software development, such as design, coding, architecture, community, organization, environment, collaboration, etc. Anti-pattern examples can be a bad practice, a wrong reaction to a combination of events, a failure to predict, understand, or control a project factor, etc. Malware is defined as the class of software that may be called viruses, worms, and Trojans [39]. Malware is specifically designed to destroy, steal data, hosts or networks, or generally perform other "bad" or illegal operations on data, hosts, or networks. We contemplated the question of dealing with malware as an anti-pattern and creating a base for detecting it at the design level. While, at the same time, treating every anti-virus as a design pattern for detecting a certain malware and describing them using any modeling tool. This was a starting point for collecting all malware, its design features, and the used anti-virus. As a result, we could create a semantic catalog of malware that allowed developers to detect the existence of malware at the design level rather than later at the code level. First, we needed to answer the research questions.

Reverse Engineering
The idea of this research was based on reverse engineering. According to the structure of android applications, we needed to reverse the source code in order to generate the design of it. First, we extracted the zip files of the APK that were in the JAR format and directly dealt with the Dex file, which was decompiled to generate java files. For the de-compilation, we used the (Android Decompiler-master) to obtain the Dex2jar files. Then, we used (JD-GUI-1.6.6) to obtain java sources. Using Modelio 3.6, we generated the class diagram model of the application. According to the results in [1], Modelio was the suitable UML modeler for reversing the class diagram and for detecting the anti-patterns. This was performed for un-infected mobile applications, but the question here was, could we do this for infected applications?

Reverse the Infected Applications
According to [1,37] the class diagram model of mobile applications was generated after reversing the java code. It was proven that the applications had many anti-patterns. According to this, the research reversed a sample of 42 APK in the dataset. Using the

Reverse the Infected Applications
According to [1,37] the class diagram model of mobile applications was generated after reversing the java code. It was proven that the applications had many anti-patterns. According to this, the research reversed a sample of 42 APK in the dataset. Using the mentioned tools, we reversed the class diagram model of the applications, as shown in Figure 1.
This research also studied the reversal of thirteen infected programs, as in Table 1. The research study included the identification of malware across different types, different sizes, and different languages (C++, Java, Golang). According to this, we could reverse the infected software.   This research also studied the reversal of thirteen infected programs, as in Table 1. The research study included the identification of malware across different types, different sizes, and different languages (C++, Java, Golang). According to this, we could reverse the infected software.

Malware Identification
From the sample in Table 1, we noted that there were many different types of malware. Therefore, as a start, we reversed all the types and were ready for detection. However, we should note that only infected applications could be checked at the design level. This was because we controlled the software and we could check it before installing it. We knew that malware, such as worms, infected the system through spam emails, which were not controlled by the user. Therefore, we could not detect all malware types at the design level.

Proposed Methodology
The main purpose of this research was to study the detection of malware at the design level. To do so, we needed to classify malware according to the way it infected systems. This helped us to determine the type of malware that the proposed methodology could detect at the design level.

Malware Classification
The initial step in the proposed methodology was malware classification. In general, malware can spread like a virus, a worm, or a Trojan. Every type has its infection method. Malware may infect your device by a visit to a hacked website, opening spam emails, downloading infected files or applications, using infected discs, or pressing on any untrusted advertisements; these are the malware sources. We tried to detect malware in the downloaded applications before installing them. The proposed research examined the code reversed from the APK by generating a class diagram of it.

Proposed Detection Method
In this section, we propose the detection method for detecting malware at the design level. The pseudo code of the proposed method is presented in Algorithm 1. if the result is ≥1; 5 print "malware detected" 6 else print "not detected"; 7 convert the model to OWL Ontology; 8 run the reasoner; 9 If the result ≥1; 10 Print "malware detected in ontology"; 11 else run the detection method algorithm; 12 print "Detected Malwares"; 13 else print (" "); 14 End.
In addition to reversing the UML class diagram model, we generated the OWL Ontology of the model using the Ontology editor. The model could then be checked for anti-patterns and malware for improving quality and security, as proposed in Figure 2.

Case Study in "Data_Send_Trojan"
To explain the proposed method, we presented a snapshot of it in the case study of "Data_Send_Trojan". Data_Send_Trojan is a kind of Trojan virus that always retrieves the important information of the users. The information most of the time includes credit card information, email addresses, passwords, instant messaging contact lists, log files, and so on. It is a part of the Trojan banker. For implementing the proposed method, we used the UML modeler and Modelio for presenting the class diagram model of the application. In addition, we examined the ability of Modelio to detect malware at the design level. Next, by converting the model to the OWL Ontology model and running SPARQL queries, the detection was performed. Figure 3 presents the UML class diagram model for posing as Data_Send_Trojan. The model contained three classes: the first class presented the user information, which was the attributes of the class; the second class presented the class of the virus containing the operations that would obtain the user information; finally, the third class presented the hacker that would receive the user information from the virus class.
The main components of an OWL Ontology are classes, datatype properties that present the attributes in UML models, and object properties, which present the operations. Every object property has a rang and a domain. The rang presents the class holding the operation, while the domain presents the class of the values. By converting the UML model of the virus to OWL Ontology, we obtained the ontology model in Figure 4. Bank:Bank is the root class of the ontology. It had three subclasses, which were Bank:Data_ send_Trojan, Bank:Hacker, and Bank:Users. There were two object properties, the first was (Bank:getUserInformation) and the second was (Bank:receiveUserInformation).

Case study in "Data_Send_Trojan"
To explain the proposed method, we presented a snapshot of it in the case "Data_Send_Trojan". Data_Send_Trojan is a kind of Trojan virus that always the important information of the users. The information most of the time inclu card information, email addresses, passwords, instant messaging contact lists and so on. It is a part of the Trojan banker. For implementing the proposed me used the UML modeler and Modelio for presenting the class diagram model plication. In addition, we examined the ability of Modelio to detect malware at t level. Next, by converting the model to the OWL Ontology model and running The main components of an OWL Ontology are classes, datatype properties present the attributes in UML models, and object properties, which present the op tions. Every object property has a rang and a domain. The rang presents the class hold the operation, while the domain presents the class of the values. By converting the U model of the virus to OWL Ontology, we obtained the ontology model in The range of the (getUserInformation) object property was the (Data_Send _Tro class, which held the object property. While the domain of it was the (Users) class wh had user information as datatype properties. The range of the (receiveUserInformat object property was the (Hacker) class, which held the object property. The domain the class (Data_send _Trojan), which, in this case, presented the class containing u information.  The challenge was getting all user information and sending it to the hacker class. In other words, both the object properties had to contain the same values and the same values must be the user information.
The next was the SPARQL queries for retrieving the user information and the values of both object properties.  The range of the (getUserInformation) object property was the (Data_Send _Trojan) class, which held the object property. While the domain of it was the (Users) class which had user information as datatype properties. The range of the (receiveUserInformation) object property was the (Hacker) class, which held the object property. The domain was the class (Data_send _Trojan), which, in this case, presented the class containing user information.
The challenge was getting all user information and sending it to the hacker class. In other words, both the object properties had to contain the same values and the same values must be the user information.  The challenge was getting all user information and sending it to the hacker class. In other words, both the object properties had to contain the same values and the same values must be the user information.
The next was the SPARQL queries for retrieving the user information and the values of both object properties. } The result of Q1 is shown in Figure 5.  Q2. SPARQL query for retrieving (getUserInformation) object property of (Data_send_ Trojan) class. The result of Q2 is shown in Figure 6. ?y ?x ?z.
} The result of Q2 is shown in Figure 6. ?y ?x ?z.

}
The result of Q3 is shown in Figure 7. We noted from the results in Figures 5-7 that the values of the object properties of Data_Send_Trojan and the hacker classes were the same user information. We noted from the results in Figures 5-7 that the values of the object properties of Data_Send_Trojan and the hacker classes were the same user information.

Evaluation Measures
To evaluate the detection performance successfully, it was necessary to identify appropriate performance metrics. The following five measures were employed to evaluate the proposed model performance.  Precision returned the rate of relevant results rather than irrelevant results. Recall was the sensitivity for the most relevant result. The F-Measure was the value that estimated the entire system performance by combining precision and recall into a single number. The maximum value of 1.000 indicated the best result.

Result and Discussion
In this section, we assessed how well the proposed approach could predict the security of android applications. For evaluating the proposed approach, we applied it to a sample of 600 malware of the CICMalDroid 2020 dataset presented by the Canadian Institute for Cybersecurity. The CICMalDroid 2020 dataset has 11,598 android samples with five distinct categories. This dataset can be found in [40].
The (Data_Send_Trojan) malware was detected 552 times by using the general SPARQL query Q4.
Q4. SPARQL query for detecting the "Data_Send_Trojan" generally. We applied the proposed approach in a sample of 600 Trojan malware using certain SPARQL queries. We compared the proposed approach to other approaches in the section of related works [18,19,23,27,29,41] as in Table 2. The related references detected malware using different methods. References [19,27,29] used a deep convolutional neural network and two deep learning models, DexCNN and DexCRNN, while reference [18] used ensemble learning and big data. Reference [23] used social network properties and community detection while reference [41] used meta-learning. The proposed approach used a semantic environment for detecting malware. The comparison criteria included the method of detection, the used dataset, the accuracy, and the detection level. It is worth noting that there were some limitations that were encountered when applying the proposed method at the preprocessing and data converting stages; this was because it was performed semi-manually, one by one, separately and took a long time. This led to the experiment of the proposed method being on a relatively small number of samples compared to other research that detected malware at the code level.  Table 3, the proposed model yielded 82.8% TPR, 17.2% FPR, 92% precision, 91% recall, and 91.4 f-measure, respectively. We considered this result as a good starting point for detecting malware at the design level. Additionally, Figure 8 illustrates our experimental receiver operating characteristic (ROC) curve plot of the proposed model. The ROC curve is a plot of performance measure that was determined by the true-positive rate versus the false. The X-axis represents the true-positive rate (TPR) and Y-axis represents the falsepositive rate (FPR). The area below the ROC curve, known as AUC, was widely utilized to evaluate the performance of the malware detection models. A data point in the upper left corner and the higher AUC value corresponded to optimal and high performance. As can be seen from Figure 8, the proposed method yielded 91% of the AUC score.  According to this result, we could detect malware at the design level. However, the ability of the used tools in anti-pattern detection to detect malware needs to be verified. This research used a Modelio checker to check the reversed model against the malware. However, the result of the check was zero detection, as in Figure 9. We could see the model and the tool check against any quality problems and the result was zero for errors and warnings. According to this result, we could detect malware at the design level. However, the ability of the used tools in anti-pattern detection to detect malware needs to be verified. This research used a Modelio checker to check the reversed model against the malware. However, the result of the check was zero detection, as in Figure 9. We could see the model and the tool check against any quality problems and the result was zero for errors and warnings.
According to this result, we could detect malware at the design level. However, the ability of the used tools in anti-pattern detection to detect malware needs to be verified. This research used a Modelio checker to check the reversed model against the malware. However, the result of the check was zero detection, as in Figure 9. We could see the model and the tool check against any quality problems and the result was zero for errors and warnings. In addition, the researchers used the reasoner of ontology and the ONTOPYTHO approach that was presented in [37]. ONTOPYTHO was used to detect anti-patterns at the design of OWL Ontologies and the same result was zero detected, as in Figure 10. According to these results, the tools that were used to detect anti-patterns at the design level could not detect malware at the same level. In addition, the researchers used the reasoner of ontology and the ONTOPYTHO approach that was presented in [37]. ONTOPYTHO was used to detect anti-patterns at the design of OWL Ontologies and the same result was zero detected, as in Figure 10. According to these results, the tools that were used to detect anti-patterns at the design level could not detect malware at the same level.

Conclusions
This study presented a malware detection method based on OWL Ontology, reverse engineering, and the semantic web query language SPARQL. This method detected malware in the design of mobile applications. By reversing the source code, the UML class diagram model was generated. This UML model was converted to OWL Ontology

Conclusions
This study presented a malware detection method based on OWL Ontology, reverse engineering, and the semantic web query language SPARQL. This method detected malware in the design of mobile applications. By reversing the source code, the UML class diagram model was generated. This UML model was converted to OWL Ontology to detect the malware. To evaluate the method's performance, a sample of 600 APK mobile applications were selected from the CICMalDroid 2020 dataset. This sample was infected by Trojan malware, which appeared 552 times through running special SPARQL queries on the design of ontologies. The experimental results showed that the proposed method detected Data_Send_Trojan malware at the design level with 92% precision, 91% recall, 91.4% f-measure, and 91% of the AUC score, respectively. Furthermore, the proposed method showed that anti-pattern detection tools were not suitable to detect malware. The proposed method was considered the first method for detecting malware at the design level, compared to state-of-the-art methods; however, we performed most of the works manually, which resulted in a small number of implemented samples.
In future research, we will continue to explore the use of other methods to detect other malware at the design level. In addition, we will try to make a cloud system to do the conversion steps at the preprocessing automatically instead of manually; this would solve problems related to the size of the dataset, which will improve the results and be able to discover a greater number of malware.