Deep Learning-Based Defect Prediction for Mobile Applications

Smartphones have enabled the widespread use of mobile applications. However, there are unrecognized defects of mobile applications that can affect businesses due to a negative user experience. To avoid this, the defects of applications should be detected and removed before release. This study aims to develop a defect prediction model for mobile applications. We performed cross-project and within-project experiments and also used deep learning algorithms, such as convolutional neural networks (CNN) and long short term memory (LSTM) to develop a defect prediction model for Android-based applications. Based on our within-project experimental results, the CNN-based model provides the best performance for mobile application defect prediction with a 0.933 average area under ROC curve (AUC) value. For cross-project mobile application defect prediction, there is still room for improvement when deep learning algorithms are preferred.


Introduction
Mobile applications are rapidly evolving. Most previous applications were developed for the management of emails, contacts, calculations, and schedules, but nowadays, we see very diverse mobile applications that change our lives dramatically. Technology has also rapidly changed during the last decade, which eventually affected the mobile application domain.
Technologies such as cloud computing, Internet of Things (IoT), and Blockchain have helped mobile application developers to develop productivity applications that we did not anticipate in the past. For instance, with the help of cryptocurrencies and/or Blockchain infrastructure, people can easily transfer their savings (e.g., USD) from one country to another because different banks have already started to set up Blockchain networks (e.g., Ripplenet) that allow international money transfer between banks and no longer need specific protocols, which affect the transfer duration. Nowadays, international money transfers are very fast, like many other activities that we perform these days.
Different mobile applications have been developed recently. For instance, people started to discover that smartphones have more functional features, such as the use of a mobile sales assistant with near-field communication (NFC), which enables mobile devices to identify products by code and display them on mobile devices. This application has enabled people to reach the product they want faster [1]. The other benefit of mobile applications is ability to process the credit card information for payment transactions or to use them for faster payments. The last one is a SoftPOS solution that converts NFC-enabled Android devices to the normal POS terminals without the need for any additional device.
This enables customers to experience in-store shopping while shopping from anywhere. Since Android devices can be bought by anyone, small businesses can benefit from this solution at low costs.
In parallel to this rapid progress in software development technologies, user expectations are also increasing. A small bug in an application affects the reputation of the software company detrimentally, users have no tolerance for the faults and failures. When a mobile application software is defective, it affects user experience and software development companies' reputation dramatically [2]. Insufficient testing in mobile applications can cause serious problems after release [3]. A defective application cause different difficulties and waste time [4]. To minimize the number of faults and failures, analyzing previous mobile application defects can provide the possibility of improving them before deployment [5].
So far, software fault analysis and prevention have been performed using different techniques, such as static code inspection, early prototyping, adapting fault-tolerant design, using defensive programming, glass-box and black-box testing, and test automation. Alternatively, predictive models use past software metrics and data to predict faulty components for the next software releases. When there is previous fault data in previous versions, these defect data are used together with software metrics. Most models so far use machine learning algorithms. While there have been many different models developed for desktop and web applications, the number of models for mobile applications is rather limited and performance is not very satisfactory.
This study aims to develop a fault prediction model for mobile applications. We focused on Android mobile applications because they are open source and experiments are repeatable. In this research, we developed our models using deep learning and shallow learning techniques and applied them to 14 Android application datasets to show the performance of the models. We designed different deep learning-based models for software fault prediction in mobile applications. We evaluated the following research questions in this research: To what extent can we predict the software faults using cross-project for mobile apps using deep learning approaches? Here, the goal is to understand how deep learning algorithms perform regarding mobile applications. • RQ2. Can we improve the performance of deep learning models using data balancing approaches? The aim of this research question is to evaluate the effect of data balancing techniques. • RQ3. Can we improve the performance of deep learning-based fault prediction models using feature engineering techniques?
To what extent can LSTM-based deep learning architecture predict software faults for mobile apps?
The paper is categorized as follows: The Section 2 presents the background and related work. Section 3 discusses the research methodology. Section 4 shows the model development. Section 5 explains the experimental results. Section 6 shows the discussion and Section 7 presents the conclusions.

Background and Related Work
A study on the identification of defects in software was presented by Akiyama [6]. However, the initial defect prediction studies using machine learning techniques started in early 2000s. Our methodology is different than the study of Akiyama because neither machine learning nor deep learning algorithms were applied by Akiyama [6] at that time. Recently, the mobile phone market has undergone a huge evolution. Very diverse mobile applications have been developed. However, there are still unrecognized defects of mobile applications that can affect businesses. To avoid this, defects of applications should be reviewed before releases. The benefit of these prediction models is that more testing resources can be allocated to fault-prone modules effectively.
Software engineers first test the functionality of code on the local systems to compare actual and expected results. When they find the difference between actual and expected code it becomes a defect. Software defects affect software quality. When a test engineer is testing code and finds the difference between actual and expected results, this is a bug. Error is a mistake in the code; while writing code, developers are not able to run or compile code. Failure is when the program is ready and customer-facing issues are in production.
There are a few types of software defects: errors of commissions, errors of omission, errors of clarity, errors of speed, and capacity. The error of commissions means the error in command. The error of omission means the closing of a bracket was forgotten, left, or excluded. The error of clarity means a misunderstanding between developer and customer. The error of speed or capacity means the program works but not in the allowed time.
Machine learning uses various algorithms to analyze and learn data then make predictions and take decisions for future. Nowadays machine learning is a well-known technique and used in different areas, such as healthcare [7][8][9][10], industrial applications [11], banking [12], telecommunication [13,14], software development [15], etc. Machine learning is also used for regression tasks and defective classes (i.e., binary class classification). Machine learning has the following classifications: in supervised learning, it practices labelling data to detect traits. Unsupervised learning evaluates non-labeled data to identify hidden structures through determining the correlations of features. The algorithm can benefit from a limited number of labeled text documents for clustering and dimensionality reduction. Semi-supervised learning is used when data points are unlabeled, and only a tiny percentage of data is labeled. The last category is reinforcement learning, which helps to solve problems with trial and error.
Even though software architecture has shifted from monolithic applications to microservice architecture [16] today, software fault prediction is always a problem to be solved. All related articles published in the last three decades on software defect prediction and mobile applications defect prediction research have been extensively studied. Nevertheless, only exceptionally were articles published after the first decade of our century. Alsolai et al. analyzed publications focused on the defect prediction of OO Software systems using applied machine learning techniques and proposed a systematic literature review. The authors used private datasets and affairs on publicly available datasets. They applied to model k-fold cross-validation methods and detailed regression tasks. Additionally, various studies implemented unique models [17]. Perez and Tah developed a mobile application to detect defects on buildings using deep learning techniques. The application worked with a smartphone camera to recognize various problems on buildings [18]. Zhao et al. (2022) proposed a model for imbalanced datasets of 15 Android applications. They used feature learning with loss function into deep learning for imbalance issue and their model performance was better when compared to 25 defect prediction models [19]. Dong et al. (2017) proposed a study on the defect prediction of Android apk files. They used deep neural network for 50 Android apk files and obtained 85.98% accuracy. Deep neural networks achieved better results than traditional machine learning techniques [20]. Cheng et al. (2022) performed a model for a cross-project to predict defects of code committed before new releases of Android applications. They used adversarial learning and experimented on 14 Android applications. This model gives better results when compared with 20 others [21].
Kaya et al. presented a study and defect prediction model on machine learning and software metrics using data sampling methods. They described the point of data sampling methods for imbalanced datasets to improve defect prediction performance. They mentioned the Adaboost algorithm as the best for defect prediction [22]. Kaur et al. performed a study for mobile app defect prediction, which used publicly available datasets. Process metrics-based machine learning models were named as the best predictors [23]. Zhao et al. created a model for Android application defect prediction. To 12 applications datasets they applied loss function for imbalance class problems and deep learning models for defect prediction. They pointed out that IDL model performance is better than other models [24]. Sewak et al. researched different variations of LSTM and presented a study on malware detection using LSTM networks [25]. Most of these models used traditional machine learning algorithms. Only a few studies used deep learning algorithms for model development. In our model, we use artificial neural networks, convolutional neural networks, and long short term memory to solve defect prediction for Android applications.

Research Methodology
In order to make design decisions for the model to be developed in this study, we conducted a systematic literature review (SLR) study on this topic [26]. We identified nine research questions and applicable papers were reclaimed from digital platforms. We applied selection criteria to the selected studies and organized them for quality assessment. Each article was carefully considered according to quality assessment questions. First, we reviewed previous studies on software defect prediction and categorized them into three groups. Web applications, mobile Android applications, and Windows phone application defect prediction. There were many studies on web application defect prediction; however, for Windows phone application there were only limited resources, and for Android mobile application defect prediction only a few studies had been published. We carefully studied each publication and performed a study on mobile application defect prediction using machine learning methods. From obtained results in the literature review, we decided to develop a model for Android application defect prediction using deep learning techniques because studies on this are very limited. For the model development process, we first identified three research questions. We reviewed previous studies to identify models developed for Android application predictions. We recognized that only a few deep learning algorithms were applied. Therefore, we collected Android application datasets, completed data processing, and developed a model using deep learning algorithms. We attempted to answer the identified research questions after the model development process was completed. Figure 1 shows the methodology of this study. they applied loss function for imbalance class problems and deep learning models for defect prediction. They pointed out that IDL model performance is better than other models [24]. Sewak et al. researched different variations of LSTM and presented a study on malware detection using LSTM networks [25]. Most of these models used traditional machine learning algorithms. Only a few studies used deep learning algorithms for model development. In our model, we use artificial neural networks, convolutional neural networks, and long short term memory to solve defect prediction for Android applications.

Research Methodology
In order to make design decisions for the model to be developed in this study, we conducted a systematic literature review (SLR) study on this topic [26]. We identified nine research questions and applicable papers were reclaimed from digital platforms. We applied selection criteria to the selected studies and organized them for quality assessment. Each article was carefully considered according to quality assessment questions. First, we reviewed previous studies on software defect prediction and categorized them into three groups. Web applications, mobile Android applications, and Windows phone application defect prediction. There were many studies on web application defect prediction; however, for Windows phone application there were only limited resources, and for Android mobile application defect prediction only a few studies had been published. We carefully studied each publication and performed a study on mobile application defect prediction using machine learning methods. From obtained results in the literature review, we decided to develop a model for Android application defect prediction using deep learning techniques because studies on this are very limited. For the model development process, we first identified three research questions. We reviewed previous studies to identify models developed for Android application predictions. We recognized that only a few deep learning algorithms were applied. Therefore, we collected Android application datasets, completed data processing, and developed a model using deep learning algorithms. We attempted to answer the identified research questions after the model development process was completed. Figure 1 shows the methodology of this study.

Model Development
The goal of this study was to develop a model for mobile application defect prediction using deep learning algorithms. We used Python data science libraries, NumPy for numerical processing, pandas for data analysis, Matplotlib for visualization, and sklearn

Model Development
The goal of this study was to develop a model for mobile application defect prediction using deep learning algorithms. We used Python data science libraries, NumPy for numerical processing, pandas for data analysis, Matplotlib for visualization, and sklearn for processing machine learning datasets. Machine learning is the practice of using algorithms to analyze data, learn from that data, and then decide on or predict new data. In machine learning, we can use different algorithms to classify and predict the desired output. However, machine learning algorithms do not consider the context of the word as a sequence. It works based on the number of times the word has been repeated, a probability is derived, and then it performs a classification task. Deep learning learns a pattern of a word or a sequence and then tries to predict the desired outcome of the task. For deep learning, we prefer Keras API Classes. To use Estimator API, we define a list of feature columns, create the Estimator model, create a data input function, call, train, evaluate, and predict methods on the Estimator object.

Datasets
We used 14 available open-source Android application datasets from previous studies [27]. The datasets are available on the COMMIT GURU platform [28]. The datasets are numerical, of different sizes, and include six columns and 34.042 lines. Table 1 shows the names of datasets, web addresses, and the total number of downloads.

Data Processing
We controlled the number of examples belonging to the true or false class to identify dataset types. The dataset is imbalanced (false class is larger than true class), as shown in Figure 2. Then we searched for null values in the datasets, as if datasets contain nonvalues, it will create a problem. We changed null values belonging to the feature columns, which we filled with zero values (zero means this feature has no effects on prediction) and false for the Contains Bug folder. Then we split the dataset X-feature columns, Y-class columns, data X-input, and data Y-output.
We used a labeling approach to encode specific numbers (0,1,2) and sklearn preprocessing technique to transform 0 as false and 1 for true. After this, we identified our datasets into train and test in each loop: X_train, X_test, Y_train, Y_state, X features, and Y classes. We used a random state as the parameter of test split, each time the cells were run if we had 1000 examples in the dataset, the random state would randomly choose 80 percent of the training and test datasets. We checked X_train and shaped X_test. In order to see how many shape examples we obtained: we split and obtained the size. The correlation between every feature is shown in Figure 3. for the Contains Bug folder. Then we split the dataset X-feature columns, Y-class co data X-input, and data Y-output. We used a labeling approach to encode specific numbers (0,1,2) and sklearn cessing technique to transform 0 as false and 1 for true. After this, we identified tasets into train and test in each loop: X_train, X_test, Y_train, Y_state, X features classes. We used a random state as the parameter of test split, each time the cells w if we had 1000 examples in the dataset, the random state would randomly choose cent of the training and test datasets. We checked X_train and shaped X_test. In see how many shape examples we obtained: we split and obtained the size. The tion between every feature is shown in Figure 3.

Validation
We applied a 10-fold cross-validation to the dataset to evaluate the perform our model. The accuracy of the random state might not show the actual performa applying cross-validation, we can resolve this problem. We split the dataset into 1 and cross-validation uses 10 blocks to train the methods and then the last block to  We used a labeling approach to encode specific numbers (0,1,2) and sklearn p cessing technique to transform 0 as false and 1 for true. After this, we identified o tasets into train and test in each loop: X_train, X_test, Y_train, Y_state, X features, classes. We used a random state as the parameter of test split, each time the cells we if we had 1000 examples in the dataset, the random state would randomly choose cent of the training and test datasets. We checked X_train and shaped X_test. In o see how many shape examples we obtained: we split and obtained the size. The c tion between every feature is shown in Figure 3.

Validation
We applied a 10-fold cross-validation to the dataset to evaluate the performa our model. The accuracy of the random state might not show the actual performan applying cross-validation, we can resolve this problem. We split the dataset into 10 and cross-validation uses 10 blocks to train the methods and then the last block to t method and keep track how of how well the method did with the test data. Fro training set, we obtained the best learning result from the test's best validation resu

Validation
We applied a 10-fold cross-validation to the dataset to evaluate the performance of our model. The accuracy of the random state might not show the actual performance. Yet applying cross-validation, we can resolve this problem. We split the dataset into 10 folds, and cross-validation uses 10 blocks to train the methods and then the last block to test the method and keep track how of how well the method did with the test data. From the training set, we obtained the best learning result from the test's best validation result. We believe that the assessment of our model will be more accurate when using a 10-fold cross-validation.

Feature Engineering
We used supervised deep learning to predict features in the dataset. Features have two critical properties: units and the magnitude computed with them. When we have a different feature in the dataset, it is calculated by various units and magnitudes. To run data into a neural network, we had to scale it using two techniques-normalization and standardization used for data scaling. Normalization scales down data between 0 to 1 and standardization helps to scale down data based on a standard normal distribution. We needed to scale down values between 0 to 1 in deep learning. In convolutional neural network images, pixels are between 0 to 1. An artificial neural network, which uses TensorFlow and Keras libraries, except for input between 0 to 1, will help the learning of weights quickly. To achieve that, we used a MinMaxScaler. We created an instance of a MinMaxScaler and after this we fit our data. Fitting the data allows the model to know each column's minimum and maximum values.

Data Balancing
As we detailed in previous sections, we are searching for solutions for the detection of mobile application defects. Additionally, we are working on Android applications' numerical datasets. In binary classification, we ended up with one class, which is essential for this study. However, we do not have enough data. In our datasets, a majority class is more significant than a minority class, and we have a class imbalance problem. Our minority class is a focus class. There are a few techniques to create fake data and balance datasets: random oversampling-from the imblearn library-makes a new sample for the minority class by sampling with replacements. We discovered that in literature studies, RandomOverSampler reported that their model did not provide the best performance; therefore, we dropped this method. Random undersampling picks random samples of the majority class, and after sampling, the majority class and minority class have the same data points. The near-miss undersampling method uses KNN to perform undersampling. There are a few types of this method, near-miss selects positive samples from the nearest distance to the farthest distance and keeps negative samples and positive samples of the average of N closest and most significant. We focused on SMOTE because of its performance and used a Python imblearn library and the oversampling method SMOTE to balance the trained dataset and increase the minority class of the dataset in Figure 4. SMOTE takes each minority sample and introduces synthetic data points connecting the minority sample and its nearest neighbors. Neighbors of the K-nearest neighbors were chosen randomly, using the same random oversampling. data into a neural network, we had to scale it using two techniques-normalizat standardization used for data scaling. Normalization scales down data between 0 t standardization helps to scale down data based on a standard normal distributi needed to scale down values between 0 to 1 in deep learning. In convolutional network images, pixels are between 0 to 1. An artificial neural network, which us sorFlow and Keras libraries, except for input between 0 to 1, will help the lear weights quickly. To achieve that, we used a MinMaxScaler. We created an instan MinMaxScaler and after this we fit our data. Fitting the data allows the model t each column's minimum and maximum values.

Data Balancing
As we detailed in previous sections, we are searching for solutions for the de of mobile application defects. Additionally, we are working on Android applicatio merical datasets. In binary classification, we ended up with one class, which is e for this study. However, we do not have enough data. In our datasets, a majority more significant than a minority class, and we have a class imbalance problem. O nority class is a focus class. There are a few techniques to create fake data and datasets: random oversampling-from the imblearn library-makes a new sample minority class by sampling with replacements. We discovered that in literature RandomOverSampler reported that their model did not provide the best perfor therefore, we dropped this method. Random undersampling picks random sample majority class, and after sampling, the majority class and minority class have th data points. The near-miss undersampling method uses KNN to perform undersam There are a few types of this method, near-miss selects positive samples from the distance to the farthest distance and keeps negative samples and positive sample average of N closest and most significant. We focused on SMOTE because of its mance and used a Python imblearn library and the oversampling method SMOTE ance the trained dataset and increase the minority class of the dataset in Figure 4. S takes each minority sample and introduces synthetic data points connecting the m sample and its nearest neighbors. Neighbors of the K-nearest neighbors were chos domly, using the same random oversampling.

Deep Learning Algorithms
This section identifies algorithms used for the defect prediction model development of Android applications. We analyzed previous studies and identified most studies used traditional machine learning methods and focused on web and desktop applications. Deep learning algorithms were used in a few studies but for mobile applications, limited research was observed. We developed our model with deep learning algorithms and believe that the performance of our model can be used to solve the problems.
Artificial Neural Networks are computing systems inspired by the brain's neural networks. These networks are based on a collection of connected units called artificial neurons. Each connection between neurons can transmit a signal from one neuron to another. The receiving neuron processes the signal and signals the downstream neurons connected to it. The structure of artificial neural networks has the following features: neurons organized in layers; input layer (absolute values from the data); hidden layers (layers between input and output, three or more hidden layers is "deep network"); and output layer (final estimate of the output). Artificial neural networks are typically organized in layers. Different types of layers include dense (or fully connected) layers, convolutional layers, pooling layers, and recurrent layers. Additional layers may perform various transformations on their inputs. The convolutional layer is most likely used with image data, the recurrent layer is used for time-series data, and the dense layer is a layer that connects each input to each output within its layer. We used the Keras library to build our neural network. Our model is sequential; we have an input layer, three hidden layers, and an output layer-our artificial neural network model is shown in Figure 5. We used a kernel initializer for weights and activation functions, such as Relu, SoftMax, and Sigmoid; we also used Adam optimizer and binary_crossentropy to evaluate our model. We started building our artificial neural network model with an input layer with six nodes because we have six columns in our dataset. Then we added hidden layers 12, 10, and 3 nodes and the output layers are one node. Our model trained for 100 epochs. An epoch is when all datasets pass a neural network, then continues. We needed to learn from the dataset and update the model. The batch size is how much data we want to process in a loop step by step. First ten samples are obtained, then the artificial neural network gives another ten until the end with randomly given samples. The architecture of our artificial neural network is shown in Figure 5. We performed tests on the dataset to predict accuracy. Following that, we evaluated the model with 10-fold cross-validation and calculated the ROC AUC, confusion matrix, and accuracy metrics.

Deep Learning Algorithms
This section identifies algorithms used for the defect prediction model development of Android applications. We analyzed previous studies and identified most studies used traditional machine learning methods and focused on web and desktop applications. Deep learning algorithms were used in a few studies but for mobile applications, limited research was observed. We developed our model with deep learning algorithms and believe that the performance of our model can be used to solve the problems.
Artificial Neural Networks are computing systems inspired by the brain's neural networks. These networks are based on a collection of connected units called artificial neurons. Each connection between neurons can transmit a signal from one neuron to another. The receiving neuron processes the signal and signals the downstream neurons connected to it. The structure of artificial neural networks has the following features: neurons organized in layers; input layer (absolute values from the data); hidden layers (layers between input and output, three or more hidden layers is "deep network"); and output layer (final estimate of the output). Artificial neural networks are typically organized in layers. Different types of layers include dense (or fully connected) layers, convolutional layers, pooling layers, and recurrent layers. Additional layers may perform various transformations on their inputs. The convolutional layer is most likely used with image data, the recurrent layer is used for time-series data, and the dense layer is a layer that connects each input to each output within its layer. We used the Keras library to build our neural network. Our model is sequential; we have an input layer, three hidden layers, and an output layer-our artificial neural network model is shown in Figure 5. We used a kernel initializer for weights and activation functions, such as Relu, SoftMax, and Sigmoid; we also used Adam optimizer and binary_crossentropy to evaluate our model. We started building our artificial neural network model with an input layer with six nodes because we have six columns in our dataset. Then we added hidden layers 12, 10, and 3 nodes and the output layers are one node. Our model trained for 100 epochs. An epoch is when all datasets pass a neural network, then continues. We needed to learn from the dataset and update the model. The batch size is how much data we want to process in a loop step by step. First ten samples are obtained, then the artificial neural network gives another ten until the end with randomly given samples. The architecture of our artificial neural network is shown in Figure 5. We performed tests on the dataset to predict accuracy. Following that, we evaluated the model with 10-fold cross-validation and calculated the ROC AUC, confusion matrix, and accuracy metrics.  CNN is an artificial neural network that is used for analyzing images. It is mostly used for image and vision-based data analysis or data classification problems. Most generally, we think of CNN as an artificial neural network specializing in picking out or detecting patterns or making sense of them. This pattern detection is what makes CNN useful for image analysis. Convolutional neural networks are just a form of artificial neural networks with different shapes from a standard MLP. CNN has hidden layers called convolutional layers, and it is precisely this layer that makes CNN. CNN has other non-convolutional layers, but the basis of a CNN is the convolutional layer. The convolutional layer receives input and then transforms and outputs the transformed input layer. With a convolutional layer, this transform is a convolutional operation. Recall that Tensors are N-dimensional arrays that we build up to: Scaler-3, Vector- [3][4][5], Matrix-( [3,4], [5,6], [7,8]), Tensor-(( [1,2], [3,4]), ( [5,6], [7,8])). Tensors make it very convenient to feed in sets of images into our model-(I, H, W, C). I: images; H: height of image in pixels; W: width of image in pixels; C: color channels: 1-Grayscale, 3-RGB. To understand the difference between a densely connected neural network and a convolutional neural network, recall that we were already able to create a deep neural network with tf.estimator API. In a densely connected layer, every neuron in one layer is directly connected to every other neuron in another layer. For the convolutional layer, we take a different approach. Each unit is connected to a smaller number of nearby units in the next layer. Convolutions also have a significant advantage for image processing, where pixels nearby are much more correlated to each other for image detection. Each convolution neural network layer looks at an increasingly more significant image part. Having units only connected to nearby units also helps with regularization, limiting the search of weights to the size of the convolution. Convolutional neural networks did not exist before for mobile application defect prediction. We decided to build a sequential model using convolutional neural networks as well. We used Keras library, NumPy, sklearn libraries, Conv2D, MaxPooling2D, dropout, flatten, dense layers, and activation functions, such as Relu, SoftMax, and sigmoid. We reshaped our train and test data to make it easy to process our model efficiently. We compiled a model to measure performance with the Adam optimizer and used loss functions, such as binary cross-entropy. Convolutional neural network model parameters are presented in Figure 6. We started building our convolutional neural network with the sequential method and convolution layer. The convolution means core between blocks of CNN. We applied a 2D convolutional layer with 64 filters and one kernel size and we defined Relu (rectified linear unit activation function). Input data was multiplied by weights and the point of the training model was by adjusting weights and biases to find the appropriate values representing the training data. Neurons obtain a variety of signals and outputs the signals. The activation function decides whether to output the signals or not. There are a few types of activation functions. Sigmoid is used to predict the probability of output. Relu is used most in convolutional neural networks and other classifications. As the second layer, we applied the max pooling 2D layer with 1,6 filters in our model. The max pooling 2D operation takes the higher value from the convolutional layer and replaces it with the output. There are also different types of pooling layers, such as the mean pooling layer, which takes the low values from the convolutional layer output and replaces it with the mean pooling layer output. The third layer of our model is the convolutional 2D layer with 32 filters and one kernel size, and the activation function is Relu. Then we added the max pooling 2D layer with 1,1 filters. We used the dropout layer to reduce the overfitting problem. Then we applied the flatten layer to our model because the last layer is dense, and the artificial neural network requires values in a 1D feature vector. Flattening enables the output of the convolutional or max pooling layer to flatten all its structures in order to create a single feature vector that the dense layer can use for final classification. We added a dense layer with 250 nodes and the activation function SoftMax because it gives the best performance and then the dropout layer. The final output layer had one node and activation function Sigmoid. We compiled the model with the Adam optimizer based on datasets, and for deciding parameters, we used the loss function binary cross-entropy. We trained a convolutional neural network with 100 epochs and 10 batch sizes. Figure 6 shows the architecture of a convolutional neural network.
Adam optimizer based on datasets, and for deciding parameters, we used the lo tion binary cross-entropy. We trained a convolutional neural network with 100 and 10 batch sizes. Figure 6 shows the architecture of a convolutional neural netw LSTM is an artificial recurrent neural network (RNN) architecture type, and i in deep learning. It allows neural networks to remember data and forget non-ap data. We can say that long short term memory is a type of recurrent neural netw current neural networks are good for short context classification. In a recurren network, processed input data after output is provided to the input in the next s new input data. It allows a recurrent neural network to remember previous step sequence. A typical LSTM network is shown in Figure 7. However, in recurrent neural network, more data is a problem. It makes not in learning new data. Long short term memory is designed for those issues of r neural networks. Long short term memory provides a solution for this problem named state. The state is a cell consisting of four parts and each part is a gate: forg memory gate, input gate, and output gate. Forget gate is the sort of state in wh stored in the internal state can be forgotten and is no longer relevant. Input gate new data we should add or update into working storage data. Output gate is all stored in the state of which part should be output immediately. Numbers between be assigned to these gates; 0 means the gates are effectively closed, and one m gate is wide open. We can make regularizations on them, add, or add a little bi cells are shown in Figure 8. LSTM is an artificial recurrent neural network (RNN) architecture type, and it is used in deep learning. It allows neural networks to remember data and forget non-applicable data. We can say that long short term memory is a type of recurrent neural network. Recurrent neural networks are good for short context classification. In a recurrent neural network, processed input data after output is provided to the input in the next step and new input data. It allows a recurrent neural network to remember previous steps in the sequence. A typical LSTM network is shown in Figure 7. LSTM is an artificial recurrent neural network (RNN) architectur in deep learning. It allows neural networks to remember data and fo data. We can say that long short term memory is a type of recurrent current neural networks are good for short context classification. I network, processed input data after output is provided to the input new input data. It allows a recurrent neural network to remember p sequence. A typical LSTM network is shown in Figure 7. However, in recurrent neural network, more data is a problem. I in learning new data. Long short term memory is designed for thos neural networks. Long short term memory provides a solution for named state. The state is a cell consisting of four parts and each part memory gate, input gate, and output gate. Forget gate is the sort of stored in the internal state can be forgotten and is no longer relevan new data we should add or update into working storage data. Outpu stored in the state of which part should be output immediately. Numb be assigned to these gates; 0 means the gates are effectively closed, gate is wide open. We can make regularizations on them, add, or a cells are shown in Figure 8. However, in recurrent neural network, more data is a problem. It makes not effective in learning new data. Long short term memory is designed for those issues of recurrent neural networks. Long short term memory provides a solution for this problem; it is a named state. The state is a cell consisting of four parts and each part is a gate: forget gate, memory gate, input gate, and output gate. Forget gate is the sort of state in which data stored in the internal state can be forgotten and is no longer relevant. Input gate is what new data we should add or update into working storage data. Output gate is all the data stored in the state of which part should be output immediately. Numbers between 0-1 can be assigned to these gates; 0 means the gates are effectively closed, and one means the gate is wide open. We can make regularizations on them, add, or add a little bit. LSTM cells are shown in Figure 8. We used the long short term memory model for Android application defect prediction. We imported Keras and TensorFlow libraries, and because LSTM expects data in a specific shape, we reshaped our data. We defined the num_of_samples, timesteps, nb_features, and nb_out. We started building our model with the sequential method and added the first LSTM layer with 100 units. We used true return sequences because we wanted each output signal to be returned to an input. The second layer used dropout to reduce overfitting. The third layer was LSTM, with 50 false units and returns sequences, then dropout layer was used.
The last layer was dense with the activation function Sigmoid. We compiled our model with binary cross-entropy and the Adam optimizer and accuracy metrics. Figure 8 shows LSTM architecture. We trained our model and evaluated the results with 10-fold cross-validation. Model loss and accuracy curves are presented in Figure 9. Since there was a large gap between training and testing curves, we had to update the model settings for the last experiments. We were able to minimize this gap after the model parameter adjustments. Figure 9 belongs to the previous experimental settings, which were later updated during the fine-tuning of the model.  We used the long short term memory model for Android application defect prediction. We imported Keras and TensorFlow libraries, and because LSTM expects data in a specific shape, we reshaped our data. We defined the num_of_samples, timesteps, nb_features, and nb_out. We started building our model with the sequential method and added the first LSTM layer with 100 units. We used true return sequences because we wanted each output signal to be returned to an input. The second layer used dropout to reduce overfitting. The third layer was LSTM, with 50 false units and returns sequences, then dropout layer was used.
The last layer was dense with the activation function Sigmoid. We compiled our model with binary cross-entropy and the Adam optimizer and accuracy metrics. Figure 8 shows LSTM architecture. We trained our model and evaluated the results with 10-fold cross-validation. Model loss and accuracy curves are presented in Figure 9. Since there was a large gap between training and testing curves, we had to update the model settings for the last experiments. We were able to minimize this gap after the model parameter adjustments. Figure 9 belongs to the previous experimental settings, which were later updated during the fine-tuning of the model. We used the long short term memory model for Android application defect prediction. We imported Keras and TensorFlow libraries, and because LSTM expects data in a specific shape, we reshaped our data. We defined the num_of_samples, timesteps, nb_features, and nb_out. We started building our model with the sequential method and added the first LSTM layer with 100 units. We used true return sequences because we wanted each output signal to be returned to an input. The second layer used dropout to reduce overfitting. The third layer was LSTM, with 50 false units and returns sequences, then dropout layer was used.
The last layer was dense with the activation function Sigmoid. We compiled our model with binary cross-entropy and the Adam optimizer and accuracy metrics. Figure 8 shows LSTM architecture. We trained our model and evaluated the results with 10-fold cross-validation. Model loss and accuracy curves are presented in Figure 9. Since there was a large gap between training and testing curves, we had to update the model settings for the last experiments. We were able to minimize this gap after the model parameter adjustments. Figure 9 belongs to the previous experimental settings, which were later updated during the fine-tuning of the model.

Accuracy Metrics
In this section, we explain accuracy metrics used in this study. We trained our model in training data and tested it with testing data. Now we need to summarize our predicted results. One way to achieve this is using the confusion matrix method. We have two categories, faulty and non-faulty, and in the last column we have true positives. True negatives (TN) refer to the non-faulty classes correctly identified by the algorithm; false negatives (FN) are faulty classes, but non-faulty classes predicted by the algorithm; false positives (FP) are non-faulty classes, but the algorithm predicts that they are faulty. Figure 10 shows the confusion matrix example.

Accuracy Metrics
In this section, we explain accuracy metrics used in this study in training data and tested it with testing data. Now we need to su results. One way to achieve this is using the confusion matrix meth gories, faulty and non-faulty, and in the last column we have tru tives (TN) refer to the non-faulty classes correctly identified by the tives (FN) are faulty classes, but non-faulty classes predicted by th tives (FP) are non-faulty classes, but the algorithm predicts that th shows the confusion matrix example. We use a binary classification because of the dataset type. In two statements: class labels and probability. We applied confus many predicted classes were correctly predicted and how many w the prediction and the Y-axis is the class label. From a balanced d the accuracy from the confusion matrix and obtain results but in a values in categories are very different; one category can be maxim mum. Our confusion matrix results are presented in Figures 11-13 Figure 11. Confusion matrix of ANN model. We use a binary classification because of the dataset type. In classification, there are two statements: class labels and probability. We applied confusion matrix to see how many predicted classes were correctly predicted and how many were not. The X-axis is the prediction and the Y-axis is the class label. From a balanced dataset we can calculate the accuracy from the confusion matrix and obtain results but in an imbalanced dataset, values in categories are very different; one category can be maximum and another minimum. Our confusion matrix results are presented in Figures 11-13.

Accuracy Metrics
In this section, we explain accuracy metrics used in this study. We trained our in training data and tested it with testing data. Now we need to summarize our pre results. One way to achieve this is using the confusion matrix method. We have tw gories, faulty and non-faulty, and in the last column we have true positives. True tives (TN) refer to the non-faulty classes correctly identified by the algorithm; false tives (FN) are faulty classes, but non-faulty classes predicted by the algorithm; fals tives (FP) are non-faulty classes, but the algorithm predicts that they are faulty.  We use a binary classification because of the dataset type. In classification, th two statements: class labels and probability. We applied confusion matrix to se many predicted classes were correctly predicted and how many were not. The X the prediction and the Y-axis is the class label. From a balanced dataset we can ca the accuracy from the confusion matrix and obtain results but in an imbalanced d values in categories are very different; one category can be maximum and anothe mum. Our confusion matrix results are presented in Figures 11-13.

Accuracy = TP + TN TP + FP + FN + TN
Accuracy alone is not enough to interpret the success of the model. We foc recall and precision because these metrics are used for imbalanced datasets. In t recall, out of the total actual positive values, how many values did we correctly positively? Recall is also named the true positive rate or sensitivity. In terms of pr out of the total predicted results, how many results are positive? Equation (2) of P and Recall is shown as follows:

onRecall = TP TP + FN Precision = TP TP + FP
We applied the score because all values are important from the actual a dicted class. If we consider our beta value as 1, our becomes an F1-Score. The for is shown as follows: We combined Precision and Recall in selecting the right metrics. A binary cl tion also uses ROC (receiver operating characteristics) and AUC (area under cur formance metrics. If the area under the curve is higher, it means the model is b Accuracy alone is not enough to interpret the success of the model. We focused on recall and precision because these metrics are used for imbalanced datasets. In terms of recall, out of the total actual positive values, how many values did we correctly predict positively? Recall is also named the true positive rate or sensitivity. In terms of precision, out of the total predicted results, how many results are positive? Equation (2) of Precision and Recall is shown as follows: We applied the F β score because all values are important from the actual and predicted class. If we consider our beta value as 1, our F β becomes an F1-Score. The formula for F β is shown as follows: We combined Precision and Recall in selecting the right metrics. A binary classification also uses ROC (receiver operating characteristics) and AUC (area under curve) performance metrics. If the area under the curve is higher, it means the model is better at predicting classes. It is mostly used in binary classification, such as fraud detection, spam detection, and software defect prediction. When AUC is 1, it means the model performs well, when model AUC is 0 or close to 0 then the model is performing at its worst separability. The ROC curve of the artificial neural network is shown in Figure 14.

Experimental Results
In this section, we summarize the obtained results for an Android applicati ware defect prediction model. We used ANN, CNN, and LSTM algorithms in search. We also applied cross-project analysis by combining all the datasets for t except the test data, and for the test we chose a different dataset each time. We pr deep learning algorithms because previous studies applied mostly machine learni sifiers. The experimental results of ANN are shown in Table 2, CNN results are pr in Table 3, LSTM results are shown in Table 4, and algorithm performance in da shown in Table 5. We found these results by tuning the hyperparameters emp However, the use of metaheuristic algorithms could increase performance of the ers.

Experimental Results
In this section, we summarize the obtained results for an Android application software defect prediction model. We used ANN, CNN, and LSTM algorithms in this research. We also applied cross-project analysis by combining all the datasets for training, except the test data, and for the test we chose a different dataset each time. We preferred deep learning algorithms because previous studies applied mostly machine learning classifiers. The experimental results of ANN are shown in Table 2, CNN results are presented in Table 3, LSTM results are shown in Table 4, and algorithm performance in datasets is shown in Table 5. We found these results by tuning the hyperparameters empirically. However, the use of metaheuristic algorithms could increase performance of the classifiers.

Discussion
In this section, we discuss our software defect prediction for Android application model performance. We evaluated three deep learning defect prediction models. We used 14 Android application projects from the COMMIT GURU platform. We used 10-fold cross-validation to evaluate the performance of each algorithm. We implemented SMOTE oversampling techniques to balance our datasets. We evaluated results with accuracy metrics, such as Recall, Precision, F1-score, ROC, and AUC. We evaluated the following research questions: RQ1. To what extent can we predict the software faults using cross-project for mobile apps using deep learning approaches?
For this research question, we analyzed deep learning performance.
RQ2. Can we improve the performance of deep learning models using data balancing approaches?
For imbalanced mobile application datasets, we used SMOTE oversampling techniques to balance and improve the effectiveness of datasets.
RQ3. Can we improve the performance of deep learning-based fault prediction models using feature engineering techniques?
To improve the performance of deep learning models, we applied the data-scaling technique from input variables to an output variable. We used MinMaxScaler because our datasets are numerical, and we practiced binary classification. MinMaxScaler used for 0, 1 and for deep learning input variables expected in an 0, 1.
RQ4. To what extent can deep learning architecture predict software faults for mobile apps?
Specifically, we focused on CNN and LSTM deep learning algorithms because, based on recent review articles, they are the most preferred algorithms for software engineering problems, namely, malware detection [29] and phishing detection [30]. However, the other deep learning algorithms [31] might be also investigated regarding this problem and performance can be improved this way. Additionally, different handcrafted and nonhandcrafted features can be analyzed [32]. We focused on Android applications in this research because the available datasets use Android projects and there were no publicly available datasets from other mobile operating systems. Additionally, we observed this issue in our recent review study on mobile malware detection [29], nearly all of the papers analyzed Android projects. However, the performance of the algorithms on different mobile operating systems, such as Windows, might be different. This requires additional research, which has not been performed in this study.
While SLR papers on mobile application defect prediction have been recently published [9], tertiary studies, such as [33], can be performed as well. Tertiary studies can synthesize secondary studies (i.e., SLR and systematic mapping (SM) studies) and present the state-of-the-art in this field. While finding relevant papers, we followed the SLR protocol carefully; however, due to some manual steps, some papers might have been neglected. A similar analysis using automated tools [34] could help to find more relevant papers in this research and include more deep learning algorithms. Previously, we performed different SLR studies [35,36] and observed the potential gaps in the relevant fields. In this research, we also used the knowledge of our recently published SLR paper [37]. Researchers can also examine this paper to see the potential research problems in this field.

Conclusions
Mobile applications have become a part of our lives. The goal of businesses is to propose users' application with zero defects and make things easier. Mobile applications are tested manually and there might be unseen defects. Additionally, nowadays there are increasing types of mobile phones on the market, and mobile applications should be adapted for this. To recognize defects before the testing stage, the literature previously proposed defect prediction models for software using machine learning techniques. For mobile applications there were limited studies, and we proposed a study on software defect prediction models for mobile applications using deep learning methods. We used fourteen open-source Android application datasets of different sizes and platforms. We applied artificial neural networks, convolutional neural networks, and long short term memory. In an AUC-based performance benchmark, the CNN was the highest performing model with 0.93. We compared our results with open-source machine learning algorithms, such as support vector machines and decision trees, and our models have better results. In future work, we plan to use our defect prediction models in SoftPOS real-time applications, because the most important things are security and robust, stable application for SoftPOS.