Anomaly and Fraud Detection in Credit Card Transactions Using the ARIMA Model

This paper addresses the problem of unsupervised approach of credit card fraud detection in unbalanced dataset using the ARIMA model. The ARIMA model is fitted on the regular spending behaviour of the customer and is used to detect fraud if some deviations or discrepancies appear. Our model is applied to credit card datasets and is compared to 4 anomaly detection approaches such as K-Means, Box-Plot, Local Outlier Factor and Isolation Forest. The results show that the ARIMA model presents a better detecting power than the benchmark models.

the card is blocked. This issue has sparked the interest of the both academia and industry, that are working to find solutions to this problem and to keep up with the ever-changing approaches adopted by malicious players [4]. Credit card fraud detection is now an active field of research, and it particularly hinges on the concept of automation; it is in fact not always feasible nor possible to manually review each transaction in order to establish its nature [5]. In addition to this, it is also important to consider that there is another significant human component that could make or break the attempt of a fraudster to successfully exploit a card: the promptness of the cardholders in reporting a stolen, lost or suspiciously used card [5]. This requires the implementation of automated tools for a smarter and faster detection of frauds, which has resulted in machine learning techniques being increasingly tested and implemented [6].
Various popular algorithms have been tested in this context, such as Random Forest, Logistic Regression, Decision Trees, Support Vector Machines (SVM), and Neural Networks [7], [8], [9]. Khare and Sait in [7] compare Logistic Regression, SVM, Decision Tree and Random Forest using the Kaggle dataset for credit cards containing 284, 807 transactions, 492 of which are fraudulent. The features of the dataset are obtained using Principal Component Analysis (PCA) on the original data for confidentiality issues. The authors also state that they use the behavioural characteristics of the owner of the card, which is shown by a variable representing the spending habits of the customer as well as the month, hour of the day, geographical location and type of merchant. Experimental results show that Random Forest is the most performing algorithm, with an accuracy score of 98.6%, compared to the 97.7% of Logistic Regression, 97.5% of SVM and 95.5% of Decision Tree. Varmedja et al in [8] compare the performances of Logistic Regression, Naive Bayes, Random Forest and Multi-Layer Perceptron on the Kaggle Dataset.
The number of features is reduced through the application of feature selection and the class imbalance addressed by oversampling with SMOTE. Their results show that Random Forest is again the best algorithm , with Accuracy, Precision and Recall equal to 99.06%, 96.38% and 81.63% respectively. Roy et al in [9] use a deep learning approach to detect frauds in credit card transactions. The dataset used in the study was provided by a financial institution and contains almost 80 million anonymised transactions performed over a period of 8 months. The authors perform feature engineering to apply field knowledge to the problem and add extra features to the original ones. Due to the unbalanced nature of the dataset, the authors also perform under sampling at the account level for each unique account ID. Artificial Neural Networks (ANN), Recurrent Neural Networks (RNN) Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU) are compared in this study; the results highlight that GRU presents the best performance with an accuracy score equal to 91.6%, followed by 91.2% (LSTM), 90.4% (RNN) and 88.9% (ANN).
As can be noted, there is a common fundamental issue in these approaches: the unbalanced nature of the datasets. In the context of credit card, fraud detection is in fact expected that the dataset will be very unbalanced, which greatly hinders the performance of supervised learning techniques [6]. Another issue involves the lack of properly labelled data, which again represents a substantial obstacle. Finally, many models lack the adaptability required to take into account the fact that the spending behaviour of customers is likely to change over time [6]. In order to tackle these problems, we propose a model that does not require the knowledge of ground truths and that is designed to make the spending behaviour of the customer as the main source of information when categorising transactions as either legitimate or fraudulent. More specifically, we frame the problem as an anomaly detection task in time series, where the variable represented by the time series is the daily count of transactions for a given customer. We propose a method making use of the ARIMA model and of a rolling windows approach to flag suspicious number of transactions as anomalies, which will be discussed in-depth in the following sections. Two widely used models for time series are the Autoregressive (AR) and the Moving Average (MA) models, which can be used together as an Autoregressive Moving Average (ARMA) model. ARMA(p, q) is the combination of the AR(p) and MA(q) models, and can be used with univariate time series.

• Autoregressive Model
The AR(p) model is defined by the equation below; it assumes that there is a dependent linear relation between the observation and the values of a specified number of lagged (previous) observations plus an error term.

• Moving Average Model
The MA(p), model is defined by the equation below; it makes use of the dependency between an observation and the residual errors resulting from the application of a moving average model to lagged observations.
The ARMA model, resulting from the combination of these two models, is defined as follows: Where p refers to the order of the AR model and q refers to the order of the MA model. The main assumption in time series analysis is that the time series is stationary, meaning that its mean and variance are constant over time; however, this is not the case in many practical situations [10]. The solution to this can be found in the generalisation of the ARMA model: the Autoregressive Integrated Moving Average(ARIMA) model. ARIMA introduces the possibility to apply differencing to the data points of time series in order to make it stationary [10]. ARIMA is now one of the most popular, flexible and simple models to fit a time series [10]; it is defined as ARIMA(p, d, q) where p and q represent the orders of the AR and MA models and the d indicates the degree of differencing. In the context of fraud detection, time series can be used as a tool when working with aggregated features. Aggregation is often used to derive new features from the original ones in order to feed to the model some information that is thought of and expected to be more relevant than the features per se. The number of daily transactions or the total amount spent in a week are examples of aggregated features [5].

Estimation Process of ARIMA
When using ARIMA, care should be taken to identify the combination of parameters that best represents the data; Box-Jenkins is a method proposed by George Box and Gwilym Jenkins in [11] that is frequently used when tuning an ARIMA model. The method is composed of three steps: 1. Identification, which refers to the use of all available data and related information to select the model that best represents the time series. This phase should however be split into two sub-steps: (a) Differencing The first step requires to establish whether the time series is stationary or not to determine whether it requires differencing. The Augmented Dickey-Fuller (ADF) test is a technique that can be used to verify if the time series on hand is stationary. The null hypothesis of the ADF test states that the time series can be represented by a unit root, meaning it presents a time-dependent structure and that is, thus, not stationary; consequently, rejecting the null hypothesis implies that the time series is stationary. (b) Configuration of p and q During this phase, it is helpful to use the correlogram to visualize the autocorrelation function (ACF) and the partial autocorrelation function (PACF) that can help to determine a suitable choice for the orders p and q. The fundamental difference between the two functions is that the PACF removes the linear dependence between the intermediate variables in order to return only the correlation between the present and lagged value. Briefly, whereas the autocorrelations function of AR(p) tails off, its partial autocorrelation function has the cutoff after the lag p. Conversely, the autocorrelations function of MA(q) has a cutoff after the lag q while its its partial autocorrelation function tails off.
2. Estimation, which refers to the training phase. Once the values of p, d, q have been established, the φ and θ coefficients can be estimated. This method uses the maximum likelihood estimation process, which is solved by non-linear function maximisation; for more details about this phase the reader is referred to [11], [12].
3. Diagnostics, which refers to the evaluation of the model and identification of improvements. This step involves the determination of issues in the model to verify whether it is able to effectively summarise the underlying data. The forecast residuals provide an important source of information for diagnostics. In an ideal model, the error will resemble white noise and will be normally distributed with a mean of 0 and a symmetrical variance. In addition to this, an ideal model would also leave no temporal structure in the residuals, as they should have been learned.

Fraud Detection with ARIMA Model on Daily Counts of Transactions
Our idea is to use ARIMA on time series representing the daily count of transactions for a given customer to detect frauds. This is based on an important point: we assume that the number of daily transactions for a given customer follows a certain pattern [13]. On a high level, the task of fraud detection in this context is based on the assumption that it is possible to recognise, and hence model, the regular spending behaviour of the customer; once this has been learned, any discrepancies and deviations from it would be likely to be frauds. We can also refer to such deviations as anomalies. An anomaly is a point in a dataset whose characteristics are significantly different compared to the other points; building from this, anomaly detection is the process to isolate such points by determining when they are deviating from the expected behaviour [14]. ARIMA will be used to try and model the legitimate spending behaviour of the customer and to produce a forecast. The intuition behind this setting can be easily explained graphically. Figure 1 shows the daily transactions of a credit card for a customer chosen in our dataset; more details about this dataset will be given in the next section. The number of legitimate transactions happening each day for such customer are in blue dot, whereas the number of frauds are in red dot. A significant peak is observed at the same day of fraudulent transactions. Based on this information, it could be argued that an anomaly detection approach based on the identification of anomalous counts of daily transactions may lead to the detection of frauds. In order to detect frauds, the following steps are proposed: 1. The time series is split into training and testing set; it is important that the training set only contains legitimate transactions so that the model would learn the legitimate behaviour of the customers. This should then allow for the identification of anomalies.
2. In the training set, based on the legitimate transactions the order of the ARIMA model is identified using the Box-Jenkins method and then the parameters of ARIMA are estimated. During this phase, care is taken to ensure that the estimated coefficients are significant and that there is no temporal structure left in the residuals. Finally in the testing set, the one-step ahead prediction is performed by rolling windows.
3. In order to detect fraud in the testing set, the errors are calculated in terms of difference between the predicted and actual daily count of transactions. Then, the Z-Scores are computed and used to flag the anomalies (i.e. the frauds). The Z-Score is calculated as where x is the prediction error on the daily count of transaction in the testing set. µ and σ are the mean and the variance based on the errors of In-Sample prediction based on the training set using our model. If the Z-Score is greater than a threshold, the day is flagged as anomalous (i.e. as fraud). 3 Application to Dataset

Dataset Description
The dataset used for this study was provided by NetGuardians SA and contains information about credit card transactions for 24 customers of a financial institution; it covers the period from June 2017 to February 2019. For reasons of confidentiality, the name of the financial institution will not be mentionned. Each row is related to a customer ID and represents a transaction with its various features (i.e. timestamp, amount..) including the class label (1 for fraud and 0 for legitimate transaction). An important aspect is that each of the 24 customers presents at least one fraud in the whole period. Figure 2 and table 1 show the number of daily transactions for all customers and the frequency of fraud and legitimate transactions in the whole dataset. We remark that the dataset is highly imbalanced with a proportion of fraud of 0.76%. However, it is important to notify that the customers are not necessarily active during the whole period. In fact, as illustrated in Figure 3 and Figure 4, some of them perform transactions only in the first part of the considered time frame, others only at the end, and others in the middle. Our approach based on ARIMA model requires in the training set a sufficient legitimate transactions in order to learn the legitimate behaviour of the customers. In addition, our approach requires at least one fraud in the testing set to evaluate the performance of the model. In this context, initially we propose to split the dataset into the training and testing set with 70-30 ratio. With this setting, there is at least one fraud in the testing set and no fraudulent transactions in the training set but unfortunately this reduces the number of customer's time series from 24 to 9. Table 2 summarises the composition of the final 9 time series that will be used in the next section. The last column indicates the number of frauds over the total number of transactions happening in the same day; as can be seen, only in one of the time series (number 10) frauds happen in two different days.

Application of ARIMA Model for Daily Counts of Transactions
The previously outlined steps are performed for each of the 9 time series separately. These are now described in detail for just one of the time series for the sake of clarity and brevity as an illustration. As already discussed, the first step involves establishing whether the time series is stationary. To do this, we perform the ADF test whose results are shown in the table 3.
It shows that the time series is stationary with significant result. Next, Figures 5(a) and 5(b) show the PACF and ACF that are used to determine the best values for the order p and q of the ARIMA model. For this time series, there may be a drop-off in the PACF at lag 1 and in the ACF at either lag 1 or 2 suggesting an ARIMA(1,0,1) or ARIMA(1,0,2). The steps for the parameters estimation and the residuals analysis in the training set conduct to select among the two models, the model ARIMA(1,0,2) as a good model for this time series and will be used to make forecasting. Figure  5(c) shows the correlogram of the residuals for the selected model and this confirms that they have a white noise pattern.These above steps are performed for all the other 8 time series; in some cases, the configuration may require multiple attempts to find the best parameters. All parameters passed on to the next stage of the study are found to be significant. It is important to mention that for the forecasting in the testing set, we set the threshold to 3. So, when the Z-Score is greater than 3, there is fraud.

Benchmark Models
Our model is compared to 4 different models of anomaly detection such as the Box-Plot, the Local Outlier Factor (LOF), Isolation Forest and the K-means. Each benchmark model is briefly explained in the following section. Box-Plot Box-Plots are used in the context of exploratory data analysis; they can be used to graphically represent data using their descriptive statistics. Box plots do not make any assumptions about the statistical distribution followed by the sample, meaning that potential outliers are identifies solely based on the degree of dispersion of the data points in the sample. Box-Plots are very useful because they can be used to effectively identify patterns in groups of numbers that might be invisible to the human eye [15]. Being a visual tool, box plots are often used to increase our understanding of data allowing for a better interpretation for quantitative data [15]. We apply Box-Plot on the entire dataset (for each time series); however, only the testing portion of the dataset is considered to calculate the results. This is done for consistency reasons in order to ensure a fair comparison of the performances.
Local Outlier Factor (LOF) Local Outlier Factor (LOF) is an algorithm introduced by Breunig, Kriegel, T. Ng and Sander in 2000 that is aimed at the identification of anomalous data points based on their local deviation from their neighbours. LOF is a density-based algorithm, and it is centred on the concept of degree of being an outlier [16], as opposed to a binary classification of outliers. The model is local because the anomaly score assigned to each point derives from the degree of isolation of that point compared to the its k neighbours, where k can be specified. More precisely, the locality of a point is given by its k-nearest neighbours. A point is considered to be an outlier when its local density results to be significantly lower than the densities of its neighbours [17]. For more details about LOF, see [16]. As for Box-Plot, LOF is applied on the entire dataset only considering the testing set to calculate the results. As already explained, this is done in order to retain consistency across the tests.
Isolation Forest Isolation Forest is an anomaly detection algorithm that implements a new approach compared to other models used for this purpose: rather than focussing on identifying normal points and their deviations (i.e. anomalies), Isolation Forest focuses directly on the detection of these anomalies without profiling. This can be done based on two fundamental properties of outliers, that is, they are few and different, which makes it so that they are isolated from the other -regular-points [18]. In order to isolate anomalies, this algorithm makes use of a tree structure, which results in outliers being placed closer to the root of the tree compared to the other points [19]. In Isolation Forest, each isolation tree isolates the anomalies by randomly selecting a features and a split value between the minimum and the maximum values of that feature; the random partitioning should result in anomalies having a shorter path due to both the low number of such instances and to their inherently different characteristics leading to early partitioning. More details about the algorithm of Isolation Forest can be seen in [19]. Isolation Forest does not require labels to work, however, it is trained on the training set comprising only of legitimate transactions and used to classify the data points in the testing set.
K-Means K-Means is an unsupervised learning model used for clustering. Clustering is the process by which from a given input, clusters or groupings are identified [20]. The process by which K-means operates can be divided into two parts: given an input comprising of a set of instances x 1 , x 2 , x 3 , ..., x n , and a number of clusters K, the algorithm places the centroids c 1 , c 2 , c 3 , ..., c n for each cluster J at random locations, then: 1. For each point x n : (a) Find nearest centroid c j . K-means computes Euclidean distance between each point x n and centroid c j . This approach is often called minimising the inertia of the clusters [21] and can be defined as follows: Where n is the number of points x and i is the number of centroids c. (b) Assign instance x n to cluster J.
2. For each cluster J : 1, 2, ...K (a) Compute new centroid c j . This is done calculating the mean from each point x to the centroid x of the cluster J to which is was firstly assigned.
3. Stop when convergence is reached, that is, there are no more changes after the iterations.
For more details about K-Means, see also [21], [22]. We fit K-Means on the entire dataset specifying two clusters (for legitimate and fraudulent daily counts). The cluster containing the smallest number of instances is considered to be the cluster indicating the positive class. As with Box-Plot and LOF, only the part of the outliers in the testing set is taken into account.

Results
The results are presented based on three metrics: Precision, Recall and F-Measure. Precision refers to the ability of the model to be trustworthy as regards its classified positive points; that is, Precision tells us how many of the predicted frauds are actually frauds. A high Precision means that when the model classifies a point as positive it is highly likely that it is a correct classification. This metric is defined by the following equation: Recall indicates the ability of the model to detect the positive class. When a model presents a high Recall, it means that the majority of positive data points would be correctly identified. The equation for Recall is shown below.
Recall = True Positive True Positive + False Negative (6) Precision and Recall indicate two opposite properties of a model, meaning that optimising one implies worsening the other. In order to gain a more comprehensive overview of the performance of the model, we can use the F-Measure metric, defined as shown in the following equation.
These metrics are calculated for each of the 9 time series analysed and used to obtain the average as described in the previous section. The results are presented in the table 4. As can be noted, ARIMA presents the best result in terms of Precision and F-Measure, whereas K-Means provides the best performance in terms of Recall. The worst performing model in this setting is Local Outlier Factor, that presents a Precision and F-Measure scores equal to 8.4% and 14.04% respectively. It should be pointed out that LOF was designed to be effective with multidimensional datasets [16], which might explain its bad performance in this particular setting. The Box-Plot model produces the best performance amongst the benchmarks with a F-Measure of 72.22% and the only one which is comparable to that of our model. The advantage of our model that it is based on the concept of modelling the normal behaviour of the customer. In addition, the forecasting by the rolling windows takes into account the dynamic changes in the spending behavior of the customer. While it can be argued that our model is overall the best one, it underperforms the Box-plot, the Isolation Forest and the K-Means in terns of Recall.
As previously discussed, only 9 out of the 24 possible time series are retained for analysis due to the lack of frauds in the testing set. Consequently, the results that were presented are highly dependent of that particular set of data. In order to assess the robustness of the model, the time series that were originally discarded are reintegrated through the injection of one fake fraudulent transaction in the testing set. The occurrence of frauds is simulated by the addition of a varying number of counts ranging from 1 to 8 to a random date in the testing set for each time series. The range was set from 1 to 8 as it reflects the one observed in the 9 time series already discussed. It should be noted that the performance of the models highly varies depending on how many counts are added and on which day. In order to account for this randomness, this process is repeated 100 times and the average of the metrics is computed. In order to have an overview of the performances over the 24 time series, a global average is computed and is shown in the table 5 Despite the fact that all models under-perform after the injection of fake frauds, the ARIMA presents the best performance in terms of Precision and F-Measure, whereas the best Recall score is achieved by Local Outlier Factor. The Precision of the latter is however again the worst, which brings Box-Plot to be the only comparable model to ARIMA in this case as well.

Conclusion
This paper addresses the problem of unsupervised approach of credit card fraud detection using the ARIMA model. The main reason on focussing on time series model comes from the lack of fraud data due to confidential issues that could represent a substantial obstacle in the development of machine learning algorithms. In this context, the goal of our approach is to model the regular spending behaviour of the customer and any discrepancies and deviations from it would be likely to be anomaly. The intuition behind this approach is centred on the assumption that the occurrence of frauds in a given day would cause the daily number of transactions to be altered in such a way that could be detected as suspicious. In the training set ARIMA model is first calibrated on the daily number of legitimate transactions in order to learn the regular spending behaviour for the customer.
In the second step, the fitted model is used to predict fraud in the testing set by using the rolling windows. The criterion of flagging fraud is based on the Z-Score calculated on the prediction errors in the testing set. Our methodology is applied on the dataset that is provided by NetGuardians and is compared to 4 anomaly detection algorithms such as K-Means, Box-Plot, Local Outlier Factor and Isolation Forest. It is shows in terms of prediction power that the ARIMA model outperforms the other models following by the Box.Plot method. Among the 4 benchmark models, the Local Outlier Factor is the worst performing model.
Our model is successful compared to the benchmarks models for two reasons: 1. It works better when there is a significant number of frauds happening in the same day. This is often the case, as fraudsters are known to take advantage of the time they have before the card is blocked to make several fraudulent transactions in a short time span [13] 2. It presents the best precision, i.e. it reduces the number of false positives compared to the benchmark models 3. It takes into account the dynamic spending behaviour for the customer by using the rolling windows.
One main problem in our approach is that ARIMA model assumes that the data comes from observations that are equally spaced in time. However, this assumption does not hold in our study since the transaction times are unequally spaced. This issue will be addressed in future research by using advanced approaches such as the continuous-time autoregressive moving average (CARMA) processes.