Effective Ransomware Detection Using Entropy Estimation of Files for Cloud Services

A variety of data-based services such as cloud services and big data-based services have emerged in recent times. These services store data and derive the value of the data. The reliability and integrity of the data must be ensured. Unfortunately, attackers have taken valuable data as hostage for money in attacks called ransomware. It is difficult to recover original data from files in systems infected by ransomware because they are encrypted and cannot be accessed without keys. There are cloud services to backup data; however, encrypted files are synchronized with the cloud service. Therefore, the original file cannot be restored even from the cloud when the victim systems are infected. Therefore, in this paper, we propose a method to effectively detect ransomware for cloud services. The proposed method detects infected files by estimating the entropy to synchronize files based on uniformity, one of the characteristics of encrypted files. For the experiment, files containing sensitive user information and system files for system operation were selected. In this study, we detected 100% of the infected files in all file formats, with no false positives or false negatives. We demonstrate that our proposed ransomware detection method was very effective compared to other existing methods. Based on the results of this paper, we expect that this detection method will not synchronize with a cloud server by detecting infected files even if the victim systems are infected with ransomware. In addition, we expect to restore the original files by backing up the files stored on the cloud server.


Introduction
Due to the fourth industrial revolution, data-based services such as cloud computing and big data have emerged. Cloud computing is a technology that processes and stores information in computing environments that provide high-level resources at a low cost [1]. Big data refers to extremely large amounts of unstructured data that have value based on their characteristics and require an analysis technique that can process a large amount of data [2]. The basis of cloud services is data, and these services cannot operate successfully if the reliability and integrity of the data are not guaranteed. Meanwhile, valuable data stored in the cloud are attractive targets; attacks can cause serious damage through ransomware attacks that threaten to manipulate data unless money is paid [3].
Ransomware restricts access by holding computer systems hostage and demanding a ransom in exchange for releasing the restrictions. Specifically, ransomware encrypts sensitive information stored on the victim system and then requires money to provide the decryption keys [4][5][6]; most of the target systems of ransomware are PCs. Investigators have tested various methods to detect ransomware and prevent system infections [7][8][9][10]. These ransomware detection methods are classified into five broad categories: based on file-based detection, system-based behavior detection, resource-based behavior detection, connection-based behavior detection, and entropy-based ransomware detection.

The Proposed Detection Methodology
In this section, we describe our proposed ransomware detection methodology in more detail. As mentioned in the Introduction, the existing methods for detecting ransomware, such as file-based detection, system-based behavior detection, resource-based behavior detection, and connection-based behavior detection, have distinct disadvantages-in particular, the problem that detection in cloud services is not considered. Therefore, in this paper, entropy-based ransomware detection methods are focused on in terms of effective ransomware detection and file recovery in cloud services.
The proposed ransomware detection methodology is shown in Figure 1. For detection, a detection module obtains a list of files to be transferred to the cloud server. Then, the module detects the ransomware by measuring the entropy, characteristic of encrypted files infected with ransomware. At this time, to determine whether or not to detect ransomware, the entropy change trend between the case of infection with ransomware and the case of non-infection is compared to determine the entropy threshold value for each file format. Through the entropy threshold derived in this way, files infected with ransomware are identified. Through this, eventually, infected files do not synchronize with the cloud server. The following paragraph summarizes the overall steps of the proposed detection methodology: Step 1. Check the files stored inside a disk Step 2. Acquire the path and list of files to be uploaded to a cloud service application Step 3. Run the ransomware detection module experimental conditions and goals, and the experimental results of detecting ransomwareinfected files. Finally, Section 4 concludes our paper.

The Proposed Detection Methodology
In this section, we describe our proposed ransomware detection methodology in more detail. As mentioned in the Introduction, the existing methods for detecting ransomware, such as file-based detection, system-based behavior detection, resource-based behavior detection, and connection-based behavior detection, have distinct disadvantagesin particular, the problem that detection in cloud services is not considered. Therefore, in this paper, entropy-based ransomware detection methods are focused on in terms of effective ransomware detection and file recovery in cloud services.
The proposed ransomware detection methodology is shown in Figure 1. For detection, a detection module obtains a list of files to be transferred to the cloud server. Then, the module detects the ransomware by measuring the entropy, characteristic of encrypted files infected with ransomware. At this time, to determine whether or not to detect ransomware, the entropy change trend between the case of infection with ransomware and the case of non-infection is compared to determine the entropy threshold value for each file format. Through the entropy threshold derived in this way, files infected with ransomware are identified. Through this, eventually, infected files do not synchronize with the cloud server. The following paragraph summarizes the overall steps of the proposed detection methodology: Step 1. Check the files stored inside a disk Step 2. Acquire the path and list of files to be uploaded to a cloud service application Step 3. Run the ransomware detection module Step 4.1. Extract the path of the file to upload Step 4.2. Extract the extension of the file to upload Step 4.3. Extract the entropy of the file to upload Step 4.4. Detect the ransomware-infected file Step 5. Terminate the ransomware detection module Step 6. Synchronize the file to the cloud service

Characteristics of Ransomware-Infected Files
The proposed detection method utilizes a feature that appears in ransomware-infected files that is not a feature of clean files. Ransomware generally encrypts to prevent access to files containing sensitive information. In terms of cryptography characteristics, the ciphertext generated by cryptography is statistically uniform; for example, if the value of the cipher result is from 0x00 to 0xFF, the probability of each generated cipher value should be the same [16]. However, suppose bias is applied to ciphertext, such as a specific value or a specific range of values. In that case, the problem is that it is possible to decrypt

Characteristics of Ransomware-Infected Files
The proposed detection method utilizes a feature that appears in ransomware-infected files that is not a feature of clean files. Ransomware generally encrypts to prevent access to files containing sensitive information. In terms of cryptography characteristics, the ciphertext generated by cryptography is statistically uniform; for example, if the value of the cipher result is from 0x00 to 0xFF, the probability of each generated cipher value should be the same [16]. However, suppose bias is applied to ciphertext, such as a specific value or a specific range of values. In that case, the problem is that it is possible to decrypt based on the probabilities of values that are generated more or less frequently [17]. To solve this problem, the cryptography technique we propose is designed so that the probability of occurrence of each value generated by the ciphertext is substantially the same. Therefore, the data in files infected with ransomware are statistically uniform because ransomware encrypts the file and the ciphertext itself is uniform. Moreover, we can detect ransomwareinfected files by measuring the numbers represented by these features.
As a method for measuring uniformity, there is an entropy estimate. According to the National Institute of Standards and Technology (NIST), entropy measures disorder or randomness. For example, the uncertainty entropy for the probability (pi, . . . , pn) of the random variable X is defined as Equation (1) [18]: Based on the equation, if the encrypted data are uniform, the entropy is high. In other words, the data in a ransomware encrypted file are uniform, and the entropy of the encrypted file is higher than that of the original clean file. In this paper, we detect ransomware based on uniformity in infected files.

Entropy Estimation Methods
There are various ways to measure entropy. Poisson distribution [19], Hamming distance [20], and spontaneous emission [21]. In addition, NIST provides methods and tools for measuring entropy published as NIST 800-90b [22]. The 800-90b measures randomness according to the properties of random numbers, and the measurement methods are divided into independent and identically distributed (IID) or non-IID. The IID methods are used when the generated random number is independent, whereas the non-IID methods are used when the generated ransom number is not independent [23]. In addition, the methods are classified as the statistics-based measurement method and the predictor-based measurement method.
This paper uses statistics-based measurement methods to speed up the entropy estimate; these methods are the most common values, collision tests, Markov tests, and compression tests [24]. The most common value estimate obtains entropy by using the probability that a value will appear frequently in the input data set, shown as Equation (2). The collision test estimate defines arbitrary repetitive patterns as collisions as Equation (3) and estimates the probability of output values that appear often based on when collisions occur. The Markov test estimate measures the dependence between successive values from a set of input data as Equation (4), and the compression test estimate measures the entropy rate based on the compression capacity of the data set as Equation (3) [22].

Detection Module Configuration
The ransomware detection module proposed in this paper detects ransomwareinfected files by measuring the entropy of files transferred to the cloud server. The module is located between the client software, that provides the cloud service, and the cloud server. This module needs the file path information to read the file data to measure the entropy of each file. In particular, the entropy changes according to the file format, so the module has to obtain the file format, which comprises a file that contains sensitive user information and operation files, that is, system, document, image, source code, and executable files. The list of files to be delivered to the cloud server is obtained, and then the paths of the files included in the list are extracted; finally, the file formats are obtained. Afterwards, the entropy of the file is measured, and the infected file is detected by comparing it with the threshold according to the file format. Through this process, it is possible to detect whether Sensors 2023, 23, 3023 5 of 18 the victim system is infected with ransomware and detect infected files. Furthermore, if the module detects that a file transferred to the cloud server is infected by ransomware, the file will not synchronize with the uploaded file for file recovery. For this reason, the original file can be restored by downloading the uploaded file from the server. To establish the effectiveness of the detection method proposed in this paper, we experimented with Dropbox, a commercial cloud service, and verified its concept. First, the information from the file to be transferred must be extracted to obtain the file path. To extract this information, we assumed that Dropbox would need this file information for synchronization. Therefore, we expected that we would need to use the CreateFile function to extract file information.
To verify this assumption, we reverse-engineered the Dropbox software and extracted the path of the synchronizing file, as shown in Figure 2. As shown in the figure, the file name to be synchronized was 'Confidential data.txt', which was passed as an argument to the CreateFileW function. That is, the Dropbox software calls the CreateFileW function to extract the file path. Therefore, the detection module used the hooking technique for the CreateFileW function to obtain the list of files synchronized to the Dropbox server.
Afterwards, the entropy of the file is measured, and the infected file is detected by comparing it with the threshold according to the file format. Through this process, it is possible to detect whether the victim system is infected with ransomware and detect infected files. Furthermore, if the module detects that a file transferred to the cloud server is infected by ransomware, the file will not synchronize with the uploaded file for file recovery. For this reason, the original file can be restored by downloading the uploaded file from the server. To establish the effectiveness of the detection method proposed in this paper, we experimented with Dropbox, a commercial cloud service, and verified its concept. First, the information from the file to be transferred must be extracted to obtain the file path. To extract this information, we assumed that Dropbox would need this file information for synchronization. Therefore, we expected that we would need to use the Cre-ateFile function to extract file information. To verify this assumption, we reverse-engineered the Dropbox software and extracted the path of the synchronizing file, as shown in Figure 2. As shown in the figure, the file name to be synchronized was 'Confidential data.txt', which was passed as an argument to the CreateFileW function. That is, the Dropbox software calls the CreateFileW function to extract the file path. Therefore, the detection module used the hooking technique for the CreateFileW function to obtain the list of files synchronized to the Dropbox server.
The detection module is connected to Dropbox software because it does not need to be inserted into the software. That is, the detection module is implemented separately to verify the proposed concept and is connected to the Dropbox code to detect ransomware. Here, the hooking technique refers to manipulating the call flow of codes or functions. For example, the path of the file is obtained by compulsorily manipulating the call flow of the CreateFileW function to the detection module because the module cannot extract the list of files synchronized by Dropbox. As a result, the path of the synchronized file is obtained from the detection module that embeds to the Dropbox software using the hooking technique, and the experimental result is shown in Figure 3. The detection module is connected to Dropbox software because it does not need to be inserted into the software. That is, the detection module is implemented separately to verify the proposed concept and is connected to the Dropbox code to detect ransomware. Here, the hooking technique refers to manipulating the call flow of codes or functions.
For example, the path of the file is obtained by compulsorily manipulating the call flow of the CreateFileW function to the detection module because the module cannot extract the list of files synchronized by Dropbox. As a result, the path of the synchronized file is obtained from the detection module that embeds to the Dropbox software using the hooking technique, and the experimental result is shown in Figure 3.
In this process, some paths are duplicated or assumed to be Dropbox files or folders. For example, there are duplicated "config.dbx-journal" files and "Dropbox" folders. In this way, unnecessary and duplicate files are removed to obtain optimized synchronized file paths and lists.  In this process, some paths are duplicated or assumed to be Dropbox files or folders. For example, there are duplicated "config.dbx-journal" files and "Dropbox" folders. In this way, unnecessary and duplicate files are removed to obtain optimized synchronized file paths and lists.
By extracting the file list and file path, the detection module can read the data of the files being synchronized with the Dropbox server and then detect the infected files by measuring the entropy-based on the read data. For measuring entropy, NIST provides both statistics-based and predictor-based methods. However, predictor-based measurement has the disadvantage of taking a comparatively long time, and thus it is difficult to use in measuring entropy in real-time. Therefore, we use faster, statistic-based measurement in this paper. Figure 4 shows the result of measuring the entropy of a file uploaded to the Dropbox server. By extracting the file list and file path, the detection module can read the data of the files being synchronized with the Dropbox server and then detect the infected files by measuring the entropy-based on the read data. For measuring entropy, NIST provides both statistics-based and predictor-based methods. However, predictor-based measurement has the disadvantage of taking a comparatively long time, and thus it is difficult to use in measuring entropy in real-time. Therefore, we use faster, statistic-based measurement in this paper. Figure 4 shows the result of measuring the entropy of a file uploaded to the Dropbox server.  In this process, some paths are duplicated or assumed to be Dropbox files or folders. For example, there are duplicated "config.dbx-journal" files and "Dropbox" folders. In this way, unnecessary and duplicate files are removed to obtain optimized synchronized file paths and lists.
By extracting the file list and file path, the detection module can read the data of the files being synchronized with the Dropbox server and then detect the infected files by measuring the entropy-based on the read data. For measuring entropy, NIST provides both statistics-based and predictor-based methods. However, predictor-based measurement has the disadvantage of taking a comparatively long time, and thus it is difficult to use in measuring entropy in real-time. Therefore, we use faster, statistic-based measurement in this paper. Figure 4 shows the result of measuring the entropy of a file uploaded to the Dropbox server. On the right side is the detection module, which outputs a list of files to be synchronized and an entropy estimate result of the files included in the list. In the experiment, the 'confidential file' is synchronized with the Dropbox server. Consequently, the most common value estimate, collision test estimate, Markov test estimate, and compression test estimate were 0.654851, 0.553649, 0.025006, and 0.426463. Hence, the min-entropy was 0.426463.

Experimental Preparation and Implementation
This section describes our experimental conditions and goals, along with the experimental results of measuring the entropy of synchronized files.

Experimental Conditions
Based on the methodology proposed in Section 2, we verify ransomware detection by measuring the entropy of the files uploaded to the Dropbox cloud server. The specific target files are a system file, a document file, an image file, a source code file, and an executable file, which are files related to sensitive user information and system operations. Ransomware does not encrypt all files stored on a disk at once, but it also infects a single file sequentially, depending on the conditions. Based on this assumption, a sample ransomware similar to the actual ransomware was produced and tested in a way that infects the target files. For the experiment, we assumed that 10 file formats were infected with 100 files in each format. We analyzed the entropy measurement results and entropy change trends to effectively detect ransomware when 10, 20, . . . , 100 files were infected. Through this process, the module determined a threshold to effectively detect infection by comparing and analyzing the results of the entropy measurement for the ransomware infection and encryption processes.
This experiment was based on three goals: First, the entropy measurement results for each file format would be compared and analyzed to determine the reference values for detecting ransomware infections according to the file formats; second, to compare and analyze the entropy changes; and third, based on the two measurement results, to derive the optimal baseline for detection by analyzing the entropy change and the detection, false-positive, and false-negative rates.
First, because measuring entropy requires reference values for each clean file format, our module measured the entropy of 100 files in each format; the result is shown in Figure 5. Second, Table 1 shows the averages of entropy for the 100 files in each format used to estimate the threshold of each format for clean files.  On the right side is the detection module, which outputs a list of files to be synchronized and an entropy estimate result of the files included in the list. In the experiment, the 'confidential file' is synchronized with the Dropbox server. Consequently, the most common value estimate, collision test estimate, Markov test estimate, and compression test estimate were 0.654851, 0.553649, 0.025006, and 0.426463. Hence, the min-entropy was 0.426463.

Experimental Preparation and Implementation
This section describes our experimental conditions and goals, along with the experimental results of measuring the entropy of synchronized files.

Experimental Conditions
Based on the methodology proposed in Section 2, we verify ransomware detection by measuring the entropy of the files uploaded to the Dropbox cloud server. The specific target files are a system file, a document file, an image file, a source code file, and an executable file, which are files related to sensitive user information and system operations. Ransomware does not encrypt all files stored on a disk at once, but it also infects a single file sequentially, depending on the conditions. Based on this assumption, a sample ransomware similar to the actual ransomware was produced and tested in a way that infects the target files. For the experiment, we assumed that 10 file formats were infected with 100 files in each format. We analyzed the entropy measurement results and entropy change trends to effectively detect ransomware when 10, 20, …, 100 files were infected. Through this process, the module determined a threshold to effectively detect infection by comparing and analyzing the results of the entropy measurement for the ransomware infection and encryption processes.
This experiment was based on three goals: First, the entropy measurement results for each file format would be compared and analyzed to determine the reference values for detecting ransomware infections according to the file formats; second, to compare and analyze the entropy changes; and third, based on the two measurement results, to derive the optimal baseline for detection by analyzing the entropy change and the detection, false-positive, and false-negative rates.
First, because measuring entropy requires reference values for each clean file format, our module measured the entropy of 100 files in each format; the result is shown in Figure  5. Second, Table 1 shows the averages of entropy for the 100 files in each format used to estimate the threshold of each format for clean files.  As the figure shows, the entropy results change for some but not all formats of clean files; specifically, 4, 7, 10, 3, and 4 high peaks corresponded, respectively, to system, document, image source code, and executable files. That is, some but not all clean files showed high entropy. By amount, respectively, fewer than 6,8,8,4, and 6, system, document, image, source code, and executable files showed entropy. By average, more than 3, 5, 4, 3, and 3, the system, document, image, source code, and executable files, respectively, showed entropy.
As shown in Table 1, the entropy averages for the 100 clean files in each file format differed according to the measurement methods; for example, the most common value estimate of entropy for the system files was 2.32, but the Markov test estimate was 0.81. Most of the most common value estimates of entropy were high, and the Markov test estimates were low. Therefore, we concluded that the optimal reference value would change according to the measurement methods and derived reference values to detect ransomware infection from each of the four methods. We found that the system, source code, and executable files had similar entropy results; in particular, entropy was lowest among the source code files and highest among the document files. Source code files are statistically biased, but image files reflect the statistical tendency of the data distribution to be small. Therefore, we derived the threshold for detecting ransomware-infected files based on the average and peak entropy, which occurs as a high peak in clean files.

Comparison and Analysis of Entropy by File Format
Entropy differed by file format, as discussed in Section 3.1. For this reason, detecting ransomware-infected files requires obtaining the file formats and the reference values for each format. This paper measured the entropy by file format, assuming that one of every 10 clean files is infected by ransomware. We present the average entropy measurements for 100 files in Figure 6.
As shown in the figure, the results show that the entropy increased with the number of infected files; in particular, if more than 90 files were infected, the entropy was higher in the infected files with all measurement methods than it was in the clean files. This means that all methods could detect infected files. On the contrary, if 50 or fewer files are infected, it is hard to distinguish them because the entropy of infected files is similar to the entropy of clean files. In particular, if more than 60 files were infected, the files in all formats had a constant average entropy without significant differences. This means that the entropy in more than 60 ransomware-infected files can be a reference value.
To detect ransomware-infected files and derive the reference value, the value is determined to be the least to distinguish between infected and clean files. In terms of results for each file format and measurement methods, the number of the system file, document, image, source code, and executable files were 30 files with the most common value estimate, 40 files with collision test estimate, 30 files with Markov test estimate, and 30 files with compression test estimate. These can be determined as the average entropy values for the reference when at least 40 files are infected. As shown in the figure, the results show that the entropy increased with the number of infected files; in particular, if more than 90 files were infected, the entropy was higher in the infected files with all measurement methods than it was in the clean files. This means that all methods could detect infected files. On the contrary, if 50 or fewer files are infected, it is hard to distinguish them because the entropy of infected files is similar to the entropy of clean files. In particular, if more than 60 files were infected, the files in all formats had a constant average entropy without significant differences. This means that the entropy in more than 60 ransomware-infected files can be a reference value.
To detect ransomware-infected files and derive the reference value, the value is determined to be the least to distinguish between infected and clean files. In terms of results for each file format and measurement methods, the number of the system file, document, image, source code, and executable files were 30 files with the most common value estimate, 40 files with collision test estimate, 30 files with Markov test estimate, and 30 files with compression test estimate. These can be determined as the average entropy values for the reference when at least 40 files are infected.
To distinguish the numerical entropy values more clearly, we present in Table 2 all the entropy values measured by file format and measurement method and the highest entropy value for clean files. In the Table, Most, Collision, Markov, Compression, C, I, and P are denoted most common value estimate, collision test estimate, Markov test estimate, compression test estimate, clean files, infected files, and peaked entropy of clean files, respectively. To distinguish the numerical entropy values more clearly, we present in Table 2 all the entropy values measured by file format and measurement method and the highest entropy value for clean files. In the Based on these results, the infection could be detected among the fewest infected source code files, and image files could be detected when most were infected. Therefore, we consider that effectively detecting ransomware-infected files requires selecting the optimal method for measuring the entropy for each file format. The numbers in red are entropy values that are not between C and P. This value is used as a threshold for detecting ransomware.

Comparison and Analysis of Entropy Changes by Number of Infected Files
The results shown in Figure 6 and Table 2 indicate that the entropy increases as the number of infected files increases. Therefore, reference values are required to detect a specific minimum number of ransomware-infected files. To determine these reference values, we compared and analyzed the changes in entropy according to the number of infections. For this paper, we assumed that 100 files were infected with ransomware per 10 files and derived the changes in entropy by the number of infections. The result is shown in Figure 7.

Comparison and Analysis of Entropy Changes by Number of Infected Files
The results shown in Figure 6 and Table 2 indicate that the entropy increases as the number of infected files increases. Therefore, reference values are required to detect a specific minimum number of ransomware-infected files. To determine these reference values, we compared and analyzed the changes in entropy according to the number of infections. For this paper, we assumed that 100 files were infected with ransomware per 10 files and derived the changes in entropy by the number of infections. The result is shown in Figure  7. The results showed that the file formats with the highest entropy according to the most common value estimation, collision test estimation, Markov test estimation, and compression test estimation were document, document, image, and document files. Source code files had the lowest entropy. According to the trend of change, the most common value estimate was the measurement method with the sharpest change by the number of infected files. The method with the slightest change was the collision test estimation. The file formats with sharp changes were system, source code, and executable files, while the document files showed the slightest changes. To summarize these results, our proposed detection module was likely to detect ransomware according to the number of infections when the entropy changes abruptly. That is, when the entropy was measured using the most common value estimates for system files, source code files, and executable files, we could effectively detect ransomware in a cloud environment.
On the analysis of the points of gradual changes by file format, the optimal number of infected files for the most common value and collision, Markov, and compression test estimates was 30 files, similar to the results shown in Figure 6 and Table 2. For this reason, we consider that with a threshold of 30 files, ransomware will be effectively detected. In this paper, we derived optimal reference values by analyzing the changes in entropy using the measurement method and determined a reference value to effectively detect ransomware based on the detection rate.
We analyzed the detection rate according to the number of infections by file format, assuming that in every 100 files, 10 files are infected by ransomware. The result is shown in Figure 8. The results showed that the file formats with the highest entropy according to the most common value estimation, collision test estimation, Markov test estimation, and compression test estimation were document, document, image, and document files. Source code files had the lowest entropy. According to the trend of change, the most common value estimate was the measurement method with the sharpest change by the number of infected files. The method with the slightest change was the collision test estimation. The file formats with sharp changes were system, source code, and executable files, while the document files showed the slightest changes. To summarize these results, our proposed detection module was likely to detect ransomware according to the number of infections when the entropy changes abruptly. That is, when the entropy was measured using the most common value estimates for system files, source code files, and executable files, we could effectively detect ransomware in a cloud environment.
On the analysis of the points of gradual changes by file format, the optimal number of infected files for the most common value and collision, Markov, and compression test estimates was 30 files, similar to the results shown in Figure 6 and Table 2. For this reason, we consider that with a threshold of 30 files, ransomware will be effectively detected. In this paper, we derived optimal reference values by analyzing the changes in entropy using the measurement method and determined a reference value to effectively detect ransomware based on the detection rate.
We analyzed the detection rate according to the number of infections by file format, assuming that in every 100 files, 10 files are infected by ransomware. The result is shown in Figure 8. All measurement methods detected 100% of the ransomware-infected files using the average entropy for 70 infected files. However, using the entropy average for 80 infected files, the detection rate for the collision test estimate was lower than that for the other measurement methods. In addition, the detection rate decreased sharply for all measurement methods using the average entropy of 100 infected files. In particular, the compression test estimate had the lowest detection rate, whereas the most common value estimate had the highest rate. By file format, using the average entropy of 80 infected document and image files, we identified several ransomware-infected files that could not be detected. However, our proposed method detected all ransomware-infected source code and executable files using the average entropy of 90 infected files.
The results in the previous two sections show that the detection rate was high using the average entropy of 30 or 40 infected files. However, we found that the detection rate was as high as 100% using all the average entropies from 10 to 70 infected files; this suggests that the average entropy can be used extremely efficiently as a reference value. In contrast, some reference values showed high false-positive and false-negative rates even with high detection rates. As shown in Table 2, some files in each format had larger than average entropy. These were cases of false positives, and for this paper we derived the optimal reference values to minimize false positives and false negatives.

Determining of Optimal Baseline Values by Detection Rates, False-Positive Rates, and False-Negative Rates
This paper derives optimal baseline values for detecting ransomware-infected files based on the above entropy comparison and analysis results, detection rates, false positives, and false negatives. Because these rates differed by measurement method, we determined the optimal baseline values using this method. First, Figure 9 shows the falsepositive rates by the measurement method. All measurement methods detected 100% of the ransomware-infected files using the average entropy for 70 infected files.
However, using the entropy average for 80 infected files, the detection rate for the collision test estimate was lower than that for the other measurement methods. In addition, the detection rate decreased sharply for all measurement methods using the average entropy of 100 infected files. In particular, the compression test estimate had the lowest detection rate, whereas the most common value estimate had the highest rate. By file format, using the average entropy of 80 infected document and image files, we identified several ransomware-infected files that could not be detected. However, our proposed method detected all ransomware-infected source code and executable files using the average entropy of 90 infected files.
The results in the previous two sections show that the detection rate was high using the average entropy of 30 or 40 infected files. However, we found that the detection rate was as high as 100% using all the average entropies from 10 to 70 infected files; this suggests that the average entropy can be used extremely efficiently as a reference value. In contrast, some reference values showed high false-positive and false-negative rates even with high detection rates. As shown in Table 2, some files in each format had larger than average entropy. These were cases of false positives, and for this paper we derived the optimal reference values to minimize false positives and false negatives.

Determining of Optimal Baseline Values by Detection Rates, False-Positive Rates, and False-Negative Rates
This paper derives optimal baseline values for detecting ransomware-infected files based on the above entropy comparison and analysis results, detection rates, false positives, and false negatives. Because these rates differed by measurement method, we determined the optimal baseline values using this method. First, Figure 9 shows the false-positive rates by the measurement method. The results show that all measurement methods had low false-positive rates with a large number of infected files. With an average entropy of 70 infected files, most methods had very low false-positive rates. By the measurement method, the Markov test estimate had a low false-positive rate during the longest interval, and the most common value estimate had a low false-positive rate during the shortest interval. By file format, the source code files had the lowest false-positive rate in the longest interval in all measurement methods, and the image files had the lowest rate in the shortest interval. By measurement method and file format, the most common values, collision test, and compression test estimates had low false-positive rates with the source code files for the longest interval. The Markov test estimate had a low false-positive rate for the longest interval with the executable and source code files. Moreover, the most common value estimate had a low falsepositive rate with the shortest interval in document files. The Markov, collision, and compression tests estimates had low false-positive rates in the shortest interval with the image files.
By the number of files in which a false-positive occurred, the image files had the false positives, and the source code had the fewest with all measurement methods. For the number of files with false positives by the measurement method, the Markov test estimate had the fewest false positives, and the most common value estimate had the false positives.
As with the false-positive rate, to detect ransomware efficiently, we derived the baseline entropy with the optimal false-negative rate, while the rates by the measurement method are shown in Figure 10. Again, all measurement methods had low false-negative rates with the average entropy for a low number of infected files. The average entropy for most of the 70 infected files had a very low false-negative rate. The results show that all measurement methods had low false-positive rates with a large number of infected files. With an average entropy of 70 infected files, most methods had very low false-positive rates. By the measurement method, the Markov test estimate had a low false-positive rate during the longest interval, and the most common value estimate had a low false-positive rate during the shortest interval. By file format, the source code files had the lowest false-positive rate in the longest interval in all measurement methods, and the image files had the lowest rate in the shortest interval. By measurement method and file format, the most common values, collision test, and compression test estimates had low false-positive rates with the source code files for the longest interval. The Markov test estimate had a low false-positive rate for the longest interval with the executable and source code files. Moreover, the most common value estimate had a low false-positive rate with the shortest interval in document files. The Markov, collision, and compression tests estimates had low false-positive rates in the shortest interval with the image files.
By the number of files in which a false-positive occurred, the image files had the false positives, and the source code had the fewest with all measurement methods. For the number of files with false positives by the measurement method, the Markov test estimate had the fewest false positives, and the most common value estimate had the false positives.
As with the false-positive rate, to detect ransomware efficiently, we derived the baseline entropy with the optimal false-negative rate, while the rates by the measurement method are shown in Figure 10. Again, all measurement methods had low false-negative rates with the average entropy for a low number of infected files. The average entropy for most of the 70 infected files had a very low false-negative rate. By both the measurement method and the file format, the source code files had a low false-negative rate in the longest interval, and the document and image files had low rates in the shortest interval. The source code files had a low false-negative rate for the longest interval, and document and image files had low false-negative rates with the shortest interval. The most common value estimate and the Markov test estimate had low false-negative rates for the longest intervals in all file formats by combined measurement method and file format.
The number of files in which a false negative occurred, source code files had the most false negatives, and system files had the fewest. The most common value estimate had the fewest false negatives and the collision test estimate the most by the measurement method.
Based on the above results for false positives and false negatives, we derived baseline values to detect ransomware-infected files. We determined the baseline values as the entropy intervals with the fewest false positives and false negatives and the highest detection rate, as shown in Table 3.  By both the measurement method and the file format, the source code files had a low false-negative rate in the longest interval, and the document and image files had low rates in the shortest interval. The source code files had a low false-negative rate for the longest interval, and document and image files had low false-negative rates with the shortest interval. The most common value estimate and the Markov test estimate had low false-negative rates for the longest intervals in all file formats by combined measurement method and file format.
The number of files in which a false negative occurred, source code files had the most false negatives, and system files had the fewest. The most common value estimate had the fewest false negatives and the collision test estimate the most by the measurement method.
Based on the above results for false positives and false negatives, we derived baseline values to detect ransomware-infected files. We determined the baseline values as the entropy intervals with the fewest false positives and false negatives and the highest detection rate, as shown in Table 3. We found baseline entropy values with zero false positives and false negatives and 100% detection in all measurement methods. Image files gave the shortest interval for the baseline entropy values by file format, and source code files had the longest interval.
The most common value estimate by each measurement method was that document files had the shortest interval at 0.25 and source code files had the longest interval at 3.35. In the collision test estimate, image files had the shortest interval at 0, and source code files had the longest interval at 3.36. In the Markov test estimate, the image files had the shortest interval at 0.8, and the source code files had the longest interval at 3.52. Finally, in the compression test estimate, image files had the shortest interval at 0.38, and source code files had the longest at 4.66. Therefore, the ransomware detection method based on entropy proposed in this paper is effective and completely accurate.

Conclusions
This paper proposed a method to effectively detect ransomware by measuring the entropy of files stored on a cloud server. The idea of the proposed detection method is uniformity, one of the characteristics of ransomware-infected encrypted files; that is, the probability that each value included in the data range for the encrypted files is almost the same. Therefore, we measured the uniformity of the files synchronized to the cloud server based on entropy. For the experiment, we collected system, document, image, source code, and executable files, which are the files necessary for system operation and include sensitive user information. We estimate the entropy by comparing the entropy of ransomware-infected encrypted files with that of clean files. To derive the baseline entropy for infected files, we compared and analyzed the entropy by file format and number of ransomware infections; based on these results, we derived the optimal baseline values for detecting ransomware-infected files. The baseline value we derived detected 100% of all ransomware-infected files, which means 0% false positives and false negatives. Therefore, the suggestions in this paper detect ransomware using ransomware-infected files very effectively. Furthermore, this allows one to recover files stored on a cloud server by backing up files stored on the cloud server.
The limitations of the measures proposed in this paper are as follows: First, it was assumed that ransomware infects the victim system files slowly and sequentially; the type of file was tested for only system files, document files, image files, source code files, and executable files. Second, the entropy characteristics of encrypted files were derived based on sample ransomware, which is very similar to actual ransomware, but not actual ransomware. Third, encryption and compression have essentially similar characteristics. Namely, after compression, the entropy of the file is increased. For this reason, the proposed method has the limitation that it is difficult to distinguish compressed files from ransomware-infected files. Fourth, the proposed method is to detect ransomware through entropy measurements. Therefore, we focused on general ransomware, and did not consider ransomware variants, such as partial intermittent encryption, which does not have a large change in entropy of files. Nevertheless, we consider that even ransomware that utilizes partial encryption can be detected if it partially measures entropy. Moreover, according to "The inadequacy of entropy-based ransomware detection", the inadequacy of the entropy-based ransomware detection method was asserted by normalizing entropy with a base 64 encoding technique [25]. However, if a base 64 encoded file is identified from the ransomware defender's point of view, entropy-based ransomware detection can detect ransomware-infected files by measuring the entropy of the files after decoding. Finally, the criteria for ransomware infected files were derived assuming that the format of the target file was known, and it is believed that of the entropy measurement methods, only the Markov test estimation could be applied.
In the future, we will consider these limitations and study ways to detect external disks, USB storage devices, secure disks, and ransomware on secure USB to backup or store files and cloud services. Moreover, to verify the applicability of the proposed methodology in the real world, the file format will be expanded, and the publicly available ransomware sample data will be configured for verification.