Measurement and Analysis of SSD Reliability Data Based on Accelerated Endurance Test

: In recent years, NAND Flash-based solid-state drives (SSDs) have become more widely used in data centers and consumer markets. Data centers generally choose to provide high-quality storage services by deploying a large number of SSDs, but there are no effective preventive measures to reduce the impact of SSD failures currently. Some existing studies have analyzed the relevant factors related to SSD failures from different angles, but the characteristics of reliability changes exhibited by SSD throughout the life cycle have not been explored in depth. On the other hand, although the 3D manufacturing process has increased the storage density of the SSD, the mutual inﬂuence between the ﬂash units has also increased, resulting in severe degradation of the performance and lifetime of the SSD. Therefore, in order to fully understand the reliability varying process of SSD throughout the life cycle, we ﬁrst designed an SSD lifetime endurance test method, then conducted the endurance test and collected the reliability data for the entire life cycle of the 3D TLC SSD in the laboratory environment with reference to the JEDEC standard. Through the analysis of experimental data and its statistical correlation, it is found that SSD will produce a large number of uncorrectable errors before reaching the endurance limit, and there will be a phenomenon of continuous high operating temperature, as well as showing some intrinsic relationships about SSD reliability data. The ﬁndings in this paper are valuable for identifying whether an SSD is going to fail.


Introduction
Currently, the demand for solid-state drives (SSDs) based on NAND Flash technology is growing in the consumer market, enterprise market, embedded product market, and so forth, and SSDs have been widely used in various computer systems. From a technical point of view, all major flash memory manufacturers focus on 3D stacking technology and 64-layer or 96-layer solutions. Compared to 2D planar technology, 3D nanotechnology is a step backward but it has reached the 20 nm-30 nm level or even higher. Because of the continuous improvements in manufacturing technology, the cost performance of SSDs has been rapidly improved in the past 10 years, and a large number of data centers have begun to deploy SSDs to further optimize their storage services.
SSDs show multidimensional advantages compared to the hard drive disks that used to be dominant in the storage industry. From the perspective of performance and power consumption, SSDs can provide not only a faster read/write speed but also better random I/O access performance, and SSDs are smaller with lower power consumption. From the perspective of reliability, the lack of moving parts inside SSDs can eliminate reliability problems, such as head collision, dielectric scratch or spindle electromechanical failure and these features can protect the SSD against physical impact. On the other hand, NAND Flash-based SSDs have only a limited number of program/erase(P/E) cycles, which means the aging problem of SSDs is unavoidable. The reliability indexes provided that the generation of uncorrectable errors is not closely related to the program and the erasure of SSDs, so UBER(uncorrectable bit error rate) is not a proper failure metric. Many studies analyzed SSD failures from multiple perspectives but there are also some works that focused on one specific SSD problem and provided the relevant optimization techniques, such as Cai et al. [8][9][10][11]14,18,19,23] focusing on the MLC Flash chip error model and proposing some improved technologies to reduce the impact of flash errors and improve flash reliability.
As 3D NAND technology matures gradually, many manufacturers have started to develop SSDs based on 3D NAND Flash [24][25][26][27]. Some studies have discussed the architecture and working principles of 3D NAND Flash cells [28][29][30][31][32][33][34]. Parat [35] introduced the Intel-Micron first generation 3D NAND Flash with a vertical channel surround gate structure which has better cell characteristics than 2D NAND and presented some technical challenges in endurance and reliability that 3D NAND will face. Venkatesan [36] discussed the fundamentals and electron properties of 3D NAND Flash from the view of fabrication process integration and equipment engineering. References [11,37,38] compared 2D NAND Flash and 3D NAND Flash in terms of physical structure and working principle, and analyzed the advantages and problems brought by 3D technology. Seo [39] studied the interference between flash cells in terms of the composition of 3D NAND Flash cells.
Although many studies have shown that 3D NAND Flash has advantages such as high storage density and low price, the shortcomings in endurance and data error rate are also obvious. Ma [40] tested and analyzed the RBER(raw bit error rate) of 3D TLC NAND Flash and also proposed a life prediction scheme of 3D TLC NAND Flash based on RBER and SVM(support vector machine), the experiment results showed that the prediction scheme can significantly extend the lifetime of flash blocks. Q. Xiong et al. [37] studied the delay and raw bit error rate of 3D NAND based on floating gate and they obtained similar results with 2D NAND. Toru [41] studied and analyzed the problems that should be paid attention to when developing the next generation of 3D NAND Flash from the perspective of power consumption, performance and reliability. Luo [42] described the effect of temperature, the program interval and the program accuracy on 3D NAND Flash. According to the characteristics of 3D NAND Flash, recent studies [16,17] have proposed remapping read-hot pages to SLC blocks, which effectively alleviated the reliability impact caused by read disturb. The classification of the above studies is organized in Table1.  [3,[8][9][10][11]14,18,19,22,23] Basic idea of 3D NAND Flash [24][25][26][27]35,36] Architecture of 3D NAND cells [28][29][30][31][32][33][34] Comparison of 2D and 3D NAND [11,37,38] Shortcomings and mitigation of 3D NAND [37,[39][40][41][42] However, these studies lack macro analyses of the variations in SSD reliability indexes throughout the entire lifetime of the SSD. As far as we know, most of the previous related research has focused on MLC, SLC, and other Flash types. This work is the first one that focuses on an endurance test of 3D TLC SSDs throughout their entire lifetime. The work shows and analyzes changes in SSD reliability data. Through the analysis of SSD reliability data, our opinions are similar to those in Reference [3]. We believe that uncorrectable errors are related to the flash program and UE is a good metric to judge SSD failures. Meanwhile, we also believe that the special change in temperature is also valuable for determining whether the SSD is about to fail.

Basic Technology
To better describe the measurement process implemented in this article, we provide a brief overview of several basic techniques in this section.

PostMark
PostMark [43] is a single-threaded synthetic benchmark program invented by NetApp in 1997. It is designed to measure the performance of file systems with workloads dominated by small file operations and a short file lifetime. This type of workload is typical for mail services, online news, web business transactions and other application scenarios. Postmark does not perform any program processing and only approximates the activity of the file system. The PostMark starts by creating a random file pool, where the files are composed of characters, numbers, and so on. The file size is evenly distributed within the specified range. After the file is created, a series of "transactions" (this is a PostMark term referring to something similar to an operation, not a database concept) are executed. The number of files, subdirectories, file size range, and number of transactions are all set by the user. Each PostMark transaction has two parts-file creation/deletion and file reading/appending. The incidence of each transaction type and the files affected by it are randomly selected to minimize the impact of file system caching, read-ahead files, disk-level caching and trace caching. Additionally, PostMark is able to adjust that correlation by setting read parameters or creating deviation parameters to produce the desired results. The file creation operation creates random text content and writes it to the file. The file deletion operation randomly selects files from the active set for deletion. The file read operation randomly selects the file and reads the entire file (using the set block size). The file write operation randomly selects a file and appends a random length to it. The user can also choose whether or not to use buffered I/O.

SMART Technology
SMART (Self-Monitoring, Analysis and Reporting Technology) is a kind of disk self-analysis and detection Technology [44]. It monitors the disk hardware (head, platters, motor et al.) status by the test commands in disk firmware and compares it with the threshold value set by the manufacturers. If a monitored value has exceeded the threshold, a warning will be sent to the users by the hardware/software monitors in the host and an automatic repair will be slightly done to ensure the reliability of data. Except for some old hard drive disks, most hard drive disks have this technology. SMART is also found in most SSDs and it can access some SSD parameters, such as model number, capacity, working temperature, data volume, error count, and so on.

Measurement Methodology
This measurement-driven study aims to better understand the reliability characteristics of 3D TLC Flash. To make the measurement process as realistic as possible, the measurement needs a workload that matches the real scenario. The rest of this section details our measurement methodology.

Overview of an SSD Sample
We selected 3D TLC SSDs with the capacity of 120G from Intel ( Figure 1) as the samples for endurance test. There are several major components on the board like power and data interfaces, controller, DRAM, NAND, and so forth. All the components may contain multiple NAND ICs(integrated circuits). The controller is a microprocessor using an external DRAM for its working memory and running the logic in firmware. The controller communicates with both NAND and host, it is responsible for converting the read and write requests from the host to the I/O operations of the NAND. The NAND flash chips were connected by a channel and each chip consists of one or more dies and each die consists of multiple planes. Normally a plane is composed of a number of blocks which are the units for erase operation and a block is composed of multiple pages, which are the smallest unit to read or write. The parallelism of data transfer contains four main levels which are channel-level, chip-level, die-level and plane-level. The DRAM is typically used to temporarily buffer the write requests or accessed data and the mapping table, which is used to map the logical address from the file system and the physical address on the flash. The other basic parameters of SSD are shown in Table 2. These parameters show the type of SSD and the corresponding NAND process.

Measurement Setup
Each SSD was connected to a single machine in order to avoid interference from other devices or programs. There was only one HDD (for data storage) and one SSD mounted on each machine, and no other tasks occupied the CPU or disk I/O resources. The whole process was an accelerated SSD aging process, so we wanted the process to simulate a real scenario and also be as quickly as possible. While setting the workload for a benchmark, we consulted the characteristics of the Oracle Archive system's workload [45] in which the write operation, more than read operation, can accelerate SSD circulation. Oracle Archive is the archiving mode of an Oracle database. In this mode, the database will back up the previous online redo logs first then erase the backup logs and start writing new online redo logs (redo files). The characteristics of this application's workload are typically writing dominant. The read operation accounts for approximately 0%-20%, and most of the random read and the write operations account for approximately 80%-100%, most of which are random writes. The processed files were mainly small files with file sizes distributed in ranges of 0-2 kb, 8-16 kb and 32-64 kb.
According to the characteristics of Oracle Archive's workload, we set up related parameters for the PostMark benchmark, as shown in Table 3, and the file sizes were randomly generated and they ranged from 32-64 KB, with a read/write ratio of 2/8, the read and write operations block size was 8 KB, the number of concurrent file operations was 100,000, the number of transactions was 400,000 and the number of working directories was 50. The default PostMark parameters were not large enough; therefore, the parameters we set were scaled up.

Measurement Flow
The JESD218 standard aims to provide an endurance test of SSDs. This standard covers the complete endurance test and data retention test for SSDs Figure 2 but it does not cover all aspects of SSD reliability, such as circuit board failures, controller failures or soft errors caused by radiation. The purpose of our measurement was to obtain a series of reliability data when SSDs were worn to their endurance limit; therefore, we only conducted the accelerated endurance test at room temperature and excluded the data retention test.

SSD Samples
Endurance stressing at room temperature Endurance stressing at high temperature 50% 50%

Room temperature rentention evaluation
Room temperature rentention evaluation High temperature rentention bake Figure 2. Simplified JESD218 endurance test. The "Endurance stressing at room temperature" is the one that we conducted.
We designed a control process of the measurement flow including the procedure to generate the workload to perform the test and data collection. The benchmark was continuously executed and the SSD reliability data was collected until the data could no longer be written into SSDs within a few months. The overview of measurement flow and data collection is shown in Figure 3.
During the endurance test, the workloads were generated continuously by the PostMark to keep the SSDs in working state, and the execution of the program was an automated process, which is shown in Figure 4. Firstly, the control scripts for setting the parameters of PostMark and other related programs were initialized. Secondly, the execution of PostMark was started and the results were saved when the execution finishes. Then the program would check whether the SSD entered the "write protection" mode. If it was "Yes," it meant that SSD had reached its endurance limit, the test process of this SSD sample was ended. If it was "No," the execution of PostMark was triggered again to keep testing.

Data Collection
During the measurement procedure, the SMART data and the device statistics data from the SSD were acquired using Smartmontools [44]. Because the SSDs from different manufacturers have different SMART attributes, some SMART attributes could not be obtained. Some SMART attributes only had a name and there were no corresponding values.
The problem is similar for the device statistics data. In the list of the statistics, some attributes do not contain any values, and these kinds of SMART attributes or device statistics attributes are not included in the scope of this work. Table 4 lists some SMART attributes that were collected and used in this research, and type represents the types of information collected, cumulative represents aging over time, and normalized values represent the ranges of 1-100 in which the lower values are worse and the higher values are better.
A portion of the device statistics data collected and used in the research is listed in Table 5, which reflects the statistical information from devices such as temperature statistics, error statistics, transmission statistics and summary statistics. We implemented the program in Python to process the data into MySQL format and save them in the database. We also developed some Linux Shell scripts to store the execution results for PostMark and the corresponding data in file format. The collection interval is one hour.

Graphical Display and Analyses of SMART Data
NAND Flash cells can undergo a limited number of P/E cycles that vary with the process, which is also referred to as its endurance rating. The flash wears out permanently when its P/E cycles are all consumed. Generally, an SSD adopts wear-leveling to distribute the wear evenly in each flash cell to average the overall wear. However, as time goes by, the overall wear will eventually lead to SSD failures.
In this section, we present some SMART attributes collected from the measurements and analyze the changes in the attributes and related phenomena. There were 10 machines used for the test including three different hardware configurations, Type-A × 4, Type-B × 4 and Type-C × 2. All the machines used the same version of the operating system and other related software, the detail of the configuration is shown in Table 6. The SSDs were used for testing and have been described in Section IV. The measurement in this work contains 10 SSDs, but 3 of them failed during the measurement due to a sudden power cut. Therefore, we only show and analyze the changing trends of SSD SMART attributes from 7 samples. The attributes are displayed in Figure 4 where the x-axis is time and the y-axis is the normalized values of each attribute.

Host Writes and NAND Writes
As explained earlier, the endurance rating of flash cells is related to the number of P/E cycles they can consume. The accumulative P/E cycles of an SSD are directly affected by the volume of the data written to it. The accumulative P/E cycles of an SSD can be estimated according to the amount of data written to an SSD. In a sense, the amount of data written to an SSD can be equivalent to P/E cycles.
There are two types of data written to an SSD. One is the host writes, which represents the amount of data to be written to an SSD transmitted by the operating system through the interface. Another one is NAND writes, which represents the amount of data actually written to the NAND Flash. Figure 5 shows the variations in the host writes and NAND writes for an SSD sample. Under normal program conditions, since the controller of 3D TLC SSD has a compression algorithm, the write amplification (WAF) can be less than 1, that is, the amount of NAND writes is less than the host writes.
For a long period, SSDs present a stable state of programming and, as the execution of benchmark, the host writes keep increasing up to about the 80%-90% stage of the measurement, the growing rate of host writes slows down and, due to the wear, SSDs cannot accept the previously requested volume of data. At the same time, the growing rate of NAND writes is dramatically increased. The reason for this increase is that when the SSD approaches permanent wear-out, most blocks are actually worn out already, and only a few blocks can still be programmed. In this phase, the available space in an SSD cannot meet the program request generated by the benchmark program.

Write Amplification
For the sake of writing the same volume of data as written previously, the SSD needs to perform more garbage collection to provide empty space within blocks that have not been thoroughly worn out for data programs. This process also leads to a rapid increase in an SSD's WAF(write amplification). When the NAND writes grow towards the end in Figure 5, the SSD is extremely close to wearing out and any more program operations may lead to SSD failures. In cases of losing data, an SSD enters the "write protection" mode and cannot perform any program operations (NAND writes stop growing). The variation process of WAF is shown in Figure 6. It can be seen from the figure that the WAF of each sample is in a relatively stable state for a long time but they increase significantly at similar rates when the SSDs are close to their endurance limit. Equation (1) is used to calculate the WAF corresponding to the daily data volume, where i represents days, NW represents NAND writes, HW represents host writes.

Media Wear-Out Indicator
Each SSD manufacturer has multiple types of products for different markets and sets the basic parameters based on the rating. Due to the limited lifetime of SSDs, manufacturers often define the terabytes written (TBW) of an SSD according to flash type, capacity, warranty period and other indicators and use it as the endurance rating for an SSD.
Media wear-out indicator is a normalized value that indicates the SSD wear degree. The value of a new SSD starts from 100 and decreases to 1 with an increase in P/E cycles. Figure 5 illustrates the changes in this attribute, it can be seen that the value has decreased to 1 after approximately 40% of the measurement and the volume of the data written to the SSD has reached the threshold declared by the manufacturer. In a later time, SSDs still maintain a stable state of the data program; thus, we believe the threshold declared by manufacturers is too conservative. The value of the media wear-out indicator falling to 1 is insufficient to declare that an SSD has reached the end of its lifetime.

Uncorrectable Errors
The Facebook study [3] focuses on the MLC NAND Flash due to its lower age and less usage of SSDs; the SSD age is between 0.5 to 2.4 years on average across different hardware platforms, and SSDs have less than 100 P/E cycles. Their report shows that the "old" SSDs have more uncorrectable errors than the "young" SSDs and for each platform, most of the errors are produced by a few SSDs while the uncorrectable bit error rates (UBERs) are between 10 −9 and 10 −11 .
Google counted the proportion of SSDs affected by uncorrectable errors within four years in their study [4] and showed that it is common for SSDs to have uncorrectable errors. According to different types of SSDs, 26% to 90% of SSDs experience at least one uncorrectable error.
We observe that uncorrectable errors are inevitable along with the wear of SSDs, and all samples in the measurement more or less have uncorrectable errors. The uncorrectable errors do not occur immediately when SSDs are put into use and they will suddenly increase to a large number when an SSD is close to its endurance limit, as shown in Figure 5, for a long period of time and the SSDs have no uncorrectable errors. The uncorrectable errors appear at approximately 80% stage of the measurement and, in the following stage, the cumulative number of uncorrectable errors increases rapidly and finally stops at a value. The UBERs of our samples are 3×10 −14 according to the observation, which are similar to the results from Microsoft and Facebook that show all rates are more than an order of magnitude above the 10 −15 and 10 −16 that are required by the JEDEC standard [6] for consumer and enterprise class drives, respectively. The reasons for these wide ranges of UBERs might be different from our conjecture.

Temperature
There is a common view that high temperatures may have negative effects on SSD performance and accelerate the aging of flash cells. The influence of external temperature is particularly important to SSDs and data centers have appropriate cooling methods according to the characteristics of the flash. In addition to the factors of external temperature, it is also necessary to understand the variation characteristics of SSD internal temperature, since the drives are deployed until final wear-out. We can obtain the real-time working temperature of the SSD controller through the sensor set by the manufacturer inside the SSD, which can better indicate the changes in SSD internal temperature. Figure 5 shows the variation of SSD internal temperature, since we initialize the measurement until SSDs wear to their endurance limit. The overall trend is similar to uncorrectable errors, NAND writes and other attributes. For a long period of time, the temperature fluctuates steadily in a range of 40 • C-50 • C, and at approximately 90% of the measurement, the temperature begins to increase significantly and rises to a range of 50 • C-55 • C. After SSD controllers are in this temperature range and last for about half a week, the SSDs enter "write protection" mode and are unable to be written anymore, then the temperature returns to the previous range spanning 40 • C-50 • C. The specific reasons will be discussed in the following paragraphs.

SATA Downshift Error
The SATA interface may downgrade to a lower signaling rate (e.g., from 6 Gbps to 3 Gbps) when too many errors are encountered. Such a low signaling rate will result in SSD performance degradation. The reason for this phenomenon could be temporary or permanent errors. According to our observations, some SSDs select a lower signaling rate when they are reaching their endurance limit. Furthermore, as listed in Table 7, more than half of the SSDs downgraded once and a few of the SSDs never downgraded. The time points for this phenomenon's appearance usually occur after the SSDs enter the "write protection" mode and Figure 5 displays the changing processes in the SATA downshift error count for an SSD in its lifetime.

Joint Analysis of SMART Attributes
As shown in Figure 5, some attributes such as temperature, NAND writes, wear-out, SATA downshift error count, uncorrectable errors and power-on-hours are displayed to better compare some changes and phenomena of different attributes. Some obvious changes can be seen from the figure. The change of the wear-out has been explained earlier, so we will not explain it too much here.
There are strong connections among NAND writes, temperature and uncorrectable errors, and NAND writes experienced a rapid increase at approximately 80%-90% of the measurement process. As mentioned before, the SSDs need to perform more garbage collection to provide enough empty space for a data program while their P/E cycles have been frequently consumed. However, the program process for NAND Flash requires applying a high voltage at the control gate of the floating gate transistor to allow the charge to pass through the oxide layer from the channel into the floating gate layer. Due to the wear of an SSD, the oxide layer of flash cells is unable to effectively provide the function to isolate the charge. After a program operation, the controller will find that the flash cells cannot effectively distinguish the voltage represented by the data, which results in a program failure. The program process will be longer, even though the volume of data is the same as before.
The frequent voltage adjustment for the program operation makes the SSD controller very busy, so the overall temperature of the chips will increase significantly. At the last stage, SSDs have reached their endurance limit completely and enter "write protection" mode; thus, no more data could be programmed. The value of NAND writes stops growing and only read operations could be done; thus, the overall temperature of SSD drops back to a normal state. Figure 7 shows the relationship between uncorrectable errors and P/E cycles for all of the samples. It can be seen that the P/E cycles experienced by the samples are around 1500 to 2000, the one with the most P/E cycles is more than 2500; they are all in line with the characteristics of TLC NAND Flash with an average P/E cycles of 1000 to 3000. The uncorrectable errors of each SSD appear at a later stage of its lifetime, it is also the time point that temperature and NAND writes show sharp increases. The sudden increase in uncorrectable errors in the following short time is also due to severe SSD wear. Most of the cells are still usable before they reach their endurance limits, but they are also very vulnerable. Data errors may occur more frequently than before when reading or programming the flash cells.
To verify the influence of environment temperature fluctuation on the changing trends of SSD reliability, two SSDs are in the condition of normal room temperature and others are in the condition of constant room temperature, which is 25 • C. As shown in Figure 5, the temperatures of first two SSDs (a and b) fluctuate over a wider range than others but the overall trends are similar, so we believe that the environment temperature fluctuation has little impact on the changing trends of SSD reliability.
The analysis of the diagrams in the above paragraphs clearly show that many SMART attributes change significantly as SSDs are close to their endurance limits. A number of phenomena such as the rapid growth of UE and the continuous high operating temperature of the controller, all indicate that SSDs are going to fail.
Some studies compared and analyzed the features of different types of flash chips. Cai et al. [11] studied the characteristics of TLC NAND Flash and MLC NAND Flash in terms of threshold voltage distribution trends, program errors, data retention errors, read disturb errors and others; they think that TLC NAND Flash and MLC NAND Flash show similar behaviors. Mielke et al. [13] studied two series of SSDs (S3500 and S3610) in data retention, bit errors, failure mechanisms and they showed similar characteristics. Schroeder et al. [4] and Narayanan et al. [2] studied various SSD drive models with different types of flash chips in Google and Microsoft data centers respectively and discussed multiple SSD reliability characteristics. The conclusion indicated that a number of different types of SSDs showed similar reliability characteristics or trends partially. Therefore, we believe that the changing trends of reliability characteristics of 3D TLC NAND Flash presented in this paper have a certain representativeness and can reflect some reliability characteristics of other types of flash chips to a certain extent.
In addition, NAND Flash-based SSDs have a great possibility of failure due to sudden power faults. Although manufacturers can deploy a protective capacity on the SSD motherboard to cope with this problem, it is still necessary to enhance the protection mechanism.

Correlation Analyses of SMART Attributes
In this section, we aim to explore the internal relations among SMART attributes and whether some SMART attributes are dominant in SSD failures and we also provide support for parameter selection of SSD failure prediction, which will be researched soon. We analyze the relationships among different SMART attributes filtered by our analysis and through visual inspection of the Pearson, Spearman, Kendall correlation coefficients.

Pearson Correlation Coefficient
In statistics, the Pearson correlation coefficient is widely used in the sciences as a measure of the linear correlation between two variables X and Y as follows: where the cov(X, Y) is the covariance, var(X) is the variance of X, var(Y) is the variance of Y. The Pearson correlation coefficient is symmetric: P(X, Y) = P(Y, X), and according to the Cauchy-Schwarz inequality, it has a value between 1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation and −1 is total negative linear correlation.

Spearman's Rank Correlation Coefficient
In statistics, the Spearman's rank correlation coefficient is a nonparametric index to measure the dependence of two variables, and it uses monotonic functions to evaluate the relationship between two variables. If there are no repeated data values, a perfect Spearman correlation of 1 or −1 occurs when each of the variables is a perfect monotone function of the other.
For two variables, X and Y (or two sets), the number of elements in them are all N. The i-th (1 <= i <= N) values of the two variables are represented by X i and Y i . The two ranked sets x and y are obtained after sequencing X and Y (ascending or descending), where x i and y i are the rank of X i in X and Y i in Y, respectively. A ranking difference set d is obtained by subtracting the corresponding elements in set x and y, where d i = x i − y i , (1 <= i <= N). The Spearman's rank correlation coefficient is calculated as follows:

Kendall Rank Correlation Coefficient
In statistics, the Kendall rank correlation coefficient-which is commonly referred to as Kendall's tau coefficient-is a statistic used to measure the ordinal association between two measured quantities. A tau test is a nonparametric hypothesis test for statistical dependence based on the tau coefficient. The Kendall correlation coefficient value ranges from −1 to 1, and it means that the two variables have consistent or opposite rank correlation when τ is 1 or −1, respectively, and it means the two variables are independent when τ is 0.
For two variables, X and Y (or two sets), the number of elements in them are all N. The i-th (1 <= i <= N) values of the two variables are represented by X i and Y i . The corresponding elements in X and Y form a pair of set XY, in which the elements are (X i , Y i )(1 <= i <= N). When any two elements (X i , Y i ) and (X j , Y j ) from the set XY have the same rank, that is, case 1 or case 2 (case 1: X i > X j and Y i > Y j , case 2: X i < X j and Y i < Y j ), these two elements are consistent. When case 3 or case 4 occurs (case 3: X i > X j and Y i < Y j , case 4: X i < X j and Y i > Y j ), the two elements are inconsistent. When case 5 or case 6 occurs (case 5: X i = X j , case 6: Y i = Y j ), the two elements are neither consistent nor inconsistent. The Kendall's tau coefficient is calculated as follows: where C is the number of consistent element pairs in XY and D is the number of the inconsistent element pairs in XY.

Analysis of Three Correlation Coefficients
We select SATA downshift counts, uncorrectable errors, temperature, NAND writes, wear-out and host writes for the correlation analysis. We implement the program in Matlab [46] with reference to Equations (2)-(4) to calculate the correlation coefficients. The value of Pearson, Spearman and Kendall correlation coefficients between the selected SMART attributes are shown in Table 8, Table 9, and Table 10 respectively. As shown in the tables, the three correlation coefficients of NAND writes and host writes are all close to or equal to 1, which shows a strong positive correlation that is consistent with our intuitive comprehension. The three correlation coefficients of NAND writes and wear-out are all close to −1, which shows a strong negative correlation and is also consistent with the intuitive observation results. The SATA downshift error appears relatively late in visual inspection and the three correlation coefficients between SATA downshift count and uncorrectable errors are approximately 0.2, which is weak, and the relationship between SATA downshift count and other attributes are weaker or even irrelevant. The three correlation coefficients between uncorrectable errors and NAND writes or host writes range from 0.4 to 0.6, which indicates a moderate degree of correlation. The Pearson correlation coefficient between uncorrectable errors and temperature is 0.77, which indicates a strong correlation. However, the Spearman and Kendall correlation coefficients are in a range of 0.2 to 0.4, which is only a weak correlation. So, the overall volume of the data written to the SSD has a significant impact on the uncorrectable errors. There is a strong relationship between temperature with both NAND writes and uncorrectable errors in the visual inspection. The Pearson correlation coefficient of temperature and NAND writes is approximately 0.3, but the Spearman and the Kendall correlation coefficient are very low, and the reason should be that the Spearman and Kendall correlation coefficient are rank correlation coefficients, whereas the temperature fluctuates in a small range, so the corresponding rank's change is not obvious.
In summary, as the correlation between the SATA downshift counts and other attributes is weak, we believe that the change in the SATA downshift counts is not related to the NAND Flash reliability, and it is not suitable as the main metric to identify SSD failures. The wear-out only has a strong correlation with the attributes related to data writes and the manufacturer's estimation of the threshold value is conservative, which is also not a proper metric to determine the SSD reliability. Some attributes related to data volume are normally affected by the workload, but they can more or less reflect the overall SSD reliability. The uncorrectable errors show a certain degree of correlation with other attributes, the correlation coefficients between uncorrectable errors and other attributes are around 0.5-0.8, which are strong. Specifically, the correlation coefficients among uncorrectable errors, NAND write and temperature are obviously higher than the others and they are closely related to the SSD reliability changes. Therefore, we believe that uncorrectable errors, NAND write and temperature are dominant metrics for identifying SSD failures and are of great value to monitor in estimating the SSD reliability.

Conclusions
This paper designs an SSD lifetime endurance test method and conducts an endurance test for 3D TLC SSDs throughout their lifetime and analyzes the phenomena caused by the changes of SSD reliability data. We first present the endurance test flow, the data collection method and the introduction of SSD reliability data. Next, we analyze the data collected from the measurements and the results reveal some valuable phenomena about the changes in the reliability data for SSDs throughout their lifetime, some of which have not been provided by the existing research. We also conducted the correlation analysis for some SMART attributes. By analyzing the correlation coefficients between different values, we show some internal relationships between the SMART attributes of SSDs, which are helpful for understanding the characteristics of SSD failures.
The findings in this paper are helpful for performing the model analysis and parameter selection when building the SSD failure prediction model, which can improve the reliability of the storage services in a data center by reducing the risk of data loss. Furthermore, the analysis of SSD reliability changing trends and the corresponding correlation analysis can provide directions for the SSD flash translation layer design optimization. Although our work focuses on TLC NAND Flash, data are collected from real flash chips and we believe that the findings will also be applicable to the emerging 3D NAND technology.