A Mobile-Oriented System for Integrity Preserving in Audio Forensics

: This paper addresses a problem in the ﬁeld of audio forensics. With the aim of providing a solution that helps Chain of Custody (CoC) processes, we propose an integrity veriﬁcation system that includes capture (mobile based), hash code calculation and cloud storage. When the audio is recorded, a hash code is generated in situ by the capture module (an application), and it is sent immediately to the cloud. Later, the integrity of the audio recording given as evidence can be veriﬁed according to the information stored in the cloud. To validate the properties of the proposed scheme, we conducted several tests to evaluate if two different inputs could generate the same hash code (collision resistance), and to evaluate how much the hash code changes when small changes occur in the input (sensitivity analysis). According to the results, all selected audio signals provide different hash codes, and these values are very sensitive to small changes over the recorded audio. On the other hand, in terms of computational cost, less than 2 s per minute of recording are required to calculate the hash code. With the above results, our system is useful to verify the integrity of audio recordings that may be relied on as digital evidence.


Introduction
Audio forensics consists of "acquiring, analysing and evaluating audio recordings that may ultimately be presented as admissible evidence in a court of law or some other official venue" [1]. In this scenario, such recordings are susceptible of becoming what is known as digital evidence, i.e., suitable proofs in the form of binary data [2,3]. However, the validity of this evidence type is subject to its reliability from the point of view of its integrity. The integrity of digital content refers to ensuring that the information obtained from the original source is complete and to the fact that it has not undergone any type of manipulation from the moment of its acquisition until its final disposition [4]. However, determining the integrity of an audio signal is not a simple task given the simplicity and great availability of audio editing tools in the market. In this sense, the admissibility of digital audio evidence in legal processes is one of the key tasks in the audio forensics area, and consequently the integrity verification process is of great interest today.
Thus, nowadays, there are several methods that allow demonstrating the integrity of digital evidence such as hash functions, digital signatures, encryption, cyclic redundancy check (CRC), and watermarks, among others [4]. The digital signatures allow guaranteeing the authorship and integrity of a document by means of the implementation of the hash function in the document to be signed, which will then be encrypted with a specific private key [4]; encryption is a process in which the information is modified in order to protect it [4]; the cyclic redundancy check (CRC) is used commonly to check that the data has not been altered during the transmission process [5]; and watermarks provide an information concealment technique in which a representative mark of the author is inserted into the file that is intended to be shared, seeking to protect the copyright [6,7]. On the other hand, a hash function (also known as digest function) allows producing a distinctive summary from an entry based on its content, generating a fixed length text string (hash code). Since hash functions are highly sensitive to changes in content, these types of functions are quite useful to demonstrate the integrity of audio signals. Recent proposals include cloud-based solutions for crime reporting [8], cloud based integrity verification by means of hash functions [9], integrity verification using blockchain [10], and evidence retrieval based on semantic methodologies [11].
Even though hash functions applicable to any type of content have been designed, in some cases, it may be advisable to design specific functions according to the content type; for multimedia files, hash functions can be designed to be applied in audio, image or video signals [12]. In addition, it is necessary to consider that the cryptographic hash functions designed for forensic files must comply with the following fundamental properties: the first property indicates that it should not be possible to recover the original signal from the hash code (preimage resistance); the second and third properties define collision resistance, and they establish that it must be computationally infeasible for a given input, to find a different input with the same hash code. In the same way, it must be difficult to find two different files that generate the same hash code [13].
Currently, there are several mechanisms that seek to attack hash functions such as brute force attack and cryptanalytical attack [14][15][16]; therefore, the number of output bits in the hash code should be considered when designing the system. It is also necessary to clarify that collisions will always exist since the mapping process is not one to one [17]. Moreover, it has been shown that certain hash functions widely used in the forensic context, such as MD5 and SHA-1, may be susceptible to collisions [18,19].
For the audio case, some hash functions are oriented to be perceptual, which are characterized by tolerating changes in the signal that do not significantly alter its content, consequently the generated hash code does not undergo modifications. Some examples of perceptual hash functions apply the non-negative matrix factorization (NNMF) of Mel-frequency Cepstral coefficients (MFCC) and then, through Principal Component Analysis (PCA), obtain a hash code of the audio signal [20]. It is possible to find solutions that use balanced multiwavelets (BMW) using five levels of decomposition and division of coefficients in the sub-bands [21]; it is also possible to find solutions based on MDCT (Modified Discrete Cosine Transform) that allow the extraction of characteristics of the signal [22]. Another variant is based on the ordering of the spectral coefficients obtained from the FFT calculation (Fast Fourier Transform) of the voice signal and a Gaussian noise signal generated from the statistics of the original signal [23].
Regarding the methods to calculate the hash code of an audio signal, the existing solutions are aimed at generating the summary of the signal after the recording (for example, once the evidence has been collected). An alternative to this approach is to generate the hash code of the audio signal immediately after its recording, as well as to store the code in a secure manner. In this sense, this paper presents the design and development of a solution capable of calculating a hash code; it is designed for speech signals without compression, and it can be used for forensic purposes. The solution operates in real time (at the time of the events) and allows one to guarantee the integrity of audio signals recorded on a mobile device with Android operating system and storage of the hash code in Microsoft Azure. After the recording and generation of the code, the solution allows verifying the integrity of the file by means of a query to the metadata previously stored in the cloud.

Proposed Scheme
The proposed function to obtain and save the hash code of an audio signal in real time is composed of three modules: (i) recording of the input signal; (ii) hash code calculation; and (iii) storage of the code in the database. The general outline of the system is shown in Figure 1. The specific operation of the system is explained below.

Input Signal
Taking into account that the proposed solution is oriented to operate in real time (generate and store the hash code immediately after recording the file), the first block of the proposed method is to record the audio file in the same Android application that performs the hash code calculation. For the recording process, the following parameters were defined in java: each audio is acquired with sampling frequency of 8000 Hz and 16 bits per sample, and it is stored in wav format. The number of samples and the duration will depend on the needs of the user. In addition, the MediaRecorder library available in Android Studio was used to make the recordings.
The result of the recording process is the signal S(n) (Equation (1)), where N is the number of samples of the signal.

Hash Code Calculation (Hash Function)
After recording the audio signal S, the first step in obtaining the hash code consists of resizing the vector S (input signal), in such a way that the data are structured in an array of 64 columns. The number of rows of this matrix will depend on the number of samples of the original signal, being equal to N/64, where N is the number of samples of the original signal. In the case N is not a multiple of 64, the original signal will be completed with zeros until the criterion is met. It should be noted that the vector of the original signal only considers the samples of the signal, that is, it discards the header of the file (44 bytes). In addition, the selection of 64 columns is due to the fact that a smaller number of columns can produce a greater number of repeated characters in the hash code, while the definition of a greater number of columns can cause an unnecessary number of zeros to be added (so that the number of samples is a multiple of the number of columns). Thus, the output of the reshaping block is a N/64 × 64 matrix (s r ), as shown in Equation (2). The next step is to calculate the mean of each column in the matrix s r , obtaining a 1 × 64 vector (s m ), as shown in Equation (3).
Then, the Discrete Fourier Transform (DFT) of the s r vector is computed through the use of the Fast Fourier Algorithm (FFT). Since the execution time for FFT is faster for powers of two, and considering voice signals with sampling frequency of 8 kHz, 2 13 was selected as the number of points in FFT, i.e., an 8192-point DFT is used. In addition, since the phase presents a greater variability than the magnitude in the whole spectrum, the argument of the DFT is taken as shown in Equations (4) and (5).
To reduce the size of the data, the vector ϕ is resized to an array of 128 rows and 64 columns, and then it is converted to a row vector containing the sum of each column. In other words, the total sum of the elements in each column of this matrix is calculated in such a way that a vector ϕ s of 64 elements is obtained, as shown in Equations (6) and (7).
Finally, to obtain the hash code, the integer and the fractional parts (considering only four significant figures) are taken (Equation (8)). These five digits interpreted as an integer value are converted to hexadecimal, and the least significant digit of each element is concatenated to obtain the 64-hex-digit hash code.

Cloud Storage
After generating the hash code of the signal, the metadata of the signal are stored in a database that is hosted in the cloud. These metadata include parameters such as the name, the date of creation of the recording, the duration of the recording, and the code and the time used to calculate the code. It should be noted that the storage created in Windows Azure is an SQL database with 32 MB of storage capacity; moreover, when the user finishes recording the audio signal, the metadata are stored in the database and cannot be deleted by the mobile user, i.e., after storage, the data are read-only. Figure 2 shows the block diagram of the proposed module.

Implementation and Evaluation of the Method
To validate the proposed method, several tests were accomplished, focused on evaluating the performance of the hash function in terms of: (i) resistance to collisions, which implies determining if two different inputs produce the same output; (ii) runtime used by the mobile device to calculate the hash code with respect to the duration of the signal; and (iii) sensitivity of the code with respect to changes in the input signal.
For the evaluation of the aforementioned parameters, two mobile devices with the characteristics mentioned in Table 1 were used. In each mobile device, 70 audio signals of different duration were recorded. To structure the results, seven ranges of duration were defined, with ten recordings in each range per device. The first range includes recordings of up to 10 s, while the second includes recordings between 10 and 20 s, doubling the duration limit in each new range. In this sense, there are 140 total recordings, as shown in Table 2.
Immediately after recording the audio signals, collision resistance, computational cost and sensitivity were evaluated. For the tests performed, stored metadata (Microsoft Azure) of each recording were used, such as recording name, creation date, hash code, and recording duration and compile time. On the other hand, seven audio recordings (one for each time range) were randomly selected, which were used to make minor modifications that later allowed generating and comparing the hash code of the modified signal with respect to the original one. Table 2. Time ranges of the recordings acquired by each device.

Collision Evaluation
A collision refers to the event in which two different inputs (voice signals) produce the same hash code. Since it is mathematically impossible to design a hash function without collisions, it is important to test the proposed function to determine its variability. The theoretical value of the collision probability can be calculated through Equation (9) [24], where L is the number of digits of the hash code and P is the base of the numeral system used to represent the digits in the hash code. For the proposed scheme, L = 64 and P = 16. Thus, the theoretical probability of collision is given by, This means that, theoretically, you expect to find a collision every ≈1.16 ×10 77 tests.
To verify the presence of collisions, the Hamming distance (HD) is calculated between each pair of codes of the test signals. The percentage of coordinates that differ from each other (HD) is given by Equation (11). d st = (#(x sj = x tj /n)). (11) If the result obtained in HD is 0, it implies that the two hash codes are identical, whereas, if the obtained value is 1, it means that the hash codes are completely different.
In a complementary way, each pair of evaluated codes are compared to each other using the Pearson Correlation Coefficient (PCC) given by Equation (12), in order to determine which is the linear dependence between two codes. The goal here is to evaluate the degree in which a pair of hash codes are linearly related, in order to evaluate the similarity between them.
where A and B are matrices or vectors of the same size. The result of the correlation can vary between −1 and 1; if the absolute PCC value is 1, it indicates a perfect linear relationship, whereas a PCC close to 0 means that there is no linear relationship between the data. The sign indicates the direction of the relationship, i.e., if both variables tend to increase or decrease at the same time, the coefficient will be positive, whereas if one variable tends to increase and the other to decrease, the coefficient will be negative. In practice, when evaluating the collision resistance in binary hash codes by means of the correlation coefficient, PCC = 1 is expected.

Computational Cost Evaluation
Since all the processing is done directly on the mobile device, it is important to consider the time it takes to execute the proposed function according to the duration of the input signal. Accordingly, it was decided to register the time it takes for the application to calculate the hash code for each recording. Then, an analysis of runtime versus signal duration was done.

Sensitivity Evaluation
To determine how the proposed hash function behaves against small modifications made to the input signal, six types of modifications were made in selected recordings. After this, the codes of the original signal were compared with those of the modified signals by using HD and PCC. The modifications made to the test signals are listed below:

1.
Modifying the amplitude of a sample 2.
Reversing the signal in time 3.
Cutting out an audio fragment 5.
Changing from 16 to 24 quantization bits For the first modification, it was decided to change only the tenth sample of the signal to 0; if the amplitude of this sample is already 0, the next sample is changed. For the second modification, the signal reversion was performed without modifying any of the samples. In the third modification, a word lasting less than 1 s was silenced. In the fourth modification (trimming), a fragment of the signal was eliminated (lasting less than 1 s). Finally, in the fifth and sixth modifications, the 16-bit signal was changed to 8 and 24 bits, respectively.

Collision Resistance
As discussed above, to verify the presence of collisions, HD was calculated between each pair of codes. To determine the number of possible combinations without repetition of each pair of codes, the binomial coefficient was calculated for a set of n elements taken in groups of size r, where n and r are two positive integers, being n greater than or equal to r. For the number of tests performed, n corresponds to the number of audio recordings, i.e., 140. Likewise, since the evaluation was done between each pair of recordings, r corresponds to two elements. In conclusion, for the number of tests performed, the quantity of possible combinations among all the codes is 9730, as shown in Equations (13)- (15).
To validate in a practical way the collision resistance, the histograms of the comparison data between each pair of codes for the 140 analyzed signals (9730 cases) are shown in Figures 3 and 4. Regarding Hamming distances, the maximum value for HD was 0.61 and the minimum value was 0.39, with a standard deviation of 0.0309. This means that, in the worst case, 39% of different coordinates were obtained between the two compared codes. It is possible to observe that all the results obtained were grouped around 0.4999. When calculating the PCC and making the histogram of the data, it is possible to observe that most of the data are concentrated in values around 0, which implies that there is no linear relationship between the compared codes ( Figure 4). According to the above, it was found that for the tests performed no collisions were detected.

Computational Cost
With the metadata stored in the database corresponding to the runtime of the hash function, it was possible to determine the computing time according to the mobile device and the duration of the voice recording. This relationship is shown in Figure 5, where the X-axis has a logarithmic scale (base 2). Based on the obtained data, it can be said that the execution time will depend on the characteristics of the device, i.e. the lower are the resources, the longer is the execution time, although in general terms the execution time is low. In Figure 5, it is possible to determine that the runtime of the proposed hash function for the mobile device Moto g is related to Equation (16), where x is the duration of the input recording. y ≈ −1 × 10 −5 x 2 + 0.0303x − 0.0365 (16) It is also possible to determine that the execution time of the proposed hash function for Nexus 5x is related to Equation (17). y ≈ 2 × 10 −6 x 2 + 0.0033x + 0.0308 (17)

Sensitivity
After making the six changes described above for each of the selected signals, the codes of the original signals and the modified signals were compared, obtaining the results in Tables 3 and 4. Table 3. Results obtained in the Hamming distance comparison between the hash codes of the original recording and the tampered recording. For the first modification made, the only recording that did not obtain a significant change was number six (Recording 6, Modification 1 in Table 3, i.e., HD = 0.10). However, even though the change consisted in modifying the value of a single sample of the signal, the hash code obtained when applying the modification differs by 10%. The remaining five modifications guarantee that at least 40% of the bits of the hash code change. The above implies that the proposed function is quite sensitive to the input data (i.e., a change of a sample in the input signal generates a significant change in the output).

Conclusions
A system for audio integrity based on mobile applications and cloud storage is proposed. In the mobile device, a hash function specifically designed for audio recordings is executed. Once the hash code is obtained, it is transmitted to a database in the cloud.
The performance evaluation of the proposed hash function was focused on collision resistance and sensitivity. After 9730 tests, no collisions were found, not even between two pairs of very similar recordings (for example, an audio recording and its altered version with only one modified sample). Among all the recordings, the lowest HD (Hamming Distance) was 0.4; in the case of a very similar signal, the lowest HD was 0.1. This means that our proposed hash function is very sensitive to the input signal.
On the other hand, the computational cost to calculate the hash code was measured. With two different mobile devices, a nonlinear relationship was obtained between the duration of the recording and the execution time. However, in both cases, it takes up to 2 s for each minute of the audio recording.
As a main conclusion, our proposed system is feasible to be used as a useful tool for audio integrity within a forensic field. A user can record a conversation with the app and use it as evidence within a legal process. Legal authority can verify the integrity of the evidence with the hash code stored in the cloud.