Non-Linear Dynamics Analysis of Protein Sequences. Application to CYP450

The nature of changes involved in crossed-sequence scale and inner-sequence scale is very challenging in protein biology. This study is a new attempt to assess with a phenomenological approach the non-stationary and nonlinear fluctuation of changes encountered in protein sequence. We have computed fluctuations from an encoded amino acid index dataset using cumulative sum technique and extracted the departure from the linear trend found in each protein sequence. For inner-sequence analysis, we found that the fluctuations of changes statistically follow a −5/3 Kolmogorov power and behave like an incremental Brownian process. The pattern of the changes in the inner sequence seems to be monofractal in essence and to be bounded between Hurst exponent [1/3,1/2] range, which respectively corresponds to the Kolmogorov and Brownian monofractal process. In addition, the changes in the inner sequence exhibit moderate complexity and chaos, which seems to be coherent with the monofractal and stochastic process highlighted previously in the study. The crossed-sequence changes analysis was achieved using an external parameter, which is the activity available for each protein sequence, and some results obtained for the inner sequence, specifically the drift and Kolmogorov complexity spectrum. We found a significant linear relationship between activity changes and drift changes, and also between activity and Kolmogorov complexity. An analysis of the mean square displacement of trajectories in the bivariate space (drift, activity) and (Kolmogorov complexity spectrum, activity) seems to present a superdiffusive law with a 1.6 power law value.


Introduction
From the information viewpoint, a protein sequence can be considered as a distribution of successive symbols extracted with a rule from a dictionary. Conceptually, it means that the protein sequence is simply encoded to a set of symbol combinations. Moreover, the number of the symbols used is usually very small in comparison to the length of the protein sequence. Consequently, there is a huge variety of combinations of symbols to encode a protein sequence in the real world. It is well-known that the molecular mechanism (stability, structure function, disorder) is often triggered by complex interactions [1][2][3]. Like the emerged part of an iceberg, the intricated symbol set of an encoded protein sequence can be seen as a footprint of a wide range of covert biochemical interactions within the protein. Then, there are numerous encoder models that try to reflect the reality accurately using a conversion rule related to physicochemical and biochemical properties [4][5][6]. Beyond the symbol combination and arrangement of the protein sequence, understanding the nature and the organization of the symbols is very challenging in protein biology. Therefore, analyzing the encoded protein sequence by means of nonlinear analysis can provide some insights about the dynamics of the changes within the dataset. Searching for similarities between encoded protein sequences in a dataset is one of the important advantages of morphological analysis of protein sequences. There are many approaches to extract groups, which are conceptually based on a clustering method of global or local information about the protein sequence [7][8][9][10][11][12][13]. The prediction of disorder of the protein sequence is often related to the ability to track the degree of randomness, the stochasticity, and the complexity embedded in the whole encoded dataset. There are studies which focus on randomness, chaos, long-range interaction between sequences for classification, and predictability. For example, Yu et al. [14] have made a comparative study of structure and intrinsic disorder between 10,000 natural and random protein sequences and found that natural sequences have more long disordered regions than random sequences. In addition, Gök et al. [5] have used the Lyapunov exponent and test four classifier algorithms (Bayesian network, Naïve Bayes, k-means, and SVM) to identify the disordered protein regions. Long short-term memory (LSTM) recurrent neural networks is a deep learning algorithm that has gained some interest for tracking the long-range interactions between sequences [1,15]. These studies reveal that there is potential information about degree of randomness, disorder, and stochasticity in protein sequences and beyond some degree of predictability. It means that the protein sequence exhibits some order within disorder and changes are not a likelihood for this set of symbols. To find out what kind of information and properties of disorder or complexity we are able to extract from protein sequences, we propose to scan the changes inside the protein sequences and between sequences using a multidisciplinary approach. It means that we intend, at the same time, to use tools from information theory field (entropy of information, Kolmogorov complexity), physical theory (chaos, fractional Brownian processes, drift-diffusion processes), and signal processing (multifractality, Fourier analysis). To our knowledge, the use of multidisciplinary tools to analyze the dynamics of the changes within a protein sequence and between sequences is new. As mentioned previously, the encoded protein sequence contains successive numerical values and can also be considered as a time series. The aim of this paper is to encompass the variability of the inner changes hidden behind the encoded protein sequence using nonlinear tools, and to assess the predictability of the underlying non-stationary protein sequence activity.
The study is organized as follows. Section 2 presents the experimental dataset and the encoded protein sequence. Section 3 describes the algorithm used to analyze the time series (i) entropy and chaos, (ii) Kolmogorov complexity and Turing machine, (iii) law-scaling and stochastic process, and (iv) surrogated and shuffled data. Finally, Section 4 includes both presentation of the results obtained and discussion. The concluding remarks are given in Section 5.

Experimental Dataset
To facilitate the understanding of readers outside the realm of life sciences, we will provide a brief definition of a polypeptide/protein sequence. A protein sequence is a chain made of residues of amino acids. Twenty amino acids are the basic building blocks for proteins. We will provide an application example as well.

Alphabetical Dictionary
Each amino acid is represented by a letter corresponding to the one-letter code for an amino acid. The global sequence has a biological meaning. A single variation in the sequence could have a huge impact on the activity of the protein. An example of a protein sequence (Cytochrome P450) is given below:

An Application Example: Cytochrome P450
Cytochrome P450 is a protein, i.e., a polypeptidic sequence of 464 or 466 amino acids. It is used to generate products of significant medical and industrial importance. Three parental cytochromes P450, i.e., CYP102A1(P1), CYP102A2(P2), and CYP102A3(P3) were used to generate 242 chimeric sequences of cytochrome P450 [16]. Further, 242 thermostable protein sequences were created by recombination of stabilizing fragments. For each variant, the thermostability (defined herewith as: Activity) was analyzed by the measurement of the T 50, T 50 being the temperature at which 50% of the protein was irreversibly denatured after incubation for 10 min. The result is a decrease in activity. Activity ranges from 39.2 • C to 64.48 • C. Chimeras are written according to fragment composition: 23121321 represents a protein that inherits the first fragment from parent P2, the second from P3, the third from P1, and so on.

Methodology
In this study, the questions are: "Can statistical, nonlinear, and complexity analysis give us some information about the pattern in a protein sequence and its changes along the sequence and also the next, or other sequences? Can we group sequences according to their activity but also their morphological pattern?". To assess the ability of the statistical chaos and complexity tools, we have transformed each protein sequence into numerical or binary time series according to the need of the use of the tool.
First of all, there exist different conversion tables to transform protein residues (letters) to numerical sequences. We have used the freely available one, namely AA index database [17,18]. This database contains a huge number of ascribed numerical values for each protein residue. There are 566 numerical values, which are for each index in the sequence univocally in correspondence with physicochemical and biochemical properties of the residues. In this case, we have selected the index 532 in the dataset, which allows us to rank and encode 20 standard amino acids.

Entropy and Chaos
Entropy is a concept that was first discovered in physics. Nevertheless, this concept is also encountered in other fields and especially in the theory of information. In 1948, Shannon [19] formalized the concept of entropy of the information H of a string of length N, which contains Q repeated symbols S = s 1 , s 2 , . . . , s Q . H is shown by the well-known formula: wherep i = N s i N . N s i is the number of appearances of the symbol s i in the string of length N. Thus, p i is the probability of occurrence within the range value ]0 1]. As we suppose that all Q symbols exist in the string, the probability 0 is excluded. The minus sign is to ensure a positive value of the entropy H as the logarithm is always negative. H is a global measure of the total amount of information in an entire probability distribution contained in a sequence.
Another measure of entropy is the sample entropy [20]. Let us consider a set of N symbols s i,k in a sequence S i chosen among M sequences in the dataset. From the sequence S i we extract two subsets of m symbols S m i,p = s i,p , s i,p+1 , . . . , s i,p+m and S m iq = s i,q , s i,q+1 , . . . , s i,q+m where p q. The parameters p and q correspond to the index position of the first symbol of respectively the subset S m i,p and S m i,q within the sequence S i . The sample entropy (SampEn) of the sequence S i is defined as length m with a distance d s m ip , s m iq < r. The r is a threshold value of similarity between the pair-wise subset symbols s m ip , s m iq . In our study, the sequence is a set of numbers. Then, the distance d s m ip , s m iq is a Euclidian distance and the tolerance value threshold value r is chosen between 0.1 and 0.2 of the standard deviation of the sequence S i [20]. Moreover, the embedding dimension m is usually taken to be 2. Finally, the sample entropy is a positive value, which can be 0 for a regular sequence and roughly 2.2 or 2.3 for a strongly irregular sequence. The sample entropy is a measure of the regularity within a sequence.
In addition, sometimes an irregularity pattern in a time series could be related to the chaos process within a sequence. The largest Lyapunov exponent is the most common parameter used to characterize chaos in a dynamical system. The sign and the value of this parameter give an indication of the response of a system to amplify, damp, or oscillate a small perturbation. In our case, it means that if the largest Lyapunov exponent is (i) positive, then the process is chaotic, (ii) close to zero, then the process is periodic or quasi-periodic, and finally (iii) negative, the process is damping and has an attractor. In our study, to achieve the search for chaos pattern in a sequence S i , we have used Wolf's algorithm [21] to compute the Lyapunov exponent spectrum and the largest Lyapunov exponent (LLE).

Kolmogorov Complexity and Turing Machine
Let us assume we have a set of M sequences S = {S 1 , S 2 , . . . , S M }. Then, we suppose that we have for each sequence i of string S i , a set of N values defined as S i = p i 1 , p i 2 , . . . , p i N . To assess disorder within a sequence, we use the Kolmogorov complexity method [22]. This method is based on the concept of Turing machine and the mathematical expression of the algorithmic complexity can be written K T (s) = min p , T(p) = s . This states that the algorithmic complexity of a string s is the shortest program p computed with a Turing's machine T to gather output s [23,24]. To compute the Kolmogorov complexity (KC), there are three processes: (i) Convert the sequence S i to binary sequence B i using a threshold method, (ii) compress the sequence B i with Lempel-Ziv compressor to a compressed sequence C i , and (iii) compute and normalize the Kolmogorov complexity number associated with the original sequence S i . Binarizing the sequence S i is based on the particular value used as threshold value p i T to assign each number p i k in the sequence S i with the value of 0 if p i k is less than the threshold value p i T , or conversely assigned with the value of 1 if p i k exceeds the threshold value p i T . The mathematical expression of the binary value of the number p i k in the sequence S i is: where p i T is a threshold value of sequence S i .
Usually, the mean of the set p i 1 , p i 2 , . . . , p i N is used as a threshold value of the sequence S i .
Nevertheless, we will take into account the amplitude of the numbers p i k to compute the optimum threshold value p i T opt associated with the sequence S i . Thus, we introduce the Kolmogorov complexity spectrum (KCS), which is an iterative procedure to compute the Kolmogorov complexity for various threshold values within the range values p i k of the sequence S i [25]. The encoding number to binary value is presented as: where Thus, for each sequence S i , the Kolmogorov complexity spectrum is a set of K Kolmogorov complexity values KC i K = KC i 1 , KC i 2 , . . . , KC i K . The optimum threshold p i T opt is chosen among the set of threshold values p i T 1 , p i T 2 , . . . , p i T K using the condition The compression method used in this study is the Lempel-Ziv compressor [26]. This is an iterative search in the binary series B i of the overall possible subset sequences, which are different from each other. The result is a compressed sequence C i . If |C i | represents the length of the compressed binary sequence C i , then Kolmogorov complexity KC i associated with the sequence S i is: The term log 2 N/N in the expression of KC i insures the normalization of the Kolmogorov complexity.

Law-Scaling and Stochastic Process
As previously mentioned, a sequence is defined as a set of alphabetic letters, which could be converted to other symbols (numerical, binary, etc.). Nevertheless, the changes of symbols along the chain are usually related to the real world of biochemical activities along the protein sequence. The question is "Do those changes present a regular or irregular pattern within a sequence which can provide some information about an underlying dynamic in a sequence?" First, we have to define the changes in a sequence i of pairwise symbols separated by a distance, namely an increment of position. Let us assume d is the increment pairwise symbols and the quantity ∆p d i = p i j − p i k d=|k− j| is the magnitude of changes of the pairwise symbols separated by an increment of d. We define the structure function S q i (d) for a sequence i defined by the expression S q i (d) = 1 where N di is the number of pairwise symbols separated with a distance d. By extension, this function can also be used to track the existence of scaling law in the data S q i (d) ∝ d ξ(q) . ξ(q) is the generalized Hurst exponent, which is indicative of the nature of pairwise symbol changes and the stochasticity of processes like long-term memories, Brownian motion, self-similarity pattern [27]. The probability function (PDFs) of the distribution of the normalized changes of pair-wise symbols ∆p d i /σ ∆p d i within a sequence i can be computed to analyze the normality of the changes in a sequence. Additionally, kurtosis or flatness is another measure of the normality of the changes of the pairwise symbols. For sequence i, the kurtosis F i = S 4i (d)/(S 2i (d)) 2 . The terms S 4i (d) and S 2i (d) are, respectively, the fourth-and second-order moment of the pairwise distribution.

Surrogated and Shuffled Data
The methods to surrogate and shuffle the data are very popular tools to assess the existence of nonlinearities and the scaling properties of a process. Both algorithms are based on the generation of randomized synthetic data using specific constraint rule to generate the synthetic data. Surrogated data used in this study are the iterative amplitude-adjusted Fourier transform (IAAFT). This method preserves the statistical properties of the original data but randomizes the phase spectrum of the Fourier transform of the original data. The synthetic data generated with this method lead to removing nonlinearities in the original data. Shuffled data are obtained by a random permutation between values of the original data. This method is a bootstrapping algorithm without repetition of the indices' permutation. Variants of the protein (synthetic sequences) are obtained by variation of any position in the sequence and not by variation of the fragments constitutive of the protein (described in the Section 2.2 "An application example: Cytochrome P450"). The data obtained are a set of values that do not exhibit any linear correlation in the synthetic data and preserve the amplitude distribution. For more information about these two algorithms, the reader can refer to the review of Schreiber and Schmitz [28].

Normalized Detrended Cumulative Sum (NDCS) Method
Fluctuations or changes along the protein sequence are of interest in this study but we need to show how we extract this information from the original data. Cumulative sum is a sequential method that is widely used to detect changes in a time series and to track the self-similarity in a dataset [29]. In this study, we have applied this algorithm for each sequence and generated a new sequence of fluctuations defined as a departure from the linear trend. Within the 242 protein sequences of a length of 466 for each one, each index in a sequence is originally labelled with an alphabetical letter. There are 20 letters used (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y) corresponding to the one-letter code for amino acid. In this study, the D PRIFT index is chosen from the AA index catalog to convert the alphabetical symbols to numerical values [30]. It allows us to distinguish each of the 20 amino acid residues by a unique value related to its hydrophobicity property. The encoding process, which converts the original alphabetical letters to numerical values within the [−5.68 6.81] range, is shown in Table 1. Table 1. Conversion rule of protein sequence of AA index 532-D PRIFT index [30]. We are aware that this description by their hydrophobicity values is oversimplified and does not account (i) for many other properties of amino acids that are well known to strongly affect pattern changes in protein sequences along families, such as volume, aromaticity, and different charge states for the same amino acid in distinct positions or, (ii) for the fact that the exposure of continuous amino acids sequences to solvent or their occlusion in protein cores is a fundamental requirement for proteins to fold in functional arrangements, giving importance to hydrophobic and polar amino acids and their distribution. However, whatever the choice among all the possible amino acid indexes that are able to distinguish between the 20 amino acid residues, the index will be insufficient.

AA Index 532 D PRIFT Index (Cornette et al. 1987)
As shown in Figure 1a, the distribution values show a non-normal distribution, which is indicative of the non-gaussian process along the protein sequence. Roughly, the distribution looks like a U-shape where the highest probability of occurrence is obtained for the extreme values and the lowest for the mean value of the available D PRIFT index. Then, the pattern of the encoded protein sequence appears like complex bounced stairs with randomness as a sharp jump (Figure 1b). To target the jump stair pattern analysis within the protein sequence, we have used the normalized detrended cumulative sum (NDCS) method. The cumulative sum is a well-known and widely used algorithm to detect changes and shifts in time series [31]. In this study, we have extracted the linear long-term and normalized the cumulative sum of each sequence to (i) focus on the local change and (ii) have the same scale to compare transformed data. Figure 2 presents an example of transforming the original data (Sequence 1) into a detrended cumulative sum data. For clarity, we only present here the cumulative sum and linear detrending of the data. The normalized process is shown in the next figure. The trend of the cumulative sum is considered to be a linear trend for all the 242 protein sequences. The negative drift of the cumulative sum is related to the mean of a sequence. In our dataset, the average of the D PRIFT index is negative for each sequence and explains the downward drift of the cumulative sum. To target the jump stair pattern analysis within the protein sequence, we have used the normalized detrended cumulative sum (NDCS) method. The cumulative sum is a well-known and widely used algorithm to detect changes and shifts in time series [31]. In this study, we have extracted the linear long-term and normalized the cumulative sum of each sequence to (i) focus on the local change and (ii) have the same scale to compare transformed data. Figure 2 presents an example of transforming the original data (Sequence 1) into a detrended cumulative sum data. For clarity, we only present here the cumulative sum and linear detrending of the data. The normalized process is shown in the next figure. The trend of the cumulative sum is considered to be a linear trend for all the 242 protein sequences. The negative drift of the cumulative sum is related to the mean of a sequence. In our dataset, the average of the D PRIFT index is negative for each sequence and explains the downward drift of the cumulative sum.     Figure 4a shows that the fluctuations of the NDCS of the D PRIFT index changes are normally distributed, with skewness close to 0 and kurtosis close to 3, which are the expected values for a normal distribution. In addition, the QQ-plot displayed in Figure 4b reveals that the observed distribution is close to a normal distribution and the two samples' (dataset values and generated normal data values) Kolmogorov-Smirnov test applied to this distribution does not reject the null hypothesis at the 5% significance level.

Normality and Intermittency
The changes along the protein sequence for four different pairwise distances show a platykurtic nature (Figure 5a). The average distribution exhibits large amplitude for fluctuations greater than 2.5 times the standard deviation of NDCS of D PRIFT index changes. The average is computed using 242 protein sequences. Below this threshold value, the distribution is close to the Gaussian distribution. This kind of departure from the Gaussian distribution in fluctuations is indicative of intermittency. Moreover, Figure 5b highlights that the platykurtic nature of the fluctuations covers a wide range of pairwise distances, but it is more pronounced with the [30-60] pairwise distance and for distances less than 10 pairwise. To summarize, this flat distribution indicates more diversity of changes for the large amplitude of pairwise distance within the protein sequence.

Normality and Intermittency
The changes along the protein sequence for four different pairwise distances show a platykurtic nature (Figure 5a). The average distribution exhibits large amplitude for fluctuations greater than 2.5 times the standard deviation of NDCS of D PRIFT index changes. The average is computed using 242 protein sequences. Below this threshold value, the distribution is close to the Gaussian distribution. This kind of departure from the Gaussian distribution in fluctuations is indicative of intermittency. Moreover, Figure 5b highlights that the platykurtic nature of the fluctuations covers a wide range of pairwise distances, but it is more pronounced with the [30-60] pairwise distance and for distances less than 10 pairwise. To summarize, this flat distribution indicates more diversity of changes for the large amplitude of pairwise distance within the protein sequence.

Kolmogorov's Law and Brownian Process
We have conducted a Fourier analysis to focus on the fluctuation of the NDCS of D PRIFT index changes. Surprisingly, scale invariance can be detected in the log-log presentation of the Fourier spectra (Figure 6a). An average of -1.68 based on power law is obtained, which is very close to the Kolmogorov power law result of -5/3. This highlights that the fluctuations of the NDCS of D PRIFT index changes along a sequence are similar to a non-stationary process and obey the famous Kolmogorov's law of the energy cascade for turbulence in the inertial scale range [22]. In addition, as shown in Figure 6b, the range scale value for each sequence is rather close to -5/3, with an observed minimum slope value of -1.56 and a maximum slope value of -1.84. This means that the changes within the protein sequence can be formulated according to Fourier transform as ( ) = where is the slope of the law and is close to the Kolmogorov spectrum. In addition, we can use criteria to check if the changes of protein are stationary or not [32]. This is summarized by the following test: • < 1, the changes are stationary, • 1, the changes are non-stationary, • 1 < < 3, the changes are non-stationary with stationary increments.

Kolmogorov's Law and Brownian Process
We have conducted a Fourier analysis to focus on the fluctuation of the NDCS of D PRIFT index changes. Surprisingly, scale invariance can be detected in the log-log presentation of the Fourier spectra (Figure 6a). An average of −1.68 based on power law is obtained, which is very close to the Kolmogorov power law result of −5/3. This highlights that the fluctuations of the NDCS of D PRIFT index changes along a sequence are similar to a non-stationary process and obey the famous Kolmogorov's law of the energy cascade for turbulence in the inertial scale range [22]. In addition, as shown in Figure 6b, the range scale value for each sequence is rather close to −5/3, with an observed minimum slope value of −1.56 and a maximum slope value of −1.84. This means that the changes within the protein sequence can be formulated according to Fourier transform as E( f ) = f β where β is the slope of the law and is close to the Kolmogorov spectrum. In addition, we can use criteria to check if the changes of protein are stationary or not [32]. This is summarized by the following test: • β < 1, the changes are stationary, • β > 1, the changes are non-stationary, • 1 < β < 3, the changes are non-stationary with stationary increments.
Thus, the changes in the sequence protein follow a non-stationary process. Moreover, the coefficient of variation of the fluctuations of the NDCS of D PRIFT index changes computed for all 242 sequences is less than 3%, confirming that this similarity with the Kolmogorov spectrum seems to be reproducible for each protein sequence as confirmed by the distribution of the spectrum slope obtained randomly with surrogated and shuffled data. As shown previously in Figure 3b, the fluctuations of the NDCS of D PRIFT index changes appear to show seemingly organized fluctuations. The question is "Is there some dynamic pattern of these change fluctuations along a sequence S i and is there some randomness of changes within the protein sequence?". A first approach is to analyze the behavior of the fluctuation of the pairwise protein index. Figure 7a shows that on average, the second-order moment S 2i (d) of the pairwise protein sequence index separated by a distance d is linearly scaled in a sequence between pairwise protein sequence indexes separated by a distance d roughly below 50. We found a power law of 0.87, which is close to the Brownian power law process. Then, the behavior of the change fluctuations along each protein sequence S i seems to be close to a Brownian process. Furthermore, we found for each protein sequence a power law between a range of [0.69 0.99] and a coefficient of variation less than 7%, which reveals that the fluctuations of NDCS of the D PRFIT index changes along a sequence S i statistically have a behavior close to a Brownian process in regard to the results obtained with the surrogated and shuffled data (Figure 7b). In addition, we have also computed the q-order moment for each protein sequence . The result is shown in Figure 8a. As observed with second-order moment ( ) analysis, we again have a scaling law distribution between pairwise protein sequence index below = 50 for a higher-order moment. This result reveals the existence of a monofractal feature along the protein sequence . Figure 8b shows that the fluctuations of NDCS of D PRIFT index changes of each protein sequence contain a monofractal feature with ( ) = 0.43 , which is a linear law of and reveals monofractal behavior. The slope of the linear law is called the Hurst exponent . As a reminder, if the value of = , it means the changes in a sequence contain no memory as for the Brownian motion. If the changes of the sequence are anti-persistent 0 < < , then the main pattern of the changes shows that a decrease is followed by an increase and vice-versa. Finally, if the Hurst exponent is as < < 1, then there is a persistent behavior in the changes and an increase or decrease will be maintained in a sequence. In our case, the changes are anti-persistent and they are statistically embedded between Kolmogorov process ( ) = [22] and the Brownian process ( ) = . Thus, there is a potential In addition, we have also computed the q-order moment for each protein sequence S i . The result is shown in Figure 8a. As observed with second-order moment S 2i (d) analysis, we again have a scaling law distribution between pairwise protein sequence index S i below d = 50 for a higher-order moment. This result reveals the existence of a monofractal feature along the protein sequence S i . Figure 8b shows that the fluctuations of NDCS of D PRIFT index changes of each protein sequence S i contain a monofractal feature with ξ(q) = 0.43 q, which is a linear law of q and reveals monofractal behavior. The slope of the linear law is called the Hurst exponent H. As a reminder, if the value of H = 1 2 , it means the changes in a sequence contain no memory as for the Brownian motion. If the changes of the sequence are anti-persistent 0 < H < 1 2 , then the main pattern of the changes shows that a decrease is followed by an increase and vice-versa. Finally, if the Hurst exponent is as 1 2 < H < 1, then there is a persistent behavior in the changes and an increase or decrease will be maintained in a sequence. In our case, the changes are anti-persistent and they are statistically embedded between Kolmogorov process ξ(q) = q 3 [22] and the Brownian process ξ(q) = q 2 . Thus, there is a potential stochastic model like the fractional Brownian model to predict the changes along the protein sequence.

Entropy, Chaos, and Complexity
As previously mentioned, a sequence is defined as a set of alphabetic letters, which could be converted to other symbols (numerical, binary, etc.). Nevertheless, the changes of symbols or numerical values along the sequence are usually related to the real world of biochemical activities inside the whole protein sequence. The question is "Do those changes present regular, irregular, chaotic and complex pattern within a sequence?" Furthermore, nonlinear analysis is one approach to estimate the changes in features along a sequence. In this study, we have used five algorithms to assess the degree of the randomness or the disorder and complexity in protein sequences: (i) The Shannon entropy ( ℎ ); (ii) the sample entropy ( ); (iii) the largest Lyapunov exponent ( ); (iv) Kolmogorov complexity ( ); and (v) the Kolmogorov complexity spectrum ( ) algorithm. Table  2 presents the descriptive statistics of the NDCS of D PRIFT index changes for 242 protein sequences. On average, there is a significant amount of information in an entire probability distribution contained in a sequence. We observe that and values are close to one. Moreover, the method underestimates the complexity in comparison to the method, which takes into account

Entropy, Chaos, and Complexity
As previously mentioned, a sequence is defined as a set of alphabetic letters, which could be converted to other symbols (numerical, binary, etc.). Nevertheless, the changes of symbols or numerical values along the sequence are usually related to the real world of biochemical activities inside the whole protein sequence. The question is "Do those changes present regular, irregular, chaotic and complex pattern within a sequence?" Furthermore, nonlinear analysis is one approach to estimate the changes in features along a sequence. In this study, we have used five algorithms to assess the degree of the randomness or the disorder and complexity in protein sequences: (i) The Shannon entropy (ShEn); (ii) the sample entropy (SampEn); (iii) the largest Lyapunov exponent (LLE); (iv) Kolmogorov complexity (KC); and (v) the Kolmogorov complexity spectrum (KCS) algorithm. Table 2 presents the descriptive statistics of the NDCS of D PRIFT index changes for 242 protein sequences. On average, there is a significant amount of information in an entire probability distribution contained in a sequence. We observe that SampEn and LLE values are close to one. Moreover, the KC method underestimates the complexity in comparison to the KCS method, which takes into account the amplitude of the changes. Following the comparison with the surrogated and shuffled data generated from the original data, we found that the NDCS of D PRIFT index changes for 242 protein sequences used in this study include stochastic and moderate chaotic processes and show apparent embedding between the Kolmogorov (H = 1/3) and Brownian (H = 1/2) monofractal processes.

Drift (DRF), Kolmogorov Complexity Spectrum (KCS), and Activity (ACT): Linear Correlation and Superdiffusive Process between Sequences
The activity as defined in Section 2.2 (Thermostability) is also freely available for each protein sequence. Figure 9a shows the cumulative sum of activity, entropy, chaos, complexity, fractal, and drift parameters for 242 protein sequences. In order to track the biochemical activity changes through an invariant sequence arrangement, we have sorted, in ascending order, each sequence with increasing activity. Then, we have also sorted the remaining parameters in respect to the increasing activity and applied the cumulative sum. For clarity, we have presented the 10th of the entropy, chaos, complexity, fractal, and drift parameters, and the 1000th for activity. Most of the curves show a slightly linear shape, which is the average mode through increasing sequence activity. Nevertheless, the dynamic of changes through this increasing activity highlights that NDCS's activity changes are well correlated with the NDCS of Kolmogorov complexity spectrum and drift (Figure 9b). There are pronounced parabola with an open upwards shape for activity (ACT) changes and a conversely open downwards shape for the Kolmogorov complexity spectrum (KCS) and drift (DRF) changes. The correlation coefficient is very high between ACT, KCS, and DRF as shown in Figure 9c. We found a relationship between the inner-sequence changes drift, the complexity, and the activity throughout crossed 242 rearranged increasing activity protein sequences. As shown in Figure   a  We found a relationship between the inner-sequence changes drift, the complexity, and the activity throughout crossed 242 rearranged increasing activity protein sequences. As shown in Figure 9c, the trajectories of the bivariate parameter (drift, activity) or (complexity, activity) exhibits trajectories with jump between sequences, which leads to the question: "Are these successive jumps related to variable changes ruled by a power law?". Then, we have analyzed these trajectories by calculating the mean square displacement of changes (∆d S ) 2 in the bivariate parameter (drift, activity) or (complexity, activity) space where d S is the distance between two sequences. Moreover, we defined the mean where N dS is the number of pairwise sequences separated by a distance d S and X is the drift (DRF) or Kolmogorov complexity spectrum (KCS). Figure 10 shows ∆(d S ) 2 ∼ d S α with α ∼ 1.7 for the drift and α ∼ 1.6 for the complexity. We found that there is a scaling law of the bivariate (DFT, ACT) or (KCS, ACT) parameter that is similar to a super diffusive process with an exponent coefficient α > 1 [33]. Here, we have plotted ∆(d S ) 2 / ∆(d Sc ) 2 where d Sc is the characteristic distance between two sequences computed with the correlation function δ(d S ) = 1 Entropy 2019, 21, x 17 of 20 9c, the trajectories of the bivariate parameter (drift, activity) or (complexity, activity) exhibits trajectories with jump between sequences, which leads to the question: "Are these successive jumps related to variable changes ruled by a power law?". Then, we have analyzed these trajectories by calculating the mean square displacement of changes 〈(Δ ) 〉 in the bivariate parameter (drift, activity) or (complexity, activity) space where is the distance between two sequences. Moreover, we defined the mean square displacement as 〈Δ( ) 〉 = ∑ ( − ) + ( − ) | | where is the number of pairwise sequences separated by a distance and is the drift ( ) or Kolmogorov complexity spectrum ( ). Figure 10 shows 〈Δ( ) 〉 ∼ with ∼ 1.7 for the drift and ∼ 1.6 for the complexity. We found that there is a scaling law of the bivariate

Conclusions
In this work, we analyze the nonlinear behavior of the D-PRIFT index changes around the overall linear trend scale of the protein sequence. To assess the nonlinear analysis, we have used protein residue values that are freely available, namely the AA index database. The protein dataset used contains 242 sequences and each sequence has 466 numerical values, one per amino acid residue. A protein sequence corresponds to a combination of encoding symbols from a dictionary of 20 standard amino acids symbols.
We have applied to each sequence a normalized detrended cumulative sum algorithm to extract the fluctuations of the numerical signal in the protein sequence. We analyzed these fluctuations with different tools, which are related to (i) entropy (information and regularity); (ii) chaos (largest Lyapunov exponent); (iii) complexity (Kolmogorov complexity and Kolmogorov complexity spectrum); and (iv) fractal (Hurst exponent). First, we showed that the change fluctuations of all the studied 242 protein sequences in the dataset seem to be non-stationary and follow on average a −5/3 Kolmogorov power-law. This result seems to be statistically significant in regard to a coefficient of variation less than 2% and a test done with randomly generated synthetically obtained data with surrogate and shuffle technique. To understand the nature of the inner changes within the protein

Conclusions
In this work, we analyze the nonlinear behavior of the D-PRIFT index changes around the overall linear trend scale of the protein sequence. To assess the nonlinear analysis, we have used protein residue values that are freely available, namely the AA index database. The protein dataset used contains 242 sequences and each sequence has 466 numerical values, one per amino acid residue. A protein sequence corresponds to a combination of encoding symbols from a dictionary of 20 standard amino acids symbols.
We have applied to each sequence a normalized detrended cumulative sum algorithm to extract the fluctuations of the numerical signal in the protein sequence. We analyzed these fluctuations with different tools, which are related to (i) entropy (information and regularity); (ii) chaos (largest Lyapunov exponent); (iii) complexity (Kolmogorov complexity and Kolmogorov complexity spectrum); and (iv) fractal (Hurst exponent). First, we showed that the change fluctuations of all the studied 242 protein sequences in the dataset seem to be non-stationary and follow on average a −5/3 Kolmogorov power-law. This result seems to be statistically significant in regard to a coefficient of variation less than 2% and a test done with randomly generated synthetically obtained data with surrogate and shuffle technique. To understand the nature of the inner changes within the protein sequence, we achieved the analysis of the variance of the changes through the scope of the spatial correlation: Here, the index position within the protein sequence. We found an invariance of pairwise scale index d, which is ruled by a S 2i (d) ∝ d α with α = 0.87, a coefficient close to one of the well-known stochastic Brownian processes. The dispersion of the slope obtained for all 242 protein sequences is statistically coherent in comparison with the results obtained with synthetic data. Following the local analysis of the changes along the protein sequence, we have performed a systematic q-order moment of the fluctuations in order to track if there is a self-similar repeating pattern in the inner sequence. We showed that change fluctuations within the protein sequence have a monofractal behavior, which is an average among the 242 sequences embedded between the Kolmogorov and Brownian monofractal processes with a Hurst exponent ranging between 1/3 and 1/2. To encompass the local analysis and to have an overview of the nonlinearity analysis, we have computed statistical parameters related to entropy, chaos, complexity, and fractality. We demonstrated that the NDCS of D PRIFT index changes for the 242 protein sequences used in this study exhibit statistically moderate complexity, and low chaotic fluctuations.
Moreover, to integrate these results in the analysis of the protein activity changes for each sequence, we have conducted a study of the relationship between the linear-trend (drift) computed with the cumulative sum algorithm, the Kolmogorov complexity spectrum, which is indicative of computational complexity, and the activity of each protein sequence. As this analysis focused on the dynamics of the changes, we also applied the normalized detrended cumulative sum for these three parameters as done for the inner-sequence analysis. The results show a strong linear relationship between the bivariate (drift, activity) and (complexity, activity) parameters, which provides insight into the potential use of drift and complexity as a predictor in a linear model. Moreover, the analysis of the trajectories in the bivariate space highlights superdiffusive behavior of the change fluctuations with a power-law around −1.6 of the mean square displacement for each chosen bivariate parameter. This study demonstrates that the changes in the inner sequence and throughout the crossed inter-sequence are nonstationary, stochastic, irregular, complex, weakly chaotic, and monofractal. To conclude, there is some predictability of protein sequence changes, which can be modelled using a stochastic model. Linear law and scale invariance features found in this study should be explored in future work to study for classification, regression predictive model, and could be useful in the field of protein engineering.