Hiding the Source Code of Stored Database Programs

: The objective of the article is to reveal an approach to hiding the code of stored programs stored in the database. The essence of this approach is the complex use of the method of random permutation of code symbols related to a specific stored program, located in several rows of some attribute of the database system table, as well as the substitution method. Moreover, with the possible substitute of each character obtained after the permutation with another one randomly selected from the Unicode standard, a legitimate user with the appropriate privileges gets access to the source code of the stored program due to the ability to quickly perform the inverse to masking transformation and overwrite the program code into the database. All other users and attackers without knowledge of certain information can only read the codes of stored programs masked with format preserving. The proposed solution is more efficient than the existing methods of hiding the code of stored programs provided by the developers of some modern database management systems (DBMS), since an attacker will need much greater computational and time consumption to disclose the source code of stored programs.


Introduction
Today, various commercial management systems exist for traditional relational databases and databases (DBs) of the NewSQL class that seek to combine the advantages of NoSQL and the transactional requirements of classical databases in order to increase productivity, -reduce network traffic, -simplify access to databases for applications, -hide a lot of specific features of a database management system (DBMS) and database from the user, -ensure business rules and a higher level of data security.
These systems offer various methods of writing and saving stored programs in the database schema [1][2][3][4]. Such code fragments are usually created using the SQL and its specific implementation in the selected DBMS. For these purposes different languages are used in various DBMSs (for example, in Oracle-PL/SQL and Java; in Microsoft SQL Server-Transact-SQL and various .NET Framework languages; in Firebird-PSQL, in DB2-SQL/PL and Java; in PostgreSQL-PL/pgSQL, PL/Tcl, PL/TclU, PL/Perl, PL/PerlU, and PL/Python; in SAP HANA-SQLScript and R). MySQL's stored procedures strictly adhere to the SQL/PSM standard for the following reasons: observance of intellectual property rights; -commercial value; -the code provides a solution to the tasks of protection and distribution of access rights to data; -inadmissibility of code modification by other users or processes (cybercriminals are resorting to increasingly sophisticated attack methods, carrying them out, for example, even through devices that are used to control and ensure security or by adapting malware based on open-source codes to turn them into threats), whereby it is advisable to hide the code of these stored programs (SPs).
In various sources, there are several terms related to hiding information such as data anonymization, data de-identification, data scrambling, data scrubbing, data obfuscation, and data masking. In this paper, we use terms such as masking (mask), obfuscation (obfuscate), and hiding (hide).
To hide (obfuscate, mask) the code, some DBMS developers offer various means. For example, in the Microsoft SQL Server DBMS you can use the mechanisms for encrypting the code of stored procedures via the WITH ENCRYPTION option in the CREATE PROCEDURE and ALTER PROCEDURE constructs [5]. In Oracle DBMS, the Wrap utility, as well as the built-in DBMS_DDL and DBMS_WRAP packages, can be used to make PL/SQL code unreadable [6,7].
However, as noted by experts in the area of database security and practice shows, the built-in tools used to hide the stored programs that are supplied with the DBMS are not effective enough and can be easily circumvented by attackers, especially those with privileged user rights [8][9][10][11].
For example, stored procedures encrypted using the WITH ENCRYPTION option in Microsoft SQL Server DBMS can be quite simply decrypted using the "dSQLSRVD" utility (for Microsoft SQL Server 7 or 2000), a stored procedure called "Decryptsp2K", or newer free standalone tools such as ApexSQL Decrypt and dbForge SQL Decryptor.
Today, an algorithm and programs are known that perform the inverse transformation of unreadable by conventional means, the so-called "wrap" code of stored procedures, functions, and packages in Oracle DBMS, such as the online program "Unwrap It!" and "PL/SQL Unwrapper" for SQL Developer [12].
Thus, we can conclude that the available capabilities of built-in tools in some DBMSs cannot fully ensure effective code hiding of stored programs, not to mention other DBMSs where there are none at all. Therefore, a certain revision of the approach to solving this problem is required, the result of which would be certain methods, techniques, and means that are relevant both in theoretical and in applied aspects.
The objective of our paper is to present an approach to hiding the code of stored programs stored in some system table of the database. The main contribution of the authors is the creation of a technique for hiding the code of stored programs on the basis of the complex use of the method of random permutation of code characters and the substitute method of each character obtained after the permutation with another one randomly selected from the Unicode standard.
The rest of this paper is organized as follows: Section 2 presents related works from the literature; Section 3 discusses the algorithms we use to mask the source code of stored programs; Section 4 presents a technique for recovering the code of masked stored programs; Section 5 concludes this work. The main abbreviations and symbols used in the paper are shown in Abbreviations.
-Substitution. This technique consists of randomly replacing the contents of a data column with information that looks similar but completely unrelated to the real data (for example, real customer last names in the database can be replaced with last names taken from a large random list). Substitution is very effective in terms of preserving the appearance of existing data. The disadvantage is that, for each column to be replaced , a large amount of replaceable information  must be available;  -Shuffling is a technique of randomly shuffling the existing field values in a table column (for  example, data of a table column containing medical records about the patients' health status are  randomly shuffled; a more complex version is also possible, for example, when the so-called method of statistical obfuscation is used (DataSifter) [20] is used, which combines introducing artificial random missingness with partial alterations using data swapping within subjects' neighborhoods); -Random data deviation (random data perturbation, random decimal numbers, random dates, random digits, random strings). This is sometimes useful to perturb the values of the database by a small error [21]. The existing value is replaced with a random one in a certain range. This technique can prevent attempts to discover true records using known date data or the exposure of sensitive numeric or date data; -Encryption including a format-preserving encryption (FPE) [22,23], since ordinary encryption, as a rule, changes the format of the original data and may increase the data dimension, which is not always desirable (for example, due to the need to control the integrity of the code of the masked stored program by its length); -Nulling out or deletion is the simple deletion of column data by replacing it with NULL; -Masking out. This technique is a special case of the substitution technique, when all masked characters are replaced with the same symbol, for example, "X" (in this case, the credit card number would be 3435 XXXX XXXX 3775); -Technique of masking numerical data using modulo operations (MOBAT-modulus-based technique); -Compound masking is the technique of masking related columns as a group, ensuring that masked data across the related columns retain the same relationship, for example, masking address fields, such as city, state (region), and postal codes. These values must be consistent after masking; -Tokenization. In this technique, data elements are replaced with random tokens-values that should not be associated with the replaced sensitive data either mathematically or in any other way. A more detailed overview of different data masking techniques is given in [13]. For the practical implementation of some of these techniques in Oracle, for example, there is a data masking pack [14], which allows you to choose different masking techniques.
However, most of the masking techniques described above, except for the encryption methods including FPE [22,23], tokenization [24], and MOBAT techniques [16][17][18][19], are used for static masking of non-production databases and, after their application, do not allow canceling operations in order to return to the original data. This is not always acceptable for production databases.
On this note, the encryption method, including format-preserving encryption [22,23], is quite resource-intensive [14,15]. MOBAT is specifically designed to mask only numerical values [16][17][18]. However, quite often, there is a need for masking not only numerical values. For example, in databases built on the basis of a schema with the universal basis of relations [25] that can be used, including data warehouses of various subject domains, the attributes of relations containing sensitive data are defined on the domain of character strings. At the same time, attempts to apply the masking procedure to non-numerical data in the modified MOBAT technique described in [19] would not be successful.
Therefore, the need arises to find some new solution. One such solution [26] was taken as a basis. In [26], an approach to hiding data stored in a database was proposed, which is based on the principles of random permutation of elements (bytes, characters) of a specific field of the corresponding column (attribute) of a row (tuple) of data and dynamic data masking. Moreover, since the source code of SP is also stored in a certain way in the corresponding DBMS tables, this approach can be fully applied to hide it.

Masking the Source Code of Stored Programs
The proposed solution uses a universal schema for hiding data, including the source codes of stored programs stored in some database which is constant for all values in the columns that are masked in this row.

Algorithm for Masking the Code of Stored Programs
We consider the implementation features of this approach using the example of hiding the source codes of stored programs for Oracle DBMS. Although the general provisions of the material below are true for any DBMS of the class in question that supports the ability to work with stored programs, the features in this paper relate to specific system tables, views, software implementations of cryptographic primitives, and some other objects of a specific DBMS.
The general scheme of the somewhat refined masking algorithm described in [26] is presented below (Algorithm 1).

Algorithm 1. Masking algorithm 1 (MA-1)
Input:   name is the name of the j -th masked stored program in the table R , type is the type of database object (table, procedure, function, package), and A represents rows of source code ( X )/masked code ( Y ) of the stored program. As is known when creating stored PL/SQL programs, the byte-code of the programs and their source code are stored in the dictionary of the Oracle DBMS database. The source code is available for line-by-line viewing, for example, in system views such as DBA_SOURCE, ALL_SOURCE, USER_SOURCE, INT$DBA_SOURCE, CDB_SOURCE, and the system table SYS.SOURCE$ (in the context of considered approach this is the table R ); () hash is one of the cryptographic hash functions (such as MD4 (Message Digest), MD5, SHA-1 (Secure Hash Algorithm), SHA-256, SHA-384, SHA-512, or SHA-3). The purpose of using the hash function is the mixing (noninjective transform) of private and public keys to make it impossible to recover from the final result, thereby becoming significantly different from other formed initial values (seed) 0 j R X for the PRNG (even if at least one of these keys changes by one character (unit)). PRNG is the PRNG used (from the list available), per z is the number of repetitions of permutations (how many times the permutation operations are performed in a row), k  is the reference checksum for the source code of the stored program, and l is the code length of the stored program in characters. It is appropriate to use the object number identifier (ID) from the system view ALL_OBJECTS as the public key 3 i K . The OBJ# attribute of the SYS.SOURCE$ table is such an identifier.
Before starting the masking procedure in accordance with the above algorithm, it is necessary to perform the following preparatory operations: 1. Determine which SPs should be transformed; 2. Generate the corresponding private keys ( 3. Create a relation (table) secret R containing the information necessary to hide and restore sensitive user data, where and encrypt all values of its rows and columns with one of the cryptographically strong algorithms, for example, the AES-256 algorithm; 5. Create for legitimate users stego files (stego containers) that store the decryption key of the table secret R data. When performing actions in accordance with the MA-1 algorithm (Algorithm 1) to mask the source code of SP, one can count on its rather effective hiding from attackers. With a large code length l (large dimensionality A ), without knowledge 0 j R X , it is very difficult to determine the sequence of generated random numbers for the permutation. For example, if we use the brute-force search, then the number of possible variants of the generated sequences of random numbers for the permutation will be ! l . For example, to find a sequence of generated random numbers for the code permutation of some stored procedure (length l = 230 characters; the number of possible permutations is 230!  7.76 × 10 444 ) using a brute-force attack in some specific implementation, when the initial value of sequence used in the PRNG is calculated in accordance with the expression , and to prevent overflow of the bit grid in subsequent calculations, max N is determined from a range of integers not exceeding, for example, 4,294,967,295, or when the used PRNG is known (in this case, it is LCG), with a hash function and 1 per z  , it may take more than 2 months when using a computer with an Intel(R) Core (ТМ) i3-7100 central processing unit (CPU) 3.90 GHz with 4 GB random-access memory (RAM) and a 500 GB hard disk drive (HDD) running the operating system Windows 10 (x64) DBMS Oracle 12.2 c. Naturally, if the used hash function, PRNG (and this generator has a larger number of internal states), and 1 per z  (permutation/transposition cipher becomes significantly more secure [27]) are unknown, then the time to determine the sequence of generated random numbers for the code permutation of some SP by brute-force attack increases.
In the general case, with a larger max N , where a larger code length of the source SP is masked, this time grows very quickly, since the factorial grows faster than any exponential function or any power function, as well as faster than any sum of products of these functions. Even with the limitations of the modern version of the Fisher-Yates shuffle algorithm [28][29][30], as the basis of the MA-1 algorithm, which cannot create more permutations than the number of internal states of the generator m, the number of permutations is quite large. Thus, for example, for the G. Marsaglia Xorshift pseudo-random number generator [31] with m = 2 128 − 1 the number of permutations can reach 3.4 × 10 38 , and, for the PRNG recommended in [32], the number grows to 3.138 × 10 57 . Moreover, in most cases, there is no need to obtain all permutations [29]. However, in fairness, it should be noted that the permutation (transposition) cipher is vulnerable to the frequency analysis techniques [33]. Therefore, most likely, the brute-force technique will not be effective for large values of l , 1 per z  , unknown used hash functions, and PRNG. In this case, the statistical analysis technique can become a more effective method for disclosing masked stored programs. However, its use also may not achieve the desired effect for an attacker because each character and value of stored programs plays a significant role for the executable instructions. Inaccurate recovery of their code is unacceptable. Since even minor changes to characters and values (constants) in the code of a stored program can lead to completely opposite actions. It is not always clear which of these options is true, even after going through all possible permutation options. For example, and 0 x  would be plausible options. As such, it is possible to mask not the entire code of the stored program but only part of it (without the header of the stored program and its end) so that the attacker does not have certain a priori information to simplify cryptanalysis. Therefore, when not knowing 0 j R X for the PRNG, it is very difficult to correctly restore the original code of the source procedure. At the very least, this is much more difficult to do than with existing built-in means for hiding stored programs. To further complicate the statistical analysis in this paper, in Section 3.2, an approach for the combined use of permutation and substitution (polyalphabetic cipher) methods is proposed.
In further research, the authors plan to study in more detail the strength of the proposed solution for statistical analysis. It should be noted that the operations performed in accordance with the MA-1 algorithm (Algorithm 1) do not lead to a change in the format of the source data (SP code lines) and the code length. This is very important for realizing the possibility of performing the necessary subsequent actions related to saving the program code on the database server and the primary control of the code integrity of the masked program, namely, the ability to evaluate whether the code length was changed (regardless of whether this was done by mistake or intentionally by an attacker/malware).
In the future, to control the integrity of the SP code, the procedure for comparing the checksum of the restored program code with the reference code stored in encrypted form in the table secret R will be used. This will allow us to make sure of the SP code invariability obtained as a result of the inverse masking, thus guaranteeing the possibility of its use without the risk of performing any undocumented (malicious) actions. The implementation of the transformation in accordance with the MA-1 algorithm (Algorithm 1) is only part of the masking process of the stored program code in contrast to the masking process of the data of the corresponding table attributes, which was considered in detail in [26]. The stored program code obtained as a result of the transformation needs to be saved on the server, which can only be done after executing the corresponding SQL statements. Specifically, in order for the transformed code of the masked procedure, function, and package to be stored on the server (even with errors), it is necessary to use the statement to create the corresponding object (CREATE…).
Therefore, the obtained result of the permutation needs to be united by concatenation with the CREATE construct: where (10) chr is the operator (control character) of the line feed (LF).
For example, for some DEMO stored procedure, for which a transformation was performed to mask its code, the following line is generated with the CREATE construct:

CREATE OR REPLACE PROCEDURE DEMO (salary IN NUMBER) as,
with which the transformed procedure code is concatenated. This is why it is important not to change the source data format during the masking operation.
After compilation, this code will be stored on the server, although the object itself (stored procedure) will have an invalid status (INVALID). However, an authorized user can, at any time, using the inverse transformation procedure, restore it to its original form, thus leading, after corresponding compilation, to the status of a VALID object. At this point, the restored program code after the performed actions by an authorized user or process can be masked again.
The concrete form of the line with the CREATE construct (with exact parameters), which is combined by concatenation with the transformed code of the masked procedure, does not matter much. The main role of the line with the CREATE construct is to enable saving the resulting transformed code on the server after corresponding compilation procedure.
An example of transformation (masking) of some stored procedure is given below. In order for the created object (stored procedure, function) to remain VALID, albeit unable to perform its real functions, you can slightly modify the above method of concatenating the CREATE construct with the masked code of the stored program. For example, as an initial construct, which concatenates with masked SP code, you can use, under certain conditions (considering that nesting multiline comments within each other is not allowed), the following lines of the PL/SQL: You can also use conditional compilation directives. To do this, you must define the initial construction as follows: create or replace procedure DEMO (salary IN NUMBER) as begin null; $IF false $THEN /* Moreover, you can define the final concatenated construction as follows: Then, the expression for Y  will take the following form:

An Improved Algorithm for Masking the Code of Stored Programs
To increase the ability to resist brute-force attacks and frequency analysis, we propose slightly modifying the MA-1 algorithm (Algorithm 1). Specifically, a variant of the algorithm that uses not only the permutation method but also the substitution method to mask the source codes of stored programs is proposed. The substitution character alphabet consists of Unicode characters (for example, in the ranges: U+0000-U+007F (ASCII characters) and U+0400-U+052F). In the improved algorithm, after a corresponding random permutation of the stored program code characters, it is proposed to substitute each obtained character with a Unicode character that is selected at random. For this purpose, PRNG ( sub PRNG ), which generates sequences of random numbers taking into account the initial value 0 j R X , is used. To perform a substitution, the transformation of the following form can be used: Thus, taking into account the above, the general scheme of the masked code algorithm for stored programs can be represented as shown below (Algorithm 2).

Algorithm 2. Masking algorithm P (MA-P)
Input: Below are examples of the resulting transformed code of the DEMO stored procedure in the case of using the Xorshift random number generator with a period of 2 128 −1 (option 3, Example 1) when masking without substituting the characters of the source code after their permutation (Example 2) and with the substitution of characters (Example 3). Such a code will not cause a compilation error; however, of course, it does not reflect the essence of this procedure!

Technique for Recovering the Code of Masked Stored Programs
In order to restore the source code of a masked program, a legitimate user (or process) must first do some preliminary work. Specifically, depending on the actions performed during masking, either one added line (it is the line: "create or replace procedure DEMO (salary IN NUMBER) as" for Example 1) or the first four lines and the last two (Examples 2, 3) must be preliminarily excluded.
Then, in the event that character substitution was performed, the procedure inverse to it is required. To find the inverse permutation, it is necessary to determine the initial permutation used in masking.
As a result, the sequence of performed actions will be as follows: 1. Unnecessary lines of transformed code of the stored program added during the concatenation process are deleted ( Y  : . The values of these keys are never shown. They are not known either to the database administrator (if they do not combine the functions of a security administrator) or to any other user. An authorized user with the appropriate privileges which will provide the correct key in an open session to decrypt data of table secret R has the mediate access (through special software) to private keys (as well as to other data of table secret R ). Furthermore, the value of this key is not shown anywhere in the clear, and it cannot be traced even through the available means of documenting executed queries (a historical command log). It is extracted by the special DBMS server software from the stego container, which will be presented by an authenticated user with the appropriate privileges during the session opening.
3. The code length of the stored program Y ( ( ) length Y ) is calculated. If the length is not equal to l , then further actions in accordance with this algorithm are terminated.
4. The initial value ( 0 j R X ) is formed, which will be used by exactly the PRNG used during the initial permutation of the selected j-th SP: 5. Array num  is prepared.
6. If necessary, if the procedure for replacing Y was performed, the transformation inverse to it is performed to restore the characters contained in the source masked stored program: It should be noted that, for the formation of a sequence of random variables i r , exactly the PRNG used in the substitution procedure is implemented.

The initial permutation is determined.
Due to the possibility of repeating the sequence of numbers generated by the PRNG from the same initial value, performing actions in accordance with the Fisher-Yates algorithm, we obtain the initial permutation, ) in accordance with the following expression: the initial (not masked) value of SP ( X ) is determined.
9. The checksum of the recovered SP code ( ( ) hash X ) is calculated, which is compared with the reference value k  from table secret R . If ( ) hash X k   , then compiling the restored SP is executed and its further use. Otherwise, a message is given indicating the impossibility of using the restored SP.

Algorithm for Restoring the Code of Masked Stored Programs
The general scheme of the inverse masking algorithm is represented below (Algorithm 3).

Algorithm 3. Inverse masking algorithm P (IMA-P)
Input: Acting in accordance with this algorithm, you can restore the source code of any masked stored program. For example, as a result of the implementation of this algorithm, the code for the masked DEMO procedure (Example 3) will be restored to its original form (Example 1).
As practice shows, it is very difficult to restore the source code of a stored program without knowing the keys and other information stored in the table secret R . An authorized user with the appropriate privileges, on the contrary, can recover the code of the masked SP very quickly. Depending on the length of the SP and server performance, this time is usually fractions or units of milliseconds (in worst cases, it does not exceed units of seconds).

Conclusions
1. Having analyzed the role of stored programs in various commercial traditional relational database management systems and NewSQL class databases, including from the position of comprehensive ensuring a higher level of data security, such as protection of the code itself from unauthorized study, using, copying, and modification, as a special copyright object that provides solutions to the protection problems, distribution of access rights to data, and the capabilities of the tools built into some DBMSs to hide code of stored programs, it was concluded that it would be advisable to search for new solutions for effectively hiding the code of these programs, the result of which would be certain methods, techniques, and means that are relevant both in theoretical and in applied aspects. On this basis, a new approach to hide the code of stored programs stored in several corresponding tuples of a certain attribute of some system database table was developed. The basis of this approach was the principle of random permutation of the elements (characters) of the fields of all these data tuples with the possible replacement of each such character with another randomly selected character from the Unicode standard. 2. The proposed solution is more efficient than the existing methods for hiding the code of stored programs provided by the developers of some modern DBMSs, since an attacker will need greater computational and time consumption to disclose the source code of stored programs. This is the case despite its individual weaknesses, associated mainly with a specific implementation, namely, with the limitedness of max N , the selected hash function, and PRNG, the availability and dimension of the stego container, and the key length for table secret R .
3. To ensure the integrity of the restored code of stored programs (to detect possible unauthorized actions to change it, regardless of whether this was done by mistake or intentionally by an attacker or a malicious program), the proposed solution uses procedures to compare its length and checksum with reference values stored in encrypted form in the table secret R . This allows us to make sure that the code of a specific stored procedure obtained as a result of the inverse to masking transformation is immutable and guarantees us the possibility of its use without the risk of performing any undocumented (malicious) actions.
In the future, to monitor the changes (distinguishing current and unauthorized) in the procedures, functions, packages, and triggers that are critical for the user and the system, it is proposed to use the capabilities of the blockchain paradigm, facilitating the work for database administrators and security officers.