A New Way to Store Simple Text Files

: In the era of ubiquitous digitization, the Internet of Things (IoT), information plays a vital role. All types of data are collected, and some of this data are stored as text ﬁles. An important aspect—regardless of the type of data—is related to ﬁle storage, especially the amount of disk space that is required. The less space is used on storing data sets, the lower is the cost of this service. Another important aspect of storing data warehouses in the form of ﬁles is the cost of data transmission needed for ﬁle transfer and its processing. Moreover, the data that are stored should be minimally protected against access and reading by other entities. The aspects mentioned above are particularly important for large data sets like Big Data. Considering the above criteria, i.e., minimizing storage space, data transfer, ensuring minimum security, the main goal of the article was to show the new way of storing text ﬁles. This article presents a method that converts data from text ﬁles like txt, json, html, py to images (image ﬁles) in png format. Taking into account such criteria as the output size of the ﬁle, the results obtained for the test ﬁles conﬁrm that presented method enables to reduce the need for disk space, as well as to hide data in an image ﬁle. The described method can be used for texts saved in extended ASCII and UTF-8 coding.


Introduction
The amount of information we deal with in the present world is enormous. Evidence of this is the growing interest in so-called Big Data, i.e., large data sets for which traditional methods of analysis and processing are ineffective [1]. Often, those data are of a different type: text, graphic, etc., and come from various sources, e.g., social media [2] or machine data [3]. Regardless of the nature of the data and its belonging to the Big Data, the storage method plays a critical role. Some of these data are stored on private disks or servers, and some use cloud services (e.g., Google Drive, Dropbox, or Microsoft OneDrive). In the case of cloud solutions or those where we do not have direct supervision over our data, there is a risk of loss or modification of stored data [4]. Of course, the service provider is responsible for these issues; however, relying solely on this may prove to be deceptive, which is confirmed by information from time to time about data leaks or internet portals defacement [5]. For this reason, it is worth hiding or encrypting data that are confidential. Confidentiality can be provided by the use of steganography as well as cryptography. Steganography allows to hide data in such a way that their presence cannot be detected by humans [6]. Unlike steganography, cryptography transforms the output in such a way that its content is not hidden [7,8].
Big Data, due to its features, i.e., volume, variety, and velocity, requires different storage than classical databases. These data require a set of servers working as a parallel system to store and to access the data for further processing. NoSQL is the type of database that is characterized as non-relational, distributed, and scalable, thus making them suitable for Big Data applications. There are more features of NoSQL databases, including easy replication, schema-free, and BASE, which are precisely described in [9].
Big Data storage can be placed within Big Data value chain [10] on the fourth level, after data acquisition, data analysis, and data curation and naturally before data usage. The review of the current state-of-the-art of data storage technologies brings the following types of storage systems: • Distributed File Systems, e.g., [11], • NoSQL Databases, e.g., [12], • Big Data querying platforms, e.g., [13], • NewSQL Databases, e.g., [14,15].
Big Data storage systems can be characterized by specifying both strengths and weaknesses, which were presented in [16]. Among the advantages, there is the support of heterogeneous structured data, simultaneous accessibility, and high fault tolerance. On the other hand, there are some weaknesses, for example, the lack of compliance with ACID set of properties that guarantee database transactions reliability (ACID stands for Atomicity, Consistency, Isolation, and Durability).
In case of large distributed file systems, it is essential to consider the size of the stored file, as well as its security attributes. One of the compression methods can be applied to reduce the size of the file. At the same time, one of the critical security attributes, namely confidentiality, can be set by the use of cryptography or steganography.
The main contribution of this article is to presents a method of storing text data to ensure their compression as well as to hide their content directly. It involves changing the data format from text to graphic data with the extension png. By appropriate transformation, the content of the text file is transformed into the file png, the content of which looks like a random string. To the best of our knowledge, no one has proposed such a solution before.
The article is divided into four parts. The first part consists of an introduction and an overview of related works. In the next section, the method of converting text files to the png graphic format was presented. Later, an example of the application of the specified method has been discussed. The last part of the article consists of a summary and a list of references.

Compression
Another problem related to storing data on external platforms is the volume that they occupy. It is especially important in the case of huge data volumes, which we deal with within the context of Big Data [16]. The smaller the data set is, the faster it will be processed, and the cost of the storage service provided is lower. This cost includes not only the cost of disk space but also the cost associated with the data transfer process. For this reason, stored data should be compressed, which will optimize costs. The compression used should be lossless, i.e., the data after decompression should have the same content as the data before compression [17]. Examples of file types that accomplish this task include png [18] (for image files) or zip [19], which is a solution for all file types.

Steganography of the Text
Discussed method concerns (among other issues) steganography of the text. In the literature, one can find many works on this subject, e.g., [20][21][22][23][24][25][26][27]. Most of them use large amounts of non-confidential information (media) to be able to hide in a piece of the confidential message.
In [20], a technique that uses reflection symmetry of the English alphabets was used for this purpose. Thanks to horizontal or vertical symmetry of the alphabet characters, the corresponding bits of the confidential message are hidden, creating appropriately modified text. In turn, Reference [21] developed a method of hiding text using the omega network. This method consists of generating a word from a dictionary that contains two letters coupled using an omega network with one letter of messages to hide. Yet another method of hiding text is in [22]. It uses Huffman coding in such a way that a secret message is attached to the received codes. In [23], a multi-keywords carrier-free text steganography method based on the part of speech tagging is discussed. The work [24], in turn, presents coverless plain text steganography based on the parity of Chinese characters' stroke number. On the other hand, in [25], the authors showed how to hide a secret message in the formatting (e.g., color) of workbook files xls. In [26] a method of hiding text in a properly constructed number system is presented. Besides, email-based text steganography is presented in [27]. The method relies on hiding the secret data within several email addresses.
Another approach is to hide text in a format other than text, e.g., in image files. The basic way that uses images to hide a confidential message is called LSB (Least Significant Bit) method [28]. It uses the least significant bits of the pixel RGB color components to enter the bits of a confidential message.

Graphic Format png
The PNG (Portable Network Graphics) file format is a raster image file format that also provides lossless compression of image data. It allows saving data in two variants: RGB (24-bit palette) and RGBA (32-bit palette). Compression was provided by the LZ77 algorithm [29] and Huffman coding [30]. Besides, just before compression, there is a new stage called filtering. It consists of preparing data for the best compression. This is done by applying a combination of filters that simplifies data for faster data writing. An example of pixel structure of the file png is presented on Figure 1. The png format itself is not patented and is free. More about it can be found, among others, in the specification [18].

Variant 1: Text Encoded in Extended Ascii
The method of replacing a text file with a png file requires that each character in the text file be saved using 8-bit encoding. This means that some content (encoding other than 8 bits) cannot be directly converted to the format png. This is because the RGB color components of the png file are values in the 0-255 range, i.e., saved in 8 bits. However, the occurrence of an inappropriate character (outside the range: 0-255) should not affect the further processing of the file. This character will be-of course-incorrectly encoded, and one will not be able to regain its correct value.
The method of converting a text file saved with extended ASCII encoding into an image file png can be described by the following steps: 1. The text file is loaded into the buffer. 2. The following are added to its content: ETX, filename with the extension, ETX, where ETX means end of text and is encoded with an ASCII code of 3. 3. The image size (height and width) is calculated using the formula where len is the length of text from step 2 and · means the ceiling function.

If the length of the text is less than
the content is appended with random values, so that it is exactly (2). 5. A two-dimensional array T size×size is created, with elements consisting of 3-element tuples. 6. Text characters are converted to 8-bit numbers (according to Extended ASCII encoding) and divided into 3-fold tuples. The next tuples are saved as the next elements of the two-dimension array T. 7. The two-dimensional array of tuples T is treated as a pixel array and saved to the format png.
Writing to png is possible using ready-made programming tools, like Python library called Pillow [31]. 8. (optional) The filename is the abbreviation obtained from the filename using the selected hash algorithm, e.g., SHA3 [32].
The above method of converting a text file into a graphic png is reversible. Replacement of the file with png extension with text file can be described by the following steps: 1. The following pixels of the graphic file are loaded to get values representing the data of the text file until a pixel with an RGB color component equal to 3 is found. 2. The RGB components of the loaded pixels are perceived as consecutive ASCII characters and saved in a text file. 3. The next pixels are read to get the values of ASCII characters representing the name and file extension, to come across another RGB component value of 3.

Variant 2: Utf8 Coded Text
If the characters in the file fall outside the extended ASCII encoding range, the UTF-8 encoding can be used. This means that significant changes must be made to the algorithm of writing and reading from Section 2.1. Besides, as in the variant with ASCII coding, if a character occurs outside of the UTF-8 range, one will get the same situation, i.e., the character will be incorrectly processed, and its original value cannot be restored. In this variant, the compression of the resulting file, which we had in the case of extended ASCII encoding, is not always preserved.
The method of converting a text file saved with UTF-8 encoding into an image file png can be described by the following steps: 1. The text file is loaded into the buffer. 2. The following are added to its content: ETX, filename with the extension, ETX, where ETX means end of text and is encoded with an ASCII code of 3. 3. The image size (height and width) is calculated using the formula where len is the length of text from step 2 and · means the ceiling function. 4. If the length of the text is less than the content is appended with random values to be exactly (4). 5. A two-dimensional array T size×size is created, with elements consisting of 3-element tuples. 6. Text characters t i are converted to numbers from 0-65,535 (according to UTF-8 encoding). 7. Each of the values of t i is stored in a positional system with a basis of 256 according to the equation: where a i and b i are the coefficients for writing the number t i in a system based on 256.
8. The values a i and b i are written in tuples of length 3. The next tuples are saved as the next elements of the two-dimension array T. 9. The two-dimensional array of tuples T is treated as a pixel array and saved to the format png.
Writing to png is possible using ready-made programming tools such as Python library called Pillow [31]. 10. (optional) The filename is the abbreviation obtained from the filename using the selected hash algorithm, e.g., SHA3 [32].
The above method of converting a text file into a graphic png is reversible. The following steps can describe the replacement of the file with png extension with a text file: 1. The following pixels of the graphic file are loaded in order to get values representing the data of the text file until a pixel with an RGB color component equal to 3 is found. 2. From the RGB components of the pixels every two values marked as a i and b i are consecutive taken. These values are the coefficients of the number t i written in a numerical system based on 256, i.e., 3. The values of t i are perceived as subsequent UTF-8 characters and saved in a text file. 4. The next pixels are read to come across another RGB component value of 3, and as in step 2, the next every two values marked as a i and b i are used to compute t i values with (6). The resulting string is the name and extension of the text file.

Limitations
The method of replacing text with the graphic file png consists of saving the text characters to the RGB values of individual pixels. Therefore, some file formats, such as the doc extension, cannot be directly processed with it. This is because the addition to the text content itself also includes the formatting of individual elements. Among others, the following formats can be directly used with this method: .txt, .tex, .js, .html, .json (also .ipynb), .py, .css. Note, that there are also some problems one may encounter while trying to read files with doc file format. They require specialized libraries for this purpose, e.g., textract [33].
Besides, the conversion of a text file into a graphic file is possible for text files encoded with extended ASCII and UTF-8 characters. In these cases, there are some differences in the way of saving and reading the pixels of the output image file.

Case Study
The proposed method converts any text file encoded using extended ASCII or UTF-8 format to the png file format. Thus, different formats are reduced to one. In fact, this effect can be seen as desirable when viewed from the point of large volume data sets.
The following files were selected for processing using the proposed method-as well as for further analysis: 1. data1.txt: a text file containing some English text encoded in extended ASCII 2. data2.txt: a text file containing some Polish text encoded in UTF-8 3. data3.txt: a text file containing 10 6 random digits 4. data4.json: a text file in the format json containing the code of the .ipynb version of python.py file 5. python.py: Python file which content is the source code published on Github platform 6. latex.tex: the file containing the latex source code of this article All of the presented algorithms have been implemented in the Python programming language. The source code responsible for the conversion (file with text content into a graphic format) has been shown on the Github platform. The URL to the source code repository can be found in Appendix A. The analysis of the time of changing the text format into an image in the png format was made on a personal computer with:

Compression
The basic measure determining the degree of compression is the Compression Ratio, which is described by the following equation [17]: Compression Ratio (CR) = size of the output stream size of the input stream .
The smaller the CR value, the less disk space the compressed file takes. The results for selected files with different extensions are provided in the Tables 1 and 2. The resulting values show that the corresponding png images have significantly smaller sizes (in the case of extended ASCII files and most test files in UTF-8 coding), which means savings in the necessary disk space. This is due to the way the png format works, i.e., the content is lossless compressed. As it can be noticed, the results obtained are worse than for other compression methods (see Table 3). However, in the case of the data3.txt test file, which consists of digits only, the CR compression value does not reflect that much. This case of the test file is similar in construction to Big Data. Nevertheless, those compression methods do not allow to hide the content of the file, which is discussed in Section 3.5. Table 3. Results for selected test files using different file compressing methods. The list of columns includes filename, size of the text file, compression method, size of the compressed file and CR.

Application in Data Transfer
This method can be used to reduce the amount of data transfer needed to perform operations. Using a png file format instead of a text file format, will allow to limit the amount of data needed to be sent and downloaded from the server. When storing files on the Internet, one can use various pricing plans that have a specific amount of storage space. Using this method will save space needed to store this file, by resulting compression, which will have consequences: saving money or saving disk space that can be used for subsequent files.

Application in Steganography
The presented method, apart from compression, also allows hiding the contents of a text file. The following is because the content of processed files in the format png looks random at first glance. This is confirmed by the sample Figures 2 and 3. A secondary observer is not able to directly extract the content contained in the image. Of course, if the graphic file is created as described in the article, then it may try to read it using the reversed method. It should be noted that the reverse method, i.e., the transition from image to text, is fully reversible. Thus, its effectiveness is 100%. However, the proper determination of the method of image hiding or encryption (in the case when the file was found to be an encrypted image) does not have to be so simple, which suggests a large number of algorithms addressing this issue, e.g., [34][35][36][37][38][39]. Table 4 shows the transition times from the graphic file to the text file. At every turn, the script has been executed 100 times, and the results show the average time and standard deviation. The results show that the transition from image to text is not a time-consuming process.
Compared to files compressed and stored in the format zip, the advantage of hiding the file's content is undoubted. A computer system can easily open the zip file. It means that we have direct access to the content of the text file. However, to make a change to a compressed file, it must be firstly decompressed. Though, the proposed method allows reading a specific pixel png file and its modification (only those particular pixel). As described above, during the last step of the algorithm, the filename can be renamed by the use of the hash function. The original filename and its extension are hidden inside the image. Identifying the right file requires reading and processing many images png as it finds the right one. For the owner of the text files (or for the server on which they are stored), it is possible to create an encrypted list with information about the name under which the file is stored (after converting it to the png format). However, if one does not need to protect its data in this way, then the step of naming the image with its hash can be skipped.
Another problem related to data, in particular large data sets, is their processing [40], including sorting [41][42][43], or processing to get random values from the collected data [44]. The proposed method of data storage allows for further processing, even though they are hidden to the human eye.

Application in Cryptography
Images obtained as a result of the method can also be encrypted at the stage of creating the png file. Before saving the text character to the appropriate pixel, it can be transformed using any encryption algorithm. Without additional encryption of the pixel components, it can be seen that the histograms for the colors red, green and blue are not flat, but resemble the structure of histograms for the natural language in which the original text file was. Using even the simplest encryption methods that flatten histograms can give entirely satisfactory results when trying to read it.

Conclusions
The article describes a method for storing text data in the file stored on the server or in the cloud, which allows simultaneously performing the compression and steganography operations of its content. Described method copes well with most text file formats (including txt, json, tex, html, css). Files stored with doc extension can proceed without formatting. Its limitation is the method of coding the original text-extended ASCII or UTF-8 coding is required. In the first case, the output file is compressed, and image content looks like it is randomly generated. In contrast, UTF-8 encoding allows the processing of a much broader range of text files, and in most cases (as demonstrated by the results obtained), it provides compression of the output file, which content looks like randomly generated. The method has been implemented in Python programming language, and the source code has been placed in the Github publicly accessible repository. The method can be used in the area of Big Data, where we deal with a large number of files (usually with text filetype), which content can be compressed (in case of extended ASCII encoding) as well as protected against a simple attempt to read their content. It should be emphasized that the proposed steganographic method extends the classic approach based on a graphic file as a medium in which bits of plain text are introduced at a bit-level of individual RGB components. The article fills a gap in the area of theory and has documented application potential, which will be the subject of further research.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.