# Comparison of Entropy and Dictionary Based Text Compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methods

#### 2.1. Arithmetic Coding

Algorithm 1: Arithmetic Coding Algorithm |

Algorithm 2: Arithmetic Decoding Algorithm |

#### 2.2. Lempel–Ziv–Welch Algorithm

Algorithm 3: LZW Coding Algorithm |

Algorithm 4: LZW Decoding Algorithm |

## 3. Results

#### 3.1. Literary Text Compression

#### 3.2. Legal Text Compression

#### 3.3. User Manual Compression

## 4. Discussion

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Celikel, E.; Dalkilic, M.E. A New Encoding Decoding Scheme for Text Compression with Embedded Security. Math. Comput. Appl.
**2004**, 9, 475–484. [Google Scholar] [CrossRef] - Rozenberg, L.; Lotan, S.; Feldman, D. Finding Patterns in Signals Using Lossy Text Compression. Algorithms
**2019**, 12, 267. [Google Scholar] [CrossRef] [Green Version] - Shahbahrami, A.; Bahrampour, R.; Rostami, M.; Mobarhan, M. Evaluation of Huffman and Arithmetic Algorithms for Multimedia Compression Standards. arXiv
**2011**, arXiv:1109.0216. [Google Scholar] [CrossRef] - Mbewe, P.; Asare, S.D. Analysis and comparison of adaptive huffman coding and arithmetic coding algorithms. In Proceedings of the 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), Guilin, China, 29–31 July 2017. [Google Scholar]
- Robert, L.; Nadarajan, R. Simple lossless preprocessing algorithms for text compression. IET Softw.
**2009**, 3, 37–45. [Google Scholar] [CrossRef] - Katugampola, U.N. A New Technique for Text Data Compression. In Proceedings of the 2012 International Symposium on Computer, Consumer and Control, Taichung, Taiwan, 4–6 June 2012; pp. 405–409. [Google Scholar]
- Howard, P.G. Lossless and lossy compression of text images by soft pattern matching. In Proceedings of the DCC ’96: Proceedings of the Conference on Data Compression, Snowbird, UT, USA, 31 March–3 April 1996; pp. 210–219. [Google Scholar]
- Al-Dubaee, S.A.; Ahmad, N. New Strategy of Lossy Text Compression. In Proceedings of the 2010 First International Conference on Integrated Intelligent Computing, Bangalore, India, 5–7 August 2010; pp. 22–26. [Google Scholar]
- Quddus, A.; Fahmy, M.M. A new compression technique for binary text images. In Proceedings of the Second IEEE Symposium on Computer and Communications, Alexandria, Egypt, 1–3 July 1997; pp. 194–198. [Google Scholar]
- Xu, J.; Zhang, W.; Xie, X.; Yang, Z. SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order. arXiv
**2017**, arXiv:1709.04035. [Google Scholar] - Sayood, K. Introduction to Data Compression, 5th ed.; Elsevier: Amsterdam, The Netherlands, 2018; Chapter 6; pp. 165–185. ISBN 978-0-12-809474-7. [Google Scholar]
- Kavitha, P. A Survey on Lossless and Lossy Data Compression Methods. Int. J. Comp. Sci. Eng. Technol.
**2016**, 7, 1277–1280. [Google Scholar] - Shanmugasundaram, S.; Lourdusamy, R. A Comparative Study Of Text Compression Algorithms. Int. J. Wisdom Based Comput.
**2011**, 1, 68–76. [Google Scholar] - Bhattacharjee, A.K.; Bej, T.; Agarwal, S. Comparison Study of Lossless Data Compression Algorithms for Text Data. IOSR-JCE J. Comp. Eng.
**2013**, 11, 15–19. [Google Scholar] [CrossRef] - Abliz, W.; Wu, H.; Maimaiti, M.; Wushouer, J.; Abiderexiti, K.; Yibulayin, T.; Wumaier, A. A Syllable-Based Technique for Uyghur Text Compression. Information
**2020**, 11, 172. [Google Scholar] [CrossRef] [Green Version] - Zhang, N.; Tao, T.; Satya, R.V.; Mukherjee, A. A flexible compressed text retrieval system using a modified LZW algorithm. In Proceedings of the Data Compression Conference, Snowbird, UT, USA, 29–31 March 2005; p. 493. [Google Scholar]
- Garain, U.; Chakraborty, M.P.; Chanda, B. Lossless Compression of Textual Images: A Study on Indic Script Documents. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; pp. 806–809. [Google Scholar]
- Mohamed, A.S.; El-Sawy, A.H.; Ahmad, S.M. Data compression for Arabic text. In Proceedings of the Fifteenth National Radio Science Conference, Cairo, Egypt, 24–26 February 1998. [Google Scholar]
- Kuruvila, M.; Gopinath, D.P. Entropy of Malayalam language and text compression using Huffman coding. In Proceedings of the First International Conference on Computational Systems and Communications (ICCSC), Trivandrum, India, 17–18 December 2014. [Google Scholar]
- Morihara, T.; Satoh, N.; Yahagi, H.; Yoshida, S. Japanese text compression using word-based coding. In Proceedings of the DCC ’98 Data Compression Conference, Snowbird, UT, USA, 30 March–1 April 1998. [Google Scholar]
- Farhad Mokter, M.; Akter, S.; Palash Uddin, M.; Ibn Afjal, M.; Al Mamun, M.; Abu Marjan, M. An Efficient Technique for Representation and Compression of Bengali Text. In Proceedings of the 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), Sylhet, Bangladesh, 21–22 September 2018; pp. 1–6. [Google Scholar]
- Kattan, A.; Poli, R. Evolutionary lossless compression with GP-ZIP. In Proceedings of the IEEE World Congress on Computational Intelligence, Hong Kong, China, 1–6 June 2008. [Google Scholar]
- Grabowski, S.; Swacha, J. Language-independent word-based text compression with fast decompression. In Proceedings of the VIth International Conference on Perspective Technologies and Methods in MEMS Design, Lviv, Ukraine, 20–23 April 2010; pp. 158–162. [Google Scholar]
- Saad Nogueira Nunes, D.; Louza, F.; Gog, S.; Ayala-Rincón, M.; Navarro, G. A Grammar Compression Algorithm Based on Induced Suffix Sorting. In Proceedings of the 2018 Data Compression Conference, Snowbird, UT, USA, 27–30 March 2018; pp. 42–51. [Google Scholar]
- Langdon, G. An Introduction to Arithmetic Coding. IBM J. Res. Dev.
**1984**, 28, 135–149. [Google Scholar] [CrossRef] - Sarkar, S.J.; Kar, K.; Das, I. Basic arithmetic coding based approach for compressing generation scheduling data array. In Proceedings of the 2017 IEEE Calcutta Conference (CALCON), Kolkata, India, 2–3 December 2017; pp. 21–25. [Google Scholar]
- Husodo, A.Y.; Munir, R. Arithmetic coding modification to compress SMS. In Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, Bandung, Indonesia, 17–19 July 2011; pp. 1–6. [Google Scholar]
- Vijayvargiya, G.; Silakari, S.; Pandey, R. A Survey: Various Techniques of Image Compression. arXiv
**2013**, arXiv:1311.6877. [Google Scholar] - Behr, F.; Fossum, V.; Mitzenmacher, M.; Xiao, D. Estimating and comparing entropies across written natural languages using PPM compression. In Proceedings of the Data Compression Conference, DCC, Snowbird, UT, USA, 25–27 March 2003; p. 416. [Google Scholar]
- Ezhilarasan, M.; Thambidurai, P.; Praveena, K.; Srinivasan, S.; Sumathi, N. A New Entropy Encoding Technique for Multimedia Data Compression. In Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) Sivakasi, Tamil Nadu, India, 13–15 December 2007; pp. 157–161. [Google Scholar]
- Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J.
**1948**, 27, 379–423. [Google Scholar] - Dheemanth, H.N. LZW Data Compression. AJER
**2014**, 3, 22–26. [Google Scholar] - Hasan, M.R.; Ibrahimy, M.I.; Motakabber, S.M.A.; Ferdaus, M.M.; Khan, M.N.H. Comparative data compression techniques and multicompression results. IOP Conf. Ser. Mater. Sci. Eng.
**2013**, 53, 012081. [Google Scholar]

**Figure 11.**Compression ratio change compared to English—Arithmetic compression (positive percentages signify larger compressed file size when compared to English).

**Figure 13.**Compression ratio change compared to English—LZW compression (positive percentages signify larger compressed file size when compared to English).

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ignatoski, M.; Lerga, J.; Stanković, L.; Daković, M.
Comparison of Entropy and Dictionary Based Text Compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian. *Mathematics* **2020**, *8*, 1059.
https://doi.org/10.3390/math8071059

**AMA Style**

Ignatoski M, Lerga J, Stanković L, Daković M.
Comparison of Entropy and Dictionary Based Text Compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian. *Mathematics*. 2020; 8(7):1059.
https://doi.org/10.3390/math8071059

**Chicago/Turabian Style**

Ignatoski, Matea, Jonatan Lerga, Ljubiša Stanković, and Miloš Daković.
2020. "Comparison of Entropy and Dictionary Based Text Compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian" *Mathematics* 8, no. 7: 1059.
https://doi.org/10.3390/math8071059