Next Article in Journal
Can Social Robots Make Societies More Human?
Previous Article in Journal
Cloudification of Virtual Reality Gliding Simulation Game
Article Menu

Export Article

Open AccessArticle
Information 2018, 9(12), 294;

A Compression-Based Toolkit for Modelling and Processing Natural Language Text

School of Computer Science and Electronic Engineering, Bangor University, Dean Street, Bangor, Gwynedd LL57 1UT, UK
Received: 30 July 2018 / Revised: 23 September 2018 / Accepted: 25 September 2018 / Published: 22 November 2018
Full-Text   |   PDF [721 KB, uploaded 28 November 2018]   |  


A novel compression-based toolkit for modelling and processing natural language text is described. The design of the toolkit adopts an encoding perspective—applications are considered to be problems in searching for the best encoding of different transformations of the source text into the target text. This paper describes a two phase ‘noiseless channel model’ architecture that underpins the toolkit which models the text processing as a lossless communication down a noise-free channel. The transformation and encoding that is performed in the first phase must be both lossless and reversible. The role of the verification and decoding second phase is to verify the correctness of the communication of the target text that is produced by the application. This paper argues that this encoding approach has several advantages over the decoding approach of the standard noisy channel model. The concepts abstracted by the toolkit’s design are explained together with details of the library calls. The pseudo-code for a number of algorithms is also described for the applications that the toolkit implements including encoding, decoding, classification, training (model building), parallel sentence alignment, word segmentation and language segmentation. Some experimental results, implementation details, memory usage and execution speeds are also discussed for these applications. View Full-Text
Keywords: text compression; text processing; encoding; decoding text compression; text processing; encoding; decoding

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Teahan, W.J. A Compression-Based Toolkit for Modelling and Processing Natural Language Text. Information 2018, 9, 294.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Information EISSN 2078-2489 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top