1. Introduction
In recent years, artificial intelligence (AI) has rapidly advanced in tandem with the accelerated pace of socioeconomic, scientific, and technological evolution on a global scale. As a synergistic integration of “big data, extensive computational capacity, and advanced algorithms”, large models mark a pivotal phase in the maturation of AI. This development presents unprecedented opportunities and challenges across a broad spectrum of industries. Large models are machine learning models characterized by their extensive parameters and intricate computational configurations. These models are typically constructed using deep neural networks containing billions, or in some cases, hundreds of billions of parameters [
1]. Supported by the triad of big data, substantial computational power, and robust algorithms, AI large models undergo extensive pre-training on vast datasets and are subsequently refined through methods such as prompting, instructional fine-tuning, and human feedback to accommodate a variety of downstream tasks [
2]. These models exhibit interdisciplinary, multi-contextual, and multipurpose processing capabilities, allowing a single model to be deployed across diverse sectors, including smart cities, biotechnology, smart offices, film production, smart military operations, and intelligent education systems. The advent of transformer models and self-attention mechanisms has significantly advanced the machine’s proficiency in processing and comprehending complex texts, culminating in the development of large language models [
3]. With the maturation of large models, vertical industrial applications have emerged as a focal point in the international arena of artificial intelligence. The whitepaper on the global digital economy shows that the number of AI large language models worldwide has reached 1328, with 36 percent from China, the second-largest proportion only after the United States [
4]. The number of large models in China with over 1 billion parameters has surpassed 100. These large models are significantly enhancing various fields such as finance, entertainment, office automation, electronic information, medical care, manufacturing, education and transportation, leading to the development of hundreds of application modes [
5].
Recently, the field of AI has witnessed the rise in large models [
6]. Large models have emerged as the center of competition in the field of artificial intelligence, driven by computing power. Nonetheless, the number of model parameters has increased [
7], which outpaces the memory expansion of accelerators, necessitating the use of large-scale GPU supercomputing clusters for training. As a result, the time and financial costs of training large models have escalated, becoming almost prohibitive [
8]. The utilization of extensive computational resources and memory is typically necessary for large models, thereby presenting both novel prospects and challenges for traditional edge intelligence systems. Traditional cloud computing is inadequate to meet the diverse data processing demands of our modern intelligent society, giving rise to the emergence of edge computing technology. This innovative computing concept positions computer and storage resources closer to the data source, thereby enhancing data processing efficiency and reducing response times. However, due to the limited computing and storage resources of edge devices, designing and optimizing lightweight models has become an essential research topic. The usage of lightweight models in edge computing and particular vertical sectors has attracted increased interest. Consequently, the application of large model quantization presents new opportunities for integrating AI with edge computing.
Deep learning and computer vision researchers have long focused on model computational intensity, parameter size, and memory usage. Resource limitations can hinder the deployment of large models on edge devices, but small models, which are faster to train and reason with, easier to integrate into mobile devices, embedded systems, or low-power environments, and can run effectively on devices with constraints, offer a more practical solution [
9]. With the continuous advancement of technologies like IoT and edge computing, the potential uses of small models will continue to grow [
10]. To achieve swifter, enhanced, and economically efficient edge intelligence, there is an immediate necessity to delve deeper into the distinctions and impacts associated with large and small models.
To understand the research focus and potential of edge computing empowered by artificial intelligence, we provide a comprehensive overview of edge computing-empowered AI and explore the potential of model quantization to link AI and edge computing. The contributions of this survey paper can be summarized as follows:
We have summarized the fundamental features of model quantification, offering backing for edge computing enhanced by AI. This investigation underscores the effectiveness of model quantification in improving AI performance at the edge, stressing its importance in boosting computational efficiency and resource utilization.
Our analysis reveals the transformative potential of integrating artificial intelligence with edge computing, showcasing how model quantification bridges the gap between complex AI models and edge devices.
We conduct a comparative and analytical evaluation of the outcomes pre and post quantification to contribute to relevant research studies.
The remainder of this research is arranged as follows:
Section 2 briefly introduces the development history of edge computing systems and artificial intelligence; we show the model quantization in bridging the gap between large AI models and edge computing in
Section 3;
Section 4 sets out the main conclusions and suggestions.
2. Development History of Edge Computing Systems and Artificial Intelligence
2.1. Development History of Edge Computing Systems
In the realm of technology, edge computing is a concept that brings together network, computing, storage, and application core functions near the physical location or data source. This arrangement enables the delivery of services closer to the end user [
11]. This advancement in edge computing is revolutionizing the network infrastructure. Edge computing implementations start at the edge, enabling organizations to make substantial enhancements in network service response times, crucial for ensuring smooth real-time business operations [
12]. The potential of edge computing in the years ahead is immense, frequently hailed as the “final frontier of artificial intelligence”. Industries such as manufacturing, utilities, energy, and transportation have spearheaded the adoption of edge computing, paving the way for its widespread integration.
In the last twenty years, global attention has surged towards exploring edge computing, leading to a notable uptick in published research materials. Edge computing has become a hot issue in academics. Prior to 2015, the worldwide volume of publications remained notably scant, while edge computing lingered within the realms of rudimentary technological aggregation, hindered by the sway and confines of basic research, conceptual frameworks, and data samples. During the period from 2015 to 2020, there was a significant surge in the volume of publications, marking a fivefold increase. This notable escalation coincided with the widespread acceptance and rapid advancement of edge computing technology, as well as improvements in algorithms, computing capacity, and data expansion. After 2020, the field of edge computing has witnessed an unprecedented surge in publications, indicative of a state of high saturation in research and development efforts. Simultaneously, during this period, edge computing has entered a phase of steady and robust development. From the perspective and content of research in edge computing, they mainly focus on three major areas, like Engineering Electrical Electronic, Computer Science information Systems, and Telecommunication.
Edge computing has transitioned from conceptualization to technological maturity and stable development, as indicated by the analysis of publication volume and the technology development process. Over this time, edge computing has grown significantly and can be roughly classified into three major stages. The time from 1997 to 2015 witnesses technological beginnings. Between 2015 and 2020, edge computing grows at a rapid pace. It has been in a stage of stabilization since 2021.
2.1.1. Technology Incubation Phase (1997–2015)
With its early introduction to the market, edge computing is currently a hot topic. The idea of the Content Delivery Network (CDN) was initially put forth by Akamai in 1998, which marked the beginning of practical attempts at edge computing [
13]. Amazon introduced the concept of the “Elastic Compute Cloud” in 2006, followed by Professor Satyanarayanan from CMU who presented cloudlets in 2009 as an early example of edge computing, showcasing a two-tier architecture that further bolstered the advancement of edge computing [
14]. In 2012, Cisco introduced the notion of fog computing, breathing new life into the development of edge computing. After conducting thorough research on edge computing models, LaMothe and others from the Pacific Northwest National Laboratory in the United States officially proposed “Edge Computing” in 2013, marking its formal entry into the academic community [
15].
With the advent of the internet of things (IoT), there has been a surge in the innovation of edge computing tech, which poses new challenges to the traditional cloud computing approach [
16]. With the emergence of cloud computing, big data, mobile internet, internet of things, and artificial intelligence, the ever-evolving field of computing is progressing in two distinct directions. One direction involves centralizing resources, while the other involves marginalizing them. The centralization of computing resources through cloud computing and supercomputing brings up several issues, such as idle resources, high power usage effectiveness (PUE), and security and privacy concerns. Nonetheless, the edge computing model provides a new approach to address the issue of excessive centralization of resources. On the one hand, the advancement of mobile technology has led to a significant demand for computing, storage, and network resources to be made available to end users. As a result, the development of a mobile internet has played a vital role in the promotion of edge computing. On the other hand, it is important to note that the advancement of big data cannot solely depend on centralized processing methods. Consequently, there is a growing need for the enhancement of edge computing, which can reduce the unnecessary consumption of resources and bandwidth for tasks such as data migration, network transmission, and data backup. This can be achieved by deploying data centers on the edge side or through the preprocessing of data by edge nodes. In the field of artificial intelligence, edge computing, as a computing model of artificial intelligence, continues to develop with the emergence of a large number of intelligent front-end devices.
2.1.2. Rapid Development Phase (2015–2020)
Since 2015, edge computing technology has experienced a significant evolution, paralleled by substantial advancements within the industry. This period has witnessed a surge in the development and deployment of edge computing solutions across various sectors. In 2015, the European Telecommunications Standards Institute (ETSI) issued a white paper focusing on mobile edge computing, advocating for the development and enhancement of technical standards in this field to facilitate wider industry usage. In 2016, ETSI expanded the concept of MEC to Multi-Access Edge Computing, which extended edge computing to other wireless access networks (such as WiFi). In 2016, Shi et al. [
17] formally proposed the edge computing model. China’s industry and academia began to acknowledge edge computing technology in 2017 with the formation of the China Edge Computing Industry Alliance and the Edge Computing Specialized Committee of the Chinese Society of Automation. According to Technavio, from 2018 to 2022, the use of edge computing technology will increase at a rate of almost 20% annually. Grand View Research has projected that the Asia Pacific edge computing market will expand at a compound annual growth rate (CAGR) of 46.7% from 2016 to 2025. Moreover, 2020 Deep Vision sets new standards with its low-latency AI processor, which allows for real-time performance-critical edge computing applications. Edge computing has emerged as a critical component in addressing the growing demand for low-latency data processing and real-time analytics. The ever-evolving landscape of edge computing offers exciting opportunities for businesses and individuals alike, driven by technological innovation and the increasing need for decentralized computing capabilities.
2.1.3. Stabilization Phase (2021-Present)
The rapid growth of edge computing can be attributed to the advancements in technologies such as internet of things (IoT), artificial intelligence (AI), and machine learning (ML). In the meantime, there is a growing focus on it in smart city initiatives across the globe. In 2020, despite the unprecedented challenges brought on by the COVID-19 pandemic, the global edge computing market still managed to reach an impressive USD 4 billion. Academic research in the field of edge computing has been gaining momentum, with a focus on exploring the intricacies of key edge computing technologies, intelligent algorithms, operating systems and basic architecture, trust security, and privacy, etc. [
18,
19,
20,
21]. In 2021, several organizations, such as China Telecom (Beijing, China), China Unicom (Beijing, China), Lenovo (Beijing, China), Baidu Cloud (Beijing, China), Tencent Cloud (Shenzhen, China), ByteDance (Beijing, China), and Intel (Santa Clara, CA, USA), collaborated to publish the “Edge Computing” Report. It not only marks the maturity of edge computing technology, but also demonstrates the prospect of its application in multiple industries. At the steady development stage, its application scenarios continue to expand. As discussed in [
22,
23,
24], edge computing has been studied in a range of fields, including smart construction, 5G Networks, traffic management, ocean monitoring and local content distribution. Especially, the combination of edge computing and artificial intelligence has become a widely discussed topic. Many artificial intelligence technologies, including voice and image recognition, can be supported by edge computing devices. As artificial intelligence technology advances, edge computing devices will play a significant role in supporting various industries, including logistics and supply chain [
25]. The mature application of edge computing has emerged as a pivotal force in propelling green agriculture across various sectors, ushering in an era of interconnected intelligence [
26].
The history of edge computing has been shaped by the dynamics of technological advancement, industry cooperation, and consumer demand. By decentralizing computational power and processing data closer to its source, edge computing enables swift decision making and enhances the efficiency of operations in diverse environments. This technology not only reduces latency but also optimizes bandwidth consumption, thereby fostering seamless connectivity and enhancing user experiences. As organizations strive to embrace the opportunities presented by the digital age, edge computing stands as a cornerstone technology, facilitating the transition towards smarter, interconnected systems that redefine the way we interact with technology and the world around us.
2.2. Development History of Artificial Intelligence
Artificial intelligence has been developing for sixty years since its inception. There are four distinct phases in history: the first spans from the late 1950s to the early 1980s, the second begins in the early 1980s and ends at the end of the 20th century, the third phase encompasses the period from the beginning of the twenty-first century to the year 2020, and the years 2020–present make up the fourth phase [
27]. Due to the lack of notable technological advancement, artificial intelligence saw two waves of failure, with applications unable to meet expectations and facilitate widespread commercialization. After two peaks and valleys, marked by the proposal of the deep learning model in 2006, artificial intelligence has entered its third phase of high-speed growth. In 2020, GPT-3’s breakthrough led to the big model entering the explosion stage [
28]. The evolution of artificial intelligence development is marked by changing characteristics over time, making it essential to provide a comprehensive overview and summary. It is crucial for establishing a strong theoretical foundation for future analyses of AI-powered edge computing.
2.2.1. Stage I: Artificial Intelligence Is Emerging and Growing Quickly, but It Is Challenging to Overcome Technological Barriers
AI developed rapidly, while symbolism prevailed. Symbolic methods were integrated with statistical techniques, leading to the emergence of knowledge-based methods, semantic processing, and human–computer interaction, thereby marking the first golden period of AI development from 1956 to 1974. New advancements in artificial intelligence have been achieved in both algorithms and methodologies. Scientists have developed a range of algorithms that have a significant impact, including the Bellman formula, which serves as a prototype for deep learning models. Additionally, scientists have created machines with rudimentary intelligence, such as STUDENT (1964), which can solve application problems, and ELIZA (1966), which can engage in simple human–machine dialogues. Given the swift progression of artificial intelligence, experts widely anticipate that it will eventually supplant humans, a view held by many researchers in the field.
However, while the exponential growth of artificial intelligence has led researchers to speculate on its potential, it is essential to recognize that the progress of artificial intelligence has slowed down as the models have certain limitations. From 1974 to 1980, the limitations of artificial intelligence became increasingly apparent. The logic prover, perceptron, and augmented learning were only capable of completing specific tasks and were unable to handle more complex ones. The level of intelligence was relatively low, and the limitations were more obvious. As for the reasons, it is affected by two main aspects. Firstly, the mathematical models and tools that underpin artificial intelligence have been discovered to be incorrect. Secondly, the complexity of numerous computations has grown exponentially, far exceeding the capability of existing algorithms. During the initial phases of development, artificial intelligence encountered inherent limitations that led to a decrease in enthusiasm and funding from R&D organizations. Consequently, it experienced a period of stagnation for the first time.
2.2.2. Stage II: Breakthroughs in Modeling Pave the Way for Early-Stage Industrialization, Yet Numerous Hurdles Hinder Their Broad Implementation
The 1980s marked significant advancements in mathematical models and the use of expert systems, as well as a renewed interest in artificial intelligence among the general public. Mathematical models for artificial intelligence have resulted in a number of significant inventions, including the well-known multilayer neural networks and the BP backpropagation method. Thanks to model optimization, highly intelligent machines were created that could play chess with humans. Additionally, Carnegie Mellon University developed an expert system for DEC in 1980, which helped DEC save about USD 40 million a year, especially by providing valuable content for decision making. In 1982, many countries, including Japan and the United States, heavily invested in the development of the 5th generation of computers, also known as the Artificial Intelligence Computer. In 1998, the modern convolutional neural network LeNet-5 was created, marking a shift from shallow machine learning to deep learning. Not only did it establish a foundation for advanced research in natural language generation and computer vision, but it also played a crucial role in the development of deep learning frameworks and large models.
Advancements in artificial intelligence were hindered by the steep costs and maintenance complexities, causing a downturn in development. Researchers developed the LISP language and the Lisp computer to improve the efficiency of executing instructions compared to general-purpose computers. However, due to its high cost and maintenance difficulties, widespread adoption of this model has been challenging. Meanwhile, between 1987 and 1993, Apple and IBM introduced the first generation of desktop computers with good performance and affordable prices. Personal computers have increasingly dominated the consumer market, with more and more computers becoming part of personal households. Expensive Lisp computers, due to their age, obsolescence, and maintenance difficulties, have steadily vanished from the market. Due to the emergence of disruptive innovations, expert systems have lost their popularity, resulting in a considerable reduction in the hardware market. Consequently, government funding for artificial intelligence has begun to decline, leading to a new downturn in the industry.
2.2.3. Stage III: A New Generation of Artificial Intelligence Has Emerged with the Information Age, but There Are Worries about Its Future Advancement
The rapid advancement of technology has propelled artificial intelligence into a new phase of development. Due to widespread internet usage, the proliferation of sensors, the advent of big data, the growth of e-commerce, and the rise of information communities, data and knowledge are interacting and influencing each other across physical and information spaces. These factors have led to significant changes in the information environment and the database, driving artificial intelligence into a new stage of development. The goals and principles of artificial intelligence have undergone significant transformations, marked by fresh advancements in scientific fundamentals and implementation strategies. Innovations such as brain-inspired computing, deep learning, and reinforcement learning underscore the burgeoning drive for intrinsic motivation.
The level of artificial intelligence is rapidly increasing while posing numerous challenges. Thanks to the rapid growth in data volume, the significant increase in computing power, and the continuous optimization of machine learning algorithms, the new generation of AI has shown the ability to perform as well as or even better than human beings in specific tasks [
29]. It has also transitioned from specialized intelligence to general-purpose intelligence, with the potential to develop into abstract intelligence. As artificial intelligence applications continue to expand, they become increasingly intertwined with human production and daily life. This brings about a lot of convenience but also poses potential problems [
30]. Firstly, it may accelerate the replacement of human workers by machines, leading to more serious structural unemployment. Secondly, it raises challenges in protecting privacy and defining data ownership rights, privacy rights, and licensing rights.
Large language models are currently experiencing rapid growth at this stage, garnering significant attention from both academia and industry alike. Leading companies have ventured into this domain, recognizing their potential to revolutionize various sectors. In June 2018, OpenAI (San Francisco, CA, USA) introduced GPT-1, utilizing the transformer architecture and trained with extensive unlabeled text data. This release has popularized pre-trained large models in the field of natural language processing. The new neural network architecture represented by the transformer model has established the foundation for algorithmic architecture of large models, leading to a significant improvement in the performance of big model technology. In October 2018, Google (Mountain View, CA, USA) unveiled the BERT (Bidirectional Encoder Representations from Transformers), which demonstrated significant advancements in natural language processing tasks, further pushing the boundaries of large model development. In February 2019, OpenAI released GPT-2, showcasing its remarkable text generation capabilities. However, this also prompted concerns about the potential misuse of the technology to create false or misleading content. This event initiated a significant discussion about the ethical and safety implications of large language models. The recent concerns regarding the implications of large language models (LLMs) necessitate a closer examination of the infrastructural ambitions of major tech corporations within the field of natural language processing (NLP) [
31]. This investigation should focus on the role played by these language models in the political economy of artificial intelligence (AI) [
32].
2.2.4. Stage IV: Pre-Trained Large Models Represented by GPT Lead the New Trend of Development
The combination of big data, powerful computing, and advanced algorithms has significantly improved the pre-training and generation abilities of these large models, enabling multi-modal and multi-scenario applications. For instance, the success of ChatGPT is attributed to the support of Microsoft Azure’s computational power and extensive data sources such as Wiki, along with fine-tuning based on the transformer architecture and reinforcement learning with human feedback (RLHF). In 2020, OpenAI launched GPT-3, which had a parameter size of 175 billion, making it the largest language model at the time. It showed significant performance improvements on zero-shot learning tasks. Following that, techniques such as reinforcement learning based on human feedback (RHLF), code pre-training, and instruction fine-tuning were introduced to further enhance inference and task generalization. When ChatGPT with GPT3.5 was released in November 2022, it garnered notice for its realistic natural language interactions and capacity to generate content for multiple scenarios. In March 2023, the newly released ultra-large-scale multi-modal pre-trained model, GPT-4, possessed multi-modal understanding and multi-type content generation capabilities.
As the capabilities of big language models expand, they are increasingly being integrated into diverse applications across fields such as natural language processing, content generation, and information retrieval. This surge in interest underscores the profound impact that big language modeling is poised to have on the way we interact with and leverage language in the digital age, signifying a pivotal moment in the evolution of artificial intelligence.
3. Bridging Edge Computing and Artificial Intelligence: Quantization
The advancement of technology has led to increasingly complex and large AI models, with a significant increase in parameter orders of magnitude. However, these large models require substantial computing power, which poses a challenge. They are often too resource intensive for edge computing devices, which typically have limited computing capabilities. As a result, there is a mismatch between the resources and capabilities of large AI models and edge computing devices. This disparity underscores the urgent need for developing lightweight AI models tailored for edge environments. Model quantization plays a crucial role in bridging the gap between large AI models and edge computing, marking a groundbreaking shift in the evolution of edge AI technology.
Quantization [
33] is an approximate algorithmic method. The computation of AI models is usually performed in floating-point computation (FP32), whereas quantization replaces the floating-point computation with lower bit computation, such as INR4 and INT8, which improves the inference performance and reduces the storage size of the model and the graphic memory usage.
In addition to quantization, other compression methods exist, such as distillation and pruning. However, in comparison, quantization has the following advantages:
Efficient compression capability. Quantization significantly lowers the storage space needed for a model by reducing the precision of its parameters. For instance, quantizing parameters from 32-bit floating-point numbers (FP32) to 8-bit integers (INT8) can reduce the model size by a factor of four.
Low resource consumption. On the one hand, quantization can significantly reduce the computational complexity of a model by reducing the bit width of parameters in the model. On the other hand, enormous quantization methods do not require re-training the entire model and can significantly reduce the computational resource requirements for model compression.
High compatibility. Quantization can be implemented without altering the model architecture, making it compatible with most other compression methods and different hardware platforms, opening up more possibilities for edge devices to equip with AI models.
In order to better understand quantization methods, we explore and present them in detail, including basic concepts, their main methods in language models, and analyze in depth the specific impact of these methods on model performance. We first introduce some basic concepts of quantization in
Section 3.1. Then, in
Section 3.2, we will briefly introduce some strong baseline quantization methods for large language models (LLMs).
Section 3.3 introduces the quantization model, evaluation metrics, method parameters, and experimental results of the aforementioned strong baseline quantization methods, and provides an analysis of these methods.
3.1. Basic Concepts
Quantization establishes an efficient data mapping relationship between fixed-point and floating-point data, resulting in significant benefits with minimal precision loss.
Uniform quantization. In uniform quantization, the mapping relationship of quantization can be expressed as follows:
where
is the input floating-point data,
is the fixed-point data after quantization,
is the value of zero point and
is the value of scale. This method is called uniform quantization since the length of each interval is the same (equal to the scale
) and the quantized values
are uniformly spaced (e.g., the integers 0,1,2, …).
The operation to recover real value
from quantized value
is called dequantization:
It is worth noting that due to the loss of information from the quantization process, the values of dequantization have some errors compared to the real values.
Non-uniform quantization. In contrast to uniform quantization, non-uniform quantization provides greater flexibility in the spacing and intervals between quantized values. This can be particularly useful in situations where a more precise representation of data is needed.
The basic idea is to allocate quantization levels according to the probability distribution or importance of the signal, allowing finer quantization intervals for important or frequently occurring values and coarser intervals for less important or less frequent values.
Non-uniform quantization can achieve higher accuracy and lower quantization error in AI models with non-uniformly distributed weights, but may not be computationally efficient due to time-consuming lookup operations. Optimizing methods and hardware such as GPUs can help mitigate this issue.
Weight-only quantization. It involves quantizing only the weights in a neural network, while the activation outputs remain at their original precision. Since this quantization method only applies to the weights, the degree of compression achieved is not very substantial.
Weight + activation quantization. It involves quantizing both the weights and activation values in a neural network. Since the range of activation layers is usually not easy to obtain in advance, it needs to be calculated during network inference or approximately predicted based on the model.
Post-training quantization (PTQ). This method is relatively simple and efficient. It only requires a trained model and a small amount of calibration data. There is no need to retrain the model. Depending on whether the activation is quantified, it is divided into the following:
Dynamic quantization. Since only weights are quantified, also called weight-only quantization, its activations are quantized during inference, which is suitable for LSTM, MLP, transformer, and other models (large number of weight parameters).
Static quantization. This method quantifies both weights and activations in advance. In order to quantify activations, it is necessary to use representative data for inference, and then count the data range of each activation. This step is also called “calibration”.
Quantization aware training (QAT). This method first adds fake quants to the model, simulates the quantization process, and then retrains the model, which can usually achieve higher accuracy.
Comparing PTQ and QAT, the main difference is whether to retrain. PTQ only requires a small amount of calibration data and the process is simple, while QAT requires inserting fake quants for retraining/fine-tuning. The process is more complex, but the accuracy is usually higher.
3.2. Quantization Methods for Large Language Models
(1)
AWQ [34] shows that it is not necessary to preserve all the weights of a neural network in a high-precision format. Instead, preserving only the top 0.1% of channels that correspond to significant activations in FP16 and quantizing the rest of the weight matrix can lead to better model performance. It means that not all weights are equally important. Retaining 0.1% of the weights as FP16 can improve quantization performance without significantly increasing the model size in terms of the total number of bits. However, using mixed-precision data types like FP16 can make system implementation more complex.
The authors propose an alternative approach, which translates to the following objective:
which means that finding a suitable scaling factor
and multiplying it by the weights
reduces its relative quantization error.
To find the optimal scaling factor
, the authors devised a method to automatically search for it and thus minimize the quantization error under full weighting. This translates to the following objective:
(2)
OPTQ/GPTQ [35]. GPTQ, also known as OPTQ, is a highly efficient method for weight-only LLM quantization that can be completed in a single pass. It draws from optimal brain quantization (OBQ), which utilizes an adaptive rounding strategy that quantizes individual weights sequentially within each row of the matrix, adjusting the remaining weights to minimize quantization errors. Although effective, OBQ may not be ideally suited for LLMs and can sometimes be slow and imprecise. To overcome these drawbacks, GPTQ modifies the quantization technique by simultaneously quantizing all rows of weights, enhancing both speed and efficiency. It also employs lazy batch updates to optimize the compute-to-memory ratio and ensure a smoother transition during quantization. Additionally, GPTQ integrates a Cholesky reformulation to increase process stability. These improvements enable GPTQ to quantize the weights of large models like OPT-175B or BLOOM-176B in approximately four hours using a single NVIDIA device.
(3)
LLM. int8() [36]. It is found that there are some outliers in the activation, their absolute values are significantly larger, and these outliers are distributed in a small number of features, called Emergent Features. In the case of matrix multiplication of activations
and weights
, the feature dimension is the
dimension. Both per-token (for activation
: one quantization factor per row) and per-channel (for weight
: one quantization factor per column) quantization are heavily influenced by these outliers. Since only a small number of features contain outliers, the idea of LLM. in8() is to perform a matrix decomposition and quantize the vast majority of the weights and activations in 8-bit (vector-wise) integers. A few dimensions of the outlier features are reserved for 16-bit integers, and high-precision matrix multiplication is carried out on them.
(4)
Llm-qat [37]. LLM-QAT directly applies the QAT framework to LLMs and addresses the issue of limited data availability by employing a data-free distillation method. This technique generates synthetic data from the original LLM and uses it to train the quantized LLM to replicate the original’s output distribution. Additionally, LLM-QAT facilitates the quantization and QAT of key-value caches, notably reducing memory usage during the generation of extended texts.
(5)
Int2.1 [38]. INT2.1 employed GPTQ for quantizing LLMs into INT2 and observed significant deviations in behavior compared to the original full-precision model. To address this, INT2.1 integrates trainable LoRA matrices into the model, which make up only 5% of the total parameters. The training approach merges a scaled version of the Kullback–Leibler divergence, comparing the full-precision and quantized models, with cross-entropy loss to maintain accuracy in next-token predictions. Their findings indicate that an INT2 large language model fine-tuned with LoRA is capable of generating coherent English text and following specific instructions effectively.
(6)
Qlora [39]. QLoRA implements a weight quantization process that converts LLMs into a 4-bit NormalFormat, and subsequently utilizes LoRA equipped with 16-bit BrainFloat for refining the quantized model on specific downstream tasks using cross-entropy loss. Additionally, it introduces a technique termed double quantization, which further compresses the model by quantizing its quantization parameters, although at a cost to computation speed. Through these methods, QLoRA effectively facilitates the fine-tuning of a 65B LLM on GPUs that have only 30 GB of memory.
3.3. Experiment
3.3.1. Datasets
WikiText-2 [40]. The WikiText language modeling dataset contains more than 100 million tokens from reputable Goods and Featured articles on Wikipedia. It is licensed under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of the Penn Treebank, WikiText-2 is more than twice as large, and WikiText-103 is over 110 times larger. Notably, the WikiText dataset retains the original case, punctuation, and numbers, and has a much larger vocabulary. This dataset is especially beneficial for models capable of leveraging long-term dependencies, given its inclusion of complete articles.
C4 [41]. The C4 (Colossal Clean Crawled Corpus) is a massive, cleaned web text dataset designed for training natural language processing (NLP) models. It comprises over 400 TB of text data, extracted from millions of web pages. C4 is meticulously curated by removing duplicate content, filtering out low-quality and irrelevant text, and performing a series of cleaning steps to enhance data quality, making it an ideal resource for training large-scale language models like GPT and BERT. Developed collaboratively by researchers at OpenAI and Google, the dataset aims to advance research and applications in natural language understanding.
3.3.2. Models
OPT [42]. The OPT (open pre-trained transformer) is an open-source, pre-trained large language model released by Meta (Menlo Park, CA, USA) AI (formerly Facebook AI). OPT aims to provide a model with performance and functionalities similar to OpenAI’s GPT-3, while facilitating broader access for the academic and research communities. Based on the transformer architecture, OPT is designed for understanding and generating natural language. It has been trained on a massive dataset covering a wide range of topics and knowledge domains, enabling it to perform various language tasks such as text generation, summarization, translation, and question answering. Notable for its openness and scalability, OPT offers multiple versions with parameters ranging from hundreds of millions to hundreds of billions, catering to researchers with different computational capabilities and application needs.
LLaMA [43]. The LLaMA (large language model by Meta AI) is a large-scale language model developed by Meta AI, designed to provide efficient and scalable natural language processing capabilities. Based on the transformer architecture, it boasts billions to trillions of parameters, enabling it to understand and generate complex textual data. LLaMA is trained on a diverse corpus, including books, articles, and web content, ensuring a broad knowledge base and linguistic comprehension. Aimed at advancing research and development in the natural language processing field, it supports multiple languages and emphasizes reducing computational resource needs while maintaining model performance, thereby gaining widespread attention and application in both academic and industrial spheres.
3.3.3. Results
Table 1 presents the perplexity results of several robust baseline quantization techniques for LLMs on Wikitext-2 and C4 datasets. These results demonstrate the effectiveness of these baseline quantization methods in reducing model size and improving inference speed, while preserving the model’s capabilities, as evidenced by the consistent perplexity levels.
4. Discussion
With the development of artificial intelligence, we have seen many impressive applications such as autonomous driving, voice assistants, and image recognition. What these applications and their AI models have in common is that they all rely on large amounts of computational resources and data.
To address these challenges, we require a new computing architecture that should be able to perform computation on edge devices while remaining low power and efficient. This is how edge computing was born. The core idea of edge computing is to push a large number of computational tasks to edge devices such as smartphones, and smart home devices, thus reducing the burden on central servers and increasing computational efficiency.
In this paper, we discuss the possibilities of model quantization in connecting AI and edge computing to enable distributed and low-power AI systems. Specifically, it is illustrated by the introduction of the working principle of the quantization method for large language models and the results of the experiments, which can effectively reduce the size of the model and accelerate the speed of the model inference, while maintaining the model’s functionality, thus opening up the possibility of its deployment on edge devices. This may also be an important research direction for future edge AI. The convergence of large models with edge devices increases the workloads on the edge side, presenting additional challenges for edge computing chips that already face limitations in power consumption and package size. Further exploration is needed to better integrate the development of large models and edge computing.