Q3'23: Technology update – low precision and model optimization

No items found.


Alexander Kozlov, Nikita Savelyev, Nikolay Lyalyushkin, Vui Seng Chua, Pablo Munoz, Alexander Suslov, Andrey Anufriev, Liubov Talamanova, Yury Gorbachev, Nilesh Jain, Maxim Proshin


This quarter we still observe an increasing trend in the Large Language Models optimization which is mostly about compressing the model weights while keeping accuracy. Interestingly, 4-bit integer and floating-point weight compression methods have been quickly adopted in the industry, and the Hugging Face Transformers library via AutoGPTQ (INT4-3-2 types) and BitAndBytes (FP4/NF4 types) integration. Now, we see some confusion from the customers’ side about what method to use and when, which, as usual, will be resolved by industry - the most adopted will survive.

Papers with notable results


  • ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats by Microsoft (https://arxiv.org/pdf/2307.09782.pdf).The paper introduces the potential in FP8 activation and FP4 weights quantization, and the impact of Low Rank Compensation (LoRC). Authors show that LoRC significantly reduces quantization errors in the W4A8 scheme for FP quantization, especially in smaller models, thereby enhancing performance. To improve the efficiency of conversion from FP4 to FP8 for W4A8 model, they propose restricting all scaling factors to be a power of 2 in different ways and show that these restrictions negligibly affect the model’s performance.
  • QuIP: 2-Bit Quantization of Large Language Models With Guarantees by Cornell University (https://arxiv.org/pdf/2307.13304.pdf).Authors propose a method based on the hypothesis that quantization benefits from incoherent weight and Hessian matrices, i.e., from the weights and the directions in which it is important to round them accurately being unaligned with the coordinate axes. The method consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre-and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. Authors apply the method on top of OPTQ and show that it improves the baseline. The code is available at https://github.com/jerry-chee/QuIP.
  • NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search by Datakalab (https://arxiv.org/pdf/2308.05600.pdf).Authors propose using non-uniform quantization over the commonly adopted way for DNN quantization, e.g. GPTQ. The method leverages from PowerQuant approach where the quantization function is defined via power function with an exponent value lower from (0, 1) internal. It allows a better fit to the weight distribution of LLM and reduces quantization error. Authors also enable the optimization of the power exponent, i.e. the optimization of the quantization operator itself during training by alleviating all the numerical instabilities. The resulting predictive function is compatible with integer-only low-bit inference. The method achieves good results in W4/A16 quantization of LLM models.
  • Gradient-Based Post-Training Quantization: Challenging the Status Quo by Sorbonne University and Datakalab (https://arxiv.org/pdf/2308.07662.pdf).In this work, authors analyze common choices in GPTQ methods. They show that the process is robust to weight selection, feature augmentation, and choice of calibration set. They also derive a number of best practices for designing more efficient and scalable GPTQ methods, regarding the problem formulation (loss, degrees of freedom, use of non-uniform quantization schemes) or optimization process (choice of variable and optimizer). Finally, they propose an importance-based mixed-precision technique. Those guidelines lead to performance improvements on all the tested state-of-the-art GPTQ methods and models.
  • Pruning vs Quantization: Which is Better? by Qualcomm AI Research (https://arxiv.org/pdf/2307.02973.pdf). The authors provide a comparison between the two techniques for compressing deep neural networks. They give an analytical comparison of expected quantization and pruning error for general data distributions. Then, they provide lower bounds for the per-layer pruning and quantization error in trained networks and compare these to empirical error after optimization. Finally, they provide an experimental comparison for training 8 large-scale models on 3 tasks. The results show that in most cases quantization outperforms pruning.
  • FPTQ: FINE-GRAINEDPOST-TRAINING QUANTIZATION FOR LARGE LANGUAGE MODELS by Meituan and Nanjing University (https://arxiv.org/pdf/2308.15987.pdf). The paper proposes a W4A8 post-training quantization method for LLMs. To recover the accuracy drop after quantization authors involve layerwise activation quantization strategies which feature a logarithmic equalization for most intractable layers, combined with fine-grained weight quantization. They eliminate the necessity for further fine-tuning and obtain the state-of-the-artW4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on MMLU and Common Sense benchmarks.
  • Low-bit Quantization for Deep Graph Neural Networks with Smoothness-aware Message Propagation by University of Warwick and TOBB University of Economics and Technology (https://arxiv.org/pdf/2308.14949.pdf).The paper presents a solution that aims quantizing GNNs while avoiding the oversmoothing problem in deep GNNs. We introduce an approach for all stages of GNNs, from message passing in training to node classification, compressing the model and enabling efficient processing. The proposed GNN quantizer learns quantization ranges and reduces the model size under low-bit quantization. To scale with the number of layers, authors devise a message propagation mechanism in training that controls layer-wise changes of similarities between neighboring nodes. This objective is incorporated into a Lagrangian function with constraints and a differential multiplier method is utilized to iteratively find optimal embeddings. The proposed quantizer demonstrates superior performance in INT2 configurations across all stages of GNN, achieving a notable level of accuracy. Finally, the inference with INT2 and INT4representations exhibits a speedup of 5.11 × and 4.70 × compared to full precision counterparts, respectively.
  • OMNIQUANT: OMNIDIRECTIONALLYCALIBRATED QUANTIZATION FOR LARGE LANGUAGE MODELS by OpenGVLab, The University of Hong Kong, and The Chinese University of Hong Kong (https://arxiv.org/pdf/2308.13137.pdf). The paper introduces the method freezes original full-precision weights while incorporating a restrained set of learnable parameters. The method imbues quantization with gradient updates while preserving the time and data efficiency of PTQ methods. It consists of Learnable Weight Clipping and Learnable Equivalent Transformation which is a more generic version of the popular Smooth Quant method. These strategies make full-precision weights and activations more amenable to quantization. Experiments demonstrate that the method outperforms previous methods across a spectrum of quantization setting sat affordable optimization time. The code is available at https://github.com/OpenGVLab/OmniQuant.
  • Softmax Bias Correction for Quantized Generative Models by Qualcomm AI Research (https://arxiv.org/pdf/2309.01729.pdf). The output of attention function, softmax activation is often kept at floating precision, especially in post-training quantization due to its degrading impact on accuracy. This study shows that quantized softmax is biased – quantized probabilities do not sum up to 1, an aftermath of rounding on tiny probabilities. The authors formulate a softmax bias correction which can be estimated empirically, offline and zero overhead by fusing the correction term into the zero-point offset of asymmetric quantization function. Ablation experiments demonstrate improved QSNR of Stable Diffusion (SD) and Perplexity of OPT-125M.Generated images from SD quantized with softmax bias correction retain similar visual structures to the original generation.
  • Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers by The University of Texas at Austin and ARM (https://arxiv.org/pdf/2308.10814.pdf).The work is based on the finding that small perturbations in quantization scale can lead to significant improvement in the quantization accuracy of Vision Transformer (ViT) models. Authors claim that quantized ViTs have an extremely non smooth loss landscape making stochastic gradient descent a poor choice for optimization. That is why they propose an evolutionary search to favor nearby local minima. They also propose to use contrastive losses (instead of MSE, KLD, etc.) that smooth the loss landscape. The experiments show that the method works well in various quantization setup for Transformer and CNN models. The code is available at: https://github.com/enyac-group/evol-q.
  • OPTIMIZE WEIGHT ROUNDING VIA SIGNED GRADIENT DESCENT FOR THE QUANTIZATION OF LLMS by Intel (https://arxiv.org/pdf/2309.05516.pdf).The authors propose a weight compression method that involves lightweight block-wise tuning using signed gradient descent. Essentially, what happens is the additive term is introduced for quantized weights to control the rounding direction and MSE loss between quantized and source layer outputs is optimized for the additive term. The method achieves superior results over GPTQ and RNT baseline in many setups for 4-bit and 4-bit weight compression. One of the possible drawbacks is the small group size which can lead to non-optimal performance improvement and footprint reduction.
  • Understanding the Impact of Post-Training Quantization on Large Language Models by Fresh works Inc (https://arxiv.org/pdf/2309.05210.pdf). Some analysis of FP4 and NF4 precisions feasibility with respect to LLM compression and how it aligns with other compression modes, e.g. INT8 and double quantization.
  • Norm Tweaking: High-performance Low-bit Quantization of Large Language Models by Meituan (https://arxiv.org/pdf/2309.02784.pdf).Authors show that LLMs are robust against weight distortion, merely slight partial weight adjustment could recover its accuracy even in extreme low-bit regime. They propose an LLM tweaking strategy composed of (1) adjusting only the parameters of LayerNorm layers while freezing other weights; (2) constrained data generation enlightened by LLM-QAT to obtain the required calibration. Experiments show significant accuracy improvements when applying this method on top of other famous such as GPTQ.
  • Gradient-Based Post-Training Quantization: Challenging the Status Quo by Sorbonne Universite and Datakalab (https://arxiv.org/pdf/2308.07662.pdf).The paper provides quite a thorough analysis of GPTQ and shows why it works in various settings, such as weight selection, feature augmentation, choice of calibration set.  The paper also reveals the best practices for designing more efficient and scalable GPTQ methods, regarding the problem formulation (loss, degrees of freedom, use of non-uniform quantization schemes) or optimization process (choice of variable and optimizer.


  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time by Rice University et. al (https://proceedings.mlr.press/v202/liu23am/liu23am.pdf). This insightful ICML’23 oral paper presents that contextual sparsity exists in LLM – Only a small subset of attention heads and MLP parameters are needed to maintain language modeling and in-context learning ability. The contextual sparsity is verified to vary dynamically w.r.t input context and can be found as high as 85% on average in OPT175B. The authors offer an understanding of contextual sparsity by linking successive self-attention to mean-shift clustering. Empirical evidence shows that token embeddings exhibit high similarity between adjacent layers and shift gradually across layers, with the formulation of residual connections being a significant contributor to sparsity. Exploiting these insights, DEJAVU, an accelerated LLM inference solution is proposed by employing NN predictors to dynamically prune head and MLP parameters. To remedy sequential execution and potential overall overhead, the sparse predictors are designed to look ahead and branched out to execute in parallel to main network. Adeptly implemented DEJAVU has demonstrated at iso-quality to OPT 175B and inference acceleration of 2X over SOTA Faster Transformer, 6X over Hugging Face serving solution on 8xA100s. The code is available at https://github.com/FMInference/DejaVu.
  • A Simple and Effective Pruning Approach for LLMs by CMU, MetaAI and Bosch AI. (https://arxiv.org/pdf/2306.11695.pdf).As ultra-large magnitude features emerged in LLMs beyond a certain scale, this paper proposes to factor input activation as part of weight importance evaluation to maintain pruning simplicity as close to magnitude pruning. The authors introduce metric Sij = |Wij | · ∥Xj2 where each weight is evaluated by the product of its magnitude and the norm of the corresponding input activations. Subsequently, the weights are ranked and pruned per output basis. Experimental results demonstrate task performance on par with Sparse GPT on LLaMa set of models (outperforming marginally in certain benchmarks). This approach is arguably the simplest pruning technique for LLMs, characterized by its speed as the process does not involve weight update/reconstruction and is simple without the need of specialized kernel for 2nd order -based computation as devised by SparseGPT. The code is available at https://github.com/locuslab/wanda.
  • Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time by Rice University (https://arxiv.org/pdf/2305.17118.pdf). This paper addresses the KV cache memory capacity requirement during LLM deployment which overflows the device memory when scaling up batch size and context length. E.g., on top of parameter memory, GPT3/OPT-175B requires 9GB KV cache per batch size to support max context of 2048 tokens, limiting batch size<= 35 on 8xA100 (80GB) setup. The studies observe that a low subset of tokens is persistently influential throughout the entire sequence generation and suggest that the property can be exploited to reduce the number of token representation stored in KV cache. The authors propose Scissorhands, inspired by textbook algorithms – reservoir sampling and the Least Recent Usage cache replacement, to utilize historical attention scores for pruning non-influential tokens from the cache when the KV buffer is full. Without the need offine-tuning, Scissorhands can reduce up to 5X KV cache requirement at negligible degradation on various task benchmarks and even compatible with4-bit compressed KV cache.

Neural Architecture Search

  • Differentiable Quantum Architecture Search for Quantum Reinforcement Learning by Siemens AGand Ludwig Maximilians University (https://arxiv.org/pdf/2309.10392.pdf). Researchers explore the automation of architecture engineering for quantum circuits. This work exploits the learnings on Differentiable Neural Architecture Search, e.g., DARTS, and expands on previous work on Differentiable Quantum Architecture Search (DQAS). The paper explores DQAS capabilities to solve quantum deep Q-learning problems, using two different environments: cart pole and frozen lake. The proposed approach, RL-DQAS build a super-circuit with a search space made of a circuit with placeholders, architecture parameters, and a set of of operations, O. The results of the proposed method, RL-DQAS, confirm that DQAS is an efficient method for automatically designing quantum circuits.
  • SANA: Sensitive-Aware Neural Architecture Search Adaptation for Uniform Quantization by Stanford University and University of California at Berkeley (https://www.mdpi.com/2076-3417/13/18/10329). Researchers tackle the challenges in uniform quantization by proposing sensitivity-aware network adaptation (SANA), which perform sensitivity analysis and automatically modifies the model architecture accordingly. To accelerate SANA’s quantization-aware finetuning, the authors propose four channel initialization strategies (Halving, Zero Padding, Averaging, and Small Int).Experimental results ResNet-50 and EfficientNet-B2 show the benefits of neural architecture adaptation.
  • SLIM-TASNET: A Slimmable Neural Network for Speech Separation by International Audio Laboratories Erlangen (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10248143). Researchers demonstrate the use of Neural Architecture Search (NAS) to obtain neural networks for speech separation, allowing for the on-the-fly  adaptation to resource-constrained environments. Their approach, Slim-TasNet, achieves dynamic inference by the application of elastic width. The super-network generation and training exploits existing weight-sharing techniques. However, the adaptive performance-efficiency trade-off at runtime is a good example of how the trained super-networks can be used in applications with varying resource constraints.
  • FINCH: Enhancing Federated Learning with Hierarchical Neural Architecture Search by the University of Science and Technology of China (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10251628). Researchers propose the FINCH framework to address some of the challenges of using Neural Architecture Search (NAS) in Federated Learning, e.g., non-IID data and resource-constrained environments. In particular, the authors focus on the application of a hierarchical NAS approach to reduce the completion time when searching for high-performing subnetworks. Subnetworks are allocated to clusters of clients based on their data distribution, and training and search are done in parallel. Results show that FINCH can discover smaller high-performing subnetworks when compared to their FL + NAS frameworks, e.g., FedNAS and DecNAS.



  • Dataset Quantization by Bytedance and National University of Singapore (https://arxiv.org/pdf/2308.10524v1.pdf).Modern computer vision and large language models train on huge datasets with millions or even billions of samples. Authors of this paper propose a dataset quantization method aiming to reduce datasets without loss of accuracy. For example, with 60% of ImageNet and 20% of Alpaca they are able to train ResNet18 andLLaMa-7B with almost no accuracy drop. The method follows these steps: (1) split dataset into multiple disjoint sets (2) uniformly sample a certain ratio of sample from each set and (3) split each image into patches and discard not informative patches, also pixel quantization is applied.
  • eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models by Apple (https://arxiv.org/pdf/2309.00964.pdf).An alternative approach on weight compression through weights clustering. Authors claim that it is infeasible to use the standard clustering approaches due to the HW resource constraints. The proposed improvements that help to reduce the memory footprint of Differentiable K-Means Clustering. Results demonstrate that the method can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to2.5 GB (3bit/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130× at some modest degradation of accuracy. 


Deep Learning Software

  • OpenLLM-Perf Leaderboard by Hugging Face (https://huggingface.co/spaces/optimum/llm-perf-leaderboard). The project aims to benchmark the performance (latency & throughput) of Large Language Models(LLMs) with different hardware, backends and optimizations using Optimum-Benchmark and Optimum flavors.
  • QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models by ETH Zurich, IST Austria and Neural Magic (https://arxiv.org/pdf/2307.03738.pdf).An automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on CPUs. The approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. An implementation is available at https://github.com/IST-DASLab/QIGen.
  • FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning by Princeton and Stanford Universities (https://arxiv.org/pdf/2307.08691.pdf).Authors observe that the inefficiency of the first version of FlashAttention is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. They propose FlashAttention-2, with better work partitioning to address these issues. In particular, they (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2× speedup compared to FlashAttention, reaching 50-73% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. Code is available at https://github.com/Dao-AILab/flash-attention.