Q4'22: Technology update – low precision and model optimization

Authors

Alexander Kozlov, Pablo Munoz, Vui Seng Chua, Nikolay Lyalyushkin, Nikita Savelyev, Yury Gorbachev, Nilesh Jain

Summary

We still observe a lot of attention to the quantization and the problem of recovering accuracy after quantization. We highly recommend reading SmoothQuantpaper from the Highlights published by Song Han Lab about improving the accuracy of quantized Transformer models.

Highlights

  • SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models by MIT and Nvidia (https://arxiv.org/pdf/2211.10438.pdf). A training-free, accuracy preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8)quantization for LLMs that can be implemented efficiently. SmoothQuant smooths the activation outliers by migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. Concretely, they compute activation statistics using a few sequences and then use these to scale down the activations and scale up the weights such that the worst-case outliers are minimized. Demonstrate up to 1.56× speedup and 2× memory reduction for LLMs with negligible loss in accuracy
  • CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers by University of Texas at Austin and ARM Inc. (https://arxiv.org/pdf/2211.09643.pdf). The method finds the optimal set of quantization scales that globally minimizes a contrastive loss without changing weights. CPT-V proposes a block-wise evolutionary search to minimize a global contrastive loss objective, allowing for accuracy improvement of existing vision transformer (ViT) quantization schemes. CPT-V improves the top-1 accuracy of a fully quantized ViT-Base by 10:30%, 0:78%, and 0:15% for 3-bit,4-bit, and 8-bit weight quantization levels.

Papers with notable results

Quantization

  • SUB-8-BIT QUANTIZATION FORON-DEVICE SPEECH RECOGNITION: A REGULARIZATION-FREE APPROACH by Amazon Alexa AI (https://arxiv.org/pdf/2210.09188.pdf). The paper introduces a method for ASR models compression that enables on-centroid weight aggregation without augmented regularizes. Instead, it leverages Softmax annealing to impose soft-to-hard quantization on centroids from the µ-Law constrained space. The method supports different quantization modes for a wide range of granularity: different bit depths can be specified for different kernels/layers/modules. The method allows compressing a Conformer into sub-5-bit with more than 6x model size reduction and Bifocal RNN-T into5-bit that reduces the memory footprint by 30.73% and P90 user-perceived latency by 31.30% compared to INT8.
  • Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models by SenseTime Research and Chinese Universities (https://arxiv.org/pdf/2209.13325.pdf). Propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. The Gamma Migration produces a more quantization-friendly model by migrating the outlier amplifierγ of LayerNorm into subsequent modules in an equivalent transformation and bringing more robust activation for quantization without extra computation burden. The Token-Wise Clipping further efficiently finds a suitable clipping range with minimal final quantization loss in a coarse-to-fine procedure. The coarse-grained stage, which leverages the fact that those less important outliers only belong to a few tokens, can obtain a preliminary clipping range quickly in a token-wise manner. The fine-grained stage then optimizes it. However, it onlysucceeds on small language models such as BERT, RoBERTa, BART and fails to maintain the accuracy for LLMs. The PyTorch implementation is available: https://github.com/wimh966/outlier_suppression.
  • GPTQ: ACCURATE POST-TRAININGQUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS by ETH Zurich and IST Austria (https://arxiv.org/pdf/2210.17323.pdf). The paper introduces a method for low-bit quantization of transformer models. Essentially, the method combines Layer-wise fine-tuning of quantization parameters with Optimal Brain Quantization method for bit-width selection. The Code is available here: https://github.com/IST-DASLab/gptq.
  • NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers by Nanji, Peking and Berkeley universities (https://arxiv.org/pdf/2211.16056.pdf).Authors provide a theoretical justification for a way to reduce the quantization error of heavy-tailed distributions with a fixed additive noisy bias. They propose a method for a quantizer-agnostic enhancement for post-training quantization (PTQ) performance on activation quantization. The method is applied on top of existing PTQ quantizers and shows superior results for Vision Transformer models trained on ImageNet dataset: up to 1.7% improvement for a linear quantization scheme up to 0.7% for a nonlinear one.
  • Exploiting the Partly Scratch-off Lottery Ticket for Quantization-Aware Training by Tencent and Chinese Universities (https://arxiv.org/pdf/2211.08544.pdf). In this paper, the authors claim the phenomenon that a large portion of quantized weights reaches the optimal quantization level after a few training epochs. Based on this observation, they zero out gradient calculations of these weights in the remaining training period to avoid meaningless updating. To find the ticket, authors develop a heuristic method that freezes a weight once the distance between the full-precision one and its quantization level is smaller than a controllable threshold. The method helps to eliminate 30%-60%weight updating and 15%-30% FLOPs of the backward pass, while keeping the baseline performance. For example, it improves 2-bit ResNet-18 by 1.41%,eliminating 56% weight updating and 28% FLOPs of the backward pass. Code is at https://github.com/zysxmu/LTS.
  • QFT: Post-training quantization via fast joint finetuning of all degrees of freedom by Hailo AI (https://arxiv.org/pdf/2212.02634.pdf).The paper proposes a modification of the layer-wise/channel-wise post-training quantization method where all the parameters are trained jointly including the layer weights, quantization scales, cross-layer factorization parameters to reduce the overall quantization error. The training setup is common and uses the original model as a teacher for layer-wise KD. The method achieves results in a4-bit, 8-bit quantization setup.
  • Make RepVGG Greater Again: A Quantization-aware Approach by Meituan (https://arxiv.org/pdf/2212.01593.pdf). An interesting read about the challenges in the quantization of RepVGG model. Authors analyze what exactly leads to the significant accuracy degradation when quantizing this model to 8-bits. They found a high variance in activations of some layers which is induced by the model architecture. They propose several tricks (essentially normalization and regularization changes) that can be applied along with QAT. With such changes, they can achieve < 2% of accuracy degradation. BTW, OpenVINO team finding is that using FP8 data types it is possible to stay within ~1% of accuracy drop compared to FP32 baseline without bells and whistles. Only scaling and Bias Correction are required.
  • A Closer Look at Hardware-Friendly Weight Quantization by Google Research (https://arxiv.org/pdf/2210.03671.pdf). Authors study the two quantization scale estimation methods to identify the sources of performance differences between the two classes, namely, sensitivity to outliers and convergence instability of the quantizer scaling factor. It is done in strong HW constraints: uniform, per-tensor, symmetric quantization. They propose various techniques to improve the performance of both quantization methods - they fix the optimization instability issues present in the MSQE-based methods during the quantization of MobileNet models and allow us to improve the validation performance of the gradient-based methods. The proposed method achieves superior results in those constraints.
  • CSMPQ: CLASS SEPARABILITYBASED MIXED-PRECISION QUANTIZATION by universities of China (https://arxiv.org/pdf/2212.10220.pdf). The paper introduces the class separability of layer-wise feature maps to search for optimal quantization bit-width. Essentially, authors leverage the TF-IDF metric from NLP to measure the class separability of layer-wise feature maps that are averaged across spatial dimensions. The method can be applied on top of the existing quantization algorithms, such as BRECQ and delivers good results, e.g. 71.30% top-1 acc with only 1.5Mb on MobileNetV2.

Pruning

  • Structured Pruning Adapters by Aarhus University and Cactus Communications (https://arxiv.org/pdf/2211.10155.pdf). The paper introduces task-switching network adapters that accelerate and specialize networks for downstream tasks. Authors propose channel- and block-based adaptors and evaluate them with a suite of pruning methods on both computer vision and natural language processing benchmarks. The method achieves comparable results when downstream ResNEt-50 ImageNet to CIFAR, Flowers, Cats and Dogs and BERT to SQuAD v1.1. The code is available at: https://github.com/lukashedegaard/structured-pruning-adapters.
  • HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers by Northeastern University, Simon Fraser University, and CoCoPIE LLC (https://arxiv.org/pdf/2211.08110.pdf). The paper introduces an algorithm and FPGA co-design for a token selector to enable image adaptive token pruning in Visual Transformers. Authors also propose a latency-aware multi-stage training strategy to learn the insertion of token selectors in ViTs. They also replace non-linearities inside ViT models with polynomial approximations and stack token pruning with 8-bit quantization. The method can achieve 28.4%∼65.3% computation reduction, for various widely used ViTs on the ImageNet, and 3.46×∼4.89× speedup with a trivial resource utilization overhead on FPGA.
  • Soft Masking for Cost-Constrained Channel Pruning by Stanford University and Nvidia (https://arxiv.org/pdf/2211.02206.pdf). Authors propose a filter pruning method with a soft mask re-parameterization of the network weights so that channel sparsity can be adaptively rewired. They also apply a scaling technique for the batch normalization weights to mitigate gradient instability at high channel pruning ratios. To perform channel pruning subject to a cost constraint, authors reformulate it as the multiple-choice knapsack problem. The method achieves SOTA results on ImageNet and VOC tasks. The code is available at: https://github.com/NVlabs/SMCP.
  • Pruning’s Effect on Generalization Through the Lens of Training and Regularization by MIT, University of Toronto, Mosaic ML, and Google Brain (https://arxiv.org/pdf/2210.13738.pdf). Authors study the impact of model pruning on the generalization capabilities. Even though the study is conducted on toy examples, it’s quite extensive and proves known facts that pruning can be considered as an additional regularization and can lead to better training results.
  • oViT: An Accurate Second-Order Pruning Framework for Vision Transformers by Yandex, Neural Magic, and IST Austria (https://arxiv.org/pdf/2210.09223.pdf). Authors introduce an approximate second-order pruner for Vision Transformer models that estimates sparsity ratios for different parts of the model. They also provide a set of general sparse fine-tuning recipes, enabling accuracy recovery at reasonable computational budgets. In addition, the authors propose a pruning framework that produces sparse accurate models for a sequence of sparsity targets in a single run, accommodating various deployments under a fixed compute budget. The method is evaluated on various ViT models including classical ViT, DeiT, XCiT, EfficientFormer and Swin, and shows SOTA results(e.g. 75% of sparsity at <1% of accuracy drop).
  • A Fast Post-Training Pruning Framework for Transformers by UC Berkeley and Samsung (https://arxiv.org/pdf/2204.09656.pdf). The proposed method prunes Transformer models without any fine-tuning. When applied to BERT and DistilBERT authors achieve 2.0x reduction in FLOPs and 1.56x speedup in inference latency while maintaining < 1% accuracy loss. Notably, the whole pruning process finishes in less than 3 minutes on a single GPU. The method consists of three main stages: (1) a lightweight mask search algorithm finds which Transformer heads and filters to prune based on the Fisher information (2) mask rearrangement that improves binary masks produced by the previous stage and (3) mask tuning tweaks some of the 1's in the mask by making them real-valued.
  • Fast DistilBERT on CPUs by Intel (https://arxiv.org/pdf/2211.07715.pdf). The work proposes a new pipeline to apply Prune-OFA with block-wise structured pruning, jointly with quantize-aware training and distillation. The work also provides an advanced Int8 sparse GEMM inference engine which is friendly to Intel VNNI instructions as a companion runtime to accelerate the resultant model from the proposed pipeline. DistilBERT/SQuADv1.1 optimized by the pipeline and deployed with the new engine outperforms Neural Magic’s proprietary sparse inference engine in throughput performance (under production latency constraint) by 50% and 4.1X over low-precision performance of ONNXRuntime. Source code can be found at https://github.com/intel/intel-extension-for-transformers.

 

Neural Architecture Search

  • EZNAS: Evolving Zero-Cost Proxies for Neural Architecture Scoring by Intel Labs (https://openreview.net/forum?id=lSqaDG4dvdt). Authors propose a genetic programming approach to automate the discovery of zero-cost neural architecture scoring metrics. The discovered metrics outperform existing hand-crafted metrics and generalize well across neural architecture design spaces. Two search spaces are explored using EZNAS:NASBench-201 and Network Design Spaces (NDS), demonstrating the strong generalization capabilities of the discovered zero-cost metrics.
  • Resource-Aware Heterogenous Federated Learning using Neural Architecture Search by Iowa State University and Intel Labs (https://arxiv.org/pdf/2211.05716.pdf). This paper proposes a framework for Resource-aware Federated Learning (RaFL). The framework uses Neural Architecture Search (NAS)to enable on-demand specialized model deployment for resource-diverse edge devices. Furthermore, the framework uses a novel model architecture fusion scheme to allow for the aggregation of the distributed learning results. RaFL demonstrates superior resource efficiency and reduction in communication overhead compared to state-of-the-art solutions.
  • NAS-LID: Efficient Neural Architecture Search with Local Intrinsic Dimension by Nvidia and universities of China, India, and GB (https://arxiv.org/pdf/2211.12759.pdf). Authors first apply the so-called local intrinsic dimension (LID) method that evaluates the geometrical properties of sampled model architectures by calculating the low-cost LID features layer-by-layer, and the similarity characterized by LID. The method can be embedded into the existing NAS frameworks, e.g. OFA of Proxyless NAS. The method significantly accelerates architecture search time and shows comparable performance on public benchmarks. The code is available: https://github.com/marsggbo/NAS-LID.
  • Automatic Subspace Evoking for Efficient Neural Architecture Search by Hisense and universities of China (https://arxiv.org/pdf/2210.17180.pdf). A method that proposes to decouple architecture search into global and local search steps that are aimed at enhancing the performance of NAS. Specifically, we first perform a global search to find promising subspaces and then perform a local search to obtain the resultant architectures. The method exploits GNN and RNN models in the search algorithm that are trained jointly. The method shows superior results compared to some well-known NAS frameworks.
  • AUTOMOE: NEURAL ARCHITECTURESEARCH FOR EFFICIENT SPARSELY ACTIVATED TRANSFORMERS by Microsoft Research and University of British Columbia (https://arxiv.org/pdf/2210.07535.pdf). Authors introduce heterogeneous search space for Transformers consisting of variable number, FFN size and placement of experts in both encoders and decoders; variable number of layers, attention heads and intermediate FFN dimension of standard Transformer modules. They extend Supernet training to this new search space which combines all possible sparse architectures into a single graph and jointly trains them via weight-sharing. They also use an evolutionary algorithm to search for optimal sparse architecture from Supernet with the best possible performance on a downstream task. The method shows better results compared to dense NAS methods for Transformers.

Other

  • GhostNetV2: Enhance Cheap Operation with Long-Range Attention by Huawei, Peking University, and University of Sydney (https://arxiv.org/pdf/2211.12905.pdf). In this paper, authors propose a hardware-friendly attention mechanism (dubbed DFC attention) and GhostNetV2 architecture for mobile applications. The proposed DFC attention is aimed at capturing the dependence between long-range pixels. They also revisit the expressiveness bottleneck in previous GhostNet and propose to enhance expanded features so that a GhostNetV2 block can aggregate local and long-range information simultaneously. The approach shows good results, 75.3% top-1 accuracy on ImageNet with 167M FLOPs. The code is available  at https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/ghostnetv2_pytorch.

Deep Learning Software

Deep Learning Hardware