Q1'22: Technology update – low precision and model optimization

Authors

Alexander Kozlov, Pablo Munoz, Vui Seng Chua, Nikolay Lyalyushkin, Yury Gorbachev, Nilesh Jain

Summary

This quarter we observe a kind of saturation in the popular optimization methods such as pruning and NAS. We reviewed a lot of papers about pruning (structured and unstructured) that do not provide any improvement over existing state-of-the-art or even performing on par. Such works mostly parasitize around the well-known methods. As for the NAS methods, there has been a significant amount of works that claim some theoretical analysis of the existing NAS techniques and their convergence without providing a way how to improve them. We did not include such results in the update.

Papers with notable results

Quantization

  • F8NET: FIXED-POINT 8-BITONLY MULTIPLICATION FOR NETWORK QUANTIZATION by Snap Inc. and US universities (https://arxiv.org/pdf/2202.05239v1.pdf).A comprehensive study on applying fixed-point quantization to DNN inference acceleration. Authors provide the analysis on how various fractional length impacts the quantization error for various types of distributions of weights and activation. They also modify the famous PACT method to make it compatible with fixed-point arithmetic. They validate the approach for various models, including MobileNet V1/V2 and ResNet18/50.

  • Quantune: Post-training Quantization of Convolutional Neural Networks using Extreme Gradient  Boosting for Fast Deployment by Artificial Intelligence Research Laboratory, ETRI (https://arxiv.org/pdf/2202.05048v1.pdf).Authors propose Quantune, a method that combines both XGBoost and transfer learning to seek the optimal quantization configuration. They implemented Quantune based on the Glow compiler stack. The extended Glow provides layer-wise mixed precision and integer-only quantization so it can generate the binary code of the quantized models for various hardware targets, from CPU (x86and ARM) to the integer-only accelerator (VTA). The method outperforms the grid, random, and genetic algorithms by approximately 36.5× with a 0.07-0.65accuracy loss across the six CNN models. The method is available at: https://github.com/leejaymin/qaunt_xgboost.

  • Logarithmic Unbiased Quantization: Simple 4-bit Training in Deep Learning by Habana Labs and Department of Electrical Engineering -Technion (https://arxiv.org/pdf/2112.10769v2.pdf).The paper examines the importance of having unbiased quantization in quantized neural network training. It proposes a logarithmic unbiased quantization method to quantize both the forward and backward phase to 4-bit. The method achieves SOTA results in 4-bit training for ResNet-50 on ImageNet and shows that just one epoch of fine-tuning in full precision combined with a variance reduction method significantly improves results.

  • Automatic Mixed-Precision Quantization Search of BERT by Samsung Research (https://arxiv.org/pdf/2112.14938v1.pdf).In this paper, authors propose an automatic mixed-precision quantization approach for BERT compression that can simultaneously conduct quantization and pruning in a subgroup-wise level. The method leverages Differentiable Neural Architecture Search to assign scale and precision for parameters in each subgroup automatically, and at the same time pruning out redundant groups of parameters. The method is evaluated on four NLP tasks and shows comparable results.

  • LG-LSQ: Learned Gradient Linear Symmetric Quantization by Tsing Hua University and Industrial Technology Research Institute (https://arxiv.org/ftp/arxiv/papers/2202/2202.09009.pdf). The paper proposes a method for accurate low-bit quantization with fine-tuning. It modifies the approach to learn quantization scaling factors by introducing three novelties: 1) the scaling simulated gradient (SSG) for determining the appropriate gradient for the scaling factor of the linear quantizer; 2) the arctangent soft round (ASR) to prevent the gradient from becoming zero, there by solving the discrete problem caused by the rounding process; 3) the minimize discretization error (MDE) method to determine an accurate gradient in backpropagation. All together they help to achieve state-of-the-art results for several models, e.g. fully 4-bit quantized MobileNet v2 on ImageNet within 1% of accuracy drop.

  • Standard Deviation-Based Quantization for Deep Neural Networks by McGillUniversity (https://arxiv.org/pdf/2202.12422v1.pdf). Reincarnation of the idea of base-2 logarithmic quantization combined with the idea of standard deviation-based quantization where the floating-point range in the quantizer function is encoded by the estimated σ value and learnable multiplier coefficient. Authors also suggest using two-phase training to increase overall accuracy. The method shows quite good results for low-bit quantization, likeINT4, INT2.

 

Pruning

  • Pruning-aware Sparse Regularization for Network Pruning by Chinese Universities (https://arxiv.org/pdf/2201.06776v1.pdf). Authors analyze sparsity-training-based methods and find that the regularization of unpruned channels is unnecessary and can lead to under-fitting. They propose a pruning method with pruning-aware sparse regularization. It imposes fine-grained sparse regularization on the specific filters selected by a pruning mask. The method reduces more than 51.07%FLOPs on ResNet-50, with a loss of 0.76% in the top-1accuracy on ImageNet. The code is released at https://github.com/CASIA-IVA-Lab/MaskSparsity.

  • HRel: Filter Pruning based on High Relevance between Activation Maps and Class Labels by universities of India (https://arxiv.org/pdf/2202.10716.pdf).The paper describes and proposes one more criterion for the selection of prunable filters in CNNs. It is based on information theory and leverages from Mutual Information characteristic of distribution. It is used to compute the so-called “Relevance” of activation maps generated by filters for mini-batch and class labels for the samples in mini-batch. This “Relevance” is used to estimate the importance of the corresponding filters and prune the less important ones. The method achieves comparable results on Image Classification tasks, e.g. 0.68% drop in the top-1 accuracy after pruning 48.66%FLOPs of ResNet-50 on ImageNet.

  • SPViT: Enabling Faster Vision Transformers via Soft Token Pruning by US and Switzerland universities (https://arxiv.org/pdf/2112.13890v1.pdf).The paper states that for Vision Transformer architectures token pruning holds a greater computation reduction compared to the compression of other dimensions. It proposes a method that introduces an attention-based multi head token selector and the token packaging technique to achieve per-image adaptive pruning. For lightweight models, the method allows the DeiT-S and DeiT-T to reduce inference latency by 40%-60% within 0.5% accuracy loss.

  • EXPLORING STRUCTURALSPARSITY IN NEURAL IMAGE COMPRESSION by Harbin Institute of Technology and Peng Cheng Laboratory (https://arxiv.org/pdf/2202.04595v4.pdf).A practical study on applying the Filter Pruning method to accelerate the inference of Image Compression models. Authors use a simple pruning method based on a learnable per-channel masks. They apply the method to different Image Compression architectures and achieve up to 7× computation reduction and 3×acceleration.

 

Neural Architecture Search

  • AutoDistil : Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models by Miscrosoft Research and Pennsylvania State University (https://arxiv.org/pdf/2201.12507v1.pdf).Authors develop a few-shot task-agnostic Neural Architecture Search framework for the NLP domain. They use self-attention distillation to train the SuperLM and demonstrate this to be better than masked language modeling objective for task-agnostic SuperLM training. Experiments in the GLUE benchmark show that the method achieves 62.4% reduction in computational cost and 59.7%reduction in model size over state-of-the-art task-agnostic distillation methods.

  • Fast Neural Architecture Search for Lightweight Dense Prediction Networks by European universities (https://arxiv.org/pdf/2203.01994v3.pdf). The paper proposes a multi-objective LDP method for searching for accurate and light weight dense prediction architectures (Segmentation, Depth Estimation, Super Resolution). It uses a new Assisted Tabu Search to enable fast neural architecture search. The method shows comparable or better results of a variety of tasks.

  • WPNAS: Neural Architecture Search by jointly using Weight Sharing and Predictor by Huawei and Samsung Research China (https://arxiv.org/pdf/2203.02086v1.pdf). Authors propose a method to jointly use weight-sharing and predictor and use a self-critical policy gradient algorithm with probabilistic sampling to update architecture parameters. They use a few-shot learning-based predictor for subnets and a weakly weight sharing strategy based on the so-called HyperNet which is essentially an RNN-based model that generates offsets for originally shared weights. The method shows comparable to SOTA results on CIFAR and ImageNet datasets.

  • ONE-NAS: An Online Neuro Evolution based Neural Architecture Search for Time Series Forecasting by Rochester Institute of Technology (https://arxiv.org/pdf/2202.13471v1.pdf). Authors claim that this work is the first attempt to design and train RNNs for time series forecasting in an online setting. Without any pretraining, the method utilizes populations of RNNs which are continuously updated with new network structures and weights in response to new multivariate input data. The method outperforms traditional statistical time series forecasting, including naive, moving average, and exponential smoothing methods, as well as state-of-the-art online ARIMA strategies.

  • BINAS: Bilinear Interpretable Neural Architecture Search by Alibaba (https://arxiv.org/pdf/2110.12399v2.pdf). The paper proposes a bilinear accuracy estimator for architecture search. The bilinear form of the proposed estimator allows the formulation of the latency constrained NAS problem as an Integer Quadratic Constrained Quadratic Programming (IQCQP). Thanks to this, it can be efficiently solved via a simple algorithm with some off-the-shelf components. The method shows comparable results in the close training setup. Code is available at: https://github.com/Alibaba-MIIL/BINAS.

Deep Learning Software

  • Neural Network Quantization with AI Model Efficiency Toolkit (AIMET) by Qualcomm (https://arxiv.org/pdf/2201.08442v1.pdf).An overview of DNN optimization toolkit from Qualcomm. The code is open-sourced and contains several state-of-the-art methods from Qualcomm Research.

 

Deep Learning Hardware