Optimizing Latent Consistency Model for Image Generation with OpenVINO™ and NNCF

No items found.

Authors: Liubov Talamanova, Ekaterina Aidova, Alexander Kozlov

Introduction

Latent Diffusion Models (LDMs) make a revolution in AI-generated art. This technology enables creation of high-quality images simply by writing a text prompt. While LDMs like Stable Diffusion are capable of achieving the outstanding quality of generation, they often suffer from the slowness of the iterative image denoising process. Latent Consistency Model (LCM) is an optimized version of LDM. Inspired by Consistency Models (CM), Latent Consistency Models (LCMs) enabled swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion. The Consistency Models is a new family of generative models that enables one-step or few-step generation. More details about the proposed approach and models can be found using the following resources: project page, paper, original repository.

Similar to original Stable Diffusion pipeline, the LCM pipeline consists of three important parts:

  • Text Encoder to create a condition to generate an image from a text prompt.
  • U-Net for step-by-step denoising latent image representation.
  • Autoencoder (VAE) for decoding latent space to image.

In this post we explain how to optimize the LCM inference by OpenVINO for Intel hardware. Since LCM is trained in a way that it should be resistant to perturbations, it means that we can also apply common optimization methods such as quantization to lower the precision while expecting a consistent generation result and we do apply 8-bit Post-training Quantization from Neural Network Compression Framework (NNCF)

Convert models to OpenVINO format

To leverage efficient inference with OpenVINO runtime on the intel platform, the original model should be converted to OpenVINO Intermediate Representation (IR). OpenVINO supports the conversion of PyTorch models directly via Model Conversion API. ov.convert_model function accepts instance of PyTorch model and example inputs for tracing and returns object of ov.Model class, ready to use or save on disk using ov.save_model function. You can find conversion details of LCM in the OpenVINO LCM Notebook.

Processing time of the diffusion model

The diffusion pipeline requires multiple iterations to generate an image. Each iteration requires a non-negligible amount of time, depending on your inference device. We have benchmarked the stable diffusion pipeline on an Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz. The number of iterations was set at 10.

Benchmarking results:

Average Latency : 6.54 seconds

Encoding Phase:
Text encoding: 0.05 seconds

Denoising Loop : 4.28 seconds
U-Net part (4 iterations): 4.27 seconds
Scheduler: 0.01 seconds

Decoding Phase:
VAE decoding: 2.21 seconds

The U-Net part of the denoising loop takes more than 60% of the full pipeline execution time. That is why the computation cost and speed of the U-Net denoising becomes the critical path in the pipeline.

In this blog, we use Neural Network Compression Framework (NNCF) Post-Training Quantization (PTQ) API to quantize the U-Net model, which can further boost the model inference while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but can lead to a substantial degradation of the accuracy.

Quantization

The quantization process includes the following steps:

  1. Create a calibration dataset for the quantization.
  2. Run nncf.quantize to obtain a quantized model.
  3. Save the INT8 model using ov.save_model function.

You can look at the dataset preparation for the U-Net model in OpenVINO LCM Notebook. General rules about dataset preparation you can find at OpenVINO documentation.

For INT8 quantization of LCM, we found some useful tricks to mitigate accuracy degradation caused by accuracy sensitive layers: 

  • The U-Net part of the LCM pipeline has a backbone with a transformer that operates on latent patches. To better preserve accuracy after NNCF PTQ, we should pass model_type=nncf.ModelType.Transformer to nncf.quantize function. It keeps several accuracy sensitive layers in FP16 precision.
  • Default symmetric quantization of weights and activations also leads to accuracy degradation of LCM. We recommend using preset=nncf.QuantizationPreset.MIXED to use symmetric quantization of weights and asymmetric quantization of activations that are more sensitive and impact the generation results more. So applying asymmetric quantization to activations helps to better represent their values and leads to better accuracy with no impact on the inference latency.
  • It was also discovered that the Fast Bias (error) Correction algorithm (FBC), which is a default part of NNCF PTQ, results in unexpected artifacts in the generated images. To disable FBC, we should pass advanced_parameters=nncf.AdvancedQuantizationParameters(disable_bias_correction=True) to nncf.quantize function.

Once the dataset is ready and the model object is instantiated, you can apply 8-bit quantization to it using the optimization workflow below:


import nncf
import openvino as ov

core = ov.Core()
unet = core.read_model(UNET_OV_PATH)
quantized_unet = nncf.quantize(
    model=unet,
    preset=nncf.QuantizationPreset.MIXED,
    calibration_dataset=nncf.Dataset(unet_calibration_data),
    model_type=nncf.ModelType.TRANSFORMER,
    advanced_parameters=nncf.AdvancedQuantizationParameters(
        disable_bias_correction=True
    )
)
ov.save_model(quantized_unet, UNET_INT8_OV_PATH

Text-to-image generation

The left image was generated using the original LCM pipeline from PyTorch. The middle image was generated using the model converted to OpenVINO FP16. The right image was generated using LCM with the quantized INT8 U-Net. Input prompt is “a beautiful pink unicorn, 8k”, seed is 1234567 and the number of inference steps is 4.

If you would like to generate your own images and compare original and quantized models, you can run an Interactive demo at the end of OpenVINO LCM Notebook.

We also measured time for the image generation by LCM pipeline with input prompt “a beautiful pink unicorn, 8k”, seed is 1234567 and 4 inference steps.

*Average time across 3 independent runs.

Performance speedup PyTorch vs OpenVINO+NNCF is 1.38x.

Notices and Disclaimers:

Performance varies by use, configuration, and other factors. Learn more at www.intel.com/PerformanceIndex​. ​Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. No product or component can be absolutely secure.​​​ ​Intel technologies may require enabled hardware, software or service activation.​​​​​​​​

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.​​

Test Configuration: Intel® Core™ i9-10980XE CPU Processor at 3.00GHz with DDR4 128 GB at 3600MHz, OS: Ubuntu 22.04.2 LTS. Tested with OpenVINO LCM Notebook.

The test was conducted by Intel on November 7, 2023.

Conclusion

In this blog, we introduced how to enable and quantize the Latent Consistency Model with OpenVINO™ runtime and NNCF:

  • Proposed NNCF INT8 PTQ quantization improves performance of image generation pipeline while preserving generation quality.
  • Provided OpenVINO LCM Notebook for model enabling, quantization, comparison of FP16 and INT8 model inference times and deployment with OpenVINO™ and NNCF.

As the next step, you can consider migration to a native OpenVINO C++ API for even faster generation pipeline inference and possibility to embed it into the client or edge-device application. You can find an example of such a pipeline here

Please give a star to NNCF and OpenVINO repositories if you find them useful.