OpenVINO GenAI GGUF Feature Update

Authors: Su Yang, Tianmeng Chen, Hongbo Zhao
This blog focuses on OpenVINO GenAI GGUF Reader updates. The model link, usage and technical details are in the previous blog: OpenVINO GenAI Supports GGUF Model

1. New Model: Qwen3

Model Graph supports Qwen3 architecture: load attn_q_norm attn_k_norm weight and add rms_norm.

Validated models on CPU/GPU: Qwen3-0.6B-f16, Qwen3-0.6B-Q8_0, Qwen3-4B-Q4_K_M

2. New Pipeline Capabilities: OV Model Serialization and Logging Printing

The new enable_save_ov_model property enables serialization of the OV model generated from a GGUF model (including the tokenizer) to the same path as the GGUF file in XML/BIN format for reuse. After OV IR serialization, the path to the GGUF-converted OV model can be used directly as input.

Set environment variable to print loading and serializing time via OPENVINO_LOG_LEVEL. Example:

[GGUF Reader]: Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
[GGUF Reader]: Loading and unpacking model done. Time: 196ms
[GGUF Reader]: Start generating OpenVINO model...
[GGUF Reader]: Save generated OpenVINO model to: gguf_models/openvino_model.xml done. Time: 466 ms
[GGUF Reader]: Model generation done. Time: 757ms

3. New Feature: GGUF Tokenizers and Detokenizers

Create a computation graph for tokenizers/detokenizers read from a GGUF file via the tokenizer node factory.

Validated models: SmolLM2-135M.F16.gguf, DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf, Llama-3.2-1B-Instruct-Q4_K_M.gguf, qwen2.5-0.5b-instruct-q4_0.gguf.

Downloading OV tokenizers IR and replacing the GGUF-converted OV tokenizer/detokenizer IR could be a workaround for generation anomalies.

4. Enhanced Support for GPU Plugin

Removes zero-point array modification workaround for Q4_0 weights. Supports dynamic quantization for GGUF-converted models.

Notes: Llama.cpp’s quantization GGUF block size differs from OpenVINO NNCF’s weight compression group size, so generation bias is expected. For better performance, using OV NNCF quantized model IR is recommended, as the GGUF reader still needs to dequantize Q4_0/Q4KM’s q6k tensor to FP16.

Sample Code Update:

Use GGUF file as the only input and save model via enable_save_ov_model.

Modify the file samples/cpp/text_generation/greedy_causal_lm.cpp of OV GenAI package (validated OV 25.3 nightly 20250727).

// Copyright (C) 2023-2025 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include "openvino/genai/llm_pipeline.hpp"

int main(int argc, char* argv[]) try {

    if (2 > argc)
        throw std::runtime_error(std::string{"Usage: "} + argv[0] + " GGUF_PATH");

    std::string gguf_path = argv[1];
    // std::string tokenizer_path = argv[2];
    std::string prompt;
    std::string device = "GPU";  // CPU can be used as well
    ov::AnyMap pipe_config = {};
    pipe_config["enable_save_ov_model"] = true;
    pipe_config.insert({ov::cache_dir("llm_cache")});

    // ov::genai::Tokenizer tokenizer(tokenizer_path);
    ov::genai::LLMPipeline pipe(gguf_path, device, pipe_config);

    ov::genai::GenerationConfig config;
    config.max_new_tokens = 90;
    size_t num_warmup = 1;
    size_t num_iter = 3;

    auto streamer = [](std::string word) {
        std::cout << word << std::flush;
        // Return flag corresponds whether generation should be stopped.
        return ov::genai::StreamingStatus::RUNNING;
    };

    std::cout << "question:\n";
    std::getline(std::cin, prompt);
    pipe.generate(prompt, config, streamer);
    std::cout << "\n----------\n"
                 "generation finished\n";

    std::cout << "start benchmarking: once warmup and run 3 interations\n";

    for (size_t i = 0; i < num_warmup; i++)
        pipe.generate(prompt, config);

    ov::genai::DecodedResults res = pipe.generate(prompt, config);
    ov::genai::PerfMetrics metrics = res.perf_metrics;

    for (size_t i = 0; i < num_iter - 1; i++) {
        res = pipe.generate(prompt, config);
        metrics = metrics + res.perf_metrics;
    }
    std::cout << std::fixed << std::setprecision(2);
    std::cout << "Load time: " << metrics.get_load_time() << " ms" << std::endl;
    std::cout << "Generate time: " << metrics.get_generate_duration().mean << " ± " << metrics.get_generate_duration().std << " ms" << std::endl;
    std::cout << "Tokenization time: " << metrics.get_tokenization_duration().mean << " ± " << metrics.get_tokenization_duration().std << " ms" << std::endl;
    std::cout << "Detokenization time: " << metrics.get_detokenization_duration().mean << " ± " << metrics.get_detokenization_duration().std << " ms" << std::endl;
    std::cout << "TTFT: " << metrics.get_ttft().mean  << " ± " << metrics.get_ttft().std << " ms" << std::endl;
    std::cout << "TPOT: " << metrics.get_tpot().mean  << " ± " << metrics.get_tpot().std << " ms/token " << std::endl;
    std::cout << "Throughput: " << metrics.get_throughput().mean  << " ± " << metrics.get_throughput().std << " tokens/s" << std::endl;

} catch (const std::exception& error) {

    try {
        std::cerr << error.what() << '\n';
    } catch (const std::ios_base::failure&) {}

    return EXIT_FAILURE;

} catch (...) {

    try {
        std::cerr << "Non-exception object thrown\n";
    } catch (const std::ios_base::failure&) {}

    return EXIT_FAILURE;
}

Running with GGUF:

For the first use, use the GGUF file as the only input; for subsequent uses, use the path to the GGUF-converted model as input.

GIF: GGUF_READER_DEMO_LNL_U9_288V