Enable tokenize and detokenize by creating OpenVINO™ model and CPP runtime pipeline

No items found.

Authors Wenyi Zou, Su Yang


Tokenization is the process of breaking down text into smaller units, such as words or subwords, known as tokens. These tokens are the building blocks for various NLP tasks, including text classification, sentiment analysis, machine translation, and more. Tokenization makes it easier for machines to understand and work with textual data by converting unstructured text into a structured format.

This article will demonstrate a C++ application of the tokenize and detokenize model with Intel’s OpenVINO™ C++ API on Linux/Windows systems.

Model Conversion

The OpenVINO™ Tokenizers project has a tool to convert a HuggingFace tokenizer into OpenVINO™ IR model tokenizer and detokenizer: it provides the convert_tokenizer function that accepts a tokenizer.

Install system dependency and setup environment

Step1: Download code

git clone https://github.com/openvinotoolkit/openvino.genai.git
cd openvino.genai
git submodule update --init

Step2: Install OpenVINO™ Install OpenVINO™ Archives >=2023.3. <INSTALL_DIR> below refers to the extraction location.

For Linux

source <INSTALL_DIR>/setupvars.sh

For Windows


Step3: Create python environment

conda create -n ov_genai python=3.10
cd text_generation/causal_lm/cpp
python -m pip install --upgrade-strategy eager "optimum>=1.14" -r ../../../llm_bench/python/requirements.txt ../../../thirdparty/openvino_contrib/modules/custom_operations/[transformers] --extra-index-url https://download.pytorch.org/whl/cpu

step4: Convert tokenizer to OpenVINO™ IR.

Take the tokenize and detokenize of the chatglm3-6b as an example.

convert_tokenizer ./chatglm3-6b/ --output ./chatglm3-6b/ov/ --with-detokenizer --trust-remote-code

Build custom OpenVINO operation extension library

cd thirdparty/openvino_contrib/modules/custom_operations
mkdir build && cd build
cmake ../ -DCMAKE_BUILD_TYPE=Release
cmake --build . --parallel 4

Load and use custom OpenVINO operation extension library

#include <openvino/openvino.hpp>
namespace {
std::pair<ov::Tensor, ov::Tensor> tokenize(ov::InferRequest& tokenizer, std::string&& prompt) {
    constexpr size_t BATCH_SIZE = 1;
    tokenizer.set_input_tensor(ov::Tensor{ov::element::string, {BATCH_SIZE}, &prompt});
    return {tokenizer.get_tensor("input_ids"), tokenizer.get_tensor("attention_mask")};

std::string detokenize(ov::InferRequest& detokenizer, std::vector<int64_t>& tokens) {
    constexpr size_t BATCH_SIZE = 1;
    detokenizer.set_input_tensor(ov::Tensor{ov::element::i64, {BATCH_SIZE, tokens.size()}, tokens.data()});
    return detokenizer.get_output_tensor().data<std::string>()[0];

int main(int argc, char* argv[]) try {
    if (argc != 3) {
        throw std::runtime_error(std::string{"Usage: "} + argv[0] + " <MODEL_DIR> '<PROMPT>'");

    ov::Core core;

#ifdef _WIN32

    std::cout << "promt " << argv[2] << std::endl;

    // tokenizer and detokenizer work on CPU only
    ov::InferRequest tokenizer = core.compile_model(
        std::string{argv[1]} + "/openvino_tokenizer.xml", "CPU").create_infer_request();
    auto [input_ids, attention_mask] = tokenize(tokenizer, argv[2]);
    ov::InferRequest detokenizer = core.compile_model(
        std::string{argv[1]} + "/openvino_detokenizer.xml", "CPU").create_infer_request();
    std::cout << "input lenghth " << input_ids.get_size() << std::endl;

    std::vector<int64_t> tokens;
    for (size_t idx = 0; idx < input_ids.get_size(); ++idx) {
    std::cout << detokenize(detokenizer, tokens) << std::endl;
} catch (const std::exception& error) {
    std::cerr << error.what() << '\n';
    return EXIT_FAILURE;
} catch (...) {
    std::cerr << "Non-exception object thrown\n";
    return EXIT_FAILURE;