Authors Wenyi Zou, Su Yang
Tokenization is the process of breaking down text into smaller units, such as words or subwords, known as tokens. These tokens are the building blocks for various NLP tasks, including text classification, sentiment analysis, machine translation, and more. Tokenization makes it easier for machines to understand and work with textual data by converting unstructured text into a structured format.
This article will demonstrate a C++ application of the tokenize and detokenize model with Intel’s OpenVINO™ C++ API on Linux/Windows systems.
Model Conversion
The OpenVINO™ Tokenizers project has a tool to convert a HuggingFace tokenizer into OpenVINO™ IR model tokenizer and detokenizer: it provides the convert_tokenizer function that accepts a tokenizer.
Install system dependency and setup environment
Step1: Download code
git clone
cd openvino.genai
git submodule update --init
Step2: Install OpenVINO™ Install OpenVINO™ Archives >=2023.3. <INSTALL_DIR> below refers to the extraction location.
For Linux
source <INSTALL_DIR>/
For Windows
Step3: Create python environment
conda create -n ov_genai python=3.10
cd text_generation/causal_lm/cpp
python -m pip install --upgrade-strategy eager "optimum>=1.14" -r ../../../llm_bench/python/requirements.txt ../../../thirdparty/openvino_contrib/modules/custom_operations/[transformers] --extra-index-url
step4: Convert tokenizer to OpenVINO™ IR.
Take the tokenize and detokenize of the chatglm3-6b as an example.
convert_tokenizer ./chatglm3-6b/ --output ./chatglm3-6b/ov/ --with-detokenizer --trust-remote-code
Build custom OpenVINO operation extension library
cd thirdparty/openvino_contrib/modules/custom_operations
mkdir build && cd build
cmake ../ -DCMAKE_BUILD_TYPE=Release
cmake --build . --parallel 4
Load and use custom OpenVINO operation extension library
#include <openvino/openvino.hpp>
namespace {
std::pair<ov::Tensor, ov::Tensor> tokenize(ov::InferRequest& tokenizer, std::string&& prompt) {
constexpr size_t BATCH_SIZE = 1;
tokenizer.set_input_tensor(ov::Tensor{ov::element::string, {BATCH_SIZE}, &prompt});
return {tokenizer.get_tensor("input_ids"), tokenizer.get_tensor("attention_mask")};
std::string detokenize(ov::InferRequest& detokenizer, std::vector<int64_t>& tokens) {
constexpr size_t BATCH_SIZE = 1;
detokenizer.set_input_tensor(ov::Tensor{ov::element::i64, {BATCH_SIZE, tokens.size()},});
return detokenizer.get_output_tensor().data<std::string>()[0];
int main(int argc, char* argv[]) try {
if (argc != 3) {
throw std::runtime_error(std::string{"Usage: "} + argv[0] + " <MODEL_DIR> '<PROMPT>'");
ov::Core core;
#ifdef _WIN32
std::cout << "promt " << argv[2] << std::endl;
// tokenizer and detokenizer work on CPU only
ov::InferRequest tokenizer = core.compile_model(
std::string{argv[1]} + "/openvino_tokenizer.xml", "CPU").create_infer_request();
auto [input_ids, attention_mask] = tokenize(tokenizer, argv[2]);
ov::InferRequest detokenizer = core.compile_model(
std::string{argv[1]} + "/openvino_detokenizer.xml", "CPU").create_infer_request();
std::cout << "input lenghth " << input_ids.get_size() << std::endl;
std::vector<int64_t> tokens;
for (size_t idx = 0; idx < input_ids.get_size(); ++idx) {
std::cout << detokenize(detokenizer, tokens) << std::endl;
} catch (const std::exception& error) {
std::cerr << error.what() << '\n';
} catch (...) {
std::cerr << "Non-exception object thrown\n";