OpenVINO deploying DeepSeek-R1 Model Server (OVMS) on Bare metal Windows AIPC

No items found.

Authors: Kunda Xu, Sapala, Rafal A

DeepSeek-R1 is an open-source reasoning model developed by DeepSeek to address tasks requiring logical inference, mathematical problem-solving, and real-time decision-making. With DeepSeek-R1,you can follow its logic, making it easier to understand and, if necessary, challenge its output. This capability gives reasoning models an edge in fields where outcomes need to be explainable, like research or complex decision-making.

Distillation in AI creates smaller, more efficient models from larger ones, preserving much of their reasoning power while reducing computational demands. DeepSeek applied this technique to create a suite of distilled models from R1, using Qwen and Llama architectures. That allows us to try DeepSeek-R1 capability locally on usual laptops (AIPC).

In this tutorial, we consider deploy deepseek-ai/DeepSeek-R1-Distill-Qwen-7B as a model server on Intel AIPC or AI work station with Windows OS to perform request generation tasks.

Requirements:

QuickStart Guide

Step 1. Install python dependencies for the conversion script:

pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt

Step 2. Run optimum-cli to download and quantize the model:

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py
mkdir models
python export_model.py text_generation --source_model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --weight-format int4 --config_file_path models/config.json --model_repository_path models --target_device GPU --cache 2

If your network access to HuggingFace is unstable, you can try to use a proxy image to pull the model.

set HF_ENDPOINT=https://hf-mirror.com

Step 3. Deploying Model Server (OVMS) on Bare metal

Download and unpack model server archive for Windows:

curl https://github.com/openvinotoolkit/model_server/releases/download/v2025.0/ovms_windows.zip
tar -xf ovms_windows.zip

Run setupvars script to set required environment variables

# Windows Command Line
.\ovms\setupvars.bat

# Windows PowerShell
.\ovms\setupvars.ps1

Step 4. DeepSeek-R1 model server deploy

Bare metal Host deploy. Required: deploying ovms on Bera metal.

ovms --rest_port 8000 --config_path ./models/config.json

OpenVINO + OVMS can also use Docker contain deploying. Required: Docker engine installed

docker run -d --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render*) --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json

When using docker as a deployment method, you need to consider whether the hardware performance of the machine is sufficient, because docker contain will also generate additional memory overhead.

For example, when deploying on a laptop or AIPC, due to the limited memory resources, it is more reasonable to use bare metal deployment method

Step 5. Check readiness Wait for the model to load.

You can check the status with a simple command

curl http://localhost:8000/v1/config
Fig 1. model server status info

Step 6. Run model server generation 

Create a file called request.json ,

{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What are the 3 main tourist attractions in Paris?" } ] }

and copy the following content into it

curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d @request.json | jq .

You will get the output like the following.

Fig 2. model server request callback

Note: If you want to get the response chunks streamed back as they are generated change stream parameter in the request to true