在帶有 NPU 的 Snapdragon 裝置上執行 SLM

瞭解如何在帶有 ONNX Runtime 的 Snapdragon 裝置上執行 SLM。

模型

目前支援的模型有：

Phi-3.5 mini instruct
Llama 3.2 3B

帶有 Snapdragon NPU 的裝置需要特定大小和格式的模型。

生成此格式模型的說明可在為 Snapdragon 構建模型中找到

構建或下載模型後，將模型資產放置在已知位置。這些資產包括：

genai_config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
quantizer.onnx
dequantizer.onnx
position-processor.onnx
一組 transformer 模型二進位制檔案
- Qualcomm 上下文二進位制檔案 (*.bin)
- 上下文二進位制元資料 (*.json)
- ONNX 包裝模型 (*.onnx)

Python 應用程式

如果您的裝置安裝了 Python，您可以執行一個簡單的問答指令碼來查詢模型。

安裝執行時

pip install onnxruntime-genai

下載指令碼

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-qa.py -o model-qa.py

執行指令碼

此指令碼假定模型資產位於名為 models\Phi-3.5-mini-instruct 的資料夾中

python .\model-qa.py -e cpu -g -v --system_prompt "You are a helpful assistant. Be brief and concise." --chat_template "<|user|>\n{input} <|end|>\n<|assistant|>" -m ..\..\models\Phi-3.5-mini-instruct

Python 指令碼內部探究

完整的 Python 指令碼釋出在此處：https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/model-qa.py。該指令碼以以下標準方式使用 API：

載入模型
```
model = og.Model(config)
```
這將模型載入到記憶體中。

建立預處理器並標記系統提示

 tokenizer = og.Tokenizer(model)
 tokenizer_stream = tokenizer.create_stream()

 # Optional
 system_tokens = tokenizer.encode(system_prompt)

這將建立一個分詞器和分詞流，允許在生成標記時將它們返回給使用者。

互動式輸入迴圈

while True:
    # Read prompt
    # Run the generation, streaming the output tokens

生成迴圈

# 1. Pre-process the prompt into tokens
input_tokens = tokenizer.encode(prompt)

# 2. Create parameters and generator (KV cache etc) and process the prompt
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
generator = og.Generator(model, params)
generator.append_tokens(system_tokens + input_tokens)

# 3. Loop until all output tokens are generated, printing
# out the decoded token
while not generator.is_done():
    generator.generate_next_token()

    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="", flush=True)

 print()
    
 # Delete the generator to free the captured graph before creating another one
 del generator

C++ 應用程式

要在 C++ 應用程式中於 Snapdragon NPU 上執行模型，請使用此處的程式碼：https://github.com/microsoft/onnxruntime-genai/tree/main/examples/c。

構建和執行此應用程式需要一臺帶有 Snapdragon NPU 的 Windows PC，以及：

cmake
Visual Studio 2022

克隆倉庫

   git clone https://github.com/microsoft/onnxruntime-genai
   cd examples\c

安裝 onnxruntime

目前需要 onnxruntime 的夜間構建版本，因為語言模型的 QNN 支援每天都在更新。

從此處下載 ONNX Runtime QNN 二進位制檔案的夜間版本

   mkdir onnxruntime-win-arm64-qnn
   move Microsoft.ML.OnnxRuntime.QNN.1.22.0-dev-20250225-0548-e46c0d8.nupkg onnxruntime-win-arm64-qnn
   cd onnxruntime-win-arm64-qnn
   tar xvzf Microsoft.ML.OnnxRuntime.QNN.1.22.0-dev-20250225-0548-e46c0d8.nupkg
   copy runtimes\win-arm64\native\* ..\..\..\lib
   cd ..

安裝 onnxruntime-genai

   curl https://github.com/microsoft/onnxruntime-genai/releases/download/v0.6.0/onnxruntime-genai-0.6.0-win-arm64.zip -o onnxruntime-genai-win-arm64.zip
   tar xvf onnxruntime-genai-win-arm64.zip
   cd onnxruntime-genai-0.6.0-win-arm64
   copy include\* ..\include
   copy lib\* ..\lib

構建示例

   cmake -A arm64 -S . -B build -DPHI3-QA=ON
   cd build
   cmake --build . --config Release

執行示例

   cd Release
   .\phi3_qa.exe <path_to_model>

C++ 示例內部探究

C++ 應用程式釋出在此處：https://github.com/microsoft/onnxruntime-genai/blob/main/examples/c/src/phi3_qa.cpp。該指令碼以以下標準方式使用 API：

載入模型
```
auto model = OgaModel::Create(*config);
```
這將模型載入到記憶體中。

建立預處理器

auto tokenizer = OgaTokenizer::Create(*model);
auto tokenizer_stream = OgaTokenizerStream::Create(*tokenizer);

這將建立一個分詞器和分詞流，允許在生成標記時將它們返回給使用者。

互動式輸入迴圈

while True:
    # Read prompt
    # Run the generation, streaming the output tokens

生成迴圈

# 1. Pre-process the prompt into tokens
auto sequences = OgaSequences::Create();
tokenizer->Encode(prompt.c_str(), *sequences);
   
# 2. Create parameters and generator (KV cache etc) and process the prompt
auto params = OgaGeneratorParams::Create(*model);
params->SetSearchOption("max_length", 1024);
auto generator = OgaGenerator::Create(*model, *params);
generator->AppendTokenSequences(*sequences);

# 3. Loop until all output tokens are generated, printing
# out the decoded token
while (!generator->IsDone()) {
  generator->GenerateNextToken();

  if (is_first_token) {
    timing.RecordFirstTokenTimestamp();
    is_first_token = false;
  }

  const auto num_tokens = generator->GetSequenceCount(0);
  const auto new_token = generator->GetSequenceData(0)[num_tokens - 1];
  std::cout << tokenizer_stream->Decode(new_token) << std::flush;
}