使用 ONNX Runtime generate() API 執行 Phi-3 語言模型

簡介

Phi-3 和 Phi 3.5 ONNX 模型託管在 HuggingFace 上，您可以使用 ONNX Runtime 的 generate() API 執行它們。

目前提供並支援 mini (3.3B) 和 medium (14B) 版本。 mini 和 medium 版本都具有短上下文 (4k) 版本和長上下文 (128k) 版本。長上下文版本可以接受更長的提示並生成更長的輸出文字，但會消耗更多記憶體。

可用模型為

本教程演示瞭如何下載並執行 Phi 3 模型的短上下文 (4k) mini (3B) 變體。有關其他變體的下載命令，請參閱模型參考。

本教程下載並執行短上下文 (4k) mini (3B) 模型變體。有關其他變體的下載命令，請參閱模型參考。

設定
選擇您的平臺
使用 DirectML 執行
使用 NVIDIA CUDA 執行
在 CPU 上執行
Phi-3 ONNX 模型參考

設定

安裝 git 大型檔案系統擴充套件

HuggingFace 使用 git 進行版本控制。要下載 ONNX 模型，您需要安裝 git lfs，如果尚未安裝。
- Windows: winget install -e --id GitHub.GitLFS (如果您沒有 winget，請從官方來源下載並執行 exe)
- Linux: apt-get install git-lfs
- MacOS: brew install git-lfs
然後執行 git lfs install
安裝 HuggingFace CLI
```
pip install huggingface-hub[cli]
```

選擇您的平臺

您使用的是帶 GPU 的 Windows 機器嗎？

我不知道 → 檢視本指南，瞭解您的 Windows 機器中是否有 GPU，並確認您的 GPU 支援 DirectML。
是 → 按照DirectML的說明進行操作。
否 → 您有 NVIDIA GPU 嗎？
- 我不知道 → 檢視本指南，瞭解您是否有支援 CUDA 的 GPU。
- 是 → 按照NVIDIA CUDA GPU的說明進行操作。
- 否 → 按照CPU的說明進行操作。

注意：根據您的硬體，只需一個包和一個模型。也就是說，只執行以下部分中的一個步驟。

使用 DirectML 執行

下載模型

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .

此命令將模型下載到名為 directml 的資料夾中。

安裝 generate() API
```
pip install --pre onnxruntime-genai-directml
```
您現在應該在 pip list 中看到 onnxruntime-genai-directml。

執行模型

使用 phi3-qa.py 執行模型。

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m directml\directml-int4-awq-block-128 -e dml

指令碼載入模型後，它會迴圈詢問您的輸入，並流式傳輸模型生成的輸出。例如

Input: Tell me a joke about GPUs

Certainly! Here\'s a light-hearted joke about GPUs:

Why did the GPU go to school? Because it wanted to improve its "processing power"!

This joke plays on the double meaning of "processing power," referring both to the computational abilities of a GPU and the idea of a student wanting to improve their academic skills.

使用 NVIDIA CUDA 執行

下載模型

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .

此命令將模型下載到名為 cuda 的資料夾中。

安裝 generate() API

pip install --pre onnxruntime-genai-cuda

執行模型

使用 phi3-qa.py 執行模型。

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32  -e cuda

指令碼載入模型後，它會迴圈詢問您的輸入，並流式傳輸模型生成的輸出。例如

Input: Tell me a joke about creative writing
 
Output:  Why don't writers ever get lost? Because they always follow the plot! 

在 CPU 上執行

下載模型

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .

此命令將模型下載到名為 cpu_and_mobile 的資料夾中

為 CPU 安裝 generate() API
```
pip install --pre onnxruntime-genai
```

執行模型

使用 phi3-qa.py 執行模型。

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu

指令碼載入模型後，它會迴圈詢問您的輸入，並流式傳輸模型生成的輸出。例如

Input: Tell me a joke about generative AI

Output:  Why did the generative AI go to school?

To improve its "creativity" algorithm!

Phi-3 ONNX 模型參考

Phi-3 mini 4k 上下文 CPU

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 mini 4k 上下文 CUDA

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 mini 4k 上下文 DirectML

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .
python phi3-qa.py -m directml\directml-int4-awq-block-128 -e dml

Phi-3 mini 128k 上下文 CPU

huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 mini 128k 上下文 CUDA

huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 mini 128k 上下文 DirectML

huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include directml/* --local-dir .
python phi3-qa.py -m directml\directml-int4-awq-block-128 -e dml

Phi-3 medium 4k 上下文 CPU

git clone https://huggingface.tw/microsoft/Phi-3-medium-4k-instruct-onnx-cpu
python phi3-qa.py -m Phi-3-medium-4k-instruct-onnx-cpu/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 medium 4k 上下文 CUDA

git clone https://huggingface.tw/microsoft/Phi-3-medium-4k-instruct-onnx-cuda
python phi3-qa.py -m Phi-3-medium-4k-instruct-onnx-cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 medium 4k 上下文 DirectML

git clone https://huggingface.tw/microsoft/Phi-3-medium-4k-instruct-onnx-directml
python phi3-qa.py -m Phi-3-medium-4k-instruct-onnx-directml/directml-int4-awq-block-128 -e dml

Phi-3 medium 128k 上下文 CPU

git clone https://huggingface.tw/microsoft/Phi-3-medium-128k-instruct-onnx-cpu
python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-cpu/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 medium 128k 上下文 CUDA

git clone https://huggingface.tw/microsoft/Phi-3-medium-128k-instruct-onnx-cuda
python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 medium 128k 上下文 DirectML

git clone https://huggingface.tw/microsoft/Phi-3-medium-128k-instruct-onnx-directml
python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-directml/directml-int4-awq-block-128 -e dml

Phi-3.5 mini 128k 上下文 CUDA

huggingface-cli download microsoft/Phi-3.5-mini-instruct-onnx --include cuda/cuda-int4-awq-block-128/* --local-dir .
python phi3-qa.py -m cuda/cuda-int4-awq-block-128 -e cuda

Phi-3.5 mini 128k 上下文 CPU

huggingface-cli download microsoft/Phi-3.5-mini-instruct-onnx --include cpu_and_mobile/cpu-int4-awq-block-128-acc-level-4/* --local-dir .
python phi3-qa.py -m cpu_and_mobile/cpu-int4-awq-block-128-acc-level-4 -e cpu