使用 WebNN 執行提供程式

本文件解釋瞭如何在 ONNX Runtime 中使用 WebNN 執行提供程式。

基礎知識

什麼是 WebNN？我應該使用它嗎？

Web 神經網路 (WebNN) API 是一項新的 Web 標準，允許 Web 應用和框架透過裝置上的硬體（例如 GPU、CPU 或專用 AI 加速器 (NPU)）加速深度神經網路。

WebNN 在 Windows、Linux、macOS、Android 和 ChromeOS 上最新版本的 Chrome 和 Edge 中可用，位於“啟用 WebNN API”標誌後面。請檢視 WebNN 狀態以獲取最新實現狀態。

請參閱 WebNN 運算元，瞭解 WebNN 執行提供程式中運算元支援的最新狀態。如果 WebNN 執行提供程式支援模型中的大多數運算元（不支援的運算元會回退到 WASM EP），並且您希望透過利用裝置端加速器實現省電、更快的處理和更流暢的效能，請考慮使用 WebNN 執行提供程式。

如何在 ONNX Runtime Web 中使用 WebNN EP

本節假設您已經使用 ONNX Runtime Web 設定了 Web 應用程式。如果您尚未設定，可以按照開始使用獲取一些基本資訊。

要使用 WebNN EP，您只需進行 3 個小改動

更新您的匯入語句
- 對於 HTML 指令碼標籤，將 ort.min.js 更改為 ort.all.min.js
```
<script src="https://example.com/path/ort.all.min.js"></script>
```
- 對於 JavaScript 匯入語句，將 onnxruntime-web 更改為 onnxruntime-web/all
```
import * as ort from 'onnxruntime-web/all';
```
有關詳細資訊，請參閱條件匯入。
在會話選項中明確指定“webnn”EP
```
const session = await ort.InferenceSession.create(modelPath, { ..., executionProviders: ['webnn'] });
```
WebNN EP 還提供了一組選項，用於建立不同型別的 WebNN MLContext。
- deviceType：'cpu'|'gpu'|'npu'（預設值為 'cpu'），指定用於 MLContext 的首選裝置型別。
- powerPreference：'default'|'low-power'|'high-performance'（預設值為 'default'），指定用於 MLContext 的首選功耗型別。
- context：MLContext 型別，允許使用者將預建立的 MLContext 傳遞給 WebNN EP，在 I/O 繫結功能中是必需的。如果提供了此選項，則其他選項將被忽略。
使用 WebNN EP 選項的示例
```
const options = {
   executionProviders: [
     {
       name: 'webnn',
       deviceType: 'gpu',
       powerPreference: "default",
     },
   ],
}
```
如果是動態形狀模型，ONNX Runtime Web 提供了 freeDimensionOverrides 會話選項來覆蓋模型的自由維度。有關詳細資訊，請參閱 freeDimensionOverrides 介紹。

WebNN API 和 WebNN EP 正在積極開發中，您可以考慮安裝 ONNX Runtime Web 的最新每夜構建版本 (onnxruntime-web@dev)，以受益於最新的功能和改進。

在 WebNN MLTensor 上保留張量資料 (I/O 繫結)

預設情況下，模型的輸入和輸出是張量，其資料儲存在 CPU 記憶體中。當您使用 WebNN EP 並在“gpu”或“npu”裝置型別上執行會話時，資料會複製到 GPU 或 NPU 記憶體，結果再複製回 CPU 記憶體。不同裝置之間以及不同會話之間的記憶體複製會給推理時間帶來很大的開銷，WebNN 提供了一種新的不透明裝置特定儲存型別 MLTensor 來解決此問題。如果您的輸入資料來自 MLTensor，或者您希望將輸出資料保留在 MLTensor 上以供進一步處理，則可以使用 I/O 繫結將資料保留在 MLTensor 上。這在執行基於 Transformer 的模型時特別有用，因為這類模型通常會將單個模型多次執行，並將上一次的輸出作為下一次的輸入。

對於模型輸入，如果您的輸入資料是 WebNN 儲存 MLTensor，您可以建立一個 MLTensor 張量並將其用作輸入張量。

對於模型輸出，有兩種方式使用 I/O 繫結功能

使用預分配的 MLTensor 張量
指定輸出資料位置

另請檢視以下主題

MLTensor 張量生命週期管理

注意：MLTensor 需要共享的 MLContext 才能進行 I/O 繫結。這意味著 MLContext 應作為 WebNN EP 選項預先建立，並在所有會話中重複利用。

從 MLTensor 建立輸入張量

如果您的輸入資料是 WebNN 儲存 MLTensor，您可以建立一個 MLTensor 張量並將其用作輸入張量

// Create WebNN MLContext
const mlContext = await navigator.ml.createContext({deviceType, ...});
// Create a WebNN MLTensor
const inputMLTensor = await mlContext.createTensor({
  dataType: 'float32',
  shape: [1, 3, 224, 224],
  writable: true,
});
// Write data to the MLTensor
const inputArrayBuffer = new Float32Array(1 * 3 * 224 * 224).fill(1.0);
mlContext.writeTensor(inputMLTensor, inputArrayBuffer);

// Create an ORT tensor from the MLTensor
const inputTensor = ort.Tensor.fromMLTensor(inputMLTensor, {
  dataType: 'float32',
  dims: [1, 3, 224, 224],
});

將此張量用作模型輸入 (feeds)，以便輸入資料將保留在 MLTensor 上。

使用預分配的 MLTensor 張量

如果您提前知道輸出形狀，可以建立一個 MLTensor 張量並將其用作輸出張量

// Create a pre-allocated MLTensor and the corresponding ORT tensor. Assuming that the output shape is [10, 1000].
const mlContext = await navigator.ml.createContext({deviceType, ...});
const preallocatedMLTensor = await mlContext.createTensor({
  dataType: 'float32',
  shape: [10, 1000],
  readable: true,
});

const preallocatedOutputTensor = ort.Tensor.fromMLTensor(preallocatedMLTensor, {
  dataType: 'float32',
  dims: [10, 1000],
});

// ...

// Run the session with fetches
const feeds = { 'input_0': inputTensor };
const fetches = { 'output_0': preallocatedOutputTensor };
await session.run(feeds, fetches);

// Read output_0 data from preallocatedMLTensor if need
const output_0 = await mlContext.readTensor(preallocatedMLTensor);
console.log('output_0 value:', new Float32Array(output_0));

透過在 fetches 中指定輸出張量，ONNX Runtime Web 將使用預分配的 MLTensor 作為輸出張量。如果存在形狀不匹配，則 run() 呼叫將失敗。

指定輸出資料位置

如果您不想為輸出使用預分配的 MLTensor 張量，也可以在會話選項中指定輸出資料位置

const sessionOptions1 = {
  ...,
  // keep all output data on MLTensor
  preferredOutputLocation: 'ml-tensor'
};

const sessionOptions2 = {
  ...,
  // alternatively, you can specify the output location for each output tensor
  preferredOutputLocation: {
    'output_0': 'cpu',         // keep output_0 on CPU. This is the default behavior.
    'output_1': 'ml-tensor'   // keep output_1 on MLTensor tensor
  }
};

// ...

// Run the session
const feeds = { 'input_0': inputTensor };
const results = await session.run(feeds);

// Read output_1 data
const output_1 = await results['output_1'].getData();
console.log('output_1 value:', new Float32Array(output_1));

透過指定配置 preferredOutputLocation，ONNX Runtime Web 將把輸出資料保留在指定的裝置上。

有關詳細資訊，請參閱API 參考：preferredOutputLocation。

注意事項

MLTensor 張量生命週期管理

瞭解底層 MLTensor 的管理方式非常重要，這樣可以避免記憶體洩漏並提高張量使用效率。

MLTensor 張量可以透過使用者程式碼建立，也可以由 ONNX Runtime Web 作為模型輸出建立。

當它由使用者程式碼建立時，總是使用現有的 MLTensor 透過 Tensor.fromMLTensor() 建立。在這種情況下，張量不“擁有”MLTensor。
- 使用者有責任確保底層 MLTensor 在推理過程中有效，並在不再需要時呼叫 mlTensor.destroy() 來釋放 MLTensor。
- 避免呼叫 tensor.getData() 和 tensor.dispose()。直接使用 MLTensor 張量。
- 使用已銷燬的 MLTensor 的 MLTensor 張量將導致會話執行失敗。
當它由 ONNX Runtime Web 作為模型輸出（而不是預分配的 MLTensor 張量）建立時，該張量“擁有”MLTensor。
- 您無需擔心 MLTensor 在張量使用之前被銷燬的情況。
- 呼叫 tensor.getData() 將資料從 MLTensor 下載到 CPU 並獲取為型別化陣列。
- 當不再需要底層 MLTensor 時，請明確呼叫 tensor.dispose() 來銷燬它。