使用 C# BERT NLP 深度學習和 ONNX Runtime 進行推理

在本教程中，我們將學習如何使用 C# 對流行的 BERT 自然語言處理深度學習模型進行推理。

為了能夠在 C# 中預處理文字，我們將利用開源的 BERTTokenizers，它包含了大多數 BERT 模型的 tokenizer。支援的模型如下所示。

BERT Base
BERT Large
BERT German
BERT 多語言
BERT Base Uncased
BERT Large Uncased

有許多模型（包括本教程中的模型）是基於這些基礎模型進行微調的。模型的 tokenizer 仍然與其微調所基於的基礎模型相同。

先決條件

本教程可以在本地執行，也可以利用 Azure 機器學習計算資源。

本地執行

在雲端使用 Azure 機器學習執行

使用 Hugging Face 下載 BERT 模型

Hugging Face 提供了一個很棒的 API，用於下載開源模型，然後我們可以使用 Python 和 PyTorch 將它們匯出為 ONNX 格式。對於尚不在 ONNX 模型動物園中的開源模型，這是一個很好的選擇。

在 Python 中下載和匯出模型的步驟

使用 transformers API 下載名為 bert-large-uncased-whole-word-masking-finetuned-squad 的 BertForQuestionAnswering 模型。

import torch
from transformers import BertForQuestionAnswering

model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
model_path = "./" + model_name + ".onnx"
model = BertForQuestionAnswering.from_pretrained(model_name)

# set the model to inference mode
# It is important to call torch_model.eval() or torch_model.train(False) before exporting the model, 
# to turn the model to inference mode. This is required since operators like dropout or batchnorm 
# behave differently in inference and training mode.
model.eval()

現在我們已經下載了模型，我們需要將其匯出為 ONNX 格式。這已內置於 PyTorch 的 torch.onnx.export 函式中。

inputs 變數指示輸入形狀。您可以像下面這樣建立一個虛擬輸入，或者使用模型測試中的樣本輸入。
將 opset_version 設定為與模型相容的最高版本。在此處瞭解有關 opset 版本的更多資訊：此處。
設定模型的 input_names 和 output_names。
為動態長度輸入設定 dynamic_axes，因為 sentence 和 context 變數對於每個推理的問題將具有不同的長度。

# Generate dummy inputs to the model. Adjust if necessary.
inputs = {
        # list of numerical ids for the tokenized text
        'input_ids':   torch.randint(32, [1, 32], dtype=torch.long), 
        # dummy list of ones
        'attention_mask': torch.ones([1, 32], dtype=torch.long),     
        # dummy list of ones
        'token_type_ids':  torch.ones([1, 32], dtype=torch.long)     
    }

symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
torch.onnx.export(model,                                         
# model being run
                  (inputs['input_ids'],
                   inputs['attention_mask'], 
                   inputs['token_type_ids']),                    # model input (or a tuple for multiple inputs)
                  model_path,                                    # where to save the model (can be a file or file-like object)
                  opset_version=11,                              # the ONNX version to export the model to
                  do_constant_folding=True,                      # whether to execute constant folding for optimization
                  input_names=['input_ids',
                               'input_mask', 
                               'segment_ids'],                   # the model's input names
                  output_names=['start_logits', "end_logits"],   # the model's output names
                  dynamic_axes={'input_ids': symbolic_names,
                                'input_mask' : symbolic_names,
                                'segment_ids' : symbolic_names,
                                'start_logits' : symbolic_names, 
                                'end_logits': symbolic_names})   # variable length axes/dynamic input

在 Python 中理解模型

在使用預構建模型並將其投入執行時，花點時間瞭解模型的預處理和後處理，以及輸入/輸出形狀和標籤是很有用的。許多模型都提供了 Python 示例程式碼。我們將使用 C# 對模型進行推理，但首先讓我們測試一下它，看看在 Python 中是如何完成的。這將有助於我們在下一步中編寫 C# 邏輯。

本教程中提供了用於測試模型的程式碼：本教程中。檢視在 Python 中測試和推理此模型的原始碼。以下是執行模型後的示例 輸入 句子和示例 輸出。
示例 輸入

input = "{\"question\": \"What is Dolly Parton's middle name?\", \"context\": \"Dolly Rebecca Parton is an American singer-songwriter\"}"

print(run(input))

以上問題的輸出應如下所示。您可以使用 input_ids 來驗證 C# 中的分詞。

Output:
{'input_ids': [101, 2054, 2003, 19958, 2112, 2239, 1005, 1055, 2690, 2171, 1029, 102, 19958, 9423, 2112, 2239, 2003, 2019, 2137, 3220, 1011, 6009, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'answer': 'Rebecca'}

使用 C# 進行推理

現在我們已經在 Python 中測試了模型，是時候在 C# 中構建它了。我們首先需要做的是建立專案。在這個例子中，我們將使用控制檯應用程式，但是你可以在任何 C# 應用程式中使用此程式碼。

開啟 Visual Studio 並建立控制檯應用程式

安裝 Nuget 包

安裝 Nuget 包 BERTTokenizers、Microsoft.ML.OnnxRuntime、Microsoft.ML.OnnxRuntime.Managed、Microsoft.ML

dotnet add package Microsoft.ML.OnnxRuntime --version 1.16.0
dotnet add package Microsoft.ML.OnnxRuntime.Managed --version 1.16.0
dotnet add package Microsoft.ML
dotnet add package BERTTokenizers --version 1.1.0

建立應用程式

匯入包

using BERTTokenizers;
using Microsoft.ML.Data;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;

新增 namespace、class 和 Main 函式。

namespace MyApp // Note: actual namespace depends on the project name.
{
    internal class BertTokenizeProgram
    {
        static void Main(string[] args)
        {

        }
    }
}

建立用於編碼的 BertInput 類

新增 BertInput 結構體

    public struct BertInput
    {
        public long[] InputIds { get; set; }
        public long[] AttentionMask { get; set; }
        public long[] TypeIds { get; set; }
    }

使用 `BertUncasedLargeTokenizer` 對句子進行分詞

建立一個句子（問題和上下文），並使用 BertUncasedLargeTokenizer 對句子進行分詞。基礎模型是 bert-large-uncased，因此我們使用庫中的 BertUncasedLargeTokenizer。請務必檢查您的 BERT 模型所基於的基礎模型是什麼，以確認您使用了正確的 tokenizer。

  var sentence = "{\"question\": \"Where is Bob Dylan From?\", \"context\": \"Bob Dylan is from Duluth, Minnesota and is an American singer-songwriter\"}";
  Console.WriteLine(sentence);

  // Create Tokenizer and tokenize the sentence.
  var tokenizer = new BertUncasedLargeTokenizer();

  // Get the sentence tokens.
  var tokens = tokenizer.Tokenize(sentence);
  // Console.WriteLine(String.Join(", ", tokens));

  // Encode the sentence and pass in the count of the tokens in the sentence.
  var encoded = tokenizer.Encode(tokens.Count(), sentence);

  // Break out encoding to InputIds, AttentionMask and TypeIds from list of (input_id, attention_mask, type_id).
  var bertInput = new BertInput()
  {
      InputIds = encoded.Select(t => t.InputIds).ToArray(),
      AttentionMask = encoded.Select(t => t.AttentionMask).ToArray(),
      TypeIds = encoded.Select(t => t.TokenTypeIds).ToArray(),
  };
 

建立推理所需的 `name -> OrtValue` 對的 `inputs`

獲取模型，在輸入緩衝區之上建立 3 個 OrtValue，並將它們封裝到 Dictionary 中以供 Run() 呼叫。請注意，幾乎所有的 Onnxruntime 類都封裝了原生資料結構，因此必須進行 Dispose 以防止記憶體洩漏。

  // Get path to model to create inference session.
  var modelPath = @"C:\code\bert-nlp-csharp\BertNlpTest\BertNlpTest\bert-large-uncased-finetuned-qa.onnx";

  using var runOptions = new RunOptions();
  using var session = new InferenceSession(modelPath);

  // Create input tensors over the input data.
  using var inputIdsOrtValue = OrtValue.CreateTensorValueFromMemory(bertInput.InputIds,
        new long[] { 1, bertInput.InputIds.Length });

  using var attMaskOrtValue = OrtValue.CreateTensorValueFromMemory(bertInput.AttentionMask,
        new long[] { 1, bertInput.AttentionMask.Length });

  using var typeIdsOrtValue = OrtValue.CreateTensorValueFromMemory(bertInput.TypeIds,
        new long[] { 1, bertInput.TypeIds.Length });

  // Create input data for session. Request all outputs in this case.
  var inputs = new Dictionary<string, OrtValue>
  {
      { "input_ids", inputIdsOrtValue },
      { "input_mask", attMaskOrtValue },
      { "segment_ids", typeIdsOrtValue }
  };

執行推理

建立 InferenceSession，執行推理並列印結果。

  // Run session and send the input data in to get inference output. 
  using var output = session.Run(runOptions, inputs, session.OutputNames);

後處理 `輸出` 並列印結果

這裡我們獲取開始位置 (startLogit) 和結束位置 (endLogits) 的索引。然後，我們獲取輸入句子的原始 tokens，並獲取預測的 token ID 的詞彙值。

            // Get the Index of the Max value from the output lists.
            // We intentionally do not copy to an array or to a list to employ algorithms.
            // Hopefully, more algos will be available in the future for spans.
            // so we can directly read from native memory and do not duplicate data that
            // can be large for some models
            // Local function
            int GetMaxValueIndex(ReadOnlySpan<float> span)
            {
                float maxVal = span[0];
                int maxIndex = 0;
                for (int i = 1; i < span.Length; ++i)
                {
                    var v = span[i];
                    if (v > maxVal)
                    {
                        maxVal = v;
                        maxIndex = i;
                    }
                }
                return maxIndex;
            }

            var startLogits = output[0].GetTensorDataAsSpan<float>();
            int startIndex = GetMaxValueIndex(startLogits);

            var endLogits = output[output.Count - 1].GetTensorDataAsSpan<float>();
            int endIndex = GetMaxValueIndex(endLogits);

            var predictedTokens = tokens
                          .Skip(startIndex)
                          .Take(endIndex + 1 - startIndex)
                          .Select(o => tokenizer.IdToToken((int)o.VocabularyIndex))
                          .ToList();

            // Print the result.
            Console.WriteLine(String.Join(" ", predictedTokens));

使用 Azure Web App 部署

在此示例中，我們建立了一個簡單的控制檯應用程式，但這可以很容易地在 C# Web 應用等中實現。查閱有關如何快速入門：部署 ASP.NET Web 應用的文件。

後續步驟

有許多不同的 BERT 模型已針對不同任務進行了微調，並且您還可以針對您的特定任務微調不同的基礎模型。此程式碼適用於大多數 BERT 模型，只需根據您的特定模型更新輸入、輸出和預處理/後處理即可。