使用 C# 和 ONNX Runtime 推理 Stable Diffusion

在本教程中，我們將學習如何在 C# 中對流行的 Stable Diffusion 深度學習模型進行推理。Stable Diffusion 模型接收文字提示並建立代表該文字的影像。請參見下面的示例：

"make a picture of green tree with flowers around it and a red sky"

Image of browser inferencing on sample images.

先決條件
使用 Hugging Face 下載 Stable Diffusion 模型
透過 Hugging Face 的 Diffusers 在 Python 中理解模型
使用 C# 進行推理
主函式
使用 ONNX Runtime Extensions 進行分詞
使用 CLIP 文字編碼器模型進行文字嵌入
推理迴圈：UNet 模型、時間步和 LMS 排程器
使用 VAEDecoder 對 output 進行後處理
結論
資源

先決條件

本教程可以在本地執行，也可以透過利用 Azure 機器學習計算在雲端執行。

從 GitHub 下載原始碼

本地執行

Visual Studio 或 VS Code
一臺啟用 GPU 的機器，配備 CUDA 或 Windows 上的 DirectML
- 配置 CUDA EP。請遵循此教程以在 Windows 11 上為 ONNX Runtime 和 C# 配置 CUDA 和 cuDNN 用於 GPU
- Windows 自帶 DirectML 支援。無需額外配置。如果您選擇此選項，請務必克隆此倉庫的 direct-ML-EP 分支。
- 這是在 GTX 3070 上構建的，尚未在更小的裝置上測試過。

使用 Azure 機器學習在雲端執行

使用 Hugging Face 下載 Stable Diffusion 模型

Hugging Face 站點擁有豐富的開源模型庫。我們將利用並從 Hugging Face 下載 ONNX Stable Diffusion 模型。

Stable Diffusion 模型 v1.4

選擇模型版本倉庫後，點選 Files and Versions，然後選擇 ONNX 分支。如果沒有可用的 ONNX 模型分支，請使用 main 分支並將其轉換為 ONNX。有關更多資訊，請參閱 PyTorch 的 ONNX 轉換教程。

克隆倉庫

git lfs install
git clone https://huggingface.tw/CompVis/stable-diffusion-v1-4 -b onnx

將包含 ONNX 檔案的資料夾複製到 C# 專案資料夾 \StableDiffusion\StableDiffusion。需要複製的資料夾是：unet、vae_decoder、text_encoder、safety_checker。

透過 Hugging Face 的 Diffusers 在 Python 中理解模型

當對預構建模型進行操作化時，花點時間理解此管道中的模型非常有用。此程式碼基於 Hugging Face Diffusers 庫和部落格。如果您想了解更多關於其工作原理的資訊，請檢視這篇精彩的博文以獲取更多詳情！

使用 C# 進行推理

現在，讓我們開始分解如何在 C# 中進行推理！unet 模型接收由連線文字和影像的 CLIP 模型建立的使用者提示的文字嵌入。潛在的噪聲影像被建立為起點。排程演算法和 unet 模型協同工作以對影像進行去噪，從而建立代表文字提示的影像。讓我們看看程式碼。

主函式

主函式設定提示、推理步數和引導比例。然後它呼叫 UNet.Inference 函式來執行推理。

需要設定的屬性有：

prompt - 用於影像的文字提示
num_inference_steps - 執行推理的步數。步數越多，推理迴圈執行時間越長，但影像質量應該會提高。
guidance_scale - 無分類器引導的比例。數值越高，越會嘗試與提示相似，但影像質量可能會下降。
batch_size - 建立的影像數量
height - 影像的高度。預設為 512，必須是 8 的倍數。
width - 影像的寬度。預設為 512，必須是 8 的倍數。

* 注意：有關更多詳情，請檢視 Hugging Face 部落格。

//Default args
var prompt = "make a picture of green tree with flowers around it and a red sky";
// Number of steps
var num_inference_steps = 10;

// Scale for classifier-free guidance
var guidance_scale = 7.5;
//num of images requested
var batch_size = 1;
// Load the tokenizer and text encoder to tokenize and encodethe text.
var textTokenized = TextProcessing.TokenizeText(prompt);
var textPromptEmbeddings = TextProcessing.TextEncode(textTokenized).ToArray();
// Create uncond_input of blank tokens
var uncondInputTokens = TextProcessing.CreateUncondInput();
var uncondEmbedding = TextProcessing.TextEncode(uncondInputTokens).ToArray();
// Concat textEmeddings and uncondEmbedding
DenseTensor<float> textEmbeddings = new DenseTensor<float>(ne[] { 2, 77, 768 });
for (var i = 0; i < textPromptEmbeddings.Length; i++)
{
    textEmbeddings[0, i / 768, i % 768] = uncondEmbedding[i];
    textEmbeddings[1, i / 768, i % 768] = textPromptEmbeddings[i];
}
var height = 512;
var width = 512;
// Inference Stable Diff
var image = UNet.Inference(num_inference_steps, textEmbeddings,guidance_scale, batch_size, height, width);
// If image failed or was unsafe it will return null.
if( image == null )
{
    Console.WriteLine("Unable to create image, please try again.");
}

使用 ONNX Runtime Extensions 進行分詞

TextProcessing 類包含用於對文字提示進行分詞並使用 CLIP 模型文字編碼器進行編碼的函式。

我們無需在 C# 中重新實現 CLIP 分詞器，而是可以利用 ONNX Runtime Extensions 中跨平臺的 CLIP 分詞器實現。ONNX Runtime Extensions 包含一個 custom_op_cliptok.onnx 檔案分詞器，用於對文字提示進行分詞。該分詞器是一個簡單的分詞器，它將文字分割成單詞，然後將單詞轉換為標記。

文字提示：表示您想要建立的影像的句子或短語。

make a picture of green tree with flowers aroundit and a red sky

文字分詞：文字提示被分詞為標記列表。每個標記 ID 都是一個數字，代表句子中的一個單詞，然後用空白標記填充，以建立 77 個標記的 maxLength。然後將標記 ID 轉換為形狀為 (1,77) 的張量。
以下是使用 ONNX Runtime Extensions 對文字提示進行分詞的程式碼。

public static int[] TokenizeText(string text)
{
            // Create Tokenizer and tokenize the sentence.
            var tokenizerOnnxPath = Directory.GetCurrentDirectory().ToString() + ("\\text_tokenizer\\custom_op_cliptok.onnx");

            // Create session options for custom op of extensions
            using var sessionOptions = new SessionOptions();
            var customOp = "ortextensions.dll";
            sessionOptions.RegisterCustomOpLibraryV2(customOp, out var libraryHandle);

            // Create an InferenceSession from the onnx clip tokenizer.
            using var tokenizeSession = new InferenceSession(tokenizerOnnxPath, sessionOptions);

            // Create input tensor from text
            using var inputTensor = OrtValue.CreateTensorWithEmptyStrings(OrtAllocator.DefaultInstance, new long[] { 1 });
            inputTensor.StringTensorSetElementAt(text.AsSpan(), 0);

            var inputs = new Dictionary<string, OrtValue>
            {
                {  "string_input", inputTensor }
            };

            // Run session and send the input data in to get inference output. 
            using var runOptions = new RunOptions();
            using var tokens = tokenizeSession.Run(runOptions, inputs, tokenizeSession.OutputNames);

            var inputIds = tokens[0].GetTensorDataAsSpan<long>();

            // Cast inputIds to Int32
            var InputIdsInt = new int[inputIds.Length];
            for(int i = 0; i < inputIds.Length; i++)
            {
                InputIdsInt[i] = (int)inputIds[i];
            }

            Console.WriteLine(String.Join(" ", InputIdsInt));

            var modelMaxLength = 77;
            // Pad array with 49407 until length is modelMaxLength
            if (InputIdsInt.Length < modelMaxLength)
            {
                var pad = Enumerable.Repeat(49407, 77 - InputIdsInt.Length).ToArray();
                InputIdsInt = InputIdsInt.Concat(pad).ToArray();
            }
            return InputIdsInt;
}

tensor([[49406,  1078,   320,  1674,   539,  1901,  2677,   593,  4023,  1630,
           585,   537,   320,   736,  2390, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]])

使用 CLIP 文字編碼器模型進行文字嵌入

這些標記被髮送到文字編碼器模型，並轉換為形狀為 (1, 77, 768) 的張量，其中第一維是批大小，第二維是標記數量，第三維是嵌入大小。文字編碼器是一個 OpenAI CLIP 模型，它連線文字到影像。

文字編碼器建立文字嵌入，該嵌入經過訓練將文字提示編碼為用於引導影像生成的向量。然後將文字嵌入與 uncond 嵌入連線起來，建立傳送到 unet 模型進行推理的文字嵌入。

文字嵌入：從分詞結果建立的代表文字提示的數字向量。文字嵌入由 text_encoder 模型建立。

        public static float[] TextEncoder(int[] tokenizedInput)
        {
            // Create input tensor. OrtValue will not copy, will read from managed memory
            using var input_ids = OrtValue.CreateTensorValueFromMemory<int>(tokenizedInput,
                new long[] { 1, tokenizedInput.Count() });

            var textEncoderOnnxPath = Directory.GetCurrentDirectory().ToString() + ("\\text_encoder\\model.onnx");

            using var encodeSession = new InferenceSession(textEncoderOnnxPath);

            // Pre-allocate the output so it goes to a managed buffer
            // we know the shape
            var lastHiddenState = new float[1 * 77 * 768];
            using var outputOrtValue = OrtValue.CreateTensorValueFromMemory<float>(lastHiddenState, new long[] { 1, 77, 768 });

            string[] input_names = { "input_ids" };
            OrtValue[] inputs = { input_ids };

            string[] output_names = { encodeSession.OutputNames[0] };
            OrtValue[] outputs = { outputOrtValue };

            // Run inference.
            using var runOptions = new RunOptions();
            encodeSession.Run(runOptions, input_names, inputs, output_names, outputs);

            return lastHiddenState;
        }

torch.Size([1, 77, 768])
tensor([[[-0.3884,  0.0229, -0.0522,  ..., -0.4899, -0.3066,  0.0675],
         [ 0.0520, -0.6046,  1.9268,  ..., -0.3985,  0.9645, -0.4424],
         [-0.8027, -0.4533,  1.7525,  ..., -1.0365,  0.6296,  1.0712],
         ...,
         [-0.6833,  0.3571, -1.1353,  ..., -1.4067,  0.0142,  0.3566],
         [-0.7049,  0.3517, -1.1524,  ..., -1.4381,  0.0090,  0.3777],
         [-0.6155,  0.4283, -1.1282,  ..., -1.4256, -0.0285,  0.3206]]],

推理迴圈：UNet 模型、時間步和 LMS 排程器

排程器

排程演算法和 unet 模型協同工作，對影像進行去噪，從而建立代表文字提示的影像。可以使用不同的排程演算法，要了解更多資訊，請檢視 Hugging Face 的這篇部落格。在此示例中，我們將使用 `LMSDiscreteScheduler`，它是基於 HuggingFace scheduling_lms_discrete.py 建立的。

時間步

推理迴圈是執行排程演算法和 unet 模型的主迴圈。該迴圈執行 timesteps 的次數，這些時間步是根據推理步數和其他引數由排程演算法計算得出的。

在此示例中，我們有 10 個推理步，計算出以下時間步：

// Get path to model to create inference session.
var modelPath = Directory.GetCurrentDirectory().ToString() + ("\\unet\\model.onnx");
var scheduler = new LMSDiscreteScheduler();
var timesteps = scheduler.SetTimesteps(numInferenceSteps);

tensor([999., 888., 777., 666., 555., 444., 333., 222., 111.,   0.])

潛在變數

latents 是模型輸入中使用的噪聲影像張量。它透過 GenerateLatentSample 函式建立，生成一個形狀為 (1,4,64,64) 的隨機張量。seed 可以設定為隨機數或固定數字。如果 seed 設定為固定數字，每次都將使用相同的潛在張量。這對於除錯或每次都想建立相同影像時非常有用。

var seed = new Random().Next();
var latents = GenerateLatentSample(batchSize, height, width,seed, scheduler.InitNoiseSigma);

Image of browser inferencing on sample images.

推理迴圈

對於每個推理步驟，潛在影像被複制以建立形狀為 (2,4,64,64) 的張量，然後將其縮放並與 unet 模型進行推理。輸出張量 (2,4,64,64) 被分割並應用引導。然後將結果張量作為去噪過程的一部分發送到 LMSDiscreteScheduler 步驟，並返回排程步驟的結果張量，迴圈再次完成，直到達到 num_inference_steps。

var modelPath = Directory.GetCurrentDirectory().ToString() + ("\\unet\\model.onnx");
var scheduler = new LMSDiscreteScheduler();
var timesteps = scheduler.SetTimesteps(numInferenceSteps);

var seed = new Random().Next();
var latents = GenerateLatentSample(batchSize, height, width, seed, scheduler.InitNoiseSigma);

// Create Inference Session
using var options = new SessionOptions();
using var unetSession = new InferenceSession(modelPath, options);

var latentInputShape = new int[] { 2, 4, height / 8, width / 8 };
var splitTensorsShape = new int[] { 1, 4, height / 8, width / 8 };

for (int t = 0; t < timesteps.Length; t++)
{
    // torch.cat([latents] * 2)
    var latentModelInput = TensorHelper.Duplicate(latents.ToArray(), latentInputShape);

    // Scale the input
    latentModelInput = scheduler.ScaleInput(latentModelInput, timesteps[t]);

    // Create model input of text embeddings, scaled latent image and timestep
    var input = CreateUnetModelInput(textEmbeddings, latentModelInput, timesteps[t]);

    // Run Inference
    using var output = unetSession.Run(input);
    var outputTensor = output[0].Value as DenseTensor<float>;

    // Split tensors from 2,4,64,64 to 1,4,64,64
    var splitTensors = TensorHelper.SplitTensor(outputTensor, splitTensorsShape);
    var noisePred = splitTensors.Item1;
    var noisePredText = splitTensors.Item2;

    // Perform guidance
    noisePred = performGuidance(noisePred, noisePredText, guidanceScale);

    // LMS Scheduler Step
    latents = scheduler.Step(noisePred, timesteps[t], latents);
}

使用 VAEDecoder 對 `output` 進行後處理

推理迴圈完成後，將結果張量進行縮放，然後傳送到 vae_decoder 模型以解碼影像。最後，將解碼後的影像張量轉換為影像並儲存到磁碟。

public static Tensor<float> Decoder(List<NamedOnnxValue> input)
{
    // Load the model which will be used to decode the latents into image space. 
   var vaeDecoderModelPath = Directory.GetCurrentDirectory().ToString() + ("\\vae_decoder\\model.onnx");
    
    // Create an InferenceSession from the Model Path.
    var vaeDecodeSession = new InferenceSession(vaeDecoderModelPath);

   // Run session and send the input data in to get inference output. 
    var output = vaeDecodeSession.Run(input);
    var result = (output.ToList().First().Value as Tensor<float>);
    return result;
}

public static Image<Rgba32> ConvertToImage(Tensor<float> output, int width = 512, int height = 512, string imageName = "sample")
{
    var result = new Image<Rgba32>(width, height);
    for (var y = 0; y < height; y++)
    {
        for (var x = 0; x < width; x++)
        {
            result[x, y] = new Rgba32(
                (byte)(Math.Round(Math.Clamp((output[0, 0, y, x] / 2 + 0.5), 0, 1) * 255)),
                (byte)(Math.Round(Math.Clamp((output[0, 1, y, x] / 2 + 0.5), 0, 1) * 255)),
                (byte)(Math.Round(Math.Clamp((output[0, 2, y, x] / 2 + 0.5), 0, 1) * 255))
            );
        }
    }
    result.Save($@"C:/code/StableDiffusion/{imageName}.png");
    return result;
}

結果影像

結論

這是關於如何在 C# 中執行 Stable Diffusion 的高階概述。它涵蓋了主要概念並提供了實現示例。要獲取完整程式碼，請檢視 Stable Diffusion C# 示例。