Project 5: Diffusion Models by Lisa (Qi) Hou

Part A: Fun with Diffusion Models

Part 1 Sampling Loops

Diffusion Model Overview

Starting with a clean image x₀, we iteratively add noise to generate progressively noisier images x_t, until we reach pure noise at timestep t = T. At t = 0, we have a clean image, and for larger t, the image becomes noisier.

A diffusion model reverses this process by denoising the noisy image x_t at each timestep. Using the predicted noise, we can either remove all noise to estimate x₀ or partially denoise it to get x_t-1. The process continues iteratively until we obtain a clean image x₀.

To generate new images, the process begins with pure noise sampled from a Gaussian distribution at timestep T, denoted as x_T. By progressively denoising, we generate a clean image.

The amount of noise added at each step is determined by noise coefficients α_t, predefined during training.

1.1 Implementing the Forward Process

The forward process adds noise to a clean image x₀. The process is defined by:

q(x_t|x₀) = N(x_t; √α̅_tx₀, (1 - α̅_t)I)

This is equivalent to:

x_t = √α̅_tx₀ + √(1 - α̅_t)ε, where ε ~ N(0, 1).

The forward process involves scaling x₀ by √α̅_t and adding Gaussian noise scaled by √(1 - α̅_t).

To implement this, use the alphas_cumprod variable, which contains α̅_t values for all t in the range [0, 999]. Remember:

t = 0: clean image, α̅_t → 1.
t close to T: noisier image, α̅_t → 0.

The forward process is tested using a sample image resized to 64x64. For timesteps t in {250, 500, 750}, the results are displayed.

Finite Difference Operator Result — Ground Truth

1.2 Classical Denoising

The noisy images are denoised using gaussian blur filter:

1.3 One-Step Denoising with Diffusion Model

The noisy images are denoised using a pretrained diffusion model with the prompt "a high quality photo":

1.4 Iteartive Denoising with Diffusion Model

Diffusion models are designed to denoise iteratively. To achieve this, I could start with noise x₁₀₀₀ at timestep T = 1000, denoise step-by-step until reaching x₀. However, this would require running the model 1000 times, which is computationally expensive.

We can speed up the process by skipping steps.

To skip steps, I created a new list of timesteps called strided_timesteps. This list corresponds to the noisiest image at the largest timestep, with strided_timesteps[-1] representing a clean image. A stride of 30 works well, so I used it to construct the list.

On the i^th denoising step, I used t = strided_timesteps[i] and denoised to t' = strided_timesteps[i+1]. The formula for this is:

Where:

x_t: Image at timestep t.
x_t': Noisy image at timestep t' (less noisy).
α_t: Defined by alphas_cumprod.
β_t = 1 - α_t.
x₀: Current estimate of the clean image

The results comparison are displayed:

1.5 Diffusion Model Sampling

Using the same iterative denoise function, I generated 5 images from scratch by setting i_strat to 0, passing in random noise with the prompt "a hight quality photo":

1.6 Classifier-Free Guidance

I implemented Classifier-Free Guidance which could improve the generaed image quality. In CGF, a conditional and unconditional noise estimate are computed, and the final noise estimate is a weighted sum of the two. The weights are determined by the conditional noise estimate using this equation, where ε_u is the unconditional noise estimate, ε_c is the conditional noise estimate, and ε is the final noise estimate:

where γ controls the strength of CFG. When γ > 1, the generated images have high quality.

The final noise estimate is used to denoise the image. The results are displayed below:

Part 2: Image-to-Image Translation

1.7 SDEdit

I started with some original test images, and noise it at different strating index, and force it back onto the image manifold without any conditioning.

1.7.1 Hand Drawn and Web Images

I also tested the SDEdit on hand drawn and web images:

1.7.2 Inpainting

For this part, I implemented inpainting in which at each timestep, the generated image x_t is modified so that the regions specified by m = 0 match the original image x_orig. This adjustment ensures consistency with the original image outside the masked region. The formula for this adjustment is:

x_t ← m ˆ x_t + (1 - m) ˆ forward(x_orig, t)

This means that everything inside the mask (where m is 1) is filled with new content, while the areas outside the mask (where m is 0) remain unchanged, incorporating the appropriate amount of noise for timestep t.

The results are displayed below:

1.7.3 Text-Conditional Image-to-image Translation

I implemented SDEdit with a different prompt for this part:

1. Prompt "a rocket ship"

2. Prompt "a broccoli"

3. Prompt "a ninja"

Part 3 Visual Anagram

Our goal is to generate an image that appears as prompt1 but, when flipped upside down, transforms into the image of a different prompt.

I followed algorithm:

    ε₁ = UNet(x_t, t, p₁)
    ε₂ = flip(UNet(flip(x_t), t, p₂))
    ε = (ε₁ + ε₂) / 2

The results are displayed below:

Part 4 Hybrid Images

To generate hybrid images using a diffusion model, we will follow a similar approach as described earlier. Specifically, we will estimate the noise using two different text prompts and combine the results to create a composite noise estimate, ε. The composite estimate will merge the low-frequency components from one noise estimate with the high-frequency components from the other. The algorithm is as follows:

        ε₁ = UNet(x_t, t, p₁)
        ε₂ = UNet(x_t, t, p₂)
        ε = f_lowpass(ε₁) + f_highpass(ε₂)

Part B Diffusion Model

Part 1: Training a single step Unet

A unconditioned U-Net is implemented to denoise image:

Building on the objective described earlier, our goal is to solve the denoising problem: given a noisy image z, train a denoiser D_θ such that it reconstructs a clean image x. To achieve this, we optimize the following L2 loss function:

L = 𝔼_z,x[‖D_θ(z) - x‖²]

To train the denoiser, we generate training pairs (z, x), where each x is a clean MNIST digit. The noisy version z is created by adding Gaussian noise to x as follows:

z = x + σϵ, where ϵ ∼ 𝒩(0, 𝐼)

Here, σ is the noise standard deviation, and ϵ is sampled from a standard normal distribution. We assume that x is normalized in the range [0, 1].

The results of this process for σ values in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0] are visualized:

Training

Objective: Train a denoiser to remove noise from a noisy image z with σ = 0.5, mapping it to a clean image x.
Dataset and Dataloader:
- The trianing MNIST dataset from torchvision.datasets.MNIST is used.
- Batch size: 256.
- Total Epochs 5 epochs.
Model: Unconditional U-net with hidden size: D = 128.
Optimizer: Adam optimizer with a learning rate of 1e-4.
Loss Function:
- Minimize the L2 loss defined as:
  
  L = 𝔼_z,x[‖D_θ(z) - x‖²]

The training loss curve is displayed below, and the model converged at the end of trianing.s

Results

The denoised images sampled from the test set are visualized after epoch 1 and epoch 5. The output after epoch 5 has more defined shapes and less noises.

Denoised results are also visualized for out-of-distribution results, denoising images with different σ 's that it wasn't trained for.

Part 2: Training a Diffusion Model

In this section, I trained a U-Net model to iteratively denoise images by implementing the Denoising Diffusion Probabilistic Model (DDPM).

Instead of predicting the denoised image, we want the network to predict the noise. Thus, the new objective becomes:

L = 𝔼_z,x[‖ε_θ(z) - ε‖²]

Sampling Process

For diffusion, the goal is to start with a pure noise image ε ~ 𝒩(0, I) and iteratively denoise it to generate a realistic image x. We use iterative denoising by sampling noisy images x_t for timesteps t ∈ {0, 1, ..., T}, where:

x_t = √(α̅_t) x₀ + √(1 - α̅_t) ε

The Variance Schedule is created by:

β₀ = 0.0001 and β_T = 0.02, with evenly spaced values for other timesteps.
α_t = 1 - β_t.
α̅_t = ∏_s=1^t α_s

To denoise x_t, we condition the U-Net on the timestep t and minimize loss function:

L = 𝔼_{x_t,x₀,t}[‖ε_θ(x_t, t) - ε‖²]

2.2 Time Conditioned U Net:

Training

Dataset and Dataloader:
- The trianing MNIST dataset from torchvision.datasets.MNIST is used.
- Batch size: 128.
- Total Epochs 20 epochs.
Model: Time Conditioned U-net with hidden size: D = 64.
Optimizer:
- Adam optimizer with a learning rate of 1e-3
- An exponential learning rate decay scheduler with a gamma value calculated as: gamma = 0.1^(1.0 / num_epochs)
Loss Function:
- Minimize the L2 loss defined as:
  
  L = 𝔼_{x_t,x₀,t}[‖ε_θ(x_t, t) - ε‖²]

The training loss curve is displayed below

Results

The sampling function is implemented according to the DDPM paper:

The generated images after epoch 5 and epoch 20 are displayed. The output after epoch 20 has more defined shapes and less noises.

2.3 Class Conditioned U Net:

To improve results and provide more control over image generation, we can condition the U-Net on the class of the digit (0–9). Some modifications are implemented on top of the implemention for time conditional U net:

Each class (0-9) is used as input to the unet. Instead of a single scalar value for the class, a one-hot vector to represent the class.
Add Fully Connected Blocks (FCBlocks):
- 2 additional FCBlocks are added to the U-Net architecture These blocks will process the class-conditioning vector and integrate it with the U-Net’s other inputs.
Condition on Time and Class:
- The U-Net now accept both the time step t and the class-conditioning vector c as inputs.
Unconditioned Scenarios:
- A dropout mechanism is implemented to allow the model to work without class-conditioning. Specifically, for 10% of the time (p_uncond = 0.1), set the class-conditioning vector c to zero.

Training

Results

A classifier-free guidance(γ = 0.5) sampling function is implemented according to the DDPM paper:

The generated images after epoch 5 and epoch 20 are displayed. The output after epoch 20 has more defined shapes and less noises.